# Hypothetical Document Embedding (HyDE) in Document Retrieval
## Overview
This code implements a Hypothetical Document Embedding (HyDE) system for document retrieval. HyDE is an innovative approach that transforms query questions into hypothetical documents containing the answer, aiming to bridge the gap between query and document distributions in vector space.
## Movtivation
Traditional retrieval method often stuggle with the semantic gap between short queries and longer, more detailed documents. HyDE addresses this by expanding the query into a full hypothetical document, potentially improving retrieval relevance by making the query respresntation more similar to the document representation in the vector space.
## Key Components
1. PDF processing and text chunking
2. Vector store creation using FAISS and OpenAI embeddings
3. Language model for generating hypothetical documetns
4. Custom HyDEretriever class implementing the HyDE technique
## Method Details
### Docuemnt Preprocessing and Vector Store Creation
1. The PDF is processed and split into chunks
2. A FAISS vector store is created using OpenAI Embeddings for efficient similarity serach
### Hypothetical Document Generation
1. A langugage model is used to generate a hypothetical document that answers the given query
2. The generation is guided by a prompt template that ensures the hypothetical document is detailed and matches the chunk size used in the vector store.
### Retrieval Process
The `HyDERetriever` calss implements the following steps:
1. Generate a hypothetical document from the query using the language model
2. use the hypothetical document as the search query in the vector store
Retrieve the most similar documents to this hypothetical document
## Key Features
1. Query Expansion: Transforms short queries into detailed hypothetical documents
2. Flexible Configuration: Allows adjustment of chunk size, overlap, and number of retrieved documents
3. Integration with OpenAI models for hypothetical document generation and embeddings for vector representation
## Benefits of this Approach
1. Improved Relevance: By expanding queries into full documents, HyDE can potentailly capture more nuanced and relevant matches
2. Handling Complex Queries: Particularly useful for complex or multi-faced queries that might be difficult to match directly
3. Adaptability: The hypothetical document generation can adapt to different types of queries and document domains
4. Potential for Better Context Understanding: The expanded query might better capture the context and intent behind the original question
## Conclusion
HyDE represents an innovative approach to document retrieval, addressing the semantic gap between queries and documents. By leveraging advanced langugage models to expand queries into hypothetical documents. HyDE has the potential to significantly improve retrieval relevance, especially for coplex or nuanced queries. This technique could be particularly valuable in domains where understanding query intent and context is crucial.

In [1]:
import os
from dotenv import load_dotenv

from langchain_openai.chat_models.azure import AzureChatOpenAI
load_dotenv()
openai_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
openai_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
openai_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
openai_api_version = os.getenv("AZURE_API_VERSION")

llm = AzureChatOpenAI(
    azure_deployment=openai_deployment,
    api_version="2024-10-01-preview",
    azure_endpoint=f"{openai_endpoint}openai/deployments/{openai_deployment}/chat/completions?api-version=2024-10-01-preview",
    temperature=0,
    logprobs=True,
)

In [2]:
path = "./data/Understanding_Climate_Change.pdf"

In [3]:
from helper_functions import *
from evaluation.evalute_rag import *
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()


    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len)
    texts = text_splitter.split_documents(documents)

    cleaned_texts = replace_t_with_space(texts)

    # embeddings = get_langchain_embedding_provider(EmbeddingProvider.AZURE)
    from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings
    embeddings = AzureOpenAIEmbeddings(
        deployment=openai_embedding,
        model="text-embedding-ada-002",
        chunk_size=16
    )
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [4]:
from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings
openai_embedding = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID")
class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        self.llm = AzureChatOpenAI(
            azure_deployment=openai_deployment,
            api_version="2024-10-01-preview",
            azure_endpoint=f"{openai_endpoint}openai/deployments/{openai_deployment}/chat/completions?api-version=2024-10-01-preview",
            temperature=0,
            logprobs=True,
        )

        self.embeddings = AzureOpenAIEmbeddings(
            deployment=openai_embedding,
            model="text-embedding-ada-002",
            chunk_size=16
        )
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = encode_pdf(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
    
        
        self.hyde_prompt = PromptTemplate(
            input_variables=["query", "chunk_size"],
            template="""Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
            the document size has be exactly {chunk_size} characters.""",
        )
        self.hyde_chain = self.hyde_prompt | self.llm

    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        return self.hyde_chain.invoke(input_variables).content

    def retrieve(self, query, k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc

In [5]:
retriever = HyDERetriever(path)

In [6]:
test_query = "What is the main cause of climate change?"
results, hypothetical_doc = retriever.retrieve(test_query)

In [7]:
docs_content = [doc.page_content for doc in results]

print("hypothetical_doc:\n")
print(text_wrap(hypothetical_doc)+"\n")
show_context(docs_content)

hypothetical_doc:

Climate change is primarily caused by human activities that increase the concentration of greenhouse gases in the
atmosphere. The burning of fossil fuels such as coal, oil, and natural gas for energy and transportation releases
significant amounts of carbon dioxide (CO2). Deforestation and land-use changes also contribute by reducing the number
of trees that can absorb CO2. Additionally, industrial processes and agricultural practices emit other potent greenhouse
gases like methane (CH4) and nitrous oxide (N2O), exacerbating the greenhouse effect and leading to global warming.

Context 1:
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhou se gases.  
Chapter 2: Causes of Climate Change  
Greenhouse Gases  
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH