# Hypothetical Document Embedding (HyDE) in Document Retrieval

## Overview

This code implements a Hypothetical Document Embedding (HyDE) system for document retrieval. HyDE is an innovative approach that transforms query questions into hypothetical documents containing the answer, aiming to bridge the gap between query and document distributions in vector space.

## Motivation

Traditional retrieval methods often struggle with the semantic gap between short queries and longer, more detailed documents. HyDE addresses this by expanding the query into a full hypothetical document, potentially improving retrieval relevance by making the query representation more similar to the document representations in the vector space.

## Key Components

1. PDF processing and text chunking
2. Vector store creation using FAISS and OpenAI embeddings
3. Language model for generating hypothetical documents
4. Custom HyDERetriever class implementing the HyDE technique

## Method Details

### Document Preprocessing and Vector Store Creation

1. The PDF is processed and split into chunks.
2. A FAISS vector store is created using OpenAI embeddings for efficient similarity search.

### Hypothetical Document Generation

1. A language model (GPT-4) is used to generate a hypothetical document that answers the given query.
2. The generation is guided by a prompt template that ensures the hypothetical document is detailed and matches the chunk size used in the vector store.

### Retrieval Process

The `HyDERetriever` class implements the following steps:

1. Generate a hypothetical document from the query using the language model.
2. Use the hypothetical document as the search query in the vector store.
3. Retrieve the most similar documents to this hypothetical document.

## Key Features

1. Query Expansion: Transforms short queries into detailed hypothetical documents.
2. Flexible Configuration: Allows adjustment of chunk size, overlap, and number of retrieved documents.
3. Integration with OpenAI Models: Uses GPT-4 for hypothetical document generation and OpenAI embeddings for vector representation.

## Benefits of this Approach

1. Improved Relevance: By expanding queries into full documents, HyDE can potentially capture more nuanced and relevant matches.
2. Handling Complex Queries: Particularly useful for complex or multi-faceted queries that might be difficult to match directly.
3. Adaptability: The hypothetical document generation can adapt to different types of queries and document domains.
4. Potential for Better Context Understanding: The expanded query might better capture the context and intent behind the original question.

## Implementation Details

1. Uses OpenAI's ChatGPT model for hypothetical document generation.
2. Employs FAISS for efficient similarity search in the vector space.
3. Allows for easy visualization of both the hypothetical document and retrieved results.

## Conclusion

Hypothetical Document Embedding (HyDE) represents an innovative approach to document retrieval, addressing the semantic gap between queries and documents. By leveraging advanced language models to expand queries into hypothetical documents, HyDE has the potential to significantly improve retrieval relevance, especially for complex or nuanced queries. This technique could be particularly valuable in domains where understanding query intent and context is crucial, such as legal research, academic literature review, or advanced information retrieval systems.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [7]:
import os
import sys
from dotenv import load_dotenv

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="meta-llama/LLama-3.2-3B-Instruct",
    base_url="http://10.0.64.77:36363/v1",
    api_key="EMPTY",  # Chuỗi bất kỳ, nma bắt buộc có
    temperature=0,
    max_tokens=4000,
)

### Define document(s) path

import os
os.makedirs('data', exist_ok=True)

!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf

In [8]:
path = "/data_hdd_16t/khanhtran/LLM/RAG/data/8.HyDE_doc/Understanding_Climate_Change.pdf"

### Define the HyDe retriever class - creating vector store, generating hypothetical document, and retrieving

In [9]:
from langchain.embeddings.base import Embeddings
class LocalEmbeddingWrapper(Embeddings):
    def __init__(self, model_path, device="cuda"):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer(model_path, device=device)

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        return self.model.encode(texts, convert_to_numpy=True).tolist()

    def embed_query(self, text: str) -> list[float]:
        return self.model.encode([text], convert_to_numpy=True)[0].tolist()

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from sentence_transformers import SentenceTransformer

def encode_pdf(filepath, chunk_size=500, chunk_overlap=100):
    loader = PyPDFLoader(filepath)
    documents = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    embeddings = LocalEmbeddingWrapper("/data_hdd_16t/khanhtran/bge-m3", device="cuda")
    texts = text_splitter.split_documents(documents)
    vectorstore = FAISS.from_documents(texts, embeddings)
    return vectorstore

In [11]:

from langchain_core.prompts import PromptTemplate

class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        self.llm = ChatOpenAI(
            model="meta-llama/Llama-3.2-3B-Instruct",
            base_url="http://10.0.64.77:36363/v1",  # URL server VLLM
            api_key="EMPTY",  # có thể là bất kỳ chuỗi nào, VLLM không kiểm tra key
            temperature=0,
            max_tokens=4000,
        )
        
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = encode_pdf(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
    
        self.hyde_prompt = PromptTemplate(
            input_variables=["query", "chunk_size"],
            template="""
                Given the question '{query}', generate a hypothetical document that directly answers this question. 
                The document should be detailed and in-depth.
                The document size has be exactly {chunk_size} characters.
            """
        )
        self.hyde_chain = self.hyde_prompt | self.llm

    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        return self.hyde_chain.invoke(input_variables).content

    def retrieve(self, query, k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc

### Create a HyDe retriever instance

In [12]:
retriever = HyDERetriever(path)

### Demonstrate on a use case

In [13]:
test_query = "What is the main cause of climate change?"
results, hypothetical_doc = retriever.retrieve(test_query)

### Plot the hypothetical document and the retrieved documnets 

In [14]:
import textwrap

def text_wrap(text, width=100):
    """
    Wrap text to specified width for better readability
    
    Args:
        text: Text to wrap
        width: Maximum line width (default: 100)
    
    Returns:
        Wrapped text string
    """
    return '\n'.join(textwrap.wrap(text, width=width))

def show_context(docs_content, width=100):
    """
    Display retrieved documents in a formatted way
    
    Args:
        docs_content: List of document content strings
        width: Maximum line width for wrapping (default: 100)
    """
    print("Retrieved Documents:\n")
    print("=" * width)
    
    for i, doc in enumerate(docs_content, 1):
        print(f"\nDocument {i}:")
        print("-" * width)
        print(text_wrap(doc, width=width))
        print("-" * width)
    
    print("\n" + "=" * width)

In [15]:
docs_content = [doc.page_content for doc in results]

print("Hypothetical_doc:\n")
print(text_wrap(hypothetical_doc)+"\n")
show_context(docs_content)

Hypothetical_doc:

**Climate Change: The Main Cause**  Climate change is primarily attributed to human activities,
particularly the emission of greenhouse gases (GHGs) from burning fossil fuels, deforestation, and
land-use changes. The main causes are:  1. **Carbon dioxide (CO2) emissions**: Released through
fossil fuel combustion (80%), deforestation (15%), and industrial processes (5%). 2. **Methane (CH4)
emissions**: Primarily from agriculture (30%), natural gas production and transport (20%), and
landfills (10%). 3. **Nitrous oxide (N2O) emissions**: Mainly from agriculture (60%), industrial
processes (20%), and fossil fuel combustion (10%).  These GHGs trap heat in the atmosphere, leading
to global warming and associated climate change impacts.

Retrieved Documents:


Document 1:
----------------------------------------------------------------------------------------------------
predict future trends. The evidence overwhelmingly shows that recent changes are primarily  driven
by h

![](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=all-rag-techniques--hyde-hypothetical-document-embedding)