<a href="https://colab.research.google.com/github/duper203/RAG_Techniques_with_upstage/blob/main/upstage/07_HyDe_Hypothetical_Document_Embedding_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothetical Document Embedding (HyDE) in Document Retrieval


## Key Components


1. PDF processing and text chunking

2. Vector store creation using FAISS and Upstage embeddings

3. Language model for generating hypothetical documents

4. Custom HyDERetriever class implementing the HyDE technique


## Method Details

1. Document Preprocessing and Vector Store Creation

2. Hypothetical Document Generation

3. Retrieval Process

In [None]:
! pip3 install -qU langchain-upstage langchain langchain-community pypdf faiss-cpu

In [2]:
import os
from google.colab import userdata

os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")

## Define document(s) path

In [12]:
path = "data/Understanding_Climate_Change.pdf"

## Functions Settings
* `replace_t_with_space` : Replaces all tab characters ('\t') with spaces
* `text_wrap` : Wraps the input text to the specified width.
* `show_context` : Display the contents of the provided context list
* `encode_pdf` : Encodes a PDF book into a vector store using Upstage embeddings.

In [None]:
def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document.

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents


In [None]:
def show_context(context):
    """
    Display the contents of the provided context list.

    Args:
        context (list): A list of context items to be displayed.

    Prints each context item in the list with a heading indicating its position.
    """
    for i, c in enumerate(context):
        print(f"Context {i + 1}:")
        print(c)
        print("\n")

In [24]:
import textwrap

def text_wrap(text, width=120):
    """
    Wraps the input text to the specified width.

    Args:
        text (str): The input text to wrap.
        width (int): The width at which to wrap the text.

    Returns:
        str: The wrapped text.
    """
    return textwrap.fill(text, width=width)

In [9]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_upstage import UpstageEmbeddings
from langchain.vectorstores import FAISS

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using Upstage embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings and vector store
    embeddings = UpstageEmbeddings(model="solar-embedding-1-large")
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

## Define the HyDe retriever class - creating vector store, generating hypothetical document, and retrieving

In [14]:
from langchain_upstage import ChatUpstage, UpstageEmbeddings
from langchain.prompts import PromptTemplate
class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        self.llm = ChatUpstage(model="solar-pro")

        self.embeddings = UpstageEmbeddings(model="solar-embedding-1-large")
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = encode_pdf(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)


        self.hyde_prompt = PromptTemplate(
            input_variables=["query", "chunk_size"],
            template="""Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
            the document size has be exactly {chunk_size} characters.""",
        )
        self.hyde_chain = self.hyde_prompt | self.llm

    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        return self.hyde_chain.invoke(input_variables).content

    def retrieve(self, query, k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc

## Create a HyDe retriever instance

In [15]:
retriever = HyDERetriever(path)

## Demonstrate on a use case

In [16]:
test_query = "What is the main cause of climate change?"
results, hypothetical_doc = retriever.retrieve(test_query)

## Plot the hypothetical document and the retrieved documents

In [25]:
docs_content = [doc.page_content for doc in results]

print("hypothetical_doc:\n")
print(text_wrap(hypothetical_doc)+"\n")
show_context(docs_content)

hypothetical_doc:

Title: The Main Cause of Climate Change  Climate change, a significant global issue, is primarily caused by human
activities, specifically greenhouse gas emissions. The primary culprit is carbon dioxide (CO2), released through burning
fossil fuels (coal, oil, gas) for electricity, heat, and transportation.  Deforestation exacerbates the problem, as
trees absorb CO2. Methane, another potent greenhouse gas, is emitted during agriculture (livestock, rice paddies) and
waste management. Industrial processes and land use changes also contribute.  These gases trap heat in the atmosphere,
causing Earth's temperature to rise, leading to severe consequences: melting ice caps, sea-level rise, extreme weather,
and biodiversity loss.  In conclusion, human-induced greenhouse gas emissions, mainly CO2, are the main cause of climate
change. Immediate action is necessary to mitigate its effects and ensure a sustainable future.  (495 characters)

Context 1:
predict future trends. The 