# Hypothetical Document Embeddings (HyDE) 

- To use a LLM to generate a “fake” hypothetical document for a given user query. It then embeds the document which is then used to look up for real documents that are similar to the hypothetical document. The underlying concept here is that the hypothetical document may be closer to the real documents in the embedding space than the query.

- Read the paper : [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496)

<img src="../figures/AI-HyDE-workflow.excalidraw.png" >

# From Scratch

## 1. Model Preparartion

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_groq import ChatGroq
from langchain.embeddings import HuggingFaceBgeEmbeddings

llm = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct")

embedding_model = HuggingFaceBgeEmbeddings(
    model_name = "BAAI/bge-small-en-v1.5",
    model_kwargs = {'device':'cpu'},
    encode_kwargs = {'normalize_embeddings':True}
)

## 2. Data Loader

In [None]:
from langchain_community.document_loaders import WikipediaLoader
from langchain_core.text_splitter import RecursiveCharacterTextSplitter

# loading data
loader = WikipediaLoader(query="Steve Jobs", load_max_docs=5)
documents = loader.load()

# text splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)
docs = text_splitter.split_documents(documents=documents)

## 3. Vector Store & Retriever

In [None]:
from langchain.vectorstores import Chroma
# creating vector store
db = Chroma.from_documents(documents = docs,embedding=embedding_model,persist_directory = "output/steve_jobs_for_hyde.db")
# create the retriever
retriever = db.as_retriever(search_kwargs = {"k":5})

## 4. Quert Generation

In [None]:
def get_hypo_doc(query):
    template = """
    Generate a product description that best suit the below product not more that 200 words.
    Return only the description 

    product : {product}
    product description : 
    """

    hyde_prompt = ChatPromptTemplate.from_template(template)
    hyde_chain = hyde_prompt | llm | StrOutputParser() #Langchain Expression

    response = hyde_chain.invoke({"product": query})
    print(f"Synthetic product description for {query}:\n {response}")
    return response

query = 'When was Steve Jobs fired from Apple?'
get_hypo_doc(query)

## 6. Retrieve

In [None]:
matched_doc = retriever.get_relevant_documents(query = get_hypo_doc(query))
print(matched_doc)

# From Langchain

Default prompts: [‘web_search’, ‘sci_fact’, ‘arguana’, ‘trec_covid’, ‘fiqa’, ‘dbpedia_entity’, ‘trec_news’, ‘mr_tydi’]
- web_search: This key is likely used for general web search tasks where the goal is to retrieve the most relevant documents from the web based on a user’s query.
- sci_fact: This could be related to scientific fact verification, where the system retrieves documents that can confirm or refute a scientific claim.


In [None]:
from langchain.chains import HypotheticalDocumentEmbedder

hyde_embedding_model = HypotheticalDocumentEmbedder.from_llm(
    llm = llm, 
    base_embeddings = embedding_model, 
    prompt_key = 'web_search'
)

In [None]:
doc_db = Chroma.from_documents(docs, hyde_embedding_model,persist_directory='output/steve_job_hyde_chains')

In [None]:
matched_docs_new = doc_db.similarity_search(query)

for doc in matched_docs_new:
    print(doc.page_content)
    print(' ')


# OUTPUT
# In 1997, Jobs returned to Apple as CEO after the company's acquisition of NeXT. He was largely responsible for reviving Apple, which was on the verge of bankruptcy. He worked closely with British designer Jony Ive to develop a line of products and services that had larger cultural ramifications,
 
# In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets,
 
# On October 5, 2011, at the age of 56, Steve Jobs, the CEO of Apple, died due to complications from a relapse of islet cell neuroendocrine pancreatic cancer. Powell Jobs inherited the Steven P. Jobs Trust, which as of May 2013 had a 7.3% stake in The Walt Disney Company worth about $12.1 billion,
 
# conducted by Sorkin. The film covers fourteen years in the life of Apple Inc. co-founder Steve Jobs, specifically ahead of three press conferences he gave during that time - the formal unveiling of the Macintosh 128K on January 24, 1984; the unveiling of the NeXT Computer on October 12, 1988; and
