#### **Hypothetical Document Embeddings HyDE in Document Retrieval**

In this we expand the query into a Hypothetical document using LLM in which answer to that query can be present. 

And then we use this Hypothetical document as our search query that retrieves most semantic document to it.

Advantages : 
- **Improved Relevance:** By expanding queries into full documents, HyDE can potentially capture more nuanced and relevant matches.
- **Potential for Better Context Understanding:** The expanded query might better capture the context and intent behind the original question.
- **Handle complex queries :** Useful for complex queries that might be difficult to match directly

---

LLM used - Ollama3.2


In [1]:
from langchain_ollama import ChatOllama 

llm = ChatOllama(
    model='llama3.2',
    temperature=0,
    verbose=True
)

llm.invoke("Hey How are you?")

  from .autonotebook import tqdm as notebook_tqdm


AIMessage(content="I'm just a language model, so I don't have emotions or feelings like humans do. However, I'm functioning properly and ready to help with any questions or tasks you may have! How can I assist you today?", additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-12-08T16:38:06.765709Z', 'done': True, 'done_reason': 'stop', 'total_duration': 21131627167, 'load_duration': 3708394875, 'prompt_eval_count': 30, 'prompt_eval_duration': 12692674375, 'eval_count': 46, 'eval_duration': 3256736958, 'logprobs': None, 'model_name': 'llama3.2', 'model_provider': 'ollama'}, id='lc_run--43eaa1f0-13e4-4d40-88e0-3295eaf1b6ee-0', usage_metadata={'input_tokens': 30, 'output_tokens': 46, 'total_tokens': 76})

---

#### **Embedding Model**

We are using Sentence Transformers HuggingFace

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings 

embedding_model = HuggingFaceEmbeddings(model='all-MiniLM-L6-v2')

text = "This is a test document."
query_result = embedding_model.embed_query(text)

# show only the first 100 characters of the stringified vector
print(f"Dimension of embeddings : {len(query_result)}")
print(str(query_result)[:100] + "...")

Dimension of embeddings : 384
[-0.0383385606110096, 0.1234646886587143, -0.02864295430481434, 0.05365273356437683, 0.0088453618809...


---

##### **Step1 - Load document**

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/Understanding_Climate_Change.pdf"
loader = PyPDFLoader(file_path)

pages = loader.load()
print(f"Number of Pages : {len(pages)}")

Number of Pages : 33


In [7]:
from pprint import pprint 
print(f"Page 1 : \n {pages[0].page_content[:300]}")

Page 1 : 
 Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an exte


--- 

##### **Step2 - Creating Chunks**

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter 

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)

chunks = text_splitter.split_documents(documents=pages)
print(f"Number of Chunks : {len(chunks)}")
print(f"Chunk 1 : \n {chunks[0]}")

Number of Chunks : 215
Chunk 1 : 
 page_content='Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human' metadata={'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021', 'creationdate': '2024-07-13T20:17:34+03:00', 'author': 'Nir', 'moddate': '2024-07-13T20:17:34+03:00', 'source': '../data/Understanding_Climate_Change.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}


---

##### **Create a Vectorstore and retriever**

In [9]:
# lets use Chroma for this 
from langchain_chroma import Chroma 

vector_store = Chroma(
    collection_name='reliable-rag',
    embedding_function=embedding_model,
    persist_directory='../data/persistent_vectordb/HyDE'
)

In [10]:
# adding chunks to our DB
import time 

start = time.time()
vector_store.add_documents(documents=chunks)
end = time.time()

print(f"Time taken to store {len(chunks)} chunks : {end-start :.2f} seconds")

Time taken to store 215 chunks : 5.89 seconds


In [11]:
## create a retriever 

retriever = vector_store.as_retriever(
    search_type="mmr", search_kwargs={"k": 2, "fetch_k": 5}
)

In [21]:
## testing retriever on sample query
query = "What is the main cause of climate change?"
retrieved_docs = retriever.invoke(query)

for i, doc in enumerate(retrieved_docs):
    print(f"Doc {i+1} \n {doc.page_content}")
    print("-"*89)

Doc 1 
 Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human
-----------------------------------------------------------------------------------------
Doc 2 
 and infrastructure. Cities are particularly vulnerable due to the "urban heat island" effect. 
Heatwaves can lead to heat-related illnesses and exacerbate existing health conditions. 
Changing Seasons 
Climate change is altering the timing and length of seasons, affecting ecosystems and human 
activities. For example, spring is arriving earlier, and winters are becoming shorter and
-----------------------------------------------------------------------------------------


--- 

##### **Creating LLM chain to get Hypothetical Document Embeddings**



In [13]:
## Data Model 
from pydantic import BaseModel, Field
from typing import Annotated
from langchain_core.prompts import PromptTemplate 

class HyDE_document(BaseModel):
    """
    It will return a Hypothetical Document in which answer to the query can be found. 
    """
    hyde_doc: Annotated[str, Field(description="It will return a Hypothetical Document")]

# configure llm with output structure 
llm_hyde_generation = llm.with_structured_output(HyDE_document)

# prompt template for HyDE generation - input variables to be {query} and {chunk_size} 
template_hyde = """ 
Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
            the document size has be exactly {chunk_size} characters.
"""

prompt_template_for_hyde = PromptTemplate(
    template=template_hyde,
    input_variables=['query', 'chunk_size']
)

## chain to generate Hypothetical Document
chain_for_hyde_gen = prompt_template_for_hyde | llm_hyde_generation

In [18]:
# lets test this out our chunk_size is 400
query = "What is the main cause of climate change?"
hypothetical_doc = chain_for_hyde_gen.invoke({'query' : query, 'chunk_size' : 400})

print(f"Hypothetical Document : \n {hypothetical_doc.hyde_doc}")

Hypothetical Document : 
 Climate Change: Main Cause

The primary driver of climate change is human activities releasing greenhouse gases (GHGs) into the atmosphere, primarily carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O). These GHGs trap heat, leading to global warming.

Causes:
1. Burning fossil fuels (coal, oil, gas)
2. Deforestation and land-use changes
3. Agriculture and livestock production
4. Industrial processes and transportation

Effects: Rising temperatures, sea-level rise, extreme weather events, and altered ecosystems.


In [20]:
## Now we'll use this hypothetical doc instead of query during retrieval
retrieved_docs = retriever.invoke(hypothetical_doc.hyde_doc)

for i, doc in enumerate(retrieved_docs):
    print(f"Doc : {i+1}")
    print(f"Content : {doc.page_content}")
    print("-"*89)

Doc : 1
Content : Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human
-----------------------------------------------------------------------------------------
Doc : 2
Content : Costs of Inaction 
Economic Impacts of Climate Change 
The economic costs of climate change include damage to infrastructure, reduced agricultural 
productivity, health care costs, and lost labor productivity. Extreme weather events, such as 
hurricanes and floods, can cause significant economic disruption. Investing in climate action 
now can prevent much higher costs in the future.
-----------------------------------------------------------------------------------------
