#### **Contextual Chunk Headers**

In this approach we give each chunk a Title or a brief header information (that can represent true content of that chunk)

This feature leads to enhancement in retrieval quality.

##### **Flow**
- Create chunks from document
- Create title corresponding to each chunk using LLM
- Add those titles with chunks and create embeddings and store them in vectorstore
- Create a retriever | make a RAG flow | test your RAG
- You can do a performance test for retriever with CCH and without CCH




In [1]:
# LLM used for inference is Llama3.2 from langchain_ollama 
from langchain_ollama import ChatOllama 

llm = ChatOllama(
    model='llama3.2',
    temperature=0,
    verbose=True
)

llm.invoke("How are you?")

AIMessage(content="I'm just a language model, so I don't have feelings or emotions like humans do. However, I'm functioning properly and ready to assist you with any questions or tasks you may have! How can I help you today?", additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-12-13T15:31:29.535412Z', 'done': True, 'done_reason': 'stop', 'total_duration': 23526228084, 'load_duration': 5070066834, 'prompt_eval_count': 29, 'prompt_eval_duration': 12635922250, 'eval_count': 47, 'eval_duration': 3979463504, 'logprobs': None, 'model_name': 'llama3.2', 'model_provider': 'ollama'}, id='lc_run--de1d15ae-e5e7-4b8a-beb3-e8d282b00a09-0', usage_metadata={'input_tokens': 29, 'output_tokens': 47, 'total_tokens': 76})

In [2]:
# embedding model which we'll use is from Sentence Transformers provided by langchain_huggingface
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model='all-MiniLM-L6-v2')

sample_embeddings = embedding_model.embed_query("Hey How are you?")
print(f"Length of sample embeddings : {len(sample_embeddings)}")
print(f"Sample embedding : {sample_embeddings[:100]}")

Length of sample embeddings : 384
Sample embedding : [-0.013380538672208786, 0.003255972173064947, 0.10806030035018921, 0.08322358131408691, 0.02040085941553116, -0.049066152423620224, 0.0722508355975151, 0.002980925841256976, -0.08823534101247787, 0.016058299690485, -0.03367079421877861, -4.332493062975118e-06, -0.02510129101574421, 0.0007887802203185856, 0.060331884771585464, -0.0415474958717823, 0.07702311128377914, -0.14256997406482697, -0.13958506286144257, 0.06023767963051796, 0.003192346775904298, 0.018982844427227974, 0.02300790697336197, 0.06056844815611839, -0.07911035418510437, -0.05399537831544876, -0.0008475205395370722, 0.03202424943447113, -0.029674910008907318, -0.04484577104449272, -0.10411098599433899, 0.06399180740118027, -0.05713418126106262, -0.02695028856396675, -0.028776653110980988, 0.00333896791562438, -0.0355900302529335, -0.13525626063346863, 0.009469274431467056, 0.0003555373114068061, 0.009924577549099922, -0.0014938903041183949, -0.009747199714183807, -0.0

#### **Loading the Data and Making Chunks**

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/Understanding_Climate_Change.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()
print(f"Number of docs : {len(docs)}")

Number of docs : 33


In [5]:
# Using text splitter to make chunks 
from langchain_text_splitters import RecursiveCharacterTextSplitter 

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

print(f"Number of Chunks are : {len(chunks)}")
print(f"Sample chunk : {chunks[0].page_content}")

Number of Chunks are : 215
Sample chunk : Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human


#### **Making a chain that can create title corresponding to each Chunk**

In [6]:
from pydantic import BaseModel, Field 
from typing import Annotated
from langchain_core.prompts import PromptTemplate

# Output Data Structure
class TitleOChunk(BaseModel):
    """ 
    This will return a string that represents suitable title for a Chunk.
    """
    title: Annotated[str, Field(description="Suitable title to the Chunk.")]

# configuring LLM with data structure 
llm_title_gen = llm.with_structured_output(TitleOChunk)

# template
title_gen_template = """ 
    You are given with a chunk : {chunk}.

    On the basis of given chunk, generate a title suitable for it.
"""

prompt_template = PromptTemplate(
    template=title_gen_template,
    input_variables=['chunk']
)

# creating a chain 
title_gen_chain = prompt_template | llm_title_gen 

In [8]:
## lets test this out 
response = title_gen_chain.invoke({'chunk' : chunks[6]})
print(f"Title suggested : {response.title}")
print("-"*89)
print(f"Chunk : {chunks[6]}")

Title suggested : Fossil Fuels and Climate Change
-----------------------------------------------------------------------------------------
Chunk : page_content='activities have intensified this natural process, leading to a warmer climate. 
Fossil Fuels 
Burning fossil fuels for energy releases large amounts of CO2. This includes coal, oil, and 
natural gas used for electricity, heating, and transportation. The industrial revolution marked 
the beginning of a significant increase in fossil fuel consumption, which continues to rise 
today. 
Coal' metadata={'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021', 'creationdate': '2024-07-13T20:17:34+03:00', 'author': 'Nir', 'moddate': '2024-07-13T20:17:34+03:00', 'source': '../data/Understanding_Climate_Change.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}


In [14]:
# we'll make a function that will add these Title to the starting of the Chunk
# and creates new Chunks 
from tqdm import tqdm

def add_title_to_chunks(chunks):
    for chunk in tqdm(chunks):
        title_generated = title_gen_chain.invoke({'chunk' : chunk.page_content})
        chunk.page_content = title_generated.title + "\n\n" + chunk.page_content
add_title_to_chunks(chunks)

100%|██████████| 215/215 [13:07<00:00,  3.66s/it]


##### **After this**
- make a vector_store
- create retriever 
- build RAG class 
- use it