# Hybrid RAG with Contextual Chunking

This notebook demonstrates how to implement RAG question answering using a few interesting techniques:

1. Source documents are split in a sentence-aware manner using Spacy
2. Document chunks are enriched with an LLM-generated context header
3. Hybrid retrieval (embeddings and TF-IDF) is performed before generation
4. Retrieval results are fused using Reciprocal Rank Fusion (RRF)

## Step 1: Extract articles and chunk on sentence boundaries

In [7]:
# First, we will download the articles using newspaper3k.

from newspaper import Article

def extract_article(url):
    article = Article(url)
    article.download()
    article.parse()
    return f'{article.title}\n{article.text}'

In [8]:
urls = [
    'https://paulgraham.com/wealth.html',
    'https://paulgraham.com/foundermode.html',
    'https://paulgraham.com/hwh.html'
]

articles = [{'url': url, 'text': extract_article(url)} for url in urls]

In [143]:
# Let's take a look at one of the downloaded articles.

print(articles[1])

{'url': 'https://paulgraham.com/foundermode.html', 'text': 'Founder Mode\n\n\n\n\n\n\n\n\nSeptember 2024\n\n\n\nAt a YC event last week Brian Chesky gave a talk that everyone who was there will remember. Most founders I talked to afterward said it was the best they\'d ever heard. Ron Conway, for the first time in his life, forgot to take notes. I\'m not going to try to reproduce it here. Instead I want to talk about a question it raised.\n\n\n\nThe theme of Brian\'s talk was that the conventional wisdom about how to run larger companies is mistaken. As Airbnb grew, well-meaning people advised him that he had to run the company in a certain way for it to scale. Their advice could be optimistically summarized as "hire good people and give them room to do their jobs." He followed this advice and the results were disastrous. So he had to figure out a better way on his own, which he did partly by studying how Steve Jobs ran Apple. So far it seems to be working. Airbnb\'s free cash flow marg

In [144]:
# Now we will split on sentence boundaries using Spacy.
# You will probably need to run python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load('en_core_web_sm')

sentencized = []
for article in articles:
    sentences = [str(s).strip() for s in nlp(article['text']).sents]
    sentencized.append(sentences)

In [145]:
# Let's see what one of the split articles looks like now.

sentencized[2][:10]

["How to Work Hard\n\n\n\n\n\n\n\n\nJune 2021\n\n\n\nIt might not seem there's much to learn about how to work hard.",
 "Anyone who's been to school knows what it entails, even if they chose not to do it.",
 'There are 12 year olds who work amazingly hard.',
 'And yet when I ask if I know more about working hard now than when I was in school, the answer is definitely yes.',
 "One thing I know is that if you want to do great things, you'll have to work very hard.",
 "I wasn't sure of that as a kid.",
 "Schoolwork varied in difficulty; one didn't always have to work super hard to do well.",
 'And some of the things famous adults did, they seemed to do almost effortlessly.',
 'Was there, perhaps, some way to evade hard work through sheer brilliance?',
 'Now I know the answer to that question.']

In [146]:
# We will create chunks of 10 sentences overlapping by 2 sentences. 

def create_chunks(sent, chunk_size=10, overlap=2):
    return [' '.join(sent[i:i+chunk_size]) for i in range(0, len(sent), chunk_size-overlap)]

chunkized = []
for sentences in sentencized:
    chunkized.append(create_chunks(sentences, chunk_size=10, overlap=2))

## Step 2: Add context to each chunk using a faster LLM

In [32]:
# Set up the OPENAI environment variable.

import getpass
import os


def _set_env(var: str):
    if os.environ.get(var):
        return
    os.environ[var] = getpass.getpass(var + ":")

In [33]:
_set_env("OPENAI_API_KEY")

In [89]:
# Set up the chain to generate the contextual header for each chunk.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

fast_llm = ChatOpenAI(model="gpt-4o-mini")
str_parser = StrOutputParser()

contextual_rag_message = """
Given the document below, we want to explain what the chunk captures in the document.

DOCUMENT:
{document}

Here is the chunk we want to explain:

CHUNK:
{chunk}

Answer ONLY with a succinct explanation of the meaning of the chunk in the context of the whole document above.
"""

contextual_rag_prompt = ChatPromptTemplate.from_messages(("human", contextual_rag_message))

context_chain = contextual_rag_prompt | fast_llm | str_parser

In [36]:
# We will try it out with one chunk first.

context_chain.invoke(input={'document': articles[0]['text'], 'chunk': chunkized[0][0]})

'The chunk introduces the central theme of the document, which is the potential for wealth creation through starting or joining a startup. It emphasizes that startups, particularly in the technology sector, have historically been reliable avenues for generating wealth. The comparison to medieval trading voyages suggests that startups operate in a high-risk, high-reward environment, tackling challenging technical problems to innovate and create value. This sets the stage for a deeper exploration of the dynamics, benefits, and challenges associated with startups throughout the rest of the essay.'

In [91]:
# Now that we know it works, we can generate a header for each chunk.

chunks_with_context = []

for i, article in enumerate(articles):
    print(f"Contextualizing article {i+1} / {len(articles)}")
    article_chunks = []
    for j, chunk in enumerate(chunkized[i]):
        print(f"Contextualizing chunk {j+1} / {len(chunkized[i])}")
        explanation = context_chain.invoke(input={'document': article['text'], 'chunk': chunk})
        full_chunk = f"{explanation}\n\n{chunk}"
        article_chunks.append(full_chunk)
    chunks_with_context.append(article_chunks)        

Contextualizing article 1 / 3
Contextualizing chunk 1 / 71
Contextualizing chunk 2 / 71
Contextualizing chunk 3 / 71
Contextualizing chunk 4 / 71
Contextualizing chunk 5 / 71
Contextualizing chunk 6 / 71
Contextualizing chunk 7 / 71
Contextualizing chunk 8 / 71
Contextualizing chunk 9 / 71
Contextualizing chunk 10 / 71
Contextualizing chunk 11 / 71
Contextualizing chunk 12 / 71
Contextualizing chunk 13 / 71
Contextualizing chunk 14 / 71
Contextualizing chunk 15 / 71
Contextualizing chunk 16 / 71
Contextualizing chunk 17 / 71
Contextualizing chunk 18 / 71
Contextualizing chunk 19 / 71
Contextualizing chunk 20 / 71
Contextualizing chunk 21 / 71
Contextualizing chunk 22 / 71
Contextualizing chunk 23 / 71
Contextualizing chunk 24 / 71
Contextualizing chunk 25 / 71
Contextualizing chunk 26 / 71
Contextualizing chunk 27 / 71
Contextualizing chunk 28 / 71
Contextualizing chunk 29 / 71
Contextualizing chunk 30 / 71
Contextualizing chunk 31 / 71
Contextualizing chunk 32 / 71
Contextualizing chu

In [130]:
# Let's take a look at what we got for one of the chunks.

print(chunks_with_context[0][2])

The chunk discusses the economic dynamics of startups, emphasizing that they allow individuals to accelerate their work life by focusing intensely for a short period, typically four years, instead of spreading their efforts over decades in traditional jobs. It outlines a hypothetical scenario in which a skilled hacker can achieve significantly higher productivity—potentially up to 36 times that of a corporate job—by working in a startup environment. This increased productivity is attributed to factors such as longer working hours, enhanced focus, and the absence of bureaucratic obstacles, ultimately presenting startups as a lucrative avenue for wealth creation in the technology sector.

The Proposition



Economically, you can think of a startup as a way to compress your whole working life into a few years. Instead of working at a low intensity for forty years, you work as hard as you possibly can for four. This pays especially well in technology, where you earn a premium for working f

## Step 3: Hybrid retrieval

*We will generate embeddings for each chunk and also create a BM25 retriever. We will fuse the results of the two retrievers using a reranking algorithm to generate the final context that we will feed to the question-answering LLM.*

In [93]:
# Let's create the retrievers.

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.retrievers import BM25Retriever

# Make a flat list of chunks.
all_chunks = []
for article_chunks in chunks_with_context:
    all_chunks.extend(article_chunks)

vectorstore = Chroma.from_texts(texts=all_chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-large"))
vector_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

tfidf_retriever = BM25Retriever.from_texts(texts=all_chunks, k=5)

In [137]:
# Now let's ask a question that can be answered from the documents we ingested.

question = "How can I compress my career into as short a time as possible?"

vector_results = vector_retriever.invoke(question)
tfidf_results = tfidf_retriever.invoke(question)

In [138]:
# How many results did we get from each retriever? (Should be k=5)

print(f"Vector results: {len(vector_results)}")
print(f"TF-IDF results: {len(tfidf_results)}")

Vector results: 5
TF-IDF results: 5


## Step 4: Reciprocal Rank Fusion to combine query results from retrievers

In [139]:
# This has the effect of combining the results of both retrievers into one,
# so that results that appear in both sets have a higher rank.
# It will also remove duplicate results.

from collections import defaultdict
from langchain.load import dumps, loads


def recriprocal_rank_fusion(*list_of_results_lists, k=60):
    rrf_ranks = defaultdict(float)
    
    for result_list in list_of_results_lists:
        for rank, result in enumerate(result_list, 1):
            str_result = dumps(result)
            rrf_ranks[str_result] += 1 / (rank + k)
            
    sorted_results = sorted(rrf_ranks.items(), key=lambda x: x[1], reverse=True)
    return [loads(k) for k, v in sorted_results]


fused_results = recriprocal_rank_fusion(vector_results, tfidf_results)

In [140]:
# Let's see how many results we have now that we have removed duplicates.

len(fused_results)

9

## Step 5: Feed the context and the question to an LLM to answer the question.

*We are using a simple RAG prompt from the Langchain Hub. It is geared towards producing short answers.*

In [147]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


long_context_llm = ChatOpenAI(model="gpt-4o")
rag_prompt = hub.pull("rlm/rag-prompt")

# Let's check the prompt that is will be fed to the LLM
rag_prompt.invoke(input={
    'question': question, 
    'context': format_docs(fused_results[:5])   # We will use the top 5
})



ChatPromptValue(messages=[HumanMessage(content='You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don\'t know the answer, just say that you don\'t know. Use three sentences maximum and keep the answer concise.\nQuestion: How can I compress my career into as short a time as possible? \nContext: The chunk discusses the economic dynamics of startups, emphasizing that they allow individuals to accelerate their work life by focusing intensely for a short period, typically four years, instead of spreading their efforts over decades in traditional jobs. It outlines a hypothetical scenario in which a skilled hacker can achieve significantly higher productivity—potentially up to 36 times that of a corporate job—by working in a startup environment. This increased productivity is attributed to factors such as longer working hours, enhanced focus, and the absence of bureaucratic obstacles, ultimately presenting startups 

In [148]:
# Now let's generate the final answer.

rag_chain = rag_prompt | long_context_llm | StrOutputParser()

rag_chain.invoke(input={
    'question': question, 
    'context': format_docs(fused_results[:5])
})

'To compress your career into as short a time as possible, consider working intensely in a startup environment where you can focus and eliminate bureaucratic obstacles, potentially achieving significantly higher productivity. This approach allows you to concentrate your efforts into a few years, particularly in technology, where rapid work can yield substantial financial rewards. By leveraging your skills and working longer hours, you can achieve the equivalent of decades of traditional corporate work in a much shorter timeframe.'