# Lab 2: RAG

## We will build and evaluate a Question Answering Expert for a fictional company: InsureLLM!

### BEFORE WE BEGIN:

Look at the knowledge-base - this is the company shared drive.

### For those new to RAG:

Does one of the Experts want to give an explanation?

We will be figuring out ways to insert relevant background information in to the prompt..

Today will be more intense - please ask me lots of questions and clarifications..

In [None]:
import numpy as np
from IPython.display import Markdown, display
from pathlib import Path
from openai import OpenAI
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from chromadb import PersistentClient
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from tqdm import tqdm
import asyncio
from litellm import acompletion

In [None]:
MODEL = "gpt-4.1-nano"
db_name = "preprocessed_db"
collection_name = "docs"
embedding_model = "text-embedding-3-large"
load_dotenv(override=True)
openai = OpenAI()

## Loading data with semantic chunking and pre-processing

Loading in the data and splitting it into chunks with the help of an LLM

In [None]:

base_path = Path("knowledge-base")
documents = []

for folder in base_path.iterdir():
    doc_type = folder.name
    for file in folder.rglob("*.md"):
        with open(file, "r", encoding="utf-8") as f:
            documents.append({
                "type": doc_type,
                "source": file.as_posix(),
                "text": f.read()
            })

print(f"Loaded {len(documents)} documents")

In [None]:
documents[0]

In [None]:
class Result(BaseModel):
    page_content: str
    metadata: dict


class Chunk(BaseModel):
    headline: str = Field(
        description="A brief heading for this chunk, typically a few words, that is most likely to be surfaced in a query",
    )
    summary: str = Field(
        description="A few sentences summarizing the content of this chunk to answer common questions"
    )
    original_text: str = Field(
        description="The original text of this chunk from the provided document, exactly as is, not changed in any way"
    )

    def as_result(self, document):
        metadata = {"source": document["source"], "type": document["type"]}
        return Result(
            page_content=self.headline + "\n\n" + self.summary + "\n\n" + self.original_text,
            metadata=metadata,
        )


class Chunks(BaseModel):
    chunks: list[Chunk]


In [None]:
def make_prompt(document):
    how_many = (len(document["text"]) // 800) + 1
    return f"""
You take a document and you split the document into overlapping chunks for a KnowledgeBase.

The document is from the shared drive of a company called Insurellm.
The document is of type: {document["type"]}
The document has been retrieved from: {document["source"]}

A chatbot will use these chunks to answer questions about the company.
You should divide up the document as you see fit, being sure that the entire document is returned in the chunks - don't leave anything out.
This document should probably be split into {how_many} chunks, but you can have more or less as appropriate.
There should be overlap between the chunks as appropriate; typically about 25% overlap or about 50 words, so you have the same text in multiple chunks for best retrieval results.

For each chunk, you should provide a headline, a summary, and the original text of the chunk.
Together your chunks should represent the entire document with overlap.

Here is the document:

{document["text"]}

Repond with the chunks.
"""

def make_messages(document):
    return [
        {"role": "user", "content": make_prompt(document)},
    ]

In [None]:


async def process_document(document):
    messages = make_messages(document)
    response = await acompletion(model=MODEL, messages=messages, response_format=Chunks)
    reply = response.choices[0].message.content
    doc_as_chunks = Chunks.model_validate_json(reply).chunks
    return [chunk.as_result(document) for chunk in doc_as_chunks]


async def create_chunks(documents, batch_size=5):
    chunks = []
    for i in tqdm(range(0, len(documents), batch_size)):
        batch = documents[i : i + batch_size]
        tasks = [process_document(doc) for doc in batch]
        results = await asyncio.gather(*tasks)
        for result in results:
            chunks.extend(result)
    return chunks

chunks = await create_chunks(documents)


In [None]:
print(len(chunks))
chunks[2]

In [None]:
def create_embeddings(chunks):
    chroma = PersistentClient(path=db_name)
    if collection_name in [c.name for c in chroma.list_collections()]:
        chroma.delete_collection(collection_name)

    texts = [chunk.page_content for chunk in chunks]
    emb = openai.embeddings.create(model=embedding_model, input=texts).data
    vectors = [e.embedding for e in emb]

    collection = chroma.get_or_create_collection(collection_name)

    ids = [str(i) for i in range(len(chunks))]
    metas = [chunk.metadata for chunk in chunks]

    collection.add(ids=ids, embeddings=vectors, documents=texts, metadatas=metas)
    print(f"Vectorstore created with {collection.count()} documents")

create_embeddings(chunks)

In [None]:
chroma = PersistentClient(path=db_name)
collection = chroma.get_or_create_collection(collection_name)

In [None]:
# How many documents are in the vector store? How many dimensions?

count = collection.count()

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store")

In [None]:
# Gather the vectors, documents and metadata

result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
metadatas = result['metadatas']
doc_types = [metadata['source'].split('/')[1] for metadata in metadatas]
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

In [None]:
# We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)

tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [None]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [None]:
chroma = PersistentClient(path=db_name)
collection = chroma.get_or_create_collection(collection_name)

In [None]:
def fetch_context(question, k=5):
    query = openai.embeddings.create(model=embedding_model, input=[question]).data[0].embedding
    results = collection.query(query_embeddings=[query], n_results=k)
    chunks = []
    for result in zip(results["documents"][0], results["metadatas"][0]):
        chunks.append(Result(page_content=result[0], metadata=result[1]))
    return chunks


fetch_context("Who is Avery?", 3)

## LangChain Code to Call OpenAI

In [None]:
def make_context(chunks):
    result = ""
    for chunk in chunks:
        result += f"Extract from {chunk.metadata['source']}:\n{chunk.page_content}\n\n"
    return result

def make_rag_prompt(question,chunks):
    context = make_context(chunks)
    return f"""
The user has asked the following question:

{question}

For context, here are extracts from the Knowledge Base that might be relevant:

{context}

With this context, please answer the question. Reply only with the answer for the user.
"""

def make_rag_messages(question, chunks):
    return [
        {"role": "system", "content": "You are a helpful assistant that answers questions about the company Insurellm based on the context provided. If you don't know the answer, say so."},
        {"role": "user", "content": make_rag_prompt(question, chunks)}
    ]


In [None]:
question = "Who is Avery?"
chunks = fetch_context(question)
make_rag_messages("Who is Avery?", chunks)


In [None]:
def answer_question(question):
    chunks = fetch_context(question)
    messages = make_rag_messages(question, chunks)
    response = openai.chat.completions.create(model=MODEL, messages=messages, temperature=0)
    return response.choices[0].message.content

In [None]:
answer_question("Who is Avery?")

# CHALLENGE:

You will be changing or replacing 2 modules:

`ingest.py`

`answer.py`

They are VERY simple! Let's look at them.

## Now check out ingest.py

Then run at the terminal:

`uv run ingest.py`

In [None]:
!uv run ingest2.py

## Now check out answer.py

In [None]:
from answer2 import fetch_context, answer_question

fetch_context("Who is Avery?")

In [None]:
result, chunks = await answer_question("Who is Avery?")
display(Markdown(result))

## Now check out app.py

As long as you keep the same 2 functions in `answer.py`, this UI will keep working!!

In [None]:
!uv run app.py

## OK - Now it's time to EVALUATE!

### First check out tests.jsonl for all the questions

And see how it's loaded in test.py


In [None]:
from test import load_tests

test_data = load_tests()

print(len(test_data))
print(test_data[0])
print(test_data[1])



In [None]:
print(set(test.category for test in test_data))


## Now take a look at eval.py

test_data[0] is a very hard question that it sometimes gets wrong  
test_data[1] is an easy question

In [None]:
from eval import evaluate_retrieval, evaluate_answer

evaluate_retrieval(test_data[0])

In [None]:
await evaluate_answer(test_data[0])

## AND FINALLY - all come together in a UI

In [None]:
!uv run evaluator.py

## Ideas for your experiments

### Quick wins

- Experiment with the encoder
- Experiment with chunking strategies

### Big change ideas

1. Pre-processing - use an LLM to rewrite (a) the chunks and/or (b) the questions / conversation history
2. Hierarchical RAG - summarize at different levels and do RAG over summaries
3. Tools!

# 10 RAG Techniques

1. **Chunking R&D:** experiment with chunking strategy to optimize for your commercial goal
2. **Encoder R&D:** select the best Encoder model based on a test set
3. **Improve Prompts:** general content, the current date, relevant context and history
4. **Document pre-processing:** use an LLM to make the chunks and/or text for encoding
5. **Query rewriting:** use an LLM to convert the user’s question to a RAG query
6. **Query expansion:** use an LLM to turn the question into multiple RAG queries
7. **Re-ranking:** use an LLM to sub-select from RAG results
8. **Hierarchical:** use an LLM to summarize at multiple levels
9. **Graph RAG:** retrieve content closely related to similar documents
10. **Agentic RAG:** use Agents for retrieval, combining with Memory and Tools such as SQL


2 hard questions that can be addressed with the above:

- Who won the IIOTY award in 2023?

- What proportion of employees have a salary over $90,000?

