# Workshop: From Simple to Agentic RAG

<a target="_blank" href="https://colab.research.google.com/github/parambharat/workshops/blob/main/rag/workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Setup

Ensure you have a W&B account and API key. You should be able to get one from [here](https://wandb.ai/authorize).

Next head over to [Tavily](https://app.tavily.com/home) and sign up for a free account and get your API key.

We will use [`uv`](https://docs.astral.sh/uv/getting-started/installation/) to install the workshop package and its dependencies.


```bash
!git clone https://github.com/parambharat/workshops.git
%cd workshops/rag
!pip install uv
!uv sync
```

Create an `.env` file in the workshop/rag directory and add the API keys to it:

```.env
PINECONE_API_KEY="YOUR_PINECONE_API_KEY"
OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
TAVILY_API_KEY="YOUR_TAVILY_API_KEY"
WANDB_PROJECT="rag-workshop-iiit-blr"
```


In [None]:
!git clone https://github.com/parambharat/workshops.git
%cd workshops/rag
!pip install -qqq uv 
!uv sync -q

In [None]:
# @title API KEYS
import os
PINECONE_API_KEY = 'PINECONE_API_KEY' # @param {type:"string"}
OPENAI_API_KEY = 'OPENAI_API_KEY' # @param {type:"string"}
TAVILY_API_KEY = 'TAVILY_API_KEY' # @param {type:"string"}
WANDB_PROJECT = "rag-workshop-pc-nyc" # @param {type:"string"}

with open("./.env", "w+") as env_f:
  env_f.write(f'PINECONE_API_KEY="{PINECONE_API_KEY}"\n')
  env_f.write(f'OPENAI_API_KEY="{OPENAI_API_KEY}"\n')
  env_f.write(f'TAVILY_API_KEY="{TAVILY_API_KEY}"\n')
  env_f.write(f'WANDB_PROJECT="{WANDB_PROJECT}"')

os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["TAVILY_API_KEY"] = TAVILY_API_KEY
os.environ["WANDB_PROJECT"] = WANDB_PROJECT



In [1]:
# %load_ext dotenv
# %dotenv
# %load_ext autoreload
# %autoreload 2

In [None]:
import os
import nltk
nltk.download("wordnet")
nltk.download("punkt")
nltk.download("punkt_tab")

!wandb login

In [3]:
import json
import wandb
import weave
from copy import deepcopy
import nest_asyncio
nest_asyncio.apply()

In [None]:
from utils import convert_contents_to_text, make_id, render_doc, printmd, chunk_simple, chunk_markdown, load_dataset, chunk_dataset, run_llm
from retrieval_metrics import RetrievalScorer
from response_metrics import ResponseScorer
from retriever import TfidfSearchEngine, BM25SearchEngine, DenseSearchEngine, Retriever, RetrieverWithReranker, HybridRetrieverWithReranker, VectorStoreSearchEngine
from generation import SimpleResponseGenerator, QueryEnhancedResponseGenerator
from pipeline import SimpleRAGPipeline, QueryEnhancedRAGPipeline
from query_enhancer import QueryEnhancer
from agent import Agent

In [None]:
response = await run_llm(messages=[{"role": "user", "content": "What is Weights and Biases and what is Pinecone?"}])
printmd(response)

## Weave

Weave is a lightweight toolkit for tracking and evaluating LLM applications ( This is a very simplified description of Weave though).

- Log and debug language model inputs, outputs, and traces
- Build rigorous, apples-to-apples evaluations for language model use cases
- Organize all the information generated across the LLM workflow, from experimentation to evaluations to production

Don't worry about the details for now. We will see more of it as we build our RAG pipeline.

For now let's initialize W&B Weave. Once intialized, weave will start tracking (more on it later) the inputs and the outputs along with underlying attributes (model name, top_k, etc.) of all the LLMs, functions, evaluations, etc. we will be calling.

In [None]:
# Initialize Weave with the project name
weave_client = weave.init(WANDB_PROJECT)

## Preparing the dataset
We'll use [Weave Datasets](https://weave-docs.wandb.ai/guides/core-types/datasets) to track and version data as the inputs and outputs of our W&B Runs. 

To learn more about how we prepared the dataset for this workshop, checkout the [dataset preparation script](./download_finance_docs.py).


In [None]:
from download_finance_docs import PDFProcessor
processor = PDFProcessor()
data = processor.load_pdf_documents()

In [None]:
docs_dir = "../data/finance_docs"

In [None]:
docs_dir = pathlib.Path(docs_dir)
docs_files = sorted(docs_dir.rglob("*.pdf"))

print(f"Number of files: {len(docs_files)}\n")
print("First 5 files:\n{files}".format(files="\n".join(map(str, docs_files[:5]))))

Our documents are stored as dictionaries with content (_raw text_) and additional metadata.

Metadata is extra information for that data point which can be used to group together similar data points, or filter out a few data points.
We will see in future chapters the importance of metadata and why it should not be ignored while building the ingestion pipeline.

The metadata can be derived (`file_type`) or is inherent (`uri`) to the data point.


In [None]:
print("Total Words: ", sum(map(lambda x: len(x.split()), map(lambda x: x["content"], docs))))


Checking the total number of tokens of your data source is a good practice. In this case, the total tokens is 500k+. Surely, most LLM providers cannot process this many tokens. Building a RAG is justified in such cases.

**Note** We are simply counting words  and calling them `tokens` here. It's only an approximation of the actual token count which will be a bigger number.
In practice we would be using a [tokenizer](https://docs.cohere.com/docs/tokens-and-tokenizers) to calculate the token counts but this naive calculation is an okay approximation for now.

In [None]:
# build weave dataset
raw_data = weave.Dataset(name="raw_data", rows=data)

# publish the dataset
weave.publish(raw_data)

## A simple RAG pipeline

First, we will build a very simple RAG pipeline.
We will mainly focus on how to preprocess and chunk the data followed by building the simplest retrieval engine without using any fancy "Vector databases".
The idea is to get a sense of the inner workings of a retrieval pipeline and understand the workflow from a user query to a generated response from an LLM.
This end to end workflow will help you understand the importance of each step in a RAG pipeline.

![](./imgs/SimpleRAG.png)

### Chunking the data

Each document contains a large number of tokens, so we need to split it into smaller chunks to manage the number of tokens per chunk. This approach serves three main purposes:

* Most embedding models have a limit of tokens per input (based on their training data and parameters).

* Chunking allows us to retrieve and send only the most relevant portions to our LLM, significantly reducing the total token count. This helps keep the LLM's cost and processing time manageable.

* When the text is small-sized, embedding models tend to generate better vectors as they can capture more fine-grained details and nuances in the text, resulting in more accurate representations.

![](./imgs/SimpleChunking.png)


When choosing chunk size, consider these trade-offs:

- Smaller chunks (100-200 tokens):
  * More precise retrieval
  * Better for finding specific details
  * May lack broader context

- Larger chunks (500-1000 tokens):
  * Provide more context
  * Capture more coherent ideas
  * May introduce noise and reduce precision

The optimal size depends on your data, expected queries, and model capabilities. Experiment with different sizes to find the best balance for your use case.

Here we are chunking each content (text) to a maximum length of 500 tokens (`CHUNK_SIZE`). This is called **FIXED CHUNKING**. Although naive, it's a good starting point.

For now, we will not be overlapping (`CHUNK_OVERLAP`) the content of one chunk with another chunk.

We will be using the `chunk_simple` function to chunk the data.



In [None]:
document_chunks = []
for doc in raw_data.rows:
    # Chunk the content of each document
    chunks = chunk_simple(doc["content"], chunk_size=500, model="gpt-4o-mini")
    # Create a unique ID for the document
    doc_id = make_id(doc["content"])
    # Iterate over the chunks and create a new document for each chunk
    for chunk in chunks:
        doc_chunk = deepcopy(doc)
        # Store the chunk as part of the document
        doc_chunk["chunk"] = chunk
        # Convert the chunk to text to be used as input for the LLM
        doc_chunk["text"] = convert_contents_to_text(chunk)
        # Create a unique ID for the chunk
        doc_chunk["chunk_id"] = make_id(chunk)
        # Add the document ID to the chunk
        doc_chunk["doc_id"] = doc_id
        # Append the chunk to the list of document chunks
        document_chunks.append(doc_chunk)

# Let's see the first 5 chunks
document_chunks[:5]



In [None]:
dataset = weave.Dataset(name="chunked_data", rows=document_chunks)
weave.publish(dataset)

### Building the simplest retrieval engine

One of the key ingredient of most retrieval systems is to represent the given modality (text in our case) as a vector.

This vector is a numerical representation representing the "content" of that modality (text).

Text vectorization (text to vector) can be done using various techniques like 
- [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), 
- [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) (Term Frequency-Inverse Document Frequency), 
- and static embeddings like [Word2Vec](https://en.wikipedia.org/wiki/Word2vec), [GloVe](https://nlp.stanford.edu/projects/glove/), 
- and context aware embeddings based on transformer based models like BERT and more, which capture the semantic meaning and relationships between words or sentences.


First, we'll use TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize our contents. Here's why:

- **Simplicity**: TF-IDF is straightforward to implement and understand, making it an excellent starting point for RAG systems.
- **Efficiency**: It's computationally lightweight, allowing for quick processing of large document collections.
- **No training required**: Unlike embedding models, TF-IDF doesn't need pre-training, making it easy to get started quickly.
- **Interpretability**: The resulting vectors are directly related to word frequencies, making them easy to interpret.

While more advanced methods like embeddings **might** provide better performance, especially for semantic understanding, we'll explore these in later as we progress through the workshop.


In [None]:
query = "How has Apple's total net sales changed over time?"

# our search engine a very simple class, that first indexes the documents and 
# then uses the search method to retrieve the top-k results 
# that are most similar to the query using cosine distance
# these are what we call sparse retrieval engines because they are based on the term frequency of the words in the documents
# they are more to do with lexical similarity than semantic similarity
tfidf_search_engine = await TfidfSearchEngine().fit(document_chunks)

# The retriever is a simple wrapper around the search engine that uses the search method
# this is mostly for the sake of consistency and to have a common interface for all the retrieval engines
class TFIDFRetriever(Retriever):
    pass

tfidf_retriever = TFIDFRetriever(search_engine=tfidf_search_engine)


retrieved_docs = await tfidf_retriever.invoke(query=query, top_k=5)
# we can look at the retrieved docs by printing them with the render_doc
# basically a fancy print statement with some formatting for the doc content and metadata
for doc in retrieved_docs:
    render_doc(doc)

Note that the `Retriever` class above is inherited from `weave.Model`.

A `Model` is a combination of data (which can include configuration, trained model weights, or other information) and code that defines how the model operates. By structuring your code to be compatible with this API, you benefit from a structured way to version your application so you can more systematically keep track of your experiments.

To create a model in Weave, you need the following:

- a class that inherits from `weave.Model`
- type definitions on all attributes
- a typed `predict`, `invoke` or `forward` method with `@weave.op` decorator.

Imagine `weave.op` to be a drop in replacement for `print` or logging statement.
However, it does a lot more than just printing and logging by tracking the inputs and outputs of the function and storing them as Weave objects.
In addition to state tracking you also get a nice weave UI to inspect the inputs, outputs, and other metadata.

If you have not initialized a weave run by doing `weave.init`, the code will work as it is without any tracking.

The `predict` method decorated with `weave.op()` will track the model settings along with the inputs and outputs anytime you call it.

### Generating a response

There are two components of any RAG pipeline - a `Retriever` and a `ResponseGenerator`. Having designed a simple retriever, here we are designing a `SimpleResponseGenerator`.

The `SimpleResponseGenerator` takes the user question along with the retrieved context (documents) as inputs and makes a LLM call using the `model` and `prompt` (system prompt). This way the generated answer is grounded on the documentation (our usecase). In this case we are using Cohere's `command-r` model.

As earlier, we have wrapped this `SimpleResponseGenerator` class with weave for tracking the inputs and the output.

The `ResponseGenerator` also has a `system_prompt` that we pass to the LLM. Consider this to be set of instructions we give to the LLM on what to do with the user's query and the retrieved documents.
In practice, the system prompt can be detailed and involved (depending on the usecase) but we are using a very simple prompt here.
Later we will iterate on it and show how improving the system prompt improves the quality of the generated response.

In [None]:
docs_data = [{"document": item["text"]} for item in retrieved_docs]
simple_response_generator = SimpleResponseGenerator()
response = await simple_response_generator.invoke(
    query=query, documents=docs_data
)
printmd(response["choices"][0]["message"]["content"])

### Bringing everything together

Finally, with all the components ready, we can bring everything together into a single pipeline.

Here, we define a `weave.Model` class `SimpleRAGPipeline` which combines the steps of retrieval and response generation.

We'll define a `invoke` method that takes the user query, retrieves relevant context using the retriever and finally synthesizes a response using the response generator.

We'll also define a few convinence methods to format the documents retrieved from the retriever and create a system prompt for the response generator.

In [None]:
# we are using the SimpleRAGPipeline class to combine the steps of retrieval and response generation

class TFIDFRAGPipeline(SimpleRAGPipeline):
    pass

# Instantiate the pipeline with the retriever and response generator
tfidf_rag_pipeline = TFIDFRAGPipeline(
    retriever=tfidf_retriever,
    generator=simple_response_generator
)

# Invoke the pipeline with the user query
response = await tfidf_rag_pipeline.invoke(query=query,)
printmd(response["answer"])

## Evaluating the RAG Pipeline

Before we starting making further changes to the RAG pipeline, it is a good practice to evaluate the current state of the pipeline
This will help us understand the current performance of the pipeline and identify areas of improvement.
Think of evaluation as a way to measure the performance of your system - a benchmark to compare the performance of your system against.

Evaluating a RAG pipeline is a crucial step in ensuring its robustness and effectiveness.
We will evaluate the two main components of a RAG pipeline - retriever and response generator.

Evaluating the retriever can be considered component evaluation. 
Depending on your RAG pipeline, there can be a few components and for ensuring robustness of your system,
it is recommended to come up with evaluation for each component.

![](./imgs/EvolvingRAG.png)


### Collecting data for evaluation

We are using a subset of the evaluation dataset we had created for wandbot.

Learn more about how we created the evaluation dataset here:

- [How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System](https://wandb.ai/wandbot/wandbot-eval/reports/How-to-Evaluate-an-LLM-Part-1-Building-an-Evaluation-Dataset-for-our-LLM-System--Vmlldzo1NTAwNTcy)
- [How to Evaluate an LLM, Part 2: Manual Evaluation of Wandbot, our LLM-Powered Docs Assistant](https://wandb.ai/wandbot/wandbot-eval/reports/How-to-Evaluate-an-LLM-Part-2-Manual-Evaluation-of-Wandbot-our-LLM-Powered-Docs-Assistant--Vmlldzo1NzU4NTM3)

The main take away from these reports are:

- we first deployed wandbot for internal usage based on rigorous eyeballing based evalution.
- the user query distribution was throughly analyzed and clustered. we sampled a good representative queries from these clusters and created a gold standard set of queries.
- we then used in-house MLEs to perform manual evaluation using Argilla. Creating such evaluation platforms are easy.
- To summarize, speed is the key here. Use whatever means you have to create a meaningful eval set.

The evaluation samples are logged as [`weave.Dataset`](https://wandb.github.io/weave/guides/core-types/datasets/). `weave.Dataset` enables you to collect examples for evaluation and automatically track versions for accurate comparisons.

Below we will download the latest version locally with a simple API.

In [None]:
# here we are downloading the latest version of the dataset from Weave
eval_dataset = weave.ref("weave:///a-sh0ts/rag-course-finance/object/eval_data:CoQDvdOENbZqkwg7IlhZm33drBCAf9OUNvf8ar6YHzM").get()

print("Number of evaluation samples: ", len(eval_dataset.rows))

Iterating through each sample is easy.

We have the question, ground truth answer and ground truth contexts.

In [None]:
eval_dataset.rows[0]

Again, we'll use W&B Weave for our evaluation purposes.
The `weave.Evaluation` class is a light weight class that can be used to evaluate the performance of a `weave.Model` on a `weave.Dataset`.

### Evaluating the Retriever

The fundamental idea of evaluating a retriever is to check how well the retrieved content matches the expected contents.
To evaluate a RAG pipeline, we need query and ground truth answer pairs. 
The ground truth answer must be grounded on some "ground" truth chunks.
This is a search problem, it's easiest to start with traditional Information retrieval metrics.



You might already have access to such evaluation dataset depending on the nature of your application or you can synthetically build one. 
To build one you can retrieve random documents/chunks and ask an LLM to generate query-answer pairs - the underlying documents/chunks will act as your ground truth chunk.

Here, we will look at different metrics that can be used to evaluate the retriever we built earlier.

### Metrics to evaluate retriever

We can evaluate a retriever using traditional ML metrics. We can also evaluate it by using a powerful LLM (next section).

Below we are importing both traditional metrics and classifier based metric from the [`retrieval_metrics.py`](retrieval_metrics.py) file.

* **Hit Rate**: Measures the proportion of queries where the retriever successfully returns at least one relevant document.
* **MRR (Mean Reciprocal Rank)**: Evaluates how quickly the retriever returns the first relevant document, based on the reciprocal of its rank.
* **NDCG (Normalized Discounted Cumulative Gain)**: Assesses the quality of the ranked retrieval results, giving more importance to relevant documents appearing earlier.
* **MAP (Mean Average Precision)**: Computes the mean precision across all relevant documents retrieved, considering the rank of each relevant document.
* **Precision**: Measures the ratio of relevant documents retrieved to the total documents retrieved by the retriever.
* **Recall**: Evaluates the ratio of relevant documents retrieved to the total relevant documents available for the query.
* **F1 Score**: The harmonic mean of precision and recall, providing a balance between both metrics to gauge retriever performance.
* **Relevance Score**: A binary score predicted by a pre-trained classifier that measures the relevance of the retrieved documents to the query.

We will be using the `RetrievalScorer` class to evaluate the retriever.
Each metric expects an `output` which is a list of retrieved chunks from the retriever and `contexts` which is a list of ground truth contexts from the evaluation dataset.


In [17]:
retrieval_evaluation = weave.Evaluation(
    name="Retrieval_Evaluation",
    dataset=eval_dataset,
    scorers=[RetrievalScorer(name="retrieval_scorer", description="Retrieval metrics")],
    preprocess_model_input=lambda x: {"query": x["question"], "top_k": 5},
)

In [None]:
tfidf_retrieval_scores = await retrieval_evaluation.evaluate(
    model=tfidf_retriever,
    __weave={"display_name": "TFIDF Retrieval"}
)


### Evaluating the Response Generation

Evaluating the generated response is a bit more complex. Although we have some ground truth answers, they are not directly comparable to the generated response. This could be due to a few reasons:

- The generated response might be a paraphrased version of the ground truth answer.
- The generated response might be a more detailed version of the ground truth answer.
- The generated response might be a more concise version of the ground truth answer.
- The generated response might be a more structured version of the ground truth answer.


Some of these metrics are based on the distance between the generated response and the ground truth answer.
- **Diff Score**: compute the similarity ratio between the normalized model output and the expected answer.
- **Levenshtein Score**: compute the Levenshtein ratio between the normalized model output and the answer.
- **ROUGE Score**: compute the ROUGE-L F1 score between the normalized model output and the reference answer.
- **BLEU Score**: compute the BLEU score between the normalized model output and the reference answer.
- **METEOR Score**: compute the METEOR score between the normalized model output and the reference answer.
- **Correctness Score**: a binary score predicted by a pre-trained classifier that measures the correctness of the generated response.
- **Helpfulness Score**: a binary score predicted by a pre-trained classifier that measures the helpfulness of the generated response.
- **Relevance Score**: a binary score predicted by a pre-trained classifier that measures the relevance of the generated response to the query.

While these metrics are not perfect, they provide a good north star for evaluating the response generation and can be used to compare the performance of the response generation model.

We will be using the `ResponseScorer` class to evaluate the response generation.
Each metric expects an `output` containing the generated response and the ground truth `answer`.

In [19]:
# Again, we are using the `weave.Evaluation` class to evaluate the response generation.
response_evaluation = weave.Evaluation(
    name="Response_Evaluation",
    dataset=eval_dataset,
    scorers=[ResponseScorer(name="response_scorer", description="Response metrics")],
    preprocess_model_input=lambda x: {"query": x["question"]},
)


In [None]:
tfidf_response_scores = await response_evaluation.evaluate(
    model=tfidf_rag_pipeline,
    __weave={"display_name": "TFIDF RAG Pipeline"}
)

## Improving the RAG Pipeline

One of the most common ways to improve the performance of a RAG pipeline is to use a better retrieval algorithm. It is the low hanging fruit and often the first thing to try. This is because the retrieval step is one of the most important steps in a RAG pipeline. With a good retrieval algorithm, we can retrieve more relevant documents and improve the quality and relevance of the generated response.

### Using BM25 for Retrieval

BM25 is a popular retrieval algorithm that uses the term frequency of the words in the documents to retrieve the most relevant documents. 
Many production IR systems use BM25.

For instance Elasticsearch uses BM25 as its default retrieval algorithm. 
It's a simple algorithm that is easy to implement and has been shown to be effective in many cases.
Although it doesn't perform semantic search, it's a good starting point for improving the retrieval step.
You'll be suprised how much it alone can improve the performance of the RAG pipeline.

We will be using the `BM25SearchEngine` class to create a BM25 search engine.


In [None]:
# we are still fitting the search engine on the same document chunks as we did for the TFIDF search engine
bm25_search_engine = await BM25SearchEngine().fit(document_chunks)

# A wrapper around the search engine that uses the search method
class BM25Retriever(Retriever):
    pass


bm25_retriever = BM25Retriever(search_engine=bm25_search_engine)

# retrieved_docs = await bm25_retriever.invoke(query=query, top_k=5)

# for doc in retrieved_docs:
#     render_doc(doc)


In [None]:
# we are now ready to evaluate the retrieval performance of the BM25 retrieval algorithm
# we are using the same evaluation dataset as we did for the TFIDF retrieval algorithm
# this should allow us to compare the performance of the two retrieval algorithms
retrieval_scores = await retrieval_evaluation.evaluate(
    model=bm25_retriever,
    __weave={"display_name": "BM25 Retrieval"}
)

While evaluating the retrieval performance is important, we are interested more in the overall performance of the RAG pipeline.

Again, we will be using the `SimpleRAGPipeline` class to create a RAG pipeline using the same response generator as we did earlier.

We are also using the same evaluation dataset as we did for the TFIDF RAG pipeline. This should allow us to compare the performance of the two pipelines.

In [None]:
class BM25RAGPipeline(SimpleRAGPipeline):
    pass


bm25_rag_pipeline = BM25RAGPipeline(
    retriever=bm25_retriever,
    generator=simple_response_generator
)

# response = await bm25_rag_pipeline.invoke(query=query)

# printmd(response["answer"])

response_scores = await response_evaluation.evaluate(
    model=bm25_rag_pipeline,
    __weave={"display_name": "BM25 RAG Pipeline"}
)

### Using Dense Retrieval

Dense retrieval is a retrieval algorithm that uses the embeddings of the documents to retrieve the most relevant documents.

While technically we can use any machine learning model to create embeddings, we will use the embeddings created by some pre-trained embedding model here. Specifically, we will use the embeddings created by the `cohere` embed API.

We model this using the `DenseSearchEngine` class to create a search engine that uses the embeddings to retrieve the most relevant documents using cosine distance.

The retrieved documents are more likely to be semantically relevant than being lexically similar i.e. they are likely to be similar to the query in terms of the semantic meaning of the words rather than the exact words that are used in the query.


In [None]:
# we are fitting the search engine on the same document chunks as we did earlier
dense_search_engine = DenseSearchEngine()
dense_search_engine = await dense_search_engine.fit(document_chunks)

# Another wrapper around the search engine
class DenseRetriever(Retriever):
    pass

dense_retriever = DenseRetriever(search_engine=dense_search_engine)

# retrieved_docs = await dense_retriever.invoke(query=query, top_k=5)
# for doc in retrieved_docs:
#     render_doc(doc)


# # Let's evaluate the retrieval performance of the dense retrieval algorithm
# retrieval_scores = await retrieval_evaluation.evaluate(
#     model=dense_retriever,
#     __weave={"display_name": "Dense Retrieval"}
# )


# The same drill, we create a RAG pipeline and evaluate it.
class DenseRAGPipeline(SimpleRAGPipeline):
    pass

dense_rag_pipeline = DenseRAGPipeline(
    retriever=dense_retriever,
    generator=simple_response_generator
)

response_scores = await response_evaluation.evaluate(
    model=dense_rag_pipeline,
    __weave={"display_name": "Dense RAG Pipeline"}
)


### Reranking the contexts

What if we can retrive documents ? i.e we cast a wider net and retrieve more documents to improve the recall of the retrieval step.

However, with more contexts it is important to pick the ones that adds the most knowledge about the given query into the LLM context. i.e. we need to improve the precision of the retrieval step.

![](./imgs/Reranking.png)

For this, a re-ranking model (a separate machine learning model) can be used to calculate a matching score for a given query and document pair.
This score can then be used to rearrange vector search results, ensuring that the most relevant results are prioritized at the top of the list.
Cohere comes with it's own re-ranking model and is quite popular.


The `DenseRerankedRetriever` class below uses a re-ranking model over dense search engine to retrieve the most relevant documents and then re-rank them.


In [None]:
# Essentially the search engine remains the same but the retrieved documents are re-ranked using a re-ranking model.

class DenseRerankedRetriever(RetrieverWithReranker):
    pass

dense_reranked_retriever = DenseRerankedRetriever(search_engine=dense_search_engine,)


# retrieved_docs = await dense_reranked_retriever.invoke(query=query, top_k=5)
# for doc in retrieved_docs:
#     render_doc(doc)

## Evaluate and compare again with the reranked pipeline.

# retrieval_scores = await retrieval_evaluation.evaluate(
#     model=dense_reranked_retriever,
#     __weave={"display_name": "Dense Reranked Retrieval"}
# )

class DenseRerankedRAGPipeline(SimpleRAGPipeline):
    pass

dense_reranked_rag_pipeline = DenseRerankedRAGPipeline(
    retriever=dense_reranked_retriever,
    generator=simple_response_generator
)

response_scores = await response_evaluation.evaluate(
    model=dense_reranked_rag_pipeline,
    __weave={"display_name": "Dense Reranked RAG Pipeline"}
)


### Hybrid Retrieveal

Even though BM25 is an old model used for retrieval tasks, it is still the state-of-the-art on various benchmark. In machine learning, we ensemble a few weak classifiers to build a stronger classifier, we can adopt the same idea to our retriever pipeline.

Here we show the concept of hybrid retriever which uses two or more retrievers and retrieves chunks from all of them followed by re-ranking.

![](./imgs/HybridRetrieval.png)

We use both the BM25 and Dense retrieval algorithms to retrieve chunks from both of them and then re-rank them.

One additional thing to note here is to deduplicate the chunks retrieved from both the retrievers before re-ranking. This is because repeated chunks are not useful and can lead to a worse performance if not deduplicated.

In [None]:
# Setup the hybrid retriever with the sparse and dense search engines.
hybrid_retriever = HybridRetrieverWithReranker(
    sparse_search_engine=bm25_search_engine,
    dense_search_engine=dense_search_engine
)

# retrieved_docs = await hybrid_retriever.invoke(query=query, top_k=10, top_n=5)

# for doc in retrieved_docs:
#     render_doc(doc)


# hybrid_retrieval_scores = await retrieval_evaluation.evaluate(
#     model=hybrid_retriever,
#     __weave={"display_name": "Hybrid Retrieval"}
# )

class HybridRAGPipeline(SimpleRAGPipeline):
    pass

hybrid_rag_pipeline = HybridRAGPipeline(
    retriever=hybrid_retriever,
    generator=simple_response_generator
)

hybrid_response_scores = await response_evaluation.evaluate(
    model=hybrid_rag_pipeline,
    __weave={"display_name": "Hybrid RAG Pipeline"}
)



## Can we do better ?

What if we have more data ? We initially used only the W&B documentation for our RAG pipeline. What if we use more data sources, like the `Weave Documentation`, `W&B Tutorials` and `Examples`, `W&B Blog posts`, `W&B Course content` etc. 

While it might improve the performance of the RAG pipeline, it might also increase the cost and complexity the pipeline. 
Our naive `DenseSearchEngine` might not be able to handle such a large corpus of documents because it requires storing the embeddings in memory.

One specific way to handle some of these issues is to use a vector database to store the embeddings of the documents and then use a retriever to retrieve the most relevant documents.
Vector databases are optimized for storing and querying embeddings and are a good way to handle large corpora of documents. 
This allows us to work with a large corpus of documents without dealing with the complexity of storing the embeddings in memory.

We will be using the `VectorStoreSearchEngine` class to create a vector store search engine.

In [None]:
# full_dataset = load_dataset(docs_root)
# chunked_dataset = chunk_dataset(full_dataset, chunk_size=500)
# vectorstore_search_engine = VectorStoreSearchEngine()
# vectorstore_search_engine = await vectorstore_search_engine.fit(chunked_dataset)
vectorstore_search_engine = VectorStoreSearchEngine()
vectorstore_search_engine = await vectorstore_search_engine.load()

One additional advantage of using a vector database is we can run more complex queries that include metadata filters.

For instance, we can retrieve all the markdown and notebook files from the dataset without retrieving the code documents. This can be useful if we are interested in retrieving a specific type of document or subset of documents before preforming the retrieval.

In [32]:
# results = await vectorstore_search_engine.search(
#     query=query,
#     top_k=5,
#     filters="file_type in ('notebook', 'markdown')")
# for doc in results:
#     render_doc(doc)

In [None]:
class VectorStoreRetriever(RetrieverWithReranker):
    pass

vectorstore_retriever = VectorStoreRetriever(search_engine=vectorstore_search_engine)

#TODO: Appropriate filters
# results = await vectorstore_retriever.invoke(query=query, top_k=10, top_n=5, filters="file_type in ('notebook', 'markdown')")
# for doc in results:
#     render_doc(doc)

class VectorStoreRAGPipeline(SimpleRAGPipeline):
    pass

vectorstore_rag_pipeline = VectorStoreRAGPipeline(
    retriever=vectorstore_retriever,
    generator=simple_response_generator)

# response = await vectorstore_rag_pipeline.invoke(query=query)

# printmd(response["answer"])

# We can now evaluate the response generation performance of the vector store RAG pipeline.
vectorstore_response_scores = await response_evaluation.evaluate(
    model=vectorstore_rag_pipeline,
    __weave={"display_name": "Vector Store RAG Pipeline"}
)
# Compare the scores of the vector store RAG pipeline with the earlier pipelines.

## Query Enhancement

Next let's look at another way to improve our RAG pipeline.
We will be implementing a *query enhancement* stage in out pipeline.  Our `QueryEnhancer` will perform two key tasks:


![QueryEnhancer](./imgs/Advanced_RAG_Pipeline.png)

1. **Intent Classification**: Determine the user's intent of based on the query. This helps us better understand the user's query and improve the response generation step.

2. **Query Expansion**: Break down queries into more focused sub-queries for retrieval. This improves retrieval by capturing different aspects of the original question.

These enhancements serve two primary purposes:
- Inform the response generator, allowing it to tailor its output based on language and intent.
- Improve the retrieval process by using more targeted sub-queries.

Let's implement our `QueryEnhancer` and see it in action:

In [None]:
# Again the `QueryEnhancer` is a `weave.Model`
# it contains two methods, one to get the intent prediction and one to generate search queries.
# the invoke method is a wrapper around the two methods.
# let's see how it works for a sample query.
query_enhancer = QueryEnhancer()
results = await query_enhancer.invoke(query=query)
results


We can now use the `QueryEnhancer` to create a `QueryEnhancedRAGPipeline`. However, unlike the `SimpleRAGPipeline`, this pipeline requires a `response_generator` that can handle the user's intent and query. Additionally, the `retriever` must now be able to handle the search queries generated by the `QueryEnhancer` and rerank and collate the results.

The `QueryEnhancedRAGPipeline` encapsulates this logic and provides a way to unify the pipeline with the `QueryEnhancer` and the response generator.

In [None]:
# Instantiate the `QueryEnhancedRAGPipeline` with the `QueryEnhancer`, `retriever` and `QueryEnhancedResponseGenerator`.
query_enhanced_rag_pipeline = QueryEnhancedRAGPipeline(
    query_enhancer=query_enhancer,
    retriever=vectorstore_retriever,
    generator=QueryEnhancedResponseGenerator()
)
# We can now evaluate the response generation performance of the query enhanced RAG pipeline.
# response = await query_enhanced_rag_pipeline.invoke(query=query)
# printmd(response["answer"])

query_enhanced_response_scores = await response_evaluation.evaluate(
    model=query_enhanced_rag_pipeline,
    __weave={"display_name": "Query Enhanced RAG Pipeline"}
)

## Agentic RAG

So far we have seen a few ways to improve our RAG pipeline. 
The query enhanced RAG pipeline is already a little agentic in the sense that it uses the `QueryEnhancer` to generate search queries and then uses the `retriever` to retrieve the most relevant documents. The query enhancer uses a LLM to generate the search queries and intents predictions. We then manually route these to the retriever and response generator. It also made use of Structured Outputs to return the results in a structured format. - This was an example of an llm using tool calling. 

Tool use allows for greater flexibility in accessing and utilizing data sources, thus unlocking new use cases not possible with a standard RAG approach.

In a setting where data sources are diverse with non-homogeneous formats (structured/semi-structured/unstructured), this approach becomes even more important.

We'll look at how we can implement an agentic RAG system using a tool use approach. The agent can search for information about how to use the product, retrieve information from the internet, and search code examples

Concretely, we'll cover the following use cases:

- `search_docs`: Searches the filing documents in the SEC database
- `search_internet`: Searches the internet for general queries

With tool use we can take out pipeline a step further and create an agentic system.
We will be using the `Agent` class to create an agentic RAG pipeline. 

The `Agent` class is a `weave.Model` that encapsulates some tools and llm calls and their routing logic. Let's see how it works.



In [None]:
agent = Agent()

answer = await agent.invoke(query=query)
printmd(answer["answer"])
