# Techniques for Improving the Effectiveness of RAG Systems

Execute the cell below to load the video presentation that accompanies this notebook, and watch it before working through the materials in this notebook.

In [1]:
from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-fx-20-v1/lesson-03.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

---

## Lesson 03: Evaluating Retrieval

Welcome to Lesson 03! Now that we have set up our retriever and can get search results through API calls, we could present the results in many ways. We could hook into an existing tool that consumes APIs or build something custom using a framework designed for easy web apps like Streamlit. Even just dumping results into a spreadsheet and visualizing it can help evaluate search results. 

To validate our system, however, we need to record feedback! While we might start with qualitative feedback from manually observing the results of a few searches, we'll need to start recording the feedback data in order to make quantitative assessments of our RAG system's effectiveness and iterate on the design.

Certainly, you'll often be able to tell with just a few starter searches if something is drastically wrong, so getting that kind of qualitative feedback first is valuable. Building effective RAG systems requires rapid iterative prototyping, so it's good to use a small test dataset for these trials--like our relatively tiny dataset of just a few hundred blog articles for this course.

**This notebook will focus on search, the LLM, and the judge UI and database.**


<div style="text-align: center;">
<img src="img/03_overview.png" width="850" alt="architecture diagram showing the search, judge UI/database, and LLM highlighted">
</div>

---

## Restart the Services

To make sure you're staring this lesson with all your services in the correct state, please restart them by running the following cell.

In [2]:
!./restart.sh

Bringing containerized services down...
Services down.
Bringing containerized services back up...
Services back up.


---

## The Judge UI

Using FastAPI as a backend and Vue.js as a frontend, we've custom built a Judge UI for searching different asset types via semantic or keyword search. This is essentially just a bit of wiring that is making very simple REST calls to the router. Via SQLAlchemy, we've integrated FastAPI with a little backend SQLite database to store the feedback we generate by voting on results.

### Launching the Judge UI

You already launched the `judge` service in Lesson 00.

In [3]:
!docker-compose logs judge

[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,244 |     INFO | database | /judge/db/sql_app.db[0m
[36mjudge-1  | [0m* 'orm_mode' has been renamed to 'from_attributes'
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,275 |     INFO | main | Instantiating FastAPI app[0m
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,275 |     INFO | main | Formatting uvicorn loggers before app startup[0m
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,275 |     INFO | main | Enabling CORS[0m
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,276 |     INFO | main | FastAPI Setup Complete[0m
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,308 |     INFO | uvicorn.error | Started server process [1][0m
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,308 |     INFO | uvicorn.error | Waiting for application startup.[0m
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,308 |     INFO | main | Re-formatting uvicorn loggers on startup[0m
[36mjudge-1  | [0m[38;5;39m2024-08-09 23:55:43,30

### Viewing the Judge UI

The `judge` service is available on port 5007. Execute the following cell to generate a link to open it in a new browser tab.

In [4]:
%%js
var host = window.location.host;
var url = 'http://'+host+':5007';
element.innerHTML = '<a style="color:green;" target="_blank" href='+url+'>Click to open the judge UI.</a>';

<IPython.core.display.Javascript object>

Note how it lets you choose what type of search, switch some of the basic parameters, and select which asset type you are searching.

Using the Judge UI, compare the results from searching through the summarization chunks vs. the standard chunking. Standard chunks are much more granular and better suited for very specific Q&A tasks. Summarized chunks are better for discovering content you may want to then dive in deeper and read. 

With standard chunking, the chunks are ordered by score, so sometimes you get multiple sections from the same document; in an app UI, you might organize those search results together so users still see multiple relevant documents in their results.

**For the remainder of this notebook we're going to focus on our summarization method of chunking, so we can leave that asset type checked and uncheck the other one in the UI.**

We're doing so because the summarization method produces fewer overall chunks, which makes evaluation faster for the purposes of this introductory notebook. However, the same principles would apply for standard chunking.

---

## Semantic and keyword search

Remember how some searches are more suited to semantic search and some to keyword search? Imagine the use case where we want to search for a product name, and in particular a relatively new product that may not be in the training data of our embedding model.

Search for `H200` using semantic search and note how some documents mismatch on H20.ai, for example.

---

## Voting with the Judge UI

Now let's vote. First let's evaluate how semantic search does on this query.

1. Make sure we're in Semantic Search mode.
2. Type `H200` in the search box.
3. Make sure only "TechBlog Posts Summaries" is checked in Asset Types.
4. Vote thumbs up or down on the 10 results. This populates our SQLite database.

Now let's evaluate how keyword search does on this query.
1. Switch to Keyword Search mode.
2. Type `*H200` in the search box. **Include the asterisk so that we get wildcard matches on the closely-related GH200.**
3. Make sure only "TechBlog Posts Summaries" is checked in Asset Types.
4. Vote thumbs up or down on the 10 results. This populates our SQLite database.

---

## Load Votes into Pandas DataFrame

SQLite DBs are each a single file, which we can easily query with Python and load into a Pandas DataFrame.

In [5]:
import os 
os.getcwd()

'/workspace/dli'

In [6]:
sql_db_filepath = os.path.abspath(os.path.join(os.getcwd(), "db", "sql_app.db"))
sql_db_filepath

'/workspace/dli/db/sql_app.db'

In [7]:
import pandas as pd
import sqlite3

def select_all_feedback() -> pd.DataFrame: 
    # Read sqlite query results into a pandas DataFrame
    con = sqlite3.connect(sql_db_filepath)
    df = pd.read_sql_query("SELECT * FROM feedback", con)
    con.close()
    return df

df = select_all_feedback()

# Verify that result of SQL query is stored in the dataframe
print(df.head(20))

Empty DataFrame
Columns: [id, query, chunk_id, search_type, keyword_search_field, keyword_search_type, asset_type, k, results_idx, n_results, vote_value, username, created]
Index: []


In [8]:
print(df.shape)

(0, 13)


---

## Precision and Recall

As in most data science evaluation tasks, we are interested in precision and recall. Remembering the definitions of each:

- **Precision:** Total number of *relevant* documents retrieved / Total number of documents retrieved.
- **Recall:** Total number of relevant documents *retrieved* / Total number of relevant documents *in the database*.

In retrievers, those scores are typically calculated with the system set for some arbitrary number of results K (indicated as Precision@K, Recall@K).

You may see variations on these like Mean Average Precision, F1 score (F1@K), or other rank-based metrics like Mean Reciprocal Rank (MRR) or Normalized Cumulative Discounted Gain (NCDG). We'll just focus on Precision and Recall in this course.

<div style="text-align: center;">
<img src="img/precision-recall.png" width="600" alt="Precision Recall">
</div>

---

## Calculating Precision

We voted on the top 5 results for these two different methods, so we can calculate Precision@5 just by doing a bit of `groupby` magic with pandas.


In [9]:
# first filter to all the feedback we put in manually (a.k.a. human feedback)
hf = df[df['username'] != 'llmjudge']

# Next transform query column so that it strips wildcards and lowercases everything
hf["query"] = hf["query"].str.replace("*", "").str.lower()

print(hf.shape)
print(hf.head())

(0, 13)
Empty DataFrame
Columns: [id, query, chunk_id, search_type, keyword_search_field, keyword_search_type, asset_type, k, results_idx, n_results, vote_value, username, created]
Index: []


In [10]:
result = hf.groupby(["query", "search_type", "asset_type"]).aggregate(precision=("vote_value", "mean")).reset_index(drop=False)
print(result)

Empty DataFrame
Columns: [query, search_type, asset_type, precision]
Index: []


So for our single "H200" query, keyword search of the summarization chunks had a precision of 1.0, and semantic search of the summarization chunks had a precision of 0.6.

We would then repeat this for not just a single query, but multiple queries across our domain of interest, averaging our precision values across those queries. We would then repeat this process with other chunking strategies for consistency.

---

## Calculating Recall

Recall is a trickier metric to evaluate than precision because our denominator is the total number of relevant documents in the entire database--and finding relevant documents automatically is the point of the system we are trying to build!

If we were to brute force this with human feedback through the UI, it would take a long time, especially as the database grows in size--a good reminder is to do initial evaluation on a subset of the dataset.

We can make recall easier to calculate with workarounds. For example, we could try a number of different search types (both semantic and keyword) and chunking strategies with the same query, then assume (or test!) that those multiple strategies found most relevant documents.

We've actually done that in our case on a small scale; ideally, we'd want to be more thorough, with additional strategies.

First, we need to calculate the total number of unique document ids that are positives:

In [11]:
totalpos = hf[hf["vote_value"] == 1].groupby("query").aggregate(totalpos=("chunk_id", "nunique")).reset_index(drop=False)
totalpos

Unnamed: 0,query,totalpos


Now, we join this to our `hf` dataframe so we can calculate recall:

In [12]:
hf = pd.merge(hf, totalpos, on=["query"])

In [13]:
hf_grouped = hf.groupby(["query", "search_type", "asset_type"]).aggregate(precision=("vote_value", "mean"), recall=("vote_value", "sum"), totalpos=("totalpos", "first")).reset_index(drop=False)
hf_grouped["recall"] = hf_grouped["recall"] / hf_grouped["totalpos"]
hf_grouped

Unnamed: 0,query,search_type,asset_type,precision,recall,totalpos


And there we have our recall! Or at least a plausible proxy for it, much faster to calculate than going through every document in the database manually.

---

## LLM As A Judge

Getting human feedback is pretty important for evaluation. However, unless you have access to an automated way of continually collecting user preferences - as search engines track whether or not a user clicked on a search result - it will require a lot of dedicated person-hours to build up your database of feedback, especially as your database of document chunks scales.

This leads to a natural follow-up question: can we ask a machine to evaluate whether a document is relevant or not? If so, we could solve the problem by just throwing compute at it.

We are actually getting close to another important topic in information retrieval systems: **rerankers**.

---

## Bi-encoders vs Cross-encoders

So far we've been using a "bi-encoder" to embed our documents. It's called a bi-encoder because we are calculating the output score from two embeddings--in our case, the user's query and the document text they are searching through. We are calculating the output score as a similarity between the embedding vectors, like cosine similarity.

Typical rerankers use cross-encoders instead of bi-encoders. They are lightweight language models (compared to something like GPT). They take in two pieces of text at once and encode them jointly, spitting out a similarity score directly instead of producing two vectors and then calculating an output score. 

These models are usually more accurate than bi-encoders+cosine similarity because they can take into account interactions between the pieces of text. However, they don't scale as well as bi-encoders, because cross-encoders need to directly score each query-result pair.

If we had a really good cross-encoder model, we could ask that model to grade each query against each document and calculate precision and recall through that.

---

## Alternatives for Reranking

However, training a good cross-encoder is challenging, in part because the datasets we would most like to use (broad search engine results, which are essentially human evaluations of query-passage matches!) are often proprietary. The best cross-encoders on HuggingFace, for instance often use the [MS MARCO](https://microsoft.github.io/msmarco/) dataset--an excellent search engine dataset with a license that restricts commercial use. 

In the absence of a good, readily-available cross-encoder, we can turn to a general purpose LLM like GPT, Llama 2, Mistral, etc. These are orders of magnitude larger and slower than a simple cross-encoder, but should be more powerful than the cross-encoder at determining relevance and therefore able to perform evaluation.

---

## Load Tech Blog Summaries

We first need access to all the document summaries that we saved in Lesson 02.

In [14]:
import json

# load the summaries from the json file
with open("data/techblogs_summaries/saved.json", "r") as f:
    saved_summaries = json.load(f)

In [15]:
len(saved_summaries)

150

In [16]:
summary = saved_summaries['https://developer.nvidia.com/blog/create-share-and-scale-enterprise-ai-workflows-with-nvidia-ai-workbench-now-in-beta/'][0]['text']
print(summary)

NVIDIA AI Workbench, now in beta, aims to streamline how enterprise developers create, share, and scale AI and machine learning projects. It allows developers to create, collaborate, and migrate AI workloads on their GPU-enabled environment of choice. The beta release includes new features such as simplified setup and installation on Windows, Ubuntu, and macOS, expanded support for container runtimes and Git servers, and new base images for project creation. The article walks through a coding copilot reference example using AI Workbench to fine-tune a generative AI model on a GPU system. Key concepts include AI Workbench Projects, fine-tuning methods such as Quantized Low Rank Adaptation (QLoRA), and a walkthrough of a Mistral 7B fine-tuning project. AI Workbench helps simplify the process of developing and deploying AI models by providing an intuitive user experience, streamlined configuration, and automation for handling Git and container-based developer environments.


---

## Use LLM to Check Summary Relevance

Now we send a request to Mixtral to determine if this text is relevant to the query "H200". Here we import a `ChatOpenAI` instance of our local NIM Mixtral 8x7B model configured and ready for use with LangChain from an [`llms` helper file](llms.py).

In [17]:
from llms import llms

In [18]:
llm = llms.nim_mixtral_llm

### Optional Remote LLMs

Optionally, instead of using our local model, you can also use either NVIDIA AI Foundation's Mixtral 8x7B model or OpenAI's gpt-3.5-turbo.

For either of these 2 options you'll need an API key. For more details about NVIDIA AI Foundation and obtaining a free API key, see [the notebook *NVIDIA AI Foundation.ipynb*](./NVIDIA%20AI%20Foundation.ipynb).

After obtaining an appropriate API key, uncomment the appropriate cell below, add your API key, and run the cell to set `llm` to the remote LLM you chose to work with.

#### NVIDIA AI Foundation Mixtral 8x7B

In [19]:
# from llms import set_api_key
# set_api_key('NVIDIA_API_KEY', '<your_nvidia_api_key>')
# llm = llms.nvai_mixtral_llm

#### OpenAI GPT-3

In [20]:
# from llms import set_api_key
# set_api_key('OPENAI_API_KEY', '<your_openai_api_key>')
# llm = llms.openai_gpt3_llm

In [21]:
import asyncio

# Initialize a semaphore object with a limit of 3.
limit = asyncio.Semaphore(3)

async def async_generate(llm, msg):
    resp = await llm.agenerate([msg])
    return resp.generations[0][0].text

In [22]:
summary

'NVIDIA AI Workbench, now in beta, aims to streamline how enterprise developers create, share, and scale AI and machine learning projects. It allows developers to create, collaborate, and migrate AI workloads on their GPU-enabled environment of choice. The beta release includes new features such as simplified setup and installation on Windows, Ubuntu, and macOS, expanded support for container runtimes and Git servers, and new base images for project creation. The article walks through a coding copilot reference example using AI Workbench to fine-tune a generative AI model on a GPU system. Key concepts include AI Workbench Projects, fine-tuning methods such as Quantized Low Rank Adaptation (QLoRA), and a walkthrough of a Mistral 7B fine-tuning project. AI Workbench helps simplify the process of developing and deploying AI models by providing an intuitive user experience, streamlined configuration, and automation for handling Git and container-based developer environments.'

In [23]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful AI bot being used by NVIDIA to determine if a passage of text is relevant to a search query. "
            + "All of the passages of text are in some way related to NVIDIA, so in order to be relevant it needs to be a strict match between the topic of the passage and the topic of the query. "
            + 'Format your output as a JSON object with a single boolean field "relevant". ',
        ),
        (
            "user",
            'Is the following passage strictly relevant to a search query for "{query}"?\nPassage: {passage}',
        ),
    ]
)



In [24]:
batch_messages = []

# truncating the list to limit to a subset of urls 
# this example is just to illustrate the point
# the first 5 urls should be irrelevant to H200, the next 2 should be relevant 
urls = list(saved_summaries.keys())[0:5] + ['https://developer.nvidia.com/blog/one-giant-superchip-for-llms-recommenders-and-gnns-introducing-nvidia-gh200-nvl32/', 'https://developer.nvidia.com/blog/nvidia-tensorrt-llm-enhancements-deliver-massive-large-language-model-speedups-on-nvidia-h200/']

for url in urls:
    title = saved_summaries[url][0]['document_title']
    summary = saved_summaries[url][0]['text']
    print(title)
    print(summary)
    print("-----")
    passage = title + "\n" + summary
    messages = template.format_messages(query="H200", passage=passage)
    batch_messages.append(messages)


Create, Share, and Scale Enterprise AI Workflows with NVIDIA AI Workbench, Now in Beta
NVIDIA AI Workbench, now in beta, aims to streamline how enterprise developers create, share, and scale AI and machine learning projects. It allows developers to create, collaborate, and migrate AI workloads on their GPU-enabled environment of choice. The beta release includes new features such as simplified setup and installation on Windows, Ubuntu, and macOS, expanded support for container runtimes and Git servers, and new base images for project creation. The article walks through a coding copilot reference example using AI Workbench to fine-tune a generative AI model on a GPU system. Key concepts include AI Workbench Projects, fine-tuning methods such as Quantized Low Rank Adaptation (QLoRA), and a walkthrough of a Mistral 7B fine-tuning project. AI Workbench helps simplify the process of developing and deploying AI models by providing an intuitive user experience, streamlined configuration, and 

In [25]:
response = llm.generate(batch_messages)
for gen in response.generations:
    print(gen[0].text)

APIConnectionError: Connection error.

These results can now be saved in a data structure and used as a proxy for ground truth when evaluating the retrieval results and calculating precision and recall.

The code above can be easily modified to perform the full evaluation with an LLM as a judge. In that case, you may want to evaluate which LLM can most cost-effectively perform that task. This kind of high-intensity, sustained generation could be well-suited for a smaller model run in a batch job--and the smaller the model, the more easily you could also tune it to more closely mimic human evaluations.

---

## Recap

The final product of this course and accompanying notebook for Lesson 04 uses only semantic search, but hopefully you can see the potential effectiveness of combining the two methods together. 

How could you create a hybrid search that merged/re-ranked the results of both search types? We'll leave that for you to research and work on yourself after this course is over.

---

## Next Lesson

Move to the next lesson by double-clicking *Lesson 04.ipynb* on the file-viewer on the left-hand side of your Jupyter Lab environment.