# Techniques for Improving the Effectiveness of RAG Systems

Execute the cell below to load the video presentation that accompanies this notebook, and watch it before working through the materials in this notebook.

In [1]:
from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-fx-20-v1/lesson-04.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

---

## Lesson 04: Better Generations

We've covered data exploration and chunking strategies, loading vector databases for retrieval, and comparing and evaluating retrieval performance. We're now ready to create the final product of this course, a functioning RAG web application.

We will build our application to be capable of multiple features:
- **Asset Discovery and Summarization**: our app should take a given topic from a user's query and find NVIDIA resources that are relevant to that topic and summarize the search results.
- **Question Answering**: our app should be able to answer a specific user question based on details it can extract from NVIDIA resources.
- **Coding Assistant**: our app should also be able to detect if the user wants it to write some code for them.

Each of these features should be handled differently for the system. In our case, we have two indices in our database: one is comprised of chunks of the actual contents of the blog posts, and the other is comprised of chunks of summaries of the blog posts. The Asset Discovery and Summarization task is better suited to the summary index, whereas the Question Answering and Coding Assistant tasks are better suited to the chunks pulled directly from the blogs.

<div style="text-align: center;">
<img src="img/blog-chunks.png" width="600" alt="Blog Chunks">
</div>

**This notebook will focus on the app UI, search, and LLM.**


<div style="text-align: center;">
<img src="img/04_overview.png" width="850" alt="architecture diagram with app UI, search, and LLM components highlighted">
</div>

---

## Tailoring LLMs for RAG

When we prompt our LLM to carry out these tasks, our prompts will vary depending on the task. The prompt for question answering will differ from the discovery/summarization prompt considerably.

In fact, it might not be just the prompt that differs based on the task! We might have multiple different *models* that carry out different tasks. Models like GPT-4 and the recent variant of Mistral known as "Mixtral" are Mixture-of-Experts models: specialized sub-models take over for tasks that they are best suited for. Similarly, if we wanted to have a model tailored to summarize our domain's content, we'd want some way to route only summarization requests to this fine-tuned model, and not question-answering requests.

We can accomplish that tailoring through parameter-efficient finetuning (PEFT), which requires much less data and compute than full finetuning of an LLM. See the session [Tailoring LLMs to Your Use Case](https://www.nvidia.com/en-us/on-demand/session/llmdevday23-02/#:~:text=Push%20LLMs%20beyond%20the%20quality,practical%2C%20real%2Dworld%20examples.) from NVIDIA's recent LLM Developer Day to learn more about customizing LLMs. During that recorded session, we showed how to p-tune a GPT model, and how to use low-rank adaptation (LoRA) with a Mistral 7B model; both are forms of PEFT. For this lesson, however, we'll stick with out-of-the-box (OOB), general-purpose GPT, just with different prompts for each task.

NVIDIA Deep Learning Institute also has an enterprise workshop *[Efficient LLM Customization](https://courses.nvidia.com/courses/course-v1:DLI+C-FX-10+V2/)* which takes a deep dive into performing PEFT on several models for a variety of tasks, as well as several methods for synthetic data generation in service of PEFT.

---

## Restart the Services

To make sure you're staring this lesson with all your services in the correct state, please restart them by running the following cell.

In [2]:
!./restart.sh

Bringing containerized services down...
Services down.
Bringing containerized services back up...
Services back up.


---

## User Intent Classification

In order to make our generation process more modular, we will need some sort of classification model that sits in front. This will classify the user's intent.

Here we import a `ChatOpenAI` instance of our local NIM Mixtral 8x7B model configured and ready for use with LangChain from an [`llms` helper file](llms.py).

In [10]:
from llms import llms

In [11]:
llm = llms.nim_mixtral_llm

### Optional Remote LLMs

Optionally, instead of using our local model, you can also use either NVIDIA AI Foundation's Mixtral 8x7B model or OpenAI's gpt-3.5-turbo.

For either of these 2 options you'll need an API key. For more details about NVIDIA AI Foundation and obtaining a free API key, see [the notebook *NVIDIA AI Foundation.ipynb*](./NVIDIA%20AI%20Foundation.ipynb).

After obtaining an appropriate API key, uncomment the appropriate cell below, add your API key, and run the cell to set `llm` to the remote LLM you chose to work with.

#### NVIDIA AI Foundation Mixtral 8x7B

In [5]:
# from llms import set_api_key
# set_api_key('NVIDIA_API_KEY', '<your_nvidia_api_key>')
# llm = llms.nvai_mixtral_llm

#### OpenAI GPT-3

In [6]:
# from llms import set_api_key
# set_api_key('OPENAI_API_KEY', '<your_openai_api_key>')
# llm = llms.openai_gpt3_llm

We're going to use an OOB LLM and a few-shot prompt to do our classification. This is not ideal, as mentioned earlier--rather than pay for all these tokens being sent in every few-shot prompt, consider modifying this notebook to use a fine-tuned model!

While it may not be cost-effective to use a large language model on the scale of GPT or Mistral for classification, consider the tradeoff between speed-of-development/time-to-market and application speed/cost. Using a larger general-purpose LLM and combining it with a few-shot prompt is often the quickest way to get something up and running, generating training data for your embedder. Later, you can optimize to save costs and reduce latency. Plus, when you use an LLM, you can easily adjust the prompt on the fly if you want to add a new category, for instance.

In [12]:
import json
from langchain_core.messages import AIMessage, SystemMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate

In [13]:
def extract_json(text):
    stack = []
    start_index = None

    for i, char in enumerate(text):
        if char == '{':
            if not stack:
                start_index = i
            stack.append(char)
        elif char == '}':
            if stack:
                stack.pop()
                if not stack:
                    end_index = i + 1
                    json_str = text[start_index:end_index]
                    try:
                        json_obj = json.loads(json_str)
                        return json_obj
                    except json.JSONDecodeError:
                        print("Error: JSON decoding failed.")
                        return None
            else:
                print("Error: Unmatched '}' character.")
                return None

    print("No JSON object found in the text.")
    return None

In [14]:
system_message = "You are a helpful AI bot being used in a technical domain. Format your output as a JSON object."
human_msg_pt = HumanMessagePromptTemplate.from_template(
    'First, is the following text a user question that needs answering or just a topic to learn more about? Second, if the text is a user question that needs answering, is the question asking for code to be written?\nText: {text}'
)
# three classification categories
code_question = AIMessage(content="{\n  \"is_user_question\": true,\n  \"asks_for_code\": true\n}")
regular_question = AIMessage(content="{\n  \"is_user_question\": true,\n  \"asks_for_code\": false\n}")
not_question = AIMessage(content="{\n  \"is_user_question\": false\n}")

prompt = ChatPromptTemplate(
    messages=[
        SystemMessage(content=system_message),
        human_msg_pt.format(text="how do I install cuda drivers"),
        code_question,
        human_msg_pt.format(text="what is the right NVIDIA SDK to use for computer vision"),
        regular_question,
        human_msg_pt.format(text="recommender systems for online shopping"),
        not_question,
        human_msg_pt.format(text="How to import rapids cudf in python?"),
        code_question,
        human_msg_pt.format(text="Generate code to make a Python web server."),
        code_question,
        human_msg_pt.format(text="biomedical devices"),
        not_question,
        human_msg_pt.format(text="write some code that prints hello world"),
        code_question,
        human_msg_pt.format(text="The leading cause of death in the 16th century was infection."),
        not_question,
        human_msg_pt.format(text="NVIDIA Merlin SDK for recommendation systems"),
        not_question,
        human_msg_pt.format(text="who founded the company NVIDIA?"),
        regular_question,
        human_msg_pt,
    ]
)
chain = prompt | llm

generation = chain.invoke({"text": "what libraries should I learn in C++"})
print(extract_json(generation.content))

{'is_user_question': True, 'asks_for_code': False}


In [15]:
generation = chain.invoke({"text": "What is a major seventh chord?"})
print(extract_json(generation.content))

{'is_user_question': True, 'asks_for_code': False}


In [16]:
generation = chain.invoke({"text": "omniverse scene lighting"})
print(extract_json(generation.content))

{'is_user_question': False}


In [17]:
generation = chain.invoke({"text": "Deep learning techniques for obstacle avoidance in autonomous mobile robots"})
print(extract_json(generation.content))

{'is_user_question': False}


In [18]:
generation = chain.invoke({"text": "Generate code to write a simple Python web app."})
print(extract_json(generation.content))

{'is_user_question': True, 'asks_for_code': True}


In [19]:
generation = chain.invoke({"text": "how would you write a print statement in C++"})
print(extract_json(generation.content))

{'is_user_question': True, 'asks_for_code': True}


Looks like our classifier is doing its job.

---

## Web Service

Now we're going to discuss our final service, which we're going to call `web` for Web App. You launched this service in Lesson 00.

In [20]:
!docker-compose logs web

[36mweb-1  | [0m[2024-08-12 15:01:31 +0000] [7] [INFO] Starting gunicorn 21.2.0
[36mweb-1  | [0m[2024-08-12 15:01:31 +0000] [7] [INFO] Listening at: http://0.0.0.0:5000 (7)
[36mweb-1  | [0m[2024-08-12 15:01:31 +0000] [7] [INFO] Using worker: gevent
[36mweb-1  | [0m[2024-08-12 15:01:31 +0000] [9] [INFO] Booting worker with pid: 9
[36mweb-1  | [0m[2024-08-12 15:01:31 +0000] [10] [INFO] Booting worker with pid: 10
[36mweb-1  | [0m[2024-08-12 15:01:31 +0000] [11] [INFO] Booting worker with pid: 11


We're going to have three different modes to our web app, based on the classified intent in the user's query. Each one will map to a separate prompt, with different styles of chunks being searched for as context depending on the classification.
1. When a user's query is detected to not be in question form, we're going to search the summarize_techblogs` index to discover which assets might be relevant. We're then going to additionally summarize over all those assets.
2. When a user's query is detected to be in question form, but not a code question, we're going to search the `techblogs` index and pull the non-code text to inject into the LLM prompt as context.
3. When a user's query is detected to be a question that asks the LLM to write code, we're going to search the `techblogs` index and grab the code data we stored as metadata. That's why we did all that extra work back in the chunking stage!

<div style="text-align: center;">
<img src="img/end-to-end.png" width="600" alt="End-to-End">
</div>

### Viewing the Web App

The final product `web` app is available on port 5000. Execute the following cell to generate a link to open it in a new browser tab.

In [21]:
%%js
var host = window.location.host;
var url = 'http://'+host+':5000';
element.innerHTML = '<a style="color:green;" target="_blank" href='+url+'>Click to open the final product web app.</a>';

<IPython.core.display.Javascript object>

---

## Try the Web App

Try a few different examples of the three types of queries, such as the two built-in examples in the buttons. These should lead to summarization responses.

Now try some QA questions. Here are some examples that are answerable based on the documents we indexed:
- Which musician worked with the company Moment Factory on her world tour?
- Write me some code using cgroups to isolate a GPU
- What is NVIDIA Workbench?

---

## Citing Sources

One feature we added to this app is the ability to cite sources used for summarization and question answering. This is due to some instructions we added to our prompt, which you can check out at `web/src/chains.py`. With a bit of string replacement, we can turn those source numbers into links that cause the page to scroll to the proper source. 

You may have seen something similar in Bing search or Perplexity AI's search.

---

## Additional Discussion Topics

### Query Transformation
We could expand short queries to be more detailed, adding context from what we know about the user, including their role or previous interests. Alternatively we could condense really long queries to a more manageable size. Query transformation could even enable keyword search when appropriate.

### Quantitative Analysis of Generation
We covered Precision and Recall as retrieval metrics, but it also makes sense to evaluate your LLM generation quantitatively. The primary metrics we typically consider are faithfulness (does the generation match the retrieved data) and answer relevancy (does the generation match the query intent). Refer to the 
[RAGAS framework](https://docs.ragas.io/en/stable/concepts/metrics/index.html) for further guidance on using an LLM to evaluate generation performance.

---

## Recap and Final Thoughts

Execute the cell below to load and then watch the course conclusion video.

In [22]:
from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-fx-20-v1/conclusion.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

We hope this rapid-fire intro left you with some valuable techniques for improving RAG systems. We provided a lot of code you're free to use, but more importantly, we hope these ideas will help you think about how to design and evaluate a useful system that goes beyond naive RAG. 

To recap:
- Build your RAG system in a modular way. Containerize each component so you can scale it as needed. Don't build monoliths.
- Match your RAG system's task to the data you have at your disposal. Consider the features of your data, like the presence of code, or implicit structure from HTML.
- Try various different chunking strategies, and consider looking into more advanced strategies as suggested in Lesson 01. Generate different indices corresponding to different chunking strategies so you can compare them. 
- Try both semantic and keyword search. Consider combining them for hybrid search. 
- Evaluate the precision/recall of your retrieval system using human-as-a-judge and LLM-as-a-judge frameworks.
- If you have a variety of tasks your system can perform, use multiple prompts, multiple retrieval indexes for context (mapping to different chunking strategies), and potentially even multiple models. 
Consider our approach of using a classifier to route to the appropriate expert model.

Thank you, and we'll see you in your next DLI!