<a href="https://colab.research.google.com/github/StrategicalIT/PipedPiperAI/blob/main/Lab06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB6: Advanced RAG
In this lab we are going to cover some advanced RAG topics: re-ranking and guardrails. We will leverage Nvdia's NIM API. These models are part of Nvidia's NEMO framework. You can explore what models are available at [https://build.nvidia.com/](https://build.nvidia.com/).



***

## Re-Rank

In RAG pipelines we retrieve additional context from external sources to help the LLM to answer more accurately. A typical RAG pipeline retrieves a small number of context chunks, ex: 2 or 3 chunks. This number is kept small to keep latency (or TTFT) low for a better user experience. Conversely, there is the risk that the chunks retrieved by the retrieval mechanism (ex: cosine similarity search) are not the most relevant ones. Increasing the number of chunks retrieved increases the likelihood of retrieving the most relevant context. Re-ranking can be applied to a larger number of chunks to select the best ones and pass them to the model without dramatically increasing latency

Re-ranking the output of a retriever yields better recall accuracy than a retriever alone. In hybrid retrieval solutions (ex: Vector Databases + Web Search) re-ranking is essential, as it helps combine/compare results from different sources of data.

NVIDIA's NEMO framework provides re-ranking models which are also available as NIMs. These are specialised transformer models. Being transformers means that they can be accelerated with GPUs

According to Nvidia's catalog documentation, the re-ranking models don't use the OpenAI syntax so we will use the "requests" library to make standard REST API calls. Let's start by installing dependencies

In [None]:
!pip install requests

Now we can import it

In [None]:
import requests

Next we read the NIM API key from the environment and store it in a variable called "apikey" for future use. You can uncomment the "print" command if you want to validate that it has been read correctly

In [None]:
#import os
#apikey = os.environ["NVIDIA_API_KEY"]
#change from OS variable import to using Google Colab secret
from google.colab import userdata
apikey = userdata.get('apikey')
#print(apikey)

This is the endpoint provided by NVIDIA. We also build a header that includes the API key to authenticate our requests

In [None]:
invoke_url = "https://ai.api.nvidia.com/v1/retrieval/nvidia/nv-rerankqa-mistral-4b-v3/reranking"

headers = {
    "Authorization": "Bearer " + apikey,
    "Accept": "application/json",
}

There are several re-ranking models available. We will use this one which is a fine-tuned version of a Mistral model with less layers for faster speed: ```nvidia/nv-rerankqa-mistral-4b-v3```

The payload for the API call requires the "query" itself and the "passages" we want to re-rank. In a real RAG pipeline the "passages" will be the output of the retrieval, but here we are going to define the "passages" ourselves.

In [None]:
passages = [
    {"text": "The Hopper GPU is paired with the Grace CPU using NVIDIA's ultra-fast chip-to-chip interconnect, delivering 900GB/s of bandwidth, 7X faster than PCIe Gen5. This innovative design will deliver up to 30X higher aggregate system memory bandwidth to the GPU compared to today's fastest servers and up to 10X higher performance for applications running terabytes of data."},
    {"text": "A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. The A100 80GB debuts the world's fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets."},
    {"text": "Accelerated servers with H100 deliver the compute power—along with 3 terabytes per second (TB/s) of memory bandwidth per GPU and scalability with NVLink and NVSwitch™."}
  ]

Let's assemble the full payload

In [None]:
payload = {
  "model": "nvidia/nv-rerankqa-mistral-4b-v3",
  "query": {"text": "What is the GPU memory bandwidth of H100 SXM?"},
  "passages": passages,
  "truncate": "END"
}

The "truncate" parameter dictates what to do when the token limit is exceeded. For the computation of the LIMIT, the  query and the passages are added together. There are two options:
- ```Truncate=NONE``` returns an error if the token limit is exceeded
- ```Truncate=END``` ignores tokens beyond the model's limit. For example if the model has a limit of 500 tokens but the query + passages add up to 600, then, the final 100 tokens will be dropped

Finally, we can send the API call and look at the response

In [None]:
response = requests.post(invoke_url, headers=headers, json=payload)
response_body = response.json()
print(response_body)

The response is the list of passages sorted in decreasing relevance order. In Python, the first element of a list has the index '0' . The score is presented in "logits" which is the raw, unnormalized prediction the model makes. However, note how the result doesn't send back the actual passages, just their index in the list.

Now we could for example decide to send only the top passage

In [None]:
print("The best match is: ", passages[response_body["rankings"][0]["index"]]["text"])

***

## Guardrails

A common requirement for enhanced RAG solutions is to implement guardrails to moderate the conversation:
- to ensure it stays within the specific topic the application has been designed for
- so that both the user prompt and the model's response are safe and free from violence, criminality, discrimination ...

Nvidia provides several models in the NIM catalog to address these two requirements as part of the NEMO framework. Increasingly, these models are deployed as part of an agentic RAG workflow, where a specific agent uses one of these models.

### Topic Control

Topic control is better used before completion. So, typically an agent in an agentic workflow would leverage a model like Nvidia's
```llama-3.1-nemoguard-8b-topic-control``` to detect whether the prompt is off-topic before passing it to the main LLM. If the prompt is off-topic, the agent can notify user and politely encourage relevant questions.

This model does follow the OpenAI REST API interface so we will install and import the necessary libraries

In [None]:
!pip install openai

Let's import the library

In [None]:
from openai import OpenAI

Next we read the NIM API key from the environment and store it in a variable called "apikey" for future use. You can uncomment the "print" command if you want to validate that it has been read correctly

In [None]:
#import os
#apikey = os.environ["NVIDIA_API_KEY"]
#change from OS variable import to using Google Colab secret
from google.colab import userdata
apikey = userdata.get('apikey')
#print(apikey)

Let's create a client instance. This client will be able to access all models. No need for a separate client connection for each model. Notice how were we are specifying the API key. Put your own API key

In [None]:
client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = apikey
)

We can now use the client connection to check whether the user's prompt is off-topic or not

In [None]:
completion = client.chat.completions.create(
  model="nvidia/llama-3.1-nemoguard-8b-topic-control",
  messages=[
      {
          "role":"system",
          "content":"You are to act as an investor relations bot for ABC, providing users with factual, publicly available information related to the company's financial performance and corporate updates. Your role is to ensure that you respond only to relevant queries and adhere to the following guidelines:\n\n1. Do not answer questions about future predictions, such as profit forecasts or future revenue outlook.\n2. Do not provide any form of investment advice, including recommendations to buy, sell, or hold ABC stock or any other securities. Never recommend any stock or investment.\n3. Do not engage in discussions that require personal opinions or subjective judgments. Never make any subjective statements about ABC, its stock or its products.\n4. If a user asks about topics irrelevant to ABC's investor relations or financial performance, politely redirect the conversation or end the interaction.\n5. Your responses should be professional, accurate, and compliant with investor relations guidelines, focusing solely on providing transparent, up-to-date information about ABC that is already publicly available."
      },
      {
          "role":"user",
          "content":"Can you speculate on the potential impact of a recession on ABCs business?"
      }
  ],
  temperature=0.5,
  top_p=1,
  max_tokens=1024
)
print(completion.choices[0].message)

## Content Safety

Another form of guardrails is to ensure content safety. This is applicable to both the user prompt and the response from the LLM. In this exercise we are going to to use Nvidia's ```llama-3.1-nemoguard-8b-content-safety``` model. As it names indicates, it is a fine-tuned version of llama3.1-8b that just classifies prompts and responses are safe or unsafe.

This model also follows the OpenAI REST API interface. We won't need to install or import the OpenAI python library, or read the NVIDIA key or create a client connection because we have already done it as part of the previous section.

[Click here to explore the model card](https://build.nvidia.com/nvidia/llama-3_1-nemoguard-8b-content-safety/modelcard) to see what content categories are used to determine whether the content is unsafe

Notice below how we are passing the prompt (role=user) and the response from the LLM (role=assistant)


In [None]:
completion = client.chat.completions.create(
  model="nvidia/llama-3.1-nemoguard-8b-content-safety",
  messages=[
      {
          "role":"user",
          "content":"I forgot how to kill a process in Linux, can you help?"
      },
      {
          "role":"assistant",
          "content":"Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."
      }
  ],
  stream=False
)

print(completion.choices[0].message)

As you can see, the model realizes that within the context of Linux administration the word "kill" is in fact a safe word

You can experiment futher by creating your own prompts and responses and observing if they are flagged as safe or unsafe by the model

***

## Prompt Enhancement

Prompt engineering is an art and it can make a big difference in the output we get from Large Language Models. That's the reason some agentic RAG solutions include a "Prompt Enhancement" agent. The mission of this type of agent is to take the user's prompt and rephrase it and enhance it in a way that can yield better results when passed to the LLM.

In the previous advanced RAG techniques we have used dedicated models that have been created for a specific task. An agent needs to use a model. However, not all agents in agentics workflows (including agentic RAG) need their own specialized model. For some use cases, we can use the main LLM but with a special "system" prompt that tells the agent to behave in a certain way. In this case we are using a "system" prompt to get the LLM to behave like a "prompt engineer".

In [None]:
messages = [
	{
        "role": "system",
        "content": "you are a prompt engineering assistant that helps users improve their prompts so that they can get a better response from a generative AI model. When you get the user's prompt, firstly correct any syntax errors and provide enhanced instructions and related questions so that it can produce better results with a generative AI model. The response must be only a JSON structure with two keys key called 'original_prompt' and 'enhanced_prompt'. Do not explain the enhancements you made and don't format the output as markdown, just the plain JSON"
	},
	{
        "role": "user",
        "content": "most popular pets in Australia",
    },
]

Notice how the "user" didn't even formulate a question. The syntax for doing a completion with the LLM is the same as we used in Lab01

In [None]:
completion = client.chat.completions.create(
  model="meta/llama-3.2-3b-instruct",
  messages=messages,
  temperature=0.2,
  max_tokens=1024
)

In [None]:
print(completion.choices[0].message)

Experiment with your own simple prompts and observe how the prompt gets enhanced

### End of Lab 6