# Retrieval Augmented Generation (RAG) with Hosted NIMs

In this tutorial, we will experiment with retrieval augmented generation (RAG), a technique that helps inject relevant context into the prompt of the LLM. This can help with reducing hallucinations as well as giving the LLM up-to-date information.

There are two main steps to RAG:
1. **Document ingestion and embedding:** documents are processed into "chunks" and an embedding representation of each chunks is created and stored.
2. **Querying and retrieval**: the query from the user is embedded, relevant chunks are retrieved, and injected as context to the large language model (LLM). The LLM then answers the user's query based on the chunks retrieved.

This tutorial is adapted from https://nvidia.github.io/GenerativeAIExamples/latest/notebooks/02_Option%281%29_NVIDIA_AI_endpoint_simple.html.

We'll be using hosted NIMs for this tutorial.

**Please do not input sensitive information (e.g. PHI).**

In [1]:
# Install dependencies
# Note, you might need to restart the notebook and rerun this cell
!pip install langchain langchain_nvidia_ai_endpoints langchain_community
!pip install faiss-cpu # replace with faiss-gpu if you are using GPU

Collecting faiss-cpu
  Using cached faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


In [2]:
# Get API key
from google.colab import userdata
NVIDIA_API_KEY = userdata.get('NVIDIA_API_KEY')

In [3]:
# Get LLM NIM endpoint
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", nvidia_api_key=NVIDIA_API_KEY, max_tokens=1024)

result = llm.invoke("Write a ballad about LangChain.")
print(result.content)

 (Verse 1)
In a world of knowledge, vast and wide,
A hero rose, with vision inside.
Named LangChain, a name well earned,
For bridging gaps, learning curves turned.

(Chorus)
LangChain, oh great LangChain,
Unlocking doors to wisdom's domain.
Your chains of language, strong and bright,
Guide us through the darkest night.

(Verse 2)
Through the power of AI so potent,
LangChain forged a harmonious bond.
Between each word, each phrase, each line,
A tapestry of thought intertwined.

(Chorus)
LangChain, oh great LangChain,
Neath your banner, we shall gain.
The secrets locked within each tongue,
No barrier too wide, no task too long.

(Bridge)
From Mandarin's rise to English's fall,
Through whispered tales of Sanskrit's call.
Beneath your gaze, each letter sings,
As bridges built on logic springs.

(Verse 3)
With every corner of the globe connected,
All stories, now in one shared speech.
LangChain's legacy, profound and tall,
A beacon of unity for us all.

(Chorus)
LangChain, oh great LangChai

In [4]:
# Get Embedding NIM endpoint
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embedder = NVIDIAEmbeddings(model="NV-Embed-QA", api_key=NVIDIA_API_KEY)

## Step 1: Process and embed the documents

In this step, we will read in the txt files and chunk them. Then, we will embed the chunks using our embedding NIM, and save the embeddings in a vector database.

The `toy_data` used by the original tutorial can be found here: https://github.com/NVIDIA/GenerativeAIExamples/tree/main/notebooks/toy_data.

Library called unstructured helps with more comples data, like pdf.

In [6]:
import os
from tqdm import tqdm
from pathlib import Path

# Here we read in the text data and prepare them into vectorstore
ps = os.listdir("toy_data/")
data = []
sources = []
for p in ps:
    if p.endswith('.txt'):
        path2file="./toy_data/"+p
        with open(path2file,encoding="utf-8") as f:
            lines=f.readlines()
            for line in lines:
                if len(line)>=1:
                    data.append(line)
                    sources.append(path2file)

documents=[d for d in data if d != '\n']
len(data), len(documents), data[0]

(203,
 122,
 "Titanic is a 1997 American epic romance and disaster film directed, written, produced, and co-edited by James Cameron. Incorporating both historical and fictionalized aspects, it is based on accounts of the sinking of RMS Titanic in 1912. Kate Winslet and Leonardo DiCaprio star as members of different social classes who fall in love during the ship's maiden voyage. The film also features Billy Zane, Kathy Bates, Frances Fisher, Gloria Stuart, Bernard Hill, Jonathan Hyde, Victor Garber, and Bill Paxton.\n")

In [7]:
# Here we create a vector store from the documents and save it to disk.
from operator import itemgetter
from langchain.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.text_splitter import CharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
import faiss

# create my own uuid
text_splitter = CharacterTextSplitter(chunk_size=400, separator=" ")
docs = []
metadatas = []

for i, d in enumerate(documents):
    splits = text_splitter.split_text(d)
    docs.extend(splits)
    metadatas.extend([{"source": sources[i]}] * len(splits))

store = FAISS.from_texts(docs, embedder , metadatas=metadatas)
store.save_local('toy_data/nv_embedding')

# You will only need to do this once, later on we will restore the already saved vectorstore

In [8]:
# Load the vectorestore back.
store = FAISS.load_local("toy_data/nv_embedding",
                         embedder,

                         # we only turn this on because we are loading the datastore we saved in the previous cell. Use with caution!
                         allow_dangerous_deserialization=True)

## Step 2: Querying

In this step, we use LangChain to simplify the querying step, where our query to the LLM is embedded, similar chunks are retrieved and passed to the LLM as context, and then the model responds to the question using the context.

A common additional step here is to use a reranker model to rerank the retrieved chunks.
- Here is a reranker NIM you can use: https://build.nvidia.com/explore/retrieval#rerank-qa-mistral-4b
- Here's a nice blog post by Pinecone to learn more about reranking: https://pinecone.io/learn/series/rag/rerankers/

In [9]:
retriever = store.as_retriever()

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke("Tell me about Sweden.")


' I\'m sorry for any confusion, but the documents provided don\'t contain any information about Sweden. They seem to be related to the "Titanic" film, containing sections like "Effects," "Editing," "See also," and "Accolades." If you have information related to Sweden, I\'d be happy to help based on that.'

# Now, it's your turn!

You can use your own documents instead of the ones provided to create a RAG chatbot.

If you want some inspiration, check out these synthetic clinical notes: https://huggingface.co/datasets/starmpcc/Asclepius-Synthetic-Clinical-Notes

It's out of scope for this tutorial, but to use RAG in a nice UI (i.e. not in Colab), I recommend following this NVIDIA tutorial: https://nvidia.github.io/GenerativeAIExamples/latest/multi-turn.html

Stay tuned for some healthcare examples!