# RAG Example Using NVIDIA API Catalog and LangChain

This notebook introduces how to use LangChain to interact with NVIDIA hosted NIM microservices like chat, embedding, and reranking models to build a simple retrieval-augmented generation (RAG) application.

## Terminology

#### RAG

- RAG is a technique for augmenting LLM knowledge with additional data.
- LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on.
- If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
- The process of bringing the appropriate information and inserting it into the model prompt is known as retrieval augmented generation (RAG).

The preceding summary of RAG originates in the LangChain v0.2 tutorial [Build a RAG App](https://python.langchain.com/v0.2/docs/tutorials/rag/) tutorial in the LangChain v0.2 documentation.

#### NIM

- [NIM microservices](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) are containerized microservices that simplify the deployment of generative AI models like LLMs and are optimized to run on NVIDIA GPUs. 
- NIM microservices support models across domains like chat, embedding, reranking, and more from both the community and NVIDIA.

#### NVIDIA API Catalog

- [NVIDIA API Catalog](https://build.nvidia.com/explore/discover) is a hosted platform for accessing a wide range of microservices online.
- You can test models on the catalog and then export them with an NVIDIA AI Enterprise license for on-premises or cloud deployment

#### langchain-nvidia-ai-endpoints

- The [`langchain-nvidia-ai-endpoints`](https://pypi.org/project/langchain-nvidia-ai-endpoints/) Python package contains LangChain integrations for building applications that communicate with NVIDIA NIM microservices.

## Installation and Requirements

Create a Python environment (preferably with Conda) using Python version 3.10.14. 
To install Jupyter Lab, refer to the [installation](https://jupyter.org/install) page.

In [None]:
# Requirements
!pip install langchain==0.2.5
!pip install langchain_community==0.2.5
!pip install faiss-cpu==1.8.0 # replace with faiss-gpu if you are using GPU
!pip install langchain-nvidia-ai-endpoints==0.1.2

## Getting Started!

To get started you need an `NVIDIA_API_KEY` to use the NVIDIA API Catalog:

1) Create a free account with [NVIDIA](https://build.nvidia.com/explore/discover).
2) Click on your model of choice.
3) Under Input select the Python tab, and click **Get API Key** and then click **Generate Key**.
4) Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

In [1]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvidia_api_key.startswith("nvapi-"), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## RAG Example using LLM & Embedding

### 1) Initialize the LLM

The ChatNVIDIA class is part of LangChain's integration (langchain_nvidia_ai_endpoints) with NVIDIA NIM microservices. 
It allows access to NVIDIA NIM for chat applications, connecting to hosted or locally-deployed microservices.

Here we will use **mixtral-8x7b-instruct-v0.1** 

In [2]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", max_tokens=1024)

# Here we are using mixtral-8x7b-instruct-v0.1 model
# But you are free to choose any model hosted at Nvidia API Catalog
# Uncomment the below code to list the availabe models
# ChatNVIDIA.get_available_models()

### 2) Intiatlize the embedding
NVIDIAEmbeddings is a client to NVIDIA embeddings models that provides access to a NVIDIA NIM for embedding. It can connect to a hosted NIM or a local NIM using a base URL

We selected **NV-Embed-QA** as the embedding

In [3]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embedder = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")

### 3) Obtain some toy text dataset
Here we are loading a toy data from a text documents and in real-time data can be loaded from various sources. 
Read [here](https://python.langchain.com/v0.2/docs/tutorials/rag/#go-deeper) for loading data from different sources

In [4]:
import os
from pathlib import Path

# For this example we load a toy data set (it's a simple text file with some information about Sweden)
TOY_DATA_PATH = "./data/"
# We read in the text data and prepare them into vectorstore
ps = os.listdir(TOY_DATA_PATH)
data = []
sources = []
for p in ps:
    if p.endswith('.txt'):
        path2file = TOY_DATA_PATH + p
        with open(path2file, encoding="utf-8") as f:
            lines = f.readlines()
            for line in lines:
                if len(line) >= 1:
                    data.append(line)
                    sources.append(path2file)

In [5]:
# Do some basic cleaning and remove empty lines
documents=[d for d in data if d != '\n']
len(data), len(documents), data[0]

(400,
 230,
 'Sweden, formally the Kingdom of Sweden, is a Nordic country located on the Scandinavian Peninsula in Northern Europe. It borders Norway to the west and north, Finland to the east, and is connected to Denmark in the southwest by a bridge–tunnel across the Öresund. At 447,425 square kilometres (172,752 sq mi), Sweden is the largest Nordic country, the third-largest country in the European Union, and the fifth-largest country in Europe. The capital and largest city is Stockholm. Sweden has a total population of 10.5 million, and a low population density of 25.5 inhabitants per square kilometre (66/sq mi), with around 87% of Swedes residing in urban areas, which cover 1.5% of the entire land area, in the central and southern half of the country.\n')

### 4) Process the documents into vectorstore and save it to disk

Real world documents can be very long, this makes it hard to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage. More on text splitting [here](https://python.langchain.com/v0.2/docs/concepts/#text-splitters)

In [6]:
# Here we create a faiss vector store from the documents and save it to disk.
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=400, separator=" ", chunk_overlap=80)
docs = []
metadatas = []

for i, d in enumerate(documents):
    splits = text_splitter.split_text(d)
    docs.extend(splits)
    metadatas.extend([{"source": sources[i]}] * len(splits))

To enable runtime search, we index text chunks by embedding each document split and storing these embeddings in a vector database. Later to search, we embed the query and perform a similarity search to find the stored splits with embeddings most similar to the query.

In [7]:
# you will only need to do this once, later on we will restore the already saved vectorstore
# store = FAISS.from_texts(docs, embedder , metadatas=metadatas)
VECTOR_STORE = './data/nv_embedding'
# store.save_local(VECTOR_STORE)

### 5) Read the previously processed & saved vectore store back

In [8]:
# Load the FAISS vectorestore back.
store = FAISS.load_local(VECTOR_STORE, embedder, allow_dangerous_deserialization=True)

### 6) Wrap the restored vectorsore into a retriever and ask our question 

In [9]:
# 创建检索器    
retriever = store.as_retriever()

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

# Langchain's LCEL(LangChain Expression Language) Runnable protocol is used to define the chain
# LCEL allows pipe together components and functions
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# chain.invoke("Tell me about Sweden.")
# chain.invoke("请向我介绍瑞典。")
# chain.invoke("请使用中文向我介绍瑞典的教育。")
chain.invoke("请使用中文向我介绍瑞典的文化。")

'根据提供的上下文，我可以告诉你瑞典有着丰富的文化。然而，由于提供的详细信息很有限，我只能简要介绍一下瑞典文化的一些方面。\n\n请注意，以下内容并不是详尽的瑞典文化描述，仅供参考。\n\n瑞典的文化被 Europen Union 列为非物质文化遗产，这说明瑞典文化的 immense value。瑞典有着独特的文化传统，包括面具和木偶戏、诗歌和民歌，以及传统节日和セリモン尼（Samernas högtider）。\n\n另外，影响瑞典文化的因素还有其地理位置，以及历史上与挪威、丹麦、芬兰和其他北欧国家的联系。这都使得瑞典的文化日益发达和多元化。\n\n请注意，这里的信息每个细节都源自提供的文本，文本并没有提供详细的信息。如果你需要更多关于瑞典文化的信息，建议去查阅更多的资料。'

In [21]:
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field

# 定义输出格式的结构化模型
class ResponseFormat(BaseModel):
    answer: str = Field(..., description="The main answer to the user's query")
    source: str = Field(..., description="The source of the answer, such as document ID or URL")

# 创建解析器
output_parser = PydanticOutputParser(
    pydantic_object=ResponseFormat,
    custom_errors={"ValidationError": "输出格式不符合要求，请检查格式！"}
)

# 提示模板中加入解析器的格式化指令
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            ("Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>"
             f"Format your output as JSON with the following instructions:\n{format_instructions}"
            )
        ),
        ("user", "{question}")
    ]
)

parsed_output = output_parser.parse_with_prompt('{"Tell me about Sweden."}', prompt)
print(parsed_output.answer)


# # Langchain's LCEL(LangChain Expression Language) Runnable protocol is used to define the chain
# # LCEL allows pipe together components and functions
# chain = (
#     {"context": retriever, "format_instructions": format_instructions, "question": RunnablePassthrough()}
#     | prompt
#     | llm
#     | output_parser
# )

# try:
#     chain.invoke("Tell me about Sweden.")
#     # # chain.invoke("请向我介绍瑞典。")
#     # # chain.invoke("请使用中文向我介绍瑞典的教育。")
#     # chain.invoke("请使用中文向我介绍瑞典的文化。")
# except Exception as e:
#     print(f"解析输出时发生错误: {e}")
#     result = ResponseFormat(answer="无法生成答案", source="未知")
#     print(result)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"answer": {"description": "The main answer to the user's query", "title": "Answer", "type": "string"}, "source": {"description": "The source of the answer, such as document ID or URL", "title": "Source", "type": "string"}}, "required": ["answer", "source"]}
```
Yes


OutputParserException: Invalid json output: {"Tell me about Sweden."}

## RAG Example with LLM, Embedding & Reranking

In [10]:
# Let's test a more complex query using the above LLM Embedding chain and see if the reranker can help.
chain.invoke("In which year Gustav's grandson ascended the throne?")

"The text does not provide information about Gustav's grandson ascending the throne. The text only provides information up to the establishment of the House of Bernadotte in 1818."

### Enhancing accuracy for single data sources

This example demonstrates how a re-ranking model can be used to combine retrieval results and improve accuracy during retrieval of documents.

Typically, reranking is a critical piece of high-accuracy, efficient retrieval pipelines. Generally, there are two important use cases:

- Combining results from multiple data sources
- Enhancing accuracy for single data sources

Here, we focus on demonstrating only the second use case. If you want to know more, check [here](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/docs/retrievers/nvidia_rerank.ipynb)

In [11]:
from langchain_nvidia_ai_endpoints import NVIDIARerank
from langchain_core.runnables import RunnableParallel

# We will narrow the collection to 100 results and further narrow it to 10 with the reranker.
retriever = store.as_retriever(search_kwargs={'k':100}) # typically k will be 1000 for real world use-cases
ranker = NVIDIARerank(model='nv-rerank-qa-mistral-4b:1', top_n=10)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

reranker = lambda input: ranker.compress_documents(query=input['question'], documents=input['context'])

chain_with_ranker = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | {"context": reranker, "question": lambda input: input['question']}
    | prompt
    | llm
    | StrOutputParser()
)
chain_with_ranker.invoke("In which year Gustav's grandson ascended the throne?")


"Gustav's grandson, Sigismund, ascended the throne in 1592."

#### Note:
- In this notebook, we have used NVIDIA NIM microservices from the NVIDIA API Catalog.
- The above APIs, ChatNVIDIA, NVIDIAEmbedding, and NVIDIARerank, also support self-hosted NIM microservices.
- Change the `base_url` to your deployed NIM URL.
- Example: `llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")`
- NIM can be hosted locally using Docker, following the [NVIDIA NIM for LLMs](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) documentation.

In [12]:
# Example Code snippet if you want to use a self-hosted NIM
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# connect to an LLM NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")