# RAG Example Using NVIDIA API Catalog and LangChain

This notebook introduces how to use LangChain to interact with NVIDIA hosted NIM microservices like chat, embedding, and reranking models to build a simple retrieval-augmented generation (RAG) application.

## Terminology

#### RAG

- RAG is a technique for augmenting LLM knowledge with additional data.
- LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on.
- If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
- The process of bringing the appropriate information and inserting it into the model prompt is known as retrieval augmented generation (RAG).

The preceding summary of RAG originates in the LangChain v0.2 tutorial [Build a RAG App](https://python.langchain.com/v0.2/docs/tutorials/rag/) tutorial in the LangChain v0.2 documentation.

#### NIM

- [NIM microservices](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/) are containerized microservices that simplify the deployment of generative AI models like LLMs and are optimized to run on NVIDIA GPUs.
- NIM microservices support models across domains like chat, embedding, reranking, and more from both the community and NVIDIA.

#### NVIDIA API Catalog

- [NVIDIA API Catalog](https://build.nvidia.com/explore/discover) is a hosted platform for accessing a wide range of microservices online.
- You can test models on the catalog and then export them with an NVIDIA AI Enterprise license for on-premises or cloud deployment

#### langchain-nvidia-ai-endpoints

- The [`langchain-nvidia-ai-endpoints`](https://pypi.org/project/langchain-nvidia-ai-endpoints/) Python package contains LangChain integrations for building applications that communicate with NVIDIA NIM microservices.

## Installation and Requirements

Create a Python environment (preferably with Conda) using Python version 3.10.14.
To install Jupyter Lab, refer to the [installation](https://jupyter.org/install) page.

In [4]:
# Requirements
!pip install langchain==0.2.5
!pip install langchain_community==0.2.5
!pip install faiss-gpu # replace with faiss-gpu if you are using GPU
!pip install langchain-nvidia-ai-endpoints==0.1.2
!pip install requests faiss-cpu pdfplumber spacy beautifulsoup4




[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip






[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Invalid requirement: '#'

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Getting Started!

To get started you need an `NVIDIA_API_KEY` to use the NVIDIA API Catalog:

1) Create a free account with [NVIDIA](https://build.nvidia.com/explore/discover).
2) Click on your model of choice.
3) Under Input select the Python tab, and click **Get API Key** and then click **Generate Key**.
4) Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.

In [2]:
# NVIDIA api keys

# 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

# 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

In [15]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvidia_api_key.startswith("nvapi-"), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## RAG Example using LLM & Embedding

### 1) Initialize the LLM

The ChatNVIDIA class is part of LangChain's integration (langchain_nvidia_ai_endpoints) with NVIDIA NIM microservices.
It allows access to NVIDIA NIM for chat applications, connecting to hosted or locally-deployed microservices.

Here we will use **mixtral-8x7b-instruct-v0.1**

In [13]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(model="nvidia/nv-embedqa-mistral-7b-v2", max_tokens=1024)

# Here we are using mixtral-8x7b-instruct-v0.1 model
# But you are free to choose any model hosted at Nvidia API Catalog
# Uncomment the below code to list the availabe models
# ChatNVIDIA.get_available_models()

### 2) Intiatlize the embedding
NVIDIAEmbeddings is a client to NVIDIA embeddings models that provides access to a NVIDIA NIM for embedding. It can connect to a hosted NIM or a local NIM using a base URL

We selected **NV-Embed-QA** as the embedding

In [7]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embedder = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")

### 3) Obtain some toy text dataset
Here we are loading a toy data from a text documents and in real-time data can be loaded from various sources.
Read [here](https://python.langchain.com/v0.2/docs/tutorials/rag/#go-deeper) for loading data from different sources

We can either;

A) Parse through a single webpage

In [6]:
# import bs4
# from langchain_community.document_loaders import WebBaseLoader

# # Only keep post title, headers, and content from the full HTML.
# bs4_strainer = bs4.SoupStrainer()
# loader = WebBaseLoader(
#     web_paths=("https://www.iras.gov.sg/taxes",),
#     bs_kwargs={"parse_only": bs4_strainer},
# )
# docs = loader.load()

# print("length of docs:", len(docs[0].page_content), "\ncontent:", docs)



43131

B) Walk through a webpage and find all sub-webpages and scrape the parent and children

In [61]:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import urllib.parse  # To handle URL joining
import time
from langchain_community.document_loaders import WebBaseLoader

base_domain = "https://www.iras.gov.sg/taxes"
domain_content = []
visited_urls = set()

def scrape_website(base_url, depth=0, max_depth=2):
    """
    Recursively scrape a website by visiting links starting from base_url.

    Parameters:
    - base_url: The URL to start scraping from.
    - depth: The current recursion depth.
    - max_depth: The maximum recursion depth to avoid infinite loops.
    """
    if base_url in visited_urls or depth > max_depth:
        return

    try:
        response = requests.get(base_url)
        if response.status_code != 200:
            print(f"Failed to retrieve {base_url}")
            return
    except Exception as e:
        print(f"Error accessing {base_url}: {e}")
        return

    visited_urls.add(base_url)
    # soup = BeautifulSoup(response.content, 'html.parser')
    # page_text = soup.get_text()

    print("Current url:", base_url)
    loader = WebBaseLoader(
        web_paths=(base_url,),  # No URL fetching as we already have the HTML content
        bs_kwargs={"parse_only": SoupStrainer(['main'])},
    )
    domain_content.append(loader.load())
    
    for link in BeautifulSoup(response.content, 'html.parser').find_all('a', href=True):  # Find all links on the current page
        relative_url = link['href']
        absolute_url = urllib.parse.urljoin(base_url, relative_url)
        if base_domain in absolute_url:  # Avoids external sites
            scrape_website(absolute_url, depth + 1, max_depth)

    time.sleep(0.25)  # Avoids overloading the server

scrape_website(base_domain, max_depth=2)
print("Number of subpages:", len(domain_content))

Current url: https://www.iras.gov.sg/taxes
Current url: https://www.iras.gov.sg/taxes/individual-income-tax
Current url: https://www.iras.gov.sg/taxes/individual-income-tax/basics-of-individual-income-tax
Current url: https://www.iras.gov.sg/taxes/individual-income-tax/basics-of-individual-income-tax/managing-mytax-portal-account
Current url: https://www.iras.gov.sg/taxes/individual-income-tax/basics-of-individual-income-tax/understanding-my-income-tax-filing
Current url: https://www.iras.gov.sg/taxes/individual-income-tax/basics-of-individual-income-tax/tax-residency-and-tax-rates
Current url: https://www.iras.gov.sg/taxes/individual-income-tax/basics-of-individual-income-tax/what-is-taxable-what-is-not
Current url: https://www.iras.gov.sg/taxes/individual-income-tax/basics-of-individual-income-tax/tax-reliefs-rebates-and-deductions
Current url: https://www.iras.gov.sg/taxes/individual-income-tax/basics-of-individual-income-tax/receive-tax-bill-pay-tax-check-refunds
Current url: https

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Current url: https://www.iras.gov.sg/taxes/international-tax/list-of-dtas-limited-dtas-and-eoi-arrangements
Current url: https://www.iras.gov.sg/taxes/withholding-tax
Current url: https://www.iras.gov.sg/taxes/withholding-tax/basics-of-withholding-tax
Current url: https://www.iras.gov.sg/taxes/withholding-tax/basics-of-withholding-tax/overview-of-withholding-tax-(WHT)
Current url: https://www.iras.gov.sg/taxes/withholding-tax/basics-of-withholding-tax/types-of-payment-and-withholding-tax-rates
Current url: https://www.iras.gov.sg/taxes/withholding-tax/payments-to-non-resident-company
Current url: https://www.iras.gov.sg/taxes/withholding-tax/payments-to-non-resident-company/payments-that-are-subject-to-withholding-tax
Current url: https://www.iras.gov.sg/taxes/withholding-tax/payments-to-non-resident-company/payments-that-are-not-subject-to-withholding-tax
Current url: https://www.iras.gov.sg/taxes/withholding-tax/payments-to-non-resident-director
Current url: https://www.iras.gov.sg/t

### 4) Process the documents into vectorstore and save it to disk

Real world documents can be very long, this makes it hard to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks for embedding and vector storage. More on text splitting [here](https://python.langchain.com/v0.2/docs/concepts/#text-splitters)

In [82]:
# Here we create a faiss vector store from the documents and save it to disk.
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=400, separator=" ", chunk_overlap=80)

documents = []
for docs in domain_content:  # Had to cut down data to a quarter's worth cuz no more tokens in model to run
    for i in range(len(docs)):
        documents.extend(text_splitter.split_text(str(docs[i].page_content).replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').replace('  ', ' ')))

# metadatas = []
# for i, d in enumerate(documents):
#     splits = text_splitter.split_text(d)
#     docs.extend(splits)
#     # metadatas.extend([{"source": sources[i]}] * len(splits))

# You will only need to do this once, later on we will restore the already saved vectorstore
store = FAISS.from_texts(documents, embedder)
VECTOR_STORE = './data/nv_embedding'
store.save_local(VECTOR_STORE)

To enable runtime search, we index text chunks by embedding each document split and storing these embeddings in a vector database. Later to search, we embed the query and perform a similarity search to find the stored splits with embeddings most similar to the query.

### 5) Read the previously processed & saved vectore store back

In [83]:
# Load the FAISS vectorestore back.
store = FAISS.load_local(VECTOR_STORE, embedder, allow_dangerous_deserialization=True)

### 6) Wrap the restored vectorsore into a retriever and ask our question

In [84]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate

retriever = store.as_retriever()

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

# Langchain's LCEL(LangChain Expression Language) Runnable protocol is used to define the chain
# LCEL allows pipe together components and functions
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(chain.invoke("What is component two hundred and twenty five million?"))
print(chain.invoke("How do i file taxes for my company?"))

There is no information provided in the documents about a component or entity called "two hundred and twenty five million." The documents contain information about various financial transactions, mortgage and loan balances, estate duty calculations, and rules related to M&A allowance and motor vehicle expenses, but there is no mention of a "component" with a value of 225 million.
To file taxes for your company in Singapore, you need to follow these steps:

1. Sign up for GIRO or PayNow Corporate with your business/organisation's bank account to receive CIT and GST refunds.
2. If your company qualifies, file Form C-S by logging in to mytax.iras.gov.sg. You can do this by being authorized by your company to act for its Corporate Income Tax matters via Corppass.
3. In case your company is being liquidated, the appointed liquidator will need to request access to myTax Portal on your behalf via Corppass.

For further details, you can refer to the FAQs on Electronic Refund for Corporate Inco

## RAG Example with LLM, Embedding & Reranking

In [12]:
# Let's test a more complex query using the above LLM Embedding chain and see if the reranker can help.
chain.invoke("In which year Gustav's grandson ascended the throne?")

"The document does not provide information on when Gustav's grandson ascended the throne. There is no mention of any Gustav, his grandson, or the event of ascending the throne in the provided document."

### Enhancing accuracy for single data sources

This example demonstrates how a re-ranking model can be used to combine retrieval results and improve accuracy during retrieval of documents.

Typically, reranking is a critical piece of high-accuracy, efficient retrieval pipelines. Generally, there are two important use cases:

- Combining results from multiple data sources
- Enhancing accuracy for single data sources

Here, we focus on demonstrating only the second use case. If you want to know more, check [here](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/docs/retrievers/nvidia_rerank.ipynb)

In [85]:
from langchain_nvidia_ai_endpoints import NVIDIARerank
from langchain_core.runnables import RunnableParallel

# We will narrow the collection to 100 results and further narrow it to 10 with the reranker.
retriever = store.as_retriever(search_kwargs={'k':100}) # typically k will be 1000 for real world use-cases
ranker = NVIDIARerank(model='nv-rerank-qa-mistral-4b:1', top_n=10)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
        ),
        ("user", "{question}"),
    ]
)

reranker = lambda input: ranker.compress_documents(query=input['question'], documents=input['context'])

chain_with_ranker = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | {"context": reranker, "question": lambda input: input['question']}
    | prompt
    | llm
    | StrOutputParser()
)
print(chain_with_ranker.invoke("How do i file taxes for my company?"))


Based on the provided documents, to file taxes for your company, you should first visit the Basic Guide to Corporate Income Tax for Companies page to get help with filing your company’s tax returns for the first time. If your company was newly incorporated in 2023 and derives income or started business in that year, you should file your Corporate Income Tax.

You have to file via mytax.iras.gov.sg and be authorized by your company to act for its Corporate Income Tax matters via Corppass. You can choose to file Form C-S, Form C-S (Lite), or Form C, depending on your company's eligibility. Refer to the step-by-step guides on Corppass setup for assistance.

For filing, ensure you are duly authorized by your company as an 'Approver' for Corporate Tax (Filing and Applications) in Corppass, have your Singpass, and your company’s Unique Entity Number (UEN)/Entity ID. Then, you can file Form for Dormant Company via mytax.iras.gov.sg.

There are different filing requirements depending on whethe

#### Note:
- In this notebook, we have used NVIDIA NIM microservices from the NVIDIA API Catalog.
- The above APIs, ChatNVIDIA, NVIDIAEmbedding, and NVIDIARerank, also support self-hosted NIM microservices.
- Change the `base_url` to your deployed NIM URL.
- Example: `llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")`
- NIM can be hosted locally using Docker, following the [NVIDIA NIM for LLMs](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) documentation.

In [None]:
# Example Code snippet if you want to use a self-hosted NIM
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# connect to an LLM NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")