# Using the NVIDIA Reranker to optimize retrieval pipelines

Reranking is crucial for achieving high accuracy and efficiency in retrieval pipelines. It plays a vital role, particularly when the pipeline incorporates citations from diverse datastores, where each datastore may employ its own unique similarity scoring algorithm. Reranking serves two primary purposes:

1. Improving accuracy for individual citations within each datastore.
2. Integrating results from multiple datastores to provide a cohesive and relevant set of citations.

This playbook goes over how to use the NeMo Retriever Text Reranking NIM (Text Reranking NIM) with LangChain for document compression and retrieval via the `NVIDIARerank` class.

## Step 1 - Install the requirements

For this example, we install langchain packages and other dependancies needed to build our pipeline. Additionally, we download the reranker model that we will be using for our examples

In [65]:
!pip install faiss_cpu==1.8.0
!pip install fastapi==0.104.1
!pip install langchain==0.1.11
!pip install langchain-community==0.0.25
!pip install langchain-core==0.1.29
!pip install langchain-nvidia-ai-endpoints==0.1.4
!pip install numpy==1.26.4
!pip install sentence-transformers==2.2.2
!pip install unstructured==0.11.8


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

In [20]:
%%bash

wget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/reranker-launchable/install-docker.sh -O install-docker
chmod +x install-docker
sudo ./install-docker

--2024-08-05 22:29:22--  https://raw.githubusercontent.com/brevdev/notebooks/main/assets/reranker-launchable/install-docker.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 852 [text/plain]
Saving to: ‘install-docker’

     0K                                                       100%  105M=0s

2024-08-05 22:29:22 (105 MB/s) - ‘install-docker’ saved [852/852]



This step requires your NGC API Key. Additionally, it could take up to 5 minutes to pull the model. The script will continue checking and exit when the model is succesfully running

In [25]:
%%bash

wget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/reranker-launchable/setup-nim.sh -O setup-nim
chmod +x setup-nim
export NGC_API_KEY=<enter key here>
sudo -E ./setup-nim

--2024-08-05 22:30:31--  https://raw.githubusercontent.com/brevdev/notebooks/main/assets/reranker-launchable/setup-nim.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1060 (1.0K) [text/plain]
Saving to: ‘setup-nim’

     0K .                                                     100%  125M=0s

2024-08-05 22:30:32 (125 MB/s) - ‘setup-nim’ saved [1060/1060]



# Step 2 - Building a pipeline using Langchain

NVIDIA offers a `ChatNVIDIA` package that lets you work with different NVIDIA NIM models. For this example, we run the reranker NIM locally and use the API catalog to access Meta Llama 3.1

In [57]:
import os
from langchain_nvidia_ai_endpoints import ChatNVIDIA

os.environ["NVIDIA_API_KEY"] = "enter api key here"
llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")



In [58]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

In [59]:
print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"}))

A CPU (Central Processing Unit) is the primary component that performs calculations and executes instructions for a computer, whereas a GPU (Graphics Processing Unit) is specifically designed to handle large amounts of graphical data and complex calculations, typically found in gaming and graphics-intensive applications.


In [60]:
print(chain.invoke({"question": "What does the H in the NVIDIA H200 stand for?"}))

Unfortunately, I don't have information on what the H in NVIDIA H200 specifically stands for. However, the "H" in NVIDIA H100, H200, and other related models is likely a classification or a codename indicating a high-end or high-performance variant.


# Reranking with Text Reranking NIM
To answer the previous question, lets build a simple retrieval and reranking pipeline to find the most relevant piece of information to the query.

Load the NVIDIA H200 Datasheet to use in the retrieval pipeline. LangChain provides a variety of document loaders for various types of documents, such as HTML, PDF, and code, from sources and locations such as private S3 buckets and public websites. The following example uses a LangChain PyPDFLoader to load a datasheet about the NVIDIA H200 Tensor Core GPU

In [61]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf")

document = loader.load()
document[0]

Document(page_content='NVIDIA H200 Tensor Core GPU\u2002|\u2002Datasheet\u2002|\u2002 1NVIDIA H200 Tensor Core GPU\nSupercharging AI and HPC workloads.\nHigher Performance With Larger, Faster Memory\nThe NVIDIA H200 Tensor Core GPU supercharges generative AI and high-\nperformance computing (HPC) workloads with game-changing performance  \nand memory capabilities. \nBased on the NVIDIA Hopper™ architecture , the NVIDIA H200 is the first GPU to \noffer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—\nthat’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU  with \n1.4X more memory bandwidth. The H200’s larger and faster memory accelerates \ngenerative AI and large language models, while advancing scientific computing for \nHPC workloads with better energy efficiency and lower total cost of ownership. \nUnlock Insights With High-Performance LLM Inference\nIn the ever-evolving landscape of AI, businesses rely on large language models to \naddress a diver

Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking, which breaks down large pieces of text, such as a long document, into smaller segments. This technique is valuable because it helps **optimize the relevance of the content returned from the vector database.**

LangChain provides a variety of document transformers, such as text splitters. The following example uses a `RecursiveCharacterTextSplitter`. The RecursiveCharacterTextSplitter is divides a large body of text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters, such as “\n\n”, “\n”, ” “, and “”, to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.

There are some nuanced complexities to text splitting since, in theory, semantically related text should be kept together.

In [62]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

document_chunks = text_splitter.split_documents(document)
print("Number of chunks from the document:", len(document_chunks))

Number of chunks from the document: 16


The following example shows how to use LangChain to interact with Text Reranking NIM using the `NVIDIAReranking` class from the same `langchain-nvidia-ai-endpoints` package as the first example

In [63]:
from langchain_nvidia_ai_endpoints import NVIDIARerank

query = "What does the H in the NVIDIA H200 stand for?"

# Initialize and connect to a NeMo Retriever Text Reranking NIM running at localhost:8000
reranker = NVIDIARerank(model="nvidia/nv-rerankqa-mistral-4b-v3",
                        base_url="http://localhost:8000/v1")

reranked_chunks = reranker.compress_documents(query=query,
                                              documents=document_chunks)

In [64]:
for chunks in reranked_chunks:

    # Access the metadata of the document
    metadata = chunks.metadata

    # Get the page content
    page_content = chunks.page_content
    
    # Print the relevance score if it exists in the metadata, followed by page content
    if 'relevance_score' in metadata:
        print(f"Relevance Score:{metadata['relevance_score']}, Page Content:{page_content}...")
    print(f"{'-' * 100}")

Relevance Score:16.625, Page Content:NVIDIA H200 Tensor Core GPU | Datasheet |  1NVIDIA H200 Tensor Core GPU
Supercharging AI and HPC workloads.
Higher Performance With Larger, Faster Memory
The NVIDIA H200 Tensor Core GPU supercharges generative AI and high-
performance computing (HPC) workloads with game-changing performance  
and memory capabilities. 
Based on the NVIDIA Hopper™ architecture , the NVIDIA H200 is the first GPU to 
offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—...
----------------------------------------------------------------------------------------------------
Relevance Score:11.5078125, Page Content:NVIDIA H200 Tensor Core GPU | Datasheet |  3Unleashing AI Acceleration for Mainstream Enterprise Servers 
With H200 NVL
The NVIDIA H200 NVL is the ideal choice for customers with space constraints within  
the data center, delivering acceleration for every AI and HPC workload regardless of size. 
With a 1.5X memory increase and a 1.2X band

## Using the reranker for context

We can build a similar chain using our top 2 reranked chunks and add them to our pipeline. Lets build one below

In [72]:
prompt_with_context = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "You have the following context for the question: {context}."
        "Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt_with_context | llm | StrOutputParser()

In [71]:
print(chain.invoke({"question": "What does the H in the NVIDIA H200 stand for?"}))

Unfortunately, I don't have information on what the H in NVIDIA H200 specifically stands for. However, the "H" in NVIDIA H100, H200, and other related models is likely a classification or a codename indicating a high-end or high-performance variant.


In [80]:
print(chain.invoke({"context": reranked_chunks[0].page_content, "question": "What does the H in the NVIDIA H200 stand for?"}))

The "H" in NVIDIA H200 stands for "Hopper", which is the architecture on which the H200 is based.
