### Table of Contents

* [Introduction](#introduction)
* [Retrieval Augumented Generation (RAG)](#RAG)
* [Example 1: Document Question-Answering with LangChain using Nvidia API Catalog](#apicatalog)
* [Document Ingestion](#ingestion)
* [Retrieval & Generation](#retrieval)
* [Ensemble Retriever using BM25Retriever and FAISS](#ensembleretrieval)
* [Reranker](#reranker)
* [Example 2: Chat with PDF](#pdf)
* [Running with NIM](#NIM)
* [Conclusion](#conclusion)

### Introduction <a name="introduction"></a>

This notebook demonstrates how to use LangChain to build a simple RAG chatbot that references a custom knowledge-base using the NeMo Retriever from [build.nvidia.com](https://build.nvidia.com/explore/discover). For more details see the [docs](https://docs.nvidia.com/cloud-functions/user-guide/latest/cloud-function/api.html)

#### NVCF (NVIDIA AI Foundation Endpoints)

NVIDIA AI Foundation Endpoints (NVCF) give users easy access to NVIDIA hosted API endpoints for NVIDIA AI Foundation Models like Mixtral 8x7B, Llama 2, Stable Diffusion, etc. These models, hosted on the NVIDIA NGC catalog, are optimized, tested, and hosted on the NVIDIA AI platform, making them fast and easy to evaluate, further customize, and seamlessly run at peak performance on any accelerated stack.

NeMo NIM and NVIDIA Cloud Functions can seamlessly fit into LLM workflows, such as LangChain and LLamaIndex, thanks to its OpenAI-compliant API endpoints. Examples of other embedding endpoints (e.g. HuggingFaceEmbeddings) are provided to showcase the "plug and play" integration of NIM/NVCF and its ability to interchange components within an existing LangChain workflow.

#### Langchain 

LangChain provides a simple framework for connecting LLMs to your own data sources. Since LLMs are both only trained up to a fixed point in time and do not contain knowledge that is proprietary to an enterprise, they can't answer questions about new or proprietary knowledge. LangChain solves this problem.


####  A simple chatbot using LangChain and NVIDIA AI Foundation Endpoints

Please see [here](https://python.langchain.com/v0.1/docs/integrations/text_embedding/nvidia_ai_endpoints/) if you need help with generating the NVIDIA_API_KEY

In [None]:
# !pip install langchain
# !pip install langchain_nvidia_ai_endpoints
# !pip install faiss-cpu
# !pip install beautifulsoup4
# !pip install -U langchain-community
# !pip install rank_bm25
# !pip install unstructured[all-docs]
# !pip install unstructured
# !pip install opencv-python==4.8.0.74
# if you have only GPU on the client machine, you can use faiss-gpu instead og faiss-cpu
# !pip install faiss-gpu accelerate
# if on mac install wget through brew

In [None]:
import os
import getpass
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain.docstore.document import Document
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

In [None]:
nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
os.environ["NVIDIA_API_KEY"] = nvapi_key

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1") # access LLM via NVIDIA AI Foundation Endpoints

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than three sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

In [None]:
print(chain.invoke({"question": "What's the difference between a GPU and a CPU?"})) 

#### Problems with regular LLMs

![Alt Text](images/1.png)

The above example works well for a general question. However, since LLMs are only trained up to a fixed point in time and do not contain knowledge that is proprietary to an enterprise, they can't answer questions about new or proprietary knowledge.

In [None]:
print(chain.invoke({"question": "How much memory does the NVIDIA H200 have?"})) 

In [None]:
print(chain.invoke({"question": "What is Triton ?"})) 

### RAG <a name="RAG"></a>

Retrieval-augmented generation (RAG) is an approach that boosts the factual correctness and trustworthiness of AI language models by incorporating information retrieved from external data sources. It addresses a limitation in how large language models (LLMs) function.

At their core, LLMs are neural networks, often evaluated by the number of parameters they possess. These parameters encode the general patterns and rules of how words are combined to form sentences, based on the training data. However, LLMs lack direct access to factual knowledge beyond what is captured in their parameters during training.

RAG techniques integrate LLMs with retrieval systems that can fetch relevant facts, data, or passages from external knowledge bases or databases. By augmenting the LLM's output with this retrieved information, RAG aims to produce responses that are not only fluent and coherent but also grounded in factual knowledge, enhancing the overall accuracy and reliability of the AI system.

### Example 1: Document Question-Answering with LangChain using Nvidia API Catalog <a name="apicatalog"></a>

![Alt Text](images/2.png)

### Document Ingestion - Generate embeddings and store in the vector store. <a name="ingestion"></a>

In [None]:
import re
from typing import List, Union
import requests
from bs4 import BeautifulSoup

def html_document_loader(url: Union[str, bytes]) -> str:
    """
    Loads the HTML content of a document from a given URL and return it's content.

    Args:
        url: The URL of the document.

    Returns:
        The content of the document.

    Raises:
        Exception: If there is an error while making the HTTP request.

    """
    try:
        response = requests.get(url)
        html_content = response.text
    except Exception as e:
        print(f"Failed to load {url} due to exception {e}")
        return ""

    try:
        # Create a Beautiful Soup object to parse html
        soup = BeautifulSoup(html_content, "html.parser")

        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.extract()

        # Get the plain text from the HTML document
        text = soup.get_text()

        # Remove excess whitespace and newlines
        text = re.sub("\s+", " ", text).strip()

        return text
    except Exception as e:
        print(f"Exception {e} while loading document")
        return ""

In [None]:
def create_embeddings(embedding_path: str = "./embed"):
    embedding_path = "./embed"
    print(f"Storing embeddings to {embedding_path}")

    # List of web pages containing NVIDIA Triton technical documentation
    urls = [
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_analyzer.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html",
    ]

    documents = []
    for url in urls:
        document = html_document_loader(url)
        documents.append(document)
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=0,
            length_function=len,
        )
        texts = text_splitter.create_documents(documents)
        index_docs(url, text_splitter, texts, embedding_path)
    print("Generated embedding successfully")

In [None]:
def index_docs(url: Union[str, bytes], splitter, documents: List[str], dest_embed_dir) -> None:
    """
    Split the document into chunks and create embeddings for the document

    Args:
        url: Source url for the document.
        splitter: Splitter used to split the document
        documents: list of documents whose embeddings needs to be created
        dest_embed_dir: destination directory for embeddings

    Returns:
        None
    """
    embeddings = NVIDIAEmbeddings(model="NV-Embed-QA")
    docs = ""
    for i, chunk in enumerate(documents):
        texts = splitter.split_text(chunk.page_content)
        
        # Cocatenate all text for Lexical search
        BM25_DOCS.append(texts[0])
        
        # Create metadata for each chunk and attach to document
        metadatas = [
            {
                "source": url,
                "chunk_index": i,
                "retriever": "FAISS"
            }
        ]
        #create embeddings and add to vector store
        if os.path.exists(dest_embed_dir):
            update = FAISS.load_local(folder_path=dest_embed_dir, embeddings=embeddings, allow_dangerous_deserialization=True)
            update.add_texts(texts, metadatas=metadatas)
            update.save_local(folder_path=dest_embed_dir)
        else:
            docsearch = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas)
            docsearch.save_local(folder_path=dest_embed_dir)

In [None]:
BM25_DOCS = []
create_embeddings()
embedding_model = NVIDIAEmbeddings(model="NV-Embed-QA", api_key=nvapi_key)

## Retrieval & Generation <a name="retrieval"></a>

In [None]:
# Embed documents
embedding_path = "embed/"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model, allow_dangerous_deserialization=True)

In [None]:
chat = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", temperature=1, max_tokens=1000, top_p=1.0)

chat_history = []

memory = ConversationBufferMemory(
    input_key="question",
    output_key="answer",
    memory_key="chat_history",
    return_messages=True,
)

question_generator = LLMChain(llm=chat, prompt=CONDENSE_QUESTION_PROMPT)

doc_chain = load_qa_chain(chat , chain_type="stuff", prompt=QA_PROMPT)

qa = ConversationalRetrievalChain(
    retriever=docsearch.as_retriever(k=20),
    combine_docs_chain=doc_chain,
    memory=memory,
    question_generator=question_generator,
    return_source_documents = False,
    verbose = False
)

In [None]:
query = "What is Triton?"
result = qa({"question": query})
print(result.get("answer"))

In [None]:
query = "Explain its architecture ?"
result = qa({"question": query})
print(result.get("answer"))

In [None]:
query = "What backends are supported by Triton ?"
result = qa({"question": query, "chat_history": []})
print(result.get("answer"))

In [None]:
query = "Does it support ONNX ?"
result = qa({"question": query})
print(result.get("answer"))

In [None]:
query = "But Why ?"
result = qa({"question": query})
print(result.get("answer"))

In [None]:
# len(BM25_DOCS)

### Ensemble Retriever using BM25Retriever and FAISS <a name="ensembleretrieval"></a>

In [None]:
bm25_retriever = BM25Retriever.from_texts(
    BM25_DOCS, metadatas=[{"retriever": "BM25"}] * len(BM25_DOCS)
)
bm25_retriever.k = 4

In [None]:
bm25_retriever

In [None]:
# Embed documents
embedding_path = "embed/"
faiss_vectorstore = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model, allow_dangerous_deserialization=True)

In [None]:
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 4})

In [None]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever],
    weights=[0.2, 0.8]
)

In [None]:
all_docs = ensemble_retriever.get_relevant_documents("What is Triton?")

In [None]:
for doc in all_docs:
    metadata = doc.metadata
    print(metadata)

## Reranker <a name="reranker"></a>

The similarity scores are calculated based on the distance metric used by FAISS, which is typically cosine similarity or Euclidean distance. By default, FAISS uses Euclidean distance, where a lower score indicates higher similarity

In [None]:
# NVIDIARerank.get_available_models()

In [None]:
from langchain_nvidia_ai_endpoints import NVIDIARerank
ranker = NVIDIARerank()
ensemble_retriever_docs = ensemble_retriever.get_relevant_documents("What is Triton?")

In [None]:
ensemble_retriever_docs

In [None]:
reranked_docs = ranker.compress_documents(query="What is Triton ?", documents=ensemble_retriever_docs) 

In [None]:
for rd in reranked_docs:
    print(rd.metadata)

In [None]:
chat = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", temperature=1, max_tokens=1000, top_p=1.0)
qa_chain = ConversationalRetrievalChain.from_llm(chat, ensemble_retriever)

In [None]:
result = qa_chain({"question": "What is Triton ?", "chat_history": []})
print(result["answer"])

### Example 2: Chat with PDF <a name="pdf"></a>

Let's take it one step further! Instead of manually creating a knowlege base, this example will demonstrate how entire documents can be processed and added into a vector database. LangChain provides a variety of document loaders that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites). Document loaders load data from a source as Documents. A Document is a piece of text (the page_content) and associated metadata. Document loaders provide a load method for loading data as documents from a configured source. Here are some of the document loaders available from LangChain.In this example, we use a LangChain UnstructuredFileLoader to load a datasheet about the NVIDIA H200 Tensor Core GPU.

In [None]:
!mkdir -p $PWD/pdfs

In [None]:
! wget -O "pdfs/h200-datasheet.pdf" -nc --user-agent="Mozilla" https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf

In [None]:
! wget -O "pdfs/DGX_GH200_datasheet.pdf" -nc --user-agent="Mozilla" https://nvdam.widen.net/content/gzjjk9m31f/original/dgx-scale-ai-infrastructure-dgx-gh200-datasheet-nvidia-us-3043177-r3-web.pdf

In [None]:
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader(["pdfs/DGX_GH200_datasheet.pdf", "pdfs/h200-datasheet.pdf"])
document = loader.load()

Once documents have been loaded, they are often transformed. One method of transformation is known as chunking, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps optimize the relevance of the content returned from the vector database. LangChain provides a variety of document transformers, such as text splitters. In this example, we use a RecursiveCharacterTextSplitter. The RecursiveCharacterTextSplitter is designed to divide a large text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters (e.g., "\n\n", "\n", " ", "") to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

document_chunks = text_splitter.split_documents(document)

In [None]:
print("Number of chunks from the document:", len(document_chunks)) 

In [None]:
# Embed documents
embedding_path = "embed/"
chatsearch = FAISS.from_documents(document_chunks, embedding=embedding_model)

In [None]:
chat = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1", temperature=1, max_tokens=1000, top_p=1.0)

chat_history = []

memory = ConversationBufferMemory(
    input_key="question",
    output_key="answer",
    memory_key="chat_history",
    return_messages=True,
)

question_generator = LLMChain(llm=chat, prompt=CONDENSE_QUESTION_PROMPT)

doc_chain = load_qa_chain(chat , chain_type="stuff", prompt=QA_PROMPT)

qa = ConversationalRetrievalChain(
    retriever=chatsearch.as_retriever(k=20),
    combine_docs_chain=doc_chain,
    memory=memory,
    question_generator=question_generator,
    return_source_documents = False,
    verbose = False
)

In [None]:
query = "What is DGX GH200 ?"
result = qa({"question": query, "chat_history": chat_history})
print(result.get("answer"))

In [None]:
query = "How much TFLOPs does it have ?"
result = qa({"question": query, "chat_history": chat_history})
print(result.get("answer"))

In [None]:
query = "What is the difference between DGX GH200 and H200?"
result = qa({"question": query, "chat_history": chat_history})
print(result.get("answer"))

## NIM Workflow <a name="NIM"></a>

Now you can take the entire application we built and replace the cloud endpoints and deploy it in the cloud of your choice or to any on-prem.  For more information take alook at the docs [here](https://docs.nvidia.com/nim/large-language-models/latest/index.html) to deploy NIMs locally

In [None]:
model_name = "meta/llama3-70b-instruct"
base_url = "http://localhost:8000"

In [None]:
# Switch between API and NIM easily.
llm = ChatNVIDIA(model=model_name, base_url=base_url + "/v1", temperature=1, max_tokens=1000, top_p=1.0)

memory = ConversationBufferMemory(
    input_key="question",
    output_key="answer",
    memory_key="chat_history",
    return_messages=True,
)
question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT)

doc_chain = load_qa_chain(llm , chain_type="stuff", prompt=QA_PROMPT)

qa = ConversationalRetrievalChain(
    retriever=docsearch.as_retriever(),
    combine_docs_chain=doc_chain,
    memory=memory,
    question_generator=question_generator,
    return_source_documents = False,
    verbose = False
)

In [None]:
query = "What is Triton?"
result = qa({"question": query})
print(result.get("answer"))

In [None]:
query = "Explain its architecture ?"
result = qa({"question": query})
print(result.get("answer"))

In [None]:
query = "Does it support ONNX ?"
result = qa({"question": query})
print(result.get("answer"))

## Conclusion <a name="conclusion"></a>

Throughout this workshop, we explored the powerful capabilities of the NVIDIA AI platform for building question-answering systems. We started by leveraging the NVIDIA API to construct a Retrieval Augmented Generation (RAG) workflow, which combines information retrieval and language generation models.

We then delved deeper into the retrieval aspect, investigating different retrievers for lexical and semantic document retrieval. By utilizing the NVIDIA Reranker, we learned how to rank and combine the retrieved chunks from multiple retrievers, enhancing the overall quality of the retrieved information.

Additionally, we examined various document loaders available in Langchain, enabling us to interact with and retrieve information from PDF documents seamlessly. This capability opens up a wide range of applications, allowing us to leverage existing knowledge bases and documentation effectively.

Finally, we demonstrated how to replicate the same workflow locally using NIMs (NVIDIA Infernece Microservice), showcasing the flexibility and portability of the NVIDIA AI platform. With a single API change, we transitioned from a cloud-based solution to a locally running instance, empowering users to deploy and scale their applications according to their specific requirements.

Through hands-on exercises and practical examples, this workshop has equipped you with the knowledge and skills to harness the power of NVIDIA's AI technologies for building advanced question-answering systems. Whether you're working with structured or unstructured data, the techniques and tools covered in this workshop will enable you to develop intelligent and efficient solutions tailored to your specific needs.