# Build a RAG using a locally hosted NIM

This notebook demonstrates how to build a RAG using NVIDIA NIM microservices. We locally host a Llama3-8b-instruct model using [NVIDIA NIM for LLMs](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) and connect to it using [LangChain NVIDIA AI Endpoints](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/) package.

We then create a vector store by downloading web pages and generating their embeddings using FAISS. We then showcase two different chat chains for querying the vector store. For this example, we use the NVIDIA Triton documentation website, though the code can be easily modified to use any other source. For the embedding model, we use [the GPU accelerated NV-Embed-QA model from NVIDIA API Catalog](https://build.nvidia.com/nvidia/embed-qa-4).

### First stage is to load NVIDIA Triton documentation from the web, chunkify the data, and generate embeddings using FAISS

To get started:

1. Generate an [NGC CLI API key] On build.nvidia.com. This key will need to be passed to docker run in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

Note: In order to run this notebook, you need to launch the NIM Docker container in the terminal outside of the web browser notebook environment. Run the commands in the first 3 cells from a terminal then begin with the 4th cell (curl inference command) within the notebook environment (web browser).

Launch the NIM LLM microservice by executing this command from the terminal where you have exported all the environment variables.

In [1]:
%%bash
# Stop and remove existing container
docker stop meta-llama3-8b-instruct || true
docker rm meta-llama3-8b-instruct || true

NGC_API_KEY="NVIDIA_API_KEY"

echo "${NGC_API_KEY}" | docker login nvcr.io -u "\$oauthtoken" --password-stdin

# Create and set permissions for cache directories
mkdir -p $HOME/.nim-cache
chmod -R 777 $HOME/.nim-cache

# Start container
docker run -d \
    --name meta-llama3-8b-instruct \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY="${NGC_API_KEY}" \
    -v $HOME/.nim-cache:/opt/nim/.cache \
    --user root \
    --network=container:verb-workspace \
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

echo "Container started, waiting for service to be ready..."

# Loop until the service is ready
max_attempts=30
attempt=0
while [ $attempt -lt $max_attempts ]; do
    if curl -s http://localhost:8000/v1/health/ready > /dev/null; then
        echo "Service is ready!"
        break
    else
        attempt=$((attempt + 1))
        echo "Attempt $attempt/$max_attempts: Service not ready yet, waiting..."
        sleep 10
    fi
done

if [ $attempt -eq $max_attempts ]; then
    echo "Service failed to become ready after $max_attempts attempts"
    docker logs meta-llama3-8b-instruct
    exit 1
fi

Error response from daemon: No such container: meta-llama3-8b-instruct
Error response from daemon: No such container: meta-llama3-8b-instruct
https://docs.docker.com/engine/reference/commandline/login/#credential-stores



Login Succeeded


Unable to find image 'nvcr.io/nim/meta/llama3-8b-instruct:1.0.0' locally
1.0.0: Pulling from nim/meta/llama3-8b-instruct
5e8117c0bd28: Pulling fs layer
d67fcc6ef577: Pulling fs layer
47ee674c5713: Pulling fs layer
63daa0e64b30: Pulling fs layer
d9d9aecefab5: Pulling fs layer
d71f46a15657: Pulling fs layer
054e2ffff644: Pulling fs layer
7d3cd81654d5: Pulling fs layer
dca613dca886: Pulling fs layer
0fdcdcda3b2e: Pulling fs layer
af7b4f7dc15a: Pulling fs layer
6d101782f66c: Pulling fs layer
e8427cb13897: Pulling fs layer
de05b029a5a2: Pulling fs layer
3d72a2698104: Pulling fs layer
aeff973c2191: Pulling fs layer
85d7d3ff0cca: Pulling fs layer
5996430251dd: Pulling fs layer
314dc83fdfc2: Pulling fs layer
5cef8f59ae9a: Pulling fs layer
927db4ce3e96: Pulling fs layer
cbe4a04f4491: Pulling fs layer
60f1a03c0955: Pulling fs layer
67c1bb2b1aac: Pulling fs layer
f16f7b821143: Pulling fs layer
9be4fff0cd1a: Pulling fs layer
de05b029a5a2: Waiting
3d72a2698104: Waiting
aeff973c2191: Waiting
85d7d3f

3116ed3731d3a83ebb64278d0565319b63ffe527216124b9663a7a034b3d8b0c
Container started, waiting for service to be ready...
Attempt 1/30: Service not ready yet, waiting...
Attempt 2/30: Service not ready yet, waiting...
Attempt 3/30: Service not ready yet, waiting...
Attempt 4/30: Service not ready yet, waiting...
Attempt 5/30: Service not ready yet, waiting...
Service is ready!


Before we continue and connect the NIM to LangChain, let's test it using a simple OpenAI completion request. You can execute this command and all the subsequent one after this from your web browser.

In [2]:
!curl -X 'POST' \
    "http://0.0.0.0:8000/v1/completions" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d '{"model": "meta/llama3-8b-instruct", "prompt": "Once upon a time", "max_tokens": 64}'

{"id":"cmpl-7b43d98dc1264bfd9b5b57a88bc98114","object":"text_completion","created":1734035097,"model":"meta/llama3-8b-instruct","choices":[{"index":0,"text":", in a small village nestled in the rolling hills of the countryside, there lived a young girl named Sophie. Sophie was a curious and adventurous child, with a mop of curly brown hair and a smile that could light up a room. She loved to explore the world around her, and was constantly asking questions about the mysteries","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":69,"completion_tokens":64}}

Now setup the LangChain flow by installing prerequisite libraries

In [10]:
!pip uninstall -y langchain langchain-nvidia-ai-endpoints langchain-core langchain-community langchain-text-splitters langsmith
!pip install --upgrade pip
!pip install langchain
!pip install langchain-nvidia-ai-endpoints
!pip install faiss-gpu
!pip install -U langchain-community

Found existing installation: langchain 0.3.11
Uninstalling langchain-0.3.11:
  Successfully uninstalled langchain-0.3.11
Found existing installation: langchain-nvidia-ai-endpoints 0.3.5
Uninstalling langchain-nvidia-ai-endpoints-0.3.5:
  Successfully uninstalled langchain-nvidia-ai-endpoints-0.3.5
Found existing installation: langchain-core 0.3.24
Uninstalling langchain-core-0.3.24:
  Successfully uninstalled langchain-core-0.3.24
[0mFound existing installation: langchain-text-splitters 0.3.2
Uninstalling langchain-text-splitters-0.3.2:
  Successfully uninstalled langchain-text-splitters-0.3.2
Found existing installation: langsmith 0.2.3
Uninstalling langsmith-0.2.3:
  Successfully uninstalled langsmith-0.2.3
Collecting langchain
  Using cached langchain-0.3.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.24 (from langchain)
  Using cached langchain_core-0.3.24-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langch

Set up NVIDIA API key, which you can get from the [API Catalog](https://build.nvidia.com/). This key will be used to communicate with GPU accelerated cloud hosted embedding model.

In [1]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

Enter your NVIDIA API key:  ········


We can now connect with the deployed NIM LLM model in LangChain by specifying the base URL

In [2]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:8000/v1", model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

result = llm.invoke("What is the capital of France?")
print(result.content)

The capital of France is Paris.


Import all the required libraries for building the langchain agent.

In [3]:
import os
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

Helper functions for loading html files, which we'll use to generate the embeddings. We'll use this later to load the relevant html documents from the Triton documentation website and convert to a vector store.

In [17]:
import re
from typing import List, Union

import requests
from bs4 import BeautifulSoup

def html_document_loader(url: Union[str, bytes]) -> str:
    """
    Loads the HTML content of a document from a given URL and return it's content.

    Args:
        url: The URL of the document.

    Returns:
        The content of the document.

    Raises:
        Exception: If there is an error while making the HTTP request.

    """
    try:
        response = requests.get(url)
        html_content = response.text
    except Exception as e:
        print(f"Failed to load {url} due to exception {e}")
        return ""

    try:
        # Create a Beautiful Soup object to parse html
        soup = BeautifulSoup(html_content, "html.parser")

        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.extract()

        # Get the plain text from the HTML document
        text = soup.get_text()

        # Remove excess whitespace and newlines
        text = re.sub("\s+", " ", text).strip()

        return text
    except Exception as e:
        print(f"Exception {e} while loading document")
        return ""

Read html files and split text in preparation for embedding generation
Note chunk_size value must match the specific LLM used for embedding genetation

Make sure to pay attention to the chunk_size parameter in TextSplitter. Setting the right chunk size is critical for RAG performance, as much of a RAG’s success is based on the retrieval step finding the right context for generation. The entire prompt (retrieved chunks + user query) must fit within the LLM’s context window. Therefore, you should not specify chunk sizes too big, and balance them out with the estimated query size. For example, while OpenAI LLMs have a context window of 8k-32k tokens, Llama3 is limited to 8k tokens. Experiment with different chunk sizes, but typical values should be 100-600, depending on the LLM.

In [26]:
def create_embeddings(embedding_path: str = "./data/nv_embedding"):

    embedding_path = "./data/nv_embedding"
    print(f"Storing embeddings to {embedding_path}")

    # List of web pages containing NVIDIA Triton technical documentation
    urls = [
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_analyzer.html",
         "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html",
    ]

    documents = []
    for url in urls:
        document = html_document_loader(url)
        documents.append(document)


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=0,
        length_function=len,
    )
    texts = text_splitter.create_documents(documents)
    index_docs(url, text_splitter, texts, embedding_path)
    print("Generated embedding successfully")

Generate embeddings using NVIDIA Retrieval QA Embedding NIM and NVIDIA AI Endpoints for LangChain and save embeddings to offline vector store in the /embed directory for future re-use

In [27]:
def index_docs(url: Union[str, bytes], splitter, documents: List[str], dest_embed_dir) -> None:
    """
    Split the document into chunks and create embeddings for the document

    Args:
        url: Source url for the document.
        splitter: Splitter used to split the document
        documents: list of documents whose embeddings needs to be created
        dest_embed_dir: destination directory for embeddings

    Returns:
        None
    """
    embeddings = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")

    for document in documents:
        texts = splitter.split_text(document.page_content)

        # metadata to attach to document
        metadatas = [document.metadata]

        # create embeddings and add to vector store
        if os.path.exists(dest_embed_dir):
            update = FAISS.load_local(folder_path=dest_embed_dir, embeddings=embeddings, allow_dangerous_deserialization=True)
            update.add_texts(texts, metadatas=metadatas)
            update.save_local(folder_path=dest_embed_dir)
        else:
            docsearch = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas)
            docsearch.save_local(folder_path=dest_embed_dir)

### Second stage is to load the embeddings from the vector store and build a RAG using NVIDIAEmbeddings

Create the embeddings model using NVIDIA Retrieval QA Embedding NIM from the API Catalog. This model represents words, phrases, or other entities as vectors of numbers and understands the relation between words and phrases. See here for reference: https://build.nvidia.com/nvidia/embed-qa-4

In [28]:


create_embeddings()

embedding_model = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")


Storing embeddings to ./data/nv_embedding
Generated embedding successfully


Load documents from vector database using FAISS

In [29]:
# Embed documents
embedding_path = "./data/nv_embedding"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model, allow_dangerous_deserialization=True)

Create a ConversationalRetrievalChain chain using a local NIM. We'll use the Llama3 8B NIM we created and deployed locally, add memory for chat history, and connect to the vector store via the embedding model. See here for reference: https://python.langchain.com/docs/modules/chains/popular/chat_vector_db#conversationalretrievalchain-with-streaming-to-stdout

In [30]:
llm = ChatNVIDIA(base_url="http://0.0.0.0:8000/v1", model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa_prompt=QA_PROMPT

doc_chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_PROMPT)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=docsearch.as_retriever(),
    chain_type="stuff",
    memory=memory,
    combine_docs_chain_kwargs={'prompt': qa_prompt},
)

Now try asking a question about Triton with the simpler chain. Compare the answer to the result with previous complex chain model

In [32]:
query = "What is Triton?"
result = qa({"question": query})
print(result.get("answer"))

I don't know.


Ask another question about Triton

In [95]:
query = "Does Triton support ONNX?"
result = qa({"question": query})
print(result.get("answer"))

Yes, Triton supports ONNX models. According to the provided context, Triton supports all ONNX models that are supported by the version of ONNX Runtime being used by Triton.


Finally showcase chat capabilites by asking a question about the previous query

In [96]:
query = "But why?"
result = qa({"question": query})
print(result.get("answer"))

Triton supports all ONNX models that are supported by the version of ONNX Runtime being used by Triton because it relies on ONNX Runtime to run the models.
