<p> <center> <a href="../../Start-NIM-RAG.ipynb">Home Page</a> </center> </p>

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="rag_nim_endpoints.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="rag_nim_endpoints.ipynb">1</a>
        <a >2</a>
        <a href="nim_lora_adapter.ipynb">3</a>
        <!-- <a href="challenge.ipynb">4</a> -->
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="nim_lora_adapter.ipynb">Next Notebook</a></span>
</div>

# Building RAG With A Localized NIM
---

This notebook will demonstrate building a Retrieval Augmented Generation (RAG) pipeline using localized NVIDIA Inference Microservice (NIM). The notebook will walk you through setting up your NVIDIA API Key, pulling and deploying a NIM image, and building a RAG application that uses the locally deployed NIM.


### Setup NVIDIA API Key

In the previous notebook, we learned how to set up our generated NVIDIA API KEY. As a requirement for this notebook, you must set up the key as enviroment variable `NVIDIA_API_KEY` to pull the NIMs docker images of your choice. If you haven't gotten your key, please visit the NVIDIA NIMs API [homepage](https://build.nvidia.com/explore/discover) and generate your API Key. Please run the cell below, input your NVIDIA API KEY in the display textbox, and press the enter key on your keyboard.

In [4]:
import os
import getpass

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key
    os.environ["NGC_API_KEY"] = nvapi_key


Please execute the cell below to ensure that your docker daemon is up and running.

In [5]:
! docker ps | egrep "^CONTAINER ID|nim"

CONTAINER ID   IMAGE                                                      COMMAND                  CREATED        STATUS        PORTS                                                                             NAMES


**Expected Output (if you have no running containers):**

```python

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

```

### Login to NVCR (NVIDIA Container Registry)

To access a NIM docker image, you must login via `docker login nvcr.io.` This process requires a default username as `--username $oauthtoken` and `--password-stdin` that accepts the value of `$NGC_API_KEY.`

In [6]:
! echo -e "$NGC_API_KEY" 

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


In [7]:
! echo -e "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


**Expected Output**:
```
WARNING! Your password will be stored unencrypted in /home/yagupta/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
```

### Pull The Image 

The next step is to Pull the docker image. We demonstrate this step by pulling `llama3-8b-instruct:1.0.0`.

In [8]:
! docker pull nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

1.0.0: Pulling from nim/meta/llama3-8b-instruct
Digest: sha256:7fe6071923b547edd9fba87c891a362ea0b4a88794b8a422d63127e54caa6ef7
Status: Image is up to date for nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
nvcr.io/nim/meta/llama3-8b-instruct:1.0.0


**Likely output:** (When you have the image pulled already)
```python

1.0.0: Pulling from nim/meta/llama3-8b-instruct
Digest: sha256:7fe6071923b547edd9fba87c891a362ea0b4a88794b8a422d63127e54caa6ef7
Status: Image is up to date for nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
```

### Pull The Image 

The next step is to Pull the docker image. We demonstrate this step by pulling `nv-embedqa-e5-v5:1.0.1`.

In [9]:
! docker pull nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.1

1.0.1: Pulling from nim/nvidia/nv-embedqa-e5-v5
Digest: sha256:128c31a60c4200f02059cb90a8aad0200fcd05fa76700cbf99167de5619c6a46
Status: Image is up to date for nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.1
nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.1


**Likely output:** (When you have the image pulled already)
```python

1.0.1: Pulling from nim/nvidia/nv-embedqa-e5-v5
Digest: sha256:128c31a60c4200f02059cb90a8aad0200fcd05fa76700cbf99167de5619c6a46
Status: Image is up to date for nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.1
nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.1
```

Let's check the model image by listing available images. *Please note that the `IMAGE ID` may differ from what you see under the expected output below*.

In [10]:
! docker image ls | egrep "^REPOSITORY|nim"

REPOSITORY                            TAG                       IMAGE ID       CREATED         SIZE
nvcr.io/nim/nvidia/nv-embedqa-e5-v5   1.0.1                     fa5c1fc5ccb3   4 months ago    15.7GB
nvcr.io/nim/meta/llama3-8b-instruct   1.0.0                     3cb29b0d79e6   5 months ago    12.5GB


**Expected Output**:

```python
REPOSITORY                            TAG                       IMAGE ID       CREATED         SIZE
nvcr.io/nim/nvidia/nv-embedqa-e5-v5   1.0.1                     fa5c1fc5ccb3   3 months ago    15.7GB
nvcr.io/nim/meta/llama3-8b-instruct   1.0.0                     3cb29b0d79e6   5 months ago    12.5GB
```


#### Setting up Cache for the Model Artifacts

The NIMs download a number of files for ensuring the best profiles are selected to achieve max performance on hardware. Set up location for caching the model artifacts as `LOCAL_NIM_CACHE` and export the variable.

In [11]:
from os.path import expanduser
home = expanduser("~")
os.environ['LOCAL_NIM_CACHE']=f"/local/.cache/nim"
!echo $LOCAL_NIM_CACHE

/local/.cache/nim


In [12]:
!mkdir -p "$LOCAL_NIM_CACHE"
!chmod 755 "$LOCAL_NIM_CACHE"

In [13]:
import random
import socket

def find_available_port(start=11000, end=11999):
    while True:
        # Randomly select a port between start and end range
        port = random.randint(start, end)
        
        # Try to create a socket and bind to the port
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
            try:
                sock.bind(("localhost", port))
                # If binding is successful, the port is free
                return port
            except OSError:
                # If binding fails, the port is in use, continue to the next iteration
                continue

# Find and print an available port
os.environ['LLM_CONTAINER_PORT'] = str(find_available_port())
print(f"Your have been alloted the available port for llm: {os.environ['LLM_CONTAINER_PORT']}")

os.environ['EMBED_CONTAINER_PORT'] = str(find_available_port())
print(f"Your have been alloted the available port for embeddings: {os.environ['EMBED_CONTAINER_PORT']}")

Your have been alloted the available port for llm: 11777
Your have been alloted the available port for embeddings: 11807


In [14]:
! docker run -it -d --rm \
--gpus '"device=5,6"' \
--name=llm_nim \
--shm-size=16GB  \
-e NGC_API_KEY \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
-u $(id -u) \
-p $LLM_CONTAINER_PORT:8000 \
nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

# In order to ensure, the local NIM container is completely loaded and doesn't remain in pending stage, we instantiate a wait interval
! sleep 60

7321e9c478f315eeff2adcf22e76e76ed614bbe819f14b5b1319d63d43414cad


In [15]:
! docker logs --tail 45 llm_nim

INFO 11-26 00:49:22.421 api_server.py:456] Serving endpoints:
  0.0.0.0:8000/openapi.json
  0.0.0.0:8000/docs
  0.0.0.0:8000/docs/oauth2-redirect
  0.0.0.0:8000/metrics
  0.0.0.0:8000/v1/health/ready
  0.0.0.0:8000/v1/health/live
  0.0.0.0:8000/v1/models
  0.0.0.0:8000/v1/version
  0.0.0.0:8000/v1/chat/completions
  0.0.0.0:8000/v1/completions
INFO 11-26 00:49:22.421 api_server.py:460] An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama3-8b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,


**Expected Output:**
```
WARNING 09-10 12:08:40.618 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 09-10 12:08:40.631 api_server.py:456] Serving endpoints:
  0.0.0.0:8000/openapi.json
  0.0.0.0:8000/docs
  0.0.0.0:8000/docs/oauth2-redirect
  0.0.0.0:8000/metrics
  0.0.0.0:8000/v1/health/ready
  0.0.0.0:8000/v1/health/live
  0.0.0.0:8000/v1/models
  0.0.0.0:8000/v1/version
  0.0.0.0:8000/v1/chat/completions
  0.0.0.0:8000/v1/completions
INFO 09-10 12:08:40.631 api_server.py:460] An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama3-8b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'

INFO 09-10 12:08:40.681 server.py:82] Started server process [32]
INFO 09-10 12:08:40.681 on.py:48] Waiting for application startup.
INFO 09-10 12:08:40.710 on.py:62] Application startup complete.
INFO 09-10 12:08:40.712 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

### Initiate A Quick Test
You can quickly test that your NIM is up and running via two methods:
- LangChain NVIDIA Endpoints
- A simple OpenAI completion request

**Parameter description:**
- **base_url**: The ULR where the NIM docker image is deployed.
- **model**: The name of the NIM model deployed. 
- **temperature**: To modulate the randomness of sampling. Reducing the temperature increases the chance of selecting words with high probabilities.
- **top_p**: To control how deterministic the model is. If you are looking for exact and factual answers, keep this low. If you seek more diverse responses, increase to a higher value.
- **max_tokens**: maximum number of output tokens to be generated.


In [23]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['LLM_CONTAINER_PORT']), model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=100, top_p=1.0)

result = llm.invoke("Who is Tim Rosenfield?")
print(result.content)

Tim Rosenfield is an American journalist and news anchor who has worked for several major news organizations, including CNN, MSNBC, and HLN (Headline News). He is best known for his work as a news anchor and correspondent, covering a wide range of topics including politics, business, and entertainment.

Rosenfield has had a long and distinguished career in journalism, with over three decades of experience in the industry. He has worked as a news anchor and correspondent for several major networks, including CNN,


In [24]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['LLM_CONTAINER_PORT']), model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=100, top_p=1.0)

result = llm.invoke("What is immersion technology?")
print(result.content)

Immersion technology refers to a range of technologies that aim to create a more immersive and engaging experience for users, often by simulating a sense of presence or interaction with a virtual environment. Immersion technologies can be used in various fields, including entertainment, education, healthcare, and more.

Some common examples of immersion technologies include:

1. Virtual Reality (VR): VR uses a headset or other device to create a simulated environment that users can interact with using controllers or other devices.
2. Augmented Reality


In [16]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['LLM_CONTAINER_PORT']), model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=100, top_p=1.0)

result = llm.invoke("What is a HyperCube?")
print(result.content)

A HyperCube is a mathematical concept that represents a higher-dimensional analog of a cube. In essence, it's a geometric shape that exists in a space with more than three dimensions.

In traditional geometry, a cube is a three-dimensional shape with six square faces, each of which is a two-dimensional square. A HyperCube, on the other hand, is a shape that exists in a space with four or more dimensions. The number of dimensions in a HyperCube is typically denoted by the letter "


In [17]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['LLM_CONTAINER_PORT']), model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=100, top_p=1.0)

result = llm.invoke("What is SMC role in sustainable AI?")
print(result.content)

SMC (Social Media Company) plays a crucial role in sustainable AI (Artificial Intelligence) in several ways:

1. **Data Collection and Management**: SMCs are responsible for collecting and managing vast amounts of user-generated data, which is essential for training AI models. By leveraging this data, AI systems can learn to recognize patterns, make predictions, and improve decision-making.
2. **Data Labeling and Annotation**: SMCs can provide labeled and annotated data to AI developers, enabling them


In case of error outputs, wait for sometime and rerun the above cell. The error might be due to the NIM container not being up completely.

### RAG Application 

In this section, we will follow the steps from the previous notebook to build a RAG application that is based on the locally deployed NIM. For our demonstration, we will not create a conversational retrieval Chain using two LLMs as in the previous notebook, but a conversational retrieval chain using a single LLM `llama3-8b-instruct`. This is because each NIM image has one base model. It is possible to use the locally deployed NIM and remote access, but for clarity and ease of understanding, we will stick with a single LLM approach.
 

#### Import libraries

#### Create Web Link Data Source

You can replace and add more web links of your choice. 

In [19]:
# all_urls = ["https://smc.co",
#        "https://smc.co/about-us",
#        "https://smc.co/pricing",
#        "https://smc.co/sustainability",
#        "https://smc.co/future-state"
#       ]

In [16]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from typing import Set, List, Tuple
import time
from collections import deque

class LinkCrawler:
    ALLOWED_DOMAINS = {'smc.co', 'firmus.co'}  # Only these domains allow deeper crawling
    
    def __init__(self, base_url: str, max_depth: int = 2, delay: float = 0.1, verbose: bool = False):
        if not base_url.startswith(('http://', 'https://')):
            base_url = 'https://' + base_url
            
        self.base_url = base_url
        self.base_domain = urlparse(base_url).netloc
        
        # Adjust max_depth based on domain
        if not any(self.base_domain.endswith(allowed) for allowed in self.ALLOWED_DOMAINS):
            self.max_depth = 1
        else:
            self.max_depth = max_depth
            
        self.visited_urls = set()
        self.ignored_urls = set()
        self.delay = delay
        self.verbose = verbose
        
    def get_domain(self, url: str) -> str:
        """Extract the main domain from a URL"""
        domain = urlparse(url).netloc
        return domain
    
    def normalize_url(self, url: str) -> str:
        if not url.startswith(('http://', 'https://')):
            url = 'https://' + url.lstrip('/:')
            
        parsed_url = urlparse(url)
        normalized_url = parsed_url.scheme + "://" + parsed_url.netloc + parsed_url.path.rstrip('/')
        return normalized_url.split('#')[0].split('?')[0]
    
    def is_valid_url(self, url: str) -> bool:
        try:
            parsed = urlparse(url)
            return parsed.scheme in ['http', 'https']
        except:
            return False
    
    def get_page_links(self, url: str) -> Tuple[Set[str], Set[str]]:
        allowed_links = set()
        ignored_links = set()
        
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }
            response = requests.get(url, timeout=10, headers=headers)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            for link in soup.find_all('a', href=True):
                href = link['href']
                absolute_url = urljoin(url, href)
                normalized_url = self.normalize_url(absolute_url)
                
                if self.is_valid_url(normalized_url):
                    if any(self.get_domain(normalized_url).endswith(allowed) for allowed in self.ALLOWED_DOMAINS):
                        allowed_links.add(normalized_url)
                    else:
                        ignored_links.add(normalized_url)
                    
            time.sleep(self.delay)
            
        except Exception as e:
            if self.verbose:
                print(f"Error processing {url}: {e}")
        
        return allowed_links, ignored_links
    
    def crawl(self) -> Tuple[List[str], List[str]]:
        queue = deque([(self.normalize_url(self.base_url), 0)])
        
        while queue:
            current_url, depth = queue.popleft()
            
            if depth >= self.max_depth:
                continue
                
            if current_url in self.visited_urls:
                continue
                
            if self.verbose:
                print(f"Crawling: {current_url} (Depth: {depth})")
                
            self.visited_urls.add(current_url)
            allowed_links, ignored_links = self.get_page_links(current_url)
            
            self.ignored_urls.update(ignored_links)
            
            for link in allowed_links:
                if link not in self.visited_urls:
                    queue.append((link, depth + 1))
        
        return sorted(list(self.visited_urls)), sorted(list(self.ignored_urls))

if __name__ == "__main__":
    target_urls = ["smc.co", "firmus.co"]
    default_max_depth = 5
    all_allowed_urls = set()
    all_ignored_urls = set()
    
    # Define domains to ignore
    IGNORED_DOMAINS = {'usda.gov', 'washingtonpost.com', 'amd.com'}
    
    for target_url in target_urls:
        try:
            print(f"\nCrawling {target_url}...")
            crawler = LinkCrawler(target_url, max_depth=default_max_depth, verbose=False)
            domain_type = "unrestricted" if any(target_url.endswith(d) for d in crawler.ALLOWED_DOMAINS) else "restricted"
            print(f"Domain type: {domain_type} (depth={crawler.max_depth})")
            
            allowed_links, ignored_links = crawler.crawl()
            all_allowed_urls.update(allowed_links)
            all_ignored_urls.update(ignored_links)
            time.sleep(2)
        except Exception as e:
            print(f"Error crawling {target_url}: {e}")
    
    # Combine all URLs and filter out ignored domains
    all_urls = {url for url in all_allowed_urls.union(all_ignored_urls) 
                if not any(urlparse(url).netloc.endswith(domain) 
                          for domain in IGNORED_DOMAINS)}

    all_urls = {url for url in all_urls if not url.startswith('https://mailto:')}
    
    print(f"\nFound {len(all_urls)} total unique links (excluding {', '.join(IGNORED_DOMAINS)}):")
#    for url in sorted(all_urls):
#        print(url)


Crawling smc.co...
Domain type: unrestricted (depth=5)

Crawling firmus.co...
Domain type: unrestricted (depth=5)

Found 136 total unique links (excluding usda.gov, amd.com, washingtonpost.com):


#### Create A Function To Load HTML Files

Below is a helper function for loading html files, which we’ll use to generate the embeddings. 

In [17]:
import re
import requests
from bs4 import BeautifulSoup
from typing import List, Union

def html_document_loader(url: Union[str, bytes]) -> str:
    """
    Loads the HTML content of a document from a given URL and return it's content.

    Args:
        url: The URL of the document.

    Returns:
        The content of the document.

    Raises:
        Exception: If there is an error while making the HTTP request.

    """
    try:
        response = requests.get(url)
        html_content = response.text
    except Exception as e:
        print(f"Failed to load {url} due to exception {e}")
        return ""

    try:
        # Create a Beautiful Soup object to parse html
        soup = BeautifulSoup(html_content, "html.parser")

        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.extract()

        # Get the plain text from the HTML document
        text = soup.get_text()

        # Remove excess whitespace and newlines
        text = re.sub("\s+", " ", text).strip()

        return text
    except Exception as e:
        print(f"Exception {e} while loading document")
        return ""

#### Create Embeddings and Document Text Splitter

Let's create a function that initializes the path to store our embeddings, execute the `html_document_loader` function, and split the document into chunks of text.

In [18]:
def create_embeddings(embeddings_model,embedding_path: str = "./embed"):

    embedding_path = "./embed"
    print(f"Storing embeddings to {embedding_path}")

    documents = []
    for url in all_urls:
#        print(f"Working on URL {url}")
        document = html_document_loader(url)
        documents.append(document)


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100,
        length_function=len,
    )
    print("Total documents:",len(documents))
    texts = text_splitter.create_documents(documents)
    print("Total texts:",len(texts))
    index_docs(embeddings_model,url, text_splitter, texts, embedding_path,)
    print("Generated embedding successfully")

#### Generate Embeddings Using NVIDIA AI Endpoints From LangChain

In this section we demostrate how to generate embeddings using NVIDIA AI Endpoints for LangChain and save embeddings to offline vector store in the `/embed` directory for future re-use.

In [19]:
! docker run -it -d --rm \
   --name embeddings_nim \
   --gpus '"device=5,6"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache"  \
   -u $(id -u) \
   -p $EMBED_CONTAINER_PORT:8000 \
   nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.1

! sleep 60

44ac65244729dfe096534894c131a5c7df20d79f2824c07c1bc349b453ca713b


In [20]:
! docker logs --tail 45 embeddings_nim

|          |                                | ":"4"}}                        |
|          |                                |                                |
+----------+--------------------------------+--------------------------------+

I1126 00:54:10.831496 272 server.cc:676] 
+-----------------------------------+---------+--------+
| Model                             | Version | Status |
+-----------------------------------+---------+--------+
| nvidia_nv_embedqa_e5_v5           | 1       | READY  |
| nvidia_nv_embedqa_e5_v5_model     | 1       | READY  |
| nvidia_nv_embedqa_e5_v5_tokenizer | 1       | READY  |
+-----------------------------------+---------+--------+

I1126 00:54:10.910448 272 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA H100 80GB HBM3"
I1126 00:54:10.910472 272 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA H100 80GB HBM3"
I1126 00:54:10.919560 272 metrics.cc:770] "Collecting CPU metrics"
I1126 00:54:10.919667 272 tritonserver.cc:2557] 
+-----------

In [22]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embeddings_model = NVIDIAEmbeddings(base_url="http://172.17.0.1:{}/v1".format(os.environ['EMBED_CONTAINER_PORT']))

#embeddings_model = NVIDIAEmbeddings(base_url="http://210.87.104.19:18000/v1")

Set model using model parameter. 
To get available models use available_models property.


Below, we create an `index_docs` function that loops through the document page content to extend text and metadata and applies [FAISS](https://faiss.ai/index.html). The embeddings are stored locally.

In [23]:
from typing import List, Union


def index_docs(embeddings_model, url: Union[str, bytes], splitter, documents: List[str], dest_embed_dir: str) -> None:
    """
    Split the documents into chunks and create embeddings for them.
    
    Args:
        embeddings_model: Model used for creating embeddings.
        url: Source url for the documents.
        splitter: Splitter used to split the documents.
        documents: List of documents whose embeddings need to be created.
        dest_embed_dir: Destination directory for embeddings.
    """
    texts = []
    metadatas = []

    for document in documents:
        chunk_texts = splitter.split_text(document.page_content)
        texts.extend(chunk_texts)
        metadatas.extend([document.metadata] * len(chunk_texts))

    if os.path.exists(dest_embed_dir):
        docsearch = FAISS.load_local(
            folder_path=dest_embed_dir, 
            embeddings=embeddings_model, 
            allow_dangerous_deserialization=True
        )
        docsearch.add_texts(texts, metadatas=metadatas)
    else:
        docsearch = FAISS.from_texts(texts, embedding=embeddings_model, metadatas=metadatas)

    docsearch.save_local(folder_path=dest_embed_dir)

#### Load Embeddings from the Vector Store and Build a RAG using NVIDIA Endpoints

Next, we call the function `create_embeddings` and load documents from [vector store](https://developer.nvidia.com/blog/accelerating-vector-search-fine-tuning-gpu-index-algorithms/) using FAISS. The Vector store stores relevant information in a high dimensional space called embeddings.

Please run the two cells below. 

In [26]:
%%time

from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
create_embeddings(embeddings_model=embeddings_model)

Storing embeddings to ./embed
Total documents: 136
Total texts: 72498
Generated embedding successfully
CPU times: user 24 s, sys: 5.75 s, total: 29.7 s
Wall time: 2min 12s


In [27]:
# load Embed documents
! ls -lh ./embed

embedding_path = "./embed/"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embeddings_model, allow_dangerous_deserialization=True)

total 347M
-rw-rw-r-- 1 ubuntu ubuntu 284M Nov 26 01:00 index.faiss
-rw-rw-r-- 1 ubuntu ubuntu  64M Nov 26 01:00 index.pkl


### Create A Conversational Retrieval Chain With llama3-8b-instruct

Below is to query vector db

In [28]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.memory import ConversationBufferMemory
from langchain.chains.question_answering import load_qa_chain


query = "pete blain"
query_embedding = embeddings_model.embed_query(query)
#print (query_embedding)

# Perform search with the query embedding
results = docsearch.similarity_search(query, k=7)

# Output the results (the most similar documents)
#print("Most relevant documents:")
#for result in results:
#    print(f"* {result.page_content} [{result.metadata}]")
    
retriever = docsearch.as_retriever(search_type="mmr", search_kwargs={"k": 7})
response = retriever.invoke(query)
#retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

print (response)


[Document(metadata={}, page_content='Director of Product & AI.As our technical lead and subject matter expert in AI, DevOps, Machine Learning & Supercomputing, Dr Peter Blain’s experience stems from his PhD in Artificial Intelligence & Computer Systems Engineering, and his work as a Former Systems Architect at the Australian Integrated Marine Observing System (IMOS). Besides deep experience with Australia’s research HPC clusters, he has also developed systems and code at commercial (Biteable), research (IMOS & TPAC), and financial'), Document(metadata={}, page_content='endobj 2572 0 obj <</IsMap false/S/URI/URI(https://www.britannica.com/biography/Woodrow-Wilson)>> endobj 2573 0 obj <</IsMap false/S/URI/URI(https://www.britannica.com/topic/National-Park-Service)>> endobj 2574 0 obj <</IsMap false/S/URI/URI(https://en.wikipedia.org/wiki/Grand_Canyon_National_Park#cite_note-9)>> endobj 2575 0 obj <</Annots 2576 0 R/ArtBox[0.0 0.0 612.0 792.0]/BleedBox[0.0 0.0 612.0 792.0]/Contents 2585 0

In [38]:
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.memory import ConversationBufferMemory
from langchain.chains.question_answering import load_qa_chain

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['LLM_CONTAINER_PORT']),
                 model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa_prompt=QA_PROMPT

doc_chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_PROMPT)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=docsearch.as_retriever(k=7),
    chain_type="stuff",
    memory=memory,
    combine_docs_chain_kwargs={'prompt': qa_prompt},
)

In [30]:
#ChatNVIDIA.get_available_models()

### Test With Query

In [36]:
query = "Who is Sean Zhan from Sustainable Metal Cloud?"
result = qa({"question": query})
print(result.get("answer"))


I don't know. The provided text does not mention Sean Zhan from Sustainable Metal Cloud.


In [40]:
query = "Who is Derek Ngo from Sustainable Metal Cloud?"
result = qa({"question": query})
print(result.get("answer"))

According to the provided context, Derek Ngo is a Solutions Architect at Sustainable Metal Cloud.


In [41]:
query = "What is a green data center?"
result = qa({"question": query})
print(result.get("answer"))

Based on the provided context, a green data center refers to a data center that is designed to be more sustainable and energy-efficient. It appears to be a data center that uses environmentally friendly technologies and practices to reduce its carbon footprint and energy consumption.


In [42]:
query = "What is firmus immersion technology?"
result = qa({"question": query})
print(result.get("answer"))

According to the provided context, Firmus Immersion Technology is an advanced immersion cooling technology that allows for a significant reduction in energy consumption, making it a groundbreaking innovation in the data center industry.


In [43]:
query = "What is Sustainable AI?"
result = qa({"question": query})
print(result.get("answer"))

Based on the provided context, it seems that Sustainable AI refers to the development and implementation of AI technology that is environmentally sustainable and has a reduced carbon footprint. This is achieved through the use of innovative cooling technologies, such as immersion cooling, which reduce energy consumption and overall carbon emissions.


In [44]:
query = "How does SMC achieve sustainable AI?"
result = qa({"question": query})
print(result.get("answer"))

According to the text, Sustainable Metal Cloud achieves sustainable AI through its energy-efficient technology, which has been verified by MLCommons. The company has achieved significant energy reductions, from 15 kW net power to operate to 451 kWh, and has also achieved a 7% performance improvement. Additionally, the company's technology aims to influence industry narratives and establish new benchmarks for sustainable AI practices.


In [45]:
query = "What is a HyperCube?"
result = qa({"question": query})
print(result.get("answer"))

Based on the provided context, a HyperCube appears to be a technology platform that provides scalable and flexible infrastructure for deploying artificial intelligence (AI) and machine learning workloads. It is powered by NVIDIA H100 GPUs and is designed to deliver robust performance while keeping sustainability in mind. Additionally, it is described as being highly efficient, with a Total Usage Effectiveness (TUE) of less than 1.15 in Singapore and operating within a sub 1.05 PUE envelope.


In [46]:
query = "What are features and capabilities of a hypercube?"
result = qa({"question": query})
print(result.get("answer"))

According to the provided context, the features and capabilities of a HyperCube include:

* Scalable and Flexible Infrastructure: modular design that adapts seamlessly to deploy AI at any scale
* Uncompromised AI Power: powered by NVIDIA H100 GPUs, delivering robust performance for AI and machine learning workloads
* Sustainable Cooling Innovation: immersion cooling technology that achieves drastic energy and CO₂ reductions
* Adaptability and Scalability: modular design integrates seamlessly into various data center locations for cloud AZs or edge locations
* Efficiency: operates within a sub 1.05 PUE envelope, extracts up to 30% further efficiency from the compute, and is amongst the most efficient compute platform in the world by Total Usage Effectiveness (TUE)


Before we move ahead, let's free up GPU VRAM by stopping the docker container.

In [47]:
! docker container stop llm_nim embeddings_nim

llm_nim
embeddings_nim


In [48]:
! docker ps | egrep "^CONTAINER ID|nim"

CONTAINER ID   IMAGE                                                      COMMAND                  CREATED        STATUS        PORTS                                                                             NAMES


The next notebook walks through to add the PEFT functionalities like LoRA with NIMs.

In [65]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

llm = ChatNVIDIA(base_url="http://0.0.0.0:{}/v1".format(os.environ['LLM_CONTAINER_PORT']), model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

question = "Who is Sean Zhan?"

prompt = ChatPromptTemplate.from_messages([
    ("system", (
        "You are friendly AI!"
        "Do not hallucinate."
        "Cite source of your answer."
#        "You are unhelpful and nasty AI!"
#        "Your responses should be concise and no longer than two sentences."
#        "Say you don't know if you don't have this information."
#        "Say ask Jarvis if you don't know."
    )),
    ("user", "{question}")
])

chain = prompt | llm | StrOutputParser()

print(chain.invoke({"question": "Who is Sean Zhan from Sustainable Metal Cloud?"}))


I apologize, but I couldn't find any information on a person named Sean Zhan from Sustainable Metal Cloud. It's possible that the information you're looking for is not publicly available or is not well-known.

If you could provide more context or details about Sustainable Metal Cloud or Sean Zhan, I may be able to help you better. Alternatively, you can try searching online or checking industry reports and publications to see if they have any information on the topic.


---

## References

- https://developer.nvidia.com/blog/tips-for-building-a-rag-pipeline-with-nvidia-ai-langchain-ai-endpoints/
- https://nvidia.github.io/GenerativeAIExamples/latest/notebooks/05_RAG_for_HTML_docs_with_Langchain_NVIDIA_AI_Endpoints.html

## Licensing

Copyright © 2024 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<br>
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="rag_nim_endpoints.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 34%; text-align: center;">
        <a href="rag_nim_endpoints.ipynb">1</a>
        <a >2</a>
        <a href="nim_lora_adapter.ipynb">3</a>
        <!-- <a href="challenge.ipynb">4</a> -->
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="nim_lora_adapter.ipynb">Next Notebook</a></span>
</div>

<br>
<p> <center> <a href="../../Start-NIM-RAG.ipynb">Home Page</a> </center> </p>