# Techniques for Improving the Effectiveness of RAG Systems

Execute the cell below to load the video presentation that accompanies this notebook, and watch it before working through the materials in this notebook.

In [1]:
from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-fx-20-v1/lesson-02.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

---

## Lesson 02: Loading the Vector/Document Database

Welcome to Lesson 02! In this section, we will focus on how to launch a database and search it with both semantic and keyword search.

Like we mentioned in Lesson 00, our RAG system is comprised of modular and independently scalable services, each running in its own container--an architecture well-suited to deploying in a cloud environment. 

In addition to using `docker-compose`, which is particularly well suited to single node deployments like the environment you are working in today, public cloud providers include managed container orchestration services that help run this kind of architecture; popular examples include [Amazon Elastic Container Service (ECS)](https://aws.amazon.com/ecs/), [Azure Container Apps](https://azure.microsoft.com/en-us/products/container-apps/), and [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine).

NVIDIA provides a related service, with a serverless API to deploy and manage AI workloads on GPUs, called [NVIDIA Cloud Functions (NVCF)](https://docs.nvidia.com/cloud-functions/user-guide/latest/cloud-function/overview.html).

The NVCF API supports HTTP polling, HTTP streaming & gRPC. Deep learning models (including embedding functions and LLMs) are especially easy to prepare and serve on NVCF through [Triton Inference Server](https://developer.nvidia.com/triton-inference-server), but NVCF supports containers using other backends as well.

**This notebook will focus on the embedder and hybrid search.**


<div style="text-align: center;">
<img src="img/02_overview.png" width="850" alt="architecture diagram with the embedder and search highlighted">
</div>

---

## Restart the Services

To make sure you're staring this lesson with all your services in the correct state, please restart them by running the following cell.

In [2]:
!./restart.sh

Bringing containerized services down...
Services down.
Bringing containerized services back up...
Services back up.


---

## Serving our Embedding Model with Triton

You deployed a NVIDIA Triton Inference Server in Lesson 00.

In [3]:
!docker-compose logs triton

[36mtriton-1  | [0m
[36mtriton-1  | [0m== Triton Inference Server ==
[36mtriton-1  | [0m
[36mtriton-1  | [0mNVIDIA Release 22.01 (build 31237563)
[36mtriton-1  | [0m
[36mtriton-1  | [0mCopyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
[36mtriton-1  | [0m
[36mtriton-1  | [0mVarious files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
[36mtriton-1  | [0m
[36mtriton-1  | [0mThis container image and its contents are governed by the NVIDIA Deep Learning Container License.
[36mtriton-1  | [0mBy pulling and using the container, you accept the terms and conditions of this license:
[36mtriton-1  | [0mhttps://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[36mtriton-1  | [0m
[36mtriton-1  | [0mI0809 23:49:38.170078 7 metrics.cc:298] Collecting metrics for GPU 0: NVIDIA A100 80GB PCIe
[36mtriton-1  | [0mI0809 23:49:38.418966 7 libtorch.cc:1227] TRITONBACKEND_Initialize: pytorch
[36mtriton-1  | [0mI

NVIDIA Triton Inference Server is open-source software for fast and scalable AI inference on both GPU and CPU, and it's the standard way we run inference at NVIDIA. For more information, see the [Triton Inference Server readme on GitHub](https://github.com/triton-inference-server/server#documentation).

Triton:
- supports models and code in Python, C++, TensorFlow 1.x and 2.x, PyTorch, ONNX, TensorRT, RAPIDS FIL (for XGBoost, Scikit-learn Random Forest, and LightGBM), and OpenVINO.
- optimizes inference for multiple query types (real-time, batch, streaming) and also supports model ensembles.
- works with NVIDIA GPUs and x86 & ARM CPUs, including models . 
- runs on scale-out cloud or data center, enterprise edge, and even on embedded devices like the NVIDIA Jetson, in both bare metal and virtualized environments (e.g. VMware vSphere), with dedicated NVIDIA Triton builds for running on Windows, Jetson, and ARM SBSA.

---

## Convert Embedding Model to TensorRT Engine

To embed our chunks in a vector database, we'll need an embedding model. As we mentioned in the previous lesson, we're using the `SentenceTransformers` framework with the `e5-large-unsupervised` embedding model. In order to further increase inference speed, we can convert the PyTorch model into a TensorRT engine file and then serve the TensorRT engine with Triton.

We have already performed this conversion for you, and the model is already available in your running `triton` service, but to do it yourself, please refer to `triton/README.md`.

---

## Health Check Triton

Here's a code snippet to check that Triton is up and running (check that we get a 200 status)

In [4]:
!curl -v http://triton:8000/v2/health/ready

*   Trying 172.19.0.5:8000...
* Connected to triton (172.19.0.5) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: triton:8000
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host triton left intact


---

## Embed With Triton Function

Inside our `router` container's source code, there is a function that takes in text as strings, sends them to Triton Inference Server using the Triton Python client, and receives vectors of floating-point numbers in response. Here we look at that function and its supporting source code.

### Imports

Here we make the necessary imports for our function.

In [5]:
from typing import List
import time
import numpy as np
import tritonclient.http

### Variables

Here we define variables that will be used within our function.

Note: `triton_host` in this environment is `triton`, but in your own environment, depending on how you run the `triton` container, this hostname may very well be different.

In [6]:
triton_host = "triton"
triton_port = "8000"
triton_model_name = "transformer_tensorrt_inference"
triton_model_version = "1"

triton_url = f"{triton_host}:{triton_port}"

### Function Definition

In [7]:
def embed_with_triton(query: List[str]) -> List[List[float]]:
    triton_client = tritonclient.http.InferenceServerClient(
        url=triton_url, verbose=False
    )

    triton_batch_size = len(query)
    triton_inputs = []
    triton_outputs = []
    triton_text_input = tritonclient.http.InferInput(
        name="TEXT", shape=(triton_batch_size,), datatype="BYTES"
    )
    triton_text_input.set_data_from_numpy(np.asarray(query, dtype=object))
    triton_inputs.append(triton_text_input)
    triton_outputs.append(
        tritonclient.http.InferRequestedOutput("output", binary_data=False)
    )

    inference_results = triton_client.infer(
        model_name=triton_model_name,
        model_version=triton_model_version,
        inputs=triton_inputs,
        outputs=triton_outputs,
    )

    embedded_query = inference_results.as_numpy("output").tolist()
    return embedded_query

---

## Try Embed With Triton Function

Note that when used in information retrieval, this embedding model works best when text queries start with the prefix "query: ", and text documents start with the prefix "passage: ", according to [the model card on HuggingFace](https://huggingface.co/intfloat/e5-large-unsupervised#faq).

In [8]:
embedded_query = embed_with_triton(["query: deep learning"])
print(embedded_query)

[[-0.0250091552734375, -0.057159423828125, -0.01119232177734375, -0.016693115234375, 0.033477783203125, 0.0107879638671875, -0.037811279296875, -0.0306243896484375, 0.0199432373046875, -0.058837890625, 0.0272979736328125, 0.0238037109375, 0.0172119140625, 0.023040771484375, -0.0020465850830078125, 0.0271759033203125, -0.0022258758544921875, 0.01262664794921875, 0.019866943359375, 0.001186370849609375, 0.01520538330078125, 0.0262603759765625, 0.0290374755859375, 0.0178985595703125, 0.0020465850830078125, 0.045562744140625, -0.010589599609375, -0.00353240966796875, 0.00022268295288085938, 0.053924560546875, -0.00968170166015625, -0.036468505859375, 0.0286407470703125, 0.0264892578125, 0.0167388916015625, -0.01507568359375, 0.01428985595703125, -0.034881591796875, -0.048248291015625, -0.015899658203125, 0.027374267578125, -0.03411865234375, 0.028167724609375, 0.03265380859375, -0.009521484375, 0.0193328857421875, -0.006137847900390625, -0.038970947265625, -0.01499176025390625, -0.03198242

A list containing one string (in this case, a query) produces a list containing one 1024-dimensional vector of floating-point numbers.

In [9]:
print(len(embedded_query))
print(len(embedded_query[0]))

1
1024


---

## Redis

Now that we have our embedding model set up, let's look at [Redis](https://redis.io/), an open-source database (among other things).

We chose Redis as our database for a number of reasons.
1. Redis is extremely fast, and we need to minimize latency for the operations it will be performing.
2. Redis is well-supported and easy to deploy through a ready-to-go container.
3. Redis supports both vector and keyword search: vector search through the relatively recent [RedisVL](https://github.com/RedisVentures/redisvl) project, and a fairly robust suite of [search and query features](https://redis.io/docs/interact/search-and-query/) for more traditional keyword search. Notably, Redis supports BM25, the default algorithm behind the popular Elasticsearch system--making it easy to transition smoothly between the two systems.
4. Redis unifies our vector database with our document (and metadata) database, so we don't have to worry about maintaining keys in a separate index like [FAISS](https://faiss.ai/).

### Viewing the Redis Service

You launched the Redis in Lesson 00.

In [10]:
!docker-compose logs redis

[36mredis-1  | [0m9:C 09 Aug 2024 23:49:37.779 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
[36mredis-1  | [0m9:C 09 Aug 2024 23:49:37.779 * Redis version=7.2.4, bits=64, commit=00000000, modified=0, pid=9, just started
[36mredis-1  | [0m9:C 09 Aug 2024 23:49:37.779 * Configuration loaded
[36mredis-1  | [0m9:M 09 Aug 2024 23:49:37.780 * monotonic clock: POSIX clock_gettime
[36mredis-1  | [0m9:M 09 Aug 2024 23:49:37.780 * Running mode=standalone, port=6379.
[36mredis-1  | [0m9:M 09 Aug 2024 23:49:37.780 * Module 'RedisCompat' loaded from /opt/redis-stack/lib/rediscompat.so
[36mredis-1  | [0m9:M 09 Aug 2024 23:49:37.781 * <search> Redis version found by RedisSearch : 7.2.4 - oss
[36mredis-1  | [0m9:M 09 Aug 2024 23:49:37.781 * <search> RediSearch version 2.8.13 (Git=2.8-c1535c6)
[36mredis-1  | [0m9:M 09 Aug 2024 23:49:37.781 * <search> Low level api version 1 initialized successfully
[36mredis-1  | [0m9:M 09 Aug 2024 23:49:37.781 * <search> concurrent writes: OFF, 

---

## Router Service

We are also going to use a `router` service whose job is to serve as an API entry point and route calls between the other three components: `chunking`, `triton`, and `redis`.

### Viewing the Router Service

You already launched the `router` service in Lesson 00.

In [11]:
!docker-compose logs router

[36mrouter-1  | [0mINFO:     Started server process [1]
[36mrouter-1  | [0mINFO:     Waiting for application startup.
[36mrouter-1  | [0mINFO:     Application startup complete.
[36mrouter-1  | [0mINFO:     Uvicorn running on http://0.0.0.0:5006 (Press CTRL+C to quit)


---

## Viewing the Router Service

Like the `chunking` service, `router` is also a FastAPI web application, with automatic documentation generation. Inside the `router` application, we use LangChain, which makes it easy to experiment by swapping out components like LLMs and prompts.

The `router` service is available on port 5006. Execute the following cell to generate a link to open it in a new browser tab.

In [12]:
%%js
var host = window.location.host;
var url = 'http://'+host+':5006';
element.innerHTML = '<a style="color:green;" target="_blank" href='+url+'>Click to open router service API docs.</a>';

<IPython.core.display.Javascript object>

You can see we have endpoints for searching as well as data insert, delete and dump endpoints.

---

## Loading the Redis Database

Before we can search, we need to fill our database with data!

As mentioned previously, we'll use the asynchronous `httpx` library to load our database.

In [13]:
# Get the asset types the router expects
import httpx 
import json

response = httpx.get("http://router:5006/asset-types")
asset_types_json = response.json()

In [14]:
print(json.dumps(asset_types_json, indent=2))

[
  {
    "id": "doc:assettypes:techblogs",
    "display_title": "TechBlog Posts",
    "chunking_params": "{\"strategy\": \"heading_section_sentence\", \"code_behavior\": \"remove_code_sections\", \"chunk_min_words\": 250, \"chunk_overlap_words\": 50}",
    "last_indexed": "2024-03-20T03:58:55",
    "display_default": true,
    "group": "Written Content",
    "display_sort_order": 1,
    "name": "techblogs",
    "group_sort_order": 2
  },
  {
    "id": "doc:assettypes:summarize_techblogs",
    "display_title": "TechBlog Posts Summaries",
    "chunking_params": "{\"strategy\": \"summarization\", \"code_behavior\": \"remove_code_sections\"}",
    "last_indexed": "2024-03-20T04:01:05",
    "display_default": true,
    "group": "Written Content",
    "display_sort_order": 2,
    "name": "summarize_techblogs",
    "group_sort_order": 2
  }
]


Let's first upload the chunks we got by breaking each article into sentence groups.

In [15]:
import os

In [16]:
data_dir = os.path.join(os.getcwd(), 'data', 'techblogs')
file_list = [x for x in sorted(os.listdir(data_dir)) if ".json" in x]

In [17]:
payloads = []

for i, filename in enumerate(file_list):
    with open(os.path.join(data_dir, filename), "r") as in_file:
        data = json.load(in_file)
    for item in data:
        # skip items that do not link to developer.nvidia.com/blog or blogs.nvidia.com
        if not item["link"].startswith(
            "https://developer.nvidia.com/blog"
        ):  # and not item['link'].startswith("https://blogs.nvidia.com"):
            # print(f"Skipping URL {item['link']}")
            continue
        document_title = item["title"]["rendered"]
        document_url = item["link"]
        document_html = item["content"]["rendered"]
        document_date = item["date_gmt"]
        document_date_modified = item["modified_gmt"]
        payloads.append(
            {
                "strategy": "heading_section_sentence",
                "code_behavior": "remove_code_sections",
                "chunk_min_words": 250,
                "chunk_overlap_words": 50,
                "input_type": "html",
                "input_str": document_html,
                "additional_metadata": {
                    "document_title": document_title,
                    "document_url": document_url,
                    "document_date": document_date,
                    "document_date_modified": document_date_modified,
                },
            }
        )

In [18]:
print(f"Total num payloads: {len(payloads)}")

Total num payloads: 150


In [19]:
payloads[0]

{'strategy': 'heading_section_sentence',
 'code_behavior': 'remove_code_sections',
 'chunk_min_words': 250,
 'chunk_overlap_words': 50,
 'input_type': 'html',
 'input_str': '<div style="margin-top: 0px; margin-bottom: 0px;" class="sharethis-inline-share-buttons" ></div>\n<p><a href="https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/">NVIDIA AI Workbench</a> is now in beta, bringing a wealth of new features to streamline how enterprise developers create, use, and share AI and machine learning (ML) projects. Announced at SIGGRAPH 2023, NVIDIA AI Workbench enables developers to create, collaborate, and migrate AI workloads on their GPU-enabled environment of choice. To learn more, see <a href="https://developer.nvidia.com/blog/develop-and-deploy-scalable-generative-ai-models-seamlessly-with-nvidia-ai-workbench/">Develop and Deploy Scalable Generative AI Models Seamlessly with NVIDIA AI Workbench</a>.</p>\n\n\n\n<p>This post explains how NVIDIA AI Workbench hel

In [20]:
import asyncio
import httpx

In [21]:
chunking_url = "http://chunking:5005/api/chunking"
existing_items_url = "http://router:5006/search/keyword"
delete_url = "http://router:5006/data/delete"
insert_url = "http://router:5006/data/insert"

In [22]:
# Initialize a semaphore object with a limit of 3.
limit = asyncio.Semaphore(3)

In [23]:
# chunk up an article
async def chunking_request(client: httpx.AsyncClient, payload: dict):
    chunking_resp = await client.post(chunking_url, json=payload, timeout=15)
    return chunking_resp.json()

In [24]:
# see if any chunks already exist in the db that match this document url
async def get_existing_items_request(client: httpx.AsyncClient, payload: dict, asset_type: str):
    existing_items_resp = await client.post(
        existing_items_url,
        json={
            "field": "document_url",
            "value": payload["additional_metadata"]["document_url"],
            "asset_types": [asset_type],
            "search_type": "exact",
            "k": 1000,  # some large number to ensure we don't hit default limit of 10
        },
        timeout=15,
    )
    return existing_items_resp.json()

In [25]:
# delete items with certain ids
async def delete_request(client: httpx.AsyncClient, results: list, asset_type: str):
    delete_resp = await client.post(
        delete_url,
        json={
            "asset_type": asset_type,
            "ids": [x["id"] for x in results],
        },
        timeout=15,
    )
    print(delete_resp.status_code)
    return delete_resp.json()

In [26]:
async def upload_techblogs_chunks(client: httpx.AsyncClient, payload: dict):
    async with limit:
        try:
            chunks = await chunking_request(client, payload)
        except:  # retry once
            chunks = await chunking_request(client, payload)
        print(
            f"{payload['additional_metadata']['document_url']} | num chunks: {len(chunks)}"
        )

        # gets ids of existing items with this url
        try:
            existing_items = await get_existing_items_request(client, payload, "techblogs")
        except:  # retry once
            existing_items = await get_existing_items_request(client, payload, "techblogs")

        if len(existing_items) > 0:
            results = existing_items[0]["results"]
            if len(results) > 0:
                # delete items that are associated with this url
                try:
                    deleted_items = await delete_request(client, results, "techblogs")
                except:  # retry once
                    deleted_items = await delete_request(client, results, "techblogs")
                print(f"Deleted ids reponse: {deleted_items}")

        # insert: send chunks to redis
        resp = await client.post(
            insert_url,
            json={
                "asset_type": "techblogs",
                "chunks": chunks,
            },
            timeout=15,
        )
        print(f"Inserted {len(resp.json())} chunks")

In [27]:
async def main():
    async with httpx.AsyncClient() as client:
        tasks = []
        for payload in payloads:
            tasks.append(upload_techblogs_chunks(client, payload))

        await asyncio.gather(*tasks)

In [28]:
start = time.perf_counter()

# If this were not in Jupyter we would run this
# asyncio.run(main())

# Since we are in a notebook, Jupyter is already running its own event loop
# so we can just simply await main()
await main()

end = time.perf_counter()

print(f"Took {end - start} seconds")

# This should take around 2-3 minutes

https://developer.nvidia.com/blog/create-share-and-scale-enterprise-ai-workflows-with-nvidia-ai-workbench-now-in-beta/ | num chunks: 7
200
Deleted ids reponse: {'items_deleted': 7}
Inserted 7 chunks
https://developer.nvidia.com/blog/modernizing-the-data-center-with-accelerated-networking/ | num chunks: 4
200
Deleted ids reponse: {'items_deleted': 4}
Inserted 4 chunks
https://developer.nvidia.com/blog/emulating-the-attention-mechanism-in-transformer-models-with-a-fully-convolutional-network/ | num chunks: 10
200
Deleted ids reponse: {'items_deleted': 10}
Inserted 10 chunks
https://developer.nvidia.com/blog/announcing-metropolis-microservices-on-nvidia-jetson-orin-for-rapid-edge-ai-development/ | num chunks: 4
200
Deleted ids reponse: {'items_deleted': 4}
Inserted 4 chunks
https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ | num chunks: 5
200
Deleted ids reponse: {'items_deleted': 5}
Inserted 5 chunks
https://developer.nvidia.com/blog/build-enterprise-gr

In [29]:
techblogs_assettype = None

for assettype in asset_types_json:
    if assettype["name"] =="techblogs":
        techblogs_assettype = assettype

In [30]:
print(json.dumps(techblogs_assettype, indent=2))

{
  "id": "doc:assettypes:techblogs",
  "display_title": "TechBlog Posts",
  "chunking_params": "{\"strategy\": \"heading_section_sentence\", \"code_behavior\": \"remove_code_sections\", \"chunk_min_words\": 250, \"chunk_overlap_words\": 50}",
  "last_indexed": "2024-03-20T03:58:55",
  "display_default": true,
  "group": "Written Content",
  "display_sort_order": 1,
  "name": "techblogs",
  "group_sort_order": 2
}


We want to update the `assettypes` redis index with some metadata about how the `techblogs` index was created.

In [31]:
techblogs_assettype["chunking_params"] = json.dumps(
    {
        "strategy": "heading_section_sentence",
        "code_behavior": "remove_code_sections",
        "chunk_min_words": 250,
        "chunk_overlap_words": 50,
    }
)

In [32]:
print(json.dumps(techblogs_assettype, indent=2))

{
  "id": "doc:assettypes:techblogs",
  "display_title": "TechBlog Posts",
  "chunking_params": "{\"strategy\": \"heading_section_sentence\", \"code_behavior\": \"remove_code_sections\", \"chunk_min_words\": 250, \"chunk_overlap_words\": 50}",
  "last_indexed": "2024-03-20T03:58:55",
  "display_default": true,
  "group": "Written Content",
  "display_sort_order": 1,
  "name": "techblogs",
  "group_sort_order": 2
}


In [33]:
update_asset_types_url = "http://router:5006/asset-types/update"
response = httpx.post(update_asset_types_url, json={"data": techblogs_assettype})
print(json.dumps(response.json(), indent=2))

{
  "display_title": "TechBlog Posts",
  "chunking_params": "{\"strategy\": \"heading_section_sentence\", \"code_behavior\": \"remove_code_sections\", \"chunk_min_words\": 250, \"chunk_overlap_words\": 50}",
  "last_indexed": "2024-08-09T23:52:23",
  "display_default": 1,
  "group": "Written Content",
  "display_sort_order": 1,
  "name": "techblogs",
  "group_sort_order": 2
}


Now we can save our database by hitting the `/data/dump` router endpoint.

In [34]:
dump_response = httpx.post("http://router:5006/data/dump")

In [35]:
print(json.dumps(dump_response.json(), indent=2))

{
  "success": true
}


We can confirm that the Redis database was indeed saved.

In [36]:
import redis 

r = redis.Redis(host='redis', port=6379)
r.lastsave()

datetime.datetime(2024, 8, 9, 23, 51, 27)

In [37]:
from datetime import datetime
datetime.now()

datetime.datetime(2024, 8, 9, 23, 52, 24, 172130)

In [38]:
# if we wanted to drop an index

# from langchain.vectorstores.redis import Redis

# Redis.drop_index(
#     index_name="assettypes", delete_documents=True, redis_url="redis://localhost:6379"
# )

Now we can head back to our API docs and confirm that our data are available. Try the semantic search examples. Use the dropdown to select the second example: semantic search. Set k and asset_types to non-default values.

Next, try the keyword search example, but modify the query to `cell phone`. The literal phrase `cell phone` doesn't return any articles, but the phrase `mobile phone` does. In cases where the concept is more important than the exact words, semantic search can help.

If we go back to the semantic search and modify the third example to try `cell phone`, we do get relevant results. The second result should be good. The first result talks about pixels and cameras--both associated with cell phones, by themselves, and also through the Google Pixel cell phone. If you see many cases like this where words' multiple meanings throw off your domain-specific search interests, you can finetune the embedding model to prefer domain-related matches to general matches.

Conversely, let's try a search for a specific product name that has no general-language meaning, like `H200`. Here is where a keyword search makes more sense, because we want the exact product name and not the meaning of "H" and "200" (try in particular wildcard search: `*H200`).

---

## Programmatic Search

Now instead of using the docs, let's use Python and httpx to hit our `/search/semantic` endpoint. In this environment, the `router` service is available at the hostname `router`.

In [39]:
search_endpoint = "http://router:5006/search/semantic"

response = httpx.post(
    search_endpoint, json={"query": "cgroups", "k": 3, "asset_types": ["techblogs"]}
)
response.json()

[{'asset_type': 'techblogs',
  'display_title': 'TechBlog Posts',
  'results': [{'id': 'doc:techblogs:af9ca8617af14615992e18c76c822cb8',
    'text': 'Improving CUDA Initialization Times Using cgroups in Certain Scenarios\nMany CUDA applications running on multi-GPU platforms usually use a single GPU for their compute needs. In such scenarios, a performance penalty is paid by applications because CUDA has to enumerate/initialize all the GPUs on the system. If a CUDA application does not require other GPUs to be visible and accessible, you can launch such applications by isolating the unwanted GPUs from the CUDA process and eliminating unnecessary initialization steps. This post discusses the various methods to accomplish this and their performance benefits. \nGPU isolation\nGPU isolation can be achieved on Linux systems by using Linux tools like ```cgroups```. In this section, we first discuss a lower-level approach and then a higher-level possible approach. Another method exposed by CU

Notice we have some valuable metadata from our chunking service.

- `text_components` contains a list of the sentences that were found in this section.
- `contains_code` is a boolean that indicates whether code was in that particular sentence.
- `only_code` is another boolean that indicates whether the entire sentence is comprised of code.

Because we specified that `code_behavior` was `remove_code_sections` the `text` attribute of the items in the `results` in the response is essentially a concatenated string formed by the sentences that were not 
entirely made up of code (i.e., `only_code == False`). You will still see some small amount of code in the text (i.e., `contains_code == True and only_code == False`), but these are usually single words in 
a sentence of natural language.

The advantage of this is, we can still go and extract the `only_code` sections because they are available through the metadata.

In [40]:
result1 = response.json()[0]['results'][1]
result1

{'id': 'doc:techblogs:34c34a6d8bf34dcf9d8c3086017b67e0',
 'text': 'Isolating GPUs using cgroups V1\nMultiple processes can be added to the ```tasks``` file to propagate the same controls to more than one process. \nIsolating GPUs using the bubblewrap utility\nThe bubblewrap utility (bwrap) is a higher-level utility available for sandboxing and access control in Linux operating systems, which can be used to achieve the same effect as the solution presented earlier. You can use this to conveniently restrict or allow access to specific GPUs from a CUDA process: More than one GPU can be exposed to a CUDA process by extending the ```dev-bind``` option in the code example. \nPerformance benefits of GPU isolation\nIn this section, we compare the performance of the CUDA driver initialization API (cuInit) with and without GPU isolation, measured over 256 iterations. The APIs are being run on an x86-based machine with four A100 class GPUs. Bar graph shows the performance of cuInit API running on

First, because of some constraints from how data structures can be stored in redis, we'll need to convert the JSON strings into lists.

In [41]:
heading_section_index = json.loads(result1["heading_section_index"])
heading_section_title = json.loads(result1["heading_section_title"])
paragraph_index = json.loads(result1["paragraph_index"])
contains_code = json.loads(result1["contains_code"])
only_code = json.loads(result1["only_code"])
text_components = json.loads(result1["text_components"])

In [42]:
assert len(heading_section_index) == len(heading_section_title) == len(paragraph_index) == len(contains_code) == len(only_code) == len(text_components)

Next, let's get all the text, including both natural language and code sections, and join it together as it appeared in the article.

In [43]:
text = ""
last_hsi = None

for i in range(len(text_components)):
    if last_hsi is None or last_hsi != heading_section_index[i]:
        text += heading_section_title[i] + "\n"
    text += text_components[i]
    if only_code[i]:
        text += "\n"
    else:
        text += " "
    # look ahead
    if i < len(text_components) - 1:
        if paragraph_index[i] != paragraph_index[i+1]:
            text += "\n"
    
    last_hsi = heading_section_index[i]

print(text.strip())

Isolating GPUs using cgroups V1
Multiple processes can be added to the ```tasks``` file to propagate the same controls to more than one process. 
Isolating GPUs using the bubblewrap utility
The bubblewrap utility (bwrap) is a higher-level utility available for sandboxing and access control in Linux operating systems, which can be used to achieve the same effect as the solution presented earlier. You can use this to conveniently restrict or allow access to specific GPUs from a CUDA process: 
```
# install bubblewrap utility on Debian-like systems
$>sudo apt-get install -y bubblewrap
# create a simple shell script that uses bubblewap for binding the required GPU to the launched process
#!/bin/sh
# bwrap.sh
GPU=$1;shift   # 0, 1, 2, 3, ..
if [ "$GPU" = "" ]; then echo "missing arg: gpu id"; exit 1; fi
bwrap \
        --bind / / \
        --dev /dev --dev-bind /dev/nvidiactl /dev/nvidiactl --dev-bind /dev/nvidia-uvm /dev/nvidia-uvm  \
        --dev-bind /dev/nvidia$GPU /dev/nvidia$GPU \
  

Or, we can extract exclusively the code and ignore the natural language.

In [44]:
text = ""

for i in range(len(text_components)):    
    if only_code[i]:
        text += text_components[i]
        text += "\n"

print(text.strip())

```
# install bubblewrap utility on Debian-like systems
$>sudo apt-get install -y bubblewrap
# create a simple shell script that uses bubblewap for binding the required GPU to the launched process
#!/bin/sh
# bwrap.sh
GPU=$1;shift   # 0, 1, 2, 3, ..
if [ "$GPU" = "" ]; then echo "missing arg: gpu id"; exit 1; fi
bwrap \
        --bind / / \
        --dev /dev --dev-bind /dev/nvidiactl /dev/nvidiactl --dev-bind /dev/nvidia-uvm /dev/nvidia-uvm  \
        --dev-bind /dev/nvidia$GPU /dev/nvidia$GPU \
        "$@"
# Launch the CUDA process with the bubblewrap utility to only allow access to a specific GPU while running
$> ./bwrap.sh 0 ./test_cuda_app <args>
```


---

## Storing TechBlog Summaries in Redis

Let's take our code from the previous lesson and use it to generate summaries for each article in our batch asynchronously.

We will use an LLM to generate the summaries, so let's begin by instatiating an LLM instance to work with. Here we import a `ChatOpenAI` instance of our local NIM Mixtral 8x7B model configured and ready for use with LangChain from an [`llms` helper file](llms.py).

In [45]:
from llms import llms

In [46]:
llm = llms.nim_mixtral_llm

### Optional Remote LLMs

Optionally, instead of using our local model, you can also use either NVIDIA AI Foundation's Mixtral 8x7B model or OpenAI's gpt-3.5-turbo.

For either of these 2 options you'll need an API key. For more details about NVIDIA AI Foundation and obtaining a free API key, see [the notebook *NVIDIA AI Foundation.ipynb*](./NVIDIA%20AI%20Foundation.ipynb).

After obtaining an appropriate API key, uncomment the appropriate cell below, add your API key, and run the cell to set `llm` to the remote LLM you chose to work with.

#### NVIDIA AI Foundation Mixtral 8x7B

In [47]:
# from llms import set_api_key
# set_api_key('NVIDIA_API_KEY', '<your_nvidia_api_key>')
# llm = llms.nvai_mixtral_llm

#### OpenAI GPT-3

In [48]:
# from llms import set_api_key
# set_api_key('OPENAI_API_KEY', '<your_openai_api_key>')
# llm = llms.openai_gpt3_llm

### Construct Tech Blog Payloads

Next we'll construct a payload for each techblog that contains the blog's HTML, along with various metadata fields, and chunking guidelines.

In [49]:
payloads = []

file_list = [x for x in sorted(os.listdir(data_dir)) if '.json' in x]

for i, filename in enumerate(file_list):
    with open(os.path.join(data_dir, filename), 'r') as in_file:
        data = json.load(in_file)
        
    for item in data:
        
        # skip items that do not link to developer.nvidia.com/blog or blogs.nvidia.com
        if not item['link'].startswith("https://developer.nvidia.com/blog"): # and not item['link'].startswith("https://blogs.nvidia.com"):
            # print(f"Skipping URL {item['link']}")
            continue
            
        document_title = item['title']['rendered']
        document_url = item['link']
        document_html = item['content']['rendered']
        document_date = item['date_gmt']
        document_date_modified = item['modified_gmt']
        
        payload = {
            "strategy": "heading_section",
            "code_behavior": "remove_code_sections",
            "input_type": "html",
            "input_str": document_html,
            "additional_metadata": {
                "document_title": document_title,
                "document_url": document_url,
                "document_date": document_date,
                "document_date_modified": document_date_modified,
            }
        }
        
        payloads.append(payload)

In [50]:
print(f"Total num payloads: {len(payloads)}")

Total num payloads: 150


In [51]:
payloads[0]

{'strategy': 'heading_section',
 'code_behavior': 'remove_code_sections',
 'input_type': 'html',
 'input_str': '<div style="margin-top: 0px; margin-bottom: 0px;" class="sharethis-inline-share-buttons" ></div>\n<p><a href="https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/">NVIDIA AI Workbench</a> is now in beta, bringing a wealth of new features to streamline how enterprise developers create, use, and share AI and machine learning (ML) projects. Announced at SIGGRAPH 2023, NVIDIA AI Workbench enables developers to create, collaborate, and migrate AI workloads on their GPU-enabled environment of choice. To learn more, see <a href="https://developer.nvidia.com/blog/develop-and-deploy-scalable-generative-ai-models-seamlessly-with-nvidia-ai-workbench/">Develop and Deploy Scalable Generative AI Models Seamlessly with NVIDIA AI Workbench</a>.</p>\n\n\n\n<p>This post explains how NVIDIA AI Workbench helps streamline the AI workflow and details new features of the 

---

In [52]:
import asyncio
import httpx
import json
from langchain_core.prompts import ChatPromptTemplate

In [53]:
summaries = [None] * len(payloads)

In [54]:
# load the summaries from the json file
with open("data/techblogs_summaries/saved.json", "r") as f:
    saved_summaries = json.load(f)

In [55]:
# Initialize a semaphore object with a limit of 3.
limit = asyncio.Semaphore(3)

In [56]:
async def async_generate(llm, msg):
    resp = await llm.agenerate([msg])
    return resp.generations[0][0].text

In [57]:
# Uncomment the following line if we want to save new summaries.
# saved_summaries = {}

async def upload_techblogs_summaries(llm, client: httpx.AsyncClient, payload: dict):
    async with limit:

        try:
            chunks = await chunking_request(client, payload)
        except:  # retry once
            chunks = await chunking_request(client, payload)
        print(
            f"{payload['additional_metadata']['document_url']} | num chunks: {len(chunks)}"
        )

        clean_text_no_code = "\n".join([x["text"] for x in chunks])
        clean_text_with_code = "\n".join([ x["heading_section_title"][0] + "\n" + "\n".join(x["text_components"]) for x in chunks])
        
        # Ask LLM for summaries

        # uncomment if we want to save new summaries
        # template = ChatPromptTemplate.from_messages(
        #     [("user", "Summarize the following article in 200 words or less:\n{user_input}")]
        # )

        # msg = template.format_messages(
        #     user_input=clean_text_no_code
        # )

        # summary = await async_generate(llm, msg)
        # summary_with_metadata = [
        #     {
        #         "text": payload["additional_metadata"]["document_title"] + "\n" + summary,
        #         "text_components": [ x["heading_section_title"][0] + "\n" + "\n".join(x["text_components"]) for x in chunks],
        #         "document_title": payload["additional_metadata"]["document_title"],
        #         "document_url": payload["additional_metadata"]["document_url"],
        #         "document_date": payload["additional_metadata"]["document_date"],
        #         "document_date_modified": payload["additional_metadata"]["document_date_modified"],
        #         "document_full_text": clean_text_with_code
        #     }
        # ]
        # saved_summaries[payload["additional_metadata"]["document_url"]] = summary_with_metadata

        # load summary we've already generated
        # comment the following line if we want to save new summaries
        summary_with_metadata = saved_summaries[payload["additional_metadata"]["document_url"]]

        # gets ids of existing items with this url
        try:
            existing_items = await get_existing_items_request(client, payload, "summarize_techblogs")
        except:  # retry once
            existing_items = await get_existing_items_request(client, payload, "summarize_techblogs")

        if len(existing_items) > 0:
            results = existing_items[0]["results"]
            if len(results) > 0:
                # delete items that are associated with this url
                try:
                    deleted_items = await delete_request(client, results, "summarize_techblogs")
                except:  # retry once
                    deleted_items = await delete_request(client, results, "summarize_techblogs")
                print(f"Deleted ids reponse: {deleted_items}")

        # insert: send chunks to redis
        resp = await client.post(
            insert_url,
            json={
                "asset_type": "summarize_techblogs",
                "chunks": summary_with_metadata,
            },
            timeout=15,
        )
        print(f"Inserted {len(resp.json())} chunks")

In [58]:
async def main():
    async with httpx.AsyncClient() as client:
        tasks = []
        # for i in range(0, 7):
        for i in range(0, len(payloads)):
            tasks.append(upload_techblogs_summaries(llm, client, payloads[i]))

        await asyncio.gather(*tasks)

In [None]:
start = time.perf_counter()

# If this were not in Jupyter we would run this
# asyncio.run(main())

# Since we are in a notebook, Jupyter is already running its own event loop
# so we can just simply await main()
await main()

end = time.perf_counter()

print(f"Took {end - start} seconds")

# This should take around 2-3 minutes

https://developer.nvidia.com/blog/create-share-and-scale-enterprise-ai-workflows-with-nvidia-ai-workbench-now-in-beta/ | num chunks: 9
200
Deleted ids reponse: {'items_deleted': 1}
Inserted 1 chunks
https://developer.nvidia.com/blog/emulating-the-attention-mechanism-in-transformer-models-with-a-fully-convolutional-network/ | num chunks: 6
200
Deleted ids reponse: {'items_deleted': 1}
Inserted 1 chunks
https://developer.nvidia.com/blog/modernizing-the-data-center-with-accelerated-networking/ | num chunks: 8
200
Deleted ids reponse: {'items_deleted': 1}
Inserted 1 chunks
https://developer.nvidia.com/blog/announcing-metropolis-microservices-on-nvidia-jetson-orin-for-rapid-edge-ai-development/ | num chunks: 10
200
Deleted ids reponse: {'items_deleted': 1}
Inserted 1 chunks
https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ | num chunks: 9
200
Deleted ids reponse: {'items_deleted': 1}
Inserted 1 chunks
https://developer.nvidia.com/blog/build-enterprise-grad

In [None]:
# save the summaries as a json file
with open("data/techblogs_summaries/saved.json", "w") as f:
    json.dump(saved_summaries, f)

In [None]:
len(saved_summaries)

In [None]:
saved_summaries[payloads[0]['additional_metadata']['document_url']]

In [None]:
techblogs_summaries_assettype = None

for assettype in asset_types_json:
    if assettype["name"] =="summarize_techblogs":
        techblogs_summaries_assettype = assettype

print(json.dumps(techblogs_summaries_assettype, indent=2))

techblogs_summaries_assettype["chunking_params"] = json.dumps(
    {
        "strategy": "summarization",
        "code_behavior": "remove_code_sections",
    }
)

print(json.dumps(techblogs_summaries_assettype, indent=2))


update_asset_types_url = "http://router:5006/asset-types/update"
response = httpx.post(update_asset_types_url, json={"data": techblogs_summaries_assettype})
print(json.dumps(response.json(), indent=2))

dump_response = httpx.post("http://router:5006/data/dump")
print(json.dumps(dump_response.json(), indent=2))

---

## Recap

We now two indexes in Redis: `techblogs`, which contains chunks of roughly 250 words, and `summarize_techblogs` which contains the summaries written by ChatGPT.

In the next notebook, we'll look at how we can evaluate the search results from these indexes.

---

## Next Lesson

Move to the next lesson by double-clicking *Lesson 03.ipynb* on the file-viewer on the left-hand side of your Jupyter Lab environment.