**Name:** Bhaskar Boruah

**Email:** boruah.bhaskar@gmail.com

**Problem Statement :** Problem-01: AI-Assisted Learning for NVIDIA SDKs and Toolkits

The objective is to develop an AI-powered Language Model (LLM) that assists users in understanding and effectively using NVIDIA SDKs and toolkits. The envisioned platform will serve as an interactive and user-friendly hub, providing comprehensive information, examples, and guidance on NVIDIA's SDKs and toolkits. By leveraging the power of language models and NVIDIA's cutting-edge technologies, the aim is to simplify the learning curve for developers and empower them to utilize NVIDIA's technologies more efficiently.



##Introduction:

In this notebook, We will explore how we can use the open source **Llama-2-7b-chat** model in both **Hugging Face transformers** and  **LangChain** framework with **FAISS library** vector store over the documents that are fetched online from the Nvidia documentation website and other pertinent articles and tutorials from the internet and Community Forums and Q&A Platforms like NVIDIA Developer Forums in order to develop an AI-powered Language Model (LLM) that helps users understand and use NVIDIA SDKs and toolkits effectively.


###Stages

1.   Prepare the Data
2.   Query for Relevant Data
3.   Craft the Response



## Installing all required Library

#Note: To avoid slowness, Change the runtime type in Google Colab as below :

**Runtime > Change runtime type > Hardware accelerator > T4 GPU**

In [59]:
!pip install -qU transformers accelerate einops langchain xformers bitsandbytes faiss-gpu sentence_transformers


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## Dataset

The dataset is composed of the following sources to develop an AI-powered language model (LLM) to assist users in understanding and effectively using various NVIDIA SDKs and toolkits:

**NVIDIA Documentation:** This documentation contains comprehensive information, usage guidelines, examples, and troubleshooting details for each SDK and toolkit. It covers topics such as installation, API references, sample code, and best practices. Link: https://docs.nvidia.com/


**Internet Articles and Tutorials:** Alongside the official documentation, incorporating relevant articles and tutorials from the internet can enhance the dataset. Blog posts, tutorials, and guides authored by developers and technology enthusiasts provide real-world insights, use cases, and tips for effectively utilizing NVIDIA SDKs and toolkits.


**Community Forums and Q&A Platforms:** Community forums like NVIDIA Developer Forums, question-and-answer platforms such as Stack Overflow, and discussions on GitHub can serve as valuable sources of information. These platforms host discussions, provide solutions to common issues, and address user queries related to NVIDIA SDKs and toolkits. Link: https://forums.developer.nvidia.com

##Data Preparation

**Ingesting Data using Document Loader:**

We will prepare a function **get_all_links**  which sends a request to the specified URL, parses the HTML content using BeautifulSoup, and then extracts and return all the URLs found in the anchor () tags on the page.By using the links , we will ingest data using WebBaseLoader document loader  which collects data by scraping webpages.


In [60]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def get_all_links(base_url):

    # Send a request to the website
    response = requests.get(base_url)

    # List to store all the web links
    web_link_list = []

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all the anchor tags (links) in the HTML
        links = soup.find_all('a')

        # Extract and print the URLs
        for link in links:
            href = link.get('href')
            if href:
                # Make the URL absolute using urljoin
                absolute_url = urljoin(base_url, href)

                web_link_list.append(absolute_url)
                #print(absolute_url)
    else:
        web_link_list.append(base_url)
        #print(f"Failed to retrieve content. Status code: {response.status_code}")

    return web_link_list

Ingest data using WebBaseLoader document loader which collects data by scraping webpages. In this case, We will be collecting data from NVIDIA documentation website, Community Forums and Q&A Platforms and Internet Articles.

In [61]:
# Get all the links from NVIDIA Documentation site
doc_nvidia_base_url = 'https://docs.nvidia.com/'
doc_nvidia_base_links= get_all_links(doc_nvidia_base_url)

# Get all the links for NVIDIA Community Forums and Q&A platforms
nvidia_devp_forum_url = 'https://forums.developer.nvidia.com/KB' #'https://forums.developer.nvidia.com'
nvidia_devp_forum_links = get_all_links(nvidia_devp_forum_url)

## Appending Internet Articles and Tutorials
internet_article_links = ['https://linuxconfig.org/how-to-install-cuda-on-ubuntu-20-04-focal-fossa-linux',
'https://www.techrxiv.org/doi/full/10.36227/techrxiv.24207888.v1',
'https://gist.github.com/denguir/b21aa66ae7fb1089655dd9de8351a202',
'https://medium.com/@mertguvencli/how-to-setup-nvidia-driver-cuda-toolkit-and-cudnn-in-ubuntu-20-4-ac5efedb4427',
'https://askubuntu.com/questions/1352541/how-do-i-install-cuda-for-ubuntu-studio-21-04'
]

web_links = doc_nvidia_base_links + nvidia_devp_forum_links + internet_article_links

print(len(web_links))
print(web_links)


164
['https://developer.nvidia.com/', 'https://developer.nvidia.com/blog/', 'https://forums.developer.nvidia.com/', 'https://docs.nvidia.com/login', 'https://docs.nvidia.com/', 'https://developer.nvidia.com/', 'https://developer.nvidia.com/blog/', 'https://forums.developer.nvidia.com/', 'https://docs.nvidia.com/login', 'https://docs.nvidia.com/#featured', 'https://docs.nvidia.com/#products', 'https://docs.nvidia.com/#all-documents', 'https://docs.omniverse.nvidia.com/', 'https://docs.omniverse.nvidia.com/', 'https://docs.nvidia.com/launchpad/index.html', 'https://docs.nvidia.com/launchpad/index.html', 'https://docs.nvidia.com/cuda/doc/index.html', 'https://docs.nvidia.com/cuda/doc/index.html', 'https://docs.nvidia.com/ai-enterprise/index.html', 'https://docs.nvidia.com/ai-enterprise/index.html', 'https://academy.nvidia.com/en/nvidia-certified-associate-data-center/', 'https://academy.nvidia.com/en/nvidia-certified-associate-data-center/', 'https://academy.nvidia.com/en/nvidia-certified

In [62]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_links)
documents = loader.load()

In [63]:
# view one document

documents[2]

Document(page_content='\n\n\n\nNVIDIA Developer Forums - NVIDIA Developer Forums\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    NVIDIA Developer Forums\n  \n\n\n\n\n\n\n\nCategory\nTopics\n\n\n\n\n\n\n\n\n\n\nCommunity Information\n\n\nDiscussion area for community platform issues and requests.\n\n\n\n0\n\n\n\n\n\n\n\n\n\nRegional Activities & Discussions\n\n\nThis is a place to have discussions which will be of interest to developers in specific geographic regions.\n\n\n\n0\n\n\n\n\n\n\n\n\n\nTechnical Blogs & Events\n\n\nFind discussions about our technical blogs, our live connect with experts events, recorded presentations and webinars.\nDiscuss the topics with peers, post questions for the presenters and authors.\n\n\n\n24\n\n\n\n\n\n\n\n\n\nAutonomous Machines\n\n\n\n\n\n\n2\n\n\n\n\n\n\n\n\n\nAutonomous Vehicles\n\n\n\n\n\n\n0\n\n\n\n\n\n\n\n\n\nAI & Data Science\n\n\n\n\n\n\n10\n\n\n\n\n\n\n\

## Splitting in Chunks using Text Splitters
Split the text into small pieces. We will need to initialize RecursiveCharacterTextSplitter and call it by passing the documents.

In [64]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

##Creating Embeddings and Storing in Vector Store
We have to create embeddings for each small chunk of text and store them in the vector store (i.e. FAISS). We will be using **all-mpnet-base-v2** Sentence Transformer to convert all pieces of text in vectors while storing them in the vector store.

In [65]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# storing embeddings in the vector store
vectorstore = FAISS.from_documents(all_splits, embeddings)



Save the embedding data to a local location where the vector store is saved.

In [79]:
vs=vectorstore.save_local(folder_path='/content/vs_data')

Load the vector store data

In [82]:
vectorstore_data = vectorstore.load_local('/content/vs_data/',embeddings)

## Implement the Hugging Face pipeline in LangChain

Initializing the text-generation pipeline with Hugging Face transformers.The pipeline require following 3 things that must initialize:
1. a LLM model - meta-llama/Llama-2-7b-chat-hf
2. Tokenizer for the selected model
3. A stopping criteria object

initialize the model and move it to CUDA-enabled GPU. Using Colab, this can take 5-10 minutes to download and initialize the model.

Also, generate an access token to allow downloading the model from Hugging Face in code. For that, go to your **Hugging Face Profile > Settings > Access Token > New Token > Generate a Token.** Just copy the token and add it in the below code.

In [67]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which can be initialized with this code:

In [68]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now, we need to define the stopping criteria of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don’t provide a stopping criteria the model just goes on a bit tangent after answering the initial question.

In [69]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

Convert these stop token ids into LongTensor objects.

In [70]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

We can do a quick spot check that no token IDs (0) appear in the stop_token_ids — there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied — meaning whether any of these token ID combinations have been generated.

In [71]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In [72]:
#initialize the Hugging Face pipeline

generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Run this code to confirm that everything is working fine.


In [73]:
res = generate_text("Explain me the difference between Data Lakehouse and Data Warehouse.")
print(res[0]["generated_text"])

Explain me the difference between Data Lakehouse and Data Warehouse. Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both are designed to store large amounts of structured and unstructured data. A data lakehouse is a centralized repository that stores all the data from various sources in its raw form, without any predefined schema or structure. On the other hand, a data warehouse is a structured repository that stores data in a specific format, typically optimized for querying and analysis.

Here are some key differences between a data lakehouse and a data warehouse:

1. Structure: A data lakehouse has no predefined schema, whereas a data warehouse has a rigid schema that defines how the data should be organized and stored.
2. Data Types: A data lakehouse can store various types of data, including structured, semi-structured, and unstructured data, while a data warehouse typically stores only structured data

In [74]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
llm(prompt="Explain me the difference between Data Lakehouse and Data Warehouse.")

" Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both concepts have gained popularity in recent years. A data lakehouse is a centralized repository that stores all the raw data from various sources in its original form, while a data warehouse is a structured repository that organizes data into a specific schema for querying and analysis.\nData Lakehouse vs Data Warehouse: What's the Difference? - DataCamp\nA data lakehouse is a centralized repository that stores all the raw data from various sources in its original form, while a data warehouse is a structured repository that organizes data into a specific schema for querying and analysis. The main differences between a data lakehouse and a data warehouse are: Data Lakehouse: Raw Data Storage A data lakehouse stores all the raw data from various sources in its original form, including structured, semi-structured, and unstructured data. This means that the da

##Initializing Chain
We have to initialize ConversationalRetrievalChain. This chain allows us to have a chatbot with memory while relying on a vector store to find relevant information from your document.

Additionally, we can return the source documents used to answer the question by specifying an optional parameter i.e. return_source_documents=True when constructing the chain.

In [87]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(llm, vectorstore_data.as_retriever(), return_source_documents=False)

## Question-Answering

In [91]:
chat_history = []

query = "What is the NVIDIA CUDA Toolkit?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

 The NVIDIA CUDA Toolkit is a comprehensive development environment for C and C++ developers building GPU-accelerated applications. It provides tools and libraries for developing, optimizing, and deploying applications on various hardware platforms, including embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers.


This time your previous question and answer will be included as a chat history which will enable the ability to ask follow up questions.

In [92]:
chat_history = [(query, result["answer"])]

query = "How can I install it on windows?" ## it referes NVIDIA CUDA toolkit
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

 You can download the latest version of the CUDA Toolkit from the official NVIDIA website by following these steps:

1. Open a web browser and navigate to the NVIDIA CUDA website (<https://developer.nvidia.com/cuda-toolkit>).
2. Click on the "Download" button next to the version of CUDA you want to install (e.g., "CUDA Toolkit 11.7.1").
3. Follow the prompts to download the installation package.
4. Once the download is complete, run the installation file and follow the prompts to install CUDA.

If you have any issues during the installation process, you can refer to the NVIDIA CUDA documentation or contact NVIDIA support for help.


In [94]:
chat_history = [(query, result["answer"])]

query = "What is the difference between NVIDIA's BioMegatron and Megatron 530B LLM?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])



 Based on the information provided in the given documents, there is no mention of a product called "BioMegatron" or "Megatron 530B LLM". It appears that the documents are focused on NVIDIA's GH200 Superchip-based MGX servers, which are designed for accelerated applications that are tightly coupled with the underlying CPU and memory platform. Therefore, I cannot provide an answer to your question as there is no relevant information available in the given documents.
