* Name: Arindam Choudhury
* Uplevel Email Address: arindam.choudhury.email@gmail.com
* Problem Statement: Problem-01: AI-Assisted Learning for NVIDIA SDKs and Toolkits

> # 2. AC_NvidiaLLM_helper_embedding_vecotorDB_loading.ipynb
### Create Embedding and Load Vector Datbase:

Generated content files are loaded back from saved location "NVIDIA_DATA_FOLDER". Embeddings are generated after splitting the data using different models and Vector Databases are created and saved back

1. NVIDIA Main page                 ---> `/NVIDIA_DATA_FOLDER/1_NVIDIA_MAIN_PAGE_CONTENT`
2. NVIDIA Main document page        ---> `/NVIDIA_DATA_FOLDER/2_NVIDIA_MAIN_DOC_PAGE_CONTENT`
3. NVIDIA All document pages        ---> `/NVIDIA_DATA_FOLDER/3_NVIDIA_DOCUMENT_PAGE_CONTENT`
4. NVIDIA Forum pages               ---> `/NVIDIA_DATA_FOLDER/4_NVIDIA_FORUM_PAGE_CONTENT`
5. NVIDIA Customer Help pages       ---> `/NVIDIA_DATA_FOLDER/5_NVIDIA_CUSTHELP_PAGE_CONTENT`
6. Wikipedia (NVIDIA) page          ---> `/NVIDIA_DATA_FOLDER/6_NVIDIA_WIKIPEDIA_PAGE_CONTENT`
7. Stack-Overflow pages for NVIDIA  ---> `/NVIDIA_DATA_FOLDER/7_NVIDIA_STACK_OVERFLOW_QA_PAGE_CONTENT`

> #### Import Necessary Libraries

In [1]:
from langchain.embeddings import HuggingFaceEmbeddings, GooglePalmEmbeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
import pickle
import os

> #### Google API Key Setup Guide

##### You need API key for Google PALM generative LLM to use Google Embedding

Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

<a class="button button-primary" href="https://makersuite.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>

In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `GOOGLE_API_KEY`.

Once you have the API key, pass it to the SDK. You can do this in two ways:

* Put the key in the `GOOGLE_API_KEY` environment variable (the SDK will automatically pick it up from there).
* Pass the key to `genai.configure(api_key=...)`

Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.

> #### Connect google drive
* ### Make sure to unzip Arindam_Choudhury_NvidiaLLM.zip and save "MyDrive" location

In [None]:
from google.colab import userdata, drive
drive.mount('/content/drive', force_remount=True)

> #### Define global variables for this notebook

In [2]:
GOOGLE_API_KEY               = 'AI-'
#os.environ["OPENAI_API_KEY"] = 'sk- '
save_data_folder             = '/content/drive/MyDrive/Arindam_Choudhury_NvidiaLLM/NVIDIA_DATA_FOLDER/'
vactor_db_name_MiniLM        = 'FAIS_INDEX_VDB_MiniLM'
vactor_db_name_mpnet         = 'FIAS_INDEX_VDB_mpnet'
vactor_db_name_Google        = 'FIAS_INDEX_VDB_Google'
#vactor_db_name_G_Palm        = 'FIAS_INDEX_VDB_G_Palm'
#vactor_db_name_OpenAI        = 'FIAS_INDEX_VDB_OpenAI'

> #### Function: Load file

In [3]:
def load_data(file_name):
    with open(save_data_folder + file_name, 'rb') as file:
      return pickle.load(file)

> #### Initialize Embedding models

In [4]:
EMBEDDING_MiniLM = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-l6-v2", show_progress=True, model_kwargs={'device': 'mps'})
EMBEDDING_mpnet  = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", show_progress=True, model_kwargs={'device': 'mps'})
EMBEDDING_Google = GoogleGenerativeAIEmbeddings(model="models/embedding-001", task_type = "RETRIEVAL_QUERY", google_api_key=GOOGLE_API_KEY)
#EMBEDDING_G_PALM = GooglePalmEmbeddings(model="models/embedding-gecko-001", google_api_key=GOOGLE_API_KEY)
#EMBEDDING_OpenAI = OpenAIEmbeddings(model="text-embedding-3-small")

> #### Load back the content files

In [5]:
NVIDIA_MAIN_PAGE_CONTENT              = load_data('1_NVIDIA_MAIN_PAGE_CONTENT')
NVIDIA_MAIN_DOC_PAGE_CONTENT          = load_data('2_NVIDIA_MAIN_DOC_PAGE_CONTENT')
NVIDIA_DOCUMENT_PAGE_CONTENT          = load_data('3_NVIDIA_DOCUMENT_PAGE_CONTENT')
NVIDIA_FORUM_PAGE_CONTENT             = load_data('4_NVIDIA_FORUM_PAGE_CONTENT')
NVIDIA_CUSTHELP_PAGE_CONTENT          = load_data('5_NVIDIA_CUSTHELP_PAGE_CONTENT')
NVIDIA_WIKIPEDIA_PAGE_CONTENT         = load_data('6_NVIDIA_WIKIPEDIA_PAGE_CONTENT')
NVIDIA_STACK_OVERFLOW_QA_PAGE_CONTENT = load_data('7_NVIDIA_STACK_OVERFLOW_QA_PAGE_CONTENT')

all_page_document = [] # save all files to a single python list                             
all_page_document.extend(NVIDIA_MAIN_PAGE_CONTENT)
all_page_document.extend(NVIDIA_MAIN_DOC_PAGE_CONTENT)
all_page_document.extend(NVIDIA_DOCUMENT_PAGE_CONTENT)
all_page_document.extend(NVIDIA_FORUM_PAGE_CONTENT)
all_page_document.extend(NVIDIA_CUSTHELP_PAGE_CONTENT)
all_page_document.extend(NVIDIA_WIKIPEDIA_PAGE_CONTENT)
all_page_document.extend(NVIDIA_STACK_OVERFLOW_QA_PAGE_CONTENT)

> #### Function: Define Text Splitter into Chunks

In [6]:
def doc_split(all_page_document, chunk_size, chunk_overlap):
    print("Total documents before split  : ", len(all_page_document))
    TEXT_SPLITTER    = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = chunk_overlap)
    split_documents  = TEXT_SPLITTER.split_documents(all_page_document)
    print("Total Split documents/contexts: ", len(split_documents))
    return split_documents

In [7]:
chunk_size = 1024
chunk_overlap = 0
split_documents = doc_split(all_page_document, chunk_size, chunk_overlap)

Total documents before split  :  36793
Total Split documents/contexts:  808127


> #### Function: Create Embedding and Save in FIASS DB

In [13]:
def create_vector_db(split_documents, EMBEDDING, vactor_db_name):

    vector_store     = FAISS.from_documents(documents=split_documents, embedding=EMBEDDING)
    print("Number of vectors added in DB: ", vector_store.index.ntotal)
    
    vector_store.save_local(save_data_folder + vactor_db_name) # save vector db to local folder
    return "Vector DataBase Loading and Saving Completed !!"

> #### Create Embedding using `HuggingFaceEmbeddings / sentence-transformers/all-MiniLM-l6-v2`

In [10]:
create_vector_db(split_documents, EMBEDDING_MiniLM, vactor_db_name_MiniLM)

Number of vectors added in DB:  808127


'Vector DataBase Loading and Saving Completed !!'

> #### Create Embedding using `HuggingFaceEmbeddings / sentence-transformers/all-mpnet-base-v2`

In [11]:
create_vector_db(split_documents, EMBEDDING_mpnet, vactor_db_name_mpnet)

Number of vectors added in DB:  808127


'Vector DataBase Loading and Saving Completed !!'

> #### Create Embedding using `GoogleGenerativeAIEmbeddings / models/embedding-001`

In [12]:
create_vector_db(split_documents, EMBEDDING_Google, vactor_db_name_Google)

Number of vectors added in DB:  808127


'Vector DataBase Loading and Saving Completed !!'