# **Pinecone DB:**

In [None]:
!pip install langchain
!pip install sentence-transformers==2.2.2
!pip install pinecone-client==2.2.4
!pip install pypdf
!pip install langchain_google_genai

In [4]:
import os

ROOT = "/content/drive/MyDrive/Generative AI/Vector Database"
os.chdir(ROOT)

## **Load Data Sets:**

* Get the data
* Unzip the data

In [5]:
from langchain.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("pdf_data")
data = loader.load()
data

[Document(page_content='Profile\nA results-driven BCA graduate with a passion for leveraging advanced data science techniques to extract valuable insights\nfrom complex datasets, my expertise lies in the strategic application of statistical methods, machine learning, deep learning,\nNLP, computer vision, and state-of-the-art tools. I possess the ability to perform in-depth data analysis and manipulation\nand can develop full-stack data science projects. Currently, I am expanding my knowledge in MLOps and Generative\nAI.\nSkills\nPython|Excel|Power BI|SQL|Rest-API (Flask)|Operating System (Windows | Ubuntu)|Statistics\nMachine Learning (Scikit-learn)|Deep Learning (PyTorch | TensorFlow)\nComputer Vision (Convolutional Neural Network | Image processing | Object Classification)\nNatural Language Processing (Classification | Summarization | Generation | Name Entity Recognition | RNN | LSTM |\nGRU | Transformers | GPT)\nGit|GitHub|ML-Ops (DVC | MLFlow | Docker | Kubernetes | CI-CD (GitHub A

## **Data Chunking:**


In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=70, chunk_overlap=20)
texts_chunks = text_splitter.split_documents(data)

In [9]:
texts_chunks[2]

Document(page_content='leveraging advanced data science techniques to extract valuable', metadata={'source': 'pdf_data/Dibyendu-Biswas-Resume.pdf', 'page': 0})

In [10]:
# number of documents after data chunking:

len(texts_chunks)

144

## **Embeddings:**

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


In [13]:
query_result = embeddings.embed_query("Hello World")
print(len(query_result))

384


## **Create a Vector DB (Pinecone):**

### **Access the Pinecone key**

In [12]:

PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', 'b7be3679-39fd-4701-bbfc-777a605aa412')
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', 'gcp-starter')

In [14]:
import pinecone
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "test" # put in the name of your pinecone index here


### **Store the Embedding vector to Pinecone DB:**

In [15]:
from langchain.vectorstores import Pinecone

docsearch = Pinecone.from_texts([t.page_content for t in texts_chunks], embeddings, index_name=index_name)

### **If you already have an index, you can load it like this:**

In [16]:
docsearch = Pinecone.from_existing_index(index_name, embeddings)
docsearch

<langchain_community.vectorstores.pinecone.Pinecone at 0x7d4bffb6ff70>

## **Perform Sementic Search:**

In [29]:
query = "Tell me something about Dibyendu Biswas?"
docs = docsearch.similarity_search(query, k=100) # retrive top 100 relevent or similar documents
docs

[Document(page_content='in overall user experience.Dibyendu Biswas'),
 Document(page_content='•Date of Birth: 12-Mar-1999.\n•Father’s Name: Tapan Kumar Biswas.'),
 Document(page_content='dibyendubiswas1998@gmail.com 1 / 2'),
 Document(page_content='dibyendubiswas1998@gmail.com 2 / 2'),
 Document(page_content='dibyendubiswas1998@gmail.com 9907278562 Siliguri, West Bengal'),
 Document(page_content='https://github.com/dibyendubiswas1998'),
 Document(page_content='Education\nBCA, IGNOU 2023 | Siliguri, India\nPersonal Information'),
 Document(page_content='Data Scientist Intern, ineuron.ai Sep\xa02023 – Dec\xa02023 | Bangalore,'),
 Document(page_content='•Nationality: Indian.\n•Marital Status: Single.'),
 Document(page_content='https://www.linkedin.com/in/dibyendubiswas1998'),
 Document(page_content='•Languages: English, Bengali, and Hindi.'),
 Document(page_content='Data Scientist'),
 Document(page_content='to extract valuable insights'),
 Document(page_content='Projects\nNews Short, Text

## **Integration with Language Models (LLM):**

### **Access the Gimini API Key Token:**

In [21]:
from google.colab import userdata
Gemini_API_Key = userdata.get('gimini_api_key_1') # get the Gemini API Key

### **Load the Gimini LLM model:**

In [22]:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-pro", google_api_key=Gemini_API_Key, convert_system_message_to_human=True) # load the gemini-pro model
llm

ChatGoogleGenerativeAI(model='gemini-pro', google_api_key=SecretStr('**********'), client=genai.GenerativeModel(
    model_name='models/gemini-pro',
    generation_config={},
    safety_settings={},
    tools=None,
), convert_system_message_to_human=True)

## **Generating Answers:**

### **Create a LangChain Chain for retrieve the Information:**

In [30]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=docsearch.as_retriever(),
                                  return_source_documents=True)

In [33]:
query = "Tell me something about 'Dibyendu Biswas'?" # ask question:
llm_response = qa_chain.invoke(query) # get most accurate result

print(llm_response['result'])

The provided text does not mention anything about 'Dibyendu Biswas' other than their name, date of birth, and father's name. So I cannot answer this question.
