<a href="https://colab.research.google.com/github/dhananjai14/LLM_tutorials/blob/main/Vector_Database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Database and Integration with LangChain

**What is vector Database?**

Vector Database is a specialized type of database optimized for storing and querying high-dimensional vector data. Vector data typically consists of numerical representations of unstructured data. These vectors capture various features or characteristics of the data they represent.

These high-dimensional vectors are especially when derived from complex deep learning models. For instance, an image represented by a neural network might be turned into a vector with hundreds or thousands of dimensions. For example
1. In the research paper "Attention is All You Need," which introduced the Transformer model, the vector size of a word/token is 512.
2. Advanced GPT models like GPT-3 and GPT-4 developed by OpenAI, the vector dimension of a word/token is 12,288.


**Why Vector Databases are used?**

1. Efficient Similarity Searches: Traditional databases are not optimized for high-dimensional similarity searches. Vector databases use specialized indexing techniques (e.g., KD-Trees, Annoy, HNSW) to efficiently search and retrieve similar vectors.

2. Performance: Vector databases offer high-performance querying capabilities, often using in-memory storage and optimized data structures.


**Common use cases**

1. Recommendation Systems: By representing users and items as vectors, recommendation systems can find the most relevant items for a user based on vector similarity.

2. Image and Video Search: Vectors can represent visual features of images and videos, allowing for efficient content-based retrieval.

3. Natural Language Processing: Text documents, sentences, and words can be converted into vectors using embeddings (e.g., Word2Vec, BERT), enabling semantic search and text classification.

**Examples of Vector Databases**

1. Chroma DB (Local DB)
2. FAISS (Facebook AI Similarity Search) (Local DB)
3. Pinecone (Cloud based)
4. MongDB (Cloud Based)



# ChromaDB

Official website: https://docs.trychroma.com/

Documentation: https://docs.trychroma.com/getting-started

The flow of notebook is as follows:
* Step 1: Download the dataset and extract the text file.  
* Step 2: Convert the data into the embeddings and store into the Chroma DB.
* Step 3: Using LLM make the chains to perform QA on the docs.

In [None]:

!pip install langchain-community langchain-openai chromadb langchain langchain-chroma openai tiktoken -q


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/46.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.7/46.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')


### Load the Dataset

In [None]:
# Dataset to be store int the DB
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip
!unzip -q new_articles.zip -d new_articles

In [None]:
import os
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import DirectoryLoader
# from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter



In [None]:
# load the document
loader = DirectoryLoader('./new_articles', glob='./*.txt', loader_cls=TextLoader)
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
len(texts)



233

### Converting to Embeddings and store to Chroma dB

In [None]:
# create the openAI embedding function
embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
persist_directory = 'dB'
# load it into Chroma
vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embeddings,
                                 persist_directory=persist_directory)



In [None]:
# Loading the dB
vectordb=Chroma(persist_directory='dB', embedding_function=embeddings)

In [None]:
# Performing semantic search

query = "How much money does microsoft raised?"
docs = vectordb.similarity_search(query)
print(docs[0].page_content)

April 28, 2023

VC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.

April 25, 2023

Called ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”


In [None]:
# Performing semantic search
retriver = vectordb.as_retriever(search_kwargs={"k": 2})
docs = retriver.invoke(query)
print(docs)

[Document(metadata={'source': 'new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}, page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”'), Document(metadata={'source': 'new_articles/05-07-3one4-capital-

### Generating response using LLM

In [None]:
# Making a chain
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate


# Creating Prompt
system_prompt = ("You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise. \n\n {context}")

prompt = ChatPromptTemplate.from_messages(
    [("system", system_prompt),("human", "{input}"),
    ]
)

# Initializing LLM
llm = OpenAI(api_key=OPENAI_API_KEY)

# Creating chains
question_answer_chain = create_stuff_documents_chain(llm=llm, prompt=prompt)
qa_chain = create_retrieval_chain(retriever=retriver, combine_docs_chain=question_answer_chain)

# generating response
response = qa_chain.invoke({"input": "How much money did Microsoft raise?"})
print(response["answer"])




System: Microsoft raised $10 billion in a big investment announced earlier this year.
