## Introduction to Vectorstores

One of the most popular ways of using Large Language Models(LLMs) have been to use an unstructured natural language query to perform a similarity search over a wide range of data sources.

This is acheived by embedding all source data and storing the resulting embedding vectors in a Vectorstore. 

#### What is an Embedding

Embeddings create a vector representation of a piece of text. 

<img src="https://cdn.openai.com/embeddings/draft-20220124e/vectors-1.svg">

This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

A Vectorstore is similar to a database that stores the embedded data and allows performing a search on that data based on the embedding vectors. 


<img src="https://python.langchain.com/assets/images/vector_stores-125d1675d58cfb46ce9054c9019fea72.jpg" width="800" height="400">

A LLM helps embed an unstructurd natural language query  and vectorstores help retrieve the embedding vectors that are 'most similar' to the embedded query

### Most Popular Vectorstores

Two of the most common free to use vectorstores that can be installed locally are:
* Chromadb
* FAISS

Lets look at installing and using both of these DB stores. 

For both demonstrations we will use the embedding model from Open AI

In [2]:
#! pip install langchain-openai chromadb
# following import of pysqlite3 is to override the system sqlite3 version 3.31 which is unsupported by chromadb
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

import os, getpass

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

os.environ['OPENAI_API_KEY'] = getpass.getpass()


Other important entities while working with vectorstores is the use of what is called "Langchain Document Loaders" and "Text/data Splitters".

Document Loaders allow ingesting various types of data sources into your program. The sources of data could be structured data like SQL , etcd databases or unstructured like web page contents, text documents, PDFs, etc.

For our use case and for simplicity, we are going to use langchain's "WebBaseLoader", which can load a document directly from a URL.

Each LLM has a restricted context window size  that limits the amount of data/information that we can provide or feed into the LLM model at a time. This is where "Splitters" come into picture. The source data is split into data chunks that are large enough to fit into LLM's context window size.

There are different types of splitters built into "langchain.text_splitter" module. In this case we are going to use a simple "CharacterTextSplitter" class.

In [3]:
#Beautiful soup package required for WebBaseLoader
#! pip install bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter


### Overall Program flow for performing similarity search

* Load the document using "OnlinePDFLoader"
* split the document into chunks using "CharacterTextsplitter"
* embed each chunk of the document using "OpenAIEmbeddings"
* Load the embredded data into the "Chroma" db vectorstore

Lets try to put all the above steps in to the Code

In [None]:
mcast_vpn_rfc_loader= WebBaseLoader("https://www.rfc-editor.org/rfc/rfc6517.txt")
mcast_vpn_rfc_document = mcast_vpn_rfc_loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
split_documents = text_splitter.split_documents(mcast_vpn_rfc_document)
db = Chroma.from_documents(split_documents, OpenAIEmbeddings())

query ="what is the document about?"
db.similarity_search(query)

We have now successfully indexed our data chunks in a Vectorstore DB.

The next step is to prompt a LLM model by providing relevant chunks of data retrieved from the vector DB similarity search to answer the user query. 

The way this is done is by using a "Retriever" or a "Retrieval Chain" which we will cover in the next Notebook.