# Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

![Vector](immagini/17_vector.png)

We've now got our document split up into small, semantically meaningful chunks, and it's time to put these chunks into an index, whereby we can easily retrieve them when it comes time to answer questions about this corpus of data. 

To do that, we're going to utilize __embeddings and vector stores__. 


First, these are incredibly important for building chatbots over your data. And second, we're going to go a bit deeper, and we're going to talk about edge cases, and where this generic method can actually fail. Don't worry, we're going to fix those later on. 

But for now, let's talk about vector stores and embeddings. 

And this comes after text splitting, when we're ready to store the documents in an easily accessible format. 


![Vector](immagini/18_vector.png)

What __EMBEDDINGS__ are? 

They take a piece of text, and they create a numerical representation of that text. Text with similar content will have similar vectors in this numeric space. What that means is we can then compare those vectors and find pieces of text that are similar. 

So, in the example below, we can see that the two sentences about pets are very similar, while a sentence about a pet and a sentence about a car are not very similar. 


![Vector](immagini/19_vector.png)

As a reminder of the full end-to-end workflow, we start with documents, we then create smaller splits of those documents, we then create embeddings of those documents, and then we store all of those in a vector store. 

A __VECTOR STORE__ is a database where you can easily look up similar vectors later on. This will become useful when we're trying to find documents that are relevant for a question at hand.

We can then take the question at hand, create an embedding, and then do comparisons to all the different vectors in the vector store, and then pick the n most similar. We then take those n most similar chunks, and pass them along with the question into an LLM, and get back an answer. 


![Vector](immagini/20_vector.png)

In [None]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

We just discussed `Document Loading` and `Splitting`.

We're going to be working with the same set of documents. These are the CS229 lectures. 

Notice that we're actually going to duplicate the first lecture. This is for the purposes of simulating some dirty data. 

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

We can then use the recursive character text splitter to create chunks. We can see that we've now created over 200 different chunks.

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [None]:
splits = text_splitter.split_documents(docs)

In [None]:
len(splits)

*OUTPUT*

209

## Embeddings

Let's take our splits and embed them.

Create embeddings for all of them. We'll use OpenAI to create these embeddings. 

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

The first two are very similar and the third one is unrelated.  We can then use the embedding class to create an embedding for each sentence. 

In [None]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [None]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

We can then use NumPy to compare them, and see which ones are most similar. 

In [None]:
import numpy as np

We'll use a dot product to compare the two embeddings.

The important thing to know is that higher is better. 

In [None]:
np.dot(embedding1, embedding2)

*OUTPUT*

0.9631853877103519

In [None]:
np.dot(embedding1, embedding3)

*OUTPUT*

0.770999765129468

In [None]:
np.dot(embedding2, embedding3)

*OUTPUT*

0.7596334120325541

## Vectorstores

### It's time to create embeddings for all the chunks of the PDFs and then store them in a vector store.  

In [None]:
# ! pip install chromadb

In [None]:
from langchain.vectorstores import Chroma

The vector store that we'll use for this lesson is Chroma. So, let's import that. LangChain has integrations with lots, over 30 different vector stores. We choose Chroma because it's lightweight and in memory, which makes it very easy to get up and started with. There are other vector stores that offer hosted solutions, which can be useful when you're trying to persist large amounts of data or persist it in cloud storage somewhere. 

In [None]:
# save in a variable called persist_directoty

persist_directory = 'docs/chroma/'

Let's also just make sure that nothing is there already. If there's stuff there already, it can throw things off and we don't want that to happen. 

In [None]:
!rm -rf ./docs/chroma  # remove old database files if any

In [None]:
# Let's now create the vector store

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory  # allows us to save the directory to disk
)

In [None]:
print(vectordb._collection.count())

*OUTPUT*

209

We can see that it's 209, which is the same as the number of splits that we had from before. 

### Similarity Search

In [None]:
question = "is there an email i can ask for help"

We're going to use the similarity_search method, and we're going to pass in the question, and then we'll also pass in k=3. This specifies the number of documents that we want to return. 

In [None]:
docs = vectordb.similarity_search(question,k=3)

In [None]:
len(docs)

*OUTPUT*

3

In [None]:
docs[0].page_content

*OUTPUT*

![Vector](immagini/21_vector.png)

Let's save this so we can use it later!

In [None]:
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

In [None]:
question = "what did they say about matlab?"

In [None]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

If we take a look at the first two results, we can see that they're actually identical. This is because when we loaded in the PDFs, if you remember, we specified on purpose a duplicate entry. 
This is bad because we've got the same information in two different chunks and we're going to be passing both of these chunks to the language model down the line. There's no real value in the second piece of information and it would be much better if there was a different distinct chunk that the language model could learn from. 


In [None]:
docs[0]

*OUTPUT*

![Vector](immagini/22_vector.png)

In [None]:
docs[1]

*OUTPUT*

![Vector](immagini/22_vector.png)

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [None]:
question = "what did they say about regression in the third lecture?"

In [None]:
docs = vectordb.similarity_search(question,k=5)

So, let's loop over all the documents and print out the metadata. 

In [None]:
for doc in docs:
    print(doc.metadata)

*OUTPUT*

![Vector](immagini/23_vector.png)

We can see that there's actually a combination of results, some from the third lecture, some from the second lecture, and some from the first. The intuition about why this is failing is that the third lecture and the fact that we want documents from only the third lecture is a piece of structured information, but we're just doing a semantic lookup based on embeddings where it creates an embedding for the whole sentence and it's probably a bit more focused on regression. Therefore, we're getting results that are probably pretty relevant to regression and so if we take a look at the fifth doc, the one that comes from the first lecture, we can see that it does in fact mention regression. So, it's picking up on that, but it's not picking up on the fact that it's only supposed to be querying documents from the third lecture because again, that's a piece of structured information that isn't really perfectly captured in this semantic embedding that we've created. 

In [None]:
print(docs[4].page_content)

*OUTPUT*

![Vector](immagini/24_vector.png)

You'll probably notice that when you make it larger, you'll retrieve more documents, but the documents towards the tail end of that may not be as relevant as the ones at the beginning. 

Approaches discussed in the next lecture can be used to address both!