# Vectorstores and Embeddings

The steps in this notebook include: 
- **Use Langchain OpenAI Embeddings and Langchain Chroma Vectorstore** 

## Contents
1. [Installation](#installation)
2. [Embeddings](#embeddings)
3. [Vector Stores](#vectorstores)  
4. [Similarity Search](#similarity)
5. [Failure modes](#failure)

**Source:** https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/4/vectorstores-and-embedding

![overview.png](./images/overview.png)

# **Installation** <a name="installation"></a>

In [2]:
!pip install -U langchain openai python-dotenv

Collecting langchain
  Downloading langchain-0.0.335-py3-none-any.whl.metadata (16 kB)
Collecting openai
  Downloading openai-1.2.3-py3-none-any.whl.metadata (16 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting anyio<4.0 (from langchain)
  Downloading anyio-3.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.2-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langsmith<0.1.0,>=0.0.63 (from langchain)
  Downloading langsmith-0.0.63-py3-none-any.whl.metadata (10 kB)
Collecting pydantic<3,>=1 (from langchain)
  Downloading pydantic-2.4.2-py3-none-any.whl.metadata (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.6/158.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting distro<2,>=1.7.0 (from openai)
  Downloading dist

In [22]:
import os
import openai
import sys

sys.path.append('../..')

# Load from a .env file 
#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = "eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJhcHAiLCJzdWIiOiIxNDYyNzU5IiwiYXVkIjoiV0VCIiwiaWF0IjoxNjk5NDUxNzMzLCJleHAiOjE3MDAwNTY1MzN9.7mqcOZ3w4gd7m9QGWcdOx7U1ayk1l22LNZ8LfPOLqjE"
openai.api_key  = os.environ['OPENAI_API_KEY']

**Document Loading** and **Splitting**.

In [6]:
!pip install -U pypdf 

Collecting pypdf
  Downloading pypdf-3.17.0-py3-none-any.whl.metadata (7.5 kB)
Downloading pypdf-3.17.0-py3-none-any.whl (277 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.4/277.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.17.0


In [7]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("data/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("data/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("data/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("data/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

<div class="alert alert-info"> 💡<b>Separators:</b>
The <code>extend()</code> method adds all the elements of an _iterable_ (list, tuple, string etc.) to the end of the list.
</div>

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

print(len(docs))
print(len(splits))
print(len(splits[0].page_content))

78
209
1499


We have **78 pages** from 4 PDFs. With the Recursive Text splitter, we have splitted the `Documents` into **209 chunks** where each `page_content`'s length is <1500 (`chunk_size`)

# **Embeddings**  <a name="embeddings"></a>

Let's take our splits and embed them.

In [18]:
!pip install -U tiktoken

Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.5.1


In [7]:
import numpy as np
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

<div class="alert alert-info"> 💡<b>OpenAIEmbeddings:</b>  
    
By default, the LangChain <b>OpenAIEmbeddings</b> class use the <code>text-embedding-ada-002</code> model.  
OpenAI recommend to use <code>text-embedding-ada-002</code> for nearly all use cases (It’s better, cheaper, and simpler to use). <a href="https://platform.openai.com/docs/guides/embeddings/embedding-models">More</a>.
</div>

In [8]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [11]:
np.dot(embedding1, embedding2)

0.9631676073007296

In [12]:
np.dot(embedding1, embedding3)

0.7710631888387288

In [13]:
np.dot(embedding2, embedding3)

0.7596683332753217

# **Vectorstores**  <a name="vectorstores"></a>

In [24]:
!pip install -U chromadb

Collecting chromadb
  Downloading chromadb-0.4.17-py3-none-any.whl.metadata (7.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.104.1-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.24.0.post1-py3-none-any.whl.metadata (6.4 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.0.2-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting pulsar-client>=3.1.0 (from chromadb)
  Downloading pulsar_client-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.16.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry

<div class="alert alert-info"> 💡<b>ChromaDB:</b>  
    
<b>ChromaDB</b> is an open-source vector store used for storing and retrieving vector embeddings. Its main use is to save embeddings along with metadata to be used later by large language models. Additionally, it can also be used for semantic search engines over text data. <a href="https://docs.trychroma.com/">More</a>.  
    
(we should have the <code>chromadb</code> python package installed).
</div>


In [28]:
from langchain.vectorstores import Chroma

In [29]:
persist_directory = './db/'

In [30]:
!rm -rf ./db/chroma  # remove old database files if any

In [None]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [19]:
print(vectordb._collection.count())

209


# **Similarity Search** <a name="similarity"></a>

In [34]:
question = "is there an email i can ask for help"

docs = vectordb.similarity_search(question,k=3)

In [22]:
len(docs)

3

In [23]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to f

Let's save this so we can use it later!

In [24]:
vectordb.persist()

# **Failure modes**  <a name="failure"></a>

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

In [25]:
question = "what did they say about matlab?"

In [26]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [27]:
docs[0]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

In [28]:
docs[1]

Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class,

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [29]:
question = "what did they say about regression in the third lecture?"

In [30]:
docs = vectordb.similarity_search(question,k=5)

In [31]:
for doc in docs:
    print(doc.metadata)

{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'page': 0}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 6}
{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 8}


In [32]:
print(docs[4].page_content)

into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's help ed me make lots of money, and here's a 
picture of my big house."  
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this 
machine learning stuff was actually useful. So what was it that you learned? Was it 
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you 
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."  
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, 
and we'll actually have a short MATLAB tutori al in one of the discussion sections for 
those of you that don't know it.  
Okay. The very last piece of logistical th ing is the discussion s ections. So discussion 
sections will be taught by the TAs, and atte ndance at discussion secti

Approaches discussed in the next lecture can be used to address both!