# 2.3 Vectorstores and Embeddings - part 1


## Setup

### Install dependencies

In [6]:
%pip install python-dotenv~=1.0 docarray~=0.40.0 pypdf~=5.1 --upgrade --quiet
%pip install chromadb~=0.5.18 --upgrade --quiet
%pip install langchain~=0.3.7 langchain_openai~=0.2.6 langchain_community~=0.3.5 --upgrade --quiet

# If running locally, you can do this instead:
#%pip install -r ../requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Load environment variables

In [7]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# If running in Google Colab, you can use this code instead:
# from google.colab import userdata
# os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
# os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")

### Setup Models

In [8]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
api_version = "2024-10-01-preview"
embedding_model = AzureOpenAIEmbeddings(model="text-embedding-3-large", openai_api_version=api_version)

### Setup path to data 

In [9]:
data_path = "../data"

We just discussed `Document Loading` and `Splitting`.

In [10]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader(f"{data_path}/MachineLearning-Lecture01.pdf"),
    PyPDFLoader(f"{data_path}/MachineLearning-Lecture01.pdf"),
    PyPDFLoader(f"{data_path}/MachineLearning-Lecture02.pdf"),
    PyPDFLoader(f"{data_path}/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [11]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [12]:
splits = text_splitter.split_documents(docs)

In [13]:
len(splits)

208

## Embeddings

Let's take our splits and embed them.

In [14]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [15]:
embedding1 = embedding_model.embed_query(sentence1)
embedding2 = embedding_model.embed_query(sentence2)
embedding3 = embedding_model.embed_query(sentence3)

print(embedding1[:10])

[-0.02029295265674591, -0.014658346772193909, -0.007224570494145155, -0.005373099818825722, 0.022524481639266014, 0.012113010510802269, -0.012991674244403839, -0.003120651701465249, -0.0017503544222563505, 0.04041854292154312]


In [16]:
import numpy as np

Embedding 1 and 2 should be similar (using NumPy's dot product to calculate similarity)

In [17]:
np.dot(embedding1, embedding2)

0.8321759221175041

But Embedding 3 should differ more

In [18]:
np.dot(embedding1, embedding3)

0.15657078149987852

In [19]:
np.dot(embedding2, embedding3)

0.11709408419034396

## Vectorstores

In [20]:
from langchain.vectorstores import Chroma

In [21]:
# Optional persist_directory to save the database
persist_directory = './db/chroma-ML-docs/'

# Remove the directory and all files in it recursively if it exists
import shutil
import os
if os.path.exists(persist_directory):    
    shutil.rmtree(persist_directory)

In [22]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding_model,
    #persist_directory=persist_directory # Optionally persist the database
)

In [23]:
print(vectordb._collection.count())

208


### Similarity Search

In [24]:
question = "is there an email i can ask for help"

In [25]:
docs = vectordb.similarity_search(question,k=3)

In [26]:
len(docs)

3

In [27]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework problems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thing that I think will help you to succeed and \ndo well in this class and even help you to enjoy this class more is if you form a study \ngroup.  \nSo start looking around where you're sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to form st

Let's save this so we can use it later!

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily. 

But there are some failure modes that can creep up. 

Here are some edge cases that can arise - we'll fix them in the next class.

In [28]:
question = "what did they say about matlab?"

In [29]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `MachineLearning-Lecture01.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [30]:
docs[0]

Document(metadata={'page': 8, 'source': '../data/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer featur

In [31]:
docs[1]

Document(metadata={'page': 8, 'source': '../data/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer featur

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [32]:
question = "what did they say about regression in the third lecture?"

In [33]:
docs = vectordb.similarity_search(question,k=5)

In [34]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': '../data/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': '../data/MachineLearning-Lecture02.pdf'}
{'page': 6, 'source': '../data/MachineLearning-Lecture03.pdf'}
{'page': 13, 'source': '../data/MachineLearning-Lecture03.pdf'}
{'page': 14, 'source': '../data/MachineLearning-Lecture03.pdf'}


In [35]:
print(docs[4].page_content)

Student:It’s the lowest it –  
Instructor (Andrew Ng):No, exactly. Right. So zero to the same, this is not the same, 
right? And the reason is, in logistic regression this is different from before, right? The 
definition of this H subscript theta of XI is not the same as the definition I was using in 
the previous lecture. And in particular this is no longer theta transpose XI. This is not a 
linear function anymore. This is a logistic function of theta transpose XI. Okay? So even 
though this looks cosmetically similar, even though this is similar on the surface, to the 
Bastrian descent rule I derived last time for least squares regression this is actually a 
totally different learning algorithm. Okay? And it turns out that there’s actually no 
coincidence that you ended up with the same learning rule. We’ll actually talk a bit more 
about this later when we talk about generalized linear models. But this is one of the most 
elegant generalized learning models that we’ll see later. Th

### How do we fix this?
The **retrieval** (2.4) notebook will cover solutions to these problems.
