# Vectorstores and Embeddings
Recall the overall workflow for retrieval augmented generation (RAG):

In [11]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

We just discussed Document Loading and Splitting.

In [12]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("https://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf"),
    PyPDFLoader("https://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf"),
    PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/cs229-notes2.pdf"),
    PyPDFLoader("https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [13]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [14]:
splits = text_splitter.split_documents(docs)

In [None]:
len(splits)

149

## Embeddings
Let's take our splits and embed them.

In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [6]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [7]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [8]:
import numpy as np

In [None]:
np.dot(embedding1, embedding2)

0.963190818034527

In [None]:
np.dot(embedding1, embedding3)

0.7711177200797418

In [None]:
np.dot(embedding2, embedding3)

0.7596334120325523

## Vectorstores

In [None]:
#%pip install chromadb

^C
Note: you may need to restart the kernel to use updated packages.


In [1]:
from langchain.vectorstores import Chroma

In [9]:
persist_directory = 'docs/chroma/'

In [3]:
!rm -rf ./docs/chroma  # remove old database files if any

'rm' is not recognized as an internal or external command,
operable program or batch file.


In [15]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [16]:
print(vectordb._collection.count())

149


## Similarity Search

In [17]:
question = "is there an email i can ask for help"

In [18]:
docs = vectordb.similarity_search(question,k=3)

In [19]:
len(docs)

3

In [20]:
docs[0].page_content

'mined (according top(y))asbefore. Then, thesender oftheemail writes the\nemail by\x0crstgenerating x1fromsome multinomial distribution overwords\n(p(x1jy)).Next, thesecond wordx2ischosen independen tlyofx1butfrom\nthesame multinomial distribution, andsimilarly forx3,x4,andsoon,until\nallnwordsoftheemail havebeengenerated. Thus,theoverallprobabilit yof\namessage isgivenbyp(y)Qn\ni=1p(xijy).Notethatthisformulalookslikethe\nonewehadearlier fortheprobabilit yofamessage under themulti-variate\nBernoulli eventmodel,butthattheterms intheformulanowmean verydif-\nferentthings. Inparticular xijyisnowamultinomial, rather thanaBernoulli\ndistribution.\nTheparameters forournewmodelare\x1ey=p(y)asbefore, \x1eijy=1=\np(xj=ijy=1)(foranyj)and\x1eijy=0=p(xj=ijy=0).Notethatwehave\nassumed thatp(xjjy)isthesameforallvaluesofj(i.e.,thatthedistribution\naccording towhichawordisgenerated doesnotdependonitsposition j\nwithin theemail).\nIfwearegivenatraining setf(x(i);y(i));i=1;:::;mgwhere x(i)=\n(x(i)\n1;x(i

Let's save this so we can use it later!

In [None]:
vectordb.persist()

## Failure modes
This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

Here are some edge cases that can arise - we'll fix them in the next class.

In [21]:
question = "what did they say about matlab?"

In [22]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

docs[0] and docs[1] are indentical.

In [23]:
docs[0]

Document(page_content='ni\x0ccan tinsigh tintothestructure oftheproblem, andwerealsoabletowrite\ntheentirealgorithm interms ofonlyinner products betweeninput feature\nvectors. Inthenextsection, wewillexploit thispropertytoapply theker-\nnelstoourclassi\x0ccation problem. Theresulting algorithm, supportvector\nmachines ,willbeabletoe\x0ecien tlylearn inveryhighdimensional spaces.\n7Kernels\nBackinourdiscussion oflinear regression, wehadaproblem inwhichthe\ninput xwastheliving areaofahouse, andweconsidered performing regres-', metadata={'page': 12, 'source': 'C:\\Users\\alexm\\AppData\\Local\\Temp\\tmppfmxqozb\\tmp.pdf'})

In [None]:
docs[1]

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [24]:
question = "what did they say about regression in the third lecture?"

In [25]:
docs = vectordb.similarity_search(question,k=5)

In [26]:
for doc in docs:
    print(doc.metadata)

{'page': 1, 'source': 'C:\\Users\\alexm\\AppData\\Local\\Temp\\tmps28qlbe1\\tmp.pdf'}
{'page': 1, 'source': 'C:\\Users\\alexm\\AppData\\Local\\Temp\\tmpepkj5t9_\\tmp.pdf'}
{'page': 0, 'source': 'C:\\Users\\alexm\\AppData\\Local\\Temp\\tmpgsbgz6ge\\tmp.pdf'}
{'page': 12, 'source': 'C:\\Users\\alexm\\AppData\\Local\\Temp\\tmppfmxqozb\\tmp.pdf'}
{'page': 2, 'source': 'C:\\Users\\alexm\\AppData\\Local\\Temp\\tmps28qlbe1\\tmp.pdf'}


In [27]:
print(docs[4].page_content)

3
Part I
Linear Regression
To make our housing example more interesting, let’s consider a slightly richer
dataset in which we also know the number of bedrooms in each house:
Living area (feet2)#bedrooms Price (1000$s)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
.........
Here, thex’s are two-dimensional vectors in R2. For instance, x(i)
1is the
living area of the i-th house in the training set, and x(i)
2is its number of
bedrooms. (In general, when designing a learning problem, it will be up to
you to decide what features to choose, so if you are out in Portland gathering
housing data, you might also decide to include other features such as whether
each house has a ﬁreplace, the number of bathrooms, and so on. We’ll say
more about feature selection later, but for now let’s take the features as
given.)
To perform supervised learning, we must decide how we’re going to rep-
resent functions/hypotheses hin a computer. As an initial choice, let’s say
we decide to approximate yas 

Approaches discussed in the next lecture can be used to address both!