# 3 Vectorstores, Embeddings, Retrieval

The splitted documents (chunks) has to be put into indexes so that they can be easily retrieved when it comes to answer questions based on the docs

In [1]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ["OPENAI_API_KEY"]

In [2]:
# Load documents
from langchain.document_loaders import PyPDFLoader
loaders = [
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture03.pdf"),
]

docs = []
for loader in loaders:
    docs.extend(loader.load())

In [3]:
# Create chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)
len(splits)

151

In [4]:
# Create embeddings
from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [5]:
str1 = "I love dogs"
str2 = "I love canines"  # "closer" to str1
str3 = "Cameroon is a beautiful country"  # "far" from str1

emb1 = embedding.embed_query(str1)
emb2 = embedding.embed_query(str2)
emb3 = embedding.embed_query(str3)

In [6]:
import numpy as np

print("emb1.emb2 =", np.dot(emb1, emb2))  # "high"
print("emb1.emb3 =", np.dot(emb1, emb3))  # "low"
print("emb2.emb3 =", np.dot(emb2, emb3))  # "low"

emb1.emb2 = 0.9631664400467244
emb1.emb3 = 0.7630367804422375
emb2.emb3 = 0.7693401811468166


## Vectorstores

In [7]:
import shutil
import os

persist_directory = "docs/chroma"  # to save the index

# Remove old database files if any
if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)

In [8]:
# Create the vectorstore
# Chroma is an exemple of vectorstore (lightweight and in-memory)
from langchain_chroma import Chroma

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory,
)
vectordb._collection.count()

151

## Similarity search

In [9]:
question = "Is there an email I can ask for help?"
docs = vectordb.similarity_search(question, k=3)
print(docs[0].page_content)

cs229-qa@cs.stanford.edu. This goes to an account that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework problems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thing that I think will help you to succeed and 
do well in this class and even help you to enjoy this class more is if you form a study 
group.  
So start looking around where you're sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study groups 
a

In [10]:
# Even if the question has nothing to do with the documents,
# the similarity search will return the top k based on a metric
question = "What is said about Sergio Ramos?"
docs = vectordb.similarity_search(question, k=3)
print(docs[0].page_content)

he says it in sort of a really touching, sincere way, and then he has this — you can see it 
in his eyes — he has this deep appreciation of the truth and beauty in the universe as 
revealed to him by the math he does.  
In this class, I'm not gonna do any truth and beauty. In this class, I'm gonna talk about 
learning theory to try to convey to you an understanding of how and why learning 
algorithms work so that we can apply these learning algorithms as effectively as possible.  
So, for example, it turns out you can prove surprisingly deep theorems on when you can 
guarantee that a learning algorithm will work, all right? So think about a learning


## Retriever

In [11]:
# Talks about a species of mushroom (Amanita phalloides)
texts = [
    "The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).",
    "A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.",
    "A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.",
]

vectordb = Chroma.from_texts(texts, embedding=embedding)
question = "Tell me about all-white mushrooms with large fruiting bodies"
vectordb.similarity_search(question, k=2)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [12]:
# Max Marginal Relevance (MMR):
# - select the top k1 documents
# - among them select the k2 most diverse
vectordb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [13]:
# Self-query: the answer to the query is restricted to certain documents
# There's a need to filter on metadata, which is not the natural behaviour
# of similarity_search; the latter only compares the embeddings of the content
# We use an LLM to guide the retriever if needed
from langchain_openai import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo  # to describe the metadata

# Retrieve the previously saved index
vectordb = Chroma(
    embedding_function=embedding,
    persist_directory=persist_directory,
)


# Metadata description
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/MachineLearning-Lecture01.pdf`, `docs/MachineLearning-Lecture02.pdf`, or `docs/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

llm = OpenAI()
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectordb,
    document_contents="Machine Learning lecture notes",  # document content description
    metadata_field_info=metadata_field_info,
)

In [14]:
question = "What did they say about regression in the third lecture?"
docs = retriever.invoke({"input": question})
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 5, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'docs/MachineLearning-Lecture03.pdf'}


In [15]:
# Compression: pass only the relevant sentences of the relevant documents to the LLM
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    print(f"\n---\n".join([f"Document {i+1}:\n" + doc.page_content for i, doc in enumerate(docs)]))

compressor = LLMChainExtractor.from_llm(llm)
retriever = vectordb.as_retriever(search_type="mmr")  # combine techniques

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

question = "what did they say about matlab?"
compressed_docs = compression_retriever.invoke(question)
pretty_print_docs(compressed_docs)

Document 1:
- MATLAB or in Octave
- easy to learn tool to use for implementing a lot of learning algorithms
- Octave
- a software package called Octave
- fewer features than MATLAB
- free
- just about everything
- once a colleague of mine at a different university, not at Stanford, actually teaches another machine learning course.
---
Document 2:
"learning theory to try to convey to you an understanding of how and why learning algorithms work so that we can apply these learning algorithms as effectively as possible."
---
Document 3:
"Instructor (Andrew Ng):Say that again."
