# Vectorstores and Embeddings


We just discussed `Document Loading` and `Splitting`.

In [1]:
! pip3 install pypdf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("data/machine_learning_linear_reg.pdf"),
    PyPDFLoader("data/machine_learning_Decision Tree.pdf"),
    PyPDFLoader("data/machine_learning_XGBoost.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [3]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)

In [4]:
splits = text_splitter.split_documents(docs)

In [5]:
len(splits)

9

## Embeddings

Let's take our splits and embed them.

In [6]:
# from langchain.embeddings.openai import OpenAIEmbeddings
from utils import SaladOllamaEmbeddings
embedding = SaladOllamaEmbeddings()

In [7]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [10]:
import numpy as np

In [13]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [14]:
np.dot(embedding1, embedding2)

11769.093646134284

In [None]:
np.dot(embedding1, embedding3)

0.7710630976675918

In [None]:
np.dot(embedding2, embedding3)

0.7596682675219103

## Vectorstores

In [15]:
! pip install chromadb


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [31]:
from langchain.vectorstores import Chroma

In [57]:
persist_directory = '/tmp/docs/chroma/'

In [51]:
!rm -rf docs/chroma  # remove old database files if any

In [56]:
!pwd

/Users/ashique/Playground/LLMDeepDive


In [1]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

NameError: name 'Chroma' is not defined

In [59]:
print(vectordb._collection.count())

9


### Similarity Search

In [None]:
question = "Which algorith uses boosting technique ?"

In [None]:
docs = vectordb.similarity_search(question,k=3)

In [None]:
len(docs)

3

In [None]:
print(docs[0].page_content)

XGBoost   Introduc)on  XGBoost (Extreme Gradient Boos2ng) is a powerful and eﬃcient implementa2on of the gradient boos2ng algorithm. It is widely used for supervised learning tasks, including classiﬁca2on, regression, and ranking problems. XGBoost builds an ensemble of weak predic2on models, typically decision trees, to create a strong predic2ve model with high accuracy and generalizability.  How It Works  XGBoost sequen2ally builds a strong model by adding weak models that collec2vely minimize a predeﬁned loss func2on. It uses gradient boos2ng to op2mize the model by ﬁGng new models to the residuals of the previous models. XGBoost employs a combina2on of regulariza2on techniques and parallel processing to enhance model performance and reduce overﬁGng.  Mathema)cal Intui)on  XGBoost op2mizes the objec2ve func2on by compu2ng the ﬁrst and second deriva2ves of the loss func2on. It uses the Taylor series expansion to approximate the loss func2on, leading to a simpliﬁed yet eﬀec2ve op2miza2

Let's save this so we can use it later!

In [None]:
vectordb.persist()

## Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

Here are some edge cases that can arise - we'll fix them in the next class.

In [None]:
question = "Does Linear Regression handle Outliers ?"

In [None]:
docs = vectordb.similarity_search(question,k=5)

Notice that we're getting duplicate chunks (because of the duplicate `machine_learning_linear_reg.pdf` in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [None]:
print(docs[0])

page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.' metadata={'page': 0, 'source': '/content/machine_learning_linear_reg.pdf'}


In [None]:
print(docs[1])

page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.' metadata={'page': 0, 'source': '/content/machine_learning_linear_reg.pdf'}


We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [None]:
question = "Why to use Decision Tree and When to use ?"

In [None]:
docs = vectordb.similarity_search(question,k=5)

In [None]:
for doc in docs:
    print(doc.metadata)

{'page': 1, 'source': '/content/machine_learning_Decision Tree.pdf'}
{'page': 0, 'source': '/content/machine_learning_Decision Tree.pdf'}
{'page': 0, 'source': '/content/machine_learning_Decision Tree.pdf'}
{'page': 0, 'source': '/content/machine_learning_XGBoost.pdf'}
{'page': 1, 'source': '/content/machine_learning_XGBoost.pdf'}


In [None]:
print(docs[4].page_content)

customiza2on and hyperparameter tuning, allowing users to op2mize model performance based on speciﬁc requirements. Disadvantages  Despite its advantages, XGBoost has certain limita2ons, such as the increased computa2onal complexity and longer training 2mes compared to simpler algorithms. It may require signiﬁcant computa2onal resources, limi2ng its applicability in resource-constrained environments. The black-box nature of the model can also hinder interpretability, making it challenging to understand the decision-making process for complex predic2ons.


## Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [None]:
question = "Disdvantages of Decision tree algorith"
docs_ss = vectordb.similarity_search(question,k=3)

In [None]:
docs_ss[0].page_content[:100]

'their versa-lity, decision trees can be prone to overﬁEng, especially when dealing with complex data'

In [None]:
docs_ss[1].page_content[:100]

'Decision trees oﬀer various advantages, including their interpretability and ease of understanding. '

Note the difference in results with `MMR`.

In [None]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)



In [None]:
docs_mmr[0].page_content[:100]

'their versa-lity, decision trees can be prone to overﬁEng, especially when dealing with complex data'

In [None]:
docs_mmr[1].page_content[:100]

'Decision trees oﬀer various advantages, including their interpretability and ease of understanding. '

## Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [None]:
question = "what did they say about XGBoost ?"

In [None]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"/content/machine_learning_XGBoost.pdf"}
)

In [None]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': '/content/machine_learning_XGBoost.pdf'}
{'page': 1, 'source': '/content/machine_learning_XGBoost.pdf'}
{'page': 0, 'source': '/content/machine_learning_XGBoost.pdf'}


In [None]:
docs[0].page_content

'XGBoost   Introduc)on  XGBoost (Extreme Gradient Boos2ng) is a powerful and eﬃcient implementa2on of the gradient boos2ng algorithm. It is widely used for supervised learning tasks, including classiﬁca2on, regression, and ranking problems. XGBoost builds an ensemble of weak predic2on models, typically decision trees, to create a strong predic2ve model with high accuracy and generalizability.  How It Works  XGBoost sequen2ally builds a strong model by adding weak models that collec2vely minimize a predeﬁned loss func2on. It uses gradient boos2ng to op2mize the model by ﬁGng new models to the residuals of the previous models. XGBoost employs a combina2on of regulariza2on techniques and parallel processing to enhance model performance and reduce overﬁGng.  Mathema)cal Intui)on  XGBoost op2mizes the objec2ve func2on by compu2ng the ﬁrst and second deriva2ves of the loss func2on. It uses the Taylor series expansion to approximate the loss func2on, leading to a simpliﬁed yet eﬀec2ve op2miza

In [None]:
docs[1].page_content

'customiza2on and hyperparameter tuning, allowing users to op2mize model performance based on speciﬁc requirements. Disadvantages  Despite its advantages, XGBoost has certain limita2ons, such as the increased computa2onal complexity and longer training 2mes compared to simpler algorithms. It may require signiﬁcant computa2onal resources, limi2ng its applicability in resource-constrained environments. The black-box nature of the model can also hinder interpretability, making it challenging to understand the decision-making process for complex predic2ons.'

In [None]:
docs[2].page_content

'It may require ﬁne-tuning of various hyperparameters to achieve the best results, making it more complex to implement compared to simpler algorithms. Addi2onally, the interpretability of the resul2ng models may be challenging due to the complexity of the ensemble learning process. Advantages  XGBoost oﬀers several advantages, including its ability to handle complex, high-dimensional data and its robustness against overﬁGng. It provides high predic2ve accuracy and generalizability, making it suitable for a wide range of real-world applica2ons. XGBoost is highly scalable and can eﬃciently handle large datasets. It also oﬀers ﬂexibility in terms of'

## Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:

1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [None]:
!pip install lark

Collecting lark
  Downloading lark-1.1.8-py3-none-any.whl (111 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/111.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.6/111.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lark
Successfully installed lark-1.1.8


In [None]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
#import lark

In [None]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `machine_learning_XGBoost.pdf`, `machine_learning_DecisionTree.pdf`, or `machine_learning_LinearReg.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [None]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [None]:
question = "What are the points to remember in Decision Tree ?"

In [None]:
docs = retriever.get_relevant_documents(question)

In [None]:
for d in docs:
    print(d.metadata)

In [None]:
print(docs)

[]


## Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [None]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [None]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [None]:
question = "what did they say Linear Regression ?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

"its reliance on the linearity assumption between the dependent and independent variables" and "If the relationship between the variables is non-linear, the model may not accurately represent the data" and "linear regression is sensitive to outliers" and "its performance may be impacted by the presence of multicollinearity among the independent variables"
----------------------------------------------------------------------------------------------------
Document 2:

"its reliance on the linearity assumption between the dependent and independent variables" and "If the relationship between the variables is non-linear, the model may not accurately represent the data" and "linear regression is sensitive to outliers" and "its performance may be impacted by the presence of multicollinearity among the independent variables"
----------------------------------------------------------------------------------------------------
Document 3:

Linear Regression is a fundamental supervis

## Combining various techniques

In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [None]:
question = "what did they say Linear Regression ?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

"its reliance on the linearity assumption between the dependent and independent variables" and "If the relationship between the variables is non-linear, the model may not accurately represent the data" and "linear regression is sensitive to outliers" and "its performance may be impacted by the presence of multicollinearity among the independent variables"
----------------------------------------------------------------------------------------------------
Document 2:

Linear Regression is a fundamental supervised learning algorithm used in the ﬁeld of sta5s5cs and machine learning. It is employed to establish the rela5onship between a dependent variable and one or more independent variables. The objec5ve of linear regression is to ﬁnd the best-ﬁ?ng straight line that can depict the rela5onship between the variables. This line serves as a predic5ve model for future data points.


## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [None]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Load PDF
loader = PyPDFLoader("/content/machine_learning_XGBoost.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

In [None]:
# question = "What are major concepts for XGBoost topic ?"
# docs_svm = SVMRetriever.get_relevant_documents(question)
# docs_svm[0]

In [None]:
# question = "what did they say about matlab?"
# docs_tfidf=tfidf_retriever.get_relevant_documents(question)
# docs_tfidf[0]

In [60]:
import shutil

# Folder path to be zipped
folder_path = '/tmp/docs/chroma'  # Replace with your folder path

# Destination path for the zip file
output_zip_path = '/Users/ashique/Playground/LLMDeepDive/docs/persistent_chroma_db'  # Replace with your desired output path and filename

# Use shutil library to create the zip file
shutil.make_archive(output_zip_path, 'zip', folder_path)


'/Users/ashique/Playground/LLMDeepDive/docs/persistent_chroma_db.zip'