# Retrieval

The Retrieval is the centerpiece of our retrieval augmented generation (RAG) system.

Let's use our vectorDB from before.

## Vectorstore retrieval


In [1]:
! pip3 install langchain chromadb pypdf

Collecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.3-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.29.0-py3-none-any.whl.metadata (6.3 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.5.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Using cached onnxruntime-1.17.3-cp311-cp311-macosx_11_0_universal2.whl.metadata (4.4 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.24.0-py3-none-any.whl.metadata (1.3 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (fro

In [2]:
%pip install lark

Collecting lark
  Downloading lark-1.1.9-py3-none-any.whl.metadata (1.9 kB)
Downloading lark-1.1.9-py3-none-any.whl (111 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: lark
Successfully installed lark-1.1.9

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Unzipping and setting up ChromaDB used in previous notebook
import zipfile
import os

# Path to the zip file
zip_file_path = 'data/chroma.zip'  # Replace with the path to your zip file

# Destination folder for the extracted contents
extracted_folder_path = 'chroma'  # Replace with the path where you want to extract the contents

# Create the destination directory if it does not exist
os.makedirs(extracted_folder_path, exist_ok=True)

# Extract the contents of the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_folder_path)


### Similarity Search

In [1]:
from langchain.vectorstores import Chroma
from utils import SaladOllamaEmbeddings
persist_directory = 'chroma'

In [2]:
embedding = SaladOllamaEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [3]:
print(vectordb._collection.count())

12


In [4]:
texts = [
    """The pineapple, a tropical fruit with a spiky, tough exterior, is known for its sweet and tangy flavor, making it a popular choice for refreshing beverages and desserts.""",
    """Mangos, with their juicy yellow-orange flesh and sweet aroma, are a beloved tropical fruit that adds a delightful burst of flavor to salads, smoothies, and chutneys.""",
    """The avocado, a versatile and creamy fruit, is known for its high nutrient content, making it a popular choice for salads, sandwiches, and the ever-popular guacamole dip.""",
]

In [5]:
chroma_db = Chroma.from_texts(texts, embedding=embedding)

In [6]:
question = "Which fruit is known for its creamy nature ?"

In [7]:
chroma_db.similarity_search(question, k=2)

[Document(page_content='The pineapple, a tropical fruit with a spiky, tough exterior, is known for its sweet and tangy flavor, making it a popular choice for refreshing beverages and desserts.'),
 Document(page_content='Mangos, with their juicy yellow-orange flesh and sweet aroma, are a beloved tropical fruit that adds a delightful burst of flavor to salads, smoothies, and chutneys.')]

In [8]:
chroma_db.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='The pineapple, a tropical fruit with a spiky, tough exterior, is known for its sweet and tangy flavor, making it a popular choice for refreshing beverages and desserts.'),
 Document(page_content='Mangos, with their juicy yellow-orange flesh and sweet aroma, are a beloved tropical fruit that adds a delightful burst of flavor to salads, smoothies, and chutneys.')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [10]:
question = "What is limitations of XGBoost"
docs_ss = vectordb.similarity_search(question,k=3)

InvalidDimensionException: Embedding dimension 4096 does not match collection dimensionality 1536

In [None]:
docs_ss[0].page_content[:200]

'customiza2on and hyperparameter tuning, allowing users to op2mize model performance based on speciﬁc requirements. Disadvantages  Despite its advantages, XGBoost has certain limita2ons, such as the in'

In [None]:
docs_ss[1].page_content[:200]

'It may require ﬁne-tuning of various hyperparameters to achieve the best results, making it more complex to implement compared to simpler algorithms. Addi2onally, the interpretability of the resul2ng '

Note the difference in results with `MMR`.

In [None]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)



In [None]:
docs_mmr[0].page_content[:100]

'customiza2on and hyperparameter tuning, allowing users to op2mize model performance based on speciﬁc'

In [None]:
docs_mmr[1].page_content[:100]

'their versa-lity, decision trees can be prone to overﬁEng, especially when dealing with complex data'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [None]:
question = "What are the advantages of Linear Regression ?"

In [None]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"/content/machine_learning_linear_reg.pdf"}
)

In [None]:
for d in docs:
    print(d.metadata)

{'page': 1, 'source': '/content/machine_learning_linear_reg.pdf'}
{'page': 1, 'source': '/content/machine_learning_linear_reg.pdf'}
{'page': 0, 'source': '/content/machine_learning_linear_reg.pdf'}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:

1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [None]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [None]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `/content/machine_learning_linear_reg.pdf`, `/content/machine_learning_Decision Tree.pdf`, or `/content/machine_learning_XGBoost.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [None]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [None]:
question = "What are the advantages of Linear Regression ?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [None]:
docs = retriever.get_relevant_documents(question)

In [None]:
for d in docs:
    print(d.metadata)

{'page': 1, 'source': '/content/machine_learning_linear_reg.pdf'}
{'page': 1, 'source': '/content/machine_learning_linear_reg.pdf'}
{'page': 0, 'source': '/content/machine_learning_linear_reg.pdf'}
{'page': 0, 'source': '/content/machine_learning_linear_reg.pdf'}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [None]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [None]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [None]:
question = "What are limitations of Linear Regression and Decision Tree models ?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

Decision trees oﬀer various advantages, including their interpretability and ease of understanding. They can handle both numerical and categorical data, making them suitable for a wide range of applica-ons. Decision trees require minimal data preprocessing and can handle missing values. They are also robust to outliers and do not require feature scaling. Furthermore, decision trees can provide insights into the most cri-cal features driving the decision-making process.  Disadvantages  One of the main drawbacks of decision trees is their tendency to overﬁt the training data, leading to poor generaliza-on on unseen data. They may not capture complex rela-onships well, especially when dealing with high-dimensional data. Addi-onally, small changes in the data can lead to signiﬁcant changes in the resul-ng tree structure, making them less stable compared to other algorithms.
----------------------------------------------------------------------------------------------------
Doc

## Combining various techniques

In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [None]:
question = "what did they say about Decision Tree?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

They may create overly complex trees that fail to generalize well to unseen data. Decision trees are also sensitive to small variations in the training data and can be unstable, leading to different results with slight changes in the input data. Additionally, decision trees can struggle to capture relationships between features that are not explicitly represented in the data.
----------------------------------------------------------------------------------------------------
Document 2:

Decision trees are versatile supervised learning algorithms used for both classification and regression tasks in machine learning. They mimic the human decision-making process by creating a model that predicts the value of a target variable based on several input features. Decision trees partition the data into subsets based on the selected features, with each partition representing a node in the tree. The algorithm selects the feature that best separates the data points, creating branches

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [None]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Load PDF
loader = PyPDFLoader("/content/machine_learning_linear_reg.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [None]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [None]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content='Linear Regression  Introduc)on  Linear regression is a fundamental supervised learning algorithm used in the ﬁeld of sta5s5cs and machine learning. It is employed to establish the rela5onship between a dependent variable and one or more independent variables. The objec5ve of linear regression is to ﬁnd the best-ﬁ?ng straight line that can depict the rela5onship between the variables. This line serves as a predic5ve model for future data points.  How It Works  Linear regression works by minimizing the ver5cal distances between the observed data points and the predicted values generated by the linear approxima5on. It accomplishes this through the method of least squares, which involves minimizing the sum of the squares of the diﬀerences between the observed and predicted values. The algorithm computes the slope and intercept of the line that minimizes the overall error, thereby determining the best-ﬁt line.  Mathema)cal Intui)on  The equa5on for a simple linear reg

In [None]:
question = "what did they say about advantages of Linear regression model ?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.   Advantages  Despite its limita5ons, linear regression oﬀers various advantages. It provides a simple and interpretable framework for understanding the rela5onship between variables. It is computa5onally eﬃcient and well-suited for scenarios where the rela5onship between variables can be adequately captured by a linear model. Moreover, linear regression serves as a fundamental building block for more complex regression models and is widely used for predic5ve analy5cs and forecas5ng tasks.  Disadvantages  One of the signiﬁcant drawbacks of linear regression is its inability to capture complex rela5onships