# Retrieval

The Retrieval is the centerpiece of our retrieval augmented generation (RAG) system.


## Vectorstore retrieval


In [None]:
! pip3 install langchain chromadb pypdf

In [None]:
%pip install lark

### Similarity Search

In [114]:
from langchain.vectorstores import Chroma
from utils import SaladOllamaEmbeddings

In [115]:
embedding = SaladOllamaEmbeddings()

documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Python is a popular programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Data science involves extracting insights from data."
]

In [119]:
chroma_db = Chroma.from_texts(documents, embedding=embedding)
# chroma_db.delete(chroma_db.get()["ids"])

In [120]:
print(chroma_db._collection.count())

4


In [123]:
question  = "python programming language ?"
chroma_db.similarity_search(question, k=2)

[Document(page_content='Python is a popular programming language.'),
 Document(page_content='Machine learning is a subset of artificial intelligence.')]

In [124]:
chroma_db.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='Python is a popular programming language.'),
 Document(page_content='Machine learning is a subset of artificial intelligence.')]

## Load some pdf into chromadb

In [126]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("data/machine_learning_linear_reg.pdf"),
    PyPDFLoader("data/machine_learning_linear_reg.pdf"),
    PyPDFLoader("data/machine_learning_Decision Tree.pdf"),
    PyPDFLoader("data/machine_learning_XGBoost.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [127]:
docs

[Document(page_content='                  Linear Regression  Introduc)on  Linear regression is a fundamental supervised learning algorithm used in the ﬁeld of sta5s5cs and machine learning. It is employed to establish the rela5onship between a dependent variable and one or more independent variables. The objec5ve of linear regression is to ﬁnd the best-ﬁ?ng straight line that can depict the rela5onship between the variables. This line serves as a predic5ve model for future data points.  How It Works  Linear regression works by minimizing the ver5cal distances between the observed data points and the predicted values generated by the linear approxima5on. It accomplishes this through the method of least squares, which involves minimizing the sum of the squares of the diﬀerences between the observed and predicted values. The algorithm computes the slope and intercept of the line that minimizes the overall error, thereby determining the best-ﬁt line.  Mathema)cal Intui)on  The equa5on for 

In [128]:

# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
len(splits)


12

In [132]:
# vectordb.delete(ids=vectordb.get()["ids"])

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory = "retrieval_db"
)

In [130]:
print(vectordb._collection.count())

24


### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [48]:
question = "What are the advantages of Linear Regression "
docs_ss = vectordb.similarity_search(question,k=3)

In [49]:
docs_ss

[Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='Advantages  Despite its limita5ons, linear regre

In [50]:
docs_ss[0].page_content[:200]

'including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the da'

In [51]:
docs_ss[1].page_content[:200]

'including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the da'

Note the difference in results with `MMR`.

In [52]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12


In [53]:
docs_mmr

[Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='their versa-lity, decision trees can be prone to overﬁEng, especially when dealing with complex datasets. They may create overly complex trees that fail to generalize well to unseen data. Decision trees are also sensi-ve to small varia-ons in the training data and can be unstable, leading to diﬀerent results with slight changes in the input data. Addi-onally, decision trees can struggle to capture rela-onships between features that are not explicitly represented in the data.  Advantages', metadata={'page': 

In [54]:
docs_mmr[0].page_content[:100]

'including its reliance on the linearity assump5on between the dependent and independent variables. I'

In [55]:
docs_mmr[1].page_content[:100]

'their versa-lity, decision trees can be prone to overﬁEng, especially when dealing with complex data'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [133]:
question = "What are the advantages of Linear Regression ?"

In [134]:
docs = vectordb.max_marginal_relevance_search(
    question,
    k=3,
    filter={"source":"data/machine_learning_linear_reg.pdf"}
)

Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12


In [135]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}
{'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}
{'page': 1, 'source': 'data/machine_learning_linear_reg.pdf'}


In [136]:
docs

[Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='Linear Regression  Introduc)on  Linear regression is a fundamental supervised learning algorithm used in the ﬁeld of sta5s5cs and machine learning. It is employed to establish the rela5onship between a dependent variable and one or more independent variables. The objec5ve of linear regression is to ﬁnd the best-ﬁ?ng straight line that can depict the rela5onship between the variables. This line serves as a predic5ve model for future data points.  How It Works  Linear regression works by minimizing the ver5ca

### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [77]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

from utils import SaladChatOllama
llm = SaladChatOllama()

In [78]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [79]:
# Wrap our vectorstore
compressor = LLMChainExtractor.from_llm(llm)

In [80]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [81]:
question = "What are limitations of Linear Regression and Decision Tree models ?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

The following parts of the context are relevant to answer the question about the limitations of Linear Regression and Decision Tree models:

* "their versatility, decision trees can be prone to overfitting, especially when dealing with complex datasets."
* "They may create overly complex trees that fail to generalize well to unseen data."
* "Decision trees are also sensitive to small variations in the training data and can be unstable, leading to different results with slight changes in the input data."
* "decision trees can struggle to capture relationships between features that are not explicitly represented in the data."
----------------------------------------------------------------------------------------------------
Document 2:

The following parts of the context are relevant to the question:

* The limitations of Linear Regression include:
	+ Reliance on the linearity assumption between the dependent and independent variables (non-linear relationships may not be ac

## Combining various techniques

In [82]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [83]:
question = "what did they say about Decision Tree?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12


Document 1:

The following parts of the context are relevant to answer the question about Decision Trees:

* "overly complex trees"
* "sensitive to small variations in the training data"
* "struggle to capture relationships between features that are not explicitly represented in the data"
----------------------------------------------------------------------------------------------------
Document 2:

The following parts of the context are relevant to the question:

* The reliance of Decision Trees on the linearity assumption between the dependent and independent variables.
* The sensitivity of linear regression to outliers.
* The impact of multicollinearity among independent variables on the performance of linear regression.
----------------------------------------------------------------------------------------------------
Document 3:

The following parts of the context are relevant to answer the question about Decision Trees:

1. Linear Regression introduction: The context provides a

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [84]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [85]:
# Load PDF
loader = PyPDFLoader("data/machine_learning_linear_reg.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [None]:
! pip install scikit-learn

In [89]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [90]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]



Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.   Advantages  Despite its limita5ons, linear regression oﬀers various advantages. It provides a simple and interpretable framework for understanding the rela5onship between variables. It is computa5onally eﬃcient and well-suited for scenarios where the rela5onship between variables can be adequately captured by a linear model. Moreover, linear regression serves as a fundamental building block for more complex regression models and is widely used for predic5ve analy5cs and forecas5ng tasks.  Disadvantages  One of the signiﬁcant drawbacks of linear regression is its inability to capture complex rela5onships

In [91]:
question = "what did they say about advantages of Linear regression model ?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.   Advantages  Despite its limita5ons, linear regression oﬀers various advantages. It provides a simple and interpretable framework for understanding the rela5onship between variables. It is computa5onally eﬃcient and well-suited for scenarios where the rela5onship between variables can be adequately captured by a linear model. Moreover, linear regression serves as a fundamental building block for more complex regression models and is widely used for predic5ve analy5cs and forecas5ng tasks.  Disadvantages  One of the signiﬁcant drawbacks of linear regression is its inability to capture complex rela5onships

### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:

1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [107]:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

from utils import SaladChatOllama,SaladOllamaEmbeddings

document_content_description = "Indias space missions"
llm = SaladChatOllama(temperature=0)
embeddings = SaladOllamaEmbeddings()

In [106]:
metadata_field_info = [
    AttributeInfo(
        name="name",
        description="Name of the space mission",
        type="string",
    ),
    AttributeInfo(
        name="year_of_launch",
        description="Year of launch of the space mission",
        type="integer",
    ),
    AttributeInfo(
        name="place_of_launch",
        description="Place of launch of the space mission",
        type="string",
    ),
]


In [108]:
from langchain_core.documents import Document

docs = [
    Document(
        page_content="India's first interplanetary mission, aimed to study Mars",
        metadata={"name": "Mangalyaan", "year_of_launch": 2013, "place_of_launch": "Sriharikota"},
    ),
    Document(
        page_content="India's lunar probe mission, aimed to explore the Moon's south pole region",
        metadata={"name": "Chandrayaan-2", "year_of_launch": 2019, "place_of_launch": "Sriharikota"},
    ),
    Document(
        page_content="India's first mission to Mars, designed to demonstrate the country's technological capability",
        metadata={"name": "Mars Orbiter Mission", "year_of_launch": 2013, "place_of_launch": "Sriharikota"},
    ),
    Document(
        page_content="India's first satellite launch vehicle, used for placing satellites into orbit",
        metadata={"name": "SLV-3", "year_of_launch": 1980, "place_of_launch": "Sriharikota"},
    ),
    Document(
        page_content="India's first successful satellite launch vehicle, used for placing satellites into geostationary orbit",
        metadata={"name": "GSLV Mk II", "year_of_launch": 2003, "place_of_launch": "Sriharikota"},
    ),
    Document(
        page_content="India's geostationary launch vehicle, used for placing satellites into geostationary orbit",
        metadata={"name": "GSLV Mk III", "year_of_launch": 2014, "place_of_launch": "Sriharikota"},
    ),
]
vectorstore = Chroma.from_documents(docs, embeddings,persist_directory="selfquery")

In [109]:
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True
)

In [112]:
question = "India's first mission to Mars?"

In [113]:
docs = retriever.get_relevant_documents(question)

OutputParserException: Parsing text
To structure the user's query to match the request schema, we need to analyze the query and identify the relevant attributes and operations that can be applied to filter the data.

In this case, the user's query is "What are songs that were not published on Spotify". Based on the provided data source, we can identify the following attributes:

* artist
* length
* genre

We can also identify the logical operations that can be applied to filter the data:

* and (logical conjunction)
* or (logical disjunction)

Based on the user's query, we can construct the following structured request:
```json
{
    "query": "",
    "filter": "and(or(eq(\"artist\", \"Taylor Swift\"), or(eq(\"artist\", \"Katy Perry\")))), NO_FILTER"
}
```
Explanation:

* The query string is empty because the user did not provide any text to compare with the document contents.
* The filter string is "and(or(eq(\"artist\", \"Taylor Swift\"), or(eq(\"artist\", \"Katy Perry\")))), NO_FILTER". This filter applies the logical operation "and" (logical conjunction) to the two comparison statements. The first statement compares the "artist" attribute with "Taylor Swift", and the second statement compares the "artist" attribute with "Katy Perry". Both statements must be true for the document to match the filter. The "NO_FILTER" value is returned if there are no filters that should be applied.

Note: In this example, we only used the logical operators and (logical conjunction) and or (logical disjunction) listed in the request schema. We did not use any other operators or attributes.
 raised following error:
Received invalid attributes artist. Allowed attributes are ['name', 'year_of_launch', 'place_of_launch']