# Retrieval

The steps in this notebook include: 
- **Use Langchain Chroma Vectorstore and Lanchain Retrievers** 

## Contents
1. [Installation](#installation)
2. [Similarity Search](#similarity)
3. [Maximum marginal relevance (MRR)](#MRR)  
4. [SelfQuery](#selfquery)
5. [Compression](#compression)
6. [Combining various techniques](#combining)
7. [Other types of retrieval](#others)

**Source:** https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/5/retrieval

![overview.png](./images/overview.png)

# **Installation** <a name="installation"></a>

In [2]:
!pip install -U langchain openai python-dotenv lark

Collecting lark
  Downloading lark-1.1.8-py3-none-any.whl.metadata (1.9 kB)
Downloading lark-1.1.8-py3-none-any.whl (111 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.6/111.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lark
Successfully installed lark-1.1.8


In [22]:
import os
import openai
import sys

sys.path.append('../..')

# Load from a .env file 
#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv()) # read local .env file

os.environ['OPENAI_API_KEY'] = "eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJhcHAiLCJzdWIiOiIxNDYyNzU5IiwiYXVkIjoiV0VCIiwiaWF0IjoxNjk5NDUxNzMzLCJleHAiOjE3MDAwNTY1MzN9.7mqcOZ3w4gd7m9QGWcdOx7U1ayk1l22LNZ8LfPOLqjE"
openai.api_key  = os.environ['OPENAI_API_KEY']

# **Similarity Search** <a name="similarity"></a>

In [None]:
#!rm -rf ./db/chroma  # remove old database files if any

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

persist_directory = 'db/chroma/'

embedding = OpenAIEmbeddings()

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

print(vectordb._collection.count())

<div class="alert alert-info"> 💡<b>ChromaDB:</b>  
    
<b>ChromaDB</b> is an open-source vector store used for storing and retrieving vector embeddings. Its main use is to save embeddings along with metadata to be used later by large language models. Additionally, it can also be used for semantic search engines over text data. <a href="https://docs.trychroma.com/">More</a>.  
    
(we should have the <code>chromadb</code> python package installed).

<pre>
# save to disk
db = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
docs = db.similarity_search(query)

# load from disk
db2 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
docs = db2.similarity_search(query)
</pre>
</div>



In [None]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

<div class="alert alert-info"> 💡<b>ChromaDB:</b>   

Create a Chroma vectorstore from a list of <code>Documents</code>:
<code>from_documents(documents[, embedding, ids, ...])</code>

Create a Chroma vectorstore from a raw documents:
<code>from_texts(texts[, embedding, metadatas, ...])</code>

</div>


In [None]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

print(smalldb.similarity_search(question, k=2),"\n")
print(smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3))

<div class="alert alert-info"> <b>Similarity Search:</b>  
    
    Run similarity search with Chroma.

        Args:
            - query (str): Query text to search for.
            - k (int): Number of results to return. Defaults to 4.
            - filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
        
        Returns:
            - List[Document]: List of documents most similar (cosine distance) to the query text.
    
<br/>  
<b>Similarity search by vector:</b> It is also possible to do a search for documents similar to a given embedding vector using <code>similarity_search_by_vector</code> which accepts an embedding vector as a parameter instead of a string. <br/> 

  
<br/> 
<b>Maximum marginal relevance search (MMR):</b>  

    Return docs selected using the maximal marginal relevance. Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents.

        Parameters
            - query: Text to look up documents similar to.
            - k: Number of Documents to return. Defaults to 4.
            - fetch_k: Number of Documents to fetch to pass to MMR algorithm.
            - lambda_mult: Number between 0 and 1 that determines the degree
                        of diversity among the results with 0 corresponding
                        to maximum diversity and 1 to minimum diversity.
                        Defaults to 0.5.
            - filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.

        Returns:
            - List of Documents selected by maximal marginal relevance.

<br/>  
<b>Maximum marginal relevance search (MMR) by vector:</b> It is also possible to do a MMR using vector with <code>max_marginal_relevance_search_by_vector</code> which accepts an embedding vector as a parameter instead of a string. <br/> 
<br/>
<i><b>Note:</i></b> MMR algorithm uses the <code>maximal_marginal_relevance()</code> funtion to calculate the maximal marginal relevance (similarity with <i>cosine simalirity</i>). From the <i>Utility functions</i> for working with vectors and vectorstores (<a href="https://api.python.langchain.com/en/latest/_modules/langchain/vectorstores/utils.html">langchain.vectorstores.utils</a>)

</div>


# **Maximum marginal relevance (MRR)** <a name="MRR"></a>


In `Vectorstores_&_Embeddings_03.ipynb` we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [None]:
question = "what did they say about matlab?"

docs_ss = vectordb.similarity_search(question,k=3)

In [None]:
docs_ss[0].page_content[:100]

In [None]:
docs_ss[1].page_content[:100]

From the latest lab, we have **209 chunks** from 4 PDFs. Where each `page_content`'s length is <1500.
````
# Duplicate documents on purpose - messy data
    "data/MachineLearning-Lecture01.pdf",
    "data/MachineLearning-Lecture01.pdf",
    "data/MachineLearning-Lecture02.pdf",
    "data/MachineLearning-Lecture03.pdf"
````


Note the difference in results with `MMR`:

In [None]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [None]:
docs_mmr[0].page_content[:100]

In [None]:
docs_mmr[1].page_content[:100]

# **SelfQuery** <a name="selfquery"></a>


**Working with metadata**

We showed that a question about the one lecture can include results from other lectures as well. To address this, many vectorstores support _operations_ on **metadata**.

- **metadata** provides context for each embedded chunk.

In [None]:
question = "what did they say about regression in the third lecture?"

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

for d in docs:
    print(d.metadata)

**Working with metadata using self-query retriever**

But we can infer the metadata from the query itself. To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [None]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [None]:
document_content_description = "Lecture notes"

llm = OpenAI(temperature=0)

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

We will receive a warning about `predict_and_parse` being deprecated the first time we executing the next line. This can be safely ignored.

In [None]:
question = "what did they say about regression in the third lecture?"

docs = retriever.get_relevant_documents(question)

In [None]:
for d in docs:
    print(d.metadata)

***Note:*** We saw in the first Lab that `PyPDFLoader` creates `Documents`that contain text (page_content) and metadata.
````
> page
    Document(page_content='MachineLearning-Lecture01  \nInstru...')
    
> page.metadata
    {'source': 'data/MachineLearning-Lecture01.pdf', 'page': 0}
````


# **Compression** <a name="compression"></a>


Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of **irrelevant text**.

Passing that full document through your application can lead to **more expensive LLM calls and poorer responses**.  
--> Contextual compression is meant to fix this.

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [None]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

LLM wrappers are simply an intermediate that allows one to connect to a Large Language Model.

**LLMChainExtractor**: Document compressor that uses an LLM chain to extract the relevant parts of documents.

**ContextualCompressionRetriever**: Retriever that wraps a base retriever and compresses the results.

>**Parameters:** 
>- `base_compressor` _langchain.retrievers.document_compressors.base.BaseDocumentCompressor_ – Compressor for compressing retrieved documents.
>- `base_retriever` _langchain.schema.retriever.BaseRetriever_ – Base Retriever to use for getting relevant documents.



In [None]:
question = "what did they say about matlab?"

compressed_docs = compression_retriever.get_relevant_documents(question)

pretty_print_docs(compressed_docs)

# **Combining various techniques** <a name="combining"></a>


In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [None]:
question = "what did they say about matlab?"

compressed_docs = compression_retriever.get_relevant_documents(question)

pretty_print_docs(compressed_docs)

# **Other types of retrieval** <a name="others"></a>


It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The LangChain retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [None]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Load PDF
loader = PyPDFLoader("data/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [None]:
question = "What are major topics for this class?"

docs_svm=svm_retriever.get_relevant_documents(question)
print(docs_svm[0])

In [None]:
question = "what did they say about matlab?"

docs_tfidf=tfidf_retriever.get_relevant_documents(question)
print(docs_tfidf[0])

By looking at the entire dataset and separating positive and negative examples through an optimally positioned hyperplane, SVM is capable of providing high-quality results for complex query types.

Similarities between **KNN** and **SVM**:
- Both are supervised machine learning algorithms.
- They can be used for various NLP tasks, including classification or information retrieval.
- They work well with high-dimensional data, such as sentence embeddings.

Differences between **KNN** and **SVM**:
- KNN is a lazy learning technique that takes into account only the K nearest neighbors, whereas SVM is a model-based technique that considers the overall structure of the entire dataset.
- KNN emphasizes local similarity, while SVM focuses on global relationships and margins between classes.
- KNN generally has a faster training phase as it doesn’t involve model building, but SVM can be computationally expensive and slower during training due to hyperplane fitting.  

[More](https://blog.gopenai.com/knowing-your-neighbors-or-harnessing-support-selecting-knn-or-svm-for-prompt-engineering-d43807580753).