# Retrievers

Retrievers are components that handle the process of searching and retrieving relevant documents or data based on a query. They are used to find the most relevant pieces of information (e.g., documents, embeddings) from a large collection or knowledge base in response to a user’s query.

Key Points:
- Retrieval Process: A retriever typically performs similarity search or other retrieval strategies to find documents that are most relevant to the query.
- Integration with Vector Stores: Retrievers often interact with vector stores (e.g., Pinecone, FAISS) to perform efficient similarity searches by comparing the query vector to stored vectors.

- Setup
  
You can update and chnage the setup (documents, llm, embeddings, vectorstore ) following your preferences and your API calls

In [28]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
from langchain_core.documents import Document

# documents
documents = [
    Document(
        page_content="Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods from carbon dioxide and water.",
        metadata={
            "source": "Biology Textbook",
            "author": "Dr. Alice Green",
            "date": "2021-03-15"
        }
    ),
    Document(
        page_content="Quantum entanglement is a physical phenomenon that occurs when pairs or groups of particles are generated such that the quantum state of each particle cannot be described independently of the others.",
        metadata={
            "source": "Physics Journal",
            "author": "Dr. John Quantum",
            "date": "2022-06-01"
        }
    ),
    Document(
        page_content="The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.",
        metadata={
            "source": "Geology Magazine",
            "author": "Dr. Emily Stone",
            "date": "2020-11-23"
        }
    ),
    Document(
        page_content="CRISPR is a powerful tool for editing genomes, allowing researchers to easily alter DNA sequences and modify gene function.",
        metadata={
            "source": "Genetics Weekly",
            "author": "Dr. Rachel Gene",
            "date": "2023-01-10"
        }
    )
]

# query
query_text = "What is the theory of plate tectonics?"

# =================== Section to change according to your choice of APIs you have access to===============
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint
from langchain_chroma import Chroma
# llm
repo_id ="tiiuae/falcon-rw-1b"
llm = HuggingFaceEndpoint(
    repo_id = repo_id,
     task="text-generation",
     temperature=0.5,
    max_new_tokens=128 
)

# embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
text = "Write a short poem about programming."
embeddings = HuggingFaceEmbeddings(model_name=model_name)

# Vectorstore from existing DB
vectorstore = Chroma(
    persist_directory="./db/chroma_db",
    embedding_function=embeddings
)


## Vector Store Retrievers

- Basic retriever from vector store

By default, FAISS retrieves vectors based on cosine similarity when performing a similarity search. This means it compares the angle between the query vector and stored vectors in the index, where smaller angles (or higher cosine similarity) indicate more relevant or similar vectors.

In [6]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
documents = retriever.get_relevant_documents(query_text)
print(documents)

[Document(id='f55325f8-c96d-4b37-b17a-4ebeb978a845', metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.")]


- Perform similarity search with scores

In [11]:
query_embedding = embeddings.embed_query(query_text)

docs_and_scores = vectorstore.similarity_search_with_score(
    query = query_text,
    k=6
)

# Display documents with similarity scores
for doc, score in docs_and_scores:
    print(f"Score: {score:.4f}")
    print(f"Content: {doc.page_content}\n---")

# The lower the score, the more similar the document (since FAISS uses L2 distance when computing scores).

Score: 0.4785
Content: The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.
---
Score: 0.4785
Content: The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.
---
Score: 0.4785
Content: The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.
---
Score: 0.4785
Content: The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.
---
Score: 1.9094
Content: Quantum entanglement is a physical phenomenon that occurs when pairs or groups of particles are generated such that the quantum state of each particle cannot be described independently of the others.
---
Score: 1.9094
Content: Quantum entanglement is a phys

- Retriever with metadata filtering

In [15]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 1,
        "filter": {"source": "Geology Magazine"} # we filter type of source 
    }
)

documents = retriever.get_relevant_documents(query_text)
documents

[Document(id='f55325f8-c96d-4b37-b17a-4ebeb978a845', metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation.")]

-  MMR retriever (Maximum Marginal Relevance)

Instead of relying only on similarity (cosine ), in MMR Diversity is also impprtant, Hence the first objective is 
- Relevance: to ensure that the retrieved documents are highly relevant to the query. This is usually determined by measuring the similarity between the document and the query.
- Diversity: The second objective is to reduce redundancy in the returned documents by penalizing documents that are too similar to the ones already selected.

The MMR score for a document is computed as:
MMR(𝐷)=𝜆⋅Relevance(𝐷)−(1−𝜆)⋅Similarity(𝐷,already retrieved)
Where:
- Relevance(D) is the similarity between the document 𝐷 and the query.
- Similarity(D, already retrieved) is the similarity between the document D and the already retrieved documents.
- 𝜆 is a tuning parameter (usually between 0 and 1) that controls the trade-off between relevance and diversity.

In [16]:
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 10}
)

## Contextual Compression
Contextual Compression is a retrieval technique in LangChain where less relevant or redundant parts of documents are automatically removed before being returned to the LLM. The idea is to compress the retrieved content based on the context of the user’s query.

In [20]:
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever

# Create base retriever
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# LLM-based document compressor
compressor = LLMChainExtractor.from_llm(llm)

# Create compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

# Get compressed results
compressed_docs = compression_retriever.get_relevant_documents(query_text)
print(compressed_docs)

[Document(metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="The extracted relevant parts of the context are:\n\nThe theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation."), Document(metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="The extracted relevant parts are:\n\nThe theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation."), Document(metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="Extracted relevant parts:\n\nThe theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation."), Document(metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23'

## Self-Query Retriever
The Self-Query Retriever in LangChain is a powerful retriever that allows an LLM (like GPT) to automatically generate its own search query and filter conditions based on the user’s natural language question — without requiring you to manually define how to search your vector store.

**PS: NOT SUPPORTED FOR FAISS**

In [22]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Optional

# Define metadata schema
class DocumentMetadataFilter(BaseModel):
    source: Optional[str] = Field(description="The source of the document")
    author: Optional[str] = Field(description="The author of the document")
    date: Optional[str] = Field(description="The date the document was written")

# Create self-query retriever
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents="Documents about various science topics",
    metadata_field_info=[
        {"name": "source", "description": "The source of the document"},
        {"name": "author", "description": "The author of the document"},
        {"name": "date", "description": "The date the document was written"}
    ],
    # structured_query_translator=StructuredQueryTranslator(),
)

# Query with automatic metadata filtering
docs = retriever.get_relevant_documents(
   query_text
)

print(docs)


[Document(id='f55325f8-c96d-4b37-b17a-4ebeb978a845', metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation."), Document(id='001c2370-a108-477e-b23d-b357b349f4d4', metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation."), Document(id='fd03d92a-4313-4acc-927c-c3c9eea04d46', metadata={'author': 'Dr. Emily Stone', 'date': '2020-11-23', 'source': 'Geology Magazine'}, page_content="The theory of plate tectonics describes the large-scale motion of Earth's lithosphere and explains phenomena such as earthquakes and mountain formation."), Document(id='ad2f7074-8cd7-4541-b59d-73aa46cf8ab2', metad