# RAG Retriever Comparison
This notebook compares different retrieval techniques for Retrieval-Augmented Generation (RAG) systems to determine which one provides the best results for our use case with CyberArk API documentation.

## Introduction
Retrieval quality is critical for RAG systems - the documents retrieved directly impact the quality of the generated answers. In this notebook, we'll compare several retrieval techniques to find the optimal approach.

## 1. Setup and Prerequisites


In [1]:
# Install required packages
%pip install -q langchain-ollama langchain langchain-community faiss-cpu langchain_huggingface rank_bm25

# Initialize the LLM
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2:latest", temperature=0.5)

# Initialize the embedding model
from langchain_huggingface import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the vector store
from langchain.vectorstores import FAISS
loaded_faiss_store = FAISS.load_local(
    "/workspaces/RAG_BOT/LocalEmbeddings/huggingface_faiss_index",
    embedding_model,
    allow_dangerous_deserialization=True
)
print("FAISS vector store loaded successfully.")

Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


FAISS vector store loaded successfully.


### 2. Retriever Implementations


In [2]:
# Test query
test_query = "How to delete a policy?"

# Function to create QA chain and get answer
def get_answer(retriever, query, description=""):
    from langchain.chains import RetrievalQA
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        return_source_documents=True
    )
    
    print(f"\n=== {description} ===")
    print(f"Question: {query}")
    
    result = qa_chain.invoke({"query": query})
    
    print("Answer:")
    print(result["result"])
    print(f"Number of source documents: {len(result['source_documents'])}")
    
    return result

#### 2.1 Basic Vector Retriever
The simplest approach using vector similarity search.

In [4]:
# Basic vector retriever
basic_retriever = loaded_faiss_store.as_retriever()

# Test retrieval
basic_docs = basic_retriever.invoke(test_query)
print(f"Basic retriever returned {len(basic_docs)} documents")

# Get answer
basic_result = get_answer(basic_retriever, test_query, "Basic Vector Retriever")

Basic retriever returned 4 documents

=== Basic Vector Retriever ===
Question: How to delete a policy?
Answer:
To delete a policy, you can send a POST request to the /Policy/DeletePolicyBlock endpoint with a JSON body containing the path of the policy block to be deleted. The request should include the required field "path" and follow the schema properties specified in the request body.

Here is an example of what the JSON request body might look like:
```json
{
  "path": "string_value"
}
```
Make sure to include the correct path and content type (application/json) in your request.
Number of source documents: 4


#### 2.2 Maximum Marginal Relevance (MMR) Retriever
MMR promotes diversity in retrieved documents, balancing relevance with information diversity.

In [5]:
# MMR retriever
mmr_retriever = loaded_faiss_store.as_retriever(search_type="mmr")

# Test retrieval
mmr_docs = mmr_retriever.invoke(test_query)
print(f"MMR retriever returned {len(mmr_docs)} documents")

# Get answer
mmr_result = get_answer(mmr_retriever, test_query, "Maximum Marginal Relevance Retriever")

MMR retriever returned 4 documents

=== Maximum Marginal Relevance Retriever ===
Question: How to delete a policy?
Answer:
To delete a policy, you need to send a POST request to the /Policy/DeletePolicyBlock endpoint with the required fields in the JSON payload.

Here is an example of what the JSON payload might look like:

```json
{
  "path": "string_value"
}
```

Replace `"string_value"` with the actual path of the policy block you want to delete.

Also, make sure that the request body contains the `Content-Type` header set to `application/json`.

The API will return a response with a status code of 200 and a JSON body containing the result of the deletion operation. If the deletion is successful, the response will contain a `Result` field with a value of `true`. If there is an error, the response will contain an `Error` field with an error message.
Number of source documents: 4


#### 2.3 Similarity Score Threshold Retriever
Only returns documents above a certain similarity threshold, helping filter out irrelevant results.

In [8]:
# Similarity Score Threshold retriever
sst_retriever = loaded_faiss_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.3, "k": 2}
)

# Test retrieval
sst_docs = sst_retriever.invoke(test_query)
print(f"SST retriever returned {len(sst_docs)} documents")

# Get answer
sst_result = get_answer(sst_retriever, test_query, "Similarity Score Threshold Retriever")

SST retriever returned 2 documents

=== Similarity Score Threshold Retriever ===
Question: How to delete a policy?
Answer:
To delete a policy, you need to send a POST request to the `/Policy/DeletePolicyBlock` endpoint with a JSON payload containing the path of the policy block to be deleted. The request body is required and must have a `path` field with a string value.
Number of source documents: 2


#### 2.4 MultiQuery Retriever
Generates multiple query variations to improve retrieval coverage.

In [9]:
# MultiQuery retriever
from langchain.retrievers.multi_query import MultiQueryRetriever

# Configure logging to see generated queries
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# Create retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=loaded_faiss_store.as_retriever(),
    llm=llm
)

# Test retrieval
multi_query_docs = multi_query_retriever.get_relevant_documents(test_query)
print(f"MultiQuery retriever returned {len(multi_query_docs)} documents")

# Get answer
multi_query_result = get_answer(multi_query_retriever, test_query, "MultiQuery Retriever")

  multi_query_docs = multi_query_retriever.get_relevant_documents(test_query)
INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three different versions of the original question:', 'How to remove or terminate an existing insurance policy?', 'Can someone assist me in canceling my current health insurance plan?', 'What steps can I take to discontinue coverage for a specific policy?', "These alternative questions capture similar intent and concepts as the original question, but with slight variations in wording and phrasing. By considering different perspectives on the user's query, we can increase the chances of retrieving relevant documents from the vector database."]


MultiQuery retriever returned 9 documents

=== MultiQuery Retriever ===
Question: How to delete a policy?


INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three different versions of the original question:', 'How can I remove or cancel a policy?', ' ', 'What steps do I need to take to delete an existing policy?', 'Can someone help me with deleting a policy that is no longer needed?']


KeyboardInterrupt: 

#### 2.5 Custom MultiQuery Retriever with Output Parser
A more tailored approach to query generation for domain-specific needs.

In [None]:
# Custom MultiQuery with Output Parser
from typing import List
from langchain_core.output_parsers import BaseOutputParser
from langchain_core.prompts import PromptTemplate

# Create custom output parser
class LineListOutputParser(BaseOutputParser[List[str]]):
    def parse(self, txt: str) -> List[str]:
        lines = txt.strip().split("\n")
        return list(filter(None, lines))

output_parser = LineListOutputParser()

# Domain-specific prompt for CyberArk API documentation
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""
You are an expert AI assistant helping users query CyberArk API documentation.

Given the user question below, generate up to 5 alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks.

These alternatives should:
- Include any possible synonyms or related terms.
- Expand abbreviations.
- Include common ways developers might phrase this question.
- Optionally incorporate relevant concepts such as endpoint names, paths, HTTP methods, parameters, request or response schemas if they are likely relevant.
- Be clear and natural in language.

Each alternative should be on its own line.

User Question:
{question}

Alternative queries:
"""
)

# Create chain and retriever
llm_chain = QUERY_PROMPT | llm | output_parser

custom_multi_query_retriever = MultiQueryRetriever(
    retriever=loaded_faiss_store.as_retriever(),
    llm_chain=llm_chain,
    parser_key="lines"
)

# Test retrieval
custom_multi_query_docs = custom_multi_query_retriever.invoke(test_query)
print(f"Custom MultiQuery retriever returned {len(custom_multi_query_docs)} documents")

# Get answer
custom_multi_query_result = get_answer(custom_multi_query_retriever, test_query, "Custom MultiQuery Retriever")

#### 2.6 Ensemble Retriever (BM25 + FAISS)
Combines keyword-based (BM25) and semantic (FAISS) search for better coverage.

In [None]:
# Ensemble Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Extract documents from the docstore
try:
    all_docs = [loaded_faiss_store.docstore._dict[doc_id] for doc_id in loaded_faiss_store.index_to_docstore_id.values()]
except AttributeError:
    # Fallback for different docstore structure
    all_docs = [loaded_faiss_store.docstore.get(doc_id) for doc_id in loaded_faiss_store.index_to_docstore_id.values()]

# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(all_docs)
bm25_retriever.k = 2

# Create ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, basic_retriever], 
    weights=[0.5, 0.5]
)

# Test retrieval
ensemble_docs = ensemble_retriever.invoke(test_query)
print(f"Ensemble retriever returned {len(ensemble_docs)} documents")

# Get answer
ensemble_result = get_answer(ensemble_retriever, test_query, "Ensemble Retriever (BM25 + FAISS)")

#### 2.7 Enhanced Ensemble with Custom MultiQuery
Combining the best techniques: keyword search and custom query expansion.

In [None]:
# Enhanced Ensemble with Custom MultiQuery
enhanced_ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, custom_multi_query_retriever], 
    weights=[0.4, 0.6]
)

# Test retrieval
enhanced_ensemble_docs = enhanced_ensemble_retriever.invoke(test_query)
print(f"Enhanced Ensemble retriever returned {len(enhanced_ensemble_docs)} documents")

# Get answer
enhanced_ensemble_result = get_answer(enhanced_ensemble_retriever, test_query, "Enhanced Ensemble (BM25 + Custom MultiQuery)")

#### 2.8 Document Reordering for Long Context
Optimizes document order for better context utilization in the LLM.

In [6]:
# Document Reordering
from langchain_community.document_transformers import LongContextReorder
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate

# Get documents using our best retriever
retrieved_docs = custom_multi_query_retriever.invoke(test_query)

# Apply reordering
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(retrieved_docs)

# Define a prompt template
prompt_template = """
Given these texts:
-----
{context}
-----
Please answer the following question:
{query}
"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "query"],
)

# Create and invoke the chain with reordered documents
chain = create_stuff_documents_chain(llm, prompt)
reordered_response = chain.invoke({"context": reordered_docs, "query": test_query})

print("\n=== Long Context Reordering ===")
print(f"Question: {test_query}")
print("Answer:")
print(reordered_response)

NameError: name 'custom_multi_query_retriever' is not defined

### 3. Retriever Comparison
Let's systematically compare the retrievers:

In [7]:
import pandas as pd

# Define our comparison metrics
retriever_comparison = pd.DataFrame({
    "Retriever Type": [
        "Basic Vector", 
        "MMR", 
        "Similarity Threshold", 
        "MultiQuery", 
        "Custom MultiQuery", 
        "Ensemble (BM25 + Vector)",
        "Enhanced Ensemble",
        "Reordered Custom MultiQuery"
    ],
    "Documents Retrieved": [
        len(basic_docs),
        len(mmr_docs),
        len(sst_docs),
        len(multi_query_docs),
        len(custom_multi_query_docs),
        len(ensemble_docs),
        len(enhanced_ensemble_docs),
        len(retrieved_docs)
    ],
    "Advantages": [
        "Simple, fast",
        "Diverse results, reduces redundancy",
        "Filters low-quality matches",
        "Improves query coverage",
        "Domain-specific query expansion",
        "Combines keyword and semantic search",
        "Best of both worlds approach",
        "Optimizes document ordering for LLMs"
    ],
    "Limitations": [
        "Sensitive to query phrasing",
        "May miss some relevant documents",
        "Risk of excluding relevant documents",
        "Computationally expensive",
        "Quality depends on prompt design",
        "More complex implementation",
        "Highest complexity",
        "Post-processing step only"
    ]
})

print("Retriever Comparison:")
retriever_comparison

NameError: name 'sst_docs' is not defined