# RAG Retriever Comparison
This notebook compares different retrieval techniques for Retrieval-Augmented Generation (RAG) systems to determine which one provides the best results for our use case with CyberArk API documentation.

## Introduction
Retrieval quality is critical for RAG systems - the documents retrieved directly impact the quality of the generated answers. In this notebook, we'll compare several retrieval techniques to find the optimal approach.

## 1. Setup and Prerequisites


In [4]:
# Install required packages
%pip install -q langchain-ollama langchain langchain-community faiss-cpu langchain_huggingface rank_bm25 sentence-transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m66.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m93.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m94.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:

# Initialize the LLM
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2:latest", temperature=0.5)

# Initialize the embedding model
from langchain_huggingface import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the vector store
from langchain.vectorstores import FAISS
loaded_faiss_store = FAISS.load_local(
    "/content/RAG_BOT/LocalEmbeddings/Hugging_split_enriched_faiss_index",
    embedding_model,
    allow_dangerous_deserialization=True
)
print("FAISS vector store loaded successfully.")

### 2. Retriever Implementations


In [26]:
# Test query
test_query = "How to delete a policy?"

# Function to create QA chain and get answer
def get_answer(retriever, query, description=""):

    docs = retriever.invoke(query)

    print(f"\n=== {description} ===\n")
    print(f"Question: {query}")
    print(f"\nNumber of source documents: {len(docs)}\n")
    print("Documents:\n")
    for idx, item in enumerate(docs):
      print(f"\n ##### Document - {idx + 1} ####################################################################################################################### \n")
      print(item)

## 2.1 Basic Vector Retriever
The simplest approach using vector similarity search.

In [27]:
# Basic vector retriever
basic_retriever = loaded_faiss_store.as_retriever()

# Test retrieval
basic_docs = basic_retriever.invoke(test_query)
print(f"Basic retriever returned {len(basic_docs)} documents")

# Get answer
basic_result = get_answer(basic_retriever, test_query, "Basic Vector Retriever")

Basic retriever returned 4 documents

=== Basic Vector Retriever ===

Question: How to delete a policy?

Number of source documents: 4

Documents:


 ##### Document - 1 ####################################################################################################################### 

page_content='**Overview**

Delete policy endpoint allows users to delete a specific policy block from their system. This endpoint is used in policy management and requires authentication via bearer token.

To use this endpoint, you need to send a POST request to `/Policy/DeletePolicyBlock` with the policy block path as a JSON payload. The endpoint uses bearer authentication for security purposes.

**Key Search Terms**

* Delete policy
* Policy block deletion
* API policy management
* Bearer authentication
* Policy management endpoint

**Example User Questions**

* What is the syntax to delete a policy block using this endpoint?
* How do I authenticate with bearer token for this endpoint?
* Can I use

## 2.2 Maximum Marginal Relevance (MMR) Retriever
MMR promotes diversity in retrieved documents, balancing relevance with information diversity.

In [28]:
# MMR retriever
mmr_retriever = loaded_faiss_store.as_retriever(search_type="mmr")

# Test retrieval
mmr_docs = mmr_retriever.invoke(test_query)
print(f"MMR retriever returned {len(mmr_docs)} documents")

# Get answer
mmr_result = get_answer(mmr_retriever, test_query, "Maximum Marginal Relevance Retriever")

MMR retriever returned 4 documents

=== Maximum Marginal Relevance Retriever ===

Question: How to delete a policy?

Number of source documents: 4

Documents:


 ##### Document - 1 ####################################################################################################################### 

page_content='**Overview**

Delete policy endpoint allows users to delete a specific policy block from their system. This endpoint is used in policy management and requires authentication via bearer token.

To use this endpoint, you need to send a POST request to `/Policy/DeletePolicyBlock` with the policy block path as a JSON payload. The endpoint uses bearer authentication for security purposes.

**Key Search Terms**

* Delete policy
* Policy block deletion
* API policy management
* Bearer authentication
* Policy management endpoint

**Example User Questions**

* What is the syntax to delete a policy block using this endpoint?
* How do I authenticate with bearer token for this endpoint?

## 2.3 Similarity Score Threshold Retriever
Only returns documents above a certain similarity threshold, helping filter out irrelevant results.

In [29]:
# Similarity Score Threshold retriever
sst_retriever = loaded_faiss_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.3, "k": 2}
)

# Test retrieval
sst_docs = sst_retriever.invoke(test_query)
print(f"SST retriever returned {len(sst_docs)} documents")

# Get answer
sst_result = get_answer(sst_retriever, test_query, "Similarity Score Threshold Retriever")

SST retriever returned 2 documents

=== Similarity Score Threshold Retriever ===

Question: How to delete a policy?

Number of source documents: 2

Documents:


 ##### Document - 1 ####################################################################################################################### 

page_content='**Overview**

Delete policy endpoint allows users to delete a specific policy block from their system. This endpoint is used in policy management and requires authentication via bearer token.

To use this endpoint, you need to send a POST request to `/Policy/DeletePolicyBlock` with the policy block path as a JSON payload. The endpoint uses bearer authentication for security purposes.

**Key Search Terms**

* Delete policy
* Policy block deletion
* API policy management
* Bearer authentication
* Policy management endpoint

**Example User Questions**

* What is the syntax to delete a policy block using this endpoint?
* How do I authenticate with bearer token for this endpoint?

## 2.4 MultiQuery Retriever
Generates multiple query variations to improve retrieval coverage.

In [38]:
# MultiQuery retriever
from langchain.retrievers.multi_query import MultiQueryRetriever

# Configure logging to see generated queries
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# Create retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=loaded_faiss_store.as_retriever(),
    llm=llm
)

# Test retrieval
multi_query_docs = multi_query_retriever.get_relevant_documents(test_query)
print(f"MultiQuery retriever returned {len(multi_query_docs)} documents")

# Get answer
multi_query_result = get_answer(multi_query_retriever, test_query, "MultiQuery Retriever")

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three different versions of the original question:', '1. What steps can I take to remove or cancel an existing insurance policy?', '2. How do I terminate or void an active policy, and what documents are required?', '3. Can you provide guidance on how to deactivate or nullify a previously purchased policy, including any necessary procedures?', 'These alternative questions aim to capture different aspects of the original query while using slightly varying phrasing and terminology. By exploring these different perspectives, we can potentially retrieve more relevant documents from the vector database and increase the chances of finding accurate information for the user.']


MultiQuery retriever returned 14 documents


INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three different versions of the original question:', 'How can I remove or cancel an existing policy?', 'What steps are involved in deleting or terminating a policy?', 'Can someone help me with the process of removing or taking down a policy?']



=== MultiQuery Retriever ===

Question: How to delete a policy?

Number of source documents: 10

Documents:


 ##### Document - 1 ####################################################################################################################### 

page_content='ports to. Default: null
                    * PictureUri (string): File system path to user picture file. Default: null
                    * DisplayName (string): Display name of user. Default: null
                    * Uuid (string): Unique Id of the user. Ex: 'cb9b5761-6cfe-45a5-8ecf-ce9fa9e0ff82'
                    * OfficeNumber (string): User's office number. Default: null
                    * MobileNumber (integer): User's mobile number. Default: null
                    * PreferredCulture (string): User preferred culture. Default: null
                    * StartDate (string): Start date for the user. Default: null
                    * Mail (string): Email address of the user. Ex: 'mark@email.com'
              

## 2.5 Custom MultiQuery Retriever with Output Parser
A more tailored approach to query generation for domain-specific needs.

In [43]:
# Custom MultiQuery with Output Parser
from typing import List
from langchain_core.output_parsers import BaseOutputParser
from langchain_core.prompts import PromptTemplate

# Create custom output parser
class LineListOutputParser(BaseOutputParser[List[str]]):
    def parse(self, txt: str) -> List[str]:
        lines = txt.strip().split("\n")
        return list(filter(None, lines))

output_parser = LineListOutputParser()

# Domain-specific prompt for CyberArk API documentation
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""
You are an expert AI assistant helping users query CyberArk API documentation.

Given the user question below, generate up to 5 alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks.

These alternatives should:
- Include any possible synonyms or related terms.
- Expand abbreviations.
- Include common ways developers might phrase this question.
- Optionally incorporate relevant concepts such as endpoint names, paths, HTTP methods, parameters, request or response schemas if they are likely relevant.
- Be clear and natural in language.

Each alternative should be on its own line.

User Question:
{question}

Alternative queries:
"""
)

# Create chain and retriever
llm_chain = QUERY_PROMPT | llm | output_parser

custom_multi_query_retriever = MultiQueryRetriever(
    retriever=loaded_faiss_store.as_retriever(),
    llm_chain=llm_chain,
    parser_key="lines"
)


# Get answer
custom_multi_query_result = get_answer(custom_multi_query_retriever, test_query, "Custom MultiQuery Retriever")

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What is the process for removing an existing policy from the CyberArk server?', '2. How do I revoke or delete a specific policy configuration in CyberArk?', '3. Can you provide instructions on how to remove a previously created policy from the CyberArk API?', '4. What are the steps involved in deleting a policy using the CyberArk management console and API?', "5. How can I remove an existing policy from the CyberArk server using the 'deletePolicy' endpoint or another related method?", 'These alternatives aim to capture different aspects of the original question, including synonyms ("revoke"), expanding on the concept of "policy", and incorporating common developer phrasing ("remove", "revoking").']



=== Custom MultiQuery Retriever ===

Question: How to delete a policy?

Number of source documents: 6

Documents:


 ##### Document - 1 ####################################################################################################################### 

page_content='tted)**

ENDPOINT: Set policy
PATH: /Policy/SavePolicyBlock3
METHOD: POST
TAGS: Policy Management
DESCRIPTION: Replaces and deprecates SavePolicyBlock2, by adding the "rev stamp" functionality that helps prevent change loss when policies are being edited by multiple users.

METADATA:
  * x-idap-anon: False
  * x-codegen-request-body-name: payload

REQUEST BODY:
  Content Type: application/json
  Schema Properties:
    * policy (object): JSON object containing the details of the policy being added or updated.
      Nested properties:
        * Newpolicy (boolean): boolean set to true if this is a new policy (add)
        * Version (integer): Policy structure version number. 1, at this time.
        * Path (string): pat

## 2.6 Ensemble Retriever (BM25 + Basic)
Combines keyword-based (BM25) and semantic (Basic) search for better coverage.

In [40]:
# Ensemble Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Extract documents from the docstore
try:
    all_docs = [loaded_faiss_store.docstore._dict[doc_id] for doc_id in loaded_faiss_store.index_to_docstore_id.values()]
except AttributeError:
    # Fallback for different docstore structure
    all_docs = [loaded_faiss_store.docstore.get(doc_id) for doc_id in loaded_faiss_store.index_to_docstore_id.values()]

# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(all_docs)
bm25_retriever.k = 2

# Create ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, basic_retriever],
    weights=[0.5, 0.5]
)

# Test retrieval
ensemble_docs = ensemble_retriever.invoke(test_query)
print(f"Ensemble retriever returned {len(ensemble_docs)} documents")

# Get answer
ensemble_result = get_answer(ensemble_retriever, test_query, "Ensemble Retriever (BM25 + FAISS)")

Ensemble retriever returned 5 documents

=== Ensemble Retriever (BM25 + FAISS) ===

Question: How to delete a policy?

Number of source documents: 5

Documents:


 ##### Document - 1 ####################################################################################################################### 

page_content='**Overview**

Delete policy endpoint allows users to delete a specific policy block from their system. This endpoint is used in policy management and requires authentication via bearer token.

To use this endpoint, you need to send a POST request to `/Policy/DeletePolicyBlock` with the policy block path as a JSON payload. The endpoint uses bearer authentication for security purposes.

**Key Search Terms**

* Delete policy
* Policy block deletion
* API policy management
* Bearer authentication
* Policy management endpoint

**Example User Questions**

* What is the syntax to delete a policy block using this endpoint?
* How do I authenticate with bearer token for this endpoin

## 2.7 Enhanced Ensemble with Custom MultiQuery
Combining the best techniques: keyword search and custom query expansion.

In [41]:
# Enhanced Ensemble with Custom MultiQuery
enhanced_ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, custom_multi_query_retriever],
    weights=[0.4, 0.6]
)

# Test retrieval
enhanced_ensemble_docs = enhanced_ensemble_retriever.invoke(test_query)
print(f"Enhanced Ensemble retriever returned {len(enhanced_ensemble_docs)} documents")

# Get answer
enhanced_ensemble_result = get_answer(enhanced_ensemble_retriever, test_query, "Enhanced Ensemble (BM25 + Custom MultiQuery)")

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are 5 alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks:', '1. What is the API endpoint for deleting a CyberArk policy, and what HTTP method should be used?', '2. How do I remove an existing policy from CyberArk using the API, including any specific parameters or headers required?', '3. Can you provide information on deleting a policy in CyberArk, specifically regarding the request body schema and response codes?', '4. What is the process for deactivating or removing a policy from the CyberArk console via the API, including endpoint names and HTTP methods?', '5. How do I delete a specific policy from the CyberArk server using the API, including any considerations for policy hierarchies or relationships?', 'These alternatives aim to provide more context and specificity around the question, while also incorporating relevant terms and concepts that might be i

Enhanced Ensemble retriever returned 14 documents


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What is the procedure for removing an existing policy from CyberArk?', '2. How can I delete a specific policy using the CyberArk API endpoint?', '3. What HTTP method and parameters are required to remove or update a policy in CyberArk?', '4. Can you provide instructions on how to delete a policy via the CyberArk Management Console API?', '5. Are there any specific considerations or caveats when deleting a policy in CyberArk, such as impact on existing groups or users?', "Note: I've tried to cover various aspects of the question, including the procedure, endpoint, HTTP method and parameters, API usage, and potential considerations."]



=== Enhanced Ensemble (BM25 + Custom MultiQuery) ===

Question: How to delete a policy?

Number of source documents: 12

Documents:


 ##### Document - 1 ####################################################################################################################### 

page_content='**Overview**

Delete policy endpoint allows users to delete a specific policy block from their system. This endpoint is used in policy management and requires authentication via bearer token.

To use this endpoint, you need to send a POST request to `/Policy/DeletePolicyBlock` with the policy block path as a JSON payload. The endpoint uses bearer authentication for security purposes.

**Key Search Terms**

* Delete policy
* Policy block deletion
* API policy management
* Bearer authentication
* Policy management endpoint

**Example User Questions**

* What is the syntax to delete a policy block using this endpoint?
* How do I authenticate with bearer token for this endpoint?
* Can I use this endpoint

## 2.8 Document Reordering for Long Context
Optimizes document order for better context utilization in the LLM.

In [42]:
# Document Reordering
from langchain_community.document_transformers import LongContextReorder
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate

# Get documents using our best retriever
retrieved_docs = custom_multi_query_retriever.invoke(test_query)

# Apply reordering
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(retrieved_docs)

# Define a prompt template
prompt_template = """
Given these texts:
-----
{context}
-----
Please answer the following question:
{query}
"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "query"],
)

# Create and invoke the chain with reordered documents
chain = create_stuff_documents_chain(llm, prompt)
reordered_response = chain.invoke({"context": reordered_docs, "query": test_query})

print("\n=== Long Context Reordering ===")
print(f"Question: {test_query}")
print("Answer:")
print(reordered_response)

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are 5 alternative rephrasings or expansions of the question:', '1. How do I remove an existing policy from the CyberArk server?', '2. What is the syntax for deleting a policy in CyberArk API, and what parameters should be included?', '3. Can you provide information on how to delete a specific policy by ID/name using the CyberArk REST API endpoint?', '4. What HTTP method (e.g. GET, POST, DELETE) is used to delete a policy in CyberArk, and are there any specific request body requirements?', '5. How do I update or remove an existing policy configuration in CyberArk using the CyberArk SDK/Python library, including any relevant code snippets?']



=== Long Context Reordering ===
Question: How to delete a policy?
Answer:
To delete a policy using the "/Policy/DeletePolicyBlock" endpoint, you need to send a POST request with a JSON payload containing the policy block path. The endpoint requires bearer authentication for security purposes.

Here is an example of how to delete a policy:

* HTTP Method: POST
* Path: `/Policy/DeletePolicyBlock`
* Security Requirement: Bearer Authentication
* Request Body:
  ```json
{
  "path": "string_value"
}
```
* Response:
  * Status Code: 200 (API-Result)
  * Content Type: `*/*`
  * Response Schema: PolicyDeletePolicyBlock
  * Response Body Properties:
    * `Result` (boolean): Did the policy block delete succeed.
    * `Error` (object): Error message text on failure, may be null

Note that if the policy block does not exist in your system, an error message will be returned in the response body.


# 3. Retriever Comparison
Let's systematically compare the retrievers:

| Retriever Type | Documents Retrieved | Advantages | Limitations |
|----------------|---------------------|------------|-------------|
| Basic Vector | 4| Simple, fast | Sensitive to query phrasing |
| MMR | 4 | Diverse results, reduces redundancy | May miss some relevant documents |
| Similarity Threshold | 2 | Filters low-quality matches | Risk of excluding relevant documents |
| MultiQuery | 14 | Improves query coverage | Computationally expensive |
| Custom MultiQuery | 9 | Domain-specific query expansion | Quality depends on prompt design |
| Ensemble (BM25 + Vector) | 5 | Combines keyword and semantic search | More complex implementation |
| Enhanced Ensemble | 14 | Best of both worlds approach | Highest complexity |
| Reordered Custom MultiQuery | 11 | Optimizes document ordering for LLMs | Post-processing step only |