### Prerequisites

In [None]:
!pip install colab-xterm
%load_ext colabxterm
%xterm

# RAG Retriever Comparison
This notebook compares different retrieval techniques for Retrieval-Augmented Generation (RAG) systems to determine which one provides the best results for our use case with CyberArk API documentation.

## Introduction
Retrieval quality is critical for RAG systems - the documents retrieved directly impact the quality of the generated answers. In this notebook, we'll compare several retrieval techniques to find the optimal approach.

## 1. Setup and Prerequisites


In [None]:
# Install required packages
%pip install -q langchain-ollama langchain langchain-community faiss-cpu langchain_huggingface rank_bm25


In [None]:

# Initialize the LLM
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3.2:latest", temperature=0.5)

# Initialize the embedding model
from langchain_huggingface import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the vector store
from langchain.vectorstores import FAISS
loaded_faiss_store = FAISS.load_local(
    "/content/RAG_BOT/LocalEmbeddings/huggingface_faiss_index",
    embedding_model,
    allow_dangerous_deserialization=True
)
print("FAISS vector store loaded successfully.")

FAISS vector store loaded successfully.


### 2. Retriever Implementations


In [None]:
# Test query
test_query = "How to delete a policy?"

# Function to create QA chain and get answer
def get_answer(retriever, query, description=""):
    from langchain.chains import RetrievalQA

    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        return_source_documents=True
    )

    print(f"\n=== {description} ===")
    print(f"Question: {query}")

    result = qa_chain.invoke({"query": query})

    print("Answer:")
    print(result["result"])
    print(f"Number of source documents: {len(result['source_documents'])}")

    return result

#### 2.1 Basic Vector Retriever
The simplest approach using vector similarity search.

In [None]:
# Basic vector retriever
basic_retriever = loaded_faiss_store.as_retriever()

# Test retrieval
basic_docs = basic_retriever.invoke(test_query)
print(f"Basic retriever returned {len(basic_docs)} documents")

# Get answer
basic_result = get_answer(basic_retriever, test_query, "Basic Vector Retriever")

Basic retriever returned 4 documents

=== Basic Vector Retriever ===
Question: How to delete a policy?
Answer:
To delete a policy, you can use the DELETE method on the "/Policy/DeletePolicyBlock" endpoint. You will need to provide the path of the policy block to be deleted in the request body as a JSON object with a single field "path". The request should include the Content-Type header set to application/json.

Example:
```json
{
  "path": "string_value"
}
```
This will send a DELETE request to the "/Policy/DeletePolicyBlock" endpoint with the specified policy block path, and return a response indicating whether the deletion was successful or not.
Number of source documents: 4


#### 2.2 Maximum Marginal Relevance (MMR) Retriever
MMR promotes diversity in retrieved documents, balancing relevance with information diversity.

In [None]:
# MMR retriever
mmr_retriever = loaded_faiss_store.as_retriever(search_type="mmr")

# Test retrieval
mmr_docs = mmr_retriever.invoke(test_query)
print(f"MMR retriever returned {len(mmr_docs)} documents")

# Get answer
mmr_result = get_answer(mmr_retriever, test_query, "Maximum Marginal Relevance Retriever")

MMR retriever returned 4 documents

=== Maximum Marginal Relevance Retriever ===
Question: How to delete a policy?
Answer:
To delete a policy, you can use the DELETE endpoint for Policy Management. The endpoint URL is `/Policy/DeletePolicyBlock` and it requires a JSON payload with the `path` property containing the path of the policy block to be deleted.

Here's an example request:

```json
{
  "path": "/path/to/policy/block"
}
```

You should send a POST request to the `/Policy/DeletePolicyBlock` endpoint with this JSON payload. The response will indicate whether the deletion was successful or not, along with any error messages that may have occurred.

For example, if the deletion is successful, the response might look like this:

```json
{
  "Result": true,
  "Error": {}
}
```

If there's an error, the response might include an `Error` object with details about what went wrong.
Number of source documents: 4


#### 2.3 Similarity Score Threshold Retriever
Only returns documents above a certain similarity threshold, helping filter out irrelevant results.

In [None]:
# Similarity Score Threshold retriever
sst_retriever = loaded_faiss_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.3, "k": 2}
)

# Test retrieval
sst_docs = sst_retriever.invoke(test_query)
print(f"SST retriever returned {len(sst_docs)} documents")

# Get answer
sst_result = get_answer(sst_retriever, test_query, "Similarity Score Threshold Retriever")

SST retriever returned 2 documents

=== Similarity Score Threshold Retriever ===
Question: How to delete a policy?
Answer:
To delete a policy, you need to send a POST request to the /Policy/DeletePolicyBlock endpoint with the required JSON payload containing the path of the policy block to be deleted. The request body should have the "path" field with a string value.

For example:

```json
{
  "path": "string_value"
}
```

This will return a response in the format specified by the PolicyDeletePolicyBlock schema, indicating whether the deletion was successful or not.
Number of source documents: 2


#### 2.4 MultiQuery Retriever
Generates multiple query variations to improve retrieval coverage.

In [None]:
# MultiQuery retriever
from langchain.retrievers.multi_query import MultiQueryRetriever

# Configure logging to see generated queries
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# Create retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=loaded_faiss_store.as_retriever(),
    llm=llm
)

# Test retrieval
multi_query_docs = multi_query_retriever.get_relevant_documents(test_query)
print(f"MultiQuery retriever returned {len(multi_query_docs)} documents")

# Get answer
multi_query_result = get_answer(multi_query_retriever, test_query, "MultiQuery Retriever")

  multi_query_docs = multi_query_retriever.get_relevant_documents(test_query)
INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three different versions of the original question:', 'How to remove or cancel an existing insurance policy?', ' ', 'What steps can I take to terminate my current health insurance policy?', 'Can someone help me with deleting or revoking my auto insurance policy?']


MultiQuery retriever returned 10 documents

=== MultiQuery Retriever ===
Question: How to delete a policy?


INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three different versions of the original question:', 'How to cancel or terminate an existing policy?', 'What steps can I take to remove a current insurance policy from my records?', 'Can anyone help me with deleting or voiding a pre-existing policy agreement?']


Answer:
To delete a policy, you can use the Delete Policy Block endpoint with a POST method and provide the path of the policy block to be deleted in the request body. The request should have a JSON payload with the path field as a string.

Here is an example of how the request would look like:

```json
{
  "path": "/Policy/SavePolicyBlock3"
}
```

This endpoint requires authentication using the bearerAuth security scheme and returns a response in the format specified by the PolicyDeletePolicyBlock schema.
Number of source documents: 6


#### 2.5 Custom MultiQuery Retriever with Output Parser
A more tailored approach to query generation for domain-specific needs.

In [None]:
# Custom MultiQuery with Output Parser
from typing import List
from langchain_core.output_parsers import BaseOutputParser
from langchain_core.prompts import PromptTemplate

# Create custom output parser
class LineListOutputParser(BaseOutputParser[List[str]]):
    def parse(self, txt: str) -> List[str]:
        lines = txt.strip().split("\n")
        return list(filter(None, lines))

output_parser = LineListOutputParser()

# Domain-specific prompt for CyberArk API documentation
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""
You are an expert AI assistant helping users query CyberArk API documentation.

Given the user question below, generate up to 5 alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks.

These alternatives should:
- Include any possible synonyms or related terms.
- Expand abbreviations.
- Include common ways developers might phrase this question.
- Optionally incorporate relevant concepts such as endpoint names, paths, HTTP methods, parameters, request or response schemas if they are likely relevant.
- Be clear and natural in language.

Each alternative should be on its own line.

User Question:
{question}

Alternative queries:
"""
)

# Create chain and retriever
llm_chain = QUERY_PROMPT | llm | output_parser

custom_multi_query_retriever = MultiQueryRetriever(
    retriever=loaded_faiss_store.as_retriever(),
    llm_chain=llm_chain,
    parser_key="lines"
)

# Test retrieval
custom_multi_query_docs = custom_multi_query_retriever.invoke(test_query)
print(f"Custom MultiQuery retriever returned {len(custom_multi_query_docs)} documents")

# Get answer
custom_multi_query_result = get_answer(custom_multi_query_retriever, test_query, "Custom MultiQuery Retriever")

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are five alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks:', "1. How to remove an existing policy from CyberArk's configuration?", '2. What is the API endpoint and HTTP method required to delete a policy in CyberArk?', '3. Can you provide details on the request body schema for deleting a policy, including any required parameters or fields?', '4. How do I use the CyberArk API to revoke or invalidate an existing policy, and what are the implications for access control?', '5. Are there specific steps or considerations needed when deleting a policy in CyberArk, such as handling dependencies or relationships with other policies?', 'These alternatives aim to provide additional context, clarify the request, or explore related concepts that might be relevant to retrieving the most accurate documentation chunks.']


Custom MultiQuery retriever returned 13 documents

=== Custom MultiQuery Retriever ===
Question: How to delete a policy?


INFO:langchain.retrievers.multi_query:Generated queries: ['Here are 5 alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks:', '1. How do I remove an existing policy from the CyberArk server?', '2. What is the API endpoint and HTTP method required to delete a policy in CyberArk?', '3. Can you provide instructions on how to use the CyberArk API to revoke or delete a specific policy ID?', '4. How can I find the schema for the DeletePolicy request in the CyberArk API documentation, including any required parameters and query parameters?', '5. What are the different methods (GET, POST, PUT, DELETE) used to manage policies in the CyberArk API, and how do I use them to delete a policy?', 'These alternative queries incorporate synonyms, expand abbreviations, and include common ways developers might phrase this question, while also incorporating relevant concepts such as endpoint names, paths, HTTP methods, parameters, request or 

Answer:
To delete a policy, you can use the `DELETE` method on the `/policies/{policyId}` endpoint.

Here's an example of how you can do this using the CyberArk Identity API:

```
curl -X DELETE \
  https://<your-organization-domain>.cyberark.com/api/v2/policies/<policyId> \
  -H 'Authorization: Bearer <your-api-token>' \
  -H 'Content-Type: application/json'
```

Replace `<your-organization-domain>` with your organization's domain, `<policyId>` with the ID of the policy you want to delete, and `<your-api-token>` with a valid API token for your organization.

Note that this will permanently delete the policy from the CyberArk Identity system. Be sure to test this endpoint in a non-production environment before using it in production.

Also, if you want to delete all policies at once, you can use the `/policies` endpoint and set the `limit` parameter to `-1`. This will return an empty array, indicating that no policies were found.

```
curl -X GET \
  https://<your-organization-domain>.

#### 2.6 Ensemble Retriever (BM25 + FAISS)
Combines keyword-based (BM25) and semantic (FAISS) search for better coverage.

In [None]:
# Ensemble Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Extract documents from the docstore
try:
    all_docs = [loaded_faiss_store.docstore._dict[doc_id] for doc_id in loaded_faiss_store.index_to_docstore_id.values()]
except AttributeError:
    # Fallback for different docstore structure
    all_docs = [loaded_faiss_store.docstore.get(doc_id) for doc_id in loaded_faiss_store.index_to_docstore_id.values()]

# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(all_docs)
bm25_retriever.k = 2

# Create ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, basic_retriever],
    weights=[0.5, 0.5]
)

# Test retrieval
ensemble_docs = ensemble_retriever.invoke(test_query)
print(f"Ensemble retriever returned {len(ensemble_docs)} documents")

# Get answer
ensemble_result = get_answer(ensemble_retriever, test_query, "Ensemble Retriever (BM25 + FAISS)")

Ensemble retriever returned 6 documents

=== Ensemble Retriever (BM25 + FAISS) ===
Question: How to delete a policy?
Answer:
To delete a policy, you would send a POST request to the /Policy/DeletePolicyBlock endpoint with a JSON body containing the path of the policy block you want to delete. The request body should have a single field named "path" which contains the path of the policy block.

For example:

```json
{
  "path": "string_value"
}
```

This would be sent in the request body, and the response would contain a "Result" field indicating whether the deletion was successful or not.
Number of source documents: 6


#### 2.7 Enhanced Ensemble with Custom MultiQuery
Combining the best techniques: keyword search and custom query expansion.

In [None]:
# Enhanced Ensemble with Custom MultiQuery
enhanced_ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, custom_multi_query_retriever],
    weights=[0.4, 0.6]
)

# Test retrieval
enhanced_ensemble_docs = enhanced_ensemble_retriever.invoke(test_query)
print(f"Enhanced Ensemble retriever returned {len(enhanced_ensemble_docs)} documents")

# Get answer
enhanced_ensemble_result = get_answer(enhanced_ensemble_retriever, test_query, "Enhanced Ensemble (BM25 + Custom MultiQuery)")

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are 5 alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks:', '1. How do I remove an existing policy from CyberArk?', '2. What is the procedure for deleting a pre-defined policy in CyberArk Constrained Database?', '3. Can you provide instructions on how to delete a policy using the CyberArk API, including any required authentication or authorization steps?', '4. How to revoke access to a specific policy via the CyberArk REST API endpoint?', '5. Are there any specific parameters or HTTP methods (e.g. DELETE, POST) that need to be used when deleting a policy in CyberArk, and what is the expected response schema?', 'These alternative queries incorporate synonyms (e.g. "remove" instead of "delete"), expand abbreviations (e.g. "Constrained Database" for "CDB"), and include common developer phrases (e.g. "using the CyberArk API"). They also try to identify relevant

Enhanced Ensemble retriever returned 14 documents

=== Enhanced Ensemble (BM25 + Custom MultiQuery) ===
Question: How to delete a policy?


INFO:langchain.retrievers.multi_query:Generated queries: ['Here are 5 alternative rephrasings or expansions of the question that could help retrieve the most relevant documentation chunks:', '1. How do I remove an existing policy from CyberArk?', '2. What is the API endpoint for deleting a policy in CyberArk, and what HTTP method should be used?', "3. Can you provide details on how to delete a specific policy using CyberArk's REST API or SDK?", "4. Are there any parameters or query strings that need to be passed when requesting the deletion of a policy via CyberArk's API?", "5. How do I handle errors or edge cases when attempting to delete a policy through CyberArk's API, and what are the response schema expectations for a successful deletion?", 'These alternatives aim to provide more context, clarify the request, and include relevant details that might help retrieve the most accurate documentation chunks from the CyberArk API documentation.']


Answer:
To delete a policy, you can use the `Delete Policy` endpoint. Here are the steps:

1. Open the API documentation for your CyberArk Cloud Directory instance.
2. Navigate to the `Policy Management` section.
3. Click on the "Delete Policy" endpoint.
4. Provide the UUID of the policy you want to delete in the request body.

The request body should contain a JSON object with the following properties:
```json
{
  "uuid": "string_value"
}
```
Replace `"string_value"` with the actual UUID of the policy you want to delete.

Example Request:
```bash
POST /policy-management/delete-policy HTTP/1.1
Content-Type: application/json

{
  "uuid": "12345678-1234-1234-1234-123456789012"
}
```
This will send a request to the `Delete Policy` endpoint with the UUID of the policy to be deleted.

Note: Make sure you have the necessary permissions and access rights to delete policies in your CyberArk Cloud Directory instance.
Number of source documents: 12


#### 2.8 Document Reordering for Long Context
Optimizes document order for better context utilization in the LLM.

In [None]:
# Document Reordering
from langchain_community.document_transformers import LongContextReorder
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate

# Get documents using our best retriever
retrieved_docs = custom_multi_query_retriever.invoke(test_query)

# Apply reordering
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(retrieved_docs)

# Define a prompt template
prompt_template = """
Given these texts:
-----
{context}
-----
Please answer the following question:
{query}
"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "query"],
)

# Create and invoke the chain with reordered documents
chain = create_stuff_documents_chain(llm, prompt)
reordered_response = chain.invoke({"context": reordered_docs, "query": test_query})

print("\n=== Long Context Reordering ===")
print(f"Question: {test_query}")
print("Answer:")
print(reordered_response)

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are 5 alternative rephrasings or expansions of the user question that could help retrieve the most relevant documentation chunks:', '1. What is the procedure for removing an existing policy from the CyberArk server?', '2. How do I delete a specific policy using its GUID (e.g., /policies/{policyGuid}/delete)?', '3. Can you provide instructions on how to remove a policy that has been assigned to a user or group in CyberArk?', '4. What is the HTTP method and endpoint for deleting a policy in CyberArk (e.g., DELETE /policies/{policyId})?', '5. Are there any specific parameters or query options required when deleting a policy, such as the --force option?']



=== Long Context Reordering ===
Question: How to delete a policy?
Answer:
To delete a policy, use the endpoint `/Policy/SavePolicyBlock3` with a `POST` request. 

This endpoint allows you to replace and deprecate the previous `SavePolicyBlock2` by adding the "rev stamp" functionality that helps prevent change loss when policies are being edited by multiple users.

The request body should contain an object with two properties: `policy` and `plinks`. The `policy` property is a JSON object containing the details of the policy being added or updated, including `Newpolicy`, `Version`, `Path`, `RevStamp`, and `Description`. If `Newpolicy` is false, then `RevStamp` must be provided.

The `plinks` property is an array of plink objects representing the current plinks file. 

Note: This endpoint requires authentication with a bearer token.


### 3. Retriever Comparison
Let's systematically compare the retrievers:

In [None]:
import pandas as pd

# Define our comparison metrics
retriever_comparison = pd.DataFrame({
    "Retriever Type": [
        "Basic Vector",
        "MMR",
        "Similarity Threshold",
        "MultiQuery",
        "Custom MultiQuery",
        "Ensemble (BM25 + Vector)",
        "Enhanced Ensemble",
        "Reordered Custom MultiQuery"
    ],
    "Documents Retrieved": [
        len(basic_docs),
        len(mmr_docs),
        len(sst_docs),
        len(multi_query_docs),
        len(custom_multi_query_docs),
        len(ensemble_docs),
        len(enhanced_ensemble_docs),
        len(retrieved_docs)
    ],
    "Advantages": [
        "Simple, fast",
        "Diverse results, reduces redundancy",
        "Filters low-quality matches",
        "Improves query coverage",
        "Domain-specific query expansion",
        "Combines keyword and semantic search",
        "Best of both worlds approach",
        "Optimizes document ordering for LLMs"
    ],
    "Limitations": [
        "Sensitive to query phrasing",
        "May miss some relevant documents",
        "Risk of excluding relevant documents",
        "Computationally expensive",
        "Quality depends on prompt design",
        "More complex implementation",
        "Highest complexity",
        "Post-processing step only"
    ]
})

print("Retriever Comparison:")
retriever_comparison

Retriever Comparison:


Unnamed: 0,Retriever Type,Documents Retrieved,Advantages,Limitations
0,Basic Vector,4,"Simple, fast",Sensitive to query phrasing
1,MMR,4,"Diverse results, reduces redundancy",May miss some relevant documents
2,Similarity Threshold,2,Filters low-quality matches,Risk of excluding relevant documents
3,MultiQuery,10,Improves query coverage,Computationally expensive
4,Custom MultiQuery,13,Domain-specific query expansion,Quality depends on prompt design
5,Ensemble (BM25 + Vector),6,Combines keyword and semantic search,More complex implementation
6,Enhanced Ensemble,14,Best of both worlds approach,Highest complexity
7,Reordered Custom MultiQuery,10,Optimizes document ordering for LLMs,Post-processing step only


Basic retriever: fewest docs, high precision if lucky, but misses much
✅ MMR: more diverse, reduces duplication
✅ SST: safe filtering, but sometimes too strict
✅ MultiQuery: highest recall, but not domain-specific
✅ Custom MultiQuery: better for domain terms (CyberArk API)
✅ Ensemble: combines semantic & keyword, robust
✅ Enhanced Ensemble (BM25 + Custom MQ): even better coverage
✅ LongContextReorder: doesn't change which docs but improves order for final answer

 2️⃣ Recommendations: Which should you pick?
Based on your scenario (CyberArk API Docs RAG Chatbot):

✔ Domain is structured (API paths, methods, etc.)
✔ User queries are often poorly phrased (“delete policy” vs “DELETE /policy”)
✔ High recall is critical (must not miss the right endpoint)
✔ Also want precision to avoid irrelevant clutter

Hence, you want:

👉 Semantic retrieval (FAISS) to catch paraphrases
👉 Keyword retrieval (BM25) for exact terms
👉 Domain-aware query expansion
👉 Reordering for prompt context optimization

🔥 My recommended best setup for production:
✅ Retriever = Enhanced Ensemble (BM25 + Custom MultiQuery Retriever)
✅ Add LongContextReordering as a post-processing step

✅ Why?

Custom MultiQuery expands user question using domain knowledge.

BM25 ensures any exact API path / method / keyword matches come through.

Ensemble weights (e.g. 0.4 BM25, 0.6 Custom MQ) balance semantic and lexical retrieval.

Reordering improves how the final context is fed to the LLM (most relevant first).

In [None]:
enhanced_ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, custom_multi_query_retriever],
    weights=[0.4, 0.6]
)

reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(retrieved_docs)


=== FINAL COMBINATION ===
Question: How to dele a policy?
