# Cuttlefish3 RAG Chunking & Retrieval Evaluation

This notebook evaluates various RAG retrieval methods using JIRA issue data from the Cuttlefish3 project.

**Retrieval Methods Evaluated:**
- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (Rerank)
- Ensemble Retrieval
- Semantic Chunking

**Data Sources:**
- JIRA Issues: `JIRA_OPEN_DATA_LARGESET_DATESHIFTED.csv`
- Golden Dataset: `cuttlefish-jira-golden-dataset-20250731-122634`

## Setup: Dependencies & API Keys

In [1]:
# Install required packages using exact versions from requirements.txt
!pip install -q jupyter>=1.1.1
!pip install -q langchain-experimental>=0.3.4
!pip install -q langchain>=0.3.19
!pip install -q "cohere>=5.12.0,<5.13.0"
!pip install -q langchain-cohere==0.4.4
!pip install -q langchain-openai>=0.3.7
!pip install -q qdrant-client>=1.13.2
!pip install -q rank-bm25>=0.2.2
!pip install -q langchain-qdrant>=0.2.0
!pip install -q ragas langsmith datasets pandas numpy pillow rapidfuzz langchain-community


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

In [2]:
import os
import getpass
from uuid import uuid4

# API Keys
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")
    
if "COHERE_API_KEY" not in os.environ:
    os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API Key: ")

# LangSmith Configuration
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
if "LANGCHAIN_API_KEY" not in os.environ:
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API Key: ")
os.environ["LANGCHAIN_PROJECT"] = f"Cuttlefish3 RAG Evaluation - {uuid4().hex[0:8]}"

print("✅ API keys configured successfully!")

✅ API keys configured successfully!


## Data Loading: JIRA Issues

In [3]:
import csv
from langchain_core.documents import Document

# Set CSV field size limit for large JIRA descriptions
csv.field_size_limit(10000000)

print("Loading JIRA issue data...")
jira_documents = []

with open('./JIRA_OPEN_DATA_LARGESET_DATESHIFTED.csv', 'r', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    
    for i, row in enumerate(reader):
        title = row.get('title', '').strip()
        description = row.get('description', '').strip()
        
        # Skip empty entries
        if not title and not description:
            continue
            
        # Create combined content for better RAG performance
        if title and description:
            content = f"Title: {title}\n\nDescription: {description}"
        elif title:
            content = f"Title: {title}"
        else:
            content = f"Description: {description}"
        
        # Create document with JIRA metadata
        doc = Document(
            page_content=content,
            metadata={
                "key": row.get('key', ''),
                "project": row.get('project', ''),
                "project_name": row.get('project_name', ''),
                "priority": row.get('priority', ''),
                "type": row.get('type', ''),
                "status": row.get('status', ''),
                "created": row.get('created', ''),
                "title": title,
                "description_length": len(description)
            }
        )
        
        jira_documents.append(doc)
        
        # Limit to first 1000 documents for manageable processing
        if len(jira_documents) >= 1000:
            break

print(f"✅ Loaded {len(jira_documents)} JIRA issue documents")
print(f"Sample document: {jira_documents[0].page_content[:200]}...")

# Show project distribution
import pandas as pd
projects = [doc.metadata['project'] for doc in jira_documents]
project_counts = pd.Series(projects).value_counts().head(5)
print(f"\nTop 5 projects in dataset:")
for project, count in project_counts.items():
    print(f"  {project}: {count} issues")

Loading JIRA issue data...
✅ Loaded 1000 JIRA issue documents
Sample document: Title: MAX_VERSIONS not respected.

Description: Below is a report from the list.  I confirmed playing in shell that indeed we have this problem.  Lets fix for 0.2.1.{code}Hello.I made some tests with...

Top 5 projects in dataset:
  HBASE: 1000 issues


## Base Components: Embeddings, VectorStore & LLM

In [4]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Base vectorstore for JIRA issues
vectorstore = Qdrant.from_documents(
    jira_documents,
    embeddings,
    location=":memory:",
    collection_name="JiraIssues"
)

# Chat model
chat_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)

# RAG prompt template
RAG_TEMPLATE = """\
You are a technical support assistant specializing in software issue resolution.
Use the JIRA issue context provided below to answer the question accurately.

If you don't know the answer based on the context, say so clearly.

Question: {question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

print("✅ Base components configured successfully!")

✅ Base components configured successfully!


## Chain Builder Function

In [5]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

def build_rag_chain(retriever, chain_name="RAG"):
    """Build a RAG chain with the given retriever."""
    chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
    )
    print(f"✅ {chain_name} chain created")
    return chain

print("✅ Chain builder function ready")

✅ Chain builder function ready


## Retrieval Method 1: Naive Retrieval

In [6]:
# Naive retriever - simple cosine similarity
naive_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
naive_chain = build_rag_chain(naive_retriever, "Naive")

# Test query
test_result = naive_chain.invoke({"question": "What are common issues with HBase?"})
print(f"Test result: {test_result['response'].content[:200]}...")

✅ Naive chain created
Test result: Common issues with HBase, based on the provided context, include:

1. **Stuck Regions During Closure**: HBase can become stuck when closing regions concurrently, which can lead to performance issues o...


## Retrieval Method 2: BM25

In [7]:
from langchain_community.retrievers import BM25Retriever

# BM25 retriever - keyword-based sparse retrieval
bm25_retriever = BM25Retriever.from_documents(jira_documents, k=10)
bm25_chain = build_rag_chain(bm25_retriever, "BM25")

# Test query
test_result = bm25_chain.invoke({"question": "What are common issues with HBase?"})
print(f"Test result: {test_result['response'].content[:200]}...")

✅ BM25 chain created
Test result: Common issues with HBase, as indicated by the provided JIRA issues, include:

1. **IOExceptions during Table Creation**: There are failure cases in the CreateTable Handler, particularly IOExceptions t...


## Retrieval Method 3: Multi-Query

In [8]:
from langchain.retrievers.multi_query import MultiQueryRetriever

# Multi-query retriever - generates multiple queries for better coverage
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, 
    llm=chat_model
)
multi_query_chain = build_rag_chain(multi_query_retriever, "Multi-Query")

# Test query
test_result = multi_query_chain.invoke({"question": "What are common issues with HBase?"})
print(f"Test result: {test_result['response'].content[:200]}...")

✅ Multi-Query chain created
Test result: Common issues with HBase, as indicated by the provided JIRA issues, include:

1. **Security Configuration**: There are concerns about enforcing secure Hadoop as a requirement for secure HBase, which c...


## Retrieval Method 4: Parent Document

In [9]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models

# Parent document retriever - search small chunks, return large documents
parent_docs = jira_documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

# Create new vectorstore for parent-child relationship
parent_client = QdrantClient(location=":memory:")
parent_client.create_collection(
    collection_name="jira_parent_docs",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_vectorstore = QdrantVectorStore(
    collection_name="jira_parent_docs", 
    embedding=embeddings, 
    client=parent_client
)

store = InMemoryStore()
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=parent_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

parent_document_retriever.add_documents(parent_docs, ids=None)
parent_chain = build_rag_chain(parent_document_retriever, "Parent Document")

# Test query
test_result = parent_chain.invoke({"question": "What are common issues with HBase?"})
print(f"Test result: {test_result['response'].content[:200]}...")

✅ Parent Document chain created
Test result: Common issues with HBase, based on the provided context, include:

1. **Stuck Regions During Closure**: There is a known issue where HBase can become stuck when closing regions concurrently, which can...


## Retrieval Method 5: Contextual Compression

In [10]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Contextual compression with reranking
compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=naive_retriever
)
compression_chain = build_rag_chain(compression_retriever, "Contextual Compression")

# Test query
test_result = compression_chain.invoke({"question": "What are common issues with HBase?"})
print(f"Test result: {test_result['response'].content[:200]}...")

✅ Contextual Compression chain created
Test result: Common issues with HBase, as indicated by the provided JIRA issues, include:

1. **Race Conditions**: There are race conditions in the HCM.getMaster method that can stall clients. This issue arises wh...


## Retrieval Method 6: Ensemble

In [11]:
from langchain.retrievers import EnsembleRetriever

# Ensemble retriever - combines multiple retrievers
retriever_list = [
    bm25_retriever, 
    naive_retriever, 
    parent_document_retriever, 
    compression_retriever, 
    multi_query_retriever
]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, 
    weights=equal_weighting
)
ensemble_chain = build_rag_chain(ensemble_retriever, "Ensemble")

# Test query
test_result = ensemble_chain.invoke({"question": "What are common issues with HBase?"})
print(f"Test result: {test_result['response'].content[:200]}...")

✅ Ensemble chain created
Test result: Common issues with HBase, as indicated by the provided JIRA issues, include:

1. **Concurrency Problems**: HBase can experience deadlocks or stalls when closing regions concurrently, as noted in HBASE...


## Retrieval Method 7: Semantic Chunking

In [12]:
from langchain_experimental.text_splitter import SemanticChunker

# Semantic chunking - split based on semantic similarity
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

# Use subset for semantic chunking due to processing time
semantic_documents = semantic_chunker.split_documents(jira_documents[:50])

semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JiraSemanticChunks"
)

semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k": 10})
semantic_chain = build_rag_chain(semantic_retriever, "Semantic Chunking")

print(f"✅ Created {len(semantic_documents)} semantic chunks from {len(jira_documents[:50])} documents")

# Test query
test_result = semantic_chain.invoke({"question": "What are common issues with HBase?"})
print(f"Test result: {test_result['response'].content[:200]}...")

✅ Semantic Chunking chain created
✅ Created 73 semantic chunks from 50 documents
Test result: Common issues with HBase, as indicated by the provided JIRA issues, include:

1. **Corrupt HFiles**: Corrupt HFiles can lead to resource leaks and Out of Memory (OOM) errors on region servers. This ca...


## Load Golden Dataset for Evaluation

In [15]:
from langsmith import Client

# Load golden dataset from LangSmith
dataset_name = "cuttlefish-jira-golden-dataset-20250731-122634"
client = Client()

print(f"Loading golden dataset: {dataset_name}")

try:
    golden_examples = list(client.list_examples(dataset_name=dataset_name))
    print(f"✅ Loaded {len(golden_examples)} examples from golden dataset")
    
    # Convert to evaluation format
    evaluation_data = []
    for example in golden_examples:
        eval_sample = {
            "user_input": example.inputs["question"],
            "reference_contexts": example.metadata.get("reference_contexts", []),
            "reference": example.outputs["answer"],
            "response": None,
            "retrieved_contexts": None
        }
        evaluation_data.append(eval_sample)
    
    # Simple evaluation sample class
    class EvaluationSample:
        def __init__(self, data):
            self.user_input = data["user_input"]
            self.reference_contexts = data["reference_contexts"]
            self.reference = data["reference"]
            self.response = data["response"]
            self.retrieved_contexts = data["retrieved_contexts"]
    
    evaluation_dataset = [EvaluationSample(data) for data in evaluation_data]
    
    print(f"📋 Sample questions:")
    for i, sample in enumerate(evaluation_dataset[:3]):
        print(f"  {i+1}. {sample.user_input[:100]}...")
    
    print(f"\n✅ Golden dataset ready: {len(evaluation_dataset)} test cases")

except Exception as e:
    print(f"❌ Error loading golden dataset: {e}")
    print("Please ensure the golden dataset exists in LangSmith")

Loading golden dataset: cuttlefish-jira-golden-dataset-20250731-122634
✅ Loaded 15 examples from golden dataset
📋 Sample questions:
  1. Why does the NullPointerException occur in MetaEditor when trying to restore a snapshot, and how doe...
  2. What issues arise from the custom implementation of ReplicationSink and how do they relate to the me...
  3. What are the performance implications of excessive readpoint checks in MemStoreScanner and how do th...

✅ Golden dataset ready: 15 test cases


## Evaluation Function

In [16]:
from langchain.callbacks import get_openai_callback
from ragas import evaluate, EvaluationDataset, RunConfig
from ragas.metrics import (
    LLMContextRecall, Faithfulness, FactualCorrectness,
    ResponseRelevancy, ContextRecall, NoiseSensitivity
)
from ragas.llms import LangchainLLMWrapper
import pandas as pd
import time

def evaluate_rag_chain(chain, dataset, method_name):
    """Evaluate a RAG chain using RAGAS metrics with cost and latency tracking."""
    print(f"🚀 Evaluating {method_name}...")
    
    start_time = time.time()
    
    with get_openai_callback() as cb:
        # Generate responses
        print("   Generating responses...")
        for i, sample in enumerate(dataset):
            if i % 5 == 0:
                print(f"      Processing {i+1}/{len(dataset)}")
            
            response = chain.invoke({"question": sample.user_input})
            sample.response = response["response"].content if hasattr(response["response"], 'content') else str(response["response"])
            sample.retrieved_contexts = [
                context.page_content for context in response["context"]
            ]
        
        # Convert to RAGAS dataset
        print("   Running RAGAS evaluation...")
        ragas_data = []
        for sample in dataset:
            ragas_data.append({
                "user_input": sample.user_input,
                "response": sample.response,
                "retrieved_contexts": sample.retrieved_contexts,
                "reference_contexts": sample.reference_contexts,
                "reference": sample.reference
            })
        
        ragas_df = pd.DataFrame(ragas_data)
        evaluation_dataset = EvaluationDataset.from_pandas(ragas_df)
        
        # Configure evaluator
        evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0))
        run_config = RunConfig(timeout=360)
        
        # Run evaluation
        ragas_results = evaluate(
            dataset=evaluation_dataset,
            metrics=[
                LLMContextRecall(),
                Faithfulness(), 
                FactualCorrectness(),
                ResponseRelevancy(),
                ContextRecall(),
                NoiseSensitivity()
            ],
            llm=evaluator_llm,
            run_config=run_config
        )
        
        results_df = ragas_results.to_pandas()
    
    end_time = time.time()
    
    # Compile results
    results = {
        "method": method_name,
        "context_recall": results_df['context_recall'].mean() if 'context_recall' in results_df.columns else 0.0,
        "faithfulness": results_df['faithfulness'].mean() if 'faithfulness' in results_df.columns else 0.0,
        "factual_correctness": results_df['factual_correctness(mode=f1)'].mean() if 'factual_correctness(mode=f1)' in results_df.columns else 0.0,
        "response_relevancy": results_df['answer_relevancy'].mean() if 'answer_relevancy' in results_df.columns else 0.0,
        "noise_sensitivity": results_df['noise_sensitivity(mode=relevant)'].mean() if 'noise_sensitivity(mode=relevant)' in results_df.columns else 0.0,
        "total_cost_usd": cb.total_cost,
        "cost_per_query": cb.total_cost / len(dataset) if len(dataset) > 0 else 0,
        "total_tokens": cb.total_tokens,
        "tokens_per_query": cb.total_tokens / len(dataset) if len(dataset) > 0 else 0,
        "total_latency_seconds": end_time - start_time,
        "latency_per_query": (end_time - start_time) / len(dataset) if len(dataset) > 0 else 0,
    }
    
    # Calculate average score
    metrics = ['context_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'noise_sensitivity']
    results['average_score'] = sum([results[metric] for metric in metrics]) / len(metrics)
    
    print(f"✅ {method_name} completed!")
    print(f"   Average Score: {results['average_score']:.4f}")
    print(f"   Cost: ${results['total_cost_usd']:.4f}")
    print(f"   Latency: {results['total_latency_seconds']:.2f}s")
    
    return results

print("✅ Evaluation function ready")

✅ Evaluation function ready


## Run Evaluations

In [15]:
# Prepare evaluation chains and datasets
chains_to_evaluate = {
    "Naive": naive_chain,
    "BM25": bm25_chain,
    "Multi-Query": multi_query_chain,
    "Parent Document": parent_chain,
    "Contextual Compression": compression_chain,
    "Ensemble": ensemble_chain,
    "Semantic Chunking": semantic_chain
}

print("🚀 Starting comprehensive evaluation...")
print(f"Evaluating {len(chains_to_evaluate)} retrieval methods")
print(f"Using {len(evaluation_dataset)} test cases\n")

# Store all results
all_results = {}

# Run evaluations (Note: This will take significant time)
for method_name, chain in chains_to_evaluate.items():
    print(f"{'='*60}")
    
    # Create fresh dataset copy for each evaluation
    method_dataset = [EvaluationSample({
        "user_input": sample.user_input,
        "reference_contexts": sample.reference_contexts,
        "reference": sample.reference,
        "response": None,
        "retrieved_contexts": None
    }) for sample in evaluation_dataset]
    
    try:
        results = evaluate_rag_chain(chain, method_dataset, method_name)
        all_results[method_name] = results
    except Exception as e:
        print(f"❌ Error evaluating {method_name}: {e}")
        continue

print(f"\n✅ Evaluation completed for {len(all_results)}/{len(chains_to_evaluate)} methods")

🚀 Starting comprehensive evaluation...
Evaluating 7 retrieval methods
Using 15 test cases

🚀 Evaluating Naive...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[23]: TimeoutError()
Exception raised in Job[59]: TimeoutError()


✅ Naive completed!
   Average Score: 0.6641
   Cost: $0.2772
   Latency: 614.85s
🚀 Evaluating BM25...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[17]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[59]: TimeoutError()


✅ BM25 completed!
   Average Score: 0.5220
   Cost: $0.2242
   Latency: 488.77s
🚀 Evaluating Multi-Query...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[54]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[71]: TimeoutError()
Exception raised in Job[83]: TimeoutError()


✅ Multi-Query completed!
   Average Score: 0.6941
   Cost: $0.3040
   Latency: 770.45s
🚀 Evaluating Parent Document...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

✅ Parent Document completed!
   Average Score: 0.5747
   Cost: $0.1311
   Latency: 332.73s
🚀 Evaluating Contextual Compression...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

✅ Contextual Compression completed!
   Average Score: 0.6352
   Cost: $0.1364
   Latency: 329.91s
🚀 Evaluating Ensemble...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[71]: TimeoutError()
Exception raised in Job[83]: TimeoutError()
Exception raised in Job[89]: TimeoutError()


✅ Ensemble completed!
   Average Score: 0.7237
   Cost: $0.3978
   Latency: 727.74s
🚀 Evaluating Semantic Chunking...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[59]: TimeoutError()


✅ Semantic Chunking completed!
   Average Score: 0.7222
   Cost: $0.2776
   Latency: 497.23s

✅ Evaluation completed for 7/7 methods


In [17]:
import json
import os
from datetime import datetime

# Results file path
results_file = "cuttlefish3_rag_evaluation_results.json"

# Try to load existing results first
all_results = {}
if os.path.exists(results_file):
    print(f"🔄 Loading existing results from {results_file}")
    try:
        with open(results_file, 'r') as f:
            saved_data = json.load(f)
            all_results = saved_data.get("results", {})
            print(f"✅ Loaded {len(all_results)} existing evaluation results")
            print(f"   Previously evaluated methods: {list(all_results.keys())}")
    except Exception as e:
        print(f"⚠️  Error loading existing results: {e}")
        all_results = {}

# Prepare evaluation chains and datasets
chains_to_evaluate = {
    "Naive": naive_chain,
    "BM25": bm25_chain,
    "Multi-Query": multi_query_chain,
    "Parent Document": parent_chain,
    "Contextual Compression": compression_chain,
    "Ensemble": ensemble_chain,
    "Semantic Chunking": semantic_chain
}

# Filter out already completed evaluations
remaining_chains = {name: chain for name, chain in chains_to_evaluate.items() 
                   if name not in all_results}

if remaining_chains:
    print(f"\n🚀 Starting evaluation for remaining methods...")
    print(f"Evaluating {len(remaining_chains)} retrieval methods")
    print(f"Methods to evaluate: {list(remaining_chains.keys())}")
    print(f"Using {len(evaluation_dataset)} test cases\n")
    
    # Run evaluations for remaining methods
    for method_name, chain in remaining_chains.items():
        print(f"{'='*60}")
        
        # Create fresh dataset copy for each evaluation
        method_dataset = [EvaluationSample({
            "user_input": sample.user_input,
            "reference_contexts": sample.reference_contexts,
            "reference": sample.reference,
            "response": None,
            "retrieved_contexts": None
        }) for sample in evaluation_dataset]
        
        try:
            results = evaluate_rag_chain(chain, method_dataset, method_name)
            all_results[method_name] = results
            
            # Save results immediately after each evaluation
            save_data = {
                "timestamp": datetime.now().isoformat(),
                "dataset_info": {
                    "dataset_name": "cuttlefish-jira-golden-dataset-20250731-122634",
                    "num_test_cases": len(evaluation_dataset),
                    "jira_documents_count": len(jira_documents)
                },
                "results": all_results
            }
            
            with open(results_file, 'w') as f:
                json.dump(save_data, f, indent=2)
            print(f"💾 Saved results to {results_file}")
            
        except Exception as e:
            print(f"❌ Error evaluating {method_name}: {e}")
            # Save partial results even on error
            save_data = {
                "timestamp": datetime.now().isoformat(),
                "dataset_info": {
                    "dataset_name": "cuttlefish-jira-golden-dataset-20250731-122634",
                    "num_test_cases": len(evaluation_dataset),
                    "jira_documents_count": len(jira_documents)
                },
                "results": all_results,
                "last_error": {
                    "method": method_name,
                    "error": str(e),
                    "timestamp": datetime.now().isoformat()
                }
            }
            with open(results_file, 'w') as f:
                json.dump(save_data, f, indent=2)
            continue

    print(f"\n✅ Evaluation completed for {len(all_results)}/{len(chains_to_evaluate)} methods")
else:
    print(f"\n✅ All evaluations already completed!")
    print(f"Found results for: {list(all_results.keys())}")

# Final save
final_save_data = {
    "timestamp": datetime.now().isoformat(),
    "dataset_info": {
        "dataset_name": "cuttlefish-jira-golden-dataset-20250731-122634",
        "num_test_cases": len(evaluation_dataset),
        "jira_documents_count": len(jira_documents)
    },
    "results": all_results
}

with open(results_file, 'w') as f:
    json.dump(final_save_data, f, indent=2)

print(f"\n💾 Final results saved to {results_file}")
print(f"📊 Total methods evaluated: {len(all_results)}")
print(f"🔍 To reload results later, simply re-run this cell")


🚀 Starting evaluation for remaining methods...
Evaluating 7 retrieval methods
Methods to evaluate: ['Naive', 'BM25', 'Multi-Query', 'Parent Document', 'Contextual Compression', 'Ensemble', 'Semantic Chunking']
Using 15 test cases

🚀 Evaluating Naive...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[23]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[71]: TimeoutError()


✅ Naive completed!
   Average Score: 0.6603
   Cost: $0.2819
   Latency: 544.21s
💾 Saved results to cuttlefish3_rag_evaluation_results.json
🚀 Evaluating BM25...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[23]: TimeoutError()
Exception raised in Job[59]: TimeoutError()


✅ BM25 completed!
   Average Score: 0.5306
   Cost: $0.2239
   Latency: 482.87s
💾 Saved results to cuttlefish3_rag_evaluation_results.json
🚀 Evaluating Multi-Query...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[71]: TimeoutError()
Exception raised in Job[83]: TimeoutError()


✅ Multi-Query completed!
   Average Score: 0.6863
   Cost: $0.3060
   Latency: 678.13s
💾 Saved results to cuttlefish3_rag_evaluation_results.json
🚀 Evaluating Parent Document...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

✅ Parent Document completed!
   Average Score: 0.5816
   Cost: $0.1314
   Latency: 348.02s
💾 Saved results to cuttlefish3_rag_evaluation_results.json
🚀 Evaluating Contextual Compression...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

✅ Contextual Compression completed!
   Average Score: 0.6413
   Cost: $0.1359
   Latency: 313.61s
💾 Saved results to cuttlefish3_rag_evaluation_results.json
🚀 Evaluating Ensemble...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[65]: TimeoutError()
Exception raised in Job[71]: TimeoutError()
Exception raised in Job[83]: TimeoutError()
Exception raised in Job[89]: TimeoutError()


✅ Ensemble completed!
   Average Score: 0.6818
   Cost: $0.3487
   Latency: 704.28s
💾 Saved results to cuttlefish3_rag_evaluation_results.json
🚀 Evaluating Semantic Chunking...
   Generating responses...
      Processing 1/15
      Processing 6/15
      Processing 11/15
   Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[53]: TimeoutError()
Exception raised in Job[59]: TimeoutError()


✅ Semantic Chunking completed!
   Average Score: 0.7439
   Cost: $0.2775
   Latency: 500.40s
💾 Saved results to cuttlefish3_rag_evaluation_results.json

✅ Evaluation completed for 7/7 methods

💾 Final results saved to cuttlefish3_rag_evaluation_results.json
📊 Total methods evaluated: 7
🔍 To reload results later, simply re-run this cell


## Load Existing Results (Optional)

Run this cell if you want to load previously saved evaluation results without re-running evaluations:

In [None]:
# Optional: Load existing results without running evaluations
import json
import os

results_file = "cuttlefish3_rag_evaluation_results.json"

if os.path.exists(results_file):
    print(f"📂 Loading results from {results_file}")
    with open(results_file, 'r') as f:
        saved_data = json.load(f)
        all_results = saved_data.get("results", {})
        
    print(f"✅ Loaded evaluation results for {len(all_results)} methods:")
    for method, results in all_results.items():
        print(f"   • {method}: Average Score = {results['average_score']:.4f}")
        
    print(f"\n📊 Dataset info:")
    dataset_info = saved_data.get("dataset_info", {})
    print(f"   • Dataset: {dataset_info.get('dataset_name', 'N/A')}")
    print(f"   • Test cases: {dataset_info.get('num_test_cases', 'N/A')}")
    print(f"   • JIRA documents: {dataset_info.get('jira_documents_count', 'N/A')}")
    print(f"   • Last updated: {saved_data.get('timestamp', 'N/A')}")
else:
    print(f"❌ No saved results found at {results_file}")
    print("Run the evaluation cell above to generate results.")

In [None]:
import pandas as pd

In [18]:
import pandas as pd

print("="*80)
print("CUTTLEFISH3 JIRA RAG EVALUATION RESULTS")
print("="*80)

if not all_results:
    print("❌ No evaluation results available. Please run the evaluation cells above.")
else:
    # Create comprehensive comparison DataFrame
    comparison_data = []
    
    for method_name, results in all_results.items():
        comparison_data.append({
            "Method": method_name,
            "Avg Score": f"{results['average_score']:.4f}",
            "Context Recall": f"{results['context_recall']:.4f}",
            "Faithfulness": f"{results['faithfulness']:.4f}",
            "Factual Correctness": f"{results['factual_correctness']:.4f}",
            "Response Relevancy": f"{results['response_relevancy']:.4f}",
            "Noise Sensitivity": f"{results['noise_sensitivity']:.4f}",
            "Total Cost": f"${results['total_cost_usd']:.4f}",
            "Cost/Query": f"${results['cost_per_query']:.4f}",
            "Total Time": f"{results['total_latency_seconds']:.1f}s",
            "Time/Query": f"{results['latency_per_query']:.1f}s",
            "Tokens/Query": f"{results['tokens_per_query']:.0f}"
        })
    
    results_df = pd.DataFrame(comparison_data)
    
    print("📊 COMPREHENSIVE RESULTS:")
    print(results_df.to_string(index=False))
    
    # Performance analysis
    print(f"\n🏆 PERFORMANCE ANALYSIS:")
    print("-" * 50)
    
    # Best overall performance
    best_method = max(all_results.items(), key=lambda x: x[1]['average_score'])
    print(f"🥇 Best Overall: {best_method[0]} (Score: {best_method[1]['average_score']:.4f})")
    
    # Most cost-effective
    cost_effective = min([r for r in all_results.items() if r[1]['total_cost_usd'] > 0], 
                        key=lambda x: x[1]['cost_per_query'])
    print(f"💰 Most Cost-Effective: {cost_effective[0]} (${cost_effective[1]['cost_per_query']:.4f}/query)")
    
    # Fastest method
    fastest = min(all_results.items(), key=lambda x: x[1]['total_latency_seconds'])
    print(f"⚡ Fastest: {fastest[0]} ({fastest[1]['total_latency_seconds']:.1f}s total)")
    
    # Individual metric leaders
    metrics = ['context_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'noise_sensitivity']
    print(f"\n🎯 Metric Leaders:")
    for metric in metrics:
        leader = max(all_results.items(), key=lambda x: x[1][metric])
        print(f"   {metric.replace('_', ' ').title()}: {leader[0]} ({leader[1][metric]:.4f})")

print(f"\n{'='*80}")
print("🎉 CUTTLEFISH3 JIRA RAG EVALUATION COMPLETE!")
print(f"{'='*80}")

CUTTLEFISH3 JIRA RAG EVALUATION RESULTS
📊 COMPREHENSIVE RESULTS:
                Method Avg Score Context Recall Faithfulness Factual Correctness Response Relevancy Noise Sensitivity Total Cost Cost/Query Total Time Time/Query Tokens/Query
                 Naive    0.6603         0.8722       0.7977              0.6787             0.8234            0.1294    $0.2819    $0.0188     544.2s      36.3s        77353
                  BM25    0.5306         0.6444       0.7844              0.4860             0.6351            0.1031    $0.2239    $0.0149     482.9s      32.2s        58071
           Multi-Query    0.6863         0.9222       0.7993              0.7000             0.8890            0.1209    $0.3060    $0.0204     678.1s      45.2s        85142
       Parent Document    0.5816         0.6733       0.7914              0.5840             0.6293            0.2300    $0.1314    $0.0088     348.0s      23.2s        35763
Contextual Compression    0.6413         0.7778       0.7800

You can find the Gsheet version of this table here: https://docs.google.com/spreadsheets/d/1raoUdbfGGrQIicetcCilAQuSlpLUG8IQJLtV5Tv3ztk/edit?usp=sharing 

## Summary & Recommendations

### Chunking:

SemanticChunking performs better than the default RecursiveCharacterTextSplit across the board. Although it may seem that the higher Noise Sensitivity is bad that is not the case (checked with Cursor in Prompt 11 of PROMPTS.md) - a higher noise sensitivity means Semantic Chunking is more sensitive to context changes which is beneficial for the kind of queries we expect from this app ie. technical support scenarios. 

### Retrieval 

Overall if you look at the scores for Response Relevancy, the Ensemble approach scores really well but at a huge cost (47s/request). The next best is Multi-Query but it's not much better at 45s/request. Next is Contextual Compression that has a similar score to Multi-Query for Response Relevancy but at only 21s/request. 

Given this, we will use the following retrieval mechanisms:

1. If the user needs an answer quickly eg. during a production incident - use the Contextual Compression retrieval 
2. If the user is able to wait for ~ 1min use the Ensemble retrieval

We have also seen during the past assessment that there are some types of queries where a keyword search is able to find information but semantic similarity just doesn't retrieve results. These are usually the ones where key entities are being searched for eg. ticket numbers. As we are dealing with tickets, for this project we'll also include BM25 as another retrieval option. 
