# Multi-Strategy Document Retrieval and Evaluation System

## Objective
Build a Python/LangChain system that compares four different document retrieval strategies against a golden source benchmark, evaluating their accuracy and cost-effectiveness for question-answering tasks.

## Input Requirements

### 1. Document Dataset (Excel)
**File:** `document_dataset.xlsx`
| TOC_Number | Text |
|------------|------|
| 1.1 | Introduction to machine learning fundamentals |
| 1.2 | Supervised learning algorithms and applications |
| ... | ... |

### 2. Golden Source Questions (Excel)
**File:** `golden_source.xlsx`
| Question_ID | Question_Text | TOC_1 | TOC_2 | TOC_3 | TOC_4 | TOC_5 |
|-------------|--------------|--------|--------|--------|--------|--------|
| Q1 | How do I evaluate my ML model? | 2.2 | 1.2 | 2.1 | 1.3 | 3.1 |
| Q2 | What are neural network types? | 3.1 | 3.2 | 1.1 | 1.2 | 2.1 |
| ... | ... | ... | ... | ... | ... | ... |
| Q10 | ... | ... | ... | ... | ... | ... |

### 3. Category Definitions
- **Category_1** (Topic): List of valid categories to be provided
- **Category_2** (Level): List of valid categories to be provided

## System Components

### 1. Data Preprocessing

#### 1.1 Document Categorization
- For each row in the document dataset:
  - Automatically assign Category_1 (Topic) based on text content
  - Automatically assign Category_2 (Level) based on text complexity
- Methods:
  - Zero-shot classification using LLM
  - Or embedding similarity to category descriptions
  - Or keyword/rule-based assignment

#### 1.2 Processed Dataset Structure
After categorization, create:
| TOC_Number | Text | Category_1 | Category_2 |
|------------|------|------------|------------|
| 1.1 | Introduction to machine learning fundamentals | Fundamentals | Beginner |
| 1.2 | Supervised learning algorithms and applications | Algorithms | Intermediate |
| ... | ... | ... | ... |

#### 1.3 Embedding Generation
- Create embeddings for all document texts (OpenAI/Claude)
- Store embeddings with metadata

### 2. Question Processing
- Embed each incoming question
- Categorize questions into both category types using same method as documents

### 3. Four Retrieval Strategies

#### Strategy A: Pure Embedding Similarity
- Compute cosine similarity between question and document embeddings
- Return top 5 most similar documents
- Cost: Embedding lookup only

#### Strategy B: Category Filtering
- Filter documents matching question's categories
- Return up to 5 matches (by TOC order)
- Cost: Metadata query only

#### Strategy C: Hybrid (Categories + Similarity)
- Filter by categories first
- Rank filtered results by embedding similarity
- Return top 5
- Cost: Filtering + embedding lookup

#### Strategy D: Full Context with Cache
- Load entire document into LLM context
- Use prompt caching for cost reduction
- Let LLM select 5 most relevant sections
- Cost: Full LLM inference (reduced with caching)

### 4. Evaluation Metrics
For each question and strategy, calculate:
- **Matches**: Number of retrieved TOCs that appear in golden source
- **Precision@5**: Matches / 5
- **Cost**: Based on API calls (embeddings, LLM tokens)
- **Latency**: Time to retrieve results

## Implementation Steps

```python
# Pseudo-code structure
class DocumentCategorizer:
    def __init__(self, category_1_list, category_2_list, llm_model):
        self.categories_1 = category_1_list
        self.categories_2 = category_2_list
        self.llm = llm_model
        
    def categorize_text(self, text):
        # Return (category_1, category_2)
        # Option 1: LLM-based categorization
        prompt = f"""
        Categorize this text:
        "{text}"
        
        Category 1 options: {self.categories_1}
        Category 2 options: {self.categories_2}
        
        Return: (category_1, category_2)
        """
        return self.llm.predict(prompt)
        
    def categorize_dataset(self, document_df):
        # Add category columns to dataframe
        for idx, row in document_df.iterrows():
            cat1, cat2 = self.categorize_text(row['Text'])
            document_df.loc[idx, 'Category_1'] = cat1
            document_df.loc[idx, 'Category_2'] = cat2
        return document_df

class DocumentRetriever:
    def __init__(self, categorized_document_df, embedding_model):
        # Initialize embeddings and categories
        
    def categorize_question(self, question):
        # Return (category_1, category_2)
        
    def retrieve_by_embedding(self, question, k=5):
        # Strategy A implementation
        
    def retrieve_by_category(self, question, k=5):
        # Strategy B implementation
        
    def retrieve_hybrid(self, question, k=5):
        # Strategy C implementation
        
    def retrieve_full_context(self, question, k=5):
        # Strategy D implementation

class Evaluator:
    def __init__(self, golden_source_df):
        # Load golden source
        
    def evaluate_strategy(self, strategy_results, question_id):
        # Calculate precision and matches
        
    def generate_report(self, all_results):
        # Create summary and detailed tables
```

## Expected Outputs

### 1. Categorized Document Dataset (Excel)
**File:** `document_dataset_categorized.xlsx`
- Original columns plus Category_1 and Category_2
- Export for verification and manual correction if needed

### 2. Summary Table

| Strategy | Avg Precision | Total Cost | Avg Latency | Best For |
|----------|---------------|------------|-------------|----------|
| A: Embedding | 0.62 | 0.0010 | 50ms | Semantic search |
| B: Categories | 0.48 | 0.0000 | 10ms | Quick filtering |
| C: Hybrid | 0.74 |  0.0010 | 60ms | Balanced approach |
| D: Full Context | 0.88 | 0.0500 | 500ms | High accuracy |

### 3. Detailed Results (Excel Export)
- **Sheet 1**: Summary statistics
- **Sheet 2**: Per-question results for all strategies
- **Sheet 3**: Cost breakdown by component
- **Sheet 4**: Retrieved TOCs vs Golden TOCs comparison
- **Sheet 5**: Categorized documents

### 4. Cost Breakdown Example
```
Initial Setup:
- Document categorization: 100 docs × $0.001 = $0.10 (one-time)
- Document embeddings: 100 docs × $0.0001 = $0.01 (one-time)

Per-Question Costs:
- Question categorization: $0.001
- Strategy A: $0.0001 (embedding lookup)
- Strategy B: $0.0000 (metadata only)
- Strategy C: $0.0001 (filtering + embedding)
- Strategy D: $0.005 (full context) or $0.0005 (cached)
```

## Technical Specifications

### Required Libraries
```python
import pandas as pd
import numpy as np
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from sklearn.metrics.pairwise import cosine_similarity
import openpyxl
from tqdm import tqdm  # For progress bars during categorization
```

### Configuration Parameters
```python
config = {
    "embedding_model": "text-embedding-ada-002",
    "llm_model": "gpt-4",
    "retrieval_k": 5,
    "categories_1": ["Fundamentals", "Algorithms", "Data Processing", "Advanced Topics"],
    "categories_2": ["Beginner", "Intermediate", "Advanced", "Expert"],
    "cost_per_embedding": 0.0001,
    "cost_per_1k_tokens": 0.03,
    "use_cache": True,
    "categorization_method": "llm"  # Options: "llm", "embedding", "keyword"
}
```

### Usage Example
```python
# Step 1: Load and categorize documents
documents_df = pd.read_excel("document_dataset.xlsx")
golden_df = pd.read_excel("golden_source.xlsx")

categorizer = DocumentCategorizer(
    config["categories_1"], 
    config["categories_2"], 
    ChatOpenAI(model=config["llm_model"])
)

print("Categorizing documents...")
categorized_df = categorizer.categorize_dataset(documents_df)
categorized_df.to_excel("document_dataset_categorized.xlsx", index=False)

# Step 2: Initialize retrieval system
retriever = DocumentRetriever(categorized_df, config)
evaluator = Evaluator(golden_df)

# Step 3: Run evaluation
results = {}
for idx, row in golden_df.iterrows():
    question_id = row['Question_ID']
    question = row['Question_Text']
    
    results[question_id] = {
        'A': retriever.retrieve_by_embedding(question),
        'B': retriever.retrieve_by_category(question),
        'C': retriever.retrieve_hybrid(question),
        'D': retriever.retrieve_full_context(question)
    }

# Step 4: Generate report
evaluator.generate_report(results)
```

## Deliverables
1. Categorized document dataset (Excel file)
2. Working Python script with:
   - Automatic document categorization
   - Four retrieval strategies
   - Evaluation framework
3. Excel report with comprehensive evaluation metrics
4. Cost analysis including:
   - One-time setup costs (categorization, embeddings)
   - Per-query costs for each strategy
5. Recommendations based on accuracy/cost tradeoffs

## Notes
- The categorization step is crucial for Strategy B and C
- Consider manual verification of categories for critical documents
- Category definitions should be clear and mutually exclusive
- Save categorized dataset for reuse in future experiments

In [None]:
# gpt-4.1-mini

In [None]:
import os
import time
import pandas as pd
import numpy as np
import openpyxl
from tqdm import tqdm
from dotenv import load_dotenv

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from sklearn.metrics.pairwise import cosine_similarity

# --- Configuration ---
load_dotenv()

config = {
    "embedding_model": "text-embedding-3-small",
    "llm_model": "gpt-4-turbo",
    "retrieval_k": 5,
    "categories_1": ["Fundamentals", "Algorithms", "Data Processing", "Advanced Topics", "Evaluation"],
    "categories_2": ["Beginner", "Intermediate", "Advanced", "Expert"],
    "cost_embedding_per_1m_tokens": 0.02,
    "cost_llm_input_per_1m_tokens": 5.00,
    "cost_llm_output_per_1m_tokens": 15.00,
    "use_cache": True,
}

# --- Helper Functions & Setup ---

def create_dummy_files():
    """Creates dummy Excel files for demonstration if they don't exist."""
    doc_file = "document_dataset.xlsx"
    golden_file = "golden_source.xlsx"
    if not os.path.exists(doc_file):
        print(f"Creating dummy file: {doc_file}")
        pd.DataFrame({
            "TOC_Number": ["1.1", "1.2", "1.3", "2.1", "2.2", "2.3", "3.1", "3.2", "3.3", "4.1"],
            "Text": [
                "Introduction to machine learning, covering basic concepts like variables and data types.",
                "Exploring supervised learning algorithms, including linear regression and logistic regression.",
                "An overview of unsupervised learning techniques such as k-means clustering.",
                "Deep dive into data preprocessing: handling missing values, scaling features.",
                "Methods for model evaluation: confusion matrix, precision, recall, and F1-score.",
                "Feature engineering strategies to improve model performance.",
                "A look at neural networks and their fundamental architecture.",
                "Advanced neural network types: Convolutional Neural Networks (CNNs) for images.",
                "Recurrent Neural Networks (RNNs) for sequence data.",
                "Understanding transfer learning and fine-tuning pre-trained models."
            ]
        }).to_excel(doc_file, index=False)
    if not os.path.exists(golden_file):
        print(f"Creating dummy file: {golden_file}")
        pd.DataFrame({
            "Question_ID": [f"Q{i}" for i in range(1, 6)],
            "Question_Text": [
                "How do I evaluate my machine learning model?", "What are the main types of neural networks?",
                "How should I start learning about ML?", "What is the difference between supervised and unsupervised learning?",
                "How to prepare data for a model?"
            ],
            "TOC_1": ["2.2", "3.2", "1.1", "1.2", "2.1"], "TOC_2": ["2.3", "3.1", "1.2", "1.3", "2.3"],
            "TOC_3": ["1.2", "3.3", "2.1", "3.1", "4.1"], "TOC_4": ["4.1", "4.1", "1.3", "2.2", "1.2"],
            "TOC_5": ["1.1", "1.1", "3.1", "2.1", "3.2"],
        }).to_excel(golden_file, index=False)

# --- System Components ---

class CostTracker:
    def __init__(self, config):
        self.config, self.total_cost, self.cost_breakdown = config, 0, {
            "setup_categorization": 0, "setup_embedding": 0, "query_categorization": 0,
            "query_embedding": 0, "query_llm_context": 0
        }
    def _calculate_cost(self, tokens, type):
        if type == "embedding": return (tokens / 1_000_000) * self.config['cost_embedding_per_1m_tokens']
        if type == "llm_input": return (tokens / 1_000_000) * self.config['cost_llm_input_per_1m_tokens']
        if type == "llm_output": return (tokens / 1_000_000) * self.config['cost_llm_output_per_1m_tokens']
        return 0
        
    def add_cost(self, tokens, type, component):
        cost = self._calculate_cost(tokens, type)
        self.total_cost += cost
        if component in self.cost_breakdown: self.cost_breakdown[component] += cost
        return cost
        
    def get_summary(self): return {"total_cost": self.total_cost, "breakdown": self.cost_breakdown}

class DocumentCategorizer:
    def __init__(self, category_1_list, category_2_list, llm, cost_tracker):
        self.llm, self.cost_tracker = llm, cost_tracker
        class Categories(BaseModel):
            category_1: str = Field(description=f"The topic from the list: {category_1_list}")
            category_2: str = Field(description=f"The level from the list: {category_2_list}")
        self.parser = PydanticOutputParser(pydantic_object=Categories)
        self.prompt = PromptTemplate(
            template="Analyze the text and assign categories.\n{format_instructions}\nText: \"{text}\"",
            input_variables=["text"], partial_variables={"format_instructions": self.parser.get_format_instructions()},
        )
        self.chain = self.prompt | self.llm | self.parser
        
    def categorize_text(self, text: str):
        try:
            input_tokens, result = len(text) // 4, self.chain.invoke({"text": text})
            output_tokens = len(str(result)) // 4
            self.cost_tracker.add_cost(input_tokens, "llm_input", "setup_categorization")
            self.cost_tracker.add_cost(output_tokens, "llm_output", "setup_categorization")
            return result.category_1, result.category_2
        except Exception as e: return "Uncategorized", "Uncategorized"
            
    def categorize_dataset(self, df: pd.DataFrame):
        cats = [self.categorize_text(row['Text']) for _, row in tqdm(df.iterrows(), total=len(df), desc="Categorizing Documents")]
        return pd.concat([df, pd.DataFrame(cats, columns=["Category_1", "Category_2"])], axis=1)

class DocumentRetriever:
    def __init__(self, df, config, cost_tracker):
        self.df, self.config, self.cost_tracker, self.llm_cache = df.copy(), config, cost_tracker, {}
        self.embedding_model = OpenAIEmbeddings(model=config["embedding_model"])
        self.llm = ChatOpenAI(model=config["llm_model"], temperature=0)
        self.categorizer = DocumentCategorizer(config["categories_1"], config["categories_2"], self.llm, cost_tracker)
        print("Generating document embeddings...")
        texts, total_tokens = self.df['Text'].tolist(), sum(len(t)//4 for t in self.df['Text'])
        self.cost_tracker.add_cost(total_tokens, 'embedding', 'setup_embedding')
        self.df['embedding'] = self.embedding_model.embed_documents(texts)
        
    def _categorize_question(self, question, question_id):
        if hasattr(self, '_cat_cache') and question_id in self._cat_cache: return self._cat_cache[question_id]
        input_tokens, (cat1, cat2) = len(question)//4, self.categorizer.categorize_text(question)
        output_tokens = len(cat1)//4 + len(cat2)//4
        self.cost_tracker.add_cost(input_tokens, 'llm_input', 'query_categorization')
        self.cost_tracker.add_cost(output_tokens, 'llm_output', 'query_categorization')
        if not hasattr(self, '_cat_cache'): self._cat_cache = {}
        self._cat_cache[question_id] = (cat1, cat2)
        return cat1, cat2
        
    def retrieve(self, s_name, question, k, q_id):
        start = time.time()
        if s_name == 'A': tocs, cost = self.retrieve_by_embedding(question, k)
        elif s_name == 'B': tocs, cost = self.retrieve_by_category(question, k, q_id)
        elif s_name == 'C': tocs, cost = self.retrieve_hybrid(question, k, q_id)
        elif s_name == 'D': tocs, cost = self.retrieve_full_context(question, k)
        else: raise ValueError(f"Unknown strategy: {s_name}")
        return tocs, (time.time() - start) * 1000, cost
        
    def retrieve_by_embedding(self, question, k):
        q_emb = self.embedding_model.embed_query(question)
        cost = self.cost_tracker.add_cost(len(question)//4, "embedding", "query_embedding")
        sims = cosine_similarity([q_emb], np.array(self.df['embedding'].tolist()))[0]
        return self.df.iloc[np.argsort(sims)[::-1][:k]]['TOC_Number'].tolist(), cost
        
    def retrieve_by_category(self, question, k, q_id):
        cat1, cat2 = self._categorize_question(question, q_id)
        df = self.df[(self.df['Category_1'] == cat1) & (self.df['Category_2'] == cat2)]
        if df.empty: df = self.df[self.df['Category_1'] == cat1]
        return df['TOC_Number'].head(k).tolist(), 0
        
    def retrieve_hybrid(self, question, k, q_id):
        cost = self.cost_tracker.add_cost(len(question)//4, "embedding", "query_embedding")
        cat1, cat2 = self._categorize_question(question, q_id)
        # --- THIS IS THE CORRECTED LINE ---
        df = pd.concat([
            self.df[(self.df['Category_1']==cat1)&(self.df['Category_2']==cat2)], 
            self.df[self.df['Category_1']==cat1]
        ]).drop_duplicates(subset=['TOC_Number']).reset_index(drop=True)
        # --- END OF CORRECTION ---
        if df.empty: return [], cost
        q_emb = self.embedding_model.embed_query(question)
        df['sim'] = cosine_similarity([q_emb], np.array(df['embedding'].tolist()))[0]
        return df.sort_values('sim', ascending=False).head(k)['TOC_Number'].tolist(), cost
        
    def retrieve_full_context(self, question, k):
        if self.config['use_cache'] and question in self.llm_cache: return self.llm_cache[question][0], 0
        context_str = "\n".join([f"TOC {row['TOC_Number']}: {row['Text']}" for _, row in self.df.iterrows()])
        prompt = f"""Given the document context below, identify the TOP {k} `TOC_Number`s most relevant to the user's question. Return only a comma-separated list of TOC numbers (e.g., 1.1, 2.3, 3.2).
CONTEXT:
---
{context_str}
---
QUESTION: "{question}"
Relevant TOC_Numbers:
"""
        res = self.llm.invoke(prompt).content.strip()
        tocs = [t.strip() for t in res.split(',')]
        cost = self.cost_tracker.add_cost(len(prompt)//4, 'llm_input', 'query_llm_context') + \
               self.cost_tracker.add_cost(len(res)//4, 'llm_output', 'query_llm_context')
        if self.config['use_cache']: self.llm_cache[question] = (tocs, cost)
        return tocs, cost

class Evaluator:
    def __init__(self, golden_source_df, k):
        self.k = k
        self.golden_df = golden_source_df
        toc_cols = [f'TOC_{i}' for i in range(1, 6)]
        self.golden_map = {row['Question_ID']: set(row[toc_cols].astype(str).values) for _, row in self.golden_df.iterrows()}
            
    def evaluate_run(self, retrieved_tocs, question_id):
        golden_tocs = self.golden_map.get(question_id, set())
        matches = len(set(retrieved_tocs).intersection(golden_tocs))
        precision = matches / self.k if self.k > 0 else 0
        return {'matches': matches, 'precision': precision, 'retrieved_tocs': ", ".join(map(str, retrieved_tocs)), 'golden_tocs': ", ".join(map(str, sorted(list(golden_tocs))))}

    def generate_report(self, all_results, cost_tracker, categorized_df):
        print("\nGenerating final report...")
        report_filename = "retrieval_evaluation_report.xlsx"
        strategy_map = {'A': 'A: Embedding', 'B': 'B: Categories', 'C': 'C: Hybrid', 'D': 'D: Full Context'}

        # 1. Per-Question Accuracy Summary
        per_q_summary_data = []
        for q_id, q_results in all_results.items():
            golden_tocs_set = self.golden_map.get(q_id, set())
            golden_tocs_str = ", ".join(map(str, sorted(list(golden_tocs_set))))
            for s_code, result in q_results.items():
                if s_code == 'categorization_cost': continue
                eval_metrics = self.evaluate_run(result['tocs'], q_id)
                per_q_summary_data.append({
                    'Question_ID': q_id,
                    'Strategy': strategy_map[s_code],
                    'Golden TOCs': golden_tocs_str,
                    'Retrieved TOCs': eval_metrics['retrieved_tocs'],
                    'Overlap %': f"{eval_metrics['precision']:.0%}"
                })
        per_q_summary_df = pd.DataFrame(per_q_summary_data)

        # 2. Detailed Per-Question Metrics
        detailed_metrics_data = []
        for q_id, q_results in all_results.items():
            row = {'Question_ID': q_id, 'Question_Text': self.golden_df[self.golden_df['Question_ID'] == q_id]['Question_Text'].iloc[0]}
            q_cat_cost = q_results.get('categorization_cost', 0)
            for s_code, result in q_results.items():
                if s_code == 'categorization_cost': continue
                eval_metrics = self.evaluate_run(result['tocs'], q_id)
                final_cost = result['cost'] + (q_cat_cost if s_code in ['B', 'C'] else 0)
                row[f'{s_code}_Matches'] = eval_metrics['matches']
                row[f'{s_code}_Precision'] = eval_metrics['precision']
                row[f'{s_code}_Latency(ms)'] = result['latency']
                row[f'{s_code}_Cost($)'] = final_cost
            detailed_metrics_data.append(row)
        detailed_metrics_df = pd.DataFrame(detailed_metrics_data)

        # 3. Strategy-Level Summary
        summary_data = []
        for s_code, s_name in strategy_map.items():
            summary_data.append({
                'Strategy': s_name,
                'Avg Precision': f"{detailed_metrics_df[f'{s_code}_Precision'].mean():.2%}",
                'Total Query Cost ($)': f"{detailed_metrics_df[f'{s_code}_Cost($)'].sum():.6f}",
                'Avg Latency (ms)': f"{detailed_metrics_df[f'{s_code}_Latency(ms)'].mean():.2f}"
            })
        summary_df = pd.DataFrame(summary_data)
        
        # 4. Cost Breakdown
        costs = cost_tracker.get_summary()
        cost_df = pd.DataFrame({
            'Component': [
                'Setup: Document Categorization', 'Setup: Document Embeddings', '---',
                'Total Query Costs (Aggregated)', '---', 'Total Estimated Cost (Setup + Query)'
            ],
            'Cost ($)': [
                f"{costs['breakdown']['setup_categorization']:.6f}", f"{costs['breakdown']['setup_embedding']:.6f}", '---',
                f"{costs['total_cost'] - costs['breakdown']['setup_categorization'] - costs['breakdown']['setup_embedding']:.6f}", '---',
                f"{costs['total_cost']:.6f}"
            ]
        })

        # 5. Write all DataFrames to a multi-sheet Excel file
        with pd.ExcelWriter(report_filename, engine='openpyxl') as writer:
            per_q_summary_df.to_excel(writer, sheet_name='Per-Question Accuracy Summary', index=False)
            summary_df.to_excel(writer, sheet_name='Strategy-Level Summary', index=False)
            detailed_metrics_df.to_excel(writer, sheet_name='Detailed Per-Question Metrics', index=False)
            cost_df.to_excel(writer, sheet_name='Cost Breakdown', index=False)
            categorized_df.drop(columns=['embedding'], errors='ignore').to_excel(
                writer, sheet_name='Categorized Documents', index=False
            )
            
        print("\n--- Evaluation Report ---")
        print("Per-Question Accuracy Summary (Top 5 rows):")
        print(per_q_summary_df.head().to_string(index=False))
        print("\nStrategy-Level Summary:")
        print(summary_df.to_string(index=False))
        print(f"\nFull report with 5 sheets saved to '{report_filename}'")
        print(f"Total estimated cost for this run: ${costs['total_cost']:.4f}")

# --- Main Execution ---
if __name__ == "__main__":
    create_dummy_files()
    if not os.getenv("OPENAI_API_KEY"):
        raise ValueError("OPENAI_API_KEY environment variable not set. Please create a .env file.")

    cost_tracker = CostTracker(config)
    llm = ChatOpenAI(model=config["llm_model"], temperature=0)
    docs_df = pd.read_excel("document_dataset.xlsx")
    golden_df = pd.read_excel("golden_source.xlsx")

    categorizer = DocumentCategorizer(config["categories_1"], config["categories_2"], llm, cost_tracker)
    categorized_df = categorizer.categorize_dataset(docs_df)
    categorized_df.to_excel("document_dataset_categorized.xlsx", index=False)
    print("Categorized documents saved.")

    retriever = DocumentRetriever(categorized_df, config, cost_tracker)
    
    all_results = {}
    print("\nRunning retrieval strategies for all questions...")
    for _, row in tqdm(golden_df.iterrows(), total=len(golden_df), desc="Evaluating Questions"):
        q_id, question = row['Question_ID'], row['Question_Text']
        cost_tracker.cost_breakdown['query_categorization'] = 0
        
        all_results[q_id] = {
            s_code: {'tocs': t, 'latency': l, 'cost': c}
            for s_code, (t, l, c) in zip(
                ['A', 'B', 'C', 'D'],
                [retriever.retrieve(s, question, config["retrieval_k"], q_id) for s in ['A', 'B', 'C', 'D']]
            )
        }
        all_results[q_id]['categorization_cost'] = cost_tracker.cost_breakdown['query_categorization']
        if hasattr(retriever, '_cat_cache'): retriever._cat_cache.clear()

    evaluator = Evaluator(golden_df, config["retrieval_k"])
    evaluator.generate_report(all_results, cost_tracker, retriever.df)