# 📊 RAG System Evaluation

**Purpose:**  
Assess our Retrieval-Augmented Generation (RAG) pipeline’s performance on a standardized question set using Ragas metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall, and Answer Correctness.

**Notebook Outline:**
1. **Environment Setup & Imports**  
2. **Initialize RAGChatbot & ChromaDB Retriever**  
3. **Baseline Evaluation**  
   - Query the RAG system  
   - Collect responses & contexts  
   - Compute baseline Ragas metrics  
4. **Prompt-Locked Evaluation**  
   - Enforce “use only these snippets” template  
   - Re-generate responses  
   - Re-compute Ragas metrics  
5. **Compare & Visualize Results**  
   - Compute average scores for both runs  
   - Plot side-by-side bar charts  
6. **Save Outputs**  
   - Export evaluation CSVs  
   - Save chart images  
7. **Next Steps & Future Experiments**  
   - Ideas for further tuning, ablations, UI integration  


# Summary of Findings

## Section 1: Baseline Evaluation

**Baseline Evaluation Summary**

* **Retrieval Performance (ChromaDB)**

  * Average Precision: **0.87**
  * Average Recall: **0.94**

* **Answer Generation Quality (LLM)**

  * Average Relevancy: **0.24**
  * Average Faithfulness: **0.26**
  * Average Correctness: **0.26**

> **Insight:** Retrieval is very strong, but answer quality lags—motivating prompt‐locking and “use‐only‐this‐context” templates.

---

**Average Baseline RAG Evaluation Metrics**

| Metric              | Average Score |
| ------------------- | ------------: |
| context\_precision  |          0.87 |
| context\_recall     |          0.94 |
| answer\_relevancy   |          0.24 |
| faithfulness        |          0.26 |
| answer\_correctness |          0.26 |

---

<!-- Insert the bar chart image here -->

![Average Baseline RAG Evaluation Metrics](images/baseline_metrics.png)

*Figure: Gap between strong retrieval vs. weaker answer generation.*

## Section 2: Prompt‐Locked Evaluation & Ragas Scoring
**Section 2: Prompt-Locked Evaluation Summary**

After enforcing the “use only these 3 snippets” prompt, we see big jumps in answer fidelity and relevance—while still retaining strong retrieval. Here are the **average** scores over the 12‐question set:

| Metric              | Baseline Avg. | Prompt-Locked Avg. |
| ------------------- | ------------: | -----------------: |
| context\_precision  |          0.87 |               1.00 |
| context\_recall     |          0.94 |               0.78 |
| faithfulness        |          0.26 |               0.94 |
| answer\_relevancy   |          0.24 |               0.78 |
| answer\_correctness |          0.26 |               0.49 |

> **Key improvements**
>
> * **Faithfulness** ↑ from 0.26 → 0.94
> * **Relevancy**  ↑ from 0.24 → 0.78
> * **Correctness**↑ from 0.26 → 0.49

---

**Visualization**

![Prompt-Locked vs. Baseline Metrics](images/baseline_vs_locked_metrics.png)


This clearly shows that prompt-locking dramatically curbs hallucination and boosts answer quality, at the small cost of a bit of recall.

# **Next Steps & Future Experiments**  
   - **Retrieval Hyperparameter Tuning**  
     Adjust the number of returned contexts, embedding model variants, and similarity thresholds to optimize precision–recall tradeoffs.  
   - **Ablation Studies**  
     Systematically disable or vary components (e.g. prompt templates, context chunk size, reranking) to measure their individual impact on faithfulness and relevance.  
   - **Prompt Engineering Variations**  
     Experiment with different “locked” prompt templates (e.g. include instruction to cite sources, adjust temperature) to see how they affect hallucination rates.  
   - **Chaining & Multi-Turn Scenarios**  
     Extend evaluation beyond single-turn to multi-turn dialogues, testing how context accumulation and chat history affect performance.  
   - **UI Integration & A/B Testing**  
     Deploy two interface variants (baseline vs. prompt-locked) to real users (e.g. DSI staff) and collect qualitative feedback on responsiveness, clarity, and trust.  
   - **Metric Expansion**  
     Incorporate human-in-the-loop ratings or additional automated metrics (e.g. linguistic quality, answer completeness) to complement Ragas scores.  


In [None]:
# pip install ragas langchain-openai
# !pip install -qU langchain-chroma langchain-core  # run once

In [1]:
import sys, pathlib
PROJECT_ROOT = pathlib.Path().resolve().parent      # parent of notebooks/
SRC_DIR = PROJECT_ROOT / "src"
sys.path.append(str(SRC_DIR))                       # now “src” is importable

In [2]:
# ─────────────────────────────────────────────────────────────────────
# 0. Setup imports and paths
# ─────────────────────────────────────────────────────────────────────
import sys, pathlib
from dotenv import load_dotenv

# If your notebook sits next to src/, add it to sys.path
PROJECT_ROOT = pathlib.Path().resolve().parent
sys.path.append(str(PROJECT_ROOT / "src"))

load_dotenv()  # loads OPENAI_API_KEY or other credentials

True

# Baseline Evaluation

**Baseline Evaluation Summary**

* **Retrieval Performance (ChromaDB)**

  * Average Precision: **0.87**
  * Average Recall: **0.94**

* **Answer Generation Quality (LLM)**

  * Average Relevancy: **0.24**
  * Average Faithfulness: **0.26**
  * Average Correctness: **0.26**

> **Insight:** Retrieval is very strong, but answer quality lags—motivating prompt‐locking and “use‐only‐this‐context” templates.

---

**Average Baseline RAG Evaluation Metrics**

| Metric              | Average Score |
| ------------------- | ------------: |
| context\_precision  |          0.87 |
| context\_recall     |          0.94 |
| answer\_relevancy   |          0.24 |
| faithfulness        |          0.26 |
| answer\_correctness |          0.26 |

---

<!-- Insert the bar chart image here -->

![Average Baseline RAG Evaluation Metrics](images/baseline_metrics.png)

*Figure: Gap between strong retrieval vs. weaker answer generation.*


In [None]:
# RAG_system_evaluation_v2.ipynb — Revised for Manual Retrieval
# Constraint: Cannot modify src files (chatbot.py, graph.py, etc.)
# Goal: Evaluate RAG system using manual ChromaDB retrieval for contexts.

# --- Section 1: Environment setup & imports ---
import os
import sys
import json
import logging
import traceback
import pandas as pd
from tqdm import tqdm
from datasets import Dataset
from dotenv import load_dotenv
from pathlib import Path
import chromadb


# Setup basic logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(name)s - %(message)s")

# Helper for full tracebacks
def print_full_traceback():
    print("--- FULL TRACEBACK ---")
    traceback.print_exc()
    print("--- END TRACEBACK ---")


# --- Section 2: Import & initialize RAGChatbot ---
module_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from src.rag_chatbot import RAGChatbot
    print("Imported RAGChatbot from src.rag_chatbot")
except ImportError:
    from src.chatbot import RAGChatbot
    print("Imported RAGChatbot from src.chatbot")

rag_chatbot_instance = None
is_rag_chatbot_initialized = False


# Load .env for OPENAI_API_KEY
dotenv_path = os.path.join(os.path.abspath('..'), '.env')
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path)
elif os.path.exists(".env"):
    load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


# Ragas imports
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

ragas_eval_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-3.5-turbo-0125", openai_api_key=OPENAI_API_KEY)
)


# --- Section 3: ChromaDB config & manual retrieval helper ---
PROJECT_ROOT_DIR   = os.path.abspath(os.path.join(os.getcwd(), '..'))
CHROMA_PERSIST_DIR = os.path.join(PROJECT_ROOT_DIR, "data", "chroma_db", "header_chunks")
CHROMA_COLLECTION_NAME = "uchicago_ms_applied_ds_header_chunks"

print(f"ChromaDB path: {CHROMA_PERSIST_DIR}")

def retrieve_contexts(question: str, top_k: int = 3):
    """
    Query ChromaDB for the top_k most relevant chunks.
    """
    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
    coll   = client.get_collection(name=CHROMA_COLLECTION_NAME)
    results = coll.query(query_texts=[question], n_results=top_k)
    return results["documents"][0]  # list of strings


# --- Section 4: Initialize RAGChatbot instance ---
def initialize_production_rag():
    global rag_chatbot_instance, is_rag_chatbot_initialized
    if is_rag_chatbot_initialized:
        return

    # You can add debug=True or verbose=True if supported
    rag_chatbot_instance = RAGChatbot(model="gpt-3.5-turbo-0125", debug=True)
    is_rag_chatbot_initialized = True
    print("RAGChatbot initialized.")


initialize_production_rag()


# --- Section 5: Query function using manual retrieval ---
def get_production_rag_response(question_dict: dict) -> dict:
    question = question_dict["question"]

    # 1) Get the answer from your chatbot
    try:
        resp = rag_chatbot_instance.chat(question, stream=False)
        answer = resp[0] if isinstance(resp, tuple) else resp
    except Exception as e:
        answer = f"Error generating answer: {e}"

    # 2) Manually retrieve top-3 contexts
    try:
        contexts = retrieve_contexts(question, top_k=3)
    except Exception as e:
        print("⚠️ Manual retrieval failed:", e)
        contexts = []

    return {"answer": str(answer).strip(), "contexts": contexts}


# --- Section 6: Define evaluation set & generate data ---
evaluation_set = [
    {"question": "What is tuition cost for the program?",
     "ground_truth": "Tuition for the MS in Applied Data Science program: $5,967 per course/$71,604 total tuition"},
    {"question": "What scholarships are available for the program?",
     "ground_truth": "The Data Science Institute Scholarship, MS in Applied Data Science Alumni Scholarship"},
    {"question": "What are the minimum scores for the TOEFL and IELTS English Language Requirement?",
     "ground_truth": "Minimum scores for the Master’s in Applied Data Science program: TOEFL, 102 (no subscore requirement); IELTS, 7 (no subscore requirement)."},
    {"question": "Is there an application fee waiver?",
     "ground_truth": "For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy."},
    {"question": "What are the deadlines for the in-person program?",
     "ground_truth": "November 7, 2024 – Priority Application Deadline; December 4, 2024 – Scholarship Priority Deadline; January 21, 2025 – International Application Deadline (requiring visa sponsorship from UChicago); March 4, 2025 – Second Priority Application Deadline; May 6, 2025 – Third Priority Application Deadline; June 23, 2025 – Final Application Deadline"},
    {"question": "How long will it take for me to receive a decision on my application?",
     "ground_truth": "In-Person application decisions are released approximately 1 to 2 months after each respected deadline. Online application decisions are released on a rolling basis"},
    {"question": "Can I set up an advising appointment with the enrollment management team?",
     "ground_truth": "Yes, meet your admissions counselor by scheduling an appointment https://apply-psd.uchicago.edu/portal/applied-data-science"},
    {"question": "Where can I mail my official transcripts?",
     "ground_truth": "The University of Chicago\nAttention: MS in Applied Data Science Admissions\n455 N Cityfront Plaza Dr., Suite 950\nChicago, Illinois 60611"},
    {"question": "Does the Master’s in Applied Data Science Online program provide visa sponsorship?",
     "ground_truth": "Only our In-Person, Full-Time program is Visa eligible"},
    {"question": "How do I apply to the MBA/MS program?",
     "ground_truth": "Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process. Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest"},
    {"question": "Is the MS in Applied Data Science program STEM/OPT eligible?",
     "ground_truth": "The MS in Applied Data Science program is STEM/OPT eligible"},
    {"question": "How many courses must you complete to earn UChicago’s Master’s in Applied Data Science?",
     "ground_truth": "To earn the MS-ADS degree students must successfully complete 12 courses (6 core, 4 elective, 2 Capstone) and our tailored Career Seminar"}
]

generated_data_for_ragas = []
print("\nGenerating RAG responses…")
for item in tqdm(evaluation_set):
    resp = get_production_rag_response(item)
    generated_data_for_ragas.append({
        "question":    item["question"],
        "answer":      resp["answer"],
        "contexts":    resp["contexts"],
        "ground_truth": item["ground_truth"]
    })

eval_dataset = Dataset.from_list(generated_data_for_ragas)
print(f"Created evaluation dataset with {len(eval_dataset)} samples.")


# --- Section 7: Run Ragas evaluation ---
metrics = [faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness]
for m in metrics:
    if getattr(m, 'llm', None) is None:
        m.llm = ragas_eval_llm

print("\nRunning Ragas evaluation…")
result = evaluate(dataset=eval_dataset, metrics=metrics, raise_exceptions=False)


# --- Section 8: Display & save results ---
import pandas as pd
if isinstance(result, Dataset):
    df = result.to_pandas()
elif isinstance(result, dict):
    df = pd.DataFrame([result])
else:
    df = result.to_pandas()

print("\n--- Ragas Evaluation Metrics Summary ---")
print(df.to_string())

# Save to CSV
out_dir = os.path.join(PROJECT_ROOT_DIR, "results")
os.makedirs(out_dir, exist_ok=True)
csv_path = os.path.join(out_dir, "ragas_evaluation_results.csv")
df.to_csv(csv_path, index=False)
print(f"\nResults saved to {os.path.relpath(csv_path, PROJECT_ROOT_DIR)}")


Imported RAGChatbot from src.rag_chatbot
ChromaDB path: /Users/danielkim/gen-ai-midterm-project/data/chroma_db/header_chunks
RAGChatbot initialized.

Generating RAG responses…


  0%|          | 0/12 [00:00<?, ?it/s]

Using invoke method


2025-05-15 01:16:23,478 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:16:24,302 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  8%|▊         | 1/12 [00:02<00:30,  2.81s/it]

Using invoke method


2025-05-15 01:16:26,357 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:16:26,365 - INFO - sentence_transformers.SentenceTransformer - Use pytorch device_name: mps
2025-05-15 01:16:26,365 - INFO - sentence_transformers.SentenceTransformer - Load pretrained SentenceTransformer: all-MiniLM-L6-v2


Retrieved existing collection 'uchicago_ms_applied_ds_header_chunks'
Retrieving documents for query: What scholarships are available for the program?
Collection: Collection(name=uchicago_ms_applied_ds_header_chunks)
Total documents in collection: 203


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Performing direct tuition information search
Found 17 tuition-related documents


2025-05-15 01:16:31,323 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 17%|█▋        | 2/12 [00:09<00:50,  5.06s/it]

Using invoke method


2025-05-15 01:16:33,008 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:16:33,822 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 25%|██▌       | 3/12 [00:12<00:35,  3.96s/it]

Using invoke method


2025-05-15 01:16:36,183 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 33%|███▎      | 4/12 [00:15<00:28,  3.60s/it]

Using invoke method


2025-05-15 01:16:38,737 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:16:39,556 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 42%|████▏     | 5/12 [00:18<00:24,  3.47s/it]

Using invoke method


2025-05-15 01:16:41,994 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:16:42,454 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 50%|█████     | 6/12 [00:20<00:18,  3.15s/it]

Using invoke method


2025-05-15 01:16:44,606 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:16:45,297 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 58%|█████▊    | 7/12 [00:23<00:15,  3.06s/it]

Using invoke method


2025-05-15 01:16:47,241 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 67%|██████▋   | 8/12 [00:25<00:10,  2.64s/it]

Using invoke method


2025-05-15 01:16:49,204 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Retrieving documents for query: Does the Master’s in Applied Data Science Online program provide visa sponsorship?
Collection: Collection(name=uchicago_ms_applied_ds_header_chunks)
Total documents in collection: 203


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-15 01:16:51,743 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 75%|███████▌  | 9/12 [00:30<00:09,  3.28s/it]

Using invoke method


2025-05-15 01:16:53,790 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:16:54,629 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 83%|████████▎ | 10/12 [00:33<00:06,  3.18s/it]

Using invoke method


2025-05-15 01:16:56,657 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 92%|█████████▏| 11/12 [00:35<00:02,  2.80s/it]

Using invoke method


2025-05-15 01:16:58,810 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Retrieving documents for query: How many courses must you complete to earn UChicago’s Master’s in Applied Data Science?
Collection: Collection(name=uchicago_ms_applied_ds_header_chunks)
Total documents in collection: 203


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-15 01:17:02,909 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
100%|██████████| 12/12 [00:41<00:00,  3.44s/it]


Created evaluation dataset with 12 samples.

Running Ragas evaluation…


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

2025-05-15 01:17:05,044 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,066 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,099 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,104 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,105 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,153 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,159 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,172 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:17:05,340 - INFO -


--- Ragas Evaluation Metrics Summary ---
                                                                                 user_input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

# Section 2: Prompt‐Locked Evaluation & Ragas Scoring
**Section 2: Prompt-Locked Evaluation Summary**

After enforcing the “use only these 3 snippets” prompt, we see big jumps in answer fidelity and relevance—while still retaining strong retrieval. Here are the **average** scores over the 12‐question set:

| Metric              | Baseline Avg. | Prompt-Locked Avg. |
| ------------------- | ------------: | -----------------: |
| context\_precision  |          0.87 |               1.00 |
| context\_recall     |          0.94 |               0.78 |
| faithfulness        |          0.26 |               0.94 |
| answer\_relevancy   |          0.24 |               0.78 |
| answer\_correctness |          0.26 |               0.49 |

> **Key improvements**
>
> * **Faithfulness** ↑ from 0.26 → 0.94
> * **Relevancy**  ↑ from 0.24 → 0.78
> * **Correctness**↑ from 0.26 → 0.49

---

**Visualization**

![Prompt-Locked vs. Baseline Metrics](images/baseline_vs_locked_metrics.png)


This clearly shows that prompt-locking dramatically curbs hallucination and boosts answer quality, at the small cost of a bit of recall.


In [13]:
# RAG_system_evaluation_v2_prompt_lock.ipynb — Full Code with Prompt Locking

# --- Section 1: Environment setup & imports ---
import os
import sys
import logging
import traceback
from dotenv import load_dotenv
import pandas as pd
from tqdm import tqdm
from datasets import Dataset
import chromadb

# Visualization (optional)
import matplotlib.pyplot as plt

# Setup basic logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(name)s - %(message)s")

# Helper for full tracebacks
def print_full_traceback():
    print("--- FULL TRACEBACK ---")
    traceback.print_exc()
    print("--- END TRACEBACK ---")


# --- Section 2: Import & initialize RAGChatbot ---
module_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from src.rag_chatbot import RAGChatbot
    logging.info("Imported RAGChatbot from src.rag_chatbot")
except ImportError:
    from src.chatbot import RAGChatbot
    logging.info("Imported RAGChatbot from src.chatbot")

rag_chatbot_instance = None
is_rag_chatbot_initialized = False

# Load .env for OPENAI_API_KEY
dotenv_path = os.path.join(os.path.abspath('..'), '.env')
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path)
elif os.path.exists(".env"):
    load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


# Ragas imports
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

ragas_eval_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-3.5-turbo-0125", openai_api_key=OPENAI_API_KEY)
)


# --- Section 3: ChromaDB config & manual retrieval helper ---
PROJECT_ROOT_DIR   = os.path.abspath(os.path.join(os.getcwd(), '..'))
CHROMA_PERSIST_DIR = os.path.join(PROJECT_ROOT_DIR, "data", "chroma_db", "header_chunks")
CHROMA_COLLECTION_NAME = "uchicago_ms_applied_ds_header_chunks"

logging.info(f"ChromaDB path: {CHROMA_PERSIST_DIR}")

def retrieve_contexts(question: str, top_k: int = 3):
    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
    coll   = client.get_collection(name=CHROMA_COLLECTION_NAME)
    results = coll.query(query_texts=[question], n_results=top_k)
    return results["documents"][0]  # list of strings


# --- Section 4: Initialize RAGChatbot instance ---
def initialize_production_rag():
    global rag_chatbot_instance, is_rag_chatbot_initialized
    if is_rag_chatbot_initialized:
        return

    rag_chatbot_instance = RAGChatbot(model="gpt-3.5-turbo-0125", debug=True)
    is_rag_chatbot_initialized = True
    logging.info("RAGChatbot initialized.")

initialize_production_rag()


# --- Section 5: Prompt-Locked Query Function ---
PROMPT_TEMPLATE = """
Use only the following 3 context snippets to answer the question exactly.
If the answer is not contained in these snippets, reply:
“Not available in the provided context.”

Context:
1. {ctx0}
2. {ctx1}
3. {ctx2}

Question:
{question}

Answer:
"""

def get_production_rag_response_locked(question_dict: dict) -> dict:
    question = question_dict["question"]

    # 1) Retrieve top-3 contexts
    try:
        contexts = retrieve_contexts(question, top_k=3)
    except Exception as e:
        logging.warning(f"Retrieval failed: {e}")
        contexts = []

    # 2) Build the locked prompt
    ctxs = contexts + [""] * (3 - len(contexts))
    prompt = PROMPT_TEMPLATE.format(
        ctx0=ctxs[0],
        ctx1=ctxs[1],
        ctx2=ctxs[2],
        question=question
    )

    # 3) Query the chatbot with the locked prompt
    try:
        resp = rag_chatbot_instance.chat(prompt, stream=False)
        answer = resp[0] if isinstance(resp, tuple) else resp
    except Exception as e:
        answer = f"Error generating answer: {e}"

    return {"answer": answer.strip(), "contexts": contexts}


# --- Section 6: Define evaluation set & generate data ---
evaluation_set = [
    {"question": "What is tuition cost for the program?",
     "ground_truth": "Tuition for the MS in Applied Data Science program: $5,967 per course/$71,604 total tuition"},
    {"question": "What scholarships are available for the program?",
     "ground_truth": "The Data Science Institute Scholarship, MS in Applied Data Science Alumni Scholarship"},
    {"question": "What are the minimum scores for the TOEFL and IELTS English Language Requirement?",
     "ground_truth": "Minimum scores for the Master’s in Applied Data Science program: TOEFL, 102 (no subscore requirement); IELTS, 7 (no subscore requirement)."},
    {"question": "Is there an application fee waiver?",
     "ground_truth": "For questions regarding an application fee waiver, please refer to the Physical Sciences Division fee waiver policy."},
    {"question": "What are the deadlines for the in-person program?",
     "ground_truth": "November 7, 2024 – Priority Application Deadline; December 4, 2024 – Scholarship Priority Deadline; January 21, 2025 – International Application Deadline (requiring visa sponsorship from UChicago); March 4, 2025 – Second Priority Application Deadline; May 6, 2025 – Third Priority Application Deadline; June 23, 2025 – Final Application Deadline"},
    {"question": "How long will it take for me to receive a decision on my application?",
     "ground_truth": "In-Person application decisions are released approximately 1 to 2 months after each respected deadline. Online application decisions are released on a rolling basis"},
    {"question": "Can I set up an advising appointment with the enrollment management team?",
     "ground_truth": "Yes, meet your admissions counselor by scheduling an appointment https://apply-psd.uchicago.edu/portal/applied-data-science"},
    {"question": "Where can I mail my official transcripts?",
     "ground_truth": "The University of Chicago\nAttention: MS in Applied Data Science Admissions\n455 N Cityfront Plaza Dr., Suite 950\nChicago, Illinois 60611"},
    {"question": "Does the Master’s in Applied Data Science Online program provide visa sponsorship?",
     "ground_truth": "Only our In-Person, Full-Time program is Visa eligible"},
    {"question": "How do I apply to the MBA/MS program?",
     "ground_truth": "Applicants interested in the Joint MBA/MS degree will apply through Booth’s centralized, joint-application process. Applicants should complete the Chicago Booth Full-Time MBA application and select the MBA/MS in Applied Data Science as their program of interest"},
    {"question": "Is the MS in Applied Data Science program STEM/OPT eligible?",
     "ground_truth": "The MS in Applied Data Science program is STEM/OPT eligible"},
    {"question": "How many courses must you complete to earn UChicago’s Master’s in Applied Data Science?",
     "ground_truth": "To earn the MS-ADS degree students must successfully complete 12 courses (6 core, 4 elective, 2 Capstone) and our tailored Career Seminar"}
]

generated_data_for_ragas = []
logging.info("Generating RAG responses with prompt locking…")
for item in tqdm(evaluation_set):
    resp = get_production_rag_response_locked(item)
    generated_data_for_ragas.append({
        "question":     item["question"],
        "answer":       resp["answer"],
        "contexts":     resp["contexts"],
        "ground_truth": item["ground_truth"]
    })

eval_dataset = Dataset.from_list(generated_data_for_ragas)
logging.info(f"Created evaluation dataset with {len(eval_dataset)} samples.")


# --- Section 7: Run Ragas evaluation ---
metrics = [faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness]
for m in metrics:
    if getattr(m, 'llm', None) is None:
        m.llm = ragas_eval_llm

logging.info("Running Ragas evaluation with prompt locking…")
result = evaluate(dataset=eval_dataset, metrics=metrics, raise_exceptions=False)


# --- Section 8: Display & save results ---
import pandas as pd
if isinstance(result, Dataset):
    df = result.to_pandas()
elif isinstance(result, dict):
    df = pd.DataFrame([result])
else:
    df = result.to_pandas()

print("\n--- Ragas Evaluation Metrics Summary (Prompt-Locked) ---")
print(df.to_string())

# Save to CSV
out_dir = os.path.join(PROJECT_ROOT_DIR, "results")
os.makedirs(out_dir, exist_ok=True)
csv_path = os.path.join(out_dir, "ragas_evaluation_prompt_locked.csv")
df.to_csv(csv_path, index=False)
logging.info(f"Results saved to {csv_path}")


2025-05-15 01:22:26,249 - INFO - root - Imported RAGChatbot from src.rag_chatbot
2025-05-15 01:22:26,266 - INFO - root - ChromaDB path: /Users/danielkim/gen-ai-midterm-project/data/chroma_db/header_chunks
2025-05-15 01:22:26,288 - INFO - root - RAGChatbot initialized.
2025-05-15 01:22:26,289 - INFO - root - Generating RAG responses with prompt locking…
  0%|          | 0/12 [00:00<?, ?it/s]

Using invoke method


2025-05-15 01:22:29,054 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:29,621 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  8%|▊         | 1/12 [00:03<00:36,  3.33s/it]

Using invoke method


2025-05-15 01:22:31,435 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:32,337 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 17%|█▋        | 2/12 [00:06<00:29,  2.97s/it]

Using invoke method


2025-05-15 01:22:34,075 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:34,759 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 25%|██▌       | 3/12 [00:08<00:24,  2.72s/it]

Using invoke method


2025-05-15 01:22:36,777 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:37,340 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 33%|███▎      | 4/12 [00:11<00:21,  2.66s/it]

Using invoke method


2025-05-15 01:22:39,339 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:39,784 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 42%|████▏     | 5/12 [00:13<00:18,  2.58s/it]

Using invoke method


2025-05-15 01:22:41,909 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:42,431 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 50%|█████     | 6/12 [00:16<00:15,  2.61s/it]

Using invoke method


2025-05-15 01:22:44,623 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:45,540 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 58%|█████▊    | 7/12 [00:19<00:13,  2.77s/it]

Using invoke method


2025-05-15 01:22:47,592 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:48,031 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 67%|██████▋   | 8/12 [00:21<00:10,  2.68s/it]

Using invoke method


2025-05-15 01:22:49,773 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:50,251 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 75%|███████▌  | 9/12 [00:23<00:07,  2.54s/it]

Using invoke method


2025-05-15 01:22:52,196 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:53,131 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 83%|████████▎ | 10/12 [00:26<00:05,  2.64s/it]

Using invoke method


2025-05-15 01:22:55,073 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:55,614 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 92%|█████████▏| 11/12 [00:29<00:02,  2.59s/it]

Using invoke method


2025-05-15 01:22:57,577 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:58,125 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
100%|██████████| 12/12 [00:31<00:00,  2.65s/it]
2025-05-15 01:22:58,135 - INFO - root - Created evaluation dataset with 12 samples.
2025-05-15 01:22:58,136 - INFO - root - Running Ragas evaluation with prompt locking…


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

2025-05-15 01:22:59,133 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:59,136 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:59,145 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:59,147 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:59,465 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-05-15 01:22:59,836 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-05-15 01:22:59,879 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:59,882 - INFO - httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-15 01:22:59,884 - INFO - httpx - HTT


--- Ragas Evaluation Metrics Summary (Prompt-Locked) ---
                                                                                 user_input                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

# **Next Steps & Future Experiments**  
   - **Retrieval Hyperparameter Tuning**  
     Adjust the number of returned contexts, embedding model variants, and similarity thresholds to optimize precision–recall tradeoffs.  
   - **Ablation Studies**  
     Systematically disable or vary components (e.g. prompt templates, context chunk size, reranking) to measure their individual impact on faithfulness and relevance.  
   - **Prompt Engineering Variations**  
     Experiment with different “locked” prompt templates (e.g. include instruction to cite sources, adjust temperature) to see how they affect hallucination rates.  
   - **Chaining & Multi-Turn Scenarios**  
     Extend evaluation beyond single-turn to multi-turn dialogues, testing how context accumulation and chat history affect performance.  
   - **UI Integration & A/B Testing**  
     Deploy two interface variants (baseline vs. prompt-locked) to real users (e.g. DSI staff) and collect qualitative feedback on responsiveness, clarity, and trust.  
   - **Metric Expansion**  
     Incorporate human-in-the-loop ratings or additional automated metrics (e.g. linguistic quality, answer completeness) to complement Ragas scores.  
