# LLM Evaluation & Prompt Testing

## Current System Overview

### RAG Pipeline Architecture

**Purpose**: An AI civics tutor that evaluates user answers for the USCIS citizenship test and provides educational feedback.

**Components**:
1. **Question Bank**: Official USCIS civics test questions with up-to-date correct answers
2. **Vector Database**: Qdrant containing embedded USCIS Civics Guide passages
3. **Retrieval**: Semantic search retrieves relevant context from civics guide
4. **LLM Evaluation**: OpenAI model evaluates answer and generates educational feedback

**System Inputs**:
- **Question**: The civics test question being asked
- **Correct Answers**: Up-to-date acceptable answers (updated monthly via automated ingestion)
- **User Answer**: The student's submitted answer
- **Context**: Retrieved passages from USCIS Civics Guide (2-4 chunks via RAG)
- **User State**: Student's location for state-specific questions (governor, senators, etc.)

**System Outputs**:
The bot returns a JSON response with three fields:

1. **`success`** (boolean): Pass/fail evaluation of the user's answer
2. **`reason`** (string): Explanation for why the answer passed or failed
3. **`background_info`** (string): Educational context to help student learn and retain information

**Optimal RAG Configuration** (from the retrieval evaluation analysis (here)[./04_retrieval_evaluation.ipynb]):
- Context limit: 4 chunks
- success threshold: 0.3
- Query expansion: False
- Temperature: 0.5

---

## Current Problems Identified

### Problem 1: LLM Using Outdated Training Data Instead of Provided Answers
**Description**: The LLM sometimes relies on its training data instead of the correct answers explicitly provided in the prompt.

**Critical Example**:
- **Question**: "Who is the current US President?"
- **Provided correct answer**: "Donald Trump" (from monthly-updated data ingestion)
- **User answer**: "Donald Trump"
- **LLM behavior**: Marks answer as **INCORRECT** because LLM was trained on 2024 data when Biden was president
- **Root cause**: LLM ignoring the `answers` parameter and using its outdated world knowledge

**Impact**: 
- System gives wrong evaluations for time-sensitive questions (presidents, governors, senators, representatives)
- Defeats the purpose of monthly automated data updates
- Users get marked wrong when they're actually correct

**Affected question types**:
- Current officeholders (president, VP, governors, senators, representatives)
- Any time-sensitive civics information that changes

---

### Problem 2: Grading Too Harsh
**Description**: The evaluation is overly strict and marks semantically correct answers as wrong.

**Examples**:
- ❌ User makes minor **typos** but answer is semantically correct → marked as fail
- ❌ Question asks for **2 answers**, user provides **3 answers** where 2 are correct → marked as fail  
- ❌ User answer is **semantically equivalent** but doesn't use exact wording → marked as fail

**Example**:
- Correct answer: "freedom of speech"
- User answer: "fredom of speach" (typos) → marked wrong
- User answer: "right to free speech" (equivalent) → marked wrong

**Impact**: Students get discouraged when correct answers are marked wrong due to technicalities.

---

### Problem 3: Background Info Too Short/Generic
**Description**: The background information often lacks educational depth and substance.

**Symptoms**:
- Background info is only 1-2 sentences (too short to be educational)
- Information appears generic and doesn't leverage the retrieved context
- Students report feeling like they "learned nothing" from the background

**Current prompt guidance**: "Keep background info short" (likely contributing to this issue)

**Impact**: Defeats the learning purpose - students don't retain information beyond rote memorization.

---

### Problem 4: Poor Context Utilization
**Description**: Unclear whether the LLM actually uses the retrieved RAG context to generate background info.

**Observation**: Some background info appears generic enough that it could have been generated without any context retrieval.

**Impact**: 
- Wastes the RAG pipeline effort
- Missing opportunity to provide authoritative, context-grounded education
- Background info lacks depth and specificity

---

### Problem 5: Redundancy Between Reason & Background
**Description**: The `reason` and `background_info` fields often contain overlapping or repetitive information.

**Example**:
```json
{
  "reason": "Your answer is correct. The three branches of government are legislative, executive, and judicial.",
  "background_info": "The U.S. government has three branches: legislative, executive, and judicial. This separation ensures checks and balances."
}
```

**Impact**: 
- Wastes tokens and API costs
- Poor user experience (feels repetitive)
- Background info doesn't add value beyond the reason

---

## Evaluation Goals

### Primary Objectives

1. **Fix Training Data Override Issue** 🔴 **CRITICAL**
   - Ensure LLM strictly uses provided `answers` parameter, not its training data
   - Test with time-sensitive questions (current president, governors, etc.)
   - Success metric: 100% accuracy on questions with outdated training data

2. **Reduce Harsh Grading**
   - Accept answers with minor typos (high edit distance similarity)
   - Accept semantically equivalent answers (same meaning, different words)
   - Pass answers for multi-part questions when some parts are correct and satisfied the min number of given correct values in list
   - Success metric: bot passing answers where 3 given, 2 min needed, and 2/3 were correct. Measured via llm-as-judge grading accuracy

3. **Improve Background Info Quality**
   - Increase substantive length (target: 50-80 words minimum)
   - Ensure educational value and learning retention
   - Success metric: Average word count ≥50, higher LLM-as-judge successs

4. **Ensure Context Usage**
   - Background info must demonstrably use retrieved RAG context
   - Include specific facts/details from context passages
   - Success metric: High context overlap ratio, LLM-as-judge confirms context usage

5. **Eliminate Redundancy**
   - Make `reason` and `background_info` distinct and complementary
   - `reason` = explain grading decision
   - `background_info` = teach something new beyond grading
   - Success metric: Reason-background similarity success <0.5

### Success Criteria Summary
We are going to explore the following metrics and see their trends given changes in prompt
#### Quantitative metrics
- positive feedback rate
- background word count
- context usage in background
- reason-background similarity
#### Qualitative metrics (LLM-as-judge)
- Override of context data
- Grading accuracy
- Background info quality
- Overall learning helpfulness


## Procedure
We will focus right now on the 2008 test, for arizona. For two reasons: I am from AZ, and the 2008 test is the one all people are taking right now since oct 20 has not yet passed.

We will:
1. Try out the current exam setup ~50 times to create a golden dataset with actual feedback.
2. check out the basic quantitative metrics to see how we're doing... and if we indeed are seeing consistently some of the issues mentioned above
3. Try out LLM-as-judge to check the 3 qualitativem metrics that should give us insight into our overall performance
4. Log our current 'base' successs
5. Try out some prompt changes
6. see if our metrics improve/worsen
7. settle on the best prompt out there.

In [49]:
import pandas as pd
import psycopg2
from dotenv import load_dotenv
import os

In [50]:

# Load environment variables from .env file
load_dotenv()

# Get the database URL
DATABASE_URL = os.getenv('DATABASE_URL')

# Connect to Neon
conn = psycopg2.connect(DATABASE_URL)
print("✅ Connected to Neon!")


✅ Connected to Neon!


In [51]:
# Load feedback data
df = pd.read_sql("""
    SELECT * FROM feedback 
    ORDER BY timestamp DESC
""", conn)

print(f"Total feedback entries: {len(df)}")
df.head()

  df = pd.read_sql("""


Total feedback entries: 70


Unnamed: 0,id,timestamp,user_state,test_year,question_text,correct_answers,user_answer,success,reason,background_info,feedback_type,session_id,rag_context_limit,rag_score_threshold,rag_query_expansion,system_prompt,user_prompt,model,llm_temperature,context
0,70,2025-10-16 23:12:25.165481,Arizona,2008,What does the President’s Cabinet do?,"{""advises the President""}",they advice the president on their area of exp...,False,Almost there! The correct answer is that the P...,The President’s Cabinet is like a secret sauce...,positive,08eee2c4-7a3c-434d-b7d8-b219e6acef5d,4,0.3,False,\nYou are a friendly USCIS officer helping the...,\n<question>\nWhat does the President’s Cabine...,gpt-4o-mini,0.5,Page 13:\nTHE EXECUTIVE BRANCH\nIn this chapte...
1,69,2025-10-16 23:12:01.025205,Arizona,2008,Who does a U.S. Senator represent?,"{""all people of the state""}",their state,False,"Close, but not quite! A U.S. Senator represent...","Each state has two U.S. Senators, so it’s like...",positive,08eee2c4-7a3c-434d-b7d8-b219e6acef5d,4,0.3,False,\nYou are a friendly USCIS officer helping the...,\n<question>\nWho does a U.S. Senator represen...,gpt-4o-mini,0.5,Page 11:\nElecting Members of the House of Rep...
2,68,2025-10-16 23:10:41.910605,Arizona,2008,Why do some states have more Representatives t...,"{""(because of) the state’ s population"",""(beca...",some states they got more people than othersta...,False,"Close, but not quite! The correct answer is th...",Did you know that California has 52 representa...,negative,08eee2c4-7a3c-434d-b7d8-b219e6acef5d,4,0.3,False,\nYou are a friendly USCIS officer helping the...,\n<question>\nWhy do some states have more Rep...,gpt-4o-mini,0.5,Page 10:\nTHE LEGISLATIVE BRANCH (CONGRESS)\nI...
3,67,2025-10-16 23:10:05.985411,Arizona,2008,What is the economic system in the United States?,"{""capitalist economy"",""market economy""}",capitalism economy,False,"Close, but not quite! The correct terms are 'c...","In a market economy, the government doesn’t co...",negative,08eee2c4-7a3c-434d-b7d8-b219e6acef5d,4,0.3,False,\nYou are a friendly USCIS officer helping the...,\n<question>\nWhat is the economic system in t...,gpt-4o-mini,0.5,Page 32:\nAMERICAN HISTORY: 1900-2001\nIn this...
4,66,2025-10-16 23:09:46.921651,Arizona,2008,When is the last day you can send in federal i...,"{""April 15""}",apr 15,True,Correct! The last day to send in federal incom...,"If April 15 falls on a weekend or holiday, the...",positive,08eee2c4-7a3c-434d-b7d8-b219e6acef5d,4,0.3,False,\nYou are a friendly USCIS officer helping the...,\n<question>\nWhen is the last day you can sen...,gpt-4o-mini,0.5,No relevant context found.


## Quantitative metrics

In [52]:
# function to get positive feedback rate
def get_positive_feedback_rate(df):
    """
    Calculate the percentage of positive feedback.
    
    Args:
        df: DataFrame with 'feedback_type' column containing 'positive' or 'negative'
    
    Returns:
        float: Percentage of positive feedback (0-100)
    """
    positive_count = (df['feedback_type'] == 'positive').sum()
    total_count = len(df)
    
    if total_count == 0:
        return 0.0
    
    return (positive_count / total_count) * 100

In [53]:
def add_background_word_count(df):
    """
    Add background_word_count column to DataFrame.
    
    Args:
        df: DataFrame with 'background_info' column
    
    Returns:
        DataFrame: Original DataFrame with new 'background_info_word_count' column added
    """
    df['background_info_word_count'] = df['background_info'].str.split().str.len()
    return df

In [54]:
def calculate_context_usage(df):
    """
    Calculate what percentage of context words appear in background_info.
    
    Returns:
        Series: Context usage ratio for each row (0-1)
    """
    def word_overlap_ratio(row):
        # Get words from context and background
        context_words = set(str(row['context']).lower().split())
        background_words = set(str(row['background_info']).lower().split())
        
        # Remove common stop words to focus on meaningful overlap
        stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'is', 'are', 'was', 'were', 'page'}
        context_words -= stop_words
        background_words -= stop_words
        
        # Calculate overlap
        if len(context_words) == 0:
            return 0.0
        
        overlap = len(context_words & background_words)
        return overlap / len(context_words)
    
    return df.apply(word_overlap_ratio, axis=1)

def calculate_ngram_overlap(df, n=3):
    """
    Calculate overlap of n-grams between context and background.
    Better at catching actual usage vs coincidental word matches.
    """
    from collections import Counter
    
    def get_ngrams(text, n):
        words = text.lower().split()
        return set([' '.join(words[i:i+n]) for i in range(len(words)-n+1)])
    
    def ngram_overlap_ratio(row):
        context_ngrams = get_ngrams(str(row['context']), n)
        background_ngrams = get_ngrams(str(row['background_info']), n)
        
        if len(context_ngrams) == 0:
            return 0.0
        
        overlap = len(context_ngrams & background_ngrams)
        return overlap / len(context_ngrams)
    
    return df.apply(ngram_overlap_ratio, axis=1)

def calculate_key_term_usage(df):
    """
    Check if key terms from context appear in background.
    Focuses on proper nouns, numbers, and capitalized terms.
    """
    import re
    
    def extract_key_terms(text):
        # Find capitalized words (likely proper nouns)
        capitalized = set(re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', text))
        # Find numbers
        numbers = set(re.findall(r'\b\d+\b', text))
        # Find words in quotes
        quoted = set(re.findall(r'"([^"]*)"', text))
        
        return capitalized | numbers | quoted
    
    def key_term_ratio(row):
        context_terms = extract_key_terms(str(row['context']))
        background_text = str(row['background_info']).lower()
        
        if len(context_terms) == 0:
            return 0.0
        
        # Check how many key terms appear in background
        found = sum(1 for term in context_terms if term.lower() in background_text)
        return found / len(context_terms)
    
    return df.apply(key_term_ratio, axis=1)

In [55]:
def analyze_context_usage(df):
    """
    Comprehensive context usage analysis using multiple metrics.
    """
    # Word overlap
    df['word_overlap'] = calculate_context_usage(df)
    
    # 3-gram overlap (phrases)
    df['phrase_overlap'] = calculate_ngram_overlap(df, n=3)
    
    # Key terms
    df['key_term_usage'] = calculate_key_term_usage(df)
    
    # Summary stats
    stats = {
        'word_overlap_mean': df['word_overlap'].mean(),
        'phrase_overlap_mean': df['phrase_overlap'].mean(),
        'key_term_usage_mean': df['key_term_usage'].mean(),
        'low_usage_count': len(df[df['word_overlap'] < 0.3])  # Flag low usage
    }
    
    return df, stats

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

def add_reason_background_similarity(df):
    """
    Calculate cosine similarity between reason and background_info using TF-IDF.
    Adds 'reason_bkg_info_similarity' column to DataFrame.
    
    Args:
        df: DataFrame with 'reason' and 'background_info' columns
    
    Returns:
        DataFrame: Original DataFrame with new 'reason_bkg_info_similarity' column added
    """
    similarities = []
    
    for idx, row in df.iterrows():
        reason = str(row['reason'])
        background = str(row['background_info'])
        
        # Create TF-IDF vectors
        vectorizer = TfidfVectorizer()
        try:
            tfidf_matrix = vectorizer.fit_transform([reason, background])
            similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
        except:
            similarity = 0.0  # Handle edge cases
        
        similarities.append(similarity)
    
    df['reason_bkg_info_similarity'] = similarities
    return df

In [57]:
# add to df the word count
df = add_background_word_count(df)


positive_rate = get_positive_feedback_rate(df)
print(f"Positive Feedback Rate: {positive_rate:.1f}%")
print(f"Average: {df['background_info_word_count'].mean():.1f} words")
print(f"Range: {df['background_info_word_count'].min()} - {df['background_info_word_count'].max()} words")

Positive Feedback Rate: 70.0%
Average: 29.5 words
Range: 16 - 49 words


In [59]:
df_usage, usage_stats = analyze_context_usage(df)
df = add_reason_background_similarity(df)

print("Context Usage Analysis:")
print(f" Mean Word overlap: {usage_stats['word_overlap_mean']:.1%}")
print(f" Mean Phrase overlap: {usage_stats['phrase_overlap_mean']:.1%}")
print(f" Mean Key term usage: {usage_stats['key_term_usage_mean']:.1%}")
print(f" Mean Low usage cases (word overlap < 30%): {usage_stats['low_usage_count']}")

# Analyze
print(f"Mean similarity: {df['reason_bkg_info_similarity'].mean():.2f}")
print(f"High redundancy (>0.5): {len(df_usage[df_usage['reason_bkg_info_similarity'] > 0.5])}")

Context Usage Analysis:
 Mean Word overlap: 2.6%
 Mean Phrase overlap: 0.4%
 Mean Key term usage: 9.1%
 Mean Low usage cases (word overlap < 30%): 70
Mean similarity: 0.25
High redundancy (>0.5): 1


#### Quick analysis on quantitative values
- Feedack rate: `70% is positive`. We can do better... lots of negative feedback come from the issues described above.
- Background length: 16-49 words with an average of `~30 words`.
- context usage: analyzed 3 metrics. Somewhat best (and its not great already) is word overlap. It gives an `avg word overlap of 2.6%`. May be better to analyze this with llm.
- reason/background similarity: used cosine similarity between reason and background info. We get a mean of 0.25 (`25%`). Not a lot of redundancy. should be good? lets confirm with llm.

## LLM as judge

In [28]:
import sys
sys.path.append('..')
from utils import rag

In [None]:
LLM_JUDGE_SYSTEM_PROMPT = """ 
You are an expert evaluator for an AI civics tutor chatbot powered by a Retrieval-Augmented Generation (RAG) pipeline. 
This chatbot helps students practice for the USCIS citizenship test. 
Your role is to judge how well the chatbot’s outputs adhere to its intended logic and data sources. 

---

### BOT CONTEXT

The chatbot receives the following inputs:
- **question**: The USCIS civics test question being asked.
- **answers**: The authoritative list of acceptable answers. These are regularly updated and must override the model’s own outdated world knowledge.
- **user_state**: The student’s U.S. state (used for state-specific questions).
- **user_answer**: The student’s submitted answer.
- **context**: 2–4 retrieved passages from the official USCIS Civics Guide.

The chatbot produces this output:
- **success** (boolean): Whether the user’s answer was marked correct or incorrect.
- **reason** (string): The rationale for the pass/fail result.
- **background_info** (string): Additional educational information drawn from the RAG context to help the user learn more about the topic.

---

### YOUR TASK

Given the chatbot’s inputs and outputs, evaluate each of the following **independently** and **objectively**.  
Use only the provided data — ignore any external or world knowledge.

---

#### 1. `answer_context_usage` (yes / no)
**Goal:** Determine if the chatbot used the provided `answers` list correctly, rather than relying on outdated training data.  
- **YES** → The chatbot clearly uses the `answers` field to judge correctness, even if that data differs from current real-world facts.  
- **NO** → The chatbot ignores the `answers` field and instead uses its own outdated or internal world knowledge.

**Examples:**
- ✅ **Yes:**  
  - Question: “Who is the U.S. President?”  
    answers: ["Donald Trump"]  
    user_answer: "Donald Trump"  
    chatbot marks as **success: true** and reason references the provided answer list.  
- ❌ **No:**  
  - Question: “Who is the U.S. President?”  
    answers: ["Donald Trump"]  
    user_answer: "Donald Trump"  
    chatbot marks as **success: false** because it claims the president is Biden — indicating outdated world knowledge.

---

#### 2. `grading_accuracy` (good / bad)
**Goal:** Assess whether the chatbot graded fairly and reasonably, without being overly strict.  
- **GOOD** → The chatbot accepts minor spelling mistakes, semantically equivalent answers, and allows for correct multi-part answers even with extra items.  
- **BAD** → The chatbot penalizes minor typos, ignores semantic equivalence, or fails multi-part answers even when enough correct items are present.

**Examples:**
- ✅ **Good:**  
  - Question: “What is one right of the people?”  
    answers: ["freedom of speech", "freedom of assembly"]  
    user_answer: "right to free speech"  
    chatbot passes the answer and explains semantic equivalence.  
- ✅ **Good:**  
  - Question: “Name two states that border Mexico.”  
    answers: ["Texas", "Arizona", "California", "New Mexico"]  
    user_answer: "Arizona, California, Michigan"  
    chatbot passes since two of three are correct and notes Michigan is unrelated.  
- ❌ **Bad:**  
  - Question: “Name one war fought by the U.S. in the 1900s.”  
    answers: ["World War I", "World War II", "Korean War"]  
    user_answer: "world war one"  
    chatbot fails it because of wording or capitalization.  
- ❌ **Bad (typo/misspelling case):**  
  - Question: “What is one freedom from the First Amendment?”  
    answers: ["freedom of speech", "freedom of religion", "freedom of assembly"]  
    user_answer: "fredom of speach"  
    chatbot marks **success: false** — this is **bad grading**, since the user’s intent is clear despite spelling errors.

---

#### 3. `background_info_quality` (good / bad)
**Goal:** Judge the educational value and distinctiveness of the chatbot’s `background_info`.  
- **GOOD** → The background provides meaningful educational content that adds new information beyond the `reason`.  
- **BAD** → The background merely restates the reason, or is too generic/uninformative.

**Examples:**
- ✅ **Good:**  
  - Question: “Who was president during World War I?”  
    reason: “Correct. Woodrow Wilson was president during World War I.”  
    background_info: “Wilson led the U.S. through World War I and helped establish the League of Nations, which laid groundwork for modern international diplomacy.”  
- ❌ **Bad:**  
  - Same question, background_info: “Woodrow Wilson was president during World War I.”  
    (This repeats the reason and adds no educational value.)

---

#### 4. `background_context_usage` (yes / no)
**Goal:** Determine if the chatbot’s `background_info` actually uses information from the retrieved RAG `context`.  
- **YES** → The background includes facts, examples, or phrasing that clearly derive from the context passages.  
- **NO** → The background is generic or unrelated to the retrieved text.

**Examples:**
- ✅ **Yes:**  
  - Context mentions that “the President signs bills into law.”  
    background_info: “The President plays a key role in the legislative process by signing bills into law, a power described in the Constitution.”  
- ❌ **No:**  
  - Context is detailed, but background_info says only: “The President is the leader of the country.”  

---

### EVALUATION GUIDELINES

- Evaluate **each metric independently** — a “good” in one does not affect others.  
- Use **only** the provided `question`, `answers`, `user_answer`, `context`, and chatbot outputs.  
- **Do not** use or infer from your own training data or current world knowledge.  
- Keep reasons **short and specific** (1–2 sentences).  
- Include a **confidence score (0–1)** for each metric based on how certain you are.  
- Output must be **strictly valid JSON** — no extra text, explanations, or formatting outside of the JSON object.

---

### OUTPUT FORMAT

```json
{
  "answer_context_usage": "yes" | "no",
  "answer_context_usage_reason": "string",
  "answer_context_usage_confidence": 0.0,
  "grading_accuracy": "good" | "bad",
  "grading_accuracy_reason": "string",
  "grading_accuracy_confidence": 0.0,
  "background_info_quality": "good" | "bad",
  "background_info_quality_reason": "string",
  "background_info_quality_confidence": 0.0,
  "background_context_usage": "yes" | "no",
  "background_context_usage_reason": "string",
  "background_context_usage_confidence": 0.0
}
"""

In [None]:

LLM_JUDGE_USER_PROMPT = """
Evaluate the following chatbot interaction **independently for each criterion**.

---

### BOT INPUT:
<question>
{question}
</question>

<answers>
{answers}
</answers>

<user_state>
{user_state}
</user_state>

<user_answer>
{user_answer}
</user_answer>

<context>
{context}
</context>

---

### BOT OUTPUT:
<success>
{success}
</success>

<reason>
{reason}
</reason>

<background_info>
{background_info}
</background_info>
"""

In [None]:
# setup for this df