# Evaluation: TechGadgets Support Bot

This notebook evaluates and compares the base model and fine-tuned model:
1. Load test cases
2. Test both models on all test cases
3. Score responses based on evaluation criteria
4. Generate comparison table and analysis

## Step 1: Setup and Load Test Cases

In [9]:
import os
import sys
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv
import json

# Add data directory to path to import test cases
sys.path.append("../data")

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Model configuration
BASE_MODEL = "gpt-4o-mini-2024-07-18"
# Set this to your fine-tuned model ID after training completes
FINE_TUNED_MODEL = "ft:gpt-4o-mini-2024-07-18:personal:techgadgets-support:D7e2cpyK"  # Replace with actual ID

# Load test cases
from test_cases import test_cases

print(f"✅ Loaded {len(test_cases)} test cases")
print(f"Base model: {BASE_MODEL}")
print(f"Fine-tuned model: {FINE_TUNED_MODEL}")
print(f"\nTest cases:")
for i, test_case in enumerate(test_cases, 1):
    print(f"  {i}. {test_case}")

✅ Loaded 10 test cases
Base model: gpt-4o-mini-2024-07-18
Fine-tuned model: ft:gpt-4o-mini-2024-07-18:personal:techgadgets-support:D7e2cpyK

Test cases:
  1. What is your return policy?
  2. Do you offer express shipping?
  3. How can I contact customer support?
  4. My order hasn't arrived yet and it's been over a week. What should I do?
  5. Can I return an item I bought last month if I still have the receipt?
  6. I need to change my shipping address for an order that's already been placed. Is that possible?
  7. Do you price match with other electronics retailers?
  8. What kind of warranty do your products come with?
  9. I accidentally ordered the wrong item. Can I exchange it instead of returning?
  10. I want to cancel my current order and also ask about your refund policy for future purchases


## Step 2: Evaluation Functions

In [10]:
def get_response(model, user_message, system_message=None):
    """Get response from a model"""
    if system_message is None:
        system_message = "You are a helpful customer support assistant."
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            temperature=0.7,
            max_tokens=500
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

def check_brand_mention(response):
    """Check if TechGadgets is mentioned (0 or 1)"""
    response_lower = response.lower()
    return 1 if "techgadgets" in response_lower else 0

def check_policy_accuracy(question, response):
    """Check if relevant policies are mentioned correctly (0-1)"""
    question_lower = question.lower()
    response_lower = response.lower()
    
    relevant_policies = []
    policy_scores = []
    
    # Return policy
    if any(kw in question_lower for kw in ["return", "refund", "money back", "send back"]):
        relevant_policies.append("return")
        if "30-day" in response_lower and ("money-back" in response_lower or "guarantee" in response_lower):
            policy_scores.append(1.0)
        elif "30" in response_lower and "day" in response_lower:
            policy_scores.append(0.7)
        else:
            policy_scores.append(0.0)
    
    # Shipping policy
    if any(kw in question_lower for kw in ["shipping", "delivery", "arrive", "when will", "how long"]):
        relevant_policies.append("shipping")
        if "3-5" in response_lower or "3 to 5" in response_lower:
            policy_scores.append(1.0)
        elif "business days" in response_lower:
            policy_scores.append(0.7)
        else:
            policy_scores.append(0.0)
        
        if "express" in question_lower or "2-day" in question_lower:
            relevant_policies.append("express_shipping")
            if "$9.99" in response_lower or "9.99" in response_lower:
                policy_scores.append(1.0)
            elif "express" in response_lower:
                policy_scores.append(0.5)
            else:
                policy_scores.append(0.0)
    
    # Support hours
    if any(kw in question_lower for kw in ["support", "help", "contact", "customer service", "phone", "chat"]):
        relevant_policies.append("support")
        if "24/7" in response_lower or "24 hours" in response_lower:
            policy_scores.append(1.0)
        elif "chat" in response_lower or "phone" in response_lower:
            policy_scores.append(0.5)
        else:
            policy_scores.append(0.0)
    
    # Warranty
    if any(kw in question_lower for kw in ["warranty", "guarantee", "coverage"]):
        relevant_policies.append("warranty")
        if "1-year" in response_lower or "one year" in response_lower:
            policy_scores.append(1.0)
        elif "warranty" in response_lower:
            policy_scores.append(0.5)
        else:
            policy_scores.append(0.0)
    
    # Price match
    if any(kw in question_lower for kw in ["price", "cost", "match", "cheaper", "competitor"]):
        relevant_policies.append("price_match")
        if "price match" in response_lower or "match competitor" in response_lower:
            policy_scores.append(1.0)
        elif "price" in response_lower:
            policy_scores.append(0.5)
        else:
            policy_scores.append(0.0)
    
    if not relevant_policies:
        return 1.0  # No specific policy needed
    
    return sum(policy_scores) / len(policy_scores) if policy_scores else 0.0

def check_tone_professionalism(response):
    """Check if tone is professional and friendly (0-1)"""
    response_lower = response.lower()
    
    # Positive indicators
    professional_words = ["appreciate", "understand", "assist", "help", "glad", "happy", "pleased", "thank"]
    positive_count = sum(1 for word in professional_words if word in response_lower)
    
    # Negative indicators
    unprofessional_words = ["whatever", "deal with it", "figure it out", "not my problem"]
    negative_count = sum(1 for phrase in unprofessional_words if phrase in response_lower)
    
    # Base score
    score = 0.7
    
    # Add points for professional language
    score += min(0.3, positive_count * 0.05)
    
    # Subtract points for unprofessional language
    score -= negative_count * 0.3
    
    # Check for empathy/acknowledgment
    if any(phrase in response_lower for phrase in ["i understand", "i realize", "i can help", "i'm here"]):
        score += 0.1
    
    return max(0.0, min(1.0, score))

def check_format_consistency(response):
    """Check if response has consistent format (0-1)"""
    response_lower = response.lower()
    
    score = 0.5  # Base score
    
    # Check for structure
    if any(char.isdigit() and response_lower[response_lower.find(char)-1:response_lower.find(char)+2] in ["1.", "2.", "3."] for char in response if char.isdigit()):
        score += 0.2  # Has numbered steps
    
    # Check for actionable content
    if any(word in response_lower for word in ["can", "will", "should", "please", "contact", "visit"]):
        score += 0.2
    
    # Check for closing/contact info when appropriate
    if "contact" in response_lower or "support" in response_lower or "help" in response_lower:
        score += 0.1
    
    return min(1.0, score)

def score_response(question, response):
    """Score a response based on all criteria"""
    scores = {
        "brand_mention": check_brand_mention(response),
        "policy_accuracy": check_policy_accuracy(question, response),
        "tone_professionalism": check_tone_professionalism(response),
        "format_consistency": check_format_consistency(response)
    }
    
    # Calculate overall score (weighted average)
    scores["overall"] = (
        scores["brand_mention"] * 0.3 +
        scores["policy_accuracy"] * 0.3 +
        scores["tone_professionalism"] * 0.2 +
        scores["format_consistency"] * 0.2
    )
    
    return scores

print("✅ Evaluation functions defined")

✅ Evaluation functions defined


## Step 3: Test Base Model

In [11]:
# System message for base model (generic)
base_system_message = "You are a helpful customer support assistant."

print("Testing base model on all test cases...")
base_results = []

for i, test_case in enumerate(test_cases, 1):
    print(f"\n[{i}/{len(test_cases)}] Testing: {test_case[:50]}...")
    
    response = get_response(BASE_MODEL, test_case, base_system_message)
    scores = score_response(test_case, response)
    
    base_results.append({
        "question": test_case,
        "response": response,
        "brand_mention": scores["brand_mention"],
        "policy_accuracy": scores["policy_accuracy"],
        "tone_professionalism": scores["tone_professionalism"],
        "format_consistency": scores["format_consistency"],
        "overall_score": scores["overall"]
    })
    
    print(f"  Brand mention: {scores['brand_mention']:.1f}")
    print(f"  Policy accuracy: {scores['policy_accuracy']:.2f}")
    print(f"  Overall score: {scores['overall']:.2f}")

print("\n✅ Base model testing complete!")

Testing base model on all test cases...

[1/10] Testing: What is your return policy?...
  Brand mention: 0.0
  Policy accuracy: 0.70
  Overall score: 0.51

[2/10] Testing: Do you offer express shipping?...
  Brand mention: 0.0
  Policy accuracy: 0.25
  Overall score: 0.38

[3/10] Testing: How can I contact customer support?...
  Brand mention: 0.0
  Policy accuracy: 0.50
  Overall score: 0.48

[4/10] Testing: My order hasn't arrived yet and it's been over a w...
  Brand mention: 0.0
  Policy accuracy: 0.00
  Overall score: 0.31

[5/10] Testing: Can I return an item I bought last month if I stil...
  Brand mention: 0.0
  Policy accuracy: 0.00
  Overall score: 0.30

[6/10] Testing: I need to change my shipping address for an order ...
  Brand mention: 0.0
  Policy accuracy: 0.00
  Overall score: 0.31

[7/10] Testing: Do you price match with other electronics retailer...
  Brand mention: 0.0
  Policy accuracy: 1.00
  Overall score: 0.63

[8/10] Testing: What kind of warranty do your produ

## Step 4: Test Fine-Tuned Model

**Note**: Make sure to update `FINE_TUNED_MODEL` with your actual fine-tuned model ID before running this cell.

In [12]:
# System message for fine-tuned model (must match training data exactly)
ft_system_message = (
    "You are a helpful customer support assistant for "
    "TechGadgets, an online electronics store. "
    "Always be friendly and professional. "
    "Always mention TechGadgets in your responses. "
    "Use company policies: 30-day money-back guarantee, "
    "standard shipping 3-5 business days, express 2-day shipping for $9.99, "
    "24/7 chat support, Mon-Fri 9AM-6PM phone support, "
    "1-year manufacturer warranty, and price matching."
)

print("Testing fine-tuned model on all test cases...")
ft_results = []

for i, test_case in enumerate(test_cases, 1):
    print(f"\n[{i}/{len(test_cases)}] Testing: {test_case[:50]}...")
    
    response = get_response(FINE_TUNED_MODEL, test_case, ft_system_message)
    scores = score_response(test_case, response)
    
    ft_results.append({
        "question": test_case,
        "response": response,
        "brand_mention": scores["brand_mention"],
        "policy_accuracy": scores["policy_accuracy"],
        "tone_professionalism": scores["tone_professionalism"],
        "format_consistency": scores["format_consistency"],
        "overall_score": scores["overall"]
    })
    
    print(f"  Brand mention: {scores['brand_mention']:.1f}")
    print(f"  Policy accuracy: {scores['policy_accuracy']:.2f}")
    print(f"  Overall score: {scores['overall']:.2f}")

print("\n✅ Fine-tuned model testing complete!")

Testing fine-tuned model on all test cases...

[1/10] Testing: What is your return policy?...
  Brand mention: 1.0
  Policy accuracy: 0.70
  Overall score: 0.87

[2/10] Testing: Do you offer express shipping?...
  Brand mention: 1.0
  Policy accuracy: 0.25
  Overall score: 0.71

[3/10] Testing: How can I contact customer support?...
  Brand mention: 1.0
  Policy accuracy: 1.00
  Overall score: 0.96

[4/10] Testing: My order hasn't arrived yet and it's been over a w...
  Brand mention: 1.0
  Policy accuracy: 0.00
  Overall score: 0.63

[5/10] Testing: Can I return an item I bought last month if I stil...
  Brand mention: 1.0
  Policy accuracy: 0.70
  Overall score: 0.81

[6/10] Testing: I need to change my shipping address for an order ...
  Brand mention: 1.0
  Policy accuracy: 0.00
  Overall score: 0.63

[7/10] Testing: Do you price match with other electronics retailer...
  Brand mention: 1.0
  Policy accuracy: 1.00
  Overall score: 0.91

[8/10] Testing: What kind of warranty do your

## Step 5: Create Comparison Table

In [13]:
# Create comparison DataFrame
comparison_data = []

for i, (base, ft) in enumerate(zip(base_results, ft_results), 1):
    comparison_data.append({
        "Test Case #": i,
        "Question": base["question"],
        "Base - Brand Mention": base["brand_mention"],
        "FT - Brand Mention": ft["brand_mention"],
        "Base - Policy Accuracy": f"{base['policy_accuracy']:.2f}",
        "FT - Policy Accuracy": f"{ft['policy_accuracy']:.2f}",
        "Base - Tone": f"{base['tone_professionalism']:.2f}",
        "FT - Tone": f"{ft['tone_professionalism']:.2f}",
        "Base - Format": f"{base['format_consistency']:.2f}",
        "FT - Format": f"{ft['format_consistency']:.2f}",
        "Base - Overall": f"{base['overall_score']:.2f}",
        "FT - Overall": f"{ft['overall_score']:.2f}",
        "Improvement": f"{ft['overall_score'] - base['overall_score']:.2f}"
    })

comparison_df = pd.DataFrame(comparison_data)

# Calculate summary statistics
summary = {
    "Metric": ["Brand Mention", "Policy Accuracy", "Tone Professionalism", "Format Consistency", "Overall Score"],
    "Base Model Avg": [
        sum(r["brand_mention"] for r in base_results) / len(base_results),
        sum(r["policy_accuracy"] for r in base_results) / len(base_results),
        sum(r["tone_professionalism"] for r in base_results) / len(base_results),
        sum(r["format_consistency"] for r in base_results) / len(base_results),
        sum(r["overall_score"] for r in base_results) / len(base_results)
    ],
    "Fine-Tuned Avg": [
        sum(r["brand_mention"] for r in ft_results) / len(ft_results),
        sum(r["policy_accuracy"] for r in ft_results) / len(ft_results),
        sum(r["tone_professionalism"] for r in ft_results) / len(ft_results),
        sum(r["format_consistency"] for r in ft_results) / len(ft_results),
        sum(r["overall_score"] for r in ft_results) / len(ft_results)
    ]
}

summary["Improvement"] = [
    summary["Fine-Tuned Avg"][i] - summary["Base Model Avg"][i]
    for i in range(len(summary["Metric"]))
]

summary_df = pd.DataFrame(summary)

# Display results
print("=" * 80)
print("COMPARISON SUMMARY")
print("=" * 80)
print(summary_df.to_string(index=False))
print("\n" + "=" * 80)
print("DETAILED COMPARISON")
print("=" * 80)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)
print(comparison_df.to_string(index=False))

# Save to CSV
os.makedirs("../results", exist_ok=True)
comparison_df.to_csv("../results/comparison_table.csv", index=False)
summary_df.to_csv("../results/summary_statistics.csv", index=False)

print("\n✅ Comparison tables saved to results/")

COMPARISON SUMMARY
              Metric  Base Model Avg  Fine-Tuned Avg  Improvement
       Brand Mention          0.0000          1.0000        1.000
     Policy Accuracy          0.2950          0.4150        0.120
Tone Professionalism          0.7650          0.8550        0.090
  Format Consistency          0.7800          0.7700       -0.010
       Overall Score          0.3975          0.7495        0.352

DETAILED COMPARISON
 Test Case #                                                                                        Question  Base - Brand Mention  FT - Brand Mention Base - Policy Accuracy FT - Policy Accuracy Base - Tone FT - Tone Base - Format FT - Format Base - Overall FT - Overall Improvement
           1                                                                     What is your return policy?                     0                   1                   0.70                 0.70        0.70      1.00          0.80        0.80           0.51         0.87        0.3

## Step 6: Generate HTML Report

The cell below creates a self-contained HTML report comparing the base and fine-tuned models, saved to `results/evaluation_report.html`. Open that file in a browser to view the results.

In [15]:
import html

def escape_html(s):
    return html.escape(str(s)) if s is not None else ""

# Build summary metrics HTML
summary_rows = ""
for _, row in summary_df.iterrows():
    metric = escape_html(row["Metric"])
    base_val = float(row["Base Model Avg"])
    ft_val = float(row["Fine-Tuned Avg"])
    imp = float(row["Improvement"])
    imp_class = "positive" if imp >= 0 else "negative"
    imp_sign = "+" if imp >= 0 else ""
    summary_rows += f"""
    <tr>
      <td class="metric-name">{metric}</td>
      <td><div class="score-bar"><span class="fill base" style="width:{min(100, base_val*100)}%"></span><span class="score-num">{base_val:.2f}</span></div></td>
      <td><div class="score-bar"><span class="fill ft" style="width:{min(100, ft_val*100)}%"></span><span class="score-num">{ft_val:.2f}</span></div></td>
      <td class="improve {imp_class}">{imp_sign}{imp:.2f}</td>
    </tr>"""

# Build detailed comparison table rows
detail_rows = ""
for _, row in comparison_df.iterrows():
    q = escape_html(row["Question"])
    if len(q) > 80:
        q_short = q[:77] + "..."
    else:
        q_short = q
    detail_rows += f"""
    <tr>
      <td>{int(row['Test Case #'])}</td>
      <td class="question" title="{q}">{q_short}</td>
      <td>{int(row['Base - Brand Mention'])}</td>
      <td>{int(row['FT - Brand Mention'])}</td>
      <td>{row['Base - Policy Accuracy']}</td>
      <td>{row['FT - Policy Accuracy']}</td>
      <td>{row['Base - Overall']}</td>
      <td>{row['FT - Overall']}</td>
      <td class="improve {'positive' if float(row['Improvement']) >= 0 else 'negative'}">{row['Improvement']}</td>
    </tr>"""

# Build sample response comparison cards (first 3)
sample_cards = ""
for i in range(min(3, len(test_cases))):
    q = escape_html(test_cases[i])
    base_resp = escape_html(base_results[i]["response"]).replace("\n", "<br>")
    ft_resp = escape_html(ft_results[i]["response"]).replace("\n", "<br>")
    base_score = base_results[i]["overall_score"]
    ft_score = ft_results[i]["overall_score"]
    sample_cards += f"""
    <div class="sample-card">
      <h3 class="sample-question">Test {i+1}: {q}</h3>
      <div class="two-col">
        <div class="response-box base">
          <h4>Base model</h4>
          <p class="score">Score: {base_score:.2f}</p>
          <div class="response-text">{base_resp[:500]}{"..." if len(base_results[i]["response"]) > 500 else ""}</div>
        </div>
        <div class="response-box ft">
          <h4>Fine-tuned model</h4>
          <p class="score">Score: {ft_score:.2f}</p>
          <div class="response-text">{ft_resp[:500]}{"..." if len(ft_results[i]["response"]) > 500 else ""}</div>
        </div>
      </div>
    </div>"""

overall_base = summary_df[summary_df["Metric"] == "Overall Score"]["Base Model Avg"].values[0]
overall_ft = summary_df[summary_df["Metric"] == "Overall Score"]["Fine-Tuned Avg"].values[0]
overall_imp = overall_ft - overall_base

html_content = f"""<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>TechGadgets Support Bot — Evaluation Report</title>
  <link rel="preconnect" href="https://fonts.googleapis.com">
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
  <link href="https://fonts.googleapis.com/css2?family=DM+Sans:ital,opsz,wght@0,9..40,400;0,9..40,600;0,9..40,700;1,9..40,400&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
  <style>
    :root {{
      --bg: #0f0f12;
      --surface: #18181c;
      --surface2: #1f1f24;
      --text: #e4e4e7;
      --text-muted: #a1a1aa;
      --accent: #22d3ee;
      --accent-dim: #0891b2;
      --base-color: #71717a;
      --ft-color: #34d399;
      --positive: #34d399;
      --negative: #f87171;
      --border: #27272a;
    }}
    * {{ box-sizing: border-box; }}
    body {{
      font-family: 'DM Sans', system-ui, sans-serif;
      background: var(--bg);
      color: var(--text);
      line-height: 1.6;
      margin: 0;
      padding: 2rem;
      min-height: 100vh;
    }}
    .container {{ max-width: 1200px; margin: 0 auto; }}
    h1 {{
      font-size: 1.75rem;
      font-weight: 700;
      margin: 0 0 0.25rem 0;
      letter-spacing: -0.02em;
    }}
    .subtitle {{
      color: var(--text-muted);
      font-size: 0.95rem;
      margin-bottom: 2rem;
    }}
    .hero {{
      background: linear-gradient(135deg, var(--surface) 0%, var(--surface2) 100%);
      border: 1px solid var(--border);
      border-radius: 12px;
      padding: 1.5rem 2rem;
      margin-bottom: 2rem;
      display: flex;
      align-items: center;
      justify-content: space-between;
      flex-wrap: wrap;
      gap: 1rem;
    }}
    .hero-models {{
      font-family: 'JetBrains Mono', monospace;
      font-size: 0.8rem;
      color: var(--text-muted);
    }}
    .big-score {{
      text-align: right;
    }}
    .big-score .label {{ font-size: 0.8rem; color: var(--text-muted); text-transform: uppercase; letter-spacing: 0.05em; }}
    .big-score .value {{ font-size: 2rem; font-weight: 700; }}
    .big-score .base .value {{ color: var(--base-color); }}
    .big-score .ft .value {{ color: var(--ft-color); }}
    .improve.positive {{ color: var(--positive); }}
    .improve.negative {{ color: var(--negative); }}
    section {{
      margin-bottom: 2.5rem;
    }}
    section h2 {{
      font-size: 1.15rem;
      font-weight: 600;
      margin-bottom: 1rem;
      color: var(--text);
      border-bottom: 1px solid var(--border);
      padding-bottom: 0.5rem;
    }}
    table {{
      width: 100%;
      border-collapse: collapse;
      font-size: 0.9rem;
    }}
    th, td {{
      padding: 0.6rem 0.75rem;
      text-align: left;
      border-bottom: 1px solid var(--border);
    }}
    th {{
      color: var(--text-muted);
      font-weight: 600;
      font-size: 0.75rem;
      text-transform: uppercase;
      letter-spacing: 0.04em;
    }}
    td.metric-name {{ font-weight: 500; }}
    td.question {{ max-width: 280px; }}
    .score-bar {{
      position: relative;
      height: 22px;
      background: var(--surface2);
      border-radius: 4px;
      overflow: hidden;
    }}
    .score-bar .fill {{
      position: absolute;
      left: 0;
      top: 0;
      bottom: 0;
      border-radius: 4px;
      min-width: 2px;
    }}
    .score-bar .fill.base {{ background: var(--base-color); }}
    .score-bar .fill.ft {{ background: var(--ft-color); }}
    .score-bar .score-num {{
      position: relative;
      z-index: 1;
      font-family: 'JetBrains Mono', monospace;
      font-size: 0.75rem;
      padding-left: 6px;
      line-height: 22px;
      color: var(--text);
    }}
    .sample-card {{
      background: var(--surface);
      border: 1px solid var(--border);
      border-radius: 10px;
      padding: 1.25rem;
      margin-bottom: 1.25rem;
    }}
    .sample-question {{
      font-size: 1rem;
      font-weight: 600;
      margin: 0 0 1rem 0;
      color: var(--accent);
    }}
    .two-col {{
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 1rem;
    }}
    @media (max-width: 768px) {{ .two-col {{ grid-template-columns: 1fr; }} }}
    .response-box {{
      padding: 1rem;
      border-radius: 8px;
      border: 1px solid var(--border);
    }}
    .response-box.base {{ background: rgba(113,113,122,0.08); border-color: var(--base-color); }}
    .response-box.ft {{ background: rgba(52,211,153,0.08); border-color: var(--ft-color); }}
    .response-box h4 {{ margin: 0 0 0.5rem 0; font-size: 0.85rem; color: var(--text-muted); }}
    .response-box .score {{ font-family: 'JetBrains Mono', monospace; font-size: 0.8rem; margin: 0 0 0.5rem 0; }}
    .response-text {{ font-size: 0.85rem; color: var(--text); max-height: 180px; overflow-y: auto; }}
    footer {{ margin-top: 2rem; padding-top: 1rem; border-top: 1px solid var(--border); font-size: 0.8rem; color: var(--text-muted); }}
  </style>
</head>
<body>
  <div class="container">
    <h1>TechGadgets Support Bot — Evaluation Report</h1>
    <p class="subtitle">Base model vs fine-tuned model comparison</p>

    <div class="hero">
      <div class="hero-models">
        <div>Base: {escape_html(BASE_MODEL)}</div>
        <div>Fine-tuned: {escape_html(FINE_TUNED_MODEL)}</div>
      </div>
      <div class="big-score">
        <div class="base"><span class="label">Base avg</span><div class="value">{overall_base:.2f}</div></div>
        <div class="ft"><span class="label">Fine-tuned avg</span><div class="value">{overall_ft:.2f}</div></div>
        <div class="improve {'positive' if overall_imp >= 0 else 'negative'}"><span class="label">Improvement</span><div class="value">{'+' if overall_imp >= 0 else ''}{overall_imp:.2f}</div></div>
      </div>
    </div>

    <section>
      <h2>Summary by metric</h2>
      <table>
        <thead>
          <tr>
            <th>Metric</th>
            <th>Base model</th>
            <th>Fine-tuned</th>
            <th>Improvement</th>
          </tr>
        </thead>
        <tbody>
          {summary_rows}
        </tbody>
      </table>
    </section>

    <section>
      <h2>Detailed comparison (all test cases)</h2>
      <table>
        <thead>
          <tr>
            <th>#</th>
            <th>Question</th>
            <th>Base brand</th>
            <th>FT brand</th>
            <th>Base policy</th>
            <th>FT policy</th>
            <th>Base overall</th>
            <th>FT overall</th>
            <th>Improvement</th>
          </tr>
        </thead>
        <tbody>
          {detail_rows}
        </tbody>
      </table>
    </section>

    <section>
      <h2>Sample responses (first 3 test cases)</h2>
      {sample_cards}
    </section>

    <footer>
      Generated from notebook 03_evaluation. Base model: {escape_html(BASE_MODEL)} · Fine-tuned: {escape_html(FINE_TUNED_MODEL)}
    </footer>
  </div>
</body>
</html>"""

os.makedirs("../results", exist_ok=True)
report_path = "../results/evaluation_report.html"
with open(report_path, "w", encoding="utf-8") as f:
    f.write(html_content)

print(f"✅ HTML report saved to {report_path}")
print("   Open this file in a browser to view the comparison.")

✅ HTML report saved to ../results/evaluation_report.html
   Open this file in a browser to view the comparison.


In [16]:
# Display side-by-side comparison of a few examples
print("Sample Response Comparisons:\n")
print("=" * 80)

for i in range(min(3, len(test_cases))):
    print(f"\nTest Case {i+1}: {test_cases[i]}")
    print("-" * 80)
    print("BASE MODEL:")
    print(base_results[i]["response"][:300] + "..." if len(base_results[i]["response"]) > 300 else base_results[i]["response"])
    print(f"\nScore: {base_results[i]['overall_score']:.2f}")
    print("\n" + "-" * 80)
    print("FINE-TUNED MODEL:")
    print(ft_results[i]["response"][:300] + "..." if len(ft_results[i]["response"]) > 300 else ft_results[i]["response"])
    print(f"\nScore: {ft_results[i]['overall_score']:.2f}")
    print("\n" + "=" * 80)

# Save example outputs
with open("../results/example_outputs.txt", "w", encoding="utf-8") as f:
    f.write("TECHGADGETS SUPPORT BOT - EXAMPLE OUTPUTS\n")
    f.write("=" * 80 + "\n\n")
    
    for i in range(len(test_cases)):
        f.write(f"TEST CASE {i+1}\n")
        f.write(f"Question: {test_cases[i]}\n\n")
        f.write("BASE MODEL RESPONSE:\n")
        f.write(base_results[i]["response"] + "\n\n")
        f.write(f"Score: {base_results[i]['overall_score']:.2f}\n\n")
        f.write("-" * 80 + "\n\n")
        f.write("FINE-TUNED MODEL RESPONSE:\n")
        f.write(ft_results[i]["response"] + "\n\n")
        f.write(f"Score: {ft_results[i]['overall_score']:.2f}\n\n")
        f.write("=" * 80 + "\n\n")

print("\n✅ Example outputs saved to results/example_outputs.txt")

Sample Response Comparisons:


Test Case 1: What is your return policy?
--------------------------------------------------------------------------------
BASE MODEL:
Our return policy typically allows customers to return items within a specified time frame (usually 30 days) from the date of purchase, provided that the items are in their original condition and packaging. Please note that certain items, such as personalized or final sale products, may not be eligi...

Score: 0.51

--------------------------------------------------------------------------------
FINE-TUNED MODEL:
Thanks for reaching out! At TechGadgets, I completely understand your curiosity about our return policy. Allow me to provide you with the details. Our return policy allows customers to return their items within 30 days of purchase for a full refund or exchange, as long as the items are in their orig...

Score: 0.87


Test Case 2: Do you offer express shipping?
-------------------------------------------------------