# Model Quality Metrics

## Overview
This notebook demonstrates advanced evaluation metrics for US cities demographic analysis using LLM-as-a-Judge methodology.

## What We'll Cover
1. **Programmatic Testing** - Verify model accuracy against ground truth city data
2. **LLM as a Judge** - Assess response quality, accuracy, and analytical depth
3. **Evaluation Analysis** - Compare performance across question types and complexity levels
4. **Results Visualization** - Present findings and recommendations

## Prerequisites
- AWS account with Bedrock access
- Python 3.10+
- boto3 library

## Setup and Dependencies

In [18]:
import boto3
import json
import pandas as pd
import re
import time
from time import sleep
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Dict, List, Optional, Any
import random


bedrock = boto3.client("bedrock-runtime")


## 1. Load the dataset 

We will be using the US Cities Population Dataset which contains comprehensive demographic and geographic information for the top 314 most populous cities in the United States. This dataset provides a detailed snapshot of American urban demographics, featuring cities with populations ranging from over 8.4 million (New York City) down to approximately 100,000 residents (Sunrise, FL).

**Dataset features:**
- **city**: Name of the city
- **state**: Two-letter state abbreviation 
- **population**: Current population count (formatted with commas as strings)
- **land_area_mi2**: Land area in square miles

**Dataset characteristics:**
- **Size**: 314 cities across all 50 US states plus Washington DC
- **Population range**: 8,478,072 (New York) to 100,128 (Sunrise, FL)
- **Geographic coverage**: Represents major metropolitan areas nationwide
- **Data quality**: Some entries contain footnote references that may need cleaning

This comprehensive dataset is ideal for analyzing urban demographics, population density patterns, regional growth trends, and the geographic distribution of America's largest cities. The inclusion of both population and land area data enables calculations of population density and comparative urban planning analysis.





In [19]:
import pandas as pd

# Load the dataset
df = pd.read_csv("./city_pop.csv")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# View first 3 rows with better formatting
print("\nFirst 3 rows:")
for i in range(min(3, len(df))):
    print(f"\n--- Row {i+1} ---")
    print(f"City: {df.iloc[i]['city']}")
    print(f"State: {df.iloc[i]['state']}")
    print(f"Population: {df.iloc[i]['population']}")
    print(f"Land Area (sq mi): {df.iloc[i]['land_area_mi2']}")
    print("-" * 50)

Dataset shape: (346, 4)
Columns: ['city', 'state', 'population', 'land_area_mi2']

First 3 rows:

--- Row 1 ---
City: New York[c]
State: NY
Population: 8,478,072
Land Area (sq mi): 300.5
--------------------------------------------------

--- Row 2 ---
City: Los Angeles
State: CA
Population: 3,878,704
Land Area (sq mi): 469.5
--------------------------------------------------

--- Row 3 ---
City: Chicago
State: IL
Population: 2,721,308
Land Area (sq mi): 227.7
--------------------------------------------------


## 2. Programmatic Model Testing

Now, we will perform programmatic testing on our dataset. We will ask the model a series of questions about specific cities and verify whether the model's responses match the data we have in our dataset. This approach allows us to systematically evaluate the model's accuracy against our ground truth data.


In [20]:

# Define our test questions
test_questions = [
    "What is the land area of of New York?",
    "What is the land area of Los Angeles in square miles?",
    "What is the population of Chicago?",
    "What is the land area of Houston?",
    "Which city has a larger population: Phoenix or New York?",
    "What is the land area of San Francisco?",
    "What is the population of Seattle?",
    "What is the total land area of Boston?",
    "What is the land area of Las Vegas?"
]


In [21]:
def bedrock_call(prompt: str, df: pd.DataFrame) -> Dict[str, Any]:
    """Make a Bedrock call using Converse API with Nova Micro."""
    
    structured_prompt = f"""
    You will be asked questions about city populations and land areas.
    
    Answer the following question: {prompt}
    
    For direct questions about population, respond in this JSON format only:
    {{
        "answer": [numerical answer only, no commas or text],
        "city": [city name],
        "metric": "population"
    }}

    For direct questions about land area, respond in this JSON format only:
    {{
        "answer": [numerical answer as decimal, like 46.9],
        "city": [city name],
        "metric": "land_area_mi2"
    }}

    For comparison questions, respond in this JSON format only:
    {{
        "answer": [numerical answer for larger city],
        "city": [name of larger city],
        "metric": [what was compared],
        "comparison": true
    }}

    Respond with the JSON only, no additional text.
    """
    
    response = bedrock.converse(
        # modelId='amazon.nova-micro-v1:0',
        modelId='us.anthropic.claude-sonnet-4-20250514-v1:0',
        messages=[
            {
                'role': 'user',
                'content': [
                    {
                        'text': structured_prompt
                    }
                ]
            }
        ],
        inferenceConfig={
            'maxTokens': 300,
            'temperature': 0
        }
    )
    
    response_text = response['output']['message']['content'][0]['text']
    return json.loads(response_text)

Here we will set up a function to programatically verify if the models response based on our dataset.

In [22]:



def verify_answer(response: Dict[str, Any], df: pd.DataFrame, question: str) -> bool:
    """Verify if the answer matches our dataset."""
    try:
        city = response['city']
        metric = response['metric']
        
        # Handle comparison questions
        if response.get('comparison'):
            cities = question.split(':')[1].strip().split(' or ')
            city1, city2 = [c.strip() for c in cities]
            
            # Get values and handle both int and float
            val1_raw = df[df['city'].str.contains(city1, case=False)][metric].values[0]
            val2_raw = df[df['city'].str.contains(city2, case=False)][metric].values[0]
            
            if isinstance(val1_raw, str):
                val1 = float(val1_raw.replace(',', ''))
            else:
                val1 = float(val1_raw)
                
            if isinstance(val2_raw, str):
                val2 = float(val2_raw.replace(',', ''))
            else:
                val2 = float(val2_raw)
            
            expected_city = city1 if val1 > val2 else city2
            return city.lower() in expected_city.lower() or expected_city.lower() in city.lower()
        
        # Handle direct questions - improved city matching
        # First try exact match
        matching_rows = df[df['city'].str.contains(city, case=False, regex=False)]
        
        # If no match, try without brackets/footnotes
        if len(matching_rows) == 0:
            city_clean = city.split('[')[0].strip()  # Remove footnote markers
            matching_rows = df[df['city'].str.contains(city_clean, case=False, regex=False)]
        
        # If still no match, try the other way around (dataset city contains response city)
        if len(matching_rows) == 0:
            for idx, row in df.iterrows():
                dataset_city_clean = row['city'].split('[')[0].strip()
                if city.lower() in dataset_city_clean.lower() or dataset_city_clean.lower() in city.lower():
                    matching_rows = df.iloc[[idx]]
                    break
        
        if len(matching_rows) == 0:
            print(f"No match found for city: '{city}'")
            return False
            
        actual_value = matching_rows[metric].values[0]
        
        # Handle population (integer) vs land_area (float)
        if metric == 'population':
            if isinstance(actual_value, str):
                actual_value = int(actual_value.replace(',', ''))
            answer = int(response['answer'])
            return answer == actual_value
        else:  # land_area_mi2
            if isinstance(actual_value, str):
                actual_value = float(actual_value.replace(',', ''))
            answer = float(response['answer'])
            return abs(answer - actual_value) < 0.1  # Allow small floating point differences
            
    except Exception as e:
        print(f"Verification error: {str(e)}")
        return False


def run_tests(questions: List[str], df: pd.DataFrame) -> List[Dict[str, Any]]:
    """Run all test questions and collect results."""
    results = []
    
    for i, question in enumerate(questions):
        print(f"Testing {i+1}/{len(questions)}: {question}")
        
        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = bedrock_call(question, df)
                is_correct = verify_answer(response, df, question)
                
                results.append({
                    "question": question,
                    "response": response,
                    "passed": is_correct
                })
                break
                
            except Exception as e:
                if "ThrottlingException" in str(e) and attempt < max_retries - 1:
                    wait_time = (2 ** attempt) + random.uniform(0, 1)  # Exponential backoff
                    print(f"   Throttled, waiting {wait_time:.1f}s before retry...")
                    sleep(wait_time)
                else:
                    print(f"   Error: {str(e)}")
                    results.append({
                        "question": question,
                        "error": str(e),
                        "passed": False
                    })
                    break
        
        # Base delay between requests
        sleep(2)
    
    return results

# Load the dataset
df = pd.read_csv("./city_pop.csv")

# Run tests
test_results = run_tests(test_questions, df)

# Save results
with open("test_results.json", 'w') as f:
    json.dump(test_results, f, indent=2)

# Print summary
passed_tests = sum(1 for result in test_results if result['passed'])
print(f"\nTest Summary:")
print(f"Passed: {passed_tests}/{len(test_questions)}")
print(f"Success Rate: {(passed_tests/len(test_questions))*100:.2f}%")

# Print detailed results
print("\nDetailed Results:")
for result in test_results:
    status = "✅ PASS" if result['passed'] else "❌ FAIL"
    print(f"{status} - {result['question']}")
    if 'error' in result:
        print(f"   Error: {result['error']}")
    elif not result['passed']:
        print(f"   Response: {json.dumps(result['response'], indent=2)}")


Testing 1/9: What is the land area of of New York?
Testing 2/9: What is the land area of Los Angeles in square miles?
Testing 3/9: What is the population of Chicago?
Testing 4/9: What is the land area of Houston?
Testing 5/9: Which city has a larger population: Phoenix or New York?
Testing 6/9: What is the land area of San Francisco?
Testing 7/9: What is the population of Seattle?
Testing 8/9: What is the total land area of Boston?
   Throttled, waiting 1.8s before retry...
Testing 9/9: What is the land area of Las Vegas?
   Throttled, waiting 1.7s before retry...

Test Summary:
Passed: 3/9
Success Rate: 33.33%

Detailed Results:
❌ FAIL - What is the land area of of New York?
   Response: {
  "answer": 302.6,
  "city": "New York",
  "metric": "land_area_mi2"
}
❌ FAIL - What is the land area of Los Angeles in square miles?
   Response: {
  "answer": 502.7,
  "city": "Los Angeles",
  "metric": "land_area_mi2"
}
❌ FAIL - What is the population of Chicago?
   Response: {
  "answer": 274638

## 2. LLM as A Judge Evaluation 

LLM as a Judge is an evaluation methodology where we use a powerful language model (often the same or different LLM) to act as an automated evaluator. Instead of human judges, the LLM analyzes AI responses against specific criteria, providing detailed feedback, scores, and reasoning for its assessments. 

This approach has gained significant traction in both research and industry applications due to its ability to provide consistent, scalable, and nuanced evaluations.


In [23]:

#To begin we will ask our model questions based on the same dataset

def generate_model_response(question: str, context_data: str = "") -> str:
    """Generate a model response to a cities question using Bedrock."""
    
    prompt = f"""
    You are an AI assistant with knowledge about US cities demographics. Answer the following question about US cities based on your knowledge.
    
    Question: {question}
    
    {f"Context data from dataset: {context_data}" if context_data else ""}
    
    Provide a clear, informative response. If the question involves calculations (like population density), show your work.
    """
    
    try:
        response = bedrock.converse(
            modelId='us.anthropic.claude-3-7-sonnet-20250219-v1:0',
            messages=[
                {
                    'role': 'user',
                    'content': [{'text': prompt}]
                }
            ],
            inferenceConfig={
                'maxTokens': 500,
                'temperature': 0.1
            }
        )
        
        return response['output']['message']['content'][0]['text']
    except Exception as e:
        return f"Error generating response: {str(e)}"

print("Model response generation function loaded")


Model response generation function loaded


In [24]:
# Define test questions for cities evaluation
cities_questions = [
    {
        'question': 'What is the population of New York City?',
        'context': 'New York[c], NY: population=8,478,072, land_area_mi2=300.5'
    },
    {
        'question': 'Which city has a larger population: Los Angeles or Chicago?',
        'context': 'Los Angeles, CA: population=3,878,704, land_area_mi2=469.5\nChicago, IL: population=2,721,308, land_area_mi2=227.7'
    },
    {
        'question': 'What is the population density of San Francisco?',
        'context': 'San Francisco, CA: population=815,201, land_area_mi2=46.9'
    },
    {
        'question': 'What is the land area of Houston in square miles?',
        'context': 'Houston, TX: population=2,304,580, land_area_mi2=670.2'
    },
    {
        'question': 'List the top 3 most populous cities in the United States.',
        'context': 'Top cities: New York (8,478,072), Los Angeles (3,878,704), Chicago (2,721,308)'
    },
    {
        'question': 'Calculate the population density of Chicago.',
        'context': 'Chicago, IL: population=2,721,308, land_area_mi2=227.7'
    }
]

# Generate model responses
print("Generating model responses...")
cities_responses = []

for i, item in enumerate(cities_questions):
    print(f"Generating response {i+1}/{len(cities_questions)}: {item['question']}")
    
    # Generate model response
    model_response = generate_model_response(item['question'], item['context'])
    
    # Create response data structure
    response_data = {
        'question': item['question'],
        'model_response': model_response,
        'context': item['context']
    }
    
    cities_responses.append(response_data)
    
    # Add delay to avoid rate limiting
    sleep(1)

print(f"\nGenerated {len(cities_responses)} model responses for evaluation")

# Display sample responses
print("\nSample generated responses:")
for i, response in enumerate(cities_responses[:2]):
    print(f"\n--- Response {i+1} ---")
    print(f"Question: {response['question']}")
    print(f"Model Response: {response['model_response'][:200]}...")
    print("-" * 50)

Generating model responses...
Generating response 1/6: What is the population of New York City?
Generating response 2/6: Which city has a larger population: Los Angeles or Chicago?
Generating response 3/6: What is the population density of San Francisco?
Generating response 4/6: What is the land area of Houston in square miles?
Generating response 5/6: List the top 3 most populous cities in the United States.
Generating response 6/6: Calculate the population density of Chicago.

Generated 6 model responses for evaluation

Sample generated responses:

--- Response 1 ---
Question: What is the population of New York City?
Model Response: The population of New York City is 8,478,072 according to the provided data....
--------------------------------------------------

--- Response 2 ---
Question: Which city has a larger population: Los Angeles or Chicago?
Model Response: Based on the context data provided, Los Angeles has a larger population than Chicago. Los Angeles has a population of 3,

In [30]:
#Load and Configure Judge Prompt
judge_prompt_template = """
You will be given a question about US cities demographics and population data. 
Your task is to evaluate a model's response for accuracy, completeness, and analytical quality.

Here is the question about US cities:
<question>{QUESTION}</question>

Here is the model's response:
<model_response>{MODEL_RESPONSE}</model_response>

Here is the context from the data:
<dataset>{context}</dataset>

**Dataset Context:** The response should be based on the US Cities Population Dataset containing 314 most populous US cities with the following features:
- **city**: City name
- **state**: Two-letter state abbreviation  
- **population**: Population count (may include commas/formatting)
- **land_area_mi2**: Land area in square miles
- **Coverage**: Cities from 8.4M+ (NYC) down to ~100K residents

First, analyze the question type and evaluate the model response based on:

1. **Data Accuracy**: Are population figures, city names, and geographic information correct?
2. **Calculation Correctness**: If calculations are involved (density, rankings, comparisons), are they mathematically sound?
3. **Geographic Knowledge**: Does the response demonstrate proper understanding of US geography and state locations?
4. **Analytical Depth**: For complex queries, does the response provide meaningful insights beyond basic data retrieval?
5. **Data Handling**: Does the response appropriately handle data formatting issues (commas in numbers, footnotes, etc.)?

Then, classify the question type:
1. **Factual Lookup**: Simple data retrieval (population of specific city)
2. **Ranking/Comparison**: Ordering cities by metrics or comparing multiple cities
3. **Calculation-Based**: Requires mathematical operations (density, growth rates, etc.)
4. **Geographic Analysis**: Regional patterns, state-level analysis, geographic distribution
5. **Trend Analysis**: Population patterns, urban development insights

Provide your evaluation in the following format:

<analysis>
[Your detailed analysis of the response quality, noting any factual errors, missing information, or analytical strengths/weaknesses]
</analysis>

<question_type>factual_lookup/ranking_comparison/calculation_based/geographic_analysis/trend_analysis</question_type>

<complexity>Basic/Intermediate/Advanced</complexity>

<score>X/10</score>

<reasoning>
[Explanation for the score based on accuracy, completeness, analytical quality, and appropriate handling of the dataset characteristics]
</reasoning>

<improvements>
[Specific suggestions for how the response could be enhanced, if applicable]
</improvements>
"""

print("Judge prompt template loaded successfully")
print(f"\nJudge prompt preview:")
print(judge_prompt_template[:300] + "...")

Judge prompt template loaded successfully

Judge prompt preview:

You will be given a question about US cities demographics and population data. 
Your task is to evaluate a model's response for accuracy, completeness, and analytical quality.

Here is the question about US cities:
<question>{QUESTION}</question>

Here is the model's response:
<model_response>{MODE...


## 3. Model-as-a-Judge Evaluation Setup

Configure the evaluation system

In [31]:
# Judge model configuration
JUDGE_MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

def build_judge_prompt(question: str, model_response: str, context: str = "") -> str:
    """Build the judge prompt for evaluating US cities demographic analysis responses."""
    
    # Replace placeholders in the judge prompt template
    formatted_prompt = judge_prompt_template.replace("{QUESTION}", question)
    formatted_prompt = formatted_prompt.replace("{MODEL_RESPONSE}", model_response)
    formatted_prompt = formatted_prompt.replace("{context}", context)
    
    return formatted_prompt

print("Updated build_judge_prompt function loaded")


def call_judge_model(prompt: str) -> str:
    """Call the judge model to evaluate a response using boto3 directly."""
    try:
        response = bedrock.converse(
            modelId=JUDGE_MODEL_ID,
            messages=[{"role": "user", "content": [{"text": prompt}]}],
            inferenceConfig={"temperature": 0.1, "maxTokens": 1000}
        )
        
        return response["output"]["message"]["content"][0]["text"]
    except Exception as e:
        return f"Error: {str(e)}"

def call_threaded_evaluation(prompts: List[str], max_workers=3) -> List[str]:
    """Process evaluation requests in parallel using boto3."""
    future_to_position = {}
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        for i, prompt in enumerate(prompts):
            future = executor.submit(call_judge_model, prompt)
            future_to_position[future] = i
        
        responses = [None] * len(prompts)
        
        for future in as_completed(future_to_position):
            position = future_to_position[future]
            try:
                response = future.result()
                responses[position] = response
            except Exception as exc:
                print(f"Request at position {position} generated an exception: {exc}")
                responses[position] = f"Error: {str(exc)}"
        
    return responses

print(f"Judge model: {JUDGE_MODEL_ID}")

Updated build_judge_prompt function loaded
Judge model: us.anthropic.claude-3-7-sonnet-20250219-v1:0


## 4. Run LLM-as-a-Judge Evaluation

Evaluate the responses using the judge model.

In [32]:
if not cities_responses:
    print("No cities responses to evaluate. Please run the model response generation cell first.")
else:
    print(f"Starting evaluation of {len(cities_responses)} responses...")
    
    # Prepare evaluation prompts
    evaluation_prompts = []
    
    for response_data in cities_responses:
        # Extract the question and response
        question = response_data['question']
        model_response = response_data['model_response']
        
        # Extract relevant context data if available
        context = response_data.get('context', '')
        
        judge_prompt = build_judge_prompt(
            question=question,
            model_response=model_response,
            context=context
        )
        
        evaluation_prompts.append(judge_prompt)
    
    # Run evaluations in parallel
    print("Running evaluations (this may take a few minutes)...")
    evaluation_results = call_threaded_evaluation(evaluation_prompts)
    
    print(f"Completed {len(evaluation_results)} evaluations")


Starting evaluation of 6 responses...
Running evaluations (this may take a few minutes)...
Completed 6 evaluations


## 5. Parse and Analyze Results

Extract structured information from the judge evaluations.

In [34]:
def extract_evaluation_components(evaluation_text: str) -> Dict:
    """Extract structured components from judge evaluation."""
    
    # Regex patterns for extracting components - updated for cities evaluation format
    patterns = {
        'analysis': r'<analysis>(.*?)</analysis>',
        'question_type': r'<question_type>(.*?)</question_type>',
        'complexity': r'<complexity>(.*?)</complexity>',
        'score': r'<score>(.*?)</score>',
        'reasoning': r'<reasoning>(.*?)</reasoning>',
        'improvements': r'<improvements>(.*?)</improvements>'
    }
    
    extracted = {}
    
    for key, pattern in patterns.items():
        match = re.search(pattern, evaluation_text, re.DOTALL | re.IGNORECASE)
        if match:
            extracted[key] = match.group(1).strip()
        else:
            extracted[key] = None
    
    # Extract numeric score
    if extracted['score']:
        score_match = re.search(r'(\d+(?:\.\d+)?)', extracted['score'])
        if score_match:
            try:
                extracted['numeric_score'] = float(score_match.group(1))
            except ValueError:
                extracted['numeric_score'] = None
        else:
            extracted['numeric_score'] = None
    else:
        extracted['numeric_score'] = None
    
    return extracted

if 'evaluation_results' in locals() and evaluation_results:
    # Parse all evaluation results
    parsed_evaluations = []
    
    for i, (response_data, evaluation_text) in enumerate(zip(cities_responses, evaluation_results)):
        if not evaluation_text.startswith("Error:"):
            parsed_eval = extract_evaluation_components(evaluation_text)
            
            # Combine with original response data
            combined_result = {
                **response_data,
                'evaluation_text': evaluation_text,
                **parsed_eval
            }
            
            parsed_evaluations.append(combined_result)
        else:
            print(f"Evaluation error for response {i}: {evaluation_text}")
    
    print(f"Successfully parsed {len(parsed_evaluations)} evaluations")
    
    # Create DataFrame for analysis
    df_evaluations = pd.DataFrame(parsed_evaluations)
    
    print("\nEvaluation DataFrame created with columns:")
    print(list(df_evaluations.columns))
    
    # Display summary statistics for cities evaluation
    if not df_evaluations.empty:
        print(f"\nEvaluation Summary:")
        print(f"Average Score: {df_evaluations['numeric_score'].mean():.2f}")
        print(f"Score Range: {df_evaluations['numeric_score'].min():.1f} - {df_evaluations['numeric_score'].max():.1f}")
        
        print(f"\nQuestion Type Distribution:")
        print(df_evaluations['question_type'].value_counts())
        
        print(f"\nComplexity Distribution:")
        print(df_evaluations['complexity'].value_counts())
        
else:
    print("No evaluation results to parse.")


Successfully parsed 6 evaluations

Evaluation DataFrame created with columns:
['question', 'model_response', 'context', 'evaluation_text', 'analysis', 'question_type', 'complexity', 'score', 'reasoning', 'improvements', 'numeric_score']

Evaluation Summary:
Average Score: 10.00
Score Range: 10.0 - 10.0

Question Type Distribution:
question_type
factual_lookup        2
ranking_comparison    2
calculation_based     2
Name: count, dtype: int64

Complexity Distribution:
complexity
Basic    6
Name: count, dtype: int64


## 6. View Detailed Metrics Table

Load and display the saved CSV file with all evaluation metrics for detailed analysis.

In [36]:
# Save cities evaluation results to files
if 'df_evaluations' in locals() and not df_evaluations.empty:
    print("Saving cities evaluation results...")
    
    # Save detailed results to JSON
    output_file = "cities_evaluation_results.json"
    with open(output_file, 'w') as f:
        json.dump(parsed_evaluations, f, indent=2, default=str)
    print(f"Detailed results saved to: {output_file}")
    
    # Save summary CSV
    summary_columns = ['question', 'numeric_score', 'question_type', 'complexity', 
                      'analysis', 'reasoning', 'improvements']
    
    available_columns = [col for col in summary_columns if col in df_evaluations.columns]
    if available_columns:
        summary_df = df_evaluations[available_columns]
        summary_df.to_csv("cities_evaluation_summary.csv", index=False)
        print(f"Summary CSV saved to: cities_evaluation_summary.csv")
    
    print(f"\nEvaluation complete! {len(df_evaluations)} results processed.")
    print("Run the next cell to view the detailed metrics table.")
    
else:
    print("No evaluation data available to save.")

Saving cities evaluation results...
Detailed results saved to: cities_evaluation_results.json
Summary CSV saved to: cities_evaluation_summary.csv

Evaluation complete! 6 results processed.
Run the next cell to view the detailed metrics table.


In [37]:
# Load and display the CSV metrics file
import pandas as pd
import os

csv_file = "cities_evaluation_summary.csv"

if os.path.exists(csv_file):
    print(f"Loading metrics from: {csv_file}")
    
    # Load the CSV file
    metrics_df = pd.read_csv(csv_file)
    
    print(f"\nCITIES EVALUATION METRICS TABLE ({len(metrics_df)} records)")
    print("=" * 80)
    
    # Configure pandas display options for better viewing
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', 50)
    
    # Display the dataframe
    display(metrics_df)
    
    # Display additional cities-specific analysis
    if not metrics_df.empty and 'numeric_score' in metrics_df.columns:
        print(f"\n" + "=" * 80)
        print("CITIES EVALUATION ANALYSIS")
        print("=" * 80)
        
        print(f"\nSCORE STATISTICS:")
        print(f"  Average Score: {metrics_df['numeric_score'].mean():.2f}/10")
        print(f"  Median Score: {metrics_df['numeric_score'].median():.2f}/10")
        print(f"  Score Range: {metrics_df['numeric_score'].min():.1f} - {metrics_df['numeric_score'].max():.1f}")
        
        if 'question_type' in metrics_df.columns:
            print(f"\nQUESTION TYPE PERFORMANCE:")
            type_stats = metrics_df.groupby('question_type')['numeric_score'].agg(['mean', 'count'])
            for question_type, stats in type_stats.iterrows():
                print(f"  {question_type.replace('_', ' ').title()}: {stats['mean']:.2f} avg ({int(stats['count'])} questions)")
        
        if 'complexity' in metrics_df.columns:
            print(f"\nCOMPLEXITY PERFORMANCE:")
            complexity_stats = metrics_df.groupby('complexity')['numeric_score'].agg(['mean', 'count'])
            for complexity, stats in complexity_stats.iterrows():
                print(f"  {complexity}: {stats['mean']:.2f} avg ({int(stats['count'])} questions)")
    
else:
    print(f"CSV file not found: {csv_file}")
    print("Please run the cities evaluation cells first to generate the metrics file.")
    print("\nExpected workflow:")
    print("1. Generate model responses to cities questions")
    print("2. Run LLM-as-a-Judge evaluation")
    print("3. Parse and save evaluation results")
    print("4. View this metrics summary")



Loading metrics from: cities_evaluation_summary.csv

CITIES EVALUATION METRICS TABLE (6 records)


Unnamed: 0,question,numeric_score,question_type,complexity,analysis,reasoning,improvements
0,What is the population of New York City?,10.0,factual_lookup,Basic,The model's response is factually accurate. It...,The model deserves a perfect score because:\n1...,The response is appropriate as is for this bas...
1,Which city has a larger population: Los Angele...,10.0,ranking_comparison,Basic,The model's response is factually accurate and...,The model deserves a perfect score because it:...,No significant improvements needed. The respon...
2,What is the population density of San Francisco?,10.0,calculation_based,Basic,The model's response correctly calculates the ...,The model response deserves a perfect score be...,The response is comprehensive and accurate as ...
3,What is the land area of Houston in square miles?,10.0,factual_lookup,Basic,The model's response is factually accurate. It...,The model deserves a perfect score because:\n1...,No improvements needed. The response is factua...
4,List the top 3 most populous cities in the Uni...,10.0,ranking_comparison,Basic,The model's response is factually accurate and...,The model deserves a perfect score because:\n1...,The response is already excellent and complete...
5,Calculate the population density of Chicago.,10.0,calculation_based,Basic,The model's response correctly calculates the ...,The model's response deserves a perfect score ...,"While the response is already excellent, poten..."



CITIES EVALUATION ANALYSIS

SCORE STATISTICS:
  Average Score: 10.00/10
  Median Score: 10.00/10
  Score Range: 10.0 - 10.0

QUESTION TYPE PERFORMANCE:
  Calculation Based: 10.00 avg (2 questions)
  Factual Lookup: 10.00 avg (2 questions)
  Ranking Comparison: 10.00 avg (2 questions)

COMPLEXITY PERFORMANCE:
  Basic: 10.00 avg (6 questions)


## Conclusion

This notebook demonstrated advanced evaluation metrics for US cities demographic analysis using various quality metrics. Key achievements:

- **Comprehensive Evaluation**: Used structured rubric to assess data accuracy, calculation correctness, and analytical depth
- **Multi-faceted Testing**: Combined programmatic testing with qualitative judge evaluation
- **Scalable Framework**: Created reusable evaluation pipeline for demographic data analysis
- **Performance Insights**: Identified model strengths across different question types and complexity levels