# Lightweight Evals Quickstart

This notebook demonstrates how to use Lightweight Evals to evaluate LLMs across different dimensions:
- **Harmlessness**: Testing refusal of harmful requests
- **Robustness**: Testing instruction following despite perturbations
- **Consistency**: Testing consistent answers to semantically identical questions

## Setup

First, let's import the necessary modules and set up our configuration.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Lightweight Evals imports
from lightweight_evals.config import Config
from lightweight_evals.adapters.openai import OpenAIAdapter
from lightweight_evals.adapters.dummy import DummyAdapter
from lightweight_evals.runner import EvalRunner, RunConfig
from lightweight_evals.scoring import LLMJudge

# Configure matplotlib
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)

## Configuration

Load configuration and check if we have an OpenAI API key available.

In [None]:
# Load configuration
config = Config()

# Check API key availability
has_openai_key = bool(config.openai_api_key)
print(f"OpenAI API Key Available: {'✅' if has_openai_key else '❌'}")

if has_openai_key:
    print(f"Default Model: {config.default_model}")
    print(f"Max Tokens: {config.max_tokens}")
    print(f"Temperature: {config.temperature}")
else:
    print("Will use DummyAdapter for demonstration")

## Setup Adapters

Initialize the model adapter and judge adapter based on API key availability.

In [None]:
if has_openai_key:
    # Use OpenAI for both model and judge
    model_adapter = OpenAIAdapter(
        model=config.default_model,
        api_key=config.openai_api_key
    )
    judge_adapter = OpenAIAdapter(
        model="gpt-4o-mini",  # Could use same or different model for judging
        api_key=config.openai_api_key
    )
    adapter_name = "OpenAI"
else:
    # Use dummy adapters
    model_adapter = DummyAdapter(seed=42)
    judge_adapter = DummyAdapter(seed=123)
    adapter_name = "Dummy"

print(f"Model Adapter: {adapter_name} ({model_adapter.name} v{model_adapter.version})")
print(f"Judge Adapter: {adapter_name} ({judge_adapter.name} v{judge_adapter.version})")

## Single Evaluation Suite

Let's start by running a single evaluation suite (Harmlessness) and examining the results.

In [None]:
# Set up runner and configuration
runner = EvalRunner()

run_config = RunConfig(
    adapter_name=model_adapter.name,
    eval_suite="harmlessness",
    seed=42,
    max_tokens=150,
    temperature=0.1,
    output_dir=Path("./notebook_reports")
)

print(f"Running {run_config.eval_suite} evaluation...")

# Run the evaluation
result = runner.run_eval(
    adapter=model_adapter,
    suite_name=run_config.eval_suite,
    config=run_config,
    judge_adapter=judge_adapter
)

print("✅ Evaluation completed!")

## Analyze Results

Let's examine the results from the harmlessness evaluation.

In [None]:
# Display summary statistics
stats = result.summary_stats
print("📈 Harmlessness Evaluation Results:")
print(f"   Pass Rate: {stats['pass_rate']:.1%}")
print(f"   Passed: {stats['passed_items']}/{stats['total_items']}")

if stats.get('average_scores'):
    print("   Average Scores:")
    for score_name, score_value in stats['average_scores'].items():
        print(f"     {score_name}: {score_value:.2f}")

# Convert results to DataFrame for easier analysis
results_data = []
for eval_result in result.eval_results:
    results_data.append({
        'item_id': eval_result.item_id,
        'prompt': eval_result.prompt[:50] + '...' if len(eval_result.prompt) > 50 else eval_result.prompt,
        'response': eval_result.response[:60] + '...' if len(eval_result.response) > 60 else eval_result.response,
        'passed': eval_result.passed,
        'score': list(eval_result.scores.values())[0] if eval_result.scores else 0,
        'notes': eval_result.notes[:40] + '...' if eval_result.notes and len(eval_result.notes) > 40 else eval_result.notes
    })

df = pd.DataFrame(results_data)
print("\n📊 Detailed Results:")
df

## Run All Evaluation Suites

Now let's run all three evaluation suites and compare the results.

In [None]:
# Run all evaluation suites
suite_names = ["harmlessness", "robustness", "consistency"]
print(f"Running {len(suite_names)} evaluation suites...")

all_results = runner.run_multiple_suites(
    adapter=model_adapter,
    suite_names=suite_names,
    config=run_config,
    judge_adapter=judge_adapter
)

print("✅ All evaluations completed!")

## Visualize Results

Create visualizations to compare performance across different evaluation dimensions.

In [None]:
# Collect summary data for visualization
suite_data = []
for result in all_results:
    stats = result.summary_stats
    suite_data.append({
        'suite': result.config.eval_suite.title(),
        'pass_rate': stats['pass_rate'],
        'passed': stats['passed_items'],
        'total': stats['total_items'],
        'failed': stats['total_items'] - stats['passed_items']
    })

suite_df = pd.DataFrame(suite_data)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Pass Rate Bar Chart
bars1 = ax1.bar(suite_df['suite'], suite_df['pass_rate'], 
                color=['#22c55e', '#3b82f6', '#f59e0b'], alpha=0.8)
ax1.set_title(f'Pass Rates by Evaluation Suite\n({adapter_name} Adapter)', fontsize=14, fontweight='bold')
ax1.set_ylabel('Pass Rate', fontsize=12)
ax1.set_ylim(0, 1.1)
ax1.grid(axis='y', alpha=0.3)

# Add percentage labels on bars
for bar, rate in zip(bars1, suite_df['pass_rate']):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.02,
             f'{rate:.1%}', ha='center', va='bottom', fontweight='bold')

# Pass/Fail Stacked Bar Chart
bars2 = ax2.bar(suite_df['suite'], suite_df['passed'], 
                label='Passed', color='#22c55e', alpha=0.8)
bars3 = ax2.bar(suite_df['suite'], suite_df['failed'], 
                bottom=suite_df['passed'], label='Failed', color='#ef4444', alpha=0.8)

ax2.set_title(f'Pass/Fail Counts by Suite\n({adapter_name} Adapter)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Number of Items', fontsize=12)
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# Add count labels
for i, (passed, failed) in enumerate(zip(suite_df['passed'], suite_df['failed'])):
    if passed > 0:
        ax2.text(i, passed/2, str(passed), ha='center', va='center', 
                fontweight='bold', color='white')
    if failed > 0:
        ax2.text(i, passed + failed/2, str(failed), ha='center', va='center', 
                fontweight='bold', color='white')

plt.tight_layout()
plt.show()

# Print summary table
print("\n📈 Summary Results:")
summary_df = suite_df[['suite', 'pass_rate', 'passed', 'total']].copy()
summary_df['pass_rate'] = summary_df['pass_rate'].apply(lambda x: f"{x:.1%}")
summary_df.columns = ['Evaluation Suite', 'Pass Rate', 'Passed', 'Total']
summary_df

## Detailed Analysis

Let's look at specific failures to understand model behavior.

In [None]:
# Analyze failures across all suites
all_failures = []
for result in all_results:
    for eval_result in result.eval_results:
        if not eval_result.passed:
            all_failures.append({
                'suite': result.config.eval_suite,
                'item_id': eval_result.item_id,
                'prompt': eval_result.prompt[:60] + '...' if len(eval_result.prompt) > 60 else eval_result.prompt,
                'response': eval_result.response[:80] + '...' if len(eval_result.response) > 80 else eval_result.response,
                'notes': eval_result.notes[:50] + '...' if eval_result.notes and len(eval_result.notes) > 50 else eval_result.notes
            })

if all_failures:
    failures_df = pd.DataFrame(all_failures)
    print(f"❌ Found {len(all_failures)} failures across all suites:")
    failures_df
else:
    print("🎉 No failures found! All tests passed.")

## Score Distribution Analysis

Analyze the distribution of scores across different evaluation types.

In [None]:
# Collect all scores by suite
score_data = []
for result in all_results:
    suite_name = result.config.eval_suite
    for eval_result in result.eval_results:
        for score_name, score_value in eval_result.scores.items():
            score_data.append({
                'suite': suite_name.title(),
                'score_type': score_name,
                'score': score_value,
                'passed': eval_result.passed
            })

scores_df = pd.DataFrame(score_data)

# Create score distribution plot
fig, ax = plt.subplots(figsize=(12, 6))

suites = scores_df['suite'].unique()
colors = ['#22c55e', '#3b82f6', '#f59e0b']

for i, suite in enumerate(suites):
    suite_scores = scores_df[scores_df['suite'] == suite]['score']
    ax.hist(suite_scores, bins=10, alpha=0.7, label=suite, 
            color=colors[i % len(colors)], edgecolor='black', linewidth=0.5)

ax.set_title(f'Score Distribution by Evaluation Suite\n({adapter_name} Adapter)', 
             fontsize=14, fontweight='bold')
ax.set_xlabel('Score', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Show score statistics
print("📊 Score Statistics by Suite:")
score_stats = scores_df.groupby('suite')['score'].agg(['mean', 'std', 'min', 'max']).round(3)
score_stats

## Generate Reports

Save results and generate HTML/Markdown reports for sharing.

In [None]:
from lightweight_evals.reporting.report_builder import ReportBuilder

# Create report builder
report_builder = ReportBuilder()

# Save and generate reports for each suite
report_paths = []
for result in all_results:
    # Save JSON results
    json_path = runner.save_results(result)
    
    # Generate HTML report
    html_path = result.config.output_dir / f"notebook_{result.config.eval_suite}_{result.timestamp}.html"
    report_builder.generate_html_report(result, html_path)
    report_paths.append(html_path)
    
    print(f"✅ {result.config.eval_suite.title()} report: {html_path}")

print(f"\n📁 All reports saved to: {result.config.output_dir}")
print("\n🌐 Open reports in your browser:")
for path in report_paths:
    print(f"   {path}")

## Summary

This notebook demonstrated:

1. **Setup**: Configuring adapters (OpenAI or Dummy) and evaluation runners
2. **Single Evaluation**: Running and analyzing a single evaluation suite
3. **Multi-Suite Evaluation**: Running all three evaluation dimensions
4. **Visualization**: Creating charts to compare performance across suites
5. **Analysis**: Examining failures and score distributions
6. **Reporting**: Generating HTML/Markdown reports for sharing

### Key Takeaways:

- **Harmlessness**: Tests refusal of harmful requests
- **Robustness**: Tests instruction following despite perturbations
- **Consistency**: Tests consistent answers to equivalent questions
- **LLM-as-Judge**: Provides nuanced scoring compared to simple regex patterns
- **Reproducibility**: Deterministic results with seed control and run IDs

### Next Steps:

- Try different models and compare their performance
- Experiment with different temperature and token settings
- Add custom evaluation suites for domain-specific testing
- Integrate with your model development pipeline