# Password Game Baseline Evaluation: Qwen3-0.6B

This notebook evaluates the **untrained** Qwen3-0.6B model on the Password Game task to establish a baseline.

## Metrics Collected:
- **Rules Satisfied**: Number of rules the model successfully satisfies
- **Final Reward**: Reward score at end of game (+1 per rule, -0.1 per character)
- **Success Rate**: % of games where all rules are satisfied
- **Average Password Length**: Mean length of final passwords
- **Rule-by-Rule Performance**: Success rate for each individual rule
- **Episode Statistics**: Per-episode detailed breakdown

## Evaluation Setup:
- **Model**: Qwen/Qwen2.5-0.6B (pretrained, no fine-tuning)
- **Episodes**: 20 independent game sessions
- **Prompting**: Qwen chat template with system/user roles
- **Max Steps**: 30 password attempts per episode
- **Temperature**: 0.7 (balanced exploration/exploitation)

Results are saved for post-training comparison.

## 1. Setup & Dependencies

In [None]:
# Check GPU availability
!nvidia-smi -L || true

import sys
print(f"Python: {sys.version}")

In [None]:
# Install required packages
!pip install -q torch torchvision --index-url https://download.pytorch.org/whl/cu121
!pip install -q transformers>=4.45.0 accelerate tokenizers
!pip install -q matplotlib seaborn pandas numpy tqdm

In [None]:
import os
import sys
import json
import random
import time
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional, Tuple, Any
from dataclasses import dataclass, asdict
from collections import defaultdict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.auto import tqdm

# Add tasks directory to path
sys.path.insert(0, '/home/user/notebooks/tasks/password-game')
from game import PasswordGame, rules

print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Device: {torch.cuda.get_device_name(0)}")

# Set style for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

## 2. Configuration

In [None]:
@dataclass
class EvalConfig:
    """Baseline evaluation configuration."""
    
    # Model
    model_name: str = "Qwen/Qwen2.5-0.6B"
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
    precision: str = "bfloat16"  # bfloat16, float16, float32
    
    # Evaluation
    num_episodes: int = 20  # Number of independent game sessions
    max_steps_per_episode: int = 30  # Max password attempts per game
    
    # Generation
    temperature: float = 0.7
    top_p: float = 0.9
    top_k: int = 50
    max_new_tokens: int = 256
    
    # Prompting
    use_chat_template: bool = True
    system_prompt: str = """You are an AI assistant playing the Password Game. You will receive rules one at a time, and must create a password that satisfies ALL current and previous rules. 

Instructions:
1. Read the current rule carefully
2. Review all previous rules
3. Generate a password satisfying ALL rules
4. ONLY output the password string, nothing else
5. Keep password as short as possible while satisfying rules

Important: Your response should ONLY contain the password - no explanations, no thinking, just the password string."""
    
    # Output
    output_dir: str = f"./baseline_eval_{int(time.time())}"
    save_episodes: bool = True  # Save individual episode data
    
    # Reproducibility
    seed: int = 42
    
    def __post_init__(self):
        os.makedirs(self.output_dir, exist_ok=True)

config = EvalConfig()

print("=" * 80)
print("BASELINE EVALUATION CONFIGURATION")
print("=" * 80)
print(f"Model: {config.model_name}")
print(f"Device: {config.device}")
print(f"Episodes: {config.num_episodes}")
print(f"Max Steps/Episode: {config.max_steps_per_episode}")
print(f"Temperature: {config.temperature}")
print(f"Output: {config.output_dir}")
print("=" * 80)

# Save config
with open(os.path.join(config.output_dir, "config.json"), "w") as f:
    json.dump(asdict(config), f, indent=2)

In [None]:
def set_seed(seed: int):
    """Set random seed for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(config.seed)
print(f"✓ Random seed set to {config.seed}")

## 3. Load Model

In [None]:
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    config.model_name,
    trust_remote_code=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"✓ Tokenizer loaded (vocab size: {len(tokenizer)})")
print(f"  - Pad token: {tokenizer.pad_token}")
print(f"  - EOS token: {tokenizer.eos_token}")

In [None]:
print("Loading model...")
dtype_map = {
    "bfloat16": torch.bfloat16,
    "float16": torch.float16,
    "float32": torch.float32
}
dtype = dtype_map[config.precision]

model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    torch_dtype=dtype,
    device_map="auto",
    trust_remote_code=True
)

model.eval()  # Set to evaluation mode

num_params = sum(p.numel() for p in model.parameters())
print(f"✓ Model loaded ({num_params/1e9:.2f}B parameters)")
print(f"  - Device: {next(model.parameters()).device}")
print(f"  - Dtype: {next(model.parameters()).dtype}")

## 4. Prompt Engineering

In [None]:
def build_prompt(game: PasswordGame, step: int) -> str:
    """
    Build prompt for current game state using Qwen chat template.
    
    Args:
        game: PasswordGame instance
        step: Current step number
    
    Returns:
        Formatted prompt string
    """
    state = game.get_minimal_game_state()
    
    # Build user message with all rules
    user_msg = f"""Step {step + 1}: Create a password satisfying these rules:

"""
    
    for i, rule in enumerate(state['all_rules']):
        user_msg += f"{i + 1}. {rule}\n"
    
    user_msg += "\nYour password:"
    
    # Use Qwen chat template
    if config.use_chat_template and hasattr(tokenizer, 'apply_chat_template'):
        messages = [
            {"role": "system", "content": config.system_prompt},
            {"role": "user", "content": user_msg}
        ]
        prompt = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
    else:
        # Fallback format
        prompt = f"System: {config.system_prompt}\n\nUser: {user_msg}\n\nAssistant:"
    
    return prompt

# Test prompt generation
test_game = PasswordGame()
test_prompt = build_prompt(test_game, 0)
print("Sample prompt (first 500 chars):")
print("=" * 80)
print(test_prompt[:500] + "...")
print("=" * 80)

In [None]:
def generate_password(prompt: str) -> str:
    """
    Generate password from prompt using model.
    
    Args:
        prompt: Formatted prompt string
    
    Returns:
        Generated password string (cleaned)
    """
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=2048
    ).to(config.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=config.max_new_tokens,
            temperature=config.temperature,
            top_p=config.top_p,
            top_k=config.top_k,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode only the generated part
    generated_ids = outputs[0][inputs.input_ids.shape[1]:]
    password = tokenizer.decode(generated_ids, skip_special_tokens=True)
    
    # Clean up the password - extract just the password string
    password = password.strip()
    
    # If model added explanation, try to extract just password
    # Look for common patterns
    if '\n' in password:
        password = password.split('\n')[0].strip()
    
    # Remove common prefixes
    for prefix in ['Password:', 'password:', 'Answer:', 'answer:']:
        if password.startswith(prefix):
            password = password[len(prefix):].strip()
    
    return password

# Test generation
test_password = generate_password(test_prompt)
print(f"Test password: '{test_password}'")
print(f"Length: {len(test_password)}")

## 5. Episode Evaluation

In [None]:
@dataclass
class EpisodeResult:
    """Results from a single episode."""
    episode_id: int
    success: bool
    final_reward: float
    rules_satisfied: int
    total_rules: int
    final_password: str
    password_length: int
    steps_taken: int
    rule_progression: List[int]  # Rules satisfied at each step
    passwords: List[str]  # All attempted passwords
    rule_feedback: Dict  # Detailed rule feedback
    game_state: Dict  # Final game state

def run_episode(episode_id: int, verbose: bool = False) -> EpisodeResult:
    """
    Run a single evaluation episode.
    
    Args:
        episode_id: Episode number
        verbose: Print detailed progress
    
    Returns:
        EpisodeResult with metrics and data
    """
    game = PasswordGame()
    
    passwords = []
    rule_progression = []
    
    if verbose:
        print(f"\n{'='*80}")
        print(f"Episode {episode_id}")
        print(f"{'='*80}")
        print(f"Captcha: {game.captcha}")
        print(f"Country: {game.country}")
        print(f"Wordle: {game.wordle_answer}")
        print(f"Moon: {game.moon_phase}")
    
    current_password = ""
    
    for step in range(config.max_steps_per_episode):
        if not game.game_active:
            break
        
        # Generate password
        prompt = build_prompt(game, step)
        current_password = generate_password(prompt)
        passwords.append(current_password)
        
        # Get feedback
        feedback = game.get_rule_feedback(current_password)
        rule_progression.append(feedback['total_passing'])
        
        if verbose:
            print(f"\nStep {step + 1}:")
            print(f"  Password: {current_password[:50]}{'...' if len(current_password) > 50 else ''}")
            print(f"  Length: {len(current_password)}")
            print(f"  Rules satisfied: {feedback['total_passing']}/{len(feedback['rules_checked'])}")
            print(f"  Reward: {feedback['reward']:.2f}")
        
        # Check if all current rules are satisfied
        current_rule = game.get_current_rule()
        all_satisfied = all(r['passes'] for r in feedback['rules_checked'])
        
        if all_satisfied:
            if verbose:
                print(f"  ✓ All rules satisfied! Advancing...")
            game.advance_rule()
        else:
            if verbose:
                failed_rules = [r for r in feedback['rules_checked'] if not r['passes']]
                print(f"  ✗ Failed {len(failed_rules)} rules:")
                for r in failed_rules[:3]:  # Show first 3 failures
                    print(f"    - Rule {r['rule_index']}: {r['rule_text'][:60]}...")
    
    # Final evaluation
    final_feedback = game.get_rule_feedback(current_password)
    final_reward = game.calculate_reward(current_password)
    
    success = not game.game_active  # Game ends when all rules satisfied
    
    result = EpisodeResult(
        episode_id=episode_id,
        success=success,
        final_reward=final_reward,
        rules_satisfied=final_feedback['total_passing'],
        total_rules=len(rules),
        final_password=current_password,
        password_length=len(current_password),
        steps_taken=len(passwords),
        rule_progression=rule_progression,
        passwords=passwords,
        rule_feedback=final_feedback,
        game_state=game.get_game_state()
    )
    
    if verbose:
        print(f"\n{'='*80}")
        print(f"Episode {episode_id} Complete")
        print(f"  Success: {success}")
        print(f"  Rules: {result.rules_satisfied}/{result.total_rules}")
        print(f"  Reward: {result.final_reward:.2f}")
        print(f"  Password length: {result.password_length}")
        print(f"  Steps: {result.steps_taken}")
        print(f"{'='*80}")
    
    return result

# Test single episode
print("Running test episode...")
test_result = run_episode(0, verbose=True)

## 6. Run Full Evaluation

In [None]:
print("=" * 80)
print(f"RUNNING {config.num_episodes} EVALUATION EPISODES")
print("=" * 80)

results = []

for episode_id in tqdm(range(config.num_episodes), desc="Episodes"):
    result = run_episode(episode_id, verbose=False)
    results.append(result)
    
    # Save individual episode if configured
    if config.save_episodes:
        episode_dir = os.path.join(config.output_dir, "episodes")
        os.makedirs(episode_dir, exist_ok=True)
        
        episode_data = {
            "episode_id": result.episode_id,
            "success": result.success,
            "final_reward": result.final_reward,
            "rules_satisfied": result.rules_satisfied,
            "total_rules": result.total_rules,
            "final_password": result.final_password,
            "password_length": result.password_length,
            "steps_taken": result.steps_taken,
            "rule_progression": result.rule_progression,
            "passwords": result.passwords,
            "rule_feedback": result.rule_feedback
        }
        
        with open(os.path.join(episode_dir, f"episode_{episode_id:03d}.json"), "w") as f:
            json.dump(episode_data, f, indent=2)

print("\n✓ Evaluation complete!")

## 7. Compute Metrics

In [None]:
# Aggregate metrics
metrics = {
    "num_episodes": len(results),
    "success_rate": sum(r.success for r in results) / len(results),
    "avg_rules_satisfied": np.mean([r.rules_satisfied for r in results]),
    "std_rules_satisfied": np.std([r.rules_satisfied for r in results]),
    "max_rules_satisfied": max(r.rules_satisfied for r in results),
    "min_rules_satisfied": min(r.rules_satisfied for r in results),
    "avg_final_reward": np.mean([r.final_reward for r in results]),
    "std_final_reward": np.std([r.final_reward for r in results]),
    "avg_password_length": np.mean([r.password_length for r in results]),
    "std_password_length": np.std([r.password_length for r in results]),
    "avg_steps_taken": np.mean([r.steps_taken for r in results]),
    "std_steps_taken": np.std([r.steps_taken for r in results]),
}

# Rule-by-rule analysis
rule_success_rates = defaultdict(int)
for result in results:
    for rule_check in result.rule_feedback['rules_checked']:
        if rule_check['passes']:
            rule_success_rates[rule_check['rule_index']] += 1

rule_success_rates = {
    rule_idx: count / len(results)
    for rule_idx, count in rule_success_rates.items()
}

metrics['rule_success_rates'] = rule_success_rates

# Print summary
print("=" * 80)
print("BASELINE EVALUATION RESULTS")
print("=" * 80)
print(f"Episodes: {metrics['num_episodes']}")
print(f"\nSuccess Rate: {metrics['success_rate']*100:.1f}%")
print(f"\nRules Satisfied: {metrics['avg_rules_satisfied']:.2f} ± {metrics['std_rules_satisfied']:.2f}")
print(f"  - Min: {metrics['min_rules_satisfied']}")
print(f"  - Max: {metrics['max_rules_satisfied']}")
print(f"  - Total Rules: {results[0].total_rules}")
print(f"\nFinal Reward: {metrics['avg_final_reward']:.2f} ± {metrics['std_final_reward']:.2f}")
print(f"\nPassword Length: {metrics['avg_password_length']:.1f} ± {metrics['std_password_length']:.1f}")
print(f"\nSteps Taken: {metrics['avg_steps_taken']:.1f} ± {metrics['std_steps_taken']:.1f}")
print("=" * 80)

# Save metrics
with open(os.path.join(config.output_dir, "metrics.json"), "w") as f:
    json.dump(metrics, f, indent=2)

print(f"\n✓ Metrics saved to {config.output_dir}/metrics.json")

## 8. Visualization

In [None]:
# Create summary plots
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Password Game Baseline Evaluation - Qwen3-0.6B', fontsize=16, fontweight='bold')

# 1. Rules Satisfied Distribution
ax = axes[0, 0]
rules_satisfied = [r.rules_satisfied for r in results]
ax.hist(rules_satisfied, bins=20, edgecolor='black', alpha=0.7)
ax.axvline(metrics['avg_rules_satisfied'], color='red', linestyle='--', linewidth=2, label=f"Mean: {metrics['avg_rules_satisfied']:.1f}")
ax.set_xlabel('Rules Satisfied')
ax.set_ylabel('Frequency')
ax.set_title('Rules Satisfied Distribution')
ax.legend()
ax.grid(True, alpha=0.3)

# 2. Final Reward Distribution
ax = axes[0, 1]
final_rewards = [r.final_reward for r in results]
ax.hist(final_rewards, bins=20, edgecolor='black', alpha=0.7, color='green')
ax.axvline(metrics['avg_final_reward'], color='red', linestyle='--', linewidth=2, label=f"Mean: {metrics['avg_final_reward']:.2f}")
ax.set_xlabel('Final Reward')
ax.set_ylabel('Frequency')
ax.set_title('Final Reward Distribution')
ax.legend()
ax.grid(True, alpha=0.3)

# 3. Password Length Distribution
ax = axes[0, 2]
password_lengths = [r.password_length for r in results]
ax.hist(password_lengths, bins=20, edgecolor='black', alpha=0.7, color='orange')
ax.axvline(metrics['avg_password_length'], color='red', linestyle='--', linewidth=2, label=f"Mean: {metrics['avg_password_length']:.1f}")
ax.set_xlabel('Password Length')
ax.set_ylabel('Frequency')
ax.set_title('Password Length Distribution')
ax.legend()
ax.grid(True, alpha=0.3)

# 4. Success Rate
ax = axes[1, 0]
success_count = sum(r.success for r in results)
fail_count = len(results) - success_count
ax.bar(['Success', 'Failure'], [success_count, fail_count], color=['green', 'red'], alpha=0.7, edgecolor='black')
ax.set_ylabel('Count')
ax.set_title(f'Success Rate: {metrics["success_rate"]*100:.1f}%')
ax.grid(True, alpha=0.3, axis='y')

# 5. Steps Taken Distribution
ax = axes[1, 1]
steps_taken = [r.steps_taken for r in results]
ax.hist(steps_taken, bins=15, edgecolor='black', alpha=0.7, color='purple')
ax.axvline(metrics['avg_steps_taken'], color='red', linestyle='--', linewidth=2, label=f"Mean: {metrics['avg_steps_taken']:.1f}")
ax.set_xlabel('Steps Taken')
ax.set_ylabel('Frequency')
ax.set_title('Steps Taken Distribution')
ax.legend()
ax.grid(True, alpha=0.3)

# 6. Rule Progression (average across episodes)
ax = axes[1, 2]
max_steps = max(len(r.rule_progression) for r in results)
avg_progression = []
for step in range(max_steps):
    step_values = [r.rule_progression[step] for r in results if step < len(r.rule_progression)]
    if step_values:
        avg_progression.append(np.mean(step_values))

ax.plot(range(len(avg_progression)), avg_progression, marker='o', linewidth=2)
ax.set_xlabel('Step')
ax.set_ylabel('Avg Rules Satisfied')
ax.set_title('Average Rule Progression')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(config.output_dir, 'summary_plots.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n✓ Summary plots saved to {config.output_dir}/summary_plots.png")

In [None]:
# Rule-by-rule performance
fig, ax = plt.subplots(figsize=(16, 8))

rule_indices = sorted(rule_success_rates.keys())
success_rates = [rule_success_rates[i] * 100 for i in rule_indices]

colors = ['green' if rate > 50 else 'orange' if rate > 25 else 'red' for rate in success_rates]

bars = ax.bar(rule_indices, success_rates, color=colors, alpha=0.7, edgecolor='black')

ax.axhline(50, color='gray', linestyle='--', linewidth=1, alpha=0.5, label='50% threshold')
ax.set_xlabel('Rule Index', fontsize=12)
ax.set_ylabel('Success Rate (%)', fontsize=12)
ax.set_title('Success Rate by Rule (Baseline)', fontsize=14, fontweight='bold')
ax.set_xticks(rule_indices)
ax.set_xticklabels(rule_indices, rotation=45)
ax.grid(True, alpha=0.3, axis='y')
ax.legend()

# Add value labels on bars
for i, (idx, rate) in enumerate(zip(rule_indices, success_rates)):
    ax.text(idx, rate + 1, f'{rate:.0f}%', ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.savefig(os.path.join(config.output_dir, 'rule_performance.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n✓ Rule performance plot saved to {config.output_dir}/rule_performance.png")

In [None]:
# Correlation analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Rules vs Reward
ax = axes[0]
rules_satisfied = [r.rules_satisfied for r in results]
final_rewards = [r.final_reward for r in results]
ax.scatter(rules_satisfied, final_rewards, alpha=0.6, s=100, edgecolor='black')
ax.set_xlabel('Rules Satisfied')
ax.set_ylabel('Final Reward')
ax.set_title('Rules Satisfied vs Final Reward')
ax.grid(True, alpha=0.3)

# Correlation coefficient
corr = np.corrcoef(rules_satisfied, final_rewards)[0, 1]
ax.text(0.05, 0.95, f'Correlation: {corr:.3f}', transform=ax.transAxes, 
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
        verticalalignment='top')

# Password Length vs Rules
ax = axes[1]
password_lengths = [r.password_length for r in results]
ax.scatter(password_lengths, rules_satisfied, alpha=0.6, s=100, edgecolor='black', color='orange')
ax.set_xlabel('Password Length')
ax.set_ylabel('Rules Satisfied')
ax.set_title('Password Length vs Rules Satisfied')
ax.grid(True, alpha=0.3)

# Correlation coefficient
corr = np.corrcoef(password_lengths, rules_satisfied)[0, 1]
ax.text(0.05, 0.95, f'Correlation: {corr:.3f}', transform=ax.transAxes, 
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5),
        verticalalignment='top')

plt.tight_layout()
plt.savefig(os.path.join(config.output_dir, 'correlations.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n✓ Correlation plots saved to {config.output_dir}/correlations.png")

## 9. Results Summary Table

In [None]:
# Create detailed results table
results_data = []
for r in results:
    results_data.append({
        'Episode': r.episode_id,
        'Success': '✓' if r.success else '✗',
        'Rules': f"{r.rules_satisfied}/{r.total_rules}",
        'Reward': f"{r.final_reward:.2f}",
        'Password Length': r.password_length,
        'Steps': r.steps_taken,
        'Final Password (preview)': r.final_password[:50] + '...' if len(r.final_password) > 50 else r.final_password
    })

df = pd.DataFrame(results_data)

# Display first 10 episodes
print("\nResults Summary (first 10 episodes):")
print(df.head(10).to_string(index=False))

# Save full table
df.to_csv(os.path.join(config.output_dir, 'results_table.csv'), index=False)
print(f"\n✓ Full results table saved to {config.output_dir}/results_table.csv")

# Summary statistics table
summary_data = {
    'Metric': [
        'Success Rate (%)',
        'Avg Rules Satisfied',
        'Avg Final Reward',
        'Avg Password Length',
        'Avg Steps Taken'
    ],
    'Value': [
        f"{metrics['success_rate']*100:.1f}%",
        f"{metrics['avg_rules_satisfied']:.2f} ± {metrics['std_rules_satisfied']:.2f}",
        f"{metrics['avg_final_reward']:.2f} ± {metrics['std_final_reward']:.2f}",
        f"{metrics['avg_password_length']:.1f} ± {metrics['std_password_length']:.1f}",
        f"{metrics['avg_steps_taken']:.1f} ± {metrics['std_steps_taken']:.1f}"
    ]
}

summary_df = pd.DataFrame(summary_data)
print("\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80)
print(summary_df.to_string(index=False))
print("="*80)

summary_df.to_csv(os.path.join(config.output_dir, 'summary_statistics.csv'), index=False)

## 10. Best & Worst Episodes Analysis

In [None]:
# Find best and worst episodes
best_episode = max(results, key=lambda r: r.rules_satisfied)
worst_episode = min(results, key=lambda r: r.rules_satisfied)

print("=" * 80)
print("BEST EPISODE")
print("=" * 80)
print(f"Episode ID: {best_episode.episode_id}")
print(f"Success: {best_episode.success}")
print(f"Rules Satisfied: {best_episode.rules_satisfied}/{best_episode.total_rules}")
print(f"Final Reward: {best_episode.final_reward:.2f}")
print(f"Password Length: {best_episode.password_length}")
print(f"Steps Taken: {best_episode.steps_taken}")
print(f"\nFinal Password: {best_episode.final_password}")
print(f"\nRule Progression: {best_episode.rule_progression}")

print("\n" + "=" * 80)
print("WORST EPISODE")
print("=" * 80)
print(f"Episode ID: {worst_episode.episode_id}")
print(f"Success: {worst_episode.success}")
print(f"Rules Satisfied: {worst_episode.rules_satisfied}/{worst_episode.total_rules}")
print(f"Final Reward: {worst_episode.final_reward:.2f}")
print(f"Password Length: {worst_episode.password_length}")
print(f"Steps Taken: {worst_episode.steps_taken}")
print(f"\nFinal Password: {worst_episode.final_password}")
print(f"\nRule Progression: {worst_episode.rule_progression}")

# Failed rules in worst episode
print("\nFailed Rules in Worst Episode:")
for rule_check in worst_episode.rule_feedback['rules_checked']:
    if not rule_check['passes']:
        print(f"  ✗ Rule {rule_check['rule_index']}: {rule_check['rule_text']}")

print("=" * 80)

## 11. Export for Post-Training Comparison

In [None]:
# Create comprehensive export for post-training comparison
baseline_export = {
    "timestamp": datetime.now().isoformat(),
    "config": asdict(config),
    "model": {
        "name": config.model_name,
        "num_parameters": num_params,
        "dtype": str(dtype)
    },
    "metrics": {
        "num_episodes": metrics['num_episodes'],
        "success_rate": metrics['success_rate'],
        "avg_rules_satisfied": metrics['avg_rules_satisfied'],
        "std_rules_satisfied": metrics['std_rules_satisfied'],
        "max_rules_satisfied": metrics['max_rules_satisfied'],
        "min_rules_satisfied": metrics['min_rules_satisfied'],
        "avg_final_reward": metrics['avg_final_reward'],
        "std_final_reward": metrics['std_final_reward'],
        "avg_password_length": metrics['avg_password_length'],
        "std_password_length": metrics['std_password_length'],
        "avg_steps_taken": metrics['avg_steps_taken'],
        "std_steps_taken": metrics['std_steps_taken']
    },
    "rule_performance": rule_success_rates,
    "episode_results": [
        {
            "episode_id": r.episode_id,
            "success": r.success,
            "rules_satisfied": r.rules_satisfied,
            "final_reward": r.final_reward,
            "password_length": r.password_length,
            "steps_taken": r.steps_taken
        }
        for r in results
    ]
}

# Save baseline export
export_path = os.path.join(config.output_dir, "baseline_export.json")
with open(export_path, "w") as f:
    json.dump(baseline_export, f, indent=2)

print(f"✓ Baseline export saved to {export_path}")
print(f"\nUse this file to compare with post-training results!")
print(f"\nAll results saved to: {config.output_dir}")
print("\nFiles generated:")
print("  - config.json: Evaluation configuration")
print("  - metrics.json: Aggregated metrics")
print("  - baseline_export.json: Full baseline data for comparison")
print("  - summary_plots.png: Visual summary")
print("  - rule_performance.png: Per-rule success rates")
print("  - correlations.png: Correlation analysis")
print("  - results_table.csv: Detailed episode results")
print("  - summary_statistics.csv: Summary statistics")
print("  - episodes/: Individual episode data (JSON)")

## 12. Conclusion

The baseline evaluation is complete! Key findings:

1. **Model Capability**: The untrained Qwen3-0.6B shows limited ability to solve the Password Game
2. **Rule Difficulty**: Some rules are harder than others (see rule performance plot)
3. **Password Complexity**: Trade-off between satisfying rules and password length
4. **Training Opportunity**: Clear room for improvement through RL training

### Next Steps:
1. Use this baseline to train a policy with PPO/GRPO
2. Compare post-training results using `baseline_export.json`
3. Analyze which rules improved most after training
4. Iterate on reward function if needed

### Using the Baseline:
```python
# Load baseline for comparison
with open('baseline_export.json', 'r') as f:
    baseline = json.load(f)

baseline_reward = baseline['metrics']['avg_final_reward']
baseline_success = baseline['metrics']['success_rate']

# Compare with post-training
improvement_reward = trained_reward - baseline_reward
improvement_success = trained_success - baseline_success
```