# Deckbuilding Model Evaluation

This notebook evaluates the deckbuilding model performance on validation data by comparing predicted decks against human decks.

## Metrics

1. **Card-accuracy**: Percentage of individual cards that match between predicted and human decks
2. **Difference distribution**: Histogram showing how many examples have N cards different

## Setup and Imports

In [1]:
import sys
sys.path.append('..')

import torch
import statisticaldeckbuild as sdb
import matplotlib.pyplot as plt
import numpy as np

# Enable inline plotting
%matplotlib inline

## Configuration

In [2]:
# Set configuration
SET_ABBREVIATION = "FDN"
DRAFT_MODE = "Premier"

# Paths
VAL_DATASET_PATH = f"../data/training_sets/{SET_ABBREVIATION}_{DRAFT_MODE}_deckbuild_val.pth"
MODEL_FOLDER = "../data/models/"
CARDS_FOLDER = "../data/cards/"

## Load Validation Dataset

In [3]:
# Load validation dataset
val_dataset = torch.load(VAL_DATASET_PATH, weights_only=False)

print(f"Loaded validation dataset:")
print(f"  Number of examples: {len(val_dataset)}")
print(f"  Number of cards: {len(val_dataset.cardnames)}")
print(f"  Deck shape: {val_dataset.decks.shape}")
print(f"  Sideboard shape: {val_dataset.sideboards.shape}")

Loaded validation dataset:
  Number of examples: 7483
  Number of cards: 281
  Deck shape: (7483, 281)
  Sideboard shape: (7483, 281)


## Initialize Deckbuilder

In [4]:
# Initialize the deck builder
builder = sdb.IterativeDeckBuilder(
    set_abbreviation=SET_ABBREVIATION,
    draft_mode=DRAFT_MODE,
    model_folder=MODEL_FOLDER,
    cards_folder=CARDS_FOLDER,
)

print(f"Initialized IterativeDeckBuilder")
print(f"  Device: {builder.device}")
print(f"  Number of cards: {builder.num_cards}")

AttributeError: module 'statisticaldeckbuild' has no attribute 'IterativeDeckBuilder'

## Quick Test - Evaluate on Small Subset

First, let's run on a small subset to verify everything works correctly.

In [None]:
# Run evaluation on small subset
results_small = sdb.evaluate_deckbuilder(
    val_dataset=val_dataset,
    builder=builder,
    max_examples=10,  # Quick test with just 10 examples
    progress_interval=5,
    verbose=True,
    save_results=False,  # Don't save for quick test
)

In [None]:
# Print summary of small test
sdb.print_summary(results_small)

## Medium Test - 100 Examples

Now let's run on a larger subset to get more reliable statistics.

In [None]:
# Run evaluation on 100 examples
results_medium = sdb.evaluate_deckbuilder(
    val_dataset=val_dataset,
    builder=builder,
    max_examples=100,
    progress_interval=10,
    verbose=True,
    save_results=True,  # Save results
    output_path=f"eval_{SET_ABBREVIATION}_{DRAFT_MODE}_100examples.json",
)

In [None]:
# Print detailed summary
sdb.print_summary(results_medium)

In [None]:
# Plot difference distribution
sdb.plot_difference_distribution(
    results_medium,
    save_path=f"eval_{SET_ABBREVIATION}_{DRAFT_MODE}_100examples_distribution.png"
)

## Inspect Individual Examples

Let's look at some individual examples to understand what the model is doing.

In [None]:
# Look at per-example results
per_example = results_medium["per_example_results"]

# Sort by number of cards different
sorted_examples = sorted(per_example, key=lambda x: x["num_different"])

print("Best 5 predictions (fewest cards different):")
for ex in sorted_examples[:5]:
    print(f"  Example {ex['index']:3d}: {ex['num_different']:2d} cards different, {ex['num_matches']:2d} matches")

print("\nWorst 5 predictions (most cards different):")
for ex in sorted_examples[-5:]:
    print(f"  Example {ex['index']:3d}: {ex['num_different']:2d} cards different, {ex['num_matches']:2d} matches")

## Detailed Example Inspection

Let's look at a specific example in detail.

In [None]:
# Pick an example to inspect (e.g., the best one)
example_idx = sorted_examples[0]["index"]

# Get the pool and build the deck
pool = sdb.evaluate.pool_from_dataset_example(val_dataset, example_idx)
result = builder.build_deck(pool, target_deck_size=23, verbose=True)

print(f"\nExample {example_idx}:")
print(f"Pool size: {len(pool)} cards")

In [None]:
# Print the predicted deck
builder.print_deck_and_sideboard(result)

In [None]:
# Compare with human deck
human_deck = val_dataset.decks[example_idx]
predicted_deck = sdb.evaluate.predicted_deck_to_counts(result, val_dataset.cardnames)

print("\nComparison with human deck:")
print(f"{'Card Name':<30} {'Human':>6} {'Predicted':>9} {'Match':>6}")
print("-" * 55)

for i, card_name in enumerate(val_dataset.cardnames):
    if human_deck[i] > 0 or predicted_deck[i] > 0:
        human_count = int(human_deck[i])
        pred_count = int(predicted_deck[i])
        match = "✓" if human_count == pred_count else "✗"
        print(f"{card_name:<30} {human_count:>6} {pred_count:>9} {match:>6}")

matches, total = sdb.compute_card_accuracy(predicted_deck, human_deck)
print(f"\nMatches: {matches}/{total} ({100*matches/total:.1f}%)")

## Full Evaluation (Optional)

Run evaluation on the full validation set. This may take a while depending on the dataset size.

In [None]:
# Uncomment to run full evaluation
# results_full = sdb.evaluate_deckbuilder(
#     val_dataset=val_dataset,
#     builder=builder,
#     max_examples=None,  # Evaluate all examples
#     progress_interval=100,
#     verbose=True,
#     save_results=True,
#     output_path=f"eval_{SET_ABBREVIATION}_{DRAFT_MODE}_full.json",
# )

In [None]:
# Uncomment to print full results
# sdb.print_summary(results_full)

In [None]:
# Uncomment to plot full results
# sdb.plot_difference_distribution(
#     results_full,
#     save_path=f"eval_{SET_ABBREVIATION}_{DRAFT_MODE}_full_distribution.png"
# )

## Analysis and Conclusions

Based on the evaluation results, analyze:
- Is the card accuracy above 50%? (Random baseline would be near 0%)
- What is the typical number of cards different?
- Where does the difference distribution peak?
- Are there any outliers with very high difference counts?

Use this information to guide model improvements.