# Sprint 5: Model Testing & Behavior Analysis

**Objectives:**
- Test final model on validation data
- Explain results
- Understand model behavior

## 1. Setup

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from src.utils import create_rating_bins, evaluate_by_rating_range, stratified_train_val_split
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("‚úì Libraries loaded")

‚úì Libraries loaded


## 2. Load Model & Data

In [7]:
# Load data
X_full = pd.read_csv('output/train_features_engineered.csv')
y_full = pd.read_csv('output/train_target.csv').squeeze()

# Load model
with open('output/preprocessor.pkl', 'rb') as f:
    preprocessor = pickle.load(f)
    
import glob
model_path = glob.glob('output/best_model_*_tuned.pkl')[0]
model_name = model_path.split('/')[-1].replace('best_model_', '').replace('_tuned.pkl', '').upper()

with open(model_path, 'rb') as f:
    model = pickle.load(f)
    
with open('output/selected_features.pkl', 'rb') as f:
    feature_info = pickle.load(f)
    feature_indices = feature_info['feature_indices']
    selected_features = feature_info['selected_features']

print(f"Model: {model_name}")
print(f"Features: {len(selected_features)}")
print(f"Data: {X_full.shape[0]:,} samples")

Model: XGBOOST
Features: 52
Data: 100,820 samples


## 3. Create Validation Split & Predict

In [8]:
# Split data
X_train, X_val, y_train, y_val = stratified_train_val_split(X_full, y_full, test_size=0.2, random_state=42)

# Preprocess
X_train_proc = preprocessor.transform(X_train)[:, feature_indices]
X_val_proc = preprocessor.transform(X_val)[:, feature_indices]

# Predict
y_train_pred = model.predict(X_train_proc)
y_val_pred = model.predict(X_val_proc)

print(f"Train: {len(y_train):,}, Val: {len(y_val):,}")
print(f"‚úì Predictions generated")

Train: 80,656, Val: 20,164
‚úì Predictions generated


## 4. Performance Metrics

In [9]:
def calc_metrics(y_true, y_pred):
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    within_100 = (np.abs(y_true - y_pred) <= 100).mean() * 100
    return {'MAPE': mape, 'RMSE': rmse, 'MAE': mae, 'R2': r2, 'Within_100': within_100}

train_m = calc_metrics(y_train, y_train_pred)
val_m = calc_metrics(y_val, y_val_pred)

print("="*60)
print(f"{model_name} PERFORMANCE")
print("="*60)
print(f"TRAINING:")
for k, v in train_m.items():
    print(f"  {k:12s}: {v:.2f}{'%' if k in ['MAPE', 'Within_100'] else ''}")
print(f"VALIDATION:")
for k, v in val_m.items():
    print(f"  {k:12s}: {v:.2f}{'%' if k in ['MAPE', 'Within_100'] else ''}")
print(f"GENERALIZATION GAP:")
print(f"  MAPE Gap: {val_m['MAPE'] - train_m['MAPE']:.2f}% ‚úÖ")
print(f"  RMSE Gap: {val_m['RMSE'] - train_m['RMSE']:.2f}")

XGBOOST PERFORMANCE
TRAINING:
  MAPE        : 2.26%
  RMSE        : 62.94
  MAE         : 42.76
  R2          : 0.93
  Within_100  : 89.25%
VALIDATION:
  MAPE        : 2.82%
  RMSE        : 79.43
  MAE         : 52.87
  R2          : 0.88
  Within_100  : 84.28%
GENERALIZATION GAP:
  MAPE Gap: 0.56% ‚úÖ
  RMSE Gap: 16.48


## 5. Performance by Rating Range

In [10]:
print("VALIDATION PERFORMANCE BY RATING RANGE")
print("="*60)
val_range = evaluate_by_rating_range(y_val, y_val_pred)
print(val_range.to_string())

best = val_range['MAPE (%)'].idxmin()
worst = val_range['MAPE (%)'].idxmax()
print(f"üèÜ Best: {best} ({val_range.loc[best, 'MAPE (%)']}%)")
print(f"‚ö†Ô∏è  Worst: {worst} ({val_range.loc[worst, 'MAPE (%)']}%)")

VALIDATION PERFORMANCE BY RATING RANGE
            MAPE (%)  Std (%)    MAE  Count
rating_bin                                 
<1200           5.59     9.07  64.45     23
1200-1400       6.30     9.64  82.15    233
1400-1600       1.98     3.99  30.11   2484
1600-1800       2.79     3.31  47.17   4412
1800-2000       3.24     3.12  61.67   5943
2000-2200       2.78     2.63  58.13   5824
>2200           2.05     2.24  46.22   1245
üèÜ Best: 1400-1600 (1.98%)
‚ö†Ô∏è  Worst: 1200-1400 (6.3%)


## 6. Feature Importance

In [11]:
if hasattr(model, 'feature_importances_'):
    imp_df = pd.DataFrame({
        'feature': selected_features,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("TOP 15 FEATURES")
    print("="*60)
    print(imp_df.head(15).to_string(index=False))
    
    top10_pct = imp_df.head(10)['importance'].sum() * 100
    print(f"Top 10 account for: {top10_pct:.1f}%")
else:
    print("Feature importance not available")
    imp_df = None

TOP 15 FEATURES
   feature  importance
feature_51    0.455477
feature_11    0.231702
feature_48    0.120202
feature_16    0.091115
feature_49    0.017236
feature_32    0.009431
feature_31    0.008162
feature_39    0.006262
feature_50    0.006175
feature_46    0.005680
feature_26    0.004416
feature_40    0.003397
feature_27    0.003298
 feature_2    0.002736
feature_41    0.002708
Top 10 account for: 95.1%


## 7. Why Does the Model Behave This Way?

In [12]:
print("="*60)
print("MODEL BEHAVIOR EXPLANATION")
print("="*60)

explanation = f"""
1. PRIMARY FEATURES (Top 3):
   ‚Ä¢ {imp_df.iloc[0]['feature']}: {imp_df.iloc[0]['importance']:.4f}
   ‚Ä¢ {imp_df.iloc[1]['feature']}: {imp_df.iloc[1]['importance']:.4f}
   ‚Ä¢ {imp_df.iloc[2]['feature']}: {imp_df.iloc[2]['importance']:.4f}
   
   WHY? These capture player skill and game context most effectively.

2. TURN EFFICIENCY METRICS:
   High correlation (0.46-0.47) with rating.
   WHY? Better players consistently score more points per turn.

3. OPPONENT QUALITY:
   opponent_rating and bot features are important.
   WHY? Your rating is inferred from who you beat.

4. GAME CONTEXT:
   is_rated, overtime features matter.
   WHY? Rated games indicate serious play.

5. WHY IT STRUGGLES WITH {worst}:
   ‚Ä¢ Highest skill variance at this level
   ‚Ä¢ Players transitioning between skill tiers
   ‚Ä¢ Performance is inconsistent

6. WHY GENERALIZATION IS STRONG ({val_m['MAPE'] - train_m['MAPE']:.2f}% gap):
   ‚Ä¢ Hyperparameter tuning prevents overfitting
   ‚Ä¢ Sample weights balance classes
   ‚Ä¢ Outlier-robust features handle noise

KEY INSIGHT:
Rating ‚âà Efficiency + Opponent Quality + Game Context

The model learns it's not just about winning, but HOW you perform
relative to your opponent's strength.
"""

print(explanation)

MODEL BEHAVIOR EXPLANATION

1. PRIMARY FEATURES (Top 3):
   ‚Ä¢ feature_51: 0.4555
   ‚Ä¢ feature_11: 0.2317
   ‚Ä¢ feature_48: 0.1202
   
   WHY? These capture player skill and game context most effectively.

2. TURN EFFICIENCY METRICS:
   High correlation (0.46-0.47) with rating.
   WHY? Better players consistently score more points per turn.

3. OPPONENT QUALITY:
   opponent_rating and bot features are important.
   WHY? Your rating is inferred from who you beat.

4. GAME CONTEXT:
   is_rated, overtime features matter.
   WHY? Rated games indicate serious play.

5. WHY IT STRUGGLES WITH 1200-1400:
   ‚Ä¢ Highest skill variance at this level
   ‚Ä¢ Players transitioning between skill tiers
   ‚Ä¢ Performance is inconsistent

6. WHY GENERALIZATION IS STRONG (0.56% gap):
   ‚Ä¢ Hyperparameter tuning prevents overfitting
   ‚Ä¢ Sample weights balance classes
   ‚Ä¢ Outlier-robust features handle noise

KEY INSIGHT:
Rating ‚âà Efficiency + Opponent Quality + Game Context

The model learn

## 8. Summary

In [None]:
print("="*60)
print("SPRINT 5 SUMMARY")
print("="*60)
print(f"Model: {model_name}")
print(f"Validation MAPE: {val_m['MAPE']:.2f}% ‚≠ê")
print(f"Generalization Gap: {val_m['MAPE'] - train_m['MAPE']:.2f}% ‚úÖ")
print(f"Best Range: {best}")
print(f"Worst Range: {worst}")

SPRINT 5 SUMMARY
Model: XGBOOST
Validation MAPE: 2.82% ‚≠ê
Generalization Gap: 0.56% ‚úÖ
Best Range: 1400-1600
Worst Range: 1200-1400
‚úÖ MODEL READY FOR DEPLOYMENT
üéâ Sprint 5 Complete!
