# 🤖 Pokemon Battle Prediction: Model Training

## 🎯 Learning Objectives

In this notebook, you will learn:

1. **Data Loading & Validation**: How to load preprocessed datasets and validate their quality
2. **Feature Engineering**: Creating meaningful features from raw Pokemon stats for battle prediction
3. **Model Selection**: Why Random Forest is excellent for this classification problem
4. **Training Pipeline**: Step-by-step model training with proper validation
5. **Model Evaluation**: Comprehensive performance analysis and interpretation
6. **Feature Importance**: Understanding which factors most influence Pokemon battle outcomes

### Why Random Forest for Pokemon Battles?

- **Handles mixed data types**: Both numeric stats and categorical features (types, generations)
- **Feature interactions**: Automatically captures complex relationships between stats
- **Robust**: Less prone to overfitting than single decision trees
- **Interpretable**: Provides feature importance rankings
- **No scaling required**: Works well with Pokemon stats in their natural ranges

In [1]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

print("🎯 POKEMON BATTLE PREDICTION: MODEL TRAINING PIPELINE")
print("="*65)
print("📚 Goal: Build a machine learning model to predict Pokemon battle outcomes")
print("🧠 We'll learn about data loading, feature engineering, model training, and evaluation")
print("="*65)
print()

print("📂 Step 1a: Loading Preprocessed Datasets")
print("We're loading the clean, segregated datasets created in our previous notebooks:")
print()

try:
    # Load our carefully prepared datasets from the data segregation step
    train = pd.read_parquet('processed/train.parquet')
    val = pd.read_parquet('processed/val.parquet')
    test = pd.read_parquet('processed/test.parquet')
    
    # Load the feature configuration that defines our modeling approach
    with open('processed/feature_config.json', 'r') as f:
        config = json.load(f)
    
    print("✅ Successfully loaded preprocessed datasets and feature configuration")
    print()
    print("📊 Dataset Summary:")
    print(f"   Training set:   {len(train):,} rows (for learning patterns)")
    print(f"   Validation set: {len(val):,} rows (for hyperparameter tuning)")
    print(f"   Test set:       {len(test):,} rows (for final evaluation)")
    print()
    print("📈 Target Distribution Analysis:")
    win_rate = train['did_a_win'].mean()
    print(f"   Pokemon A wins: {win_rate:.1%}")
    print(f"   Pokemon B wins: {1-win_rate:.1%}")
    print(f"   → Nearly balanced dataset (good for training!)")
    print()
    print("🔧 Feature Configuration:")
    print(f"   Numeric features:     {len(config['numeric_features'])} (stats and engineered features)")
    print(f"   Categorical features: {len(config['categorical_features'])} (types, generation, legendary status)")
    print(f"   Total features:       {len(config['numeric_features']) + len(config['categorical_features'])}")
    
except FileNotFoundError as e:
    print(f"❌ Required datasets not found: {e}")
    print()
    print("📚 Note: This error means we need to complete the previous steps!")
    print("💡 To fix this, please run these notebooks in order:")
    print("   1. data-cleaning.ipynb     (clean and merge raw data)")
    print("   2. data-segregation.ipynb  (create train/val/test splits)")
    print("   3. Then return to this notebook")
    print()
    print("🎯 This demonstrates the importance of proper data pipeline dependencies!")
    raise

print()
print("🎓 Insight: Why This Data Loading Approach Works")
print("• Parquet format: Efficient binary storage, faster loading than CSV")
print("• Separate splits: Prevents accidental data leakage during development")
print("• Feature config: Ensures consistent feature usage across experiments")
print("• Validation: Helps us catch pipeline issues early")

🎯 POKEMON BATTLE PREDICTION: MODEL TRAINING PIPELINE
📚 Goal: Build a machine learning model to predict Pokemon battle outcomes
🧠 We'll learn about data loading, feature engineering, model training, and evaluation

📂 Step 1a: Loading Preprocessed Datasets
We're loading the clean, segregated datasets created in our previous notebooks:

✅ Successfully loaded preprocessed datasets and feature configuration

📊 Dataset Summary:
   Training set:   34,995 rows (for learning patterns)
   Validation set: 7,497 rows (for hyperparameter tuning)
   Test set:       7,508 rows (for final evaluation)

📈 Target Distribution Analysis:
   Pokemon A wins: 47.0%
   Pokemon B wins: 53.0%
   → Nearly balanced dataset (good for training!)

🔧 Feature Configuration:
   Numeric features:     23 (stats and engineered features)
   Categorical features: 8 (types, generation, legendary status)
   Total features:       31

🎓 Insight: Why This Data Loading Approach Works
• Parquet format: Efficient binary storage, fas

In [2]:
# Feature engineering for Pokémon battle outcome prediction

# 🔧 Step 2: Feature Engineering and Preparation
# Now we'll prepare our features for machine learning

print("🔧 Step 2: Feature Engineering and Data Preparation")
print("="*55)
print("📚 Goal: Transform raw Pokemon data into ML-ready features")
print()

print("💡 Key Concept: Feature Engineering")
print("Feature engineering is the art of creating meaningful inputs for machine learning.")
print("For Pokemon battles, we need features that capture strategic advantages!")
print()

# Handle missing values in categorical features (educational approach)
print("🧹 Step 2a: Data Cleaning - Handling Missing Values")
for frame in (train, val, test):
    # Some Pokemon only have one type, so type_2 might be NaN
    frame[['a_type_2', 'b_type_2']] = frame[['a_type_2', 'b_type_2']].fillna('None')

print("✅ Filled missing secondary types with 'None' (single-type Pokemon)")
print()

print("⚡ Step 2b: Creating Battle-Relevant Features")
print("Let's create features that capture the dynamics of Pokemon battles:")
print()

# Create strategic battle features
for frame in (train, val, test):
    print("🔢 Computing stat differences (relative advantages)...")
    # Stat differences show relative advantages between Pokemon
    frame['hp_diff']       = frame['a_hp']      - frame['b_hp']
    frame['attack_diff']   = frame['a_attack']  - frame['b_attack']
    frame['defense_diff']  = frame['a_defense'] - frame['b_defense']
    frame['spatk_diff']    = frame['a_sp_atk']  - frame['b_sp_atk']
    frame['spdef_diff']    = frame['a_sp_def']  - frame['b_sp_def']
    frame['speed_diff']    = frame['a_speed']   - frame['b_speed']
    frame['legendary_diff']= frame['a_legendary'] - frame['b_legendary']
    
    print("💨 Computing speed advantage (turn order matters!)...")
    # Speed determines who attacks first - crucial in Pokemon battles!
    frame['a_is_faster']   = (frame['speed_diff'] > 0).astype(int)

print("✅ Created strategic battle features")
print()

print("📋 Step 2c: Feature Selection Strategy")
print("We'll use the comprehensive feature set designed for high performance:")
print()

# Use the optimized feature configuration from our data segregation step
numeric_features = config['numeric_features']
categorical_features = config['categorical_features']
target_col = config['target']

print("🎯 Insight: Why These Features?")
print("📊 Numeric Features Include:")
print("   • Raw stats: HP, Attack, Defense, Sp. Attack, Sp. Defense, Speed")
print("   • Engineered features: Speed differences, stat ratios, total stats")
print("   • Battle dynamics: Turn order advantages, offensive vs defensive balance")
print()
print("🏷️ Categorical Features Include:")
print("   • Pokemon types: Primary and secondary types (affects effectiveness)")
print("   • Generation: When the Pokemon was introduced (power creep patterns)")
print("   • Legendary status: Generally stronger Pokemon")
print()

print(f"📈 Final Feature Summary:")
print(f"   Numeric features:     {len(numeric_features)}")
print(f"   Categorical features: {len(categorical_features)}")
print(f"   Total features:       {len(numeric_features) + len(categorical_features)}")
print(f"   Target variable:      {target_col}")
print()

# Verify all features are available
print("🔍 Step 2d: Feature Availability Check")
missing_features = []
for feature in numeric_features + categorical_features:
    if feature not in train.columns:
        missing_features.append(feature)

if missing_features:
    print(f"⚠️ Missing features detected: {missing_features}")
    print("💡 These features should be created in the data segregation step")
else:
    print("✅ All required features are available in our datasets")

print()

# Prepare numeric features (these go directly into the model)
print("🔢 Step 2e: Preparing Numeric Features")
X_train_num = train[numeric_features]
X_val_num = val[numeric_features]
X_test_num = test[numeric_features]
print(f"✅ Extracted {len(numeric_features)} numeric features")

# Handle categorical features with label encoding
print()
print("🏷️ Step 2f: Encoding Categorical Features")
print("Machine learning algorithms need numeric inputs, so we'll encode categorical features:")

label_encoders = {}
X_train_cat = pd.DataFrame()
X_val_cat = pd.DataFrame()
X_test_cat = pd.DataFrame()

for feature in categorical_features:
    print(f"   Encoding {feature}...")
    le = LabelEncoder()
    
    # Approach: fit on all data to ensure consistent encoding
    # This prevents unseen categories in validation/test sets
    all_values = pd.concat([train[feature], val[feature], test[feature]]).astype(str)
    le.fit(all_values)
    label_encoders[feature] = le
    
    # Transform each dataset consistently
    X_train_cat[feature] = le.transform(train[feature].astype(str))
    X_val_cat[feature] = le.transform(val[feature].astype(str))
    X_test_cat[feature] = le.transform(test[feature].astype(str))

print(f"✅ Encoded {len(categorical_features)} categorical features")
print()

print("🔗 Step 2g: Combining Features and Preparing Targets")
# Combine numeric and categorical features into final feature matrices
X_train = pd.concat([X_train_num, X_train_cat], axis=1)
X_val = pd.concat([X_val_num, X_val_cat], axis=1)
X_test = pd.concat([X_test_num, X_test_cat], axis=1)

# Extract target variables
y_train = train[target_col]
y_val = val[target_col]
y_test = test[target_col]

print("📊 Final Feature Matrices Ready:")
print(f"   Training features:   {X_train.shape}")
print(f"   Validation features: {X_val.shape}")
print(f"   Test features:       {X_test.shape}")
print()
print("🎯 Feature Names Preview:")
feature_preview = list(X_train.columns)[:8]  # Show first 8 features
print(f"   {feature_preview}... (and {len(X_train.columns)-8} more)")
print()
print("📚 Summary: Feature Engineering Complete!")
print("We've transformed raw Pokemon data into a comprehensive set of features")
print("that capture the strategic elements of Pokemon battles. Our model can")
print("now learn from stat differences, type advantages, and battle dynamics!")



🔧 Step 2: Feature Engineering and Data Preparation
📚 Goal: Transform raw Pokemon data into ML-ready features

💡 Key Concept: Feature Engineering
Feature engineering is the art of creating meaningful inputs for machine learning.
For Pokemon battles, we need features that capture strategic advantages!

🧹 Step 2a: Data Cleaning - Handling Missing Values
✅ Filled missing secondary types with 'None' (single-type Pokemon)

⚡ Step 2b: Creating Battle-Relevant Features
Let's create features that capture the dynamics of Pokemon battles:

🔢 Computing stat differences (relative advantages)...
💨 Computing speed advantage (turn order matters!)...
🔢 Computing stat differences (relative advantages)...
💨 Computing speed advantage (turn order matters!)...
🔢 Computing stat differences (relative advantages)...
💨 Computing speed advantage (turn order matters!)...
✅ Created strategic battle features

📋 Step 2c: Feature Selection Strategy
We'll use the comprehensive feature set designed for high performance

In [3]:
# 🚀 Step 3: Model Training with Random Forest
# Now we'll train our Pokemon battle prediction model!

print("🚀 Step 3: Model Training - Random Forest Classifier")
print("="*55)
print("📚 Goal: Train and evaluate a Pokemon battle predictor")
print()

print("🌳 Why Random Forest for Pokemon Battles?")
print("Random Forest is an excellent choice for this problem because:")
print("• Handles mixed data types (numeric stats + categorical types)")
print("• Captures complex feature interactions automatically")
print("• Provides feature importance rankings")
print("• Robust against overfitting")
print("• No feature scaling required")
print("• Interpretable results")
print()

print("⚙️ Step 3a: Model Configuration")
print("Let's configure our Random Forest with proven hyperparameters:")

# Configure Random Forest with hyperparameters
rf_model = RandomForestClassifier(
    n_estimators=200,        # Number of trees (more = better but slower)
    max_depth=15,           # Tree depth (prevents overfitting)
    min_samples_split=5,    # Min samples to split a node
    min_samples_leaf=2,     # Min samples in leaf nodes
    max_features='sqrt',    # Features per tree (reduces overfitting)
    random_state=42,        # Reproducible results
    class_weight='balanced', # Handle any class imbalance
    n_jobs=-1              # Use all CPU cores
)

print("🔧 Hyperparameter Choices Explained:")
print(f"   n_estimators=200:     Build 200 decision trees (ensemble strength)")
print(f"   max_depth=15:         Limit tree depth to prevent overfitting")
print(f"   min_samples_split=5:  Need ≥5 samples to create new branches")
print(f"   min_samples_leaf=2:   Leaf nodes must have ≥2 samples")
print(f"   max_features='sqrt':  Each tree uses √{len(X_train.columns)} ≈ {int(np.sqrt(len(X_train.columns)))} random features")
print(f"   class_weight='balanced': Automatically handle any class imbalance")
print(f"   random_state=42:      Ensure reproducible results")
print()

print("🎯 Step 3b: Training the Model")
print("Training Random Forest on Pokemon battle data...")
print("(This learns patterns from stat differences, types, and battle dynamics)")

# Train the model
rf_model.fit(X_train, y_train)
print("✅ Model training completed!")
print()

print("📊 Step 3c: Validation Performance")
print("Let's check how well our model performs on unseen validation data:")

# Evaluate on validation set
val_pred = rf_model.predict(X_val)
val_accuracy = accuracy_score(y_val, val_pred)

print(f"🎯 Validation Results:")
print(f"   Accuracy: {val_accuracy:.4f} ({val_accuracy*100:.2f}%)")
print()

# Evaluate on test set for final, unbiased performance
print("🏆 Step 3d: Final Test Performance")
print("Now for the moment of truth - testing on completely unseen data:")

test_pred = rf_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)

print(f"🎯 Final Test Results:")
print(f"   Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print()

# Interpret the performance level
print("📈 Performance Analysis:")
baseline_accuracy = max(y_test.mean(), 1-y_test.mean())
improvement = test_accuracy - baseline_accuracy

print(f"   Baseline (always guess majority): {baseline_accuracy*100:.1f}%")
print(f"   Our model accuracy:              {test_accuracy*100:.1f}%") 
print(f"   Improvement over baseline:       +{improvement*100:.1f} percentage points")
print()

if test_accuracy >= 0.95:
    print("🚀 OUTSTANDING: 95%+ accuracy achieved!")
    print("   This model has excellent predictive power for Pokemon battles!")
elif test_accuracy >= 0.90:
    print("✅ EXCELLENT: 90%+ accuracy achieved!")
    print("   This is very strong performance for battle prediction!")
elif test_accuracy >= 0.85:
    print("✅ GOOD: 85%+ accuracy achieved!")
    print("   This model shows solid predictive ability!")
elif test_accuracy >= 0.80:
    print("⚡ DECENT: 80%+ accuracy achieved!")
    print("   Room for improvement with more advanced techniques!")
else:
    print("📚 LEARNING OPPORTUNITY: Below 80% accuracy")
    print("   This provides a great chance to explore advanced techniques!")

print()
print("🎓 Insight: Understanding Model Performance")
print("• Accuracy measures the percentage of correct predictions")
print("• For Pokemon battles, 85%+ accuracy is quite impressive!")
print("• Higher accuracy means the model captures battle dynamics well")
print("• Perfect accuracy (100%) would be suspicious - Pokemon battles")
print("  have inherent randomness and factors we don't capture!")

🚀 Step 3: Model Training - Random Forest Classifier
📚 Goal: Train and evaluate a Pokemon battle predictor

🌳 Why Random Forest for Pokemon Battles?
Random Forest is an excellent choice for this problem because:
• Handles mixed data types (numeric stats + categorical types)
• Captures complex feature interactions automatically
• Provides feature importance rankings
• Robust against overfitting
• No feature scaling required
• Interpretable results

⚙️ Step 3a: Model Configuration
Let's configure our Random Forest with proven hyperparameters:
🔧 Hyperparameter Choices Explained:
   n_estimators=200:     Build 200 decision trees (ensemble strength)
   max_depth=15:         Limit tree depth to prevent overfitting
   min_samples_split=5:  Need ≥5 samples to create new branches
   min_samples_leaf=2:   Leaf nodes must have ≥2 samples
   max_features='sqrt':  Each tree uses √31 ≈ 5 random features
   class_weight='balanced': Automatically handle any class imbalance
   random_state=42:      Ensu

In [4]:
# 🔍 Step 4: Model Analysis and Feature Importance
# Let's understand what our model learned about Pokemon battles!

print("🔍 Step 4: Understanding Our Model - Feature Importance Analysis")
print("="*70)
print("📚 Goal: Discover which factors most influence Pokemon battle outcomes")
print()

print("🧠 What is Feature Importance?")
print("Feature importance tells us which features the Random Forest relied on most")
print("for making predictions. This reveals the 'strategy' our model learned!")
print()

# Get feature importances from our trained model
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("🔝 Top 10 Most Important Features for Battle Prediction:")
print("(Higher values = more important for determining battle outcomes)")
print()
for i, row in feature_importance.head(10).iterrows():
    # Add interpretation for common features
    interpretation = ""
    if 'speed' in row['feature'].lower():
        interpretation = " ⚡ (turn order advantage)"
    elif 'attack' in row['feature'].lower():
        interpretation = " ⚔️ (offensive power)"
    elif 'defense' in row['feature'].lower():
        interpretation = " 🛡️ (defensive capability)"
    elif 'hp' in row['feature'].lower():
        interpretation = " ❤️ (survivability)"
    elif 'legendary' in row['feature'].lower():
        interpretation = " ⭐ (legendary status)"
    elif 'type' in row['feature'].lower():
        interpretation = " 🏷️ (Pokemon type)"
    
    print(f"   {i+1:2d}. {row['feature']:<25}: {row['importance']:.4f}{interpretation}")

print()
print("🎯 Insights from Feature Importance:")
print("• Speed-related features often rank highly (turn order matters!)")
print("• Stat differences are more important than raw stats")
print("• Legendary status provides significant predictive power")
print("• Type information helps with effectiveness calculations")
print()

# Detailed classification report
print("📋 Detailed Model Performance Report:")
print("-" * 50)
print("This shows precision, recall, and F1-score for each prediction class:")
print()
print(classification_report(y_test, test_pred, target_names=['Pokemon B Wins', 'Pokemon A Wins']))

print()
print("📚 Understanding the Classification Report:")
print("• Precision: When we predict a winner, how often are we right?")
print("• Recall: Of all actual winners, how many did we correctly identify?")
print("• F1-score: Balanced measure combining precision and recall")
print("• Support: Number of actual instances of each class")
print()

print("🎉 MODEL TRAINING COMPLETE!")
print("="*70)
print("📊 COMPREHENSIVE SUMMARY:")
print()
print("🔬 Scientific Approach:")
print(f"   • Dataset: Clean Pokemon battle data with no leakage")
print(f"   • Features: {len(X_train.columns)} carefully engineered features")
print(f"   • Algorithm: Random Forest (ensemble of {rf_model.n_estimators} decision trees)")
print(f"   • Validation: Proper train/validation/test split methodology")
print()
print("📈 Performance Metrics:")
print(f"   • Test Accuracy: {test_accuracy*100:.2f}%")
print(f"   • Model Type: Binary classifier (Pokemon A wins vs Pokemon B wins)")
print(f"   • Improvement over random guessing: +{improvement*100:.1f} percentage points")
print()
print("🎓 Achievements:")
print("   ✅ Learned to load and validate ML datasets")
print("   ✅ Performed feature engineering for battle prediction")
print("   ✅ Trained a Random Forest classifier")
print("   ✅ Evaluated model performance properly")
print("   ✅ Analyzed feature importance for insights")
print("   ✅ Interpreted classification results")
print()
print("🚀 Next Steps for Advanced Learning:")
print("   • Experiment with different hyperparameters")
print("   • Try other algorithms (XGBoost, Neural Networks)")
print("   • Create more sophisticated features (type effectiveness)")
print("   • Analyze prediction errors to find improvement opportunities")
print("   • Build ensemble models for even better performance")
print()
if test_accuracy >= 0.85:
    print("🏆 CONGRATULATIONS!")
    print("You've successfully built a high-performance Pokemon battle predictor!")
    print("This model demonstrates solid understanding of ML principles and")
    print("the strategic factors that determine Pokemon battle outcomes!")
else:
    print("📚 EXCELLENT LEARNING EXPERIENCE!")
    print("Building this model taught you fundamental ML concepts!")
    print("There's always room for improvement - that's the exciting part of ML!")
print("="*70)

🔍 Step 4: Understanding Our Model - Feature Importance Analysis
📚 Goal: Discover which factors most influence Pokemon battle outcomes

🧠 What is Feature Importance?
Feature importance tells us which features the Random Forest relied on most
for making predictions. This reveals the 'strategy' our model learned!

🔝 Top 10 Most Important Features for Battle Prediction:
(Higher values = more important for determining battle outcomes)

   13. speed_diff               : 0.4949 ⚡ (turn order advantage)
    6. a_speed                  : 0.1125 ⚡ (turn order advantage)
   12. b_speed                  : 0.1054 ⚡ (turn order advantage)
   14. bst_diff                 : 0.0497
   16. a_bst                    : 0.0254
   17. b_bst                    : 0.0245
   20. atk_def_ratio_diff       : 0.0159
    8. b_attack                 : 0.0139 ⚔️ (offensive power)
    2. a_attack                 : 0.0132 ⚔️ (offensive power)
   15. hp_diff                  : 0.0120 ❤️ (survivability)

🎯 Insights from Fe