Pokemon Combat Data Segregation & Feature Engineering

This notebook demonstrates how to properly split data for machine learning and engineer features for high performance.

## Learning Objectives:

- 🔄 Create proper train/validation/test splits without data leakage
- ⚡ Engineer features that drive 95%+ accuracy
- 📊 Understand why certain features are important for Pokemon battles
- 🎯 Prepare optimized datasets for model training

## What We'll Learn:

- GroupShuffleSplit to prevent pair leakage
- Advanced feature engineering techniques
- Speed and stat ratio importance in Pokemon battles
- How to save optimized datasets for production use

In [1]:
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

In [2]:
# Import libraries for data splitting and feature engineering
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

print("📚 Data Segregation & Feature Engineering Pipeline")
print("🎯 Goal: Create train/val/test splits without data leakage")
print("⚡ Focus: Engineer features for 95%+ accuracy")
print("="*60)

📚 Data Segregation & Feature Engineering Pipeline
🎯 Goal: Create train/val/test splits without data leakage
⚡ Focus: Engineer features for 95%+ accuracy


In [3]:
# STEP 1: Load Clean Data and Understand Distribution
print("🔍 STEP 1: Loading and Analyzing Clean Data")
print("-" * 50)

# Load the cleaned data from previous notebook
df = pd.read_csv('data/final_cleaned_no_duplicates.csv')

print("📊 Dataset Overview:")
print(f"   Total battles: {len(df):,}")
print(f"   Unique Pokemon pairs: {df['pair_key'].nunique():,}")
print(f"   Features per row: {df.shape[1]}")

print(f"\n🎯 Target Distribution Analysis:")
a_wins = df['did_a_win'].sum()
b_wins = len(df) - a_wins
a_win_rate = df['did_a_win'].mean()

print(f"   Pokemon A (first) wins: {a_wins:,} ({a_win_rate:.1%})")
print(f"   Pokemon B (second) wins: {b_wins:,} ({1-a_win_rate:.1%})")

print(f"\n💡 Why This Distribution Matters:")
if 0.45 <= a_win_rate <= 0.55:
    print(f"   ✅ Well-balanced: Natural battle dynamics preserved")
    print(f"   📈 Good for machine learning: No severe class imbalance")
else:
    print(f"   📊 Imbalanced but natural: Reflects real Pokemon battle patterns")

print(f"\n🔍 Data Quality Check:")
pair_coverage = df['pair_key'].nunique() / len(df)
avg_battles_per_pair = len(df) / df['pair_key'].nunique()

print(f"   Pair coverage: {pair_coverage:.3f} (lower = more repeats)")
print(f"   Average battles per pair: {avg_battles_per_pair:.1f}")
print(f"   Missing values: {df.isnull().sum().sum()}")

print(f"\n🎯 Next: Split data to prevent overfitting")

# Create train/test split using GroupShuffleSplit to avoid pair leakage
print(f"\n🔄 STEP 2: Creating Train/Test Split (85%/15%)")
print("📚 Why GroupShuffleSplit: Ensures same Pokemon pair doesn't appear in both train and test")

gss = GroupShuffleSplit(n_splits=1, test_size=0.15, random_state=42)
train_idx, test_idx = next(gss.split(df, groups=df['pair_key']))

train_df = df.iloc[train_idx].reset_index(drop=True)
test_df  = df.iloc[test_idx ].reset_index(drop=True)

print(f"   Train set: {len(train_df):,} battles")
print(f"   Test set: {len(test_df):,} battles")
print(f"   Split ratio: {len(train_df)/len(df):.1%} train, {len(test_df)/len(df):.1%} test")


🔍 STEP 1: Loading and Analyzing Clean Data
--------------------------------------------------
📊 Dataset Overview:
   Total battles: 50,000
   Unique Pokemon pairs: 46,211
   Features per row: 22

🎯 Target Distribution Analysis:
   Pokemon A (first) wins: 23,601 (47.2%)
   Pokemon B (second) wins: 26,399 (52.8%)

💡 Why This Distribution Matters:
   ✅ Well-balanced: Natural battle dynamics preserved
   📈 Good for machine learning: No severe class imbalance

🔍 Data Quality Check:
   Pair coverage: 0.924 (lower = more repeats)
   Average battles per pair: 1.1
   Missing values: 48016

🎯 Next: Split data to prevent overfitting

🔄 STEP 2: Creating Train/Test Split (85%/15%)
📚 Why GroupShuffleSplit: Ensures same Pokemon pair doesn't appear in both train and test
   Train set: 42,492 battles
   Test set: 7,508 battles
   Split ratio: 85.0% train, 15.0% test


In [4]:
# --- second split: train --> (train, val)  -----------------
gss_val = GroupShuffleSplit(n_splits=1, test_size=0.176, random_state=43)
tr_idx, val_idx = next(gss_val.split(train_df, groups=train_df['pair_key']))

# use the SAME original frame for both selections
val_df   = train_df.iloc[val_idx].reset_index(drop=True)
train_df = train_df.iloc[tr_idx].reset_index(drop=True)


In [5]:
# 📊 Dataset Split Analysis
# Let's create a reporting function to validate our splits and understand the distribution

def report(part, name):
    """
    Analyze dataset splits
    
    This function helps us understand:
    - Size of each split (number of rows)
    - Win rate balance (should be around 50% for fair battles)
    - Unique pairs (confirms no data leakage between splits)
    """
    print(f'{name:<5} rows:{len(part):>6} | positives:{part["did_a_win"].mean():.3f} | unique pairs:{part["pair_key"].nunique()}')

print("🎯 Examining our train/validation/test splits")
print("This helps us verify that our splits are balanced and properly segregated")
print()

for part, name in [(train_df, 'Train'), (val_df, 'Val'), (test_df, 'Test')]:
    report(part, name)

print()
print("📚 What we're looking for:")
print("• Win rates around 0.500 (balanced outcomes)")
print("• No overlapping unique pairs between splits")
print("• Reasonable size distributions (Train > Val ≈ Test)")


🎯 Examining our train/validation/test splits
This helps us verify that our splits are balanced and properly segregated

Train rows: 34995 | positives:0.470 | unique pairs:32365
Val   rows:  7497 | positives:0.473 | unique pairs:6914
Test  rows:  7508 | positives:0.482 | unique pairs:6932

📚 What we're looking for:
• Win rates around 0.500 (balanced outcomes)
• No overlapping unique pairs between splits
• Reasonable size distributions (Train > Val ≈ Test)


In [6]:
# 🔍 Data Leakage Validation - Critical Step in ML Pipeline
# This is one of the most important validation steps in machine learning!

print("🎯 Concept: Data Leakage Prevention")
print("Data leakage occurs when information from the test set 'leaks' into training.")
print("This would give us artificially high performance that doesn't generalize.")
print()

print("📚 Why we check for pair leakage:")
print("• If the same Pokemon pair appears in both train and test sets,")
print("  the model might memorize specific matchups rather than learn general patterns")
print("• This would lead to overoptimistic performance estimates")
print()

# Validate no overlap between splits using set intersection
assert not set(train_df.pair_key) & set(val_df.pair_key),  'Leak between train and val'
assert not set(train_df.pair_key) & set(test_df.pair_key), 'Leak between train and test'
assert not set(val_df.pair_key)   & set(test_df.pair_key), 'Leak between val and test'

print('✅ No pair leakage detected - Our splits are properly isolated!')
print()
print("🎉 This means our model evaluation will be trustworthy and realistic.")

🎯 Concept: Data Leakage Prevention
Data leakage occurs when information from the test set 'leaks' into training.
This would give us artificially high performance that doesn't generalize.

📚 Why we check for pair leakage:
• If the same Pokemon pair appears in both train and test sets,
  the model might memorize specific matchups rather than learn general patterns
• This would lead to overoptimistic performance estimates

✅ No pair leakage detected - Our splits are properly isolated!

🎉 This means our model evaluation will be trustworthy and realistic.


In [7]:
# 💾 Saving Our Clean, Segregated Datasets
# Now we'll save our properly split datasets for model training

print("🎯 Saving Clean Dataset Splits")
print("We're saving three separate files to maintain clear boundaries:")
print()

# Save each split to a separate CSV file
train_df.to_csv('data/train_set.csv', index=False)
val_df.to_csv('data/val_set.csv', index=False)
test_df.to_csv('data/test_set.csv', index=False)

print("💾 Saved clean datasets:")
print("• data/train_set.csv")
print("• data/val_set.csv") 
print("• data/test_set.csv")
print()

print("📚 Why separate files matter:")
print("• Clear separation prevents accidental mixing during model development")
print("• Each file has a specific purpose in the ML workflow")
print("• Training set: Learn patterns")
print("• Validation set: Tune hyperparameters and select best model")
print("• Test set: Final, unbiased performance evaluation")
print()

print("🎯 These datasets have:")
print("✅ Natural win distribution (balanced outcomes)")
print("✅ NO data leakage between splits")
print("✅ Proper Pokemon pair segregation")
print()
print("🚀 Ready for model training!")

🎯 Saving Clean Dataset Splits
We're saving three separate files to maintain clear boundaries:

💾 Saved clean datasets:
• data/train_set.csv
• data/val_set.csv
• data/test_set.csv

📚 Why separate files matter:
• Clear separation prevents accidental mixing during model development
• Each file has a specific purpose in the ML workflow
• Training set: Learn patterns
• Validation set: Tune hyperparameters and select best model
• Test set: Final, unbiased performance evaluation

🎯 These datasets have:
✅ Natural win distribution (balanced outcomes)
✅ NO data leakage between splits
✅ Proper Pokemon pair segregation

🚀 Ready for model training!
💾 Saved clean datasets:
• data/train_set.csv
• data/val_set.csv
• data/test_set.csv

📚 Why separate files matter:
• Clear separation prevents accidental mixing during model development
• Each file has a specific purpose in the ML workflow
• Training set: Learn patterns
• Validation set: Tune hyperparameters and select best model
• Test set: Final, unbias

In [8]:
# 🔧 Feature Engineering - Creating Powerful Predictive Features
# This step transforms raw stats into meaningful battle indicators

import pandas as pd
import numpy as np
import json
from pathlib import Path

print("🎯 Feature Engineering for Pokemon Battles")
print("We'll create features that capture the strategic elements of Pokemon combat")
print()

# Load our clean, segregated datasets
train_df = pd.read_csv('data/train_set.csv')
val_df   = pd.read_csv('data/val_set.csv')
test_df  = pd.read_csv('data/test_set.csv')

print(f"📊 Loaded clean datasets:")
print(f"   Train: {len(train_df)} rows")
print(f"   Val: {len(val_df)} rows") 
print(f"   Test: {len(test_df)} rows")
print()

print("📚 Feature Engineering")
print("Instead of using raw stats, we'll create features that capture battle dynamics:")
print()

for frame in (train_df, val_df, test_df):
    # Speed difference (crucial for battles - who attacks first?)
    frame['speed_diff'] = frame['a_speed'] - frame['b_speed']
    
    # Base stat totals and difference (overall power comparison)
    a_bst = frame[['a_hp','a_attack','a_defense','a_sp_atk','a_sp_def','a_speed']].sum(axis=1)
    b_bst = frame[['b_hp','b_attack','b_defense','b_sp_atk','b_sp_def','b_speed']].sum(axis=1)
    frame['a_bst'] = a_bst
    frame['b_bst'] = b_bst
    frame['bst_diff'] = a_bst - b_bst
    
    # Attack/Defense ratios (offensive vs defensive capabilities)
    frame['a_atk_def_ratio'] = frame['a_attack'] / (frame['a_defense'] + 1)  # +1 prevents division by zero
    frame['b_atk_def_ratio'] = frame['b_attack'] / (frame['b_defense'] + 1)
    frame['atk_def_ratio_diff'] = frame['a_atk_def_ratio'] - frame['b_atk_def_ratio']
    
    # Special attack/defense ratios (special move effectiveness)
    frame['a_sp_ratio'] = frame['a_sp_atk'] / (frame['a_sp_def'] + 1)
    frame['b_sp_ratio'] = frame['b_sp_atk'] / (frame['b_sp_def'] + 1)
    frame['sp_ratio_diff'] = frame['a_sp_ratio'] - frame['b_sp_ratio']
    
    # HP advantage (survivability comparison)
    frame['hp_diff'] = frame['a_hp'] - frame['b_hp']

print("✅ Created strategic battle features:")
print("• Speed difference (turn order advantage)")
print("• Base stat total difference (overall power)")
print("• Attack/Defense ratios (offensive capability)")
print("• Special move ratios (special attack effectiveness)")  
print("• HP difference (survivability advantage)")
print()

# Data type optimization and cleaning
for frame in (train_df, val_df, test_df):
    # Ensure target is numeric (required for ML algorithms)
    frame['did_a_win'] = frame['did_a_win'].astype(int)
    
    # Convert boolean columns to int (ML algorithms prefer numeric)
    frame[['a_legendary', 'b_legendary']] = frame[['a_legendary', 'b_legendary']].astype(int)
    
    # Fill any NaNs in type_2 with 'None' (some Pokemon only have one type)
    frame[['a_type_2', 'b_type_2']] = frame[['a_type_2', 'b_type_2']].fillna('None')

print("🧹 Data type optimization completed")
print()

# Quality validation function
def quick_audit(df, name):
    """Validate data quality for machine learning"""
    assert df['did_a_win'].nunique() == 2, f'{name}: target should be binary'
    assert df.isna().sum().sum() == 0, f'{name}: no NaN values allowed'
    print(f'{name:<5} ✔  rows:{len(df):>6}  win_rate:{df.did_a_win.mean():.3f}')

print(f"📋 Quality Audit - Ensuring ML readiness:")
for d, n in [(train_df, 'Train'), (val_df, 'Val'), (test_df, 'Test')]:
    quick_audit(d, n)

print()

# Define our comprehensive feature set for high-performance modeling
high_performance_features = {
    'numeric': [
        # Original Pokemon stats
        'a_hp', 'a_attack', 'a_defense', 'a_sp_atk', 'a_sp_def', 'a_speed',
        'b_hp', 'b_attack', 'b_defense', 'b_sp_atk', 'b_sp_def', 'b_speed',
        # Strategic engineered features
        'speed_diff', 'bst_diff', 'hp_diff',
        'a_bst', 'b_bst',
        'a_atk_def_ratio', 'b_atk_def_ratio', 'atk_def_ratio_diff',
        'a_sp_ratio', 'b_sp_ratio', 'sp_ratio_diff'
    ],
    'categorical': [
        'a_type_1', 'a_type_2', 'b_type_1', 'b_type_2',
        'a_generation', 'b_generation', 'a_legendary', 'b_legendary'
    ]
}

# Verify all features exist in our datasets
missing_features = []
for feature_type, features in high_performance_features.items():
    for feature in features:
        if feature not in train_df.columns:
            missing_features.append(feature)

if missing_features:
    print(f"⚠️ Missing features: {missing_features}")
else:
    print(f"✅ All features present and ready for modeling")

print()
print(f'🚀 COMPREHENSIVE FEATURE SET:')
print(f'   Numeric features: {len(high_performance_features["numeric"])} (stats + engineered)')
print(f'   Categorical features: {len(high_performance_features["categorical"])} (types + metadata)')
print()

# Save optimized datasets in efficient Parquet format
Path('processed').mkdir(exist_ok=True)

train_df.to_parquet('processed/train.parquet', index=False)
val_df.to_parquet('processed/val.parquet', index=False)
test_df.to_parquet('processed/test.parquet', index=False)

# Save feature configuration for consistent model training
feature_config = {
    'numeric_features': high_performance_features['numeric'],
    'categorical_features': high_performance_features['categorical'],
    'target': 'did_a_win',
    'description': 'Comprehensive Pokemon battle prediction features',
    'pipeline_version': 'educational_v1'
}

with open('processed/feature_config.json', 'w') as f:
    json.dump(feature_config, f, indent=2)

print(f'💾 SAVED OPTIMIZED DATASETS:')
print(f'   • processed/train.parquet (efficient binary format)')
print(f'   • processed/val.parquet')
print(f'   • processed/test.parquet')
print(f'   • processed/feature_config.json (feature definitions)')
print()

print(f'🎯 DATASETS READY FOR MODEL TRAINING!')
print("📚 Next step: Use model_training.ipynb to build and evaluate your Pokemon battle predictor")
print("="*70)


🎯 Feature Engineering for Pokemon Battles
We'll create features that capture the strategic elements of Pokemon combat

📊 Loaded clean datasets:
   Train: 34995 rows
   Val: 7497 rows
   Test: 7508 rows

📚 Feature Engineering
Instead of using raw stats, we'll create features that capture battle dynamics:

✅ Created strategic battle features:
• Speed difference (turn order advantage)
• Base stat total difference (overall power)
• Attack/Defense ratios (offensive capability)
• Special move ratios (special attack effectiveness)
• HP difference (survivability advantage)

🧹 Data type optimization completed

📋 Quality Audit - Ensuring ML readiness:
Train ✔  rows: 34995  win_rate:0.470
Val   ✔  rows:  7497  win_rate:0.473
Test  ✔  rows:  7508  win_rate:0.482

✅ All features present and ready for modeling

🚀 COMPREHENSIVE FEATURE SET:
   Numeric features: 23 (stats + engineered)
   Categorical features: 8 (types + metadata)

📊 Loaded clean datasets:
   Train: 34995 rows
   Val: 7497 rows
   Tes