# Notebook 26 ‚Äî Opening Recommender Model: Training Pipeline

### 0. Overview and Goals

This notebook defines the full pipeline for training the chess opening recommender model.  
The objective is to predict **player‚Äìopening performance scores** ((wins + (0.5 * draws) / num games)) for openings a player hasn‚Äôt yet played, based on their results in the openings they *have* played.  

The model will use **matrix factorization** with **stochastic gradient descent (SGD)** to learn latent factors representing player and opening characteristics.  
All computations will be implemented in **PyTorch**, with data loaded from my local **DuckDB** database.

**High-level specs:**
- Use only *White* openings initially (we‚Äôll extend to Black later).  
- Data source: processed player‚Äìopening stats from local DuckDB.  
- Predict: normalized ‚Äúscore‚Äù = win rate ((wins + 0.5 x draws) / total games).  
- Filter: only include entries with ‚â• `MIN_GAMES_THRESHOLD` (default = 50).  
- Ignore: rating differences, time controls, and other metadata.  
- Model parameters (to be defined in appropriate places for easy editing):  
  - `NUM_FACTORS`, `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`, `NUM_PLAYERS_TO_PROCESS`  
- Logging and checkpoints throughout for reproducibility.  
- All random operations seeded for deterministic runs.  

---

### 1. Data Extraction
- Connect to local DuckDB
- Pull all processed player‚Äìopening statistics from
- Verify schema consistency:  
  - Required columns: `player_id`, `opening_id`, `eco`, `num_games`, `wins`, `draws`, `losses`.  
- Include a row-count sanity check.
- Only players with ratings above 1200

---

### 2. Data Sanitization & Normalization
- Optionally normalize scores if needed for MF convergence.  
- Drop players with no qualifying openings and openings with no qualifying players.  
  - I believe there shouldn't be any but we'll double check.
- Resequence player_id and opening_id to be sequential integers - right now there are gaps because of entries we deleted from the DB 
- Check for sparsity consistency (no implicit zeros yet).  
- Note that this data has already been split in to white and black games further up the pipeline

### Data Quality
- Drop entries with fewer than `MIN_GAMES_THRESHOLD` games
- Handle any duplicate `(player_id, opening_id)` combinations
- Remove players with no qualifying openings
- Remove openings with no qualifying players
- Verify no null values remain

### ECO Codes
- Keep ECO codes for later categorical encoding (Step 4)
- ECO will be used as opening side information (similar to rating for players)

### Confidence Weighting
- Use `MIN_GAMES_THRESHOLD = 10` to keep more data
- Add a **confidence weight** column: `confidence = num_games / (num_games + K)` where K ‚âà 50
- This weight will be used in the loss function to down-weight uncertain predictions
- High-game-count entries ‚Üí high confidence ‚Üí larger loss impact
- Low-game-count entries ‚Üí low confidence ‚Üí smaller loss impact

### Player Rating (Side Information)
- **Player ratings are side information** - they describe player characteristics, not individual player-opening interactions
- Ratings will be stored separately and joined to player embeddings during training
- We'll **normalize ratings** (likely z-score normalization) to avoid scaling issues with the embedding layer
- Rating normalization will be done once after extraction, not per-row

---

### 3. Data Splits
- Split into train/test/val sets.  
- Ensure every player and every opening appears at least once in the training data.  
- Strategy:  
  - Sample unique players and openings to guarantee coverage in train.  
  - Remaining data ‚Üí stratified random split into train/test.  
  - Deduplicate and merge unique IDs back into train if needed.

---

### 4. Enumerate Categorical Variables
- Enumerate `eco` (if included) as an integer categorical variable.  
- Confirm all columns are numeric and compatible with PyTorch tensors.  
- Verify no missing or out-of-range IDs.

---

### 5. Training Data Structure
- Each row: one `(player_id, opening_id, score)` record.
- Include other fields- eco, num games etc
- Convert DataFrame to PyTorch tensors (`torch.long` for IDs, `torch.float` for scores).  
- Log dataset shapes and sparsity metrics.

---

### 6. Training Setup
Define constants:
- `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`, `NUM_FACTORS`  
- Loss functions: MSE and RMSE  
- Activation: sigmoid or none (depending on score normalization)  
- Optimizer: SGD  
- Figure out if there's anything else we need to design or specify

Implement helper functions:
- `train_one_epoch()`
- `evaluate_model()`
- `calculate_rmse()`
- `save_checkpoint()`  

Ensure detailed logging, ETA reporting, and reproducible random seeds.

---

### 7. Training Loop
- Initialize player and opening embeddings.  
- Iterate through epochs with mini-batch SGD (`BATCH_SIZE = 1024`).  
- Compute and log MSE/RMSE per epoch.  
- Save model checkpoints locally after each epoch.

---

### 8. Evaluation
- Evaluate on test set.  
- Report MSE, RMSE, and visual diagnostics (predicted vs actual score).  
- Inspect a few player and opening latent factors for sanity.

---

### 9. Cross-Validation & Hyperparameter Tuning
- Define ranges for:  
  - `NUM_FACTORS`, `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`  
- Perform small-scale grid or random search for best configuration.  
- Compare validation RMSE across runs.

---

### 10. Next Steps
- Extend model to include Black openings.  
- Experiment with hybrid inputs (player rating, ECO grouping).  
- Consider implicit feedback handling (unplayed openings as zeros).  
- Integrate trained model into API for recommendation output.

---

**Notes:**  
- Every random seed and parameter definition will be explicit.  
- Every major step includes row-count, schema, and type validation.  
- Model artifacts and logs will be saved locally for reproducibility.


## Step 1: Data Extraction

Connect to DuckDB and extract all player-opening statistics.
Verify schema and perform sanity checks.

In [None]:
# Setup and imports
from pathlib import Path
import pandas as pd
import sys

# Add utils to path
sys.path.append(str(Path.cwd() / 'utils'))
from database.db_utils import get_db_connection

# Configuration
DB_PATH = Path.cwd().parent / "data" / "processed" / "chess_games.db"
COLOR_FILTER = 'w'  # 'w' for white, 'b' for black
MIN_HOLDOUT_PLAYERS = 1000  # Minimum number of players to reserve for fold-in verification. These will not be used at all in this notebook for training, test/val or anything else.

print("=" * 60)
print("STEP 1: DATA EXTRACTION")
print("=" * 60)
print(f"\nüìÅ Database: {DB_PATH}")
print(f"üìÅ Database exists: {DB_PATH.exists()}")
print(f"üé® Color filter: {'White' if COLOR_FILTER == 'w' else 'Black'}")
print(f"üîí Minimum holdout players: {MIN_HOLDOUT_PLAYERS:,}")

if not DB_PATH.exists():
    raise FileNotFoundError(f"Database not found at {DB_PATH}")

In [None]:
# Connect to DuckDB and extract player-opening statistics
con = get_db_connection(str(DB_PATH))

try:
    print(f"\n1Ô∏è‚É£  Extracting player-opening statistics (color: '{COLOR_FILTER}')...")
    
    # Extract stats with calculated score and num_games
    # Filter by color, minimum rating, and calculate score in the database
    MIN_RATING = 1200
    print(f"   ‚Ä¢ Minimum rating filter: {MIN_RATING}")
    
    # First, get all eligible players and randomly select holdout set
    print(f"\n2Ô∏è‚É£  Selecting holdout players for fold-in verification...")
    print(f"   ‚Ä¢ Holdout size: {MIN_HOLDOUT_PLAYERS:,} players minimum")
    
    # Get all players with sufficient data
    player_query = f"""
        SELECT DISTINCT p.id as player_id
        FROM player p
        JOIN player_opening_stats pos ON p.id = pos.player_id
        WHERE p.rating >= {MIN_RATING}
        AND pos.color = '{COLOR_FILTER}'
    """
    
    all_eligible_players = pd.DataFrame(con.execute(player_query).df())
    total_eligible = len(all_eligible_players)
    print(f"   ‚Ä¢ Total eligible players: {total_eligible:,}")
    
    if total_eligible < MIN_HOLDOUT_PLAYERS:
        raise ValueError(f"Not enough eligible players ({total_eligible:,}) to create holdout set of {MIN_HOLDOUT_PLAYERS:,}")
    
    # Randomly sample holdout players (deterministic with seed)
    import numpy as np
    np.random.seed(42)  # For reproducibility
    
    holdout_player_ids = np.random.choice(
        all_eligible_players['player_id'].values,
        size=MIN_HOLDOUT_PLAYERS,
        replace=False
    )
    
    training_player_ids = set(all_eligible_players['player_id'].values) - set(holdout_player_ids)
    
    print(f"   ‚Ä¢ Holdout players selected: {len(holdout_player_ids):,}")
    print(f"   ‚Ä¢ Training players available: {len(training_player_ids):,}")
    print(f"   ‚Ä¢ Holdout percentage: {100 * len(holdout_player_ids) / total_eligible:.1f}%")
    
    # Convert training player IDs to SQL-friendly string
    training_player_ids_str = ','.join(map(str, training_player_ids))
    
    # Extract data ONLY for training players
    print(f"\n3Ô∏è‚É£  Extracting training data (excluding holdout players)...")
    
    query = f"""
        SELECT 
            pos.player_id,
            pos.opening_id,
            pos.num_wins + pos.num_draws + pos.num_losses as num_games,
            (pos.num_wins + (pos.num_draws * 0.5)) / 
                NULLIF(pos.num_wins + pos.num_draws + pos.num_losses, 0) as score,
            o.eco
        FROM player_opening_stats pos
        JOIN opening o ON pos.opening_id = o.id
        JOIN player p ON pos.player_id = p.id
        WHERE pos.color = '{COLOR_FILTER}'
        AND p.rating >= {MIN_RATING}
        AND pos.player_id IN ({training_player_ids_str})
        ORDER BY pos.player_id, pos.opening_id
    """
    
    raw_data = pd.DataFrame(con.execute(query).df())
    
    print(f"   ‚úì Extracted {len(raw_data):,} rows")
    
    # Also save holdout player IDs for later use
    holdout_players_df = pd.DataFrame({'player_id': holdout_player_ids})
    print(f"\n   üíæ Saved holdout_players_df with {len(holdout_players_df):,} player IDs")
    print(f"   ‚Ä¢ These players are COMPLETELY UNSEEN by the training process")
    print(f"   ‚Ä¢ Use them later for fold-in verification")
    
    # Schema verification
    print("\n4Ô∏è‚É£  Verifying schema...")
    required_columns = ['player_id', 'opening_id', 'num_games', 'score', 'eco']
    
    for col in required_columns:
        if col not in raw_data.columns:
            raise ValueError(f"Missing required column: {col}")
    
    print(f"   ‚úì All required columns present: {required_columns}")
    
    # Data types verification
    print("\n5Ô∏è‚É£  Checking data types...")
    print(f"   ‚Ä¢ player_id: {raw_data['player_id'].dtype}")
    print(f"   ‚Ä¢ opening_id: {raw_data['opening_id'].dtype}")
    print(f"   ‚Ä¢ num_games: {raw_data['num_games'].dtype}")
    print(f"   ‚Ä¢ score: {raw_data['score'].dtype}")
    print(f"   ‚Ä¢ eco: {raw_data['eco'].dtype}")
    
    # Basic statistics
    print("\n6Ô∏è‚É£  Data statistics...")
    print(f"   ‚Ä¢ Total rows: {len(raw_data):,}")
    print(f"   ‚Ä¢ Unique players: {raw_data['player_id'].nunique():,}")
    print(f"   ‚Ä¢ Unique openings: {raw_data['opening_id'].nunique():,}")
    print(f"   ‚Ä¢ Total games (sum): {raw_data['num_games'].sum():,}")
    
    # Player ID range
    print(f"\n   Player ID range:")
    print(f"   ‚Ä¢ Min: {raw_data['player_id'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['player_id'].max()}")
    
    # Opening ID range
    print(f"\n   Opening ID range:")
    print(f"   ‚Ä¢ Min: {raw_data['opening_id'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['opening_id'].max()}")
    
    # Games per entry statistics
    print(f"\n   Games per entry:")
    print(f"   ‚Ä¢ Min: {raw_data['num_games'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['num_games'].max()}")
    print(f"   ‚Ä¢ Mean: {raw_data['num_games'].mean():.1f}")
    print(f"   ‚Ä¢ Median: {raw_data['num_games'].median():.0f}")
    
    # Score statistics
    print(f"\n   Score distribution:")
    print(f"   ‚Ä¢ Min: {raw_data['score'].min():.4f}")
    print(f"   ‚Ä¢ Max: {raw_data['score'].max():.4f}")
    print(f"   ‚Ä¢ Mean: {raw_data['score'].mean():.4f}")
    print(f"   ‚Ä¢ Median: {raw_data['score'].median():.4f}")
    
    # Check for null values
    print("\n7Ô∏è‚É£  Checking for null values...")
    null_counts = raw_data.isnull().sum()
    if null_counts.sum() == 0:
        print("   ‚úì No null values found")
    else:
        print("   ‚ö†Ô∏è  Found null values:")
        for col, count in null_counts[null_counts > 0].items():
            print(f"      ‚Ä¢ {col}: {count} nulls")
    
    # Sample data
    print("\n8Ô∏è‚É£  Sample of extracted data (first 10 rows):")
    print(raw_data.head(10).to_string())
    
    print("\n" + "=" * 60)
    print("‚úÖ DATA EXTRACTION COMPLETE")
    print("=" * 60)
    print(f"\nData shape: {raw_data.shape}")
    print(f"Columns: {list(raw_data.columns)}")
    print(f"\n‚ö†Ô∏è  IMPORTANT: {len(holdout_player_ids):,} players held out for fold-in verification")
    print(f"   ‚Ä¢ Access via: holdout_players_df")
    print(f"   ‚Ä¢ These players will NOT appear in any training, validation, or test splits")
    
finally:
    con.close()
    print("\n‚úì Database connection closed")

## Step 2: Data Sanitization & Normalization

Filter low-quality data, handle duplicates, and prepare for training.

In [None]:
# 2a. Filter low-quality data, handle duplicates, and prepare for training.

import numpy as np

# Configuration
MIN_GAMES_THRESHOLD = 10

print("=" * 60)
print("STEP 2: DATA SANITIZATION & NORMALIZATION")
print("=" * 60)
print(f"\n‚öôÔ∏è  Configuration:")
print(f"   ‚Ä¢ MIN_GAMES_THRESHOLD: {MIN_GAMES_THRESHOLD}")

# Start with raw_data from Step 1
print(f"\nüìä Starting data shape: {raw_data.shape}")
print(f"   ‚Ä¢ Rows: {len(raw_data):,}")
print(f"   ‚Ä¢ Unique players: {raw_data['player_id'].nunique():,}")
print(f"   ‚Ä¢ Unique openings: {raw_data['opening_id'].nunique():,}")

# 1. Filter by minimum games threshold
print(f"\n1Ô∏è‚É£  Filtering entries with < {MIN_GAMES_THRESHOLD} games...")
before_filter = len(raw_data)
clean_data = raw_data.query(f'num_games >= {MIN_GAMES_THRESHOLD}').copy()
num_rows_after_filter = len(clean_data)
num_rows_filtered_out = before_filter - num_rows_after_filter

print(f"   ‚Ä¢ Before: {before_filter:,} rows")
print(f"   ‚Ä¢ After: {num_rows_after_filter:,} rows")
print(f"   ‚Ä¢ Filtered out: {num_rows_filtered_out:,} rows ({100*num_rows_filtered_out/before_filter:.1f}%)")

# 2. Check for duplicates
print(f"\n2Ô∏è‚É£  Checking for duplicate (player_id, opening_id) combinations...")
num_duplicates = clean_data.duplicated(subset=['player_id', 'opening_id']).sum()

if num_duplicates > 0:
    print(f"   ‚ö†Ô∏è  Found {num_duplicates} duplicate entries")
    dup_mask = clean_data.duplicated(subset=['player_id', 'opening_id'], keep=False)
    print("\n   Sample of duplicates:")
    print(clean_data[dup_mask].head(10).to_string())
    
    # Keep only first occurrence of any duplicate player-opening pair
    print("\n   Removing duplicates (keeping first occurrence)...")
    clean_data = pd.DataFrame.drop_duplicates(clean_data, subset=['player_id', 'opening_id'], keep='first')
    print(f"   ‚úì After deduplication: {len(clean_data):,} rows")
else:
    print(f"   ‚úì No duplicates found")

# 3. Remove players with no qualifying openings
print(f"\n3Ô∏è‚É£  Removing players with no qualifying openings...") # Note that a few players only play stuff like the Van't Kruijs which we've excluded, so a small numer of players will be excluded here
players_before = clean_data['player_id'].nunique()

# Count openings per player
num_openings_per_player = pd.DataFrame(clean_data.groupby('player_id').size(), columns=['count'])
players_with_data = num_openings_per_player[num_openings_per_player['count'] > 0].index.tolist()

# Filter
clean_data = clean_data[clean_data['player_id'].isin(players_with_data)]
players_after = clean_data['player_id'].nunique()

print(f"   ‚Ä¢ Players before: {players_before:,}")
print(f"   ‚Ä¢ Players after: {players_after:,}")
print(f"   ‚Ä¢ Removed: {players_before - players_after}")

# 4. Remove openings with no qualifying players
print(f"\n4Ô∏è‚É£  Removing openings with no qualifying players...")
num_openings_before = clean_data['opening_id'].nunique()

# Use pd.DataFrame.groupby() to count players per opening
num_players_per_opening = pd.DataFrame(clean_data.groupby('opening_id').size(), columns=['count'])
openings_with_data = num_players_per_opening[num_players_per_opening['count'] > 0].index.tolist()

# Filter using pd.DataFrame.isin()
clean_data = clean_data[clean_data['opening_id'].isin(openings_with_data)]
openings_after = clean_data['opening_id'].nunique()

print(f"   ‚Ä¢ Openings before: {num_openings_before:,}")
print(f"   ‚Ä¢ Openings after: {openings_after:,}")
print(f"   ‚Ä¢ Removed: {num_openings_before - openings_after}")

# 5. Verify no null values using pd.isna()
print(f"\n5Ô∏è‚É£  Verifying no null values...")
null_counts = pd.DataFrame.isna(clean_data).sum()
if null_counts.sum() == 0:
    print("   ‚úì No null values found")
else:
    print("   ‚ö†Ô∏è  Found null values:")
    for col, count in null_counts[null_counts > 0].items():
        print(f"      ‚Ä¢ {col}: {count} nulls")
    # Drop rows with nulls using pd.DataFrame.dropna()
    clean_data = pd.DataFrame.dropna(clean_data)
    print(f"   ‚úì Dropped null rows. New shape: {clean_data.shape}")

# TODO: Add confidence weighting column
# TODO: Extract and normalize player ratings (side information)

# Reset index using pd.DataFrame.reset_index()
clean_data = pd.DataFrame.reset_index(clean_data, drop=True)

# Final statistics using pd functions
print(f"\n6Ô∏è‚É£  Final data statistics:")
print(f"   ‚Ä¢ Total rows: {len(clean_data):,}")
print(f"   ‚Ä¢ Unique players: {pd.Series.nunique(clean_data['player_id']):,}")
print(f"   ‚Ä¢ Unique openings: {pd.Series.nunique(clean_data['opening_id']):,}")
print(f"   ‚Ä¢ Total games: {pd.Series.sum(clean_data['num_games']):,}")
print(f"   ‚Ä¢ Avg games per entry: {pd.Series.mean(clean_data['num_games']):.1f}")
print(f"   ‚Ä¢ Avg openings per player: {len(clean_data) / pd.Series.nunique(clean_data['player_id']):.1f}")
print(f"   ‚Ä¢ Avg players per opening: {len(clean_data) / pd.Series.nunique(clean_data['opening_id']):.1f}")

# Score distribution using pd functions
print(f"\n   Score statistics:")
print(f"   ‚Ä¢ Min: {pd.Series.min(clean_data['score']):.4f}")
print(f"   ‚Ä¢ 25th percentile: {pd.Series.quantile(clean_data['score'], 0.25):.4f}")
print(f"   ‚Ä¢ Median: {pd.Series.median(clean_data['score']):.4f}")
print(f"   ‚Ä¢ 75th percentile: {pd.Series.quantile(clean_data['score'], 0.75):.4f}")
print(f"   ‚Ä¢ Max: {pd.Series.max(clean_data['score']):.4f}")
print(f"   ‚Ä¢ Mean: {pd.Series.mean(clean_data['score']):.4f}")
print(f"   ‚Ä¢ Std: {pd.Series.std(clean_data['score']):.4f}")

# Sample of cleaned data using pd.DataFrame.sample()
print(f"\n7Ô∏è‚É£  Sample of cleaned data (10 random rows):")
print(pd.DataFrame.sample(clean_data, min(10, len(clean_data)), random_state=42).to_string())

print("\n" + "=" * 60)
print("‚úÖ DATA SANITIZATION COMPLETE")
print("=" * 60)
print(f"\nCleaned data shape: {clean_data.shape}")
print(f"Data reduction: {100 * (1 - len(clean_data)/len(raw_data)):.1f}%")

In [None]:
# 2b. Apply hierarchical Bayesian shrinkage to adjust scores based on sample size confidence

# Check if confidence already exists - if so, skip this processing
if 'confidence' in clean_data.columns:
    print("=" * 60)
    print("‚è≠Ô∏è  SKIPPING STEP 2B: HIERARCHICAL BAYESIAN SCORE ADJUSTMENT")
    print("=" * 60)
    print("\n‚úì 'confidence' column already exists in data")
    print("   This indicates hierarchical Bayesian processing has already been applied.")
    print(f"\nCurrent data shape: {clean_data.shape}")
    print(f"Confidence range: [{clean_data['confidence'].min():.4f}, {clean_data['confidence'].max():.4f}]")
else:
    # Define the processing function
    # This is a long function, I recommend you fold it down in your editor
    def apply_hierarchical_bayesian_shrinkage(data, k_player=50):
        """
        Apply two-level hierarchical Bayesian shrinkage to adjust scores.
        
        A lot of our player-opening entries have a small number of games played, because openings are so specific.
        This introduces sample size issues.
        
        We use TWO-LEVEL shrinkage:
        Level 1: Calculate opening-specific means (these are our "ground truth" for each opening)
        Level 2: Shrink individual player-opening scores toward their opening's mean
        This is better than shrinking toward global mean because different openings have different baseline win rates
        
        Parameters:
        -----------
        data : pd.DataFrame
            Clean data with columns: player_id, opening_id, score, num_games, eco
        k_player : int
            Shrinkage constant for player-opening scores (default: 50)
            
        Returns:
        --------
        pd.DataFrame
            Data with adjusted scores and new 'confidence' column
        """
        print("=" * 60)
        print("STEP 2B: HIERARCHICAL BAYESIAN SCORE ADJUSTMENT")
        print("=" * 60)
        
        print(f"\n‚öôÔ∏è  Configuration:")
        print(f"   ‚Ä¢ K_PLAYER (shrinkage constant): {k_player}")
        print(f"   ‚Ä¢ Method: Two-level empirical Bayes shrinkage")
        print(f"   ‚Ä¢ Level 1: Calculate opening-specific means")
        print(f"   ‚Ä¢ Level 2: Shrink player scores toward opening means")
        
        # Calculate global mean score for comparison
        global_mean_score = data["score"].mean()
        print(f"\nüìä Global statistics:")
        print(f"   ‚Ä¢ Global mean score: {global_mean_score:.4f}")
        print(f"   ‚Ä¢ Total entries: {len(data):,}")
        print(f"   ‚Ä¢ Unique openings: {data['opening_id'].nunique():,}")
        
        # Store original scores for comparison
        data = data.copy()  # Best practice: work on a copy
        data["score_original"] = data["score"].copy()
        
        # LEVEL 1: Calculate opening-specific means and statistics
        print(f"\n1Ô∏è‚É£  LEVEL 1: Calculating opening-specific means...")
        
        opening_stats = (
            data.groupby("opening_id")
            .agg(
                {
                    "score": "mean",
                    "num_games": "sum",
                    "player_id": "count",  # Number of players who played this opening
                }
            )
            .rename(
                columns={
                    "score": "opening_mean",
                    "num_games": "opening_total_games",
                    "player_id": "opening_num_players",
                }
            )
        )
        
        print(f"   ‚úì Calculated means for {len(opening_stats):,} openings")
        
        # Opening mean statistics
        print(f"\n   Opening mean score distribution:")
        print(f"   ‚Ä¢ Min: {opening_stats['opening_mean'].min():.4f}")
        print(f"   ‚Ä¢ 25th percentile: {opening_stats['opening_mean'].quantile(0.25):.4f}")
        print(f"   ‚Ä¢ Median: {opening_stats['opening_mean'].median():.4f}")
        print(f"   ‚Ä¢ 75th percentile: {opening_stats['opening_mean'].quantile(0.75):.4f}")
        print(f"   ‚Ä¢ Max: {opening_stats['opening_mean'].max():.4f}")
        print(f"   ‚Ä¢ Std: {opening_stats['opening_mean'].std():.4f}")
        
        # Show distribution of opening sizes
        print(f"\n   Opening sample size distribution:")
        print(
            f"   ‚Ä¢ Total games per opening (median): {opening_stats['opening_total_games'].median():.0f}"
        )
        print(
            f"   ‚Ä¢ Players per opening (median): {opening_stats['opening_num_players'].median():.0f}"
        )
        print(
            f"   ‚Ä¢ Total games range: [{opening_stats['opening_total_games'].min():.0f}, {opening_stats['opening_total_games'].max():.0f}]"
        )
        print(
            f"   ‚Ä¢ Players range: [{opening_stats['opening_num_players'].min():.0f}, {opening_stats['opening_num_players'].max():.0f}]"
        )
        
        # Merge opening means back into main dataframe
        data = data.merge(
            opening_stats[["opening_mean"]], left_on="opening_id", right_index=True, how="left"
        )
        
        # LEVEL 2: Shrink player-opening scores toward opening-specific means
        print(f"\n2Ô∏è‚É£  LEVEL 2: Shrinking player scores toward opening means...")
        print(
            f"   Formula: adjusted_score = (num_games √ó player_score + {k_player} √ó opening_mean) / (num_games + {k_player})"
        )
        
        numerator = (data["num_games"] * data["score_original"]) + (
            k_player * data["opening_mean"]
        )
        denominator = data["num_games"] + k_player
        data["score"] = numerator / denominator
        
        print(f"   ‚úì Scores adjusted for {len(data):,} entries")
        
        # Calculate confidence weights (will be used in loss function later)
        print(f"\n3Ô∏è‚É£  Calculating confidence weights...")
        data["confidence"] = data["num_games"] / (
            data["num_games"] + k_player
        )
        print(f"   ‚úì Confidence weights calculated")
        print(f"   ‚Ä¢ Formula: confidence = num_games / (num_games + {k_player})")
        print(
            f"   ‚Ä¢ Range: [{data['confidence'].min():.4f}, {data['confidence'].max():.4f}]"
        )
        
        # Statistics on the adjustment
        score_diff = data["score"] - data["score_original"]
        print(f"\n4Ô∏è‚É£  Adjustment statistics:")
        print(f"   ‚Ä¢ Mean adjustment: {score_diff.mean():.6f}")
        print(f"   ‚Ä¢ Std adjustment: {score_diff.std():.6f}")
        print(f"   ‚Ä¢ Max adjustment: {score_diff.max():.6f}")
        print(f"   ‚Ä¢ Min adjustment: {score_diff.min():.6f}")
        
        # Show distribution of adjustments
        print(f"\n   Adjustment by num_games quartiles:")
        quartiles = data["num_games"].quantile([0.25, 0.5, 0.75])
        print(
            f"   ‚Ä¢ 25th percentile (n={quartiles[0.25]:.0f} games): avg adjustment = {score_diff[data['num_games'] <= quartiles[0.25]].mean():.6f}"
        )
        print(
            f"   ‚Ä¢ 50th percentile (n={quartiles[0.5]:.0f} games): avg adjustment = {score_diff[(data['num_games'] > quartiles[0.25]) & (data['num_games'] <= quartiles[0.5])].mean():.6f}"
        )
        print(
            f"   ‚Ä¢ 75th percentile (n={quartiles[0.75]:.0f} games): avg adjustment = {score_diff[(data['num_games'] > quartiles[0.5]) & (data['num_games'] <= quartiles[0.75])].mean():.6f}"
        )
        print(
            f"   ‚Ä¢ >75th percentile (n>{quartiles[0.75]:.0f} games): avg adjustment = {score_diff[data['num_games'] > quartiles[0.75]].mean():.6f}"
        )
        
        # New score distribution after adjustment
        print(f"\n5Ô∏è‚É£  Adjusted score statistics:")
        print(f"   ‚Ä¢ Min: {data['score'].min():.4f}")
        print(f"   ‚Ä¢ 25th percentile: {data['score'].quantile(0.25):.4f}")
        print(f"   ‚Ä¢ Median: {data['score'].median():.4f}")
        print(f"   ‚Ä¢ 75th percentile: {data['score'].quantile(0.75):.4f}")
        print(f"   ‚Ä¢ Max: {data['score'].max():.4f}")
        print(f"   ‚Ä¢ Mean: {data['score'].mean():.4f}")
        print(f"   ‚Ä¢ Std: {data['score'].std():.4f}")
        
        # Detailed sample showing the effect across different game counts
        print(f"\n6Ô∏è‚É£  Sample comparisons (showing effect of hierarchical shrinkage):")
        print(f"\n   {'='*120}")
        print(f"   Low-game entries (10-20 games) - HIGH shrinkage toward opening mean:")
        print(f"   {'='*120}")
        
        low_game_sample = data[
            (data["num_games"] >= 10) & (data["num_games"] <= 20)
        ].sample(
            min(
                10,
                len(
                    data[
                        (data["num_games"] >= 10) & (data["num_games"] <= 20)
                    ]
                ),
            ),
            random_state=42,
        )
        for idx, row in low_game_sample.iterrows():
            adjustment = row["score"] - row["score_original"]
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} | Games: {row['num_games']:>3} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí Adjusted: {row['score']:.4f} | "
                f"Diff: {adjustment:>+.4f} | Confidence: {row['confidence']:.3f}"
            )
        
        print(f"\n   {'='*120}")
        print(f"   Medium-game entries (50-100 games) - MODERATE shrinkage:")
        print(f"   {'='*120}")
        
        med_game_sample = data[
            (data["num_games"] >= 50) & (data["num_games"] <= 100)
        ].sample(
            min(
                10,
                len(
                    data[
                        (data["num_games"] >= 50) & (data["num_games"] <= 100)
                    ]
                ),
            ),
            random_state=42,
        )
        for idx, row in med_game_sample.iterrows():
            adjustment = row["score"] - row["score_original"]
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} | Games: {row['num_games']:>3} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí Adjusted: {row['score']:.4f} | "
                f"Diff: {adjustment:>+.4f} | Confidence: {row['confidence']:.3f}"
            )
        
        print(f"\n   {'='*120}")
        print(f"   High-game entries (200+ games) - LOW shrinkage:")
        print(f"   {'='*120}")
        
        high_game_sample = data[data["num_games"] >= 200].sample(
            min(10, len(data[data["num_games"] >= 200])), random_state=42
        )
        for idx, row in high_game_sample.iterrows():
            adjustment = row["score"] - row["score_original"]
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} | Games: {row['num_games']:>3} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí Adjusted: {row['score']:.4f} | "
                f"Diff: {adjustment:>+.4f} | Confidence: {row['confidence']:.3f}"
            )
        
        # Show extreme cases - comparing to both opening mean AND global mean
        print(f"\n7Ô∏è‚É£  Extreme cases (showing why opening-specific shrinkage matters):")
        
        # Find entries where opening mean differs significantly from global mean
        data["opening_deviation_from_global"] = (
            data["opening_mean"] - global_mean_score
        ).abs()
        
        print(f"\n   Openings with HIGHEST win rates (strong for White):")
        strong_openings = data.nlargest(5, "opening_mean")[
            ["opening_id", "opening_mean", "eco"]
        ].drop_duplicates("opening_id")
        for idx, row in strong_openings.iterrows():
            num_entries = len(data[data["opening_id"] == row["opening_id"]])
            deviation = row["opening_mean"] - global_mean_score
            print(
                f"   Opening {row['opening_id']:>4} ({row['eco']:>3}): mean = {row['opening_mean']:.4f} "
                f"(+{deviation:.4f} vs global) | {num_entries} player entries"
            )
        
        print(f"\n   Openings with LOWEST win rates (weak for White):")
        weak_openings = data.nsmallest(5, "opening_mean")[
            ["opening_id", "opening_mean", "eco"]
        ].drop_duplicates("opening_id")
        for idx, row in weak_openings.iterrows():
            num_entries = len(data[data["opening_id"] == row["opening_id"]])
            deviation = row["opening_mean"] - global_mean_score
            print(
                f"   Opening {row['opening_id']:>4} ({row['eco']:>3}): mean = {row['opening_mean']:.4f} "
                f"({deviation:.4f} vs global) | {num_entries} player entries"
            )
        
        # Show specific examples where hierarchical shrinkage made a difference
        print(f"\n8Ô∏è‚É£  Examples showing hierarchical shrinkage benefit:")
        
        # Find entries with strong openings where player did well
        strong_opening_ids = data.nlargest(50, "opening_mean")["opening_id"].unique()
        strong_examples = data[
            (data["opening_id"].isin(strong_opening_ids))
            & (data["num_games"] <= 20)
            & (data["score_original"] > 0.6)
        ].sample(
            min(
                3,
                len(
                    data[
                        (data["opening_id"].isin(strong_opening_ids))
                        & (data["num_games"] <= 20)
                        & (data["score_original"] > 0.6)
                    ]
                ),
            ),
            random_state=42,
        )
        
        print(
            f"\n   Strong opening + good player performance (shrunk toward HIGH opening mean):"
        )
        for idx, row in strong_examples.iterrows():
            adjustment = row["score"] - row["score_original"]
            global_shrink_would_be = (
                (row["num_games"] * row["score_original"]) + (k_player * global_mean_score)
            ) / (row["num_games"] + k_player)
            difference = row["score"] - global_shrink_would_be
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} ({row['eco']:>3}) | Games: {row['num_games']:>2} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí {row['score']:.4f}"
            )
            print(
                f"      If we'd shrunk to global mean: {global_shrink_would_be:.4f} (would lose {difference:+.4f} of deserved credit)"
            )
        
        # Find entries with weak openings where player did poorly
        weak_opening_ids = data.nsmallest(50, "opening_mean")["opening_id"].unique()
        weak_examples = data[
            (data["opening_id"].isin(weak_opening_ids))
            & (data["num_games"] <= 20)
            & (data["score_original"] < 0.45)
        ].sample(
            min(
                3,
                len(
                    data[
                        (data["opening_id"].isin(weak_opening_ids))
                        & (data["num_games"] <= 20)
                        & (data["score_original"] < 0.45)
                    ]
                ),
            ),
            random_state=42,
        )
        
        print(f"\n   Weak opening + poor player performance (shrunk toward LOW opening mean):")
        for idx, row in weak_examples.iterrows():
            adjustment = row["score"] - row["score_original"]
            global_shrink_would_be = (
                (row["num_games"] * row["score_original"]) + (k_player * global_mean_score)
            ) / (row["num_games"] + k_player)
            difference = row["score"] - global_shrink_would_be
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} ({row['eco']:>3}) | Games: {row['num_games']:>2} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí {row['score']:.4f}"
            )
            print(
                f"      If we'd shrunk to global mean: {global_shrink_would_be:.4f} (would unfairly boost by {-difference:+.4f})"
            )
        
        # Drop temporary columns
        print(f"\n9Ô∏è‚É£  Cleaning up...")
        data = data.drop(
            columns=["score_original", "opening_mean", "opening_deviation_from_global"]
        )
        print(f"   ‚úì Removed temporary columns")
        
        print(f"\n" + "=" * 60)
        print("‚úÖ HIERARCHICAL BAYESIAN ADJUSTMENT COMPLETE")
        print("=" * 60)
        print(f"\nFinal data shape: {data.shape}")
        print(f"Columns: {list(data.columns)}")
        print(f"\nNew columns added:")
        print(f"   ‚Ä¢ 'confidence': weight for loss function (range [0,1])")
        print(f"   ‚Ä¢ 'score': adjusted using hierarchical Bayesian shrinkage")
        print(f"\nKey improvement over simple shrinkage:")
        print(f"   ‚Ä¢ Player scores now shrink toward OPENING-SPECIFIC means, not global mean")
        print(f"   ‚Ä¢ Preserves opening difficulty differences")
        print(f"   ‚Ä¢ More accurate for both strong and weak openings")
        
        return data
    
    # Configuration for Bayesian shrinkage
    K_PLAYER = 50  # Shrinkage constant for player-opening scores
    
    # Call the function
    clean_data = apply_hierarchical_bayesian_shrinkage(clean_data, k_player=K_PLAYER)


In [None]:
print(clean_data.sample().to_string())

In [None]:
# 2c. Gather player rating statistics (no mutation, just exploration)

print("=" * 60)
print("STEP 2C: PLAYER RATING STATISTICS")
print("=" * 60)

# Connect to database and extract player ratings
con = get_db_connection(str(DB_PATH))

try:
    print(f"\n1Ô∏è‚É£  Extracting player ratings from database...")
    
    # Get unique player IDs from our clean_data
    unique_player_ids = clean_data['player_id'].unique()
    player_ids_str = ','.join(map(str, unique_player_ids))
    
    # Query to get player ratings
    rating_query = f"""
        SELECT 
            id as player_id,
            name,
            title,
            rating
        FROM player
        WHERE id IN ({player_ids_str})
    """
    
    player_ratings = pd.DataFrame(con.execute(rating_query).df())
    print(f"   ‚úì Retrieved ratings for {len(player_ratings):,} players")
    
finally:
    con.close()
    print("   ‚úì Database connection closed")

# Merge ratings into clean_data for analysis
print(f"\n2Ô∏è‚É£  Merging ratings with clean_data...")
clean_data_with_ratings = clean_data.merge(player_ratings[['player_id', 'rating']], on='player_id', how='left')
print(f"   ‚úì Merged successfully")

# Check for missing ratings
num_missing_ratings = clean_data_with_ratings['rating'].isna().sum()
if num_missing_ratings > 0:
    print(f"   ‚ö†Ô∏è  {num_missing_ratings:,} entries ({100*num_missing_ratings/len(clean_data_with_ratings):.2f}%) have missing ratings")
else:
    print(f"   ‚úì All entries have ratings")

# Basic rating statistics
print(f"\n3Ô∏è‚É£  Basic rating statistics:")
print(f"   ‚Ä¢ Count: {player_ratings['rating'].notna().sum():,}")
print(f"   ‚Ä¢ Missing: {player_ratings['rating'].isna().sum():,}")
print(f"   ‚Ä¢ Min: {player_ratings['rating'].min():.0f}")
print(f"   ‚Ä¢ Max: {player_ratings['rating'].max():.0f}")
print(f"   ‚Ä¢ Mean: {player_ratings['rating'].mean():.2f}")
print(f"   ‚Ä¢ Median: {player_ratings['rating'].median():.0f}")
print(f"   ‚Ä¢ Std Dev: {player_ratings['rating'].std():.2f}")

# Quartile statistics
print(f"\n4Ô∏è‚É£  Quartile statistics:")
print(f"   ‚Ä¢ 25th percentile: {player_ratings['rating'].quantile(0.25):.0f}")
print(f"   ‚Ä¢ 50th percentile (median): {player_ratings['rating'].quantile(0.50):.0f}")
print(f"   ‚Ä¢ 75th percentile: {player_ratings['rating'].quantile(0.75):.0f}")

# Granular percentile statistics (5% increments)
print(f"\n5Ô∏è‚É£  Detailed percentile distribution (5% increments):")
percentiles = [0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50,
               0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00]

print(f"\n   {'Percentile':<12} {'Rating':<10} {'Visual'}")
print(f"   {'-'*12} {'-'*10} {'-'*40}")

for p in percentiles:
    rating_value = player_ratings['rating'].quantile(p)
    # Create a simple bar visualization
    bar_length = int((rating_value - player_ratings['rating'].min()) / 
                     (player_ratings['rating'].max() - player_ratings['rating'].min()) * 40)
    bar = '‚ñà' * bar_length
    print(f"   {p*100:>5.0f}%       {rating_value:>7.0f}    {bar}")

# Rating ranges and counts
print(f"\n6Ô∏è‚É£  Rating distribution by range:")
rating_ranges = [
    (0, 1000), (1000, 1200), (1200, 1400), (1400, 1600), 
    (1600, 1800), (1800, 2000), (2000, 2200), (2200, 2400), 
    (2400, 2600), (2600, 3000)
]

print(f"\n   {'Range':<15} {'Count':<10} {'Percentage':<12} {'Visual'}")
print(f"   {'-'*15} {'-'*10} {'-'*12} {'-'*40}")

for low, high in rating_ranges:
    count = len(player_ratings[(player_ratings['rating'] >= low) & (player_ratings['rating'] < high)])
    pct = 100 * count / len(player_ratings)
    bar_length = int(pct * 0.4)  # Scale for visualization
    bar = '‚ñà' * bar_length
    print(f"   {low:>4}-{high:<8} {count:>7,}    {pct:>6.2f}%      {bar}")

# Interquartile range
iqr = player_ratings['rating'].quantile(0.75) - player_ratings['rating'].quantile(0.25)
print(f"\n7Ô∏è‚É£  Spread statistics:")
print(f"   ‚Ä¢ Range: {player_ratings['rating'].max() - player_ratings['rating'].min():.0f}")
print(f"   ‚Ä¢ Interquartile Range (IQR): {iqr:.0f}")
print(f"   ‚Ä¢ 10th-90th percentile range: {player_ratings['rating'].quantile(0.90) - player_ratings['rating'].quantile(0.10):.0f}")

# Skewness and kurtosis if available
try:
    from scipy.stats import skew, kurtosis
    skewness = skew(player_ratings['rating'].dropna())
    kurt = kurtosis(player_ratings['rating'].dropna())
    print(f"\n8Ô∏è‚É£  Distribution shape:")
    print(f"   ‚Ä¢ Skewness: {skewness:.4f} {'(right-skewed)' if skewness > 0 else '(left-skewed)' if skewness < 0 else '(symmetric)'}")
    print(f"   ‚Ä¢ Kurtosis: {kurt:.4f} {'(heavy-tailed)' if kurt > 0 else '(light-tailed)' if kurt < 0 else '(normal)'}")
except ImportError:
    print(f"\n8Ô∏è‚É£  Distribution shape:")
    print(f"   ‚Ä¢ scipy not available for skewness/kurtosis calculation")

# Sample of players at different rating levels
print(f"\n9Ô∏è‚É£  Sample players at different rating levels:")
sample_percentiles = [0.1, 0.25, 0.5, 0.75, 0.9]
for p in sample_percentiles:
    rating_threshold = player_ratings['rating'].quantile(p)
    # Get a player near this rating
    sample_player = player_ratings.iloc[(player_ratings['rating'] - rating_threshold).abs().argsort()[:1]]
    print(f"\n   ~{p*100:.0f}th percentile (rating ‚âà {rating_threshold:.0f}):")
    for idx, row in sample_player.iterrows():
        # print(f"      Player {row['player_id']}: {row['name']} - Rating: {row['rating']:.0f} {f'({row['title']})' if pd.notna(row['title']) else ''}")
        title_str = f" ({row['title']})" if pd.notna(row['title']) else ""
        print(f"      Player {row['player_id']}: {row['name']} - Rating: {row['rating']:.0f}{title_str}")

print("\n" + "=" * 60)
print("‚úÖ RATING STATISTICS COMPLETE")
print("=" * 60)
print(f"\nKey takeaways:")
print(f"   ‚Ä¢ Total players: {len(player_ratings):,}")
print(f"   ‚Ä¢ Rating range: [{player_ratings['rating'].min():.0f}, {player_ratings['rating'].max():.0f}]")
print(f"   ‚Ä¢ Mean ¬± std: {player_ratings['rating'].mean():.0f} ¬± {player_ratings['rating'].std():.0f}")
print(f"   ‚Ä¢ Median: {player_ratings['rating'].median():.0f}")
print(f"\n   Next steps: Normalize ratings for model input")

In [None]:
# 2d. Normalize player ratings using z-score normalization (for use as side information in MF model)

# Check if we've already created the player_side_info table
if 'player_side_info' in globals() and 'rating_z' in player_side_info.columns:
    print("=" * 60)
    print("‚è≠Ô∏è  SKIPPING STEP 2D: RATING NORMALIZATION")
    print("=" * 60)
    print("\n‚úì 'player_side_info' table already exists")
    print("   This indicates rating normalization has already been applied.")
    print(f"\nPlayer side info shape: {player_side_info.shape}")
    
    # Show statistics
    print(f"\nüìä Existing normalized rating statistics:")
    print(f"   ‚Ä¢ Min: {player_side_info['rating_z'].min():.4f}")
    print(f"   ‚Ä¢ Max: {player_side_info['rating_z'].max():.4f}")
    print(f"   ‚Ä¢ Mean: {player_side_info['rating_z'].mean():.6f} (should be ~0)")
    print(f"   ‚Ä¢ Std: {player_side_info['rating_z'].std():.6f} (should be ~1)")
    
    print(f"\nüìã Sample of existing normalized ratings:")
    sample_data = player_side_info.sample(min(10, len(player_side_info)), random_state=42)
    for idx, row in sample_data.iterrows():
        print(f"   Player {idx:>5} | {row['name']:<20} | "
              f"Rating: {row['rating']:>4.0f} ‚Üí Z-score: {row['rating_z']:>6.3f}")
else:
    def normalize_player_ratings(player_ratings_df):
        """
        Apply z-score normalization to player ratings for use as side information.
        
        This creates a SEPARATE table of player-level features, NOT merged into clean_data.
        Rating is side information - it describes the player, not the player-opening interaction.
        
        During training, the model will LOOK UP each player's rating_z from this table.
        
        Parameters:
        -----------
        player_ratings_df : pd.DataFrame
            Player ratings with columns: player_id, name, title, rating
            
        Returns:
        --------
        tuple: (player_side_info DataFrame, RATING_MEAN, RATING_STD)
        """
        print("=" * 60)
        print("STEP 2D: NORMALIZE PLAYER RATINGS (SIDE INFORMATION)")
        print("=" * 60)
        
        print(f"\n‚öôÔ∏è  Normalization strategy: Z-score")
        print(f"   ‚Ä¢ Formula: (rating - mean) / std")
        print(f"   ‚Ä¢ Purpose: Scale ratings for use as side information in MF model")
        print(f"   ‚Ä¢ Storage: SEPARATE lookup table, NOT merged into clean_data")
        print(f"   ‚Ä¢ Usage: Model will lookup player_id ‚Üí rating_z during training")
        
        # Calculate normalization parameters
        RATING_MEAN = player_ratings_df['rating'].mean()
        RATING_STD = player_ratings_df['rating'].std()
        
        print(f"\n1Ô∏è‚É£  Normalization parameters (calculated from {len(player_ratings_df):,} players):")
        print(f"   ‚Ä¢ Mean: {RATING_MEAN:.2f}")
        print(f"   ‚Ä¢ Std Dev: {RATING_STD:.2f}")
        
        # Create side information table - only keep player_id and rating for now
        player_side_info = player_ratings_df[['player_id', 'rating']].copy()
        player_side_info['rating_z'] = (player_side_info['rating'] - RATING_MEAN) / RATING_STD
        
        print(f"\n2Ô∏è‚É£  Normalized rating statistics:")
        print(f"   ‚Ä¢ Min: {player_side_info['rating_z'].min():.4f}")
        print(f"   ‚Ä¢ Max: {player_side_info['rating_z'].max():.4f}")
        print(f"   ‚Ä¢ Mean: {player_side_info['rating_z'].mean():.6f} (should be ~0)")
        print(f"   ‚Ä¢ Std: {player_side_info['rating_z'].std():.6f} (should be ~1)")
        print(f"   ‚Ä¢ Range: [{player_side_info['rating_z'].min():.2f}, {player_side_info['rating_z'].max():.2f}]")
        
        print(f"\n3Ô∏è‚É£  Sample normalized ratings across skill levels:")
        sample_percentiles = [0.1, 0.25, 0.5, 0.75, 0.9]
        for p in sample_percentiles:
            rating_threshold = player_side_info['rating'].quantile(p)
            sample_player = player_side_info.iloc[(player_side_info['rating'] - rating_threshold).abs().argsort()[:1]]
            for idx, row in sample_player.iterrows():
                print(f"   ~{p*100:.0f}th percentile: Player {idx} | "
                      f"Rating: {row['rating']:>4.0f} ‚Üí Z-score: {row['rating_z']:>6.3f}")
        
        print(f"\n4Ô∏è‚É£  Interpretation guide:")
        print(f"   ‚Ä¢ rating_z ‚âà {(1200 - RATING_MEAN)/RATING_STD:.1f}: 1200 player (minimum)")
        print(f"   ‚Ä¢ rating_z ‚âà {(player_side_info['rating'].quantile(0.25) - RATING_MEAN)/RATING_STD:.1f}: {player_side_info['rating'].quantile(0.25):.0f} player (25th percentile)")
        print(f"   ‚Ä¢ rating_z ‚âà  0.0: {RATING_MEAN:.0f} player (mean)")
        print(f"   ‚Ä¢ rating_z ‚âà {(player_side_info['rating'].quantile(0.75) - RATING_MEAN)/RATING_STD:.1f}: {player_side_info['rating'].quantile(0.75):.0f} player (75th percentile)")
        print(f"   ‚Ä¢ rating_z ‚âà {(player_side_info['rating'].max() - RATING_MEAN)/RATING_STD:.1f}: {player_side_info['rating'].max():.0f} player (maximum)")
        
        print(f"\n5Ô∏è‚É£  Side information table structure:")
        print(f"   ‚Ä¢ Shape: {player_side_info.shape}")
        print(f"   ‚Ä¢ Columns: {list(player_side_info.columns)}")
        print(f"   ‚Ä¢ Indexing: Setting player_id as index for O(1) lookups")
        
        # Set player_id as index for fast lookups
        player_side_info = player_side_info.set_index('player_id')
        
        print(f"\n6Ô∏è‚É£  Sample entries from side information table:")
        sample_data = player_side_info.sample(min(10, len(player_side_info)), random_state=42)
        for idx, row in sample_data.iterrows():
            print(f"   Player {idx:>5} | Rating: {row['rating']:>4.0f} ‚Üí Z-score: {row['rating_z']:>6.3f}")
        
        print(f"\n7Ô∏è‚É£  Removing unnecessary columns...")
        # Drop rating column - we only need rating_z for the model
        player_side_info = player_side_info.drop(columns=['rating'])
        print(f"   ‚úì Dropped 'rating' column (only keeping 'rating_z')")
        print(f"   ‚Ä¢ Final columns: {list(player_side_info.columns)}")
        
        print(f"\n8Ô∏è‚É£  Verifying all clean_data players have ratings:")
        # This is important - make sure every player in clean_data has a rating
        missing_players = set(clean_data['player_id'].unique()) - set(player_side_info.index)
        if len(missing_players) > 0:
            print(f"   ‚ö†Ô∏è  WARNING: {len(missing_players)} players in clean_data are missing from side_info!")
            print(f"   Missing player IDs: {sorted(list(missing_players))[:10]}...")
        else:
            print(f"   ‚úì All {len(player_side_info):,} players in clean_data have side information")
        
        print("\n" + "=" * 60)
        print("‚úÖ RATING NORMALIZATION COMPLETE")
        print("=" * 60)
        print(f"\nCreated: player_side_info")
        print(f"   ‚Ä¢ Shape: {player_side_info.shape}")
        print(f"   ‚Ä¢ Index: player_id")
        print(f"   ‚Ä¢ Columns: {list(player_side_info.columns)}")
        
        print(f"\nüìä Data structure summary:")
        print(f"   ‚Ä¢ clean_data: {clean_data.shape[0]:,} rows (player-opening interactions)")
        print(f"   ‚Ä¢ player_side_info: {len(player_side_info):,} rows (one per player)")
        print(f"   ‚Ä¢ Rating storage: ONE value per player (not duplicated per interaction)")
        
        print(f"\n‚ö†Ô∏è  CRITICAL: Save these parameters for inference!")
        print(f"   RATING_MEAN = {RATING_MEAN:.2f}")
        print(f"   RATING_STD = {RATING_STD:.2f}")
        print(f"\n   You'll need them to normalize ratings for new users at inference time.")
        
        return player_side_info, RATING_MEAN, RATING_STD
    
    # Call the function
    player_side_info, RATING_MEAN, RATING_STD = normalize_player_ratings(player_ratings)


In [None]:
print(clean_data.sample(20).to_string())

In [None]:
print(player_side_info.sample(10).to_string())

## Step 3: Train/Test/Val splits

Here, I split my data and drop columns that are no longer needed. We're very close to being able to train our model!

In [None]:
# Step 3: Train/Validation/Test Split (75/15/10) - OPTIMIZED

from sklearn.model_selection import train_test_split
import numpy as np

print("=" * 60)
print("STEP 3: TRAIN/VALIDATION/TEST SPLIT")
print("=" * 60)

print(f"\n‚öôÔ∏è  Configuration:")
print(f"   ‚Ä¢ Train: 75%")
print(f"   ‚Ä¢ Validation: 15%")
print(f"   ‚Ä¢ Test: 10%")
print(f"   ‚Ä¢ Random seed: 42 (for reproducibility)")

# Prepare the data
print(f"\n1Ô∏è‚É£  Preparing data for split...")

# Drop num_games from clean_data - we don't need it for training
# Keep: player_id, opening_id, score, eco, confidence
X = clean_data[["player_id", "opening_id", "eco", "confidence"]].copy()
y = clean_data["score"].copy()

print(f"   ‚Ä¢ Features (X): {X.shape}")
print(f"   ‚Ä¢ Target (y): {y.shape}")
print(f"   ‚Ä¢ Feature columns: {list(X.columns)}")

# Clean up player_side_info - only keep rating_z
print(f"\n2Ô∏è‚É£  Cleaning player side information...")
player_side_info_clean = player_side_info[["rating_z"]].copy()
print(f"   ‚Ä¢ Original player_side_info shape: {player_side_info.shape}")
print(f"   ‚Ä¢ Cleaned player_side_info shape: {player_side_info_clean.shape}")
print(f"   ‚Ä¢ Columns: {list(player_side_info_clean.columns)}")

# OPTIMIZED: Use index-based splitting to avoid DataFrame copies
print(f"\n3Ô∏è‚É£  Splitting data (optimized approach)...")
idx = np.arange(len(X))

# First split: separate out test set (10%)
idx_temp, idx_test = train_test_split(idx, test_size=0.10, random_state=42, shuffle=True)

# Second split: split remaining into train (75%) and val (15%)
# 15% of original = 15/90 ‚âà 0.1667 of temp
idx_train, idx_val = train_test_split(idx_temp, test_size=15/90, random_state=42, shuffle=True)

# Create splits using iloc (view, not copy)
X_train, y_train = X.iloc[idx_train], y.iloc[idx_train]
X_val, y_val = X.iloc[idx_val], y.iloc[idx_val]
X_test, y_test = X.iloc[idx_test], y.iloc[idx_test]

print(f"   ‚Ä¢ Train: {len(X_train):,} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"   ‚Ä¢ Validation: {len(X_val):,} samples ({len(X_val)/len(X)*100:.1f}%)")
print(f"   ‚Ä¢ Test: {len(X_test):,} samples ({len(X_test)/len(X)*100:.1f}%)")

# Verify the split
print(f"\n4Ô∏è‚É£  Verification:")
total = len(X_train) + len(X_val) + len(X_test)
print(f"   ‚Ä¢ Total samples: {total:,} (should equal {len(X):,})")
print(f"   ‚Ä¢ Train %: {len(X_train)/total*100:.2f}% (target: 75%)")
print(f"   ‚Ä¢ Val %: {len(X_val)/total*100:.2f}% (target: 15%)")
print(f"   ‚Ä¢ Test %: {len(X_test)/total*100:.2f}% (target: 10%)")

# OPTIMIZED: Pre-compute unique arrays once
print(f"\n5Ô∏è‚É£  Computing coverage statistics (cached)...")
players_train = X_train["player_id"].unique()
players_val = X_val["player_id"].unique()
players_test = X_test["player_id"].unique()

openings_train = X_train["opening_id"].unique()
openings_val = X_val["opening_id"].unique()
openings_test = X_test["opening_id"].unique()

print(f"\n   Players:")
print(f"   ‚Ä¢ Train: {len(players_train):,} unique players")
print(f"   ‚Ä¢ Val: {len(players_val):,} unique players")
print(f"   ‚Ä¢ Test: {len(players_test):,} unique players")
print(f"   ‚Ä¢ Total unique: {X['player_id'].nunique():,} players")

print(f"\n   Openings:")
print(f"   ‚Ä¢ Train: {len(openings_train):,} unique openings")
print(f"   ‚Ä¢ Val: {len(openings_val):,} unique openings")
print(f"   ‚Ä¢ Test: {len(openings_test):,} unique openings")
print(f"   ‚Ä¢ Total unique: {X['opening_id'].nunique():,} openings")

# OPTIMIZED: Use NumPy setdiff1d for cold-start analysis (C-speed)
print(f"\n6Ô∏è‚É£  Cold start analysis (vectorized)...")

val_cold_players = np.setdiff1d(players_val, players_train, assume_unique=True)
val_cold_openings = np.setdiff1d(openings_val, openings_train, assume_unique=True)

test_cold_players = np.setdiff1d(players_test, players_train, assume_unique=True)
test_cold_openings = np.setdiff1d(openings_test, openings_train, assume_unique=True)

print(f"\n   Validation set:")
print(f"   ‚Ä¢ Players not in train: {len(val_cold_players):,} ({len(val_cold_players)/len(players_val)*100:.1f}%)")
print(f"   ‚Ä¢ Openings not in train: {len(val_cold_openings):,} ({len(val_cold_openings)/len(openings_val)*100:.1f}%)")

print(f"\n   Test set:")
print(f"   ‚Ä¢ Players not in train: {len(test_cold_players):,} ({len(test_cold_players)/len(players_test)*100:.1f}%)")
print(f"   ‚Ä¢ Openings not in train: {len(test_cold_openings):,} ({len(test_cold_openings)/len(openings_test)*100:.1f}%)")

# OPTIMIZED: Compute stats in one pass using describe()
print(f"\n7Ô∏è‚É£  Score distribution across splits:")

y_train_stats = y_train.describe()
y_val_stats = y_val.describe()
y_test_stats = y_test.describe()

print(f"\n   Train:")
print(f"   ‚Ä¢ Mean: {y_train_stats['mean']:.4f}")
print(f"   ‚Ä¢ Std: {y_train_stats['std']:.4f}")
print(f"   ‚Ä¢ Min: {y_train_stats['min']:.4f}")
print(f"   ‚Ä¢ Max: {y_train_stats['max']:.4f}")

print(f"\n   Validation:")
print(f"   ‚Ä¢ Mean: {y_val_stats['mean']:.4f}")
print(f"   ‚Ä¢ Std: {y_val_stats['std']:.4f}")
print(f"   ‚Ä¢ Min: {y_val_stats['min']:.4f}")
print(f"   ‚Ä¢ Max: {y_val_stats['max']:.4f}")

print(f"\n   Test:")
print(f"   ‚Ä¢ Mean: {y_test_stats['mean']:.4f}")
print(f"   ‚Ä¢ Std: {y_test_stats['std']:.4f}")
print(f"   ‚Ä¢ Min: {y_test_stats['min']:.4f}")
print(f"   ‚Ä¢ Max: {y_test_stats['max']:.4f}")

# OPTIMIZED: Compute confidence stats in one pass
print(f"\n8Ô∏è‚É£  Confidence distribution across splits:")

conf_train_stats = X_train['confidence'].describe()
conf_val_stats = X_val['confidence'].describe()
conf_test_stats = X_test['confidence'].describe()

print(f"\n   Train:")
print(f"   ‚Ä¢ Mean: {conf_train_stats['mean']:.4f}")
print(f"   ‚Ä¢ Median: {conf_train_stats['50%']:.4f}")

print(f"\n   Validation:")
print(f"   ‚Ä¢ Mean: {conf_val_stats['mean']:.4f}")
print(f"   ‚Ä¢ Median: {conf_val_stats['50%']:.4f}")

print(f"\n   Test:")
print(f"   ‚Ä¢ Mean: {conf_test_stats['mean']:.4f}")
print(f"   ‚Ä¢ Median: {conf_test_stats['50%']:.4f}")

print("\n" + "=" * 60)
print("‚úÖ DATA SPLIT COMPLETE")
print("=" * 60)

print(f"\nüìä Summary:")
print(f"   ‚Ä¢ Training data: {len(X_train):,} samples (75%)")
print(f"   ‚Ä¢ Validation data: {len(X_val):,} samples (15%)")
print(f"   ‚Ä¢ Test data: {len(X_test):,} samples (10%)")
print(f"   ‚Ä¢ Player side info: {len(player_side_info_clean):,} players")
print(f"   ‚Ä¢ Side info columns: {list(player_side_info_clean.columns)}")

print(f"\nüì¶ Available datasets:")
print(f"   ‚Ä¢ X_train, y_train - Training features and targets")
print(f"   ‚Ä¢ X_val, y_val - Validation features and targets")
print(f"   ‚Ä¢ X_test, y_test - Test features and targets")
print(f"   ‚Ä¢ player_side_info_clean - Player ratings (indexed by player_id)")

print(f"\nüí° Next steps:")
print(f"   ‚Ä¢ Enumerate ECO codes as categorical features")
print(f"   ‚Ä¢ Convert to PyTorch tensors")
print(f"   ‚Ä¢ Build matrix factorization model with side information")


## Step 3b: Remap Player and Opening IDs to Sequential Integers

**Why remap IDs?**
- Database IDs may have gaps (e.g., [1, 5, 10, 15, ...]) from deleted entries
- Embedding layers need 0-based contiguous indices for efficiency
- Remapping saves memory (no unused embedding slots)

**Process:**
1. Check if IDs are already sequential (0 or 1-based with no gaps)
2. If not, create mappings: old_id ‚Üí new_sequential_id
3. Remap all DataFrames and side info tables
4. Verify mappings with spot checks

This ensures embeddings use minimal memory and indices align properly.

In [None]:
# Step 3b: Remap player and opening IDs to 0-based sequential integers

print("=" * 60)
print("STEP 3B: REMAP IDs TO SEQUENTIAL INTEGERS")
print("=" * 60)

def check_and_remap_ids(df_list, id_column, entity_name):
    """
    Check if IDs are sequential starting from 0, and remap if not.
    
    Parameters:
    -----------
    df_list : list of DataFrames
        List of DataFrames containing the ID column to check/remap
    id_column : str
        Name of the ID column ('player_id' or 'opening_id')
    entity_name : str
        Name for logging ('player' or 'opening')
        
    Returns:
    --------
    tuple: (df_list with remapped IDs, id_to_idx mapping dict, needs_remapping bool)
    """
    # Get all unique IDs across all dataframes
    all_ids = pd.concat([df[id_column] for df in df_list]).unique()
    all_ids_sorted = sorted(all_ids)
    
    print(f"\n{'='*60}")
    print(f"Checking {entity_name} IDs...")
    print(f"{'='*60}")
    print(f"   ‚Ä¢ Total unique {entity_name}s: {len(all_ids_sorted)}")
    print(f"   ‚Ä¢ ID range: [{all_ids_sorted[0]}, {all_ids_sorted[-1]}]")
    
    # Check if IDs are already 0-based sequential (0, 1, 2, ...)
    expected_sequential = list(range(len(all_ids_sorted)))
    is_sequential = (all_ids_sorted == expected_sequential)
    
    if is_sequential:
        print(f"   ‚úì {entity_name} IDs are already 0-based sequential - no remapping needed!")
        return df_list, None, False
    
    # Check if IDs are 1-based sequential (1, 2, 3, ...)
    expected_sequential_1based = list(range(1, len(all_ids_sorted) + 1))
    is_sequential_1based = (all_ids_sorted == expected_sequential_1based)
    
    if is_sequential_1based:
        print(f"   ‚ö†Ô∏è  {entity_name} IDs are 1-based sequential - will remap to 0-based")
    else:
        # Calculate gaps
        num_gaps = (all_ids_sorted[-1] - all_ids_sorted[0] + 1) - len(all_ids_sorted)
        print(f"   ‚ö†Ô∏è  {entity_name} IDs have gaps - will remap to 0-based sequential")
        print(f"   ‚Ä¢ Number of gaps: {num_gaps}")
        print(f"   ‚Ä¢ Wasted embedding slots without remapping: {num_gaps}")
    
    # Create mapping: old_id -> new_idx (0-based)
    id_to_idx = {old_id: new_idx for new_idx, old_id in enumerate(all_ids_sorted)}
    idx_to_id = {new_idx: old_id for old_id, new_idx in id_to_idx.items()}
    
    print(f"\n   Creating mapping...")
    print(f"   ‚Ä¢ Example mappings:")
    sample_ids = all_ids_sorted[:5] + all_ids_sorted[-5:]
    for old_id in sample_ids[:10]:  # Show first 5 and last 5
        print(f"      {entity_name}_id {old_id} ‚Üí {id_to_idx[old_id]}")
    
    # Remap all DataFrames
    print(f"\n   Remapping {len(df_list)} DataFrames...")
    remapped_dfs = []
    for i, df in enumerate(df_list):
        df_copy = df.copy()
        df_copy[id_column] = df_copy[id_column].map(id_to_idx)
        remapped_dfs.append(df_copy)
        print(f"   ‚úì Remapped DataFrame {i+1}/{len(df_list)}")
    
    return remapped_dfs, (id_to_idx, idx_to_id), True

# 1. Remap player IDs
print(f"\n1Ô∏è‚É£  Processing player IDs...")
player_dfs = [X_train, X_val, X_test, clean_data, player_side_info.reset_index()]
remapped_player_dfs, player_mappings, player_remapped = check_and_remap_ids(
    player_dfs, 'player_id', 'player'
)

if player_remapped:
    X_train, X_val, X_test, clean_data, player_side_info_remapped = remapped_player_dfs
    player_id_to_idx, player_idx_to_id = player_mappings
    player_side_info = player_side_info_remapped.set_index('player_id')
    print(f"\n   ‚úÖ Player ID remapping complete!")
else:
    player_id_to_idx, player_idx_to_id = None, None

# 2. Remap opening IDs
print(f"\n2Ô∏è‚É£  Processing opening IDs...")
opening_dfs = [X_train, X_val, X_test, clean_data]
remapped_opening_dfs, opening_mappings, opening_remapped = check_and_remap_ids(
    opening_dfs, 'opening_id', 'opening'
)

if opening_remapped:
    X_train, X_val, X_test, clean_data = remapped_opening_dfs
    opening_id_to_idx, opening_idx_to_id = opening_mappings
    print(f"\n   ‚úÖ Opening ID remapping complete!")
else:
    opening_id_to_idx, opening_idx_to_id = None, None

# 3. Spot checks to verify mappings
print(f"\n3Ô∏è‚É£  Running spot checks to verify ID remapping correctness...")
print(f"   Strategy: Sample entries BEFORE remapping, verify mappings AFTER")

# Sample 10 entries: first, last, and 8 in between
print(f"\n   Sampling 10 entries from X_train (before remapping was applied)...")
total_rows = len(X_train)
# Get indices: first, last, and 8 evenly spaced in between
sample_indices = [0]  # First row
step = (total_rows - 1) // 9  # Divide remaining rows into 9 parts
sample_indices.extend([min(i * step, total_rows - 1) for i in range(1, 9)])
sample_indices.append(total_rows - 1)  # Last row

print(f"   ‚Ä¢ Sample indices: {sample_indices}")
print(f"   ‚Ä¢ These represent: first, evenly spaced middle rows, and last")

# Store samples with their NEW (remapped) IDs
samples = []
for idx in sample_indices:
    row = X_train.iloc[idx]
    samples.append({
        'index': idx,
        'new_player_id': row['player_id'],
        'new_opening_id': row['opening_id'],
        'confidence': row['confidence']
    })

print(f"\n   Verification checks:")
print(f"   {'#':<4} {'Row Idx':<10} {'New Player':<12} {'New Opening':<12} {'Confidence':<12} {'Status':<15}")
print(f"   {'-'*4} {'-'*10} {'-'*12} {'-'*12} {'-'*12} {'-'*15}")

all_checks_passed = True
for i, sample in enumerate(samples, 1):
    new_player_id = sample['new_player_id']
    new_opening_id = sample['new_opening_id']
    confidence = sample['confidence']
    idx = sample['index']
    
    checks = []
    
    # Check 1: New player ID is valid (0-based sequential)
    if player_remapped:
        player_valid = 0 <= new_player_id < len(player_id_to_idx)
        checks.append(("player_id", player_valid))
    else:
        checks.append(("player_id", True))  # No remapping needed means original was valid
    
    # Check 2: New opening ID is valid (0-based sequential)
    if opening_remapped:
        opening_valid = 0 <= new_opening_id < len(opening_id_to_idx)
        checks.append(("opening_id", opening_valid))
    else:
        checks.append(("opening_id", True))  # No remapping needed means original was valid
    
    # Check 3: Player exists in player_side_info
    player_exists = new_player_id in player_side_info.index
    checks.append(("in_side_info", player_exists))
    
    # Check 4: Opening exists in clean_data
    opening_exists = new_opening_id in clean_data['opening_id'].values
    checks.append(("in_clean", opening_exists))
    
    # All checks must pass
    all_pass = all(check[1] for check in checks)
    status = "‚úì PASS" if all_pass else f"‚úó FAIL ({','.join([c[0] for c in checks if not c[1]])})"
    
    if not all_pass:
        all_checks_passed = False
    
    print(f"   {i:<4} {idx:<10} {new_player_id:<12} {new_opening_id:<12} {confidence:<12.4f} {status:<15}")

# Additional verification: check if we can reverse map
if player_remapped or opening_remapped:
    print(f"\n   Reverse mapping verification (sample of 3 entries):")
    print(f"   {'#':<4} {'New‚ÜíOld Player':<30} {'New‚ÜíOld Opening':<30}")
    print(f"   {'-'*4} {'-'*30} {'-'*30}")
    
    for i in [0, len(samples)//2, len(samples)-1]:  # First, middle, last
        sample = samples[i]
        
        # Reverse map player
        if player_remapped:
            old_player = player_idx_to_id.get(sample['new_player_id'], 'NOT_FOUND')
            player_str = f"{sample['new_player_id']} ‚Üí {old_player}"
        else:
            player_str = f"{sample['new_player_id']} (unchanged)"
        
        # Reverse map opening
        if opening_remapped:
            old_opening = opening_idx_to_id.get(sample['new_opening_id'], 'NOT_FOUND')
            opening_str = f"{sample['new_opening_id']} ‚Üí {old_opening}"
        else:
            opening_str = f"{sample['new_opening_id']} (unchanged)"
        
        print(f"   {i+1:<4} {player_str:<30} {opening_str:<30}")

if all_checks_passed:
    print(f"\n   ‚úÖ All spot checks passed! ID mappings are correct.")
else:
    print(f"\n   ‚ö†Ô∏è  Some spot checks failed - investigate immediately!")
    raise ValueError("ID remapping verification failed!")

# 4. Summary
print(f"\n4Ô∏è‚É£  Summary:")
print(f"\n   Player IDs:")
if player_remapped:
    print(f"   ‚Ä¢ Original range: [{player_idx_to_id[0]}, {player_idx_to_id[len(player_idx_to_id)-1]}]")
    print(f"   ‚Ä¢ New range: [0, {len(player_idx_to_id)-1}]")
    print(f"   ‚Ä¢ Mapping saved as: player_id_to_idx, player_idx_to_id")
else:
    print(f"   ‚Ä¢ No remapping needed - IDs already sequential")
    
print(f"\n   Opening IDs:")
if opening_remapped:
    print(f"   ‚Ä¢ Original range: [{opening_idx_to_id[0]}, {opening_idx_to_id[len(opening_idx_to_id)-1]}]")
    print(f"   ‚Ä¢ New range: [0, {len(opening_idx_to_id)-1}]")
    print(f"   ‚Ä¢ Mapping saved as: opening_id_to_idx, opening_idx_to_id")
else:
    print(f"   ‚Ä¢ No remapping needed - IDs already sequential")

print(f"\n   Updated DataFrames:")
print(f"   ‚Ä¢ X_train: {X_train.shape}")
print(f"   ‚Ä¢ X_val: {X_val.shape}")
print(f"   ‚Ä¢ X_test: {X_test.shape}")
print(f"   ‚Ä¢ clean_data: {clean_data.shape}")
print(f"   ‚Ä¢ player_side_info: {player_side_info.shape}")

print("\n" + "=" * 60)
print("‚úÖ ID REMAPPING COMPLETE")
print("=" * 60)

print(f"\nüí° Important:")
print(f"   ‚Ä¢ All player_id and opening_id values are now 0-based sequential")
print(f"   ‚Ä¢ Use these for embedding layers: nn.Embedding(num_players, dim)")
print(f"   ‚Ä¢ Save mappings for inference (to convert new user/opening IDs)")
print(f"   ‚Ä¢ player_side_info index is now 0-based sequential player IDs")

## 4. Enumerate Categorical Variables

I believe the only variable we need to enumerate here is `eco`. That's the broad categorization of a specific opening.

Notes:

- One ECO code will have many openings
- They're sorted by letter, then further by number. For instance, C21 and C44 are in the `C` family.
- Maybe we make this side information?

First, let's get some data on ECO codes to help us better understand what we're working with.

In [None]:
# Step 4: ECO Code Statistics (no mutations, just exploration)

print("=" * 60)
print("STEP 4: ECO CODE STATISTICS")
print("=" * 60)

# Basic ECO statistics across all data
print(f"\n1Ô∏è‚É£  Overall ECO statistics:")
print(f"   ‚Ä¢ Total unique ECO codes: {clean_data['eco'].nunique()}")
print(f"   ‚Ä¢ Total entries: {len(clean_data):,}")
print(f"   ‚Ä¢ Missing ECO values: {clean_data['eco'].isna().sum()}")

# ECO value counts
eco_counts = clean_data['eco'].value_counts().sort_index()
print(f"\n2Ô∏è‚É£  Distribution of entries per ECO code:")
print(f"   ‚Ä¢ Mean entries per ECO: {eco_counts.mean():.1f}")
print(f"   ‚Ä¢ Median entries per ECO: {eco_counts.median():.1f}")
print(f"   ‚Ä¢ Min entries: {eco_counts.min()}")
print(f"   ‚Ä¢ Max entries: {eco_counts.max()}")
print(f"   ‚Ä¢ Std: {eco_counts.std():.1f}")

# ECO by first letter (family)
print(f"\n3Ô∏è‚É£  ECO families (by first letter):")
eco_families = clean_data['eco'].str[0].value_counts().sort_index()
print(f"\n   {'Family':<8} {'Count':<10} {'Percentage':<12} {'Visual'}")
print(f"   {'-'*8} {'-'*10} {'-'*12} {'-'*40}")
for family, count in eco_families.items():
    pct = 100 * count / len(clean_data)
    bar_length = int(pct * 0.4)
    bar = '‚ñà' * bar_length
    print(f"   {family:<8} {count:>7,}    {pct:>6.2f}%      {bar}")

# Top 20 most common ECO codes
print(f"\n4Ô∏è‚É£  Top 20 most common ECO codes:")
top_eco = clean_data['eco'].value_counts().head(20)
print(f"\n   {'Rank':<6} {'ECO':<6} {'Count':<10} {'Percentage':<12} {'Visual'}")
print(f"   {'-'*6} {'-'*6} {'-'*10} {'-'*12} {'-'*30}")
for i, (eco, count) in enumerate(top_eco.items(), 1):
    pct = 100 * count / len(clean_data)
    bar_length = int(pct * 0.3)
    bar = '‚ñà' * bar_length
    print(f"   {i:<6} {eco:<6} {count:>7,}    {pct:>6.2f}%      {bar}")

# Bottom 20 least common ECO codes
print(f"\n5Ô∏è‚É£  Bottom 20 least common ECO codes:")
bottom_eco = clean_data['eco'].value_counts().tail(20)
print(f"\n   {'Rank':<6} {'ECO':<6} {'Count':<10} {'Visual'}")
print(f"   {'-'*6} {'-'*6} {'-'*10} {'-'*30}")
for i, (eco, count) in enumerate(bottom_eco.items(), 1):
    bar_length = min(count, 30)
    bar = '‚ñà' * bar_length
    print(f"   {i:<6} {eco:<6} {count:>7,}    {bar}")

# ECO code format analysis
print(f"\n6Ô∏è‚É£  ECO code format analysis:")
eco_lengths = clean_data['eco'].str.len().value_counts().sort_index()
print(f"   ‚Ä¢ ECO code lengths:")
for length, count in eco_lengths.items():
    pct = 100 * count / len(clean_data)
    print(f"      {length} characters: {count:,} ({pct:.2f}%)")

# Check for any unusual ECO codes
print(f"\n7Ô∏è‚É£  Sample of ECO codes:")
sample_eco = clean_data['eco'].drop_duplicates().sample(min(20, clean_data['eco'].nunique()), random_state=42).sort_values()
print(f"   {', '.join(sample_eco.values)}")

# ECO statistics by split
print(f"\n8Ô∏è‚É£  ECO distribution across splits:")
print(f"\n   Train split:")
print(f"   ‚Ä¢ Unique ECO codes: {X_train['eco'].nunique()}")
print(f"   ‚Ä¢ Total entries: {len(X_train):,}")

print(f"\n   Validation split:")
print(f"   ‚Ä¢ Unique ECO codes: {X_val['eco'].nunique()}")
print(f"   ‚Ä¢ Total entries: {len(X_val):,}")
val_new_eco = set(X_val['eco'].unique()) - set(X_train['eco'].unique())
print(f"   ‚Ä¢ ECO codes not in train: {len(val_new_eco)}")

print(f"\n   Test split:")
print(f"   ‚Ä¢ Unique ECO codes: {X_test['eco'].nunique()}")
print(f"   ‚Ä¢ Total entries: {len(X_test):,}")
test_new_eco = set(X_test['eco'].unique()) - set(X_train['eco'].unique())
print(f"   ‚Ä¢ ECO codes not in train: {len(test_new_eco)}")

# Average score by ECO code (top 10 and bottom 10)
print(f"\n9Ô∏è‚É£  Average score by ECO code:")
eco_scores = clean_data.groupby('eco')['score'].agg(['mean', 'count']).sort_values('mean', ascending=False)

print(f"\n   Top 10 ECO codes by average score:")
print(f"\n   {'ECO':<6} {'Avg Score':<12} {'Count':<10}")
print(f"   {'-'*6} {'-'*12} {'-'*10}")
for eco, row in eco_scores.head(10).iterrows():
    print(f"   {eco:<6} {row['mean']:<12.4f} {int(row['count']):>7,}")

print(f"\n   Bottom 10 ECO codes by average score:")
print(f"\n   {'ECO':<6} {'Avg Score':<12} {'Count':<10}")
print(f"   {'-'*6} {'-'*12} {'-'*10}")
for eco, row in eco_scores.tail(10).iterrows():
    print(f"   {eco:<6} {row['mean']:<12.4f} {int(row['count']):>7,}")

# ECO codes with high variance in scores
print(f"\nüîü  ECO codes with highest score variance (may indicate difficulty):")
eco_variance = clean_data.groupby('eco')['score'].agg(['var', 'std', 'count']).sort_values('var', ascending=False).head(10)
print(f"\n   {'ECO':<6} {'Variance':<12} {'Std Dev':<12} {'Count':<10}")
print(f"   {'-'*6} {'-'*12} {'-'*12} {'-'*10}")
for eco, row in eco_variance.iterrows():
    print(f"   {eco:<6} {row['var']:<12.4f} {row['std']:<12.4f} {int(row['count']):>7,}")

# Number of openings per ECO code
print(f"\n1Ô∏è‚É£1Ô∏è‚É£  Openings per ECO code:")
# Connect to database to get opening counts
con = get_db_connection(str(DB_PATH))
try:
    eco_opening_query = """
        SELECT eco, COUNT(DISTINCT id) as num_openings
        FROM opening
        GROUP BY eco
        ORDER BY num_openings DESC
    """
    eco_opening_counts = pd.DataFrame(con.execute(eco_opening_query).df())
    
    # Filter to only ECO codes in our data
    eco_opening_counts = eco_opening_counts[eco_opening_counts['eco'].isin(clean_data['eco'].unique())]
    
    print(f"   ‚Ä¢ Mean openings per ECO: {eco_opening_counts['num_openings'].mean():.1f}")
    print(f"   ‚Ä¢ Median openings per ECO: {eco_opening_counts['num_openings'].median():.1f}")
    print(f"   ‚Ä¢ Max openings per ECO: {eco_opening_counts['num_openings'].max()}")
    print(f"   ‚Ä¢ Min openings per ECO: {eco_opening_counts['num_openings'].min()}")
    
    print(f"\n   Top 10 ECO codes by number of openings:")
    print(f"\n   {'ECO':<6} {'# Openings':<12}")
    print(f"   {'-'*6} {'-'*12}")
    for idx, row in eco_opening_counts.head(10).iterrows():
        print(f"   {row['eco']:<6} {int(row['num_openings']):>10}")
    
finally:
    con.close()

print("\n" + "=" * 60)
print("‚úÖ ECO CODE STATISTICS COMPLETE")
print("=" * 60)

print(f"\nüìä Key takeaways:")
print(f"   ‚Ä¢ Total unique ECO codes: {clean_data['eco'].nunique()}")
print(f"   ‚Ä¢ Most common family: {eco_families.idxmax()} ({eco_families.max():,} entries)")
print(f"   ‚Ä¢ Most common ECO: {top_eco.index[0]} ({top_eco.iloc[0]:,} entries)")
print(f"   ‚Ä¢ ECO codes appear in all splits (good for training)")
print(f"\nüí° Next steps:")
print(f"   ‚Ä¢ Enumerate ECO codes as integers for categorical encoding")
print(f"   ‚Ä¢ Consider ECO as opening-level side information (similar to player ratings)")
print(f"   ‚Ä¢ Verify all ECO codes in validation/test exist in training set")

## 4b. Create ECO Side Information

**Why ECO is Side Information:**
- ECO codes describe **opening characteristics**, not individual player-opening interactions
- Similar to how player ratings describe players, ECO describes openings
- Each opening has ONE ECO code (not per player-opening pair)

**Implementation Strategy:**
- Split ECO codes into two categorical features:
  - `eco_letter`: A, B, C, D, or E ‚Üí encoded as integers 0-4
  - `eco_number`: The numeric part (e.g., "21" from "C21") ‚Üí encoded as sequential integers
- Store in a separate `opening_side_info` lookup table (indexed by opening_id)
- Remove `eco` from train/test/val DataFrames (it's not interaction data)
- During training, model will lookup opening_id ‚Üí (eco_letter, eco_number)

**Why Split ECO into Letter and Number:**
- ECO families (A-E) represent fundamentally different opening types:
  - **A**: Flank openings (English, R√©ti, Bird's, etc.)
  - **B**: Semi-Open games (Sicilian, French, Caro-Kann, etc.)
  - **C**: Open games (King's pawn openings, Spanish, Italian, etc.)
  - **D**: Closed games (Queen's Gambit variations)
  - **E**: Indian defenses (King's Indian, Nimzo-Indian, etc.)
- Numbers within each family represent variations (C20-C29, C30-C39, etc.)
- Model can learn separate embeddings for family vs variation

**Categorical Encoding:**
- Both features will be treated as categorical (not ordinal)
- Higher numbers don't mean "better" openings
- Model will learn embedding vectors for each category
- This allows the model to capture non-linear relationships between ECO codes and performance

In [None]:
# 4b. Create ECO side information and remove ECO from train/test/val DataFrames

# Check if ECO processing has already been done
if 'eco' not in X_train.columns and 'opening_side_info' in globals():
    print("=" * 60)
    print("‚è≠Ô∏è  SKIPPING STEP 4B: ECO SIDE INFORMATION CREATION")
    print("=" * 60)
    print("\n‚úì ECO column already removed from train/test/val data")
    print("‚úì 'opening_side_info' table already exists")
    print(f"\nOpening side info shape: {opening_side_info.shape}")
    print(f"Columns: {list(opening_side_info.columns)}")
    
    # Show statistics
    print(f"\nüìä Existing ECO encoding statistics:")
    print(f"   ‚Ä¢ Unique eco_letter values: {opening_side_info['eco_letter'].nunique()}")
    print(f"   ‚Ä¢ Unique eco_number values: {opening_side_info['eco_number'].nunique()}")
    print(f"   ‚Ä¢ eco_letter range: [{opening_side_info['eco_letter'].min()}, {opening_side_info['eco_letter'].max()}]")
    print(f"   ‚Ä¢ eco_number range: [{opening_side_info['eco_number'].min()}, {opening_side_info['eco_number'].max()}]")
    
    print(f"\nüìã Sample of existing ECO encoding:")
    sample_data = opening_side_info.sample(min(10, len(opening_side_info)), random_state=42)
    for idx, row in sample_data.iterrows():
        print(f"   Opening {idx:>4} | ECO: {row['eco']:>3} ‚Üí Letter: {row['eco_letter']} ({row['eco_letter_str']}), Number: {row['eco_number']:>2} ({row['eco_number_str']:>2})")
else:
    def create_eco_side_information(clean_data_df, X_train_df, X_val_df, X_test_df):
        """
        Create ECO side information table and remove ECO from train/test/val DataFrames.
        
        ECO codes are opening-level features, not player-opening interaction features.
        We split each ECO code (e.g., "C21") into:
        - eco_letter: The letter part (A, B, C, D, or E)
        - eco_number: The numeric part (e.g., 21)
        
        Both are encoded as sequential integers for use as categorical features in embeddings.
        
        Parameters:
        -----------
        clean_data_df : pd.DataFrame
            Full cleaned data with opening_id and eco columns
        X_train_df, X_val_df, X_test_df : pd.DataFrame
            Train/val/test feature DataFrames (will be modified to remove 'eco')
            
        Returns:
        --------
        tuple: (opening_side_info, eco_letter_map, eco_number_map, X_train, X_val, X_test)
        """
        print("=" * 60)
        print("STEP 4B: CREATE ECO SIDE INFORMATION")
        print("=" * 60)
        
        print(f"\n‚öôÔ∏è  Strategy:")
        print(f"   ‚Ä¢ Extract unique opening_id ‚Üí eco mappings from clean_data")
        print(f"   ‚Ä¢ Split ECO codes: 'C21' ‚Üí letter='C', number='21'")
        print(f"   ‚Ä¢ Encode as sequential integers (categorical, not ordinal)")
        print(f"   ‚Ä¢ Store in opening_side_info lookup table")
        print(f"   ‚Ä¢ Remove 'eco' column from X_train, X_val, X_test")
        
        # Extract unique opening ‚Üí ECO mappings
        print(f"\n1Ô∏è‚É£  Extracting unique opening-ECO mappings...")
        opening_eco_map = clean_data_df[['opening_id', 'eco']].drop_duplicates().set_index('opening_id')
        print(f"   ‚úì Extracted {len(opening_eco_map):,} unique openings")
        
        # Verify one-to-one mapping
        eco_per_opening = clean_data_df.groupby('opening_id')['eco'].nunique()
        if (eco_per_opening > 1).any():
            problematic = eco_per_opening[eco_per_opening > 1]
            print(f"   ‚ö†Ô∏è  WARNING: {len(problematic)} openings have multiple ECO codes!")
            print(f"   Problematic opening IDs: {problematic.index.tolist()[:10]}...")
        else:
            print(f"   ‚úì Verified: Each opening has exactly one ECO code (good!)")
        
        # Split ECO into letter and number components
        print(f"\n2Ô∏è‚É£  Splitting ECO codes into letter and number components...")
        opening_eco_map['eco_letter_str'] = opening_eco_map['eco'].str[0]  # First character (A-E)
        opening_eco_map['eco_number_str'] = opening_eco_map['eco'].str[1:]  # Remaining characters (numeric)
        
        print(f"   ‚úì Extracted letter and number components")
        print(f"   ‚Ä¢ Unique letters: {opening_eco_map['eco_letter_str'].unique()}")
        print(f"   ‚Ä¢ Unique numbers: {opening_eco_map['eco_number_str'].nunique()}")
        
        # Create encoding mappings for eco_letter (A-E ‚Üí 0-4)
        print(f"\n3Ô∏è‚É£  Encoding ECO letters as categorical integers...")
        unique_letters = sorted(opening_eco_map['eco_letter_str'].unique())
        eco_letter_to_int = {letter: idx for idx, letter in enumerate(unique_letters)}
        eco_int_to_letter = {idx: letter for letter, idx in eco_letter_to_int.items()}
        
        opening_eco_map['eco_letter'] = opening_eco_map['eco_letter_str'].map(eco_letter_to_int)
        
        print(f"   ‚úì Letter encoding created:")
        for letter, idx in sorted(eco_letter_to_int.items()):
            count = (opening_eco_map['eco_letter_str'] == letter).sum()
            print(f"      '{letter}' ‚Üí {idx} ({count:,} openings)")
        
        # Create encoding mappings for eco_number (00-99 ‚Üí sequential integers)
        print(f"\n4Ô∏è‚É£  Encoding ECO numbers as categorical integers...")
        unique_numbers = sorted(opening_eco_map['eco_number_str'].unique())
        eco_number_to_int = {num: idx for idx, num in enumerate(unique_numbers)}
        eco_int_to_number = {idx: num for num, idx in eco_number_to_int.items()}
        
        opening_eco_map['eco_number'] = opening_eco_map['eco_number_str'].map(eco_number_to_int)
        
        print(f"   ‚úì Number encoding created:")
        print(f"      {len(unique_numbers)} unique numbers mapped to [0, {len(unique_numbers)-1}]")
        print(f"      Range: '{unique_numbers[0]}' ‚Üí 0, ..., '{unique_numbers[-1]}' ‚Üí {len(unique_numbers)-1}")
        
        # Show distribution of numbers
        print(f"\n   Distribution of ECO numbers (top 10):")
        number_counts = opening_eco_map['eco_number_str'].value_counts().head(10)
        for num, count in number_counts.items():
            encoded = eco_number_to_int[num]
            print(f"      '{num}' (‚Üí {encoded:>2}): {count:>3} openings")
        
        # Create final opening_side_info table
        print(f"\n5Ô∏è‚É£  Creating opening_side_info lookup table...")
        # Only keep the encoded categorical columns and rename them for clarity
        opening_side_info = opening_eco_map[['eco_letter', 'eco_number']].copy()
        opening_side_info = opening_side_info.rename(columns={
            'eco_letter': 'eco_letter_cat',  # _cat suffix indicates categorical encoding
            'eco_number': 'eco_number_cat'
        })
        
        print(f"   ‚úì Created opening_side_info")
        print(f"      ‚Ä¢ Shape: {opening_side_info.shape}")
        print(f"      ‚Ä¢ Index: opening_id")
        print(f"      ‚Ä¢ Columns: {list(opening_side_info.columns)}")
        print(f"      ‚Ä¢ eco_letter_cat: Categorical encoding of ECO letter (A-E ‚Üí 0-4)")
        print(f"      ‚Ä¢ eco_number_cat: Categorical encoding of ECO number")
        
        # Verify all openings in train/val/test have ECO info
        print(f"\n6Ô∏è‚É£  Verifying coverage of train/val/test openings...")
        all_openings = set(X_train_df['opening_id'].unique()) | \
                       set(X_val_df['opening_id'].unique()) | \
                       set(X_test_df['opening_id'].unique())
        
        missing_openings = all_openings - set(opening_side_info.index)
        if len(missing_openings) > 0:
            print(f"   ‚ö†Ô∏è  WARNING: {len(missing_openings)} openings in splits are missing ECO info!")
            print(f"   Missing opening IDs: {sorted(list(missing_openings))[:10]}...")
        else:
            print(f"   ‚úì All {len(all_openings):,} openings in train/val/test have ECO side information")
        
        # Remove ECO from train/val/test DataFrames
        print(f"\n7Ô∏è‚É£  Removing 'eco' column from train/val/test DataFrames...")
        print(f"   ‚Ä¢ X_train before: {X_train_df.shape}, columns: {list(X_train_df.columns)}")
        print(f"   ‚Ä¢ X_val before: {X_val_df.shape}, columns: {list(X_val_df.columns)}")
        print(f"   ‚Ä¢ X_test before: {X_test_df.shape}, columns: {list(X_test_df.columns)}")
        
        X_train_clean = X_train_df.drop(columns=['eco'])
        X_val_clean = X_val_df.drop(columns=['eco'])
        X_test_clean = X_test_df.drop(columns=['eco'])
        
        print(f"\n   After removing 'eco':")
        print(f"   ‚Ä¢ X_train: {X_train_clean.shape}, columns: {list(X_train_clean.columns)}")
        print(f"   ‚Ä¢ X_val: {X_val_clean.shape}, columns: {list(X_val_clean.columns)}")
        print(f"   ‚Ä¢ X_test: {X_test_clean.shape}, columns: {list(X_test_clean.columns)}")
        
        # Sample data showing the transformation
        print(f"\n8Ô∏è‚É£  Sample of ECO encoding (10 random openings):")
        sample_openings = opening_side_info.sample(min(10, len(opening_side_info)), random_state=42)
        
        # For display purposes, reconstruct original values from the categorical encodings
        print(f"\n   {'Opening ID':<12} {'Letter (Cat)':<15} {'Number (Cat)':<15} {'Reconstructed ECO':<18}")
        print(f"   {'-'*12} {'-'*15} {'-'*15} {'-'*18}")
        for idx, row in sample_openings.iterrows():
            letter_str = eco_int_to_letter[row['eco_letter_cat']]
            number_str = eco_int_to_number[row['eco_number_cat']]
            reconstructed = f"{letter_str}{number_str}"
            print(f"   {idx:<12} {row['eco_letter_cat']:<15} {row['eco_number_cat']:<15} {reconstructed:<18}")
        
        # Show ECO family distribution
        print(f"\n9Ô∏è‚É£  ECO family distribution in opening_side_info:")
        letter_dist = opening_side_info['eco_letter_cat'].value_counts().sort_index()
        print(f"\n   {'Encoded':<8} {'Letter':<8} {'Count':<10} {'Percentage':<12} {'Visual'}")
        print(f"   {'-'*8} {'-'*8} {'-'*10} {'-'*12} {'-'*40}")
        for encoded_val, count in letter_dist.items():
            letter = eco_int_to_letter[encoded_val]
            pct = 100 * count / len(opening_side_info)
            bar_length = int(pct * 0.4)
            bar = '‚ñà' * bar_length
            print(f"   {encoded_val:<8} {letter:<8} {count:>7,}    {pct:>6.2f}%      {bar}")
        
        print("\n" + "=" * 60)
        print("‚úÖ ECO SIDE INFORMATION CREATION COMPLETE")
        print("=" * 60)
        
        print(f"\nCreated: opening_side_info")
        print(f"   ‚Ä¢ Shape: {opening_side_info.shape}")
        print(f"   ‚Ä¢ Index: opening_id (for O(1) lookups)")
        print(f"   ‚Ä¢ Columns: {list(opening_side_info.columns)}")
        
        print(f"\nüìä Data structure summary:")
        print(f"   ‚Ä¢ X_train: {X_train_clean.shape[0]:,} rows, {X_train_clean.shape[1]} features")
        print(f"   ‚Ä¢ X_val: {X_val_clean.shape[0]:,} rows, {X_val_clean.shape[1]} features")
        print(f"   ‚Ä¢ X_test: {X_test_clean.shape[0]:,} rows, {X_test_clean.shape[1]} features")
        print(f"   ‚Ä¢ opening_side_info: {len(opening_side_info):,} openings (one per opening)")
        print(f"   ‚Ä¢ ECO storage: ONE entry per opening (not duplicated per interaction)")
        
        print(f"\n‚ö†Ô∏è  CRITICAL: Save these mappings for inference!")
        print(f"   ‚Ä¢ eco_letter_to_int: {eco_letter_to_int}")
        print(f"   ‚Ä¢ eco_number_to_int: (dict with {len(eco_number_to_int)} entries)")
        print(f"\n   You'll need them to encode ECO codes for new openings at inference time.")
        
        print(f"\nüí° Model usage:")
        print(f"   During training, for each (player_id, opening_id) pair:")
        print(f"   1. Lookup opening_id ‚Üí opening_side_info[opening_id]")
        print(f"   2. Get eco_letter_cat and eco_number_cat (already encoded as integers)")
        print(f"   3. Feed into categorical embedding layers")
        print(f"   4. Combine with opening latent factors")
        
        print(f"\nüßπ Final cleanup:")
        print(f"   ‚Ä¢ Kept only encoded categorical columns: eco_letter_cat, eco_number_cat")
        print(f"   ‚Ä¢ Removed raw ECO strings (eco, eco_letter_str, eco_number_str)")
        print(f"   ‚Ä¢ Column names clearly indicate categorical encoding (_cat suffix)")
        
        return opening_side_info, eco_letter_to_int, eco_number_to_int, X_train_clean, X_val_clean, X_test_clean
    
    # Call the function
    opening_side_info, eco_letter_map, eco_number_map, X_train, X_val, X_test = create_eco_side_information(
        clean_data, X_train, X_val, X_test
    )
    
    # Create reverse mappings for decoding (needed for verification and debugging)
    eco_int_to_letter = {v: k for k, v in eco_letter_map.items()}
    eco_int_to_number = {v: k for k, v in eco_number_map.items()}
    print(f"\n‚úì Created reverse mappings for ECO decoding:")
    print(f"   ‚Ä¢ eco_int_to_letter: {len(eco_int_to_letter)} entries")
    print(f"   ‚Ä¢ eco_int_to_number: {len(eco_int_to_number)} entries")

In [None]:
# Verify final data structure after ECO processing

print("=" * 60)
print("VERIFICATION: FINAL DATA STRUCTURE")
print("=" * 60)

print(f"\n1Ô∏è‚É£  Train/Val/Test DataFrames (ECO removed):")
print(f"\n   X_train:")
print(f"   ‚Ä¢ Shape: {X_train.shape}")
print(f"   ‚Ä¢ Columns: {list(X_train.columns)}")
print(f"   ‚Ä¢ Sample:")
print(X_train.head(3).to_string())

print(f"\n   X_val:")
print(f"   ‚Ä¢ Shape: {X_val.shape}")
print(f"   ‚Ä¢ Columns: {list(X_val.columns)}")

print(f"\n   X_test:")
print(f"   ‚Ä¢ Shape: {X_test.shape}")
print(f"   ‚Ä¢ Columns: {list(X_test.columns)}")

print(f"\n2Ô∏è‚É£  Side Information Tables:")

print(f"\n   player_side_info (indexed by player_id):")
print(f"   ‚Ä¢ Shape: {player_side_info.shape}")
print(f"   ‚Ä¢ Columns: {list(player_side_info.columns)}")
print(f"   ‚Ä¢ Sample:")
print(player_side_info.head(3).to_string())

print(f"\n   opening_side_info (indexed by opening_id):")
print(f"   ‚Ä¢ Shape: {opening_side_info.shape}")
print(f"   ‚Ä¢ Columns: {list(opening_side_info.columns)}")
print(f"   ‚Ä¢ Sample:")
print(opening_side_info.head(3).to_string())

print(f"\n3Ô∏è‚É£  Encoding Mappings (for inference):")
print(f"\n   eco_letter_map:")
for k, v in sorted(eco_letter_map.items()):
    print(f"      '{k}' ‚Üí {v}")

print(f"\n   eco_number_map (first 10):")
for i, (k, v) in enumerate(sorted(eco_number_map.items())[:10]):
    print(f"      '{k}' ‚Üí {v}")
print(f"      ... ({len(eco_number_map)} total)")

print(f"\n4Ô∏è‚É£  Example: Lookup flow for a random train sample:")
sample = X_train.sample(1, random_state=42).iloc[0]
player_id = sample['player_id']
opening_id = sample['opening_id']

print(f"\n   Sample interaction:")
print(f"   ‚Ä¢ player_id: {player_id}")
print(f"   ‚Ä¢ opening_id: {opening_id}")
print(f"   ‚Ä¢ confidence: {sample['confidence']:.4f}")

print(f"\n   Player side info lookup:")
player_info = player_side_info.loc[player_id]
print(f"   ‚Ä¢ rating_z: {player_info['rating_z']:.4f}")

print(f"\n   Opening side info lookup:")
opening_info = opening_side_info.loc[opening_id]
print(f"   ‚Ä¢ eco_letter_cat: {opening_info['eco_letter_cat']} (letter: {eco_int_to_letter[opening_info['eco_letter_cat']]})")
print(f"   ‚Ä¢ eco_number_cat: {opening_info['eco_number_cat']} (number: {eco_int_to_number[opening_info['eco_number_cat']]})")

print("\n" + "=" * 60)
print("‚úÖ VERIFICATION COMPLETE")
print("=" * 60)

print(f"\nüì¶ Ready for PyTorch tensor conversion:")
print(f"   ‚Ä¢ Features: player_id, opening_id, confidence")
print(f"   ‚Ä¢ Target: score (in y_train, y_val, y_test)")
print(f"   ‚Ä¢ Player side info: rating_z")
print(f"   ‚Ä¢ Opening side info: eco_letter_cat, eco_number_cat")
print(f"\n   All ECO and rating data is now properly separated as side information!")
print(f"\n   Side info tables are clean - only contain necessary model inputs!")

In [None]:
# Verification: Sample 100 player-opening pairs with reconstructed ECO codes and opening names
# Doing this to make sure that our ECO encoding/decoding is correct

print("=" * 100)
print("VERIFICATION: ECO RECONSTRUCTION AND OPENING NAMES")
print("=" * 100)

# Sample 100 random player-opening pairs from training data
sample_size = 100
sample_data = X_train.sample(min(sample_size, len(X_train)), random_state=42)

print(f"\nSampling {len(sample_data)} player-opening pairs for verification...\n")

# Get unique opening IDs from sample
opening_ids = sample_data['opening_id'].unique()
opening_ids_str = ','.join(map(str, opening_ids.astype(int)))

# Query database for opening names
con = get_db_connection(str(DB_PATH))
try:
    opening_query = f"""
        SELECT id, name, eco
        FROM opening
        WHERE id IN ({opening_ids_str})
    """
    opening_names = pd.DataFrame(con.execute(opening_query).df()).set_index('id')
finally:
    con.close()

# Create reverse mappings for ECO decoding
eco_int_to_letter = {v: k for k, v in eco_letter_map.items()}
eco_int_to_number = {v: k for k, v in eco_number_map.items()}

# Build verification table
print(f"{'#':<4} {'Player':<8} {'Opening':<9} {'ECO (DB)':<10} {'Reconstructed':<13} {'Match':<6} {'Opening Name':<50}")
print("=" * 100)

matches = 0
for i, (idx, row) in enumerate(sample_data.iterrows(), 1):
    player_id = int(row['player_id'])
    opening_id = int(row['opening_id'])
    
    # Lookup opening side info
    opening_info = opening_side_info.loc[opening_id]
    
    # Reconstruct ECO from encoded categorical values
    eco_letter_encoded = opening_info['eco_letter_cat']
    eco_number_encoded = opening_info['eco_number_cat']
    
    eco_letter_decoded = eco_int_to_letter[eco_letter_encoded]
    eco_number_decoded = eco_int_to_number[eco_number_encoded]
    
    reconstructed_eco = f"{eco_letter_decoded}{eco_number_decoded}"
    
    # Get original ECO from database
    db_eco = opening_names.loc[opening_id, 'eco']
    opening_name = opening_names.loc[opening_id, 'name']
    
    # Check if they match
    match = "‚úì" if reconstructed_eco == db_eco else "‚úó"
    if reconstructed_eco == db_eco:
        matches += 1
    
    # Truncate opening name if too long
    if len(opening_name) > 48:
        opening_name = opening_name[:45] + "..."
    
    print(f"{i:<4} {player_id:<8} {opening_id:<9} {db_eco:<10} {reconstructed_eco:<13} {match:<6} {opening_name:<50}")

print("=" * 100)
print(f"\n‚úÖ Verification Results:")
print(f"   ‚Ä¢ Total samples: {len(sample_data)}")
print(f"   ‚Ä¢ Matches: {matches}/{len(sample_data)} ({100*matches/len(sample_data):.1f}%)")
print(f"   ‚Ä¢ Mismatches: {len(sample_data) - matches}")

if matches == len(sample_data):
    print(f"\nüéâ Perfect! All ECO codes reconstructed correctly!")
else:
    print(f"\n‚ö†Ô∏è  Warning: Some ECO codes did not match. Investigate mismatches above.")

## 5. Data Verification and Examination
We're almost there. Let's examine our data structures to check for any obvious flaws.

In [None]:


print("X_train \n", X_train.head())
print("="*60)
print("X_val \n", X_val.head())
print("=" * 60)
print("X_test \n", X_test.head())
print("=" * 60)
print("y_train \n", y_train.head())
print("=" * 60)
print("y_val \n", y_val.head())
print("=" * 60) 
print("y_test \n", y_test.head())

# Now side information
print("player_side_info \n", player_side_info.head())
print("=" * 60)
print("opening_side_info \n", opening_side_info.head())

In [None]:
# Final verification: Display cleaned side info tables

print("=" * 60)
print("CLEANED SIDE INFORMATION TABLES")
print("=" * 60)

print(f"\nüìä player_side_info (cleaned):")
print(f"   ‚Ä¢ Shape: {player_side_info.shape}")
print(f"   ‚Ä¢ Columns: {list(player_side_info.columns)}")
print(f"   ‚Ä¢ Index: player_id")
print(f"\n   Sample (5 rows):")
print(player_side_info.head().to_string())

print(f"\nüìä opening_side_info (cleaned):")
print(f"   ‚Ä¢ Shape: {opening_side_info.shape}")
print(f"   ‚Ä¢ Columns: {list(opening_side_info.columns)}")
print(f"   ‚Ä¢ Index: opening_id")
print(f"\n   Sample (5 rows):")
print(opening_side_info.head().to_string())

print(f"\n‚úÖ Both side info tables contain ONLY the necessary model inputs:")
print(f"   ‚Ä¢ player_side_info: rating_z (normalized rating)")
print(f"   ‚Ä¢ opening_side_info: eco_letter_cat, eco_number_cat (categorical encodings)")
print(f"   ‚Ä¢ No unnecessary columns (names, titles, raw strings, etc.)")

## Step 5: Convert to PyTorch Tensors

**What are tensors?**
Tensors are PyTorch's version of arrays - just multi-dimensional data structures optimized for deep learning. Think of them like fancy NumPy arrays that can run on GPUs.

**What we need to convert:**

1. **Main features** (X_train, X_val, X_test):
   - `player_id` ‚Üí long tensor (integer IDs)
   - `opening_id` ‚Üí long tensor (integer IDs)
   - `confidence` ‚Üí float tensor (weights for loss function)

2. **Targets** (y_train, y_val, y_test):
   - `score` ‚Üí float tensor (what we're predicting)

3. **Player side info** (player_side_info):
   - `rating_z` ‚Üí float tensor (normalized ratings)
   - Indexed by player_id for fast lookup

4. **Opening side info** (opening_side_info):
   - `eco_letter_cat` ‚Üí long tensor (categorical)
   - `eco_number_cat` ‚Üí long tensor (categorical)
   - Indexed by opening_id for fast lookup

**Why these data types?**
- `long` (int64): For IDs and categorical features that will be embedded
- `float` (float32): For continuous values like scores, confidence, and normalized ratings



In [None]:
# Install PyTorch (if not already installed)
import sys
import subprocess
import torch

In [None]:
# Step 5: Convert all data to PyTorch tensors

import torch

print("=" * 60)
print("STEP 5: CONVERT TO PYTORCH TENSORS")
print("=" * 60)

# Set random seed for reproducibility
torch.manual_seed(42)

print(f"\n‚öôÔ∏è  Configuration:")
print(f"   ‚Ä¢ PyTorch version: {torch.__version__}")
print(f"   ‚Ä¢ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   ‚Ä¢ CUDA device: {torch.cuda.get_device_name(0)}")
print(f"   ‚Ä¢ Default dtype: float32 for continuous, int64 for IDs/categorical")

# 0. Index alignment sanity checks
print(f"\n0Ô∏è‚É£  Index alignment sanity checks...")

# Check player_side_info is properly indexed
print(f"   Checking player_side_info index alignment...")
player_ids_sorted = sorted(player_side_info.index.values)
if player_ids_sorted != list(player_side_info.index.values):
    print(f"   ‚ö†Ô∏è  player_side_info index is not sorted - this is OK, we'll use index values directly")
else:
    print(f"   ‚úì player_side_info index is sorted")
print(f"   ‚Ä¢ Index range: [{player_side_info.index.min()}, {player_side_info.index.max()}]")
print(f"   ‚Ä¢ Index dtype: {player_side_info.index.dtype}")

# Check opening_side_info is properly indexed
print(f"\n   Checking opening_side_info index alignment...")
opening_ids_sorted = sorted(opening_side_info.index.values)
if opening_ids_sorted != list(opening_side_info.index.values):
    print(f"   ‚ö†Ô∏è  opening_side_info index is not sorted - this is OK, we'll use index values directly")
else:
    print(f"   ‚úì opening_side_info index is sorted")
print(f"   ‚Ä¢ Index range: [{opening_side_info.index.min()}, {opening_side_info.index.max()}]")
print(f"   ‚Ä¢ Index dtype: {opening_side_info.index.dtype}")

# CRITICAL: Check if indices are contiguous 0-based
# If opening_side_info.index = [0, 1, 2, ..., N-1], then we can use opening_id as direct array index
# If not (e.g., [10, 15, 23, ...]), we'll need a mapping dictionary
print(f"\n   Checking if indices are contiguous 0-based...")
player_contiguous = (player_side_info.index == range(len(player_side_info))).all()
opening_contiguous = (opening_side_info.index == range(len(opening_side_info))).all()

print(f"   ‚Ä¢ player_side_info contiguous 0-based: {player_contiguous}")
print(f"   ‚Ä¢ opening_side_info contiguous 0-based: {opening_contiguous}")

if not player_contiguous:
    print(f"   ‚ÑπÔ∏è  Player IDs are NOT 0-based contiguous - will need mapping for embedding lookup")
if not opening_contiguous:
    print(f"   ‚ÑπÔ∏è  Opening IDs are NOT 0-based contiguous - will need mapping for embedding lookup")

# 1. Convert main features (train/val/test)
print(f"\n1Ô∏è‚É£  Converting main features (X_train, X_val, X_test)...")

# Train set
player_ids_train = torch.tensor(X_train['player_id'].values, dtype=torch.long)
opening_ids_train = torch.tensor(X_train['opening_id'].values, dtype=torch.long)
confidence_train = torch.tensor(X_train['confidence'].values, dtype=torch.float32)

print(f"   Train tensors:")
print(f"   ‚Ä¢ player_ids_train: {player_ids_train.shape}, dtype={player_ids_train.dtype}")
print(f"   ‚Ä¢ opening_ids_train: {opening_ids_train.shape}, dtype={opening_ids_train.dtype}")
print(f"   ‚Ä¢ confidence_train: {confidence_train.shape}, dtype={confidence_train.dtype}")

# Validation set
player_ids_val = torch.tensor(X_val['player_id'].values, dtype=torch.long)
opening_ids_val = torch.tensor(X_val['opening_id'].values, dtype=torch.long)
confidence_val = torch.tensor(X_val['confidence'].values, dtype=torch.float32)

print(f"\n   Validation tensors:")
print(f"   ‚Ä¢ player_ids_val: {player_ids_val.shape}, dtype={player_ids_val.dtype}")
print(f"   ‚Ä¢ opening_ids_val: {opening_ids_val.shape}, dtype={opening_ids_val.dtype}")
print(f"   ‚Ä¢ confidence_val: {confidence_val.shape}, dtype={confidence_val.dtype}")

# Test set
player_ids_test = torch.tensor(X_test['player_id'].values, dtype=torch.long)
opening_ids_test = torch.tensor(X_test['opening_id'].values, dtype=torch.long)
confidence_test = torch.tensor(X_test['confidence'].values, dtype=torch.float32)

print(f"\n   Test tensors:")
print(f"   ‚Ä¢ player_ids_test: {player_ids_test.shape}, dtype={player_ids_test.dtype}")
print(f"   ‚Ä¢ opening_ids_test: {opening_ids_test.shape}, dtype={opening_ids_test.dtype}")
print(f"   ‚Ä¢ confidence_test: {confidence_test.shape}, dtype={confidence_test.dtype}")

# 2. Convert targets (scores)
print(f"\n2Ô∏è‚É£  Converting target scores (y_train, y_val, y_test)...")

scores_train = torch.tensor(y_train.values, dtype=torch.float32)
scores_val = torch.tensor(y_val.values, dtype=torch.float32)
scores_test = torch.tensor(y_test.values, dtype=torch.float32)

print(f"   ‚Ä¢ scores_train: {scores_train.shape}, dtype={scores_train.dtype}")
print(f"   ‚Ä¢ scores_val: {scores_val.shape}, dtype={scores_val.dtype}")
print(f"   ‚Ä¢ scores_test: {scores_test.shape}, dtype={scores_test.dtype}")

print(f"\n   Score ranges (sanity check):")
print(f"   ‚Ä¢ Train: [{scores_train.min():.4f}, {scores_train.max():.4f}]")
print(f"   ‚Ä¢ Val: [{scores_val.min():.4f}, {scores_val.max():.4f}]")
print(f"   ‚Ä¢ Test: [{scores_test.min():.4f}, {scores_test.max():.4f}]")

# 3. Convert player side information
print(f"\n3Ô∏è‚É£  Converting player side information...")

# Create tensor of all player ratings (indexed by player_id)
# Since player_side_info is indexed by player_id, we need to ensure coverage
player_ratings_tensor = torch.tensor(player_side_info['rating_z'].values, dtype=torch.float32)
player_ids_in_side_info = torch.tensor(player_side_info.index.values, dtype=torch.long)

print(f"   ‚Ä¢ player_ratings_tensor: {player_ratings_tensor.shape}, dtype={player_ratings_tensor.dtype}")
print(f"   ‚Ä¢ player_ids_in_side_info: {player_ids_in_side_info.shape}, dtype={player_ids_in_side_info.dtype}")
print(f"   ‚Ä¢ Rating range: [{player_ratings_tensor.min():.4f}, {player_ratings_tensor.max():.4f}]")

# Verify all player IDs in train/val/test are covered
all_player_ids = torch.cat([player_ids_train, player_ids_val, player_ids_test]).unique()
missing_players = set(all_player_ids.tolist()) - set(player_ids_in_side_info.tolist())
if len(missing_players) > 0:
    print(f"   ‚ö†Ô∏è  WARNING: {len(missing_players)} players in splits missing from side_info!")
else:
    print(f"   ‚úì All {len(all_player_ids)} unique players in splits have side information")

# 4. Convert opening side information
print(f"\n4Ô∏è‚É£  Converting opening side information...")

# Verify column names exist (using eco_letter_cat and eco_number_cat consistently)
if 'eco_letter_cat' not in opening_side_info.columns:
    raise ValueError(f"Column 'eco_letter_cat' not found. Available: {opening_side_info.columns.tolist()}")
if 'eco_number_cat' not in opening_side_info.columns:
    raise ValueError(f"Column 'eco_number_cat' not found. Available: {opening_side_info.columns.tolist()}")

# Create tensors for opening ECO features (indexed by opening_id)
opening_eco_letter_tensor = torch.tensor(opening_side_info['eco_letter_cat'].values, dtype=torch.long)
opening_eco_number_tensor = torch.tensor(opening_side_info['eco_number_cat'].values, dtype=torch.long)
opening_ids_in_side_info = torch.tensor(opening_side_info.index.values, dtype=torch.long)

print(f"   ‚Ä¢ opening_eco_letter_tensor: {opening_eco_letter_tensor.shape}, dtype={opening_eco_letter_tensor.dtype}")
print(f"   ‚Ä¢ opening_eco_number_tensor: {opening_eco_number_tensor.shape}, dtype={opening_eco_number_tensor.dtype}")
print(f"   ‚Ä¢ opening_ids_in_side_info: {opening_ids_in_side_info.shape}, dtype={opening_ids_in_side_info.dtype}")
print(f"   ‚Ä¢ ECO letter range: [{opening_eco_letter_tensor.min()}, {opening_eco_letter_tensor.max()}]")
print(f"   ‚Ä¢ ECO number range: [{opening_eco_number_tensor.min()}, {opening_eco_number_tensor.max()}]")

# Verify all opening IDs in train/val/test are covered
all_opening_ids = torch.cat([opening_ids_train, opening_ids_val, opening_ids_test]).unique()
missing_openings = set(all_opening_ids.tolist()) - set(opening_ids_in_side_info.tolist())
if len(missing_openings) > 0:
    print(f"   ‚ö†Ô∏è  WARNING: {len(missing_openings)} openings in splits missing from side_info!")
else:
    print(f"   ‚úì All {len(all_opening_ids)} unique openings in splits have side information")

# 5. Summary statistics
print(f"\n5Ô∏è‚É£  Summary statistics:")
print(f"\n   Dataset sizes:")
print(f"   ‚Ä¢ Train: {len(scores_train):,} samples")
print(f"   ‚Ä¢ Val: {len(scores_val):,} samples")
print(f"   ‚Ä¢ Test: {len(scores_test):,} samples")

print(f"\n   Unique entities:")
print(f"   ‚Ä¢ Players: {len(player_ids_in_side_info):,}")
print(f"   ‚Ä¢ Openings: {len(opening_ids_in_side_info):,}")

print(f"\n   Vocabulary sizes (for embedding layers):")
print(f"   ‚Ä¢ num_players: {player_ids_in_side_info.max() + 1} (max player_id + 1)")
print(f"   ‚Ä¢ num_openings: {opening_ids_in_side_info.max() + 1} (max opening_id + 1)")
print(f"   ‚Ä¢ num_eco_letters: {opening_eco_letter_tensor.max() + 1} (max eco_letter_cat + 1)")
print(f"   ‚Ä¢ num_eco_numbers: {opening_eco_number_tensor.max() + 1} (max eco_number_cat + 1)")

# 6. Memory usage (approximate - using simple calculation)
print(f"\n6Ô∏è‚É£  Approximate memory usage:")
# More efficient calculation: element_size * nelement for each tensor
# Using list comprehension with helper function for cleaner code
def tensor_memory_mb(t):
    """Calculate tensor memory in MB"""
    return (t.element_size() * t.nelement()) / (1024 * 1024)

tensors = [
    player_ids_train, opening_ids_train, confidence_train, scores_train,
    player_ids_val, opening_ids_val, confidence_val, scores_val,
    player_ids_test, opening_ids_test, confidence_test, scores_test,
    player_ratings_tensor, opening_eco_letter_tensor, opening_eco_number_tensor
]
total_memory_mb = sum(tensor_memory_mb(t) for t in tensors)
print(f"   ‚Ä¢ Total tensor memory: {total_memory_mb:.2f} MB")

print("\n" + "=" * 60)
print("‚úÖ TENSOR CONVERSION COMPLETE")
print("=" * 60)

print(f"\nüì¶ Available tensors for training:")
print(f"\n   Main features (train/val/test):")
print(f"   ‚Ä¢ player_ids_train, player_ids_val, player_ids_test")
print(f"   ‚Ä¢ opening_ids_train, opening_ids_val, opening_ids_test")
print(f"   ‚Ä¢ confidence_train, confidence_val, confidence_test")

print(f"\n   Targets (train/val/test):")
print(f"   ‚Ä¢ scores_train, scores_val, scores_test")

print(f"\n   Side information:")
print(f"   ‚Ä¢ player_ratings_tensor (indexed by player_ids_in_side_info)")
print(f"   ‚Ä¢ opening_eco_letter_tensor (indexed by opening_ids_in_side_info)")
print(f"   ‚Ä¢ opening_eco_number_tensor (indexed by opening_ids_in_side_info)")

print(f"\nüí° Ready for model training!")
print(f"   These tensors can be directly fed into PyTorch DataLoaders and models.")