# Notebook 26 ‚Äî Opening Recommender Model: Training Pipeline

### 0. Overview and Goals

This notebook defines the full pipeline for training the chess opening recommender model.  
The objective is to predict **player‚Äìopening performance scores** ((wins + (0.5 * draws) / num games)) for openings a player hasn‚Äôt yet played, based on their results in the openings they *have* played.  

The model will use **matrix factorization** with **stochastic gradient descent (SGD)** to learn latent factors representing player and opening characteristics.  
All computations will be implemented in **PyTorch**, with data loaded from my local **DuckDB** database.

**High-level specs:**
- Use only *White* openings initially (we‚Äôll extend to Black later).  
- Data source: processed player‚Äìopening stats from local DuckDB.  
- Predict: normalized ‚Äúscore‚Äù = win rate ((wins + 0.5 x draws) / total games).  
- Filter: only include entries with ‚â• `MIN_GAMES_THRESHOLD` (default = 50).  
- Ignore: rating differences, time controls, and other metadata.  
- Model parameters (to be defined in appropriate places for easy editing):  
  - `NUM_FACTORS`, `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`, `NUM_PLAYERS_TO_PROCESS`  
- Logging and checkpoints throughout for reproducibility.  
- All random operations seeded for deterministic runs.  

---

### 1. Data Extraction
- Connect to local DuckDB
- Pull all processed player‚Äìopening statistics from
- Verify schema consistency:  
  - Required columns: `player_id`, `opening_id`, `eco`, `num_games`, `wins`, `draws`, `losses`.  
- Include a row-count sanity check.
- Only players with ratings above 1200

---

### 2. Data Sanitization & Normalization
- Optionally normalize scores if needed for MF convergence.  
- Drop players with no qualifying openings and openings with no qualifying players.  
  - I believe there shouldn't be any but we'll double check.
- Resequence player_id and opening_id to be sequential integers - right now there are gaps because of entries we deleted from the DB 
- Check for sparsity consistency (no implicit zeros yet).  
- Note that this data has already been split in to white and black games further up the pipeline

### Data Quality
- Drop entries with fewer than `MIN_GAMES_THRESHOLD` games
- Handle any duplicate `(player_id, opening_id)` combinations
- Remove players with no qualifying openings
- Remove openings with no qualifying players
- Verify no null values remain

### ECO Codes
- Keep ECO codes for later categorical encoding (Step 4)
- ECO will be used as opening side information (similar to rating for players)

### Confidence Weighting
- Use `MIN_GAMES_THRESHOLD = 10` to keep more data
- Add a **confidence weight** column: `confidence = num_games / (num_games + K)` where K ‚âà 50
- This weight will be used in the loss function to down-weight uncertain predictions
- High-game-count entries ‚Üí high confidence ‚Üí larger loss impact
- Low-game-count entries ‚Üí low confidence ‚Üí smaller loss impact

### Player Rating (Side Information)
- **Player ratings are side information** - they describe player characteristics, not individual player-opening interactions
- Ratings will be stored separately and joined to player embeddings during training
- We'll **normalize ratings** (likely z-score normalization) to avoid scaling issues with the embedding layer
- Rating normalization will be done once after extraction, not per-row

---

### 3. Data Splits
- Split into train/test/val sets.  
- Ensure every player and every opening appears at least once in the training data.  
- Strategy:  
  - Sample unique players and openings to guarantee coverage in train.  
  - Remaining data ‚Üí stratified random split into train/test.  
  - Deduplicate and merge unique IDs back into train if needed.

---

### 4. Enumerate Categorical Variables
- Enumerate `eco` (if included) as an integer categorical variable.  
- Confirm all columns are numeric and compatible with PyTorch tensors.  
- Verify no missing or out-of-range IDs.

---

### 5. Training Data Structure
- Each row: one `(player_id, opening_id, score)` record.
- Include other fields- eco, num games etc
- Convert DataFrame to PyTorch tensors (`torch.long` for IDs, `torch.float` for scores).  
- Log dataset shapes and sparsity metrics.

---

### 6. Training Setup
Define constants:
- `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`, `NUM_FACTORS`  
- Loss functions: MSE and RMSE  
- Activation: sigmoid or none (depending on score normalization)  
- Optimizer: SGD  
- Figure out if there's anything else we need to design or specify

Implement helper functions:
- `train_one_epoch()`
- `evaluate_model()`
- `calculate_rmse()`
- `save_checkpoint()`  

Ensure detailed logging, ETA reporting, and reproducible random seeds.

---

### 7. Training Loop
- Initialize player and opening embeddings.  
- Iterate through epochs with mini-batch SGD (`BATCH_SIZE = 1024`).  
- Compute and log MSE/RMSE per epoch.  
- Save model checkpoints locally after each epoch.

---

### 8. Evaluation
- Evaluate on test set.  
- Report MSE, RMSE, and visual diagnostics (predicted vs actual score).  
- Inspect a few player and opening latent factors for sanity.

---

### 9. Cross-Validation & Hyperparameter Tuning
- Define ranges for:  
  - `NUM_FACTORS`, `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`  
- Perform small-scale grid or random search for best configuration.  
- Compare validation RMSE across runs.

---

### 10. Next Steps
- Extend model to include Black openings.  
- Experiment with hybrid inputs (player rating, ECO grouping).  
- Consider implicit feedback handling (unplayed openings as zeros).  
- Integrate trained model into API for recommendation output.

---

**Notes:**  
- Every random seed and parameter definition will be explicit.  
- Every major step includes row-count, schema, and type validation.  
- Model artifacts and logs will be saved locally for reproducibility.


## Step 1: Data Extraction

Connect to DuckDB and extract all player-opening statistics.
Verify schema and perform sanity checks.

In [1]:
# Setup and imports
from pathlib import Path
import pandas as pd
import sys

# Add utils to path
sys.path.append(str(Path.cwd() / 'utils'))
from database.db_utils import get_db_connection

# Configuration
DB_PATH = Path.cwd().parent / "data" / "processed" / "chess_games.db"
COLOR_FILTER = 'w'  # 'w' for white, 'b' for black

print("=" * 60)
print("STEP 1: DATA EXTRACTION")
print("=" * 60)
print(f"\nüìÅ Database: {DB_PATH}")
print(f"üìÅ Database exists: {DB_PATH.exists()}")
print(f"üé® Color filter: {'White' if COLOR_FILTER == 'w' else 'Black'}")

if not DB_PATH.exists():
    raise FileNotFoundError(f"Database not found at {DB_PATH}")

STEP 1: DATA EXTRACTION

üìÅ Database: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db
üìÅ Database exists: True
üé® Color filter: White


In [2]:
# Connect to DuckDB and extract player-opening statistics
con = get_db_connection(str(DB_PATH))

try:
    print(f"\n1Ô∏è‚É£  Extracting player-opening statistics (color: '{COLOR_FILTER}')...")
    
    # Extract stats with calculated score and num_games
    # Filter by color, minimum rating, and calculate score in the database
    MIN_RATING = 1200
    print(f"   ‚Ä¢ Minimum rating filter: {MIN_RATING}")
    
    query = f"""
        SELECT 
            pos.player_id,
            pos.opening_id,
            pos.num_wins + pos.num_draws + pos.num_losses as num_games,
            (pos.num_wins + (pos.num_draws * 0.5)) / 
                NULLIF(pos.num_wins + pos.num_draws + pos.num_losses, 0) as score,
            o.eco
        FROM player_opening_stats pos
        JOIN opening o ON pos.opening_id = o.id
        JOIN player p ON pos.player_id = p.id
        WHERE pos.color = '{COLOR_FILTER}'
        AND p.rating >= {MIN_RATING}
        ORDER BY pos.player_id, pos.opening_id
    """
    
    raw_data = pd.DataFrame(con.execute(query).df())
    
    print(f"   ‚úì Extracted {len(raw_data):,} rows")
    
    # Schema verification
    print("\n2Ô∏è‚É£  Verifying schema...")
    required_columns = ['player_id', 'opening_id', 'num_games', 'score', 'eco']
    
    for col in required_columns:
        if col not in raw_data.columns:
            raise ValueError(f"Missing required column: {col}")
    
    print(f"   ‚úì All required columns present: {required_columns}")
    
    # Data types verification
    print("\n3Ô∏è‚É£  Checking data types...")
    print(f"   ‚Ä¢ player_id: {raw_data['player_id'].dtype}")
    print(f"   ‚Ä¢ opening_id: {raw_data['opening_id'].dtype}")
    print(f"   ‚Ä¢ num_games: {raw_data['num_games'].dtype}")
    print(f"   ‚Ä¢ score: {raw_data['score'].dtype}")
    print(f"   ‚Ä¢ eco: {raw_data['eco'].dtype}")
    
    # Basic statistics
    print("\n4Ô∏è‚É£  Data statistics...")
    print(f"   ‚Ä¢ Total rows: {len(raw_data):,}")
    print(f"   ‚Ä¢ Unique players: {raw_data['player_id'].nunique():,}")
    print(f"   ‚Ä¢ Unique openings: {raw_data['opening_id'].nunique():,}")
    print(f"   ‚Ä¢ Total games (sum): {raw_data['num_games'].sum():,}")
    
    # Player ID range
    print(f"\n   Player ID range:")
    print(f"   ‚Ä¢ Min: {raw_data['player_id'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['player_id'].max()}")
    
    # Opening ID range
    print(f"\n   Opening ID range:")
    print(f"   ‚Ä¢ Min: {raw_data['opening_id'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['opening_id'].max()}")
    
    # Games per entry statistics
    print(f"\n   Games per entry:")
    print(f"   ‚Ä¢ Min: {raw_data['num_games'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['num_games'].max()}")
    print(f"   ‚Ä¢ Mean: {raw_data['num_games'].mean():.1f}")
    print(f"   ‚Ä¢ Median: {raw_data['num_games'].median():.0f}")
    
    # Score statistics
    print(f"\n   Score distribution:")
    print(f"   ‚Ä¢ Min: {raw_data['score'].min():.4f}")
    print(f"   ‚Ä¢ Max: {raw_data['score'].max():.4f}")
    print(f"   ‚Ä¢ Mean: {raw_data['score'].mean():.4f}")
    print(f"   ‚Ä¢ Median: {raw_data['score'].median():.4f}")
    
    # Check for null values
    print("\n5Ô∏è‚É£  Checking for null values...")
    null_counts = raw_data.isnull().sum()
    if null_counts.sum() == 0:
        print("   ‚úì No null values found")
    else:
        print("   ‚ö†Ô∏è  Found null values:")
        for col, count in null_counts[null_counts > 0].items():
            print(f"      ‚Ä¢ {col}: {count} nulls")
    
    # Sample data
    print("\n6Ô∏è‚É£  Sample of extracted data (first 10 rows):")
    print(raw_data.head(10).to_string())
    
    print("\n" + "=" * 60)
    print("‚úÖ DATA EXTRACTION COMPLETE")
    print("=" * 60)
    print(f"\nData shape: {raw_data.shape}")
    print(f"Columns: {list(raw_data.columns)}")
    
finally:
    con.close()
    print("\n‚úì Database connection closed")


1Ô∏è‚É£  Extracting player-opening statistics (color: 'w')...
   ‚Ä¢ Minimum rating filter: 1200


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

   ‚úì Extracted 11,802,584 rows

2Ô∏è‚É£  Verifying schema...
   ‚úì All required columns present: ['player_id', 'opening_id', 'num_games', 'score', 'eco']

3Ô∏è‚É£  Checking data types...
   ‚Ä¢ player_id: int32
   ‚Ä¢ opening_id: int32
   ‚Ä¢ num_games: int32
   ‚Ä¢ score: float64
   ‚Ä¢ eco: object

4Ô∏è‚É£  Data statistics...
   ‚Ä¢ Total rows: 11,802,584
   ‚Ä¢ Unique players: 49,551
   ‚Ä¢ Unique openings: 2,991
   ‚Ä¢ Total games (sum): 233,559,168

   Player ID range:
   ‚Ä¢ Min: 1
   ‚Ä¢ Max: 50000

   Opening ID range:
   ‚Ä¢ Min: 2
   ‚Ä¢ Max: 3589

   Games per entry:
   ‚Ä¢ Min: 1
   ‚Ä¢ Max: 13462
   ‚Ä¢ Mean: 19.8
   ‚Ä¢ Median: 3

   Score distribution:
   ‚Ä¢ Min: 0.0000
   ‚Ä¢ Max: 1.0000
   ‚Ä¢ Mean: 0.5007
   ‚Ä¢ Median: 3

   Score distribution:
   ‚Ä¢ Min: 0.0000
   ‚Ä¢ Max: 1.0000
   ‚Ä¢ Mean: 0.5007
   ‚Ä¢ Median: 0.5000

5Ô∏è‚É£  Checking for null values...
   ‚Ä¢ Median: 0.5000

5Ô∏è‚É£  Checking for null values...
   ‚úì No null values found

6Ô∏è‚É£  Sample

## Step 2: Data Sanitization & Normalization

Filter low-quality data, handle duplicates, and prepare for training.

In [3]:
# 2a. Filter low-quality data, handle duplicates, and prepare for training.

import numpy as np

# Configuration
MIN_GAMES_THRESHOLD = 10

print("=" * 60)
print("STEP 2: DATA SANITIZATION & NORMALIZATION")
print("=" * 60)
print(f"\n‚öôÔ∏è  Configuration:")
print(f"   ‚Ä¢ MIN_GAMES_THRESHOLD: {MIN_GAMES_THRESHOLD}")

# Start with raw_data from Step 1
print(f"\nüìä Starting data shape: {raw_data.shape}")
print(f"   ‚Ä¢ Rows: {len(raw_data):,}")
print(f"   ‚Ä¢ Unique players: {raw_data['player_id'].nunique():,}")
print(f"   ‚Ä¢ Unique openings: {raw_data['opening_id'].nunique():,}")

# 1. Filter by minimum games threshold
print(f"\n1Ô∏è‚É£  Filtering entries with < {MIN_GAMES_THRESHOLD} games...")
before_filter = len(raw_data)
clean_data = raw_data.query(f'num_games >= {MIN_GAMES_THRESHOLD}').copy()
num_rows_after_filter = len(clean_data)
num_rows_filtered_out = before_filter - num_rows_after_filter

print(f"   ‚Ä¢ Before: {before_filter:,} rows")
print(f"   ‚Ä¢ After: {num_rows_after_filter:,} rows")
print(f"   ‚Ä¢ Filtered out: {num_rows_filtered_out:,} rows ({100*num_rows_filtered_out/before_filter:.1f}%)")

# 2. Check for duplicates
print(f"\n2Ô∏è‚É£  Checking for duplicate (player_id, opening_id) combinations...")
num_duplicates = clean_data.duplicated(subset=['player_id', 'opening_id']).sum()

if num_duplicates > 0:
    print(f"   ‚ö†Ô∏è  Found {num_duplicates} duplicate entries")
    dup_mask = clean_data.duplicated(subset=['player_id', 'opening_id'], keep=False)
    print("\n   Sample of duplicates:")
    print(clean_data[dup_mask].head(10).to_string())
    
    # Keep only first occurrence of any duplicate player-opening pair
    print("\n   Removing duplicates (keeping first occurrence)...")
    clean_data = pd.DataFrame.drop_duplicates(clean_data, subset=['player_id', 'opening_id'], keep='first')
    print(f"   ‚úì After deduplication: {len(clean_data):,} rows")
else:
    print(f"   ‚úì No duplicates found")

# 3. Remove players with no qualifying openings
print(f"\n3Ô∏è‚É£  Removing players with no qualifying openings...") # Note that a few players only play stuff like the Van't Kruijs which we've excluded, so a small numer of players will be excluded here
players_before = clean_data['player_id'].nunique()

# Count openings per player
num_openings_per_player = pd.DataFrame(clean_data.groupby('player_id').size(), columns=['count'])
players_with_data = num_openings_per_player[num_openings_per_player['count'] > 0].index.tolist()

# Filter
clean_data = clean_data[clean_data['player_id'].isin(players_with_data)]
players_after = clean_data['player_id'].nunique()

print(f"   ‚Ä¢ Players before: {players_before:,}")
print(f"   ‚Ä¢ Players after: {players_after:,}")
print(f"   ‚Ä¢ Removed: {players_before - players_after}")

# 4. Remove openings with no qualifying players
print(f"\n4Ô∏è‚É£  Removing openings with no qualifying players...")
num_openings_before = clean_data['opening_id'].nunique()

# Use pd.DataFrame.groupby() to count players per opening
num_players_per_opening = pd.DataFrame(clean_data.groupby('opening_id').size(), columns=['count'])
openings_with_data = num_players_per_opening[num_players_per_opening['count'] > 0].index.tolist()

# Filter using pd.DataFrame.isin()
clean_data = clean_data[clean_data['opening_id'].isin(openings_with_data)]
openings_after = clean_data['opening_id'].nunique()

print(f"   ‚Ä¢ Openings before: {num_openings_before:,}")
print(f"   ‚Ä¢ Openings after: {openings_after:,}")
print(f"   ‚Ä¢ Removed: {num_openings_before - openings_after}")

# 5. Verify no null values using pd.isna()
print(f"\n5Ô∏è‚É£  Verifying no null values...")
null_counts = pd.DataFrame.isna(clean_data).sum()
if null_counts.sum() == 0:
    print("   ‚úì No null values found")
else:
    print("   ‚ö†Ô∏è  Found null values:")
    for col, count in null_counts[null_counts > 0].items():
        print(f"      ‚Ä¢ {col}: {count} nulls")
    # Drop rows with nulls using pd.DataFrame.dropna()
    clean_data = pd.DataFrame.dropna(clean_data)
    print(f"   ‚úì Dropped null rows. New shape: {clean_data.shape}")

# TODO: Add confidence weighting column
# TODO: Extract and normalize player ratings (side information)

# Reset index using pd.DataFrame.reset_index()
clean_data = pd.DataFrame.reset_index(clean_data, drop=True)

# Final statistics using pd functions
print(f"\n6Ô∏è‚É£  Final data statistics:")
print(f"   ‚Ä¢ Total rows: {len(clean_data):,}")
print(f"   ‚Ä¢ Unique players: {pd.Series.nunique(clean_data['player_id']):,}")
print(f"   ‚Ä¢ Unique openings: {pd.Series.nunique(clean_data['opening_id']):,}")
print(f"   ‚Ä¢ Total games: {pd.Series.sum(clean_data['num_games']):,}")
print(f"   ‚Ä¢ Avg games per entry: {pd.Series.mean(clean_data['num_games']):.1f}")
print(f"   ‚Ä¢ Avg openings per player: {len(clean_data) / pd.Series.nunique(clean_data['player_id']):.1f}")
print(f"   ‚Ä¢ Avg players per opening: {len(clean_data) / pd.Series.nunique(clean_data['opening_id']):.1f}")

# Score distribution using pd functions
print(f"\n   Score statistics:")
print(f"   ‚Ä¢ Min: {pd.Series.min(clean_data['score']):.4f}")
print(f"   ‚Ä¢ 25th percentile: {pd.Series.quantile(clean_data['score'], 0.25):.4f}")
print(f"   ‚Ä¢ Median: {pd.Series.median(clean_data['score']):.4f}")
print(f"   ‚Ä¢ 75th percentile: {pd.Series.quantile(clean_data['score'], 0.75):.4f}")
print(f"   ‚Ä¢ Max: {pd.Series.max(clean_data['score']):.4f}")
print(f"   ‚Ä¢ Mean: {pd.Series.mean(clean_data['score']):.4f}")
print(f"   ‚Ä¢ Std: {pd.Series.std(clean_data['score']):.4f}")

# Sample of cleaned data using pd.DataFrame.sample()
print(f"\n7Ô∏è‚É£  Sample of cleaned data (10 random rows):")
print(pd.DataFrame.sample(clean_data, min(10, len(clean_data)), random_state=42).to_string())

print("\n" + "=" * 60)
print("‚úÖ DATA SANITIZATION COMPLETE")
print("=" * 60)
print(f"\nCleaned data shape: {clean_data.shape}")
print(f"Data reduction: {100 * (1 - len(clean_data)/len(raw_data)):.1f}%")

STEP 2: DATA SANITIZATION & NORMALIZATION

‚öôÔ∏è  Configuration:
   ‚Ä¢ MIN_GAMES_THRESHOLD: 10

üìä Starting data shape: (11802584, 5)
   ‚Ä¢ Rows: 11,802,584
   ‚Ä¢ Unique players: 49,551
   ‚Ä¢ Unique openings: 2,991

1Ô∏è‚É£  Filtering entries with < 10 games...
   ‚Ä¢ Unique openings: 2,991

1Ô∏è‚É£  Filtering entries with < 10 games...
   ‚Ä¢ Before: 11,802,584 rows
   ‚Ä¢ After: 2,956,680 rows
   ‚Ä¢ Filtered out: 8,845,904 rows (74.9%)

2Ô∏è‚É£  Checking for duplicate (player_id, opening_id) combinations...
   ‚Ä¢ Before: 11,802,584 rows
   ‚Ä¢ After: 2,956,680 rows
   ‚Ä¢ Filtered out: 8,845,904 rows (74.9%)

2Ô∏è‚É£  Checking for duplicate (player_id, opening_id) combinations...
   ‚úì No duplicates found

3Ô∏è‚É£  Removing players with no qualifying openings...
   ‚úì No duplicates found

3Ô∏è‚É£  Removing players with no qualifying openings...
   ‚Ä¢ Players before: 49,467
   ‚Ä¢ Players after: 49,467
   ‚Ä¢ Removed: 0

4Ô∏è‚É£  Removing openings with no qualifying player

In [4]:
# 2b. Apply hierarchical Bayesian shrinkage to adjust scores based on sample size confidence

# Check if confidence already exists - if so, skip this processing
if 'confidence' in clean_data.columns:
    print("=" * 60)
    print("‚è≠Ô∏è  SKIPPING STEP 2B: HIERARCHICAL BAYESIAN SCORE ADJUSTMENT")
    print("=" * 60)
    print("\n‚úì 'confidence' column already exists in data")
    print("   This indicates hierarchical Bayesian processing has already been applied.")
    print(f"\nCurrent data shape: {clean_data.shape}")
    print(f"Confidence range: [{clean_data['confidence'].min():.4f}, {clean_data['confidence'].max():.4f}]")
else:
    # Define the processing function
    # This is a long function, I recommend you fold it down in your editor
    def apply_hierarchical_bayesian_shrinkage(data, k_player=50):
        """
        Apply two-level hierarchical Bayesian shrinkage to adjust scores.
        
        A lot of our player-opening entries have a small number of games played, because openings are so specific.
        This introduces sample size issues.
        
        We use TWO-LEVEL shrinkage:
        Level 1: Calculate opening-specific means (these are our "ground truth" for each opening)
        Level 2: Shrink individual player-opening scores toward their opening's mean
        This is better than shrinking toward global mean because different openings have different baseline win rates
        
        Parameters:
        -----------
        data : pd.DataFrame
            Clean data with columns: player_id, opening_id, score, num_games, eco
        k_player : int
            Shrinkage constant for player-opening scores (default: 50)
            
        Returns:
        --------
        pd.DataFrame
            Data with adjusted scores and new 'confidence' column
        """
        print("=" * 60)
        print("STEP 2B: HIERARCHICAL BAYESIAN SCORE ADJUSTMENT")
        print("=" * 60)
        
        print(f"\n‚öôÔ∏è  Configuration:")
        print(f"   ‚Ä¢ K_PLAYER (shrinkage constant): {k_player}")
        print(f"   ‚Ä¢ Method: Two-level empirical Bayes shrinkage")
        print(f"   ‚Ä¢ Level 1: Calculate opening-specific means")
        print(f"   ‚Ä¢ Level 2: Shrink player scores toward opening means")
        
        # Calculate global mean score for comparison
        global_mean_score = data["score"].mean()
        print(f"\nüìä Global statistics:")
        print(f"   ‚Ä¢ Global mean score: {global_mean_score:.4f}")
        print(f"   ‚Ä¢ Total entries: {len(data):,}")
        print(f"   ‚Ä¢ Unique openings: {data['opening_id'].nunique():,}")
        
        # Store original scores for comparison
        data = data.copy()  # Best practice: work on a copy
        data["score_original"] = data["score"].copy()
        
        # LEVEL 1: Calculate opening-specific means and statistics
        print(f"\n1Ô∏è‚É£  LEVEL 1: Calculating opening-specific means...")
        
        opening_stats = (
            data.groupby("opening_id")
            .agg(
                {
                    "score": "mean",
                    "num_games": "sum",
                    "player_id": "count",  # Number of players who played this opening
                }
            )
            .rename(
                columns={
                    "score": "opening_mean",
                    "num_games": "opening_total_games",
                    "player_id": "opening_num_players",
                }
            )
        )
        
        print(f"   ‚úì Calculated means for {len(opening_stats):,} openings")
        
        # Opening mean statistics
        print(f"\n   Opening mean score distribution:")
        print(f"   ‚Ä¢ Min: {opening_stats['opening_mean'].min():.4f}")
        print(f"   ‚Ä¢ 25th percentile: {opening_stats['opening_mean'].quantile(0.25):.4f}")
        print(f"   ‚Ä¢ Median: {opening_stats['opening_mean'].median():.4f}")
        print(f"   ‚Ä¢ 75th percentile: {opening_stats['opening_mean'].quantile(0.75):.4f}")
        print(f"   ‚Ä¢ Max: {opening_stats['opening_mean'].max():.4f}")
        print(f"   ‚Ä¢ Std: {opening_stats['opening_mean'].std():.4f}")
        
        # Show distribution of opening sizes
        print(f"\n   Opening sample size distribution:")
        print(
            f"   ‚Ä¢ Total games per opening (median): {opening_stats['opening_total_games'].median():.0f}"
        )
        print(
            f"   ‚Ä¢ Players per opening (median): {opening_stats['opening_num_players'].median():.0f}"
        )
        print(
            f"   ‚Ä¢ Total games range: [{opening_stats['opening_total_games'].min():.0f}, {opening_stats['opening_total_games'].max():.0f}]"
        )
        print(
            f"   ‚Ä¢ Players range: [{opening_stats['opening_num_players'].min():.0f}, {opening_stats['opening_num_players'].max():.0f}]"
        )
        
        # Merge opening means back into main dataframe
        data = data.merge(
            opening_stats[["opening_mean"]], left_on="opening_id", right_index=True, how="left"
        )
        
        # LEVEL 2: Shrink player-opening scores toward opening-specific means
        print(f"\n2Ô∏è‚É£  LEVEL 2: Shrinking player scores toward opening means...")
        print(
            f"   Formula: adjusted_score = (num_games √ó player_score + {k_player} √ó opening_mean) / (num_games + {k_player})"
        )
        
        numerator = (data["num_games"] * data["score_original"]) + (
            k_player * data["opening_mean"]
        )
        denominator = data["num_games"] + k_player
        data["score"] = numerator / denominator
        
        print(f"   ‚úì Scores adjusted for {len(data):,} entries")
        
        # Calculate confidence weights (will be used in loss function later)
        print(f"\n3Ô∏è‚É£  Calculating confidence weights...")
        data["confidence"] = data["num_games"] / (
            data["num_games"] + k_player
        )
        print(f"   ‚úì Confidence weights calculated")
        print(f"   ‚Ä¢ Formula: confidence = num_games / (num_games + {k_player})")
        print(
            f"   ‚Ä¢ Range: [{data['confidence'].min():.4f}, {data['confidence'].max():.4f}]"
        )
        
        # Statistics on the adjustment
        score_diff = data["score"] - data["score_original"]
        print(f"\n4Ô∏è‚É£  Adjustment statistics:")
        print(f"   ‚Ä¢ Mean adjustment: {score_diff.mean():.6f}")
        print(f"   ‚Ä¢ Std adjustment: {score_diff.std():.6f}")
        print(f"   ‚Ä¢ Max adjustment: {score_diff.max():.6f}")
        print(f"   ‚Ä¢ Min adjustment: {score_diff.min():.6f}")
        
        # Show distribution of adjustments
        print(f"\n   Adjustment by num_games quartiles:")
        quartiles = data["num_games"].quantile([0.25, 0.5, 0.75])
        print(
            f"   ‚Ä¢ 25th percentile (n={quartiles[0.25]:.0f} games): avg adjustment = {score_diff[data['num_games'] <= quartiles[0.25]].mean():.6f}"
        )
        print(
            f"   ‚Ä¢ 50th percentile (n={quartiles[0.5]:.0f} games): avg adjustment = {score_diff[(data['num_games'] > quartiles[0.25]) & (data['num_games'] <= quartiles[0.5])].mean():.6f}"
        )
        print(
            f"   ‚Ä¢ 75th percentile (n={quartiles[0.75]:.0f} games): avg adjustment = {score_diff[(data['num_games'] > quartiles[0.5]) & (data['num_games'] <= quartiles[0.75])].mean():.6f}"
        )
        print(
            f"   ‚Ä¢ >75th percentile (n>{quartiles[0.75]:.0f} games): avg adjustment = {score_diff[data['num_games'] > quartiles[0.75]].mean():.6f}"
        )
        
        # New score distribution after adjustment
        print(f"\n5Ô∏è‚É£  Adjusted score statistics:")
        print(f"   ‚Ä¢ Min: {data['score'].min():.4f}")
        print(f"   ‚Ä¢ 25th percentile: {data['score'].quantile(0.25):.4f}")
        print(f"   ‚Ä¢ Median: {data['score'].median():.4f}")
        print(f"   ‚Ä¢ 75th percentile: {data['score'].quantile(0.75):.4f}")
        print(f"   ‚Ä¢ Max: {data['score'].max():.4f}")
        print(f"   ‚Ä¢ Mean: {data['score'].mean():.4f}")
        print(f"   ‚Ä¢ Std: {data['score'].std():.4f}")
        
        # Detailed sample showing the effect across different game counts
        print(f"\n6Ô∏è‚É£  Sample comparisons (showing effect of hierarchical shrinkage):")
        print(f"\n   {'='*120}")
        print(f"   Low-game entries (10-20 games) - HIGH shrinkage toward opening mean:")
        print(f"   {'='*120}")
        
        low_game_sample = data[
            (data["num_games"] >= 10) & (data["num_games"] <= 20)
        ].sample(
            min(
                10,
                len(
                    data[
                        (data["num_games"] >= 10) & (data["num_games"] <= 20)
                    ]
                ),
            ),
            random_state=42,
        )
        for idx, row in low_game_sample.iterrows():
            adjustment = row["score"] - row["score_original"]
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} | Games: {row['num_games']:>3} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí Adjusted: {row['score']:.4f} | "
                f"Diff: {adjustment:>+.4f} | Confidence: {row['confidence']:.3f}"
            )
        
        print(f"\n   {'='*120}")
        print(f"   Medium-game entries (50-100 games) - MODERATE shrinkage:")
        print(f"   {'='*120}")
        
        med_game_sample = data[
            (data["num_games"] >= 50) & (data["num_games"] <= 100)
        ].sample(
            min(
                10,
                len(
                    data[
                        (data["num_games"] >= 50) & (data["num_games"] <= 100)
                    ]
                ),
            ),
            random_state=42,
        )
        for idx, row in med_game_sample.iterrows():
            adjustment = row["score"] - row["score_original"]
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} | Games: {row['num_games']:>3} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí Adjusted: {row['score']:.4f} | "
                f"Diff: {adjustment:>+.4f} | Confidence: {row['confidence']:.3f}"
            )
        
        print(f"\n   {'='*120}")
        print(f"   High-game entries (200+ games) - LOW shrinkage:")
        print(f"   {'='*120}")
        
        high_game_sample = data[data["num_games"] >= 200].sample(
            min(10, len(data[data["num_games"] >= 200])), random_state=42
        )
        for idx, row in high_game_sample.iterrows():
            adjustment = row["score"] - row["score_original"]
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} | Games: {row['num_games']:>3} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí Adjusted: {row['score']:.4f} | "
                f"Diff: {adjustment:>+.4f} | Confidence: {row['confidence']:.3f}"
            )
        
        # Show extreme cases - comparing to both opening mean AND global mean
        print(f"\n7Ô∏è‚É£  Extreme cases (showing why opening-specific shrinkage matters):")
        
        # Find entries where opening mean differs significantly from global mean
        data["opening_deviation_from_global"] = (
            data["opening_mean"] - global_mean_score
        ).abs()
        
        print(f"\n   Openings with HIGHEST win rates (strong for White):")
        strong_openings = data.nlargest(5, "opening_mean")[
            ["opening_id", "opening_mean", "eco"]
        ].drop_duplicates("opening_id")
        for idx, row in strong_openings.iterrows():
            num_entries = len(data[data["opening_id"] == row["opening_id"]])
            deviation = row["opening_mean"] - global_mean_score
            print(
                f"   Opening {row['opening_id']:>4} ({row['eco']:>3}): mean = {row['opening_mean']:.4f} "
                f"(+{deviation:.4f} vs global) | {num_entries} player entries"
            )
        
        print(f"\n   Openings with LOWEST win rates (weak for White):")
        weak_openings = data.nsmallest(5, "opening_mean")[
            ["opening_id", "opening_mean", "eco"]
        ].drop_duplicates("opening_id")
        for idx, row in weak_openings.iterrows():
            num_entries = len(data[data["opening_id"] == row["opening_id"]])
            deviation = row["opening_mean"] - global_mean_score
            print(
                f"   Opening {row['opening_id']:>4} ({row['eco']:>3}): mean = {row['opening_mean']:.4f} "
                f"({deviation:.4f} vs global) | {num_entries} player entries"
            )
        
        # Show specific examples where hierarchical shrinkage made a difference
        print(f"\n8Ô∏è‚É£  Examples showing hierarchical shrinkage benefit:")
        
        # Find entries with strong openings where player did well
        strong_opening_ids = data.nlargest(50, "opening_mean")["opening_id"].unique()
        strong_examples = data[
            (data["opening_id"].isin(strong_opening_ids))
            & (data["num_games"] <= 20)
            & (data["score_original"] > 0.6)
        ].sample(
            min(
                3,
                len(
                    data[
                        (data["opening_id"].isin(strong_opening_ids))
                        & (data["num_games"] <= 20)
                        & (data["score_original"] > 0.6)
                    ]
                ),
            ),
            random_state=42,
        )
        
        print(
            f"\n   Strong opening + good player performance (shrunk toward HIGH opening mean):"
        )
        for idx, row in strong_examples.iterrows():
            adjustment = row["score"] - row["score_original"]
            global_shrink_would_be = (
                (row["num_games"] * row["score_original"]) + (k_player * global_mean_score)
            ) / (row["num_games"] + k_player)
            difference = row["score"] - global_shrink_would_be
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} ({row['eco']:>3}) | Games: {row['num_games']:>2} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí {row['score']:.4f}"
            )
            print(
                f"      If we'd shrunk to global mean: {global_shrink_would_be:.4f} (would lose {difference:+.4f} of deserved credit)"
            )
        
        # Find entries with weak openings where player did poorly
        weak_opening_ids = data.nsmallest(50, "opening_mean")["opening_id"].unique()
        weak_examples = data[
            (data["opening_id"].isin(weak_opening_ids))
            & (data["num_games"] <= 20)
            & (data["score_original"] < 0.45)
        ].sample(
            min(
                3,
                len(
                    data[
                        (data["opening_id"].isin(weak_opening_ids))
                        & (data["num_games"] <= 20)
                        & (data["score_original"] < 0.45)
                    ]
                ),
            ),
            random_state=42,
        )
        
        print(f"\n   Weak opening + poor player performance (shrunk toward LOW opening mean):")
        for idx, row in weak_examples.iterrows():
            adjustment = row["score"] - row["score_original"]
            global_shrink_would_be = (
                (row["num_games"] * row["score_original"]) + (k_player * global_mean_score)
            ) / (row["num_games"] + k_player)
            difference = row["score"] - global_shrink_would_be
            print(
                f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} ({row['eco']:>3}) | Games: {row['num_games']:>2} | "
                f"Opening mean: {row['opening_mean']:.4f} | Original: {row['score_original']:.4f} ‚Üí {row['score']:.4f}"
            )
            print(
                f"      If we'd shrunk to global mean: {global_shrink_would_be:.4f} (would unfairly boost by {-difference:+.4f})"
            )
        
        # Drop temporary columns
        print(f"\n9Ô∏è‚É£  Cleaning up...")
        data = data.drop(
            columns=["score_original", "opening_mean", "opening_deviation_from_global"]
        )
        print(f"   ‚úì Removed temporary columns")
        
        print(f"\n" + "=" * 60)
        print("‚úÖ HIERARCHICAL BAYESIAN ADJUSTMENT COMPLETE")
        print("=" * 60)
        print(f"\nFinal data shape: {data.shape}")
        print(f"Columns: {list(data.columns)}")
        print(f"\nNew columns added:")
        print(f"   ‚Ä¢ 'confidence': weight for loss function (range [0,1])")
        print(f"   ‚Ä¢ 'score': adjusted using hierarchical Bayesian shrinkage")
        print(f"\nKey improvement over simple shrinkage:")
        print(f"   ‚Ä¢ Player scores now shrink toward OPENING-SPECIFIC means, not global mean")
        print(f"   ‚Ä¢ Preserves opening difficulty differences")
        print(f"   ‚Ä¢ More accurate for both strong and weak openings")
        
        return data
    
    # Configuration for Bayesian shrinkage
    K_PLAYER = 50  # Shrinkage constant for player-opening scores
    
    # Call the function
    clean_data = apply_hierarchical_bayesian_shrinkage(clean_data, k_player=K_PLAYER)


STEP 2B: HIERARCHICAL BAYESIAN SCORE ADJUSTMENT

‚öôÔ∏è  Configuration:
   ‚Ä¢ K_PLAYER (shrinkage constant): 50
   ‚Ä¢ Method: Two-level empirical Bayes shrinkage
   ‚Ä¢ Level 1: Calculate opening-specific means
   ‚Ä¢ Level 2: Shrink player scores toward opening means

üìä Global statistics:
   ‚Ä¢ Global mean score: 0.5111
   ‚Ä¢ Total entries: 2,956,680
   ‚Ä¢ Unique openings: 2,717

1Ô∏è‚É£  LEVEL 1: Calculating opening-specific means...
   ‚úì Calculated means for 2,717 openings

   Opening mean score distribution:
   ‚Ä¢ Min: 0.1667
   ‚Ä¢ 25th percentile: 0.4962
   ‚Ä¢ Median: 0.5163
   ‚Ä¢ 75th percentile: 0.5365
   ‚Ä¢ Max: 1.0000
   ‚Ä¢ Std: 0.0504

   Opening sample size distribution:
   ‚Ä¢ Total games per opening (median): 4902
   ‚Ä¢ Players per opening (median): 159
   ‚Ä¢ Total games range: [10, 5613106]
   ‚Ä¢ Players range: [1, 43778]
   ‚úì Calculated means for 2,717 openings

   Opening mean score distribution:
   ‚Ä¢ Min: 0.1667
   ‚Ä¢ 25th percentile: 0.4962
   

In [5]:
print(clean_data.sample().to_string())

         player_id  opening_id  num_games     score  eco  confidence
1906751      31564        1374         28  0.545827  C00    0.358974


In [6]:
# 2c. Gather player rating statistics (no mutation, just exploration)

print("=" * 60)
print("STEP 2C: PLAYER RATING STATISTICS")
print("=" * 60)

# Connect to database and extract player ratings
con = get_db_connection(str(DB_PATH))

try:
    print(f"\n1Ô∏è‚É£  Extracting player ratings from database...")
    
    # Get unique player IDs from our clean_data
    unique_player_ids = clean_data['player_id'].unique()
    player_ids_str = ','.join(map(str, unique_player_ids))
    
    # Query to get player ratings
    rating_query = f"""
        SELECT 
            id as player_id,
            name,
            title,
            rating
        FROM player
        WHERE id IN ({player_ids_str})
    """
    
    player_ratings = pd.DataFrame(con.execute(rating_query).df())
    print(f"   ‚úì Retrieved ratings for {len(player_ratings):,} players")
    
finally:
    con.close()
    print("   ‚úì Database connection closed")

# Merge ratings into clean_data for analysis
print(f"\n2Ô∏è‚É£  Merging ratings with clean_data...")
clean_data_with_ratings = clean_data.merge(player_ratings[['player_id', 'rating']], on='player_id', how='left')
print(f"   ‚úì Merged successfully")

# Check for missing ratings
num_missing_ratings = clean_data_with_ratings['rating'].isna().sum()
if num_missing_ratings > 0:
    print(f"   ‚ö†Ô∏è  {num_missing_ratings:,} entries ({100*num_missing_ratings/len(clean_data_with_ratings):.2f}%) have missing ratings")
else:
    print(f"   ‚úì All entries have ratings")

# Basic rating statistics
print(f"\n3Ô∏è‚É£  Basic rating statistics:")
print(f"   ‚Ä¢ Count: {player_ratings['rating'].notna().sum():,}")
print(f"   ‚Ä¢ Missing: {player_ratings['rating'].isna().sum():,}")
print(f"   ‚Ä¢ Min: {player_ratings['rating'].min():.0f}")
print(f"   ‚Ä¢ Max: {player_ratings['rating'].max():.0f}")
print(f"   ‚Ä¢ Mean: {player_ratings['rating'].mean():.2f}")
print(f"   ‚Ä¢ Median: {player_ratings['rating'].median():.0f}")
print(f"   ‚Ä¢ Std Dev: {player_ratings['rating'].std():.2f}")

# Quartile statistics
print(f"\n4Ô∏è‚É£  Quartile statistics:")
print(f"   ‚Ä¢ 25th percentile: {player_ratings['rating'].quantile(0.25):.0f}")
print(f"   ‚Ä¢ 50th percentile (median): {player_ratings['rating'].quantile(0.50):.0f}")
print(f"   ‚Ä¢ 75th percentile: {player_ratings['rating'].quantile(0.75):.0f}")

# Granular percentile statistics (5% increments)
print(f"\n5Ô∏è‚É£  Detailed percentile distribution (5% increments):")
percentiles = [0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50,
               0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00]

print(f"\n   {'Percentile':<12} {'Rating':<10} {'Visual'}")
print(f"   {'-'*12} {'-'*10} {'-'*40}")

for p in percentiles:
    rating_value = player_ratings['rating'].quantile(p)
    # Create a simple bar visualization
    bar_length = int((rating_value - player_ratings['rating'].min()) / 
                     (player_ratings['rating'].max() - player_ratings['rating'].min()) * 40)
    bar = '‚ñà' * bar_length
    print(f"   {p*100:>5.0f}%       {rating_value:>7.0f}    {bar}")

# Rating ranges and counts
print(f"\n6Ô∏è‚É£  Rating distribution by range:")
rating_ranges = [
    (0, 1000), (1000, 1200), (1200, 1400), (1400, 1600), 
    (1600, 1800), (1800, 2000), (2000, 2200), (2200, 2400), 
    (2400, 2600), (2600, 3000)
]

print(f"\n   {'Range':<15} {'Count':<10} {'Percentage':<12} {'Visual'}")
print(f"   {'-'*15} {'-'*10} {'-'*12} {'-'*40}")

for low, high in rating_ranges:
    count = len(player_ratings[(player_ratings['rating'] >= low) & (player_ratings['rating'] < high)])
    pct = 100 * count / len(player_ratings)
    bar_length = int(pct * 0.4)  # Scale for visualization
    bar = '‚ñà' * bar_length
    print(f"   {low:>4}-{high:<8} {count:>7,}    {pct:>6.2f}%      {bar}")

# Interquartile range
iqr = player_ratings['rating'].quantile(0.75) - player_ratings['rating'].quantile(0.25)
print(f"\n7Ô∏è‚É£  Spread statistics:")
print(f"   ‚Ä¢ Range: {player_ratings['rating'].max() - player_ratings['rating'].min():.0f}")
print(f"   ‚Ä¢ Interquartile Range (IQR): {iqr:.0f}")
print(f"   ‚Ä¢ 10th-90th percentile range: {player_ratings['rating'].quantile(0.90) - player_ratings['rating'].quantile(0.10):.0f}")

# Skewness and kurtosis if available
try:
    from scipy.stats import skew, kurtosis
    skewness = skew(player_ratings['rating'].dropna())
    kurt = kurtosis(player_ratings['rating'].dropna())
    print(f"\n8Ô∏è‚É£  Distribution shape:")
    print(f"   ‚Ä¢ Skewness: {skewness:.4f} {'(right-skewed)' if skewness > 0 else '(left-skewed)' if skewness < 0 else '(symmetric)'}")
    print(f"   ‚Ä¢ Kurtosis: {kurt:.4f} {'(heavy-tailed)' if kurt > 0 else '(light-tailed)' if kurt < 0 else '(normal)'}")
except ImportError:
    print(f"\n8Ô∏è‚É£  Distribution shape:")
    print(f"   ‚Ä¢ scipy not available for skewness/kurtosis calculation")

# Sample of players at different rating levels
print(f"\n9Ô∏è‚É£  Sample players at different rating levels:")
sample_percentiles = [0.1, 0.25, 0.5, 0.75, 0.9]
for p in sample_percentiles:
    rating_threshold = player_ratings['rating'].quantile(p)
    # Get a player near this rating
    sample_player = player_ratings.iloc[(player_ratings['rating'] - rating_threshold).abs().argsort()[:1]]
    print(f"\n   ~{p*100:.0f}th percentile (rating ‚âà {rating_threshold:.0f}):")
    for idx, row in sample_player.iterrows():
        # print(f"      Player {row['player_id']}: {row['name']} - Rating: {row['rating']:.0f} {f'({row['title']})' if pd.notna(row['title']) else ''}")
        title_str = f" ({row['title']})" if pd.notna(row['title']) else ""
        print(f"      Player {row['player_id']}: {row['name']} - Rating: {row['rating']:.0f}{title_str}")

print("\n" + "=" * 60)
print("‚úÖ RATING STATISTICS COMPLETE")
print("=" * 60)
print(f"\nKey takeaways:")
print(f"   ‚Ä¢ Total players: {len(player_ratings):,}")
print(f"   ‚Ä¢ Rating range: [{player_ratings['rating'].min():.0f}, {player_ratings['rating'].max():.0f}]")
print(f"   ‚Ä¢ Mean ¬± std: {player_ratings['rating'].mean():.0f} ¬± {player_ratings['rating'].std():.0f}")
print(f"   ‚Ä¢ Median: {player_ratings['rating'].median():.0f}")
print(f"\n   Next steps: Normalize ratings for model input")

STEP 2C: PLAYER RATING STATISTICS

1Ô∏è‚É£  Extracting player ratings from database...
   ‚úì Retrieved ratings for 49,467 players
   ‚úì Database connection closed

2Ô∏è‚É£  Merging ratings with clean_data...
   ‚úì Merged successfully
   ‚úì All entries have ratings

3Ô∏è‚É£  Basic rating statistics:
   ‚Ä¢ Count: 49,467
   ‚Ä¢ Missing: 0
   ‚Ä¢ Min: 1200
   ‚Ä¢ Max: 2823
   ‚Ä¢ Mean: 1765.38
   ‚Ä¢ Median: 1762
   ‚Ä¢ Std Dev: 249.22

4Ô∏è‚É£  Quartile statistics:
   ‚Ä¢ 25th percentile: 1584
   ‚Ä¢ 50th percentile (median): 1762
   ‚Ä¢ 75th percentile: 1937

5Ô∏è‚É£  Detailed percentile distribution (5% increments):

   Percentile   Rating     Visual
   ------------ ---------- ----------------------------------------
       0%          1200    
       5%          1359    ‚ñà‚ñà‚ñà
      10%          1436    ‚ñà‚ñà‚ñà‚ñà‚ñà
      15%          1494    ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
      20%          1541    ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
      25%          1584    ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
     

In [None]:
# 2d. Normalize player ratings using z-score normalization (for use as side information in MF model)

# TODO we'll need to save the mean and std for use during inference later. Copilot, you'll see later that I forgot this; remind me

# Check if rating_z already exists - if so, skip processing but still show the statistics
if 'rating_z' in clean_data.columns:
    print("=" * 60)
    print("‚è≠Ô∏è  SKIPPING STEP 2D: RATING NORMALIZATION")
    print("=" * 60)
    print("\n‚úì 'rating_z' column already exists in data")
    print("   This indicates rating normalization has already been applied.")
    print(f"\nCurrent data shape: {clean_data.shape}")
    
    # Still show statistics for review
    print(f"\nüìä Existing normalized rating statistics:")
    print(f"   ‚Ä¢ Min: {clean_data['rating_z'].min():.4f}")
    print(f"   ‚Ä¢ Max: {clean_data['rating_z'].max():.4f}")
    print(f"   ‚Ä¢ Mean: {clean_data['rating_z'].mean():.6f} (should be ~0)")
    print(f"   ‚Ä¢ Std: {clean_data['rating_z'].std():.6f} (should be ~1)")
    
    # Show sample with original and normalized ratings
    print(f"\nüìã Sample of existing normalized ratings:")
    sample_data = clean_data[['player_id', 'rating', 'rating_z']].drop_duplicates('player_id').sample(10, random_state=42)
    for idx, row in sample_data.iterrows():
        print(f"   Player {row['player_id']:>5} | Original: {row['rating']:>4.0f} ‚Üí Normalized: {row['rating_z']:>6.3f}")
else:
    # Define the normalization function
    def normalize_player_ratings(data, player_ratings_df):
        """
        Apply z-score normalization to player ratings for use as side information.
        
        Why z-score normalization?
        - Puts ratings on similar scale as embedding initialization (typically N(0, 0.1))
        - Prevents gradient scale mismatch during training
        - Standard practice for side information in matrix factorization
        - Preserves relative differences between players
        
        Parameters:
        -----------
        data : pd.DataFrame
            Clean data with player_id column
        player_ratings_df : pd.DataFrame
            Player ratings with columns: player_id, rating
            
        Returns:
        --------
        pd.DataFrame
            Data with 'rating' and 'rating_z' columns added
        """
        print("=" * 60)
        print("STEP 2D: NORMALIZE PLAYER RATINGS")
        print("=" * 60)
        
        print(f"\n‚öôÔ∏è  Normalization strategy: Z-score")
        print(f"   ‚Ä¢ Formula: (rating - mean) / std")
        print(f"   ‚Ä¢ Purpose: Scale ratings for use as side information in MF model")
        print(f"   ‚Ä¢ Benefits: Prevents gradient dominance, matches embedding scale")
        
        # Calculate normalization parameters from player_ratings
        RATING_MEAN = player_ratings_df['rating'].mean()
        RATING_STD = player_ratings_df['rating'].std()
        
        print(f"\n1Ô∏è‚É£  Normalization parameters (calculated from {len(player_ratings_df):,} players):")
        print(f"   ‚Ä¢ Mean: {RATING_MEAN:.2f}")
        print(f"   ‚Ä¢ Std Dev: {RATING_STD:.2f}")
        
        # Apply z-score normalization to player_ratings
        player_ratings_df = player_ratings_df.copy()
        player_ratings_df['rating_z'] = (player_ratings_df['rating'] - RATING_MEAN) / RATING_STD
        
        print(f"\n2Ô∏è‚É£  Normalized rating statistics:")
        print(f"   ‚Ä¢ Min: {player_ratings_df['rating_z'].min():.4f}")
        print(f"   ‚Ä¢ Max: {player_ratings_df['rating_z'].max():.4f}")
        print(f"   ‚Ä¢ Mean: {player_ratings_df['rating_z'].mean():.6f} (should be ~0)")
        print(f"   ‚Ä¢ Std: {player_ratings_df['rating_z'].std():.6f} (should be ~1)")
        print(f"   ‚Ä¢ Range: [{player_ratings_df['rating_z'].min():.2f}, {player_ratings_df['rating_z'].max():.2f}]")
        
        print(f"\n3Ô∏è‚É£  Sample normalized ratings across skill levels:")
        sample_percentiles = [0.1, 0.25, 0.5, 0.75, 0.9]
        for p in sample_percentiles:
            rating_threshold = player_ratings_df['rating'].quantile(p)
            sample_player = player_ratings_df.iloc[(player_ratings_df['rating'] - rating_threshold).abs().argsort()[:1]]
            for idx, row in sample_player.iterrows():
                print(f"   ~{p*100:.0f}th percentile: {row['name']:<20} | "
                      f"Original: {row['rating']:>4.0f} ‚Üí Normalized: {row['rating_z']:>6.3f}")
        
        print(f"\n4Ô∏è‚É£  Interpretation guide:")
        print(f"   ‚Ä¢ rating_z ‚âà {(1200 - RATING_MEAN)/RATING_STD:.1f}: 1200 player (minimum)")
        print(f"   ‚Ä¢ rating_z ‚âà {(player_ratings_df['rating'].quantile(0.25) - RATING_MEAN)/RATING_STD:.1f}: {player_ratings_df['rating'].quantile(0.25):.0f} player (25th percentile)")
        print(f"   ‚Ä¢ rating_z ‚âà  0.0: {RATING_MEAN:.0f} player (mean)")
        print(f"   ‚Ä¢ rating_z ‚âà {(player_ratings_df['rating'].quantile(0.75) - RATING_MEAN)/RATING_STD:.1f}: {player_ratings_df['rating'].quantile(0.75):.0f} player (75th percentile)")
        print(f"   ‚Ä¢ rating_z ‚âà {(player_ratings_df['rating'].max() - RATING_MEAN)/RATING_STD:.1f}: {player_ratings_df['rating'].max():.0f} player (maximum)")
        
        # Merge rating and rating_z into the main data
        print(f"\n5Ô∏è‚É£  Merging normalized ratings into clean_data...")
        data = data.merge(
            player_ratings_df[['player_id', 'rating', 'rating_z']], 
            on='player_id', 
            how='left'
        )
        print(f"   ‚úì Merged successfully")
        
        # Verify no missing values
        missing_ratings = data['rating_z'].isna().sum()
        if missing_ratings > 0:
            print(f"   ‚ö†Ô∏è  Warning: {missing_ratings} entries have missing rating_z values")
        else:
            print(f"   ‚úì All entries have rating_z values")
        
        # Show sample of merged data
        print(f"\n6Ô∏è‚É£  Sample of data with normalized ratings (10 random entries):")
        sample_data = data.sample(min(10, len(data)), random_state=42)
        for idx, row in sample_data.iterrows():
            print(f"   Player {row['player_id']:>5} | Opening {row['opening_id']:>4} | "
                  f"Rating: {row['rating']:>4.0f} ‚Üí Z-score: {row['rating_z']:>6.3f} | "
                  f"Score: {row['score']:.4f} | Games: {row['num_games']:>3}")
        
        print("\n" + "=" * 60)
        print("‚úÖ RATING NORMALIZATION COMPLETE")
        print("=" * 60)
        print(f"\nFinal data shape: {data.shape}")
        print(f"New columns added:")
        print(f"   ‚Ä¢ 'rating': original player rating (side information)")
        print(f"   ‚Ä¢ 'rating_z': z-score normalized rating (for model input)")
        
        print(f"\n‚ö†Ô∏è  CRITICAL: Save these parameters for inference!")
        print(f"   RATING_MEAN = {RATING_MEAN:.2f}")
        print(f"   RATING_STD = {RATING_STD:.2f}")
        print(f"\n   You'll need them to normalize ratings for new users at inference time.")
        
        return data, RATING_MEAN, RATING_STD
    
    # Call the function
    clean_data, RATING_MEAN, RATING_STD = normalize_player_ratings(clean_data, player_ratings)


STEP 2D: NORMALIZE PLAYER RATINGS

‚öôÔ∏è  Normalization strategy: Z-score
   ‚Ä¢ Formula: (rating - mean) / std
   ‚Ä¢ Purpose: Scale ratings for use as side information in MF model
   ‚Ä¢ Benefits: Prevents gradient dominance, matches embedding scale

1Ô∏è‚É£  Normalization parameters (calculated from 49,467 players):
   ‚Ä¢ Mean: 1765.38
   ‚Ä¢ Std Dev: 249.22

2Ô∏è‚É£  Normalized rating statistics:
   ‚Ä¢ Min: -2.2686
   ‚Ä¢ Max: 4.2437
   ‚Ä¢ Mean: -0.000000 (should be ~0)
   ‚Ä¢ Std: 1.000000 (should be ~1)
   ‚Ä¢ Range: [-2.27, 4.24]

3Ô∏è‚É£  Sample normalized ratings across skill levels:
   ~10th percentile: no_cry               | Original: 1436 ‚Üí Normalized: -1.322
   ~25th percentile: Bala_Nuthulapati     | Original: 1584 ‚Üí Normalized: -0.728
   ~50th percentile: aj2345               | Original: 1762 ‚Üí Normalized: -0.014
   ~75th percentile: AAJ_88               | Original: 1937 ‚Üí Normalized:  0.689
   ~90th percentile: Finnja               | Original: 2089 ‚Üí Norma