# Notebook 26 ‚Äî Opening Recommender Model: Training Pipeline

### 0. Overview and Goals

This notebook defines the full pipeline for training the chess opening recommender model.  
The objective is to predict **player‚Äìopening performance scores** ((wins + (0.5 * draws) / num games)) for openings a player hasn‚Äôt yet played, based on their results in the openings they *have* played.  

The model will use **matrix factorization** with **stochastic gradient descent (SGD)** to learn latent factors representing player and opening characteristics.  
All computations will be implemented in **PyTorch**, with data loaded from my local **DuckDB** database.

**High-level specs:**
- Use only *White* openings initially (we‚Äôll extend to Black later).  
- Data source: processed player‚Äìopening stats from local DuckDB.  
- Predict: normalized ‚Äúscore‚Äù = win rate ((wins + 0.5 x draws) / total games).  
- Filter: only include entries with ‚â• `MIN_GAMES_THRESHOLD` (default = 50).  
- Ignore: rating differences, time controls, and other metadata.  
- Model parameters (to be defined in appropriate places for easy editing):  
  - `NUM_FACTORS`, `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`, `NUM_PLAYERS_TO_PROCESS`  
- Logging and checkpoints throughout for reproducibility.  
- All random operations seeded for deterministic runs.  

---

### 1. Data Extraction
- Connect to local DuckDB
- Pull all processed player‚Äìopening statistics from
- Verify schema consistency:  
  - Required columns: `player_id`, `opening_id`, `eco`, `num_games`, `wins`, `draws`, `losses`.  
- Include a row-count sanity check.

---

### 2. Data Sanitization & Normalization
- Optionally normalize scores if needed for MF convergence.  
- Drop players with no qualifying openings and openings with no qualifying players.  
  - I believe there shouldn't be any but we'll double check.
- Resequence player_id and opening_id to be sequential integers - right now there are gaps because of entries we deleted from the DB 
- Consider including `eco` as an enumerated categorical variable.  
- Check for sparsity consistency (no implicit zeros yet).  
- Note that this data has already been split in to white and black games further up the pipeline

### Data Quality
- Drop entries with fewer than `MIN_GAMES_THRESHOLD` games
- Handle any duplicate `(player_id, opening_id)` combinations
- Remove players with no qualifying openings
- Remove openings with no qualifying players
- Verify no null values remain

### ECO Codes
- Keep ECO codes for later categorical encoding (Step 4)
- ECO will be used as opening side information (similar to rating for players)

### Confidence Weighting
- Use `MIN_GAMES_THRESHOLD = 10` to keep more data
- Add a **confidence weight** column: `confidence = num_games / (num_games + K)` where K ‚âà 50
- This weight will be used in the loss function to down-weight uncertain predictions
- High-game-count entries ‚Üí high confidence ‚Üí larger loss impact
- Low-game-count entries ‚Üí low confidence ‚Üí smaller loss impact

### Player Rating (Side Information)
- **Player ratings are side information** - they describe player characteristics, not individual player-opening interactions
- Ratings will be stored separately and joined to player embeddings during training
- We'll **normalize ratings** (likely z-score normalization) to avoid scaling issues with the embedding layer
- Rating normalization will be done once after extraction, not per-row

---

### 3. Data Splits
- Split into train/test/val sets.  
- Ensure every player and every opening appears at least once in the training data.  
- Strategy:  
  - Sample unique players and openings to guarantee coverage in train.  
  - Remaining data ‚Üí stratified random split into train/test.  
  - Deduplicate and merge unique IDs back into train if needed.

---

### 4. Enumerate Categorical Variables
- Enumerate `eco` (if included) as an integer categorical variable.  
- Confirm all columns are numeric and compatible with PyTorch tensors.  
- Verify no missing or out-of-range IDs.

---

### 5. Training Data Structure
- Each row: one `(player_id, opening_id, score)` record.
- Include other fields- eco, num games etc
- Convert DataFrame to PyTorch tensors (`torch.long` for IDs, `torch.float` for scores).  
- Log dataset shapes and sparsity metrics.

---

### 6. Training Setup
Define constants:
- `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`, `NUM_FACTORS`  
- Loss functions: MSE and RMSE  
- Activation: sigmoid or none (depending on score normalization)  
- Optimizer: SGD  
- Figure out if there's anything else we need to design or specify

Implement helper functions:
- `train_one_epoch()`
- `evaluate_model()`
- `calculate_rmse()`
- `save_checkpoint()`  

Ensure detailed logging, ETA reporting, and reproducible random seeds.

---

### 7. Training Loop
- Initialize player and opening embeddings.  
- Iterate through epochs with mini-batch SGD (`BATCH_SIZE = 1024`).  
- Compute and log MSE/RMSE per epoch.  
- Save model checkpoints locally after each epoch.

---

### 8. Evaluation
- Evaluate on test set.  
- Report MSE, RMSE, and visual diagnostics (predicted vs actual score).  
- Inspect a few player and opening latent factors for sanity.

---

### 9. Cross-Validation & Hyperparameter Tuning
- Define ranges for:  
  - `NUM_FACTORS`, `LEARNING_RATE`, `BATCH_SIZE`, `N_EPOCHS`  
- Perform small-scale grid or random search for best configuration.  
- Compare validation RMSE across runs.

---

### 10. Next Steps
- Extend model to include Black openings.  
- Experiment with hybrid inputs (player rating, ECO grouping).  
- Consider implicit feedback handling (unplayed openings as zeros).  
- Integrate trained model into API for recommendation output.

---

**Notes:**  
- Every random seed and parameter definition will be explicit.  
- Every major step includes row-count, schema, and type validation.  
- Model artifacts and logs will be saved locally for reproducibility.


## Step 1: Data Extraction

Connect to DuckDB and extract all player-opening statistics.
Verify schema and perform sanity checks.

In [3]:
# Setup and imports
from pathlib import Path
import pandas as pd
import sys

# Add utils to path
sys.path.append(str(Path.cwd() / 'utils'))
from database.db_utils import get_db_connection

# Configuration
DB_PATH = Path.cwd().parent / "data" / "processed" / "chess_games.db"
COLOR_FILTER = 'w'  # 'w' for white, 'b' for black

print("=" * 60)
print("STEP 1: DATA EXTRACTION")
print("=" * 60)
print(f"\nüìÅ Database: {DB_PATH}")
print(f"üìÅ Database exists: {DB_PATH.exists()}")
print(f"üé® Color filter: {'White' if COLOR_FILTER == 'w' else 'Black'}")

if not DB_PATH.exists():
    raise FileNotFoundError(f"Database not found at {DB_PATH}")

STEP 1: DATA EXTRACTION

üìÅ Database: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db
üìÅ Database exists: True
üé® Color filter: White


In [6]:
# Connect to DuckDB and extract player-opening statistics
con = get_db_connection(str(DB_PATH))

try:
    print(f"\n1Ô∏è‚É£  Extracting player-opening statistics (color: '{COLOR_FILTER}')...")
    
    # Extract stats with calculated score and num_games
    # Filter by color and calculate score in the database
    query = f"""
        SELECT 
            pos.player_id,
            pos.opening_id,
            pos.num_wins + pos.num_draws + pos.num_losses as num_games,
            (pos.num_wins + (pos.num_draws * 0.5)) / 
                NULLIF(pos.num_wins + pos.num_draws + pos.num_losses, 0) as score,
            o.eco
        FROM player_opening_stats pos
        JOIN opening o ON pos.opening_id = o.id
        WHERE pos.color = '{COLOR_FILTER}'
        ORDER BY pos.player_id, pos.opening_id
    """
    
    raw_data = con.execute(query).df()
    
    print(f"   ‚úì Extracted {len(raw_data):,} rows")
    
    # Schema verification
    print("\n2Ô∏è‚É£  Verifying schema...")
    required_columns = ['player_id', 'opening_id', 'num_games', 'score', 'eco']
    
    for col in required_columns:
        if col not in raw_data.columns:
            raise ValueError(f"Missing required column: {col}")
    
    print(f"   ‚úì All required columns present: {required_columns}")
    
    # Data types verification
    print("\n3Ô∏è‚É£  Checking data types...")
    print(f"   ‚Ä¢ player_id: {raw_data['player_id'].dtype}")
    print(f"   ‚Ä¢ opening_id: {raw_data['opening_id'].dtype}")
    print(f"   ‚Ä¢ num_games: {raw_data['num_games'].dtype}")
    print(f"   ‚Ä¢ score: {raw_data['score'].dtype}")
    print(f"   ‚Ä¢ eco: {raw_data['eco'].dtype}")
    
    # Basic statistics
    print("\n4Ô∏è‚É£  Data statistics...")
    print(f"   ‚Ä¢ Total rows: {len(raw_data):,}")
    print(f"   ‚Ä¢ Unique players: {raw_data['player_id'].nunique():,}")
    print(f"   ‚Ä¢ Unique openings: {raw_data['opening_id'].nunique():,}")
    print(f"   ‚Ä¢ Total games (sum): {raw_data['num_games'].sum():,}")
    
    # Player ID range
    print(f"\n   Player ID range:")
    print(f"   ‚Ä¢ Min: {raw_data['player_id'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['player_id'].max()}")
    
    # Opening ID range
    print(f"\n   Opening ID range:")
    print(f"   ‚Ä¢ Min: {raw_data['opening_id'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['opening_id'].max()}")
    
    # Games per entry statistics
    print(f"\n   Games per entry:")
    print(f"   ‚Ä¢ Min: {raw_data['num_games'].min()}")
    print(f"   ‚Ä¢ Max: {raw_data['num_games'].max()}")
    print(f"   ‚Ä¢ Mean: {raw_data['num_games'].mean():.1f}")
    print(f"   ‚Ä¢ Median: {raw_data['num_games'].median():.0f}")
    
    # Score statistics
    print(f"\n   Score distribution:")
    print(f"   ‚Ä¢ Min: {raw_data['score'].min():.4f}")
    print(f"   ‚Ä¢ Max: {raw_data['score'].max():.4f}")
    print(f"   ‚Ä¢ Mean: {raw_data['score'].mean():.4f}")
    print(f"   ‚Ä¢ Median: {raw_data['score'].median():.4f}")
    
    # Check for null values
    print("\n5Ô∏è‚É£  Checking for null values...")
    null_counts = raw_data.isnull().sum()
    if null_counts.sum() == 0:
        print("   ‚úì No null values found")
    else:
        print("   ‚ö†Ô∏è  Found null values:")
        for col, count in null_counts[null_counts > 0].items():
            print(f"      ‚Ä¢ {col}: {count} nulls")
    
    # Sample data
    print("\n6Ô∏è‚É£  Sample of extracted data (first 10 rows):")
    print(raw_data.head(10).to_string())
    
    print("\n" + "=" * 60)
    print("‚úÖ DATA EXTRACTION COMPLETE")
    print("=" * 60)
    print(f"\nData shape: {raw_data.shape}")
    print(f"Columns: {list(raw_data.columns)}")
    
finally:
    con.close()
    print("\n‚úì Database connection closed")


1Ô∏è‚É£  Extracting player-opening statistics (color: 'w')...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

   ‚úì Extracted 11,877,700 rows

2Ô∏è‚É£  Verifying schema...
   ‚úì All required columns present: ['player_id', 'opening_id', 'num_games', 'score', 'eco']

3Ô∏è‚É£  Checking data types...
   ‚Ä¢ player_id: int32
   ‚Ä¢ opening_id: int32
   ‚Ä¢ num_games: int32
   ‚Ä¢ score: float64
   ‚Ä¢ eco: object

4Ô∏è‚É£  Data statistics...
   ‚Ä¢ Total rows: 11,877,700
   ‚Ä¢ Unique players: 49,906
   ‚Ä¢ Unique openings: 2,991
   ‚Ä¢ Total games (sum): 235,152,459

   Player ID range:
   ‚Ä¢ Min: 1
   ‚Ä¢ Max: 50000

   Opening ID range:
   ‚Ä¢ Min: 2
   ‚Ä¢ Max: 3589

   Games per entry:
   ‚Ä¢ Min: 1
   ‚Ä¢ Max: 13462
   ‚Ä¢ Mean: 19.8
   ‚Ä¢ Median: 3

   Score distribution:
   ‚Ä¢ Min: 0.0000
   ‚Ä¢ Max: 1.0000
   ‚Ä¢ Mean: 0.5006
   ‚Ä¢ Median: 3

   Score distribution:
   ‚Ä¢ Min: 0.0000
   ‚Ä¢ Max: 1.0000
   ‚Ä¢ Mean: 0.5006
   ‚Ä¢ Median: 0.5000

5Ô∏è‚É£  Checking for null values...
   ‚Ä¢ Median: 0.5000

5Ô∏è‚É£  Checking for null values...
   ‚úì No null values found

6Ô∏è‚É£  Sample