# Optimized Chess Data Processing

This notebook implements efficient strategies for processing a large PGN database, focusing on:

1. **Early filtering** - Filter games during reading to avoid loading unnecessary data
2. **Incremental processing** - Process in chunks and save progress 
3. **Parallelization** - Use multiple CPU cores where beneficial
4. **Type safety** - Strong typing for better IDE support and code quality

In [1]:
# Import necessary libraries
from typing import TypedDict, Optional, Dict, List, Set, Iterator, Tuple, Any, Literal, Union
import chess.pgn
import zstandard as zstd
import io
import time
import json
import os
import multiprocessing
from pathlib import Path
import pickle
from dataclasses import dataclass, field

## Type Definitions

First, let's define strong types for our data structures to ensure type safety and better IDE support.

In [2]:
# Define types for game results
GameResult = Literal["1-0", "0-1", "1/2-1/2", "*"]

# Time control categories
TimeControlCategory = Literal["bullet", "blitz", "rapid", "classical", "correspondence", "unknown"]

class GameHeaders(TypedDict, total=False):
    """Type for game headers extracted from PGN."""
    Event: str
    Site: str
    Date: str
    White: str
    Black: str
    Result: GameResult
    WhiteElo: str
    BlackElo: str
    ECO: str
    Opening: str
    TimeControl: str
    Termination: str
    WhiteRatingDiff: str
    BlackRatingDiff: str

class OpeningResults(TypedDict):
    """Statistics for a player's results with a particular opening."""
    opening_name: str
    results: Dict[str, Union[int, float]]

# Player statistics structure
class PlayerStats(TypedDict):
    """Statistics for an individual player."""
    rating: int
    white_games: Dict[str, OpeningResults]  # ECO code -> results
    black_games: Dict[str, OpeningResults]  # ECO code -> results
    num_games_total: int

# Configuration parameters with defaults
@dataclass
class ProcessingConfig:
    """Configuration for the game processing pipeline."""
    # Filtering parameters
    min_elo: int = 1500  # Minimum player Elo to include
    exclude_time_controls: Set[str] = field(default_factory=lambda: {"bullet", "hyperbullet", "ultrabullet"})
    min_moves: int = 10  # Minimum number of moves in game
    
    # Processing parameters
    chunk_size: int = 100_000  # Number of games to process in each chunk
    max_chunks: Optional[int] = None  # Maximum number of chunks to process (None for all)
    save_interval: int = 1  # Save after processing this many chunks
    
    # File paths
    save_dir: str = "../data/processed"
    player_data_file: str = "player_stats.json"
    progress_file: str = "processing_progress.json"
    
    # Parallelization
    use_parallel: bool = True
    num_processes: int = multiprocessing.cpu_count() - 1  # Use all but one CPU core

## Helper Functions

Let's define helper functions for time control categorization and game filtering.

In [3]:
def categorize_time_control(time_control: str) -> TimeControlCategory:
    """
    Categorize the time control string into standard categories.
    
    Args:
        time_control: The time control string from the game headers
        
    Returns:
        The category of time control
    """
    # Handle missing or malformed time control
    if not time_control or time_control == "?" or time_control == "Unknown":
        return "unknown"
    
    # Split time control into initial time and increment
    # Format is typically "initial+increment" like "180+2" (3 minutes + 2 second increment)
    parts = time_control.split("+")
    
    try:
        # Initial time in seconds
        initial_time = int(parts[0])
        
        # Categorize based on initial time
        if initial_time < 180:  # Less than 3 minutes
            return "bullet"
        elif initial_time < 600:  # 3-10 minutes
            return "blitz"
        elif initial_time < 1800:  # 10-30 minutes
            return "rapid"
        elif initial_time <= 6000:  # 30-100 minutes
            return "classical"
        else:  # More than 100 minutes
            return "correspondence"
    except (ValueError, IndexError):
        # Handle correspondence format like "1 day"
        if "day" in time_control.lower():
            return "correspondence"
        return "unknown"

def should_include_game(headers: GameHeaders, config: ProcessingConfig) -> bool:
    """
    Determine if a game should be included based on filtering criteria.
    
    Args:
        headers: The game headers
        config: Processing configuration
        
    Returns:
        True if the game should be included, False otherwise
    """
    # Skip games with missing essential information
    if not all(key in headers for key in ['White', 'Black', 'Result', 'TimeControl']):
        return False
    
    # Skip games with too low Elo
    try:
        white_elo = int(headers.get('WhiteElo', '0'))
        black_elo = int(headers.get('BlackElo', '0'))
        if white_elo < config.min_elo or black_elo < config.min_elo:
            return False
    except ValueError:
        return False
    
    # Skip games with excluded time controls
    time_control = headers.get('TimeControl', '')
    tc_category = categorize_time_control(time_control)
    
    if tc_category in config.exclude_time_controls:
        return False
        
    # Skip abandoned or incomplete games
    if headers.get('Result', '') == '*':
        return False
    
    # Game passes all filters
    return True

## Efficient PGN Reading

Now let's implement an efficient reader that filters games early and processes in chunks.

In [5]:
def game_reader(file_path: str, config: ProcessingConfig) -> Iterator[chess.pgn.Game]:
    """
    Generator that reads and yields games from a compressed PGN file,
    applying early filtering to avoid unnecessary processing.
    
    Args:
        file_path: Path to the compressed PGN file
        config: Processing configuration
        
    Yields:
        Chess games that pass the filtering criteria
    """
    with open(file_path, 'rb') as f:
        dctx = zstd.ZstdDecompressor()
        stream_reader = dctx.stream_reader(f)
        text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
        
        games_processed = 0
        games_included = 0
        start_time = time.time()
        
        while True:
            # Read the next game
            game = chess.pgn.read_game(text_stream)
            
            # Check if we've reached the end of the file
            if game is None:
                break
                
            games_processed += 1
            
            # Extract headers for filtering
            headers = dict(game.headers)
            
            # Check if game meets inclusion criteria
            if should_include_game(headers, config):
                games_included += 1
                yield game
            
            # Print progress periodically
            if games_processed % 5_000 == 0:
                elapsed = time.time() - start_time
                inclusion_rate = (games_included / games_processed) * 100 if games_processed > 0 else 0
                print(f"Processed: {games_processed} games, Included: {games_included} " +
                      f"({inclusion_rate:.1f}%) in {elapsed:.2f} seconds")

def process_game_chunk(chunk: List[chess.pgn.Game]) -> Dict[str, PlayerStats]:
    """
    Process a chunk of games and return player statistics.
    
    Args:
        chunk: A list of chess games to process
        
    Returns:
        Dictionary mapping player usernames to their statistics
    """
    players_data: Dict[str, PlayerStats] = {}
    
    for game in chunk:
        headers = dict(game.headers)
        
        white_player = headers['White']
        black_player = headers['Black']
        white_elo = int(headers.get('WhiteElo', 0))
        black_elo = int(headers.get('BlackElo', 0))
        result = headers['Result']
        eco_code = headers.get('ECO', 'Unknown')
        opening_name = headers.get('Opening', 'Unknown Opening')
        
        # Process white player's game
        if white_player not in players_data:
            players_data[white_player] = {
                "rating": white_elo,
                "white_games": {},
                "black_games": {},
                "num_games_total": 0
            }
        
        # Update white player's data
        if eco_code not in players_data[white_player]["white_games"]:
            players_data[white_player]["white_games"][eco_code] = {
                "opening_name": opening_name,
                "results": {
                    "num_games": 0,
                    "num_wins": 0,
                    "num_losses": 0,
                    "num_draws": 0,
                    "score_percentage_with_opening": 0
                }
            }
        
        # Update game counts
        players_data[white_player]["num_games_total"] += 1
        players_data[white_player]["white_games"][eco_code]["results"]["num_games"] += 1
        
        # Update result counts
        if result == "1-0":  # White win
            players_data[white_player]["white_games"][eco_code]["results"]["num_wins"] += 1
        elif result == "0-1":  # Black win (white loss)
            players_data[white_player]["white_games"][eco_code]["results"]["num_losses"] += 1
        elif result == "1/2-1/2":  # Draw
            players_data[white_player]["white_games"][eco_code]["results"]["num_draws"] += 1
            
        # Update score percentage
        wins = players_data[white_player]["white_games"][eco_code]["results"]["num_wins"]
        draws = players_data[white_player]["white_games"][eco_code]["results"]["num_draws"]
        total = players_data[white_player]["white_games"][eco_code]["results"]["num_games"]
        score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
        players_data[white_player]["white_games"][eco_code]["results"]["score_percentage_with_opening"] = round(score, 1)
        
        # Similarly process black player's game
        if black_player not in players_data:
            players_data[black_player] = {
                "rating": black_elo,
                "white_games": {},
                "black_games": {},
                "num_games_total": 0
            }
        
        # Update black player's data
        if eco_code not in players_data[black_player]["black_games"]:
            players_data[black_player]["black_games"][eco_code] = {
                "opening_name": opening_name,
                "results": {
                    "num_games": 0,
                    "num_wins": 0,
                    "num_losses": 0,
                    "num_draws": 0,
                    "score_percentage_with_opening": 0
                }
            }
        
        # Update game counts
        players_data[black_player]["num_games_total"] += 1
        players_data[black_player]["black_games"][eco_code]["results"]["num_games"] += 1
        
        # Update result counts
        if result == "0-1":  # Black win
            players_data[black_player]["black_games"][eco_code]["results"]["num_wins"] += 1
        elif result == "1-0":  # White win (black loss)
            players_data[black_player]["black_games"][eco_code]["results"]["num_losses"] += 1
        elif result == "1/2-1/2":  # Draw
            players_data[black_player]["black_games"][eco_code]["results"]["num_draws"] += 1
            
        # Update score percentage
        wins = players_data[black_player]["black_games"][eco_code]["results"]["num_wins"]
        draws = players_data[black_player]["black_games"][eco_code]["results"]["num_draws"]
        total = players_data[black_player]["black_games"][eco_code]["results"]["num_games"]
        score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
        players_data[black_player]["black_games"][eco_code]["results"]["score_percentage_with_opening"] = round(score, 1)
    
    return players_data

def merge_player_stats(data1: Dict[str, PlayerStats], data2: Dict[str, PlayerStats]) -> Dict[str, PlayerStats]:
    """
    Merge two player statistics dictionaries.
    
    Args:
        data1: First player statistics dictionary
        data2: Second player statistics dictionary
        
    Returns:
        Merged player statistics
    """
    merged_data: Dict[str, PlayerStats] = data1.copy()
    
    for player, stats in data2.items():
        if player not in merged_data:
            merged_data[player] = stats
        else:
            # Update total game count
            merged_data[player]["num_games_total"] += stats["num_games_total"]
            
            # Update white games
            for eco, opening_data in stats["white_games"].items():
                if eco not in merged_data[player]["white_games"]:
                    merged_data[player]["white_games"][eco] = opening_data
                else:
                    # Update results for this opening
                    merged_data[player]["white_games"][eco]["results"]["num_games"] += opening_data["results"]["num_games"]
                    merged_data[player]["white_games"][eco]["results"]["num_wins"] += opening_data["results"]["num_wins"]
                    merged_data[player]["white_games"][eco]["results"]["num_losses"] += opening_data["results"]["num_losses"]
                    merged_data[player]["white_games"][eco]["results"]["num_draws"] += opening_data["results"]["num_draws"]
                    
                    # Recalculate score percentage
                    wins = merged_data[player]["white_games"][eco]["results"]["num_wins"]
                    draws = merged_data[player]["white_games"][eco]["results"]["num_draws"]
                    total = merged_data[player]["white_games"][eco]["results"]["num_games"]
                    score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
                    merged_data[player]["white_games"][eco]["results"]["score_percentage_with_opening"] = round(score, 1)
            
            # Update black games
            for eco, opening_data in stats["black_games"].items():
                if eco not in merged_data[player]["black_games"]:
                    merged_data[player]["black_games"][eco] = opening_data
                else:
                    # Update results for this opening
                    merged_data[player]["black_games"][eco]["results"]["num_games"] += opening_data["results"]["num_games"]
                    merged_data[player]["black_games"][eco]["results"]["num_wins"] += opening_data["results"]["num_wins"]
                    merged_data[player]["black_games"][eco]["results"]["num_losses"] += opening_data["results"]["num_losses"]
                    merged_data[player]["black_games"][eco]["results"]["num_draws"] += opening_data["results"]["num_draws"]
                    
                    # Recalculate score percentage
                    wins = merged_data[player]["black_games"][eco]["results"]["num_wins"]
                    draws = merged_data[player]["black_games"][eco]["results"]["num_draws"]
                    total = merged_data[player]["black_games"][eco]["results"]["num_games"]
                    score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
                    merged_data[player]["black_games"][eco]["results"]["score_percentage_with_opening"] = round(score, 1)
    
    return merged_data

## Main Processing Pipeline

Now let's implement the main processing pipeline that ties everything together.

In [6]:
def save_progress(players_data: Dict[str, PlayerStats], 
                  chunk_num: int, 
                  config: ProcessingConfig) -> None:
    """
    Save current progress to disk.
    
    Args:
        players_data: Current player statistics
        chunk_num: Current chunk number
        config: Processing configuration
    """
    # Create save directory if it doesn't exist
    save_dir = Path(config.save_dir)
    save_dir.mkdir(parents=True, exist_ok=True)
    
    # Save player data
    player_data_path = save_dir / config.player_data_file
    
    # For large datasets, pickle can be more efficient than JSON
    with open(player_data_path, 'wb') as f:
        pickle.dump(players_data, f)
        
    # Save progress information
    progress_path = save_dir / config.progress_file
    progress_info = {
        "last_chunk_processed": chunk_num,
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "num_players": len(players_data),
        "config": {k: v for k, v in config.__dict__.items()}
    }
    
    with open(progress_path, 'w') as f:
        json.dump(progress_info, f, indent=2)
    
    print(f"Saved progress after chunk {chunk_num}. " +
          f"Current data includes {len(players_data)} players.")

def load_progress(config: ProcessingConfig) -> Tuple[Dict[str, PlayerStats], int]:
    """
    Load previous progress from disk.
    
    Args:
        config: Processing configuration
        
    Returns:
        Tuple of (player_data, last_chunk_processed)
    """
    player_data_path = Path(config.save_dir) / config.player_data_file
    progress_path = Path(config.save_dir) / config.progress_file
    
    # Default values if no saved progress
    players_data: Dict[str, PlayerStats] = {}
    last_chunk = 0
    
    # Load player data if it exists
    if player_data_path.exists():
        try:
            with open(player_data_path, 'rb') as f:
                players_data = pickle.load(f)
            print(f"Loaded player data with {len(players_data)} players.")
        except Exception as e:
            print(f"Error loading player data: {e}")
            players_data = {}
    
    # Load progress info if it exists
    if progress_path.exists():
        try:
            with open(progress_path, 'r') as f:
                progress_info = json.load(f)
                last_chunk = progress_info.get("last_chunk_processed", 0)
            print(f"Resuming from chunk {last_chunk}.")
        except Exception as e:
            print(f"Error loading progress info: {e}")
            last_chunk = 0
            
    return players_data, last_chunk

def process_pgn_file(file_path: str, config: ProcessingConfig) -> Dict[str, PlayerStats]:
    """
    Process a PGN file in chunks, with support for resuming from previous progress.
    
    Args:
        file_path: Path to the PGN file
        config: Processing configuration
        
    Returns:
        Player statistics dictionary
    """
    # Check if file exists
    if not Path(file_path).exists():
        raise FileNotFoundError(f"PGN file not found: {file_path}")
    
    # Load previous progress if any
    players_data, start_chunk = load_progress(config)
    
    # Create game reader
    game_gen = game_reader(file_path, config)
    
    # Skip chunks that were already processed
    for _ in range(start_chunk):
        print(f"Skipping chunk {_ + 1}...")
        for _ in range(config.chunk_size):
            try:
                next(game_gen)
            except StopIteration:
                print("Reached end of file while skipping chunks.")
                return players_data
    
    # Process chunks
    chunk_num = start_chunk
    while True:
        if config.max_chunks and chunk_num >= start_chunk + config.max_chunks:
            print(f"Reached maximum number of chunks ({config.max_chunks}). Stopping.")
            break
            
        # Collect a chunk of games
        chunk = []
        for _ in range(config.chunk_size):
            try:
                game = next(game_gen)
                chunk.append(game)
            except StopIteration:
                print("Reached end of file.")
                break
                
        if not chunk:
            print("No more games to process.")
            break
            
        print(f"Processing chunk {chunk_num + 1} with {len(chunk)} games...")
        chunk_start_time = time.time()
        
        # Process chunk
        if config.use_parallel and len(chunk) >= 1000:  # Only parallelize for larger chunks
            # Split chunk into smaller pieces for parallel processing
            num_processes = min(config.num_processes, len(chunk) // 100)  # Ensure each process has enough work
            chunk_size = len(chunk) // num_processes
            chunks = [chunk[i:i + chunk_size] for i in range(0, len(chunk), chunk_size)]
            
            # Process chunks in parallel
            with multiprocessing.Pool(processes=num_processes) as pool:
                results = pool.map(process_game_chunk, chunks)
                
            # Merge results
            chunk_data = {}
            for result in results:
                chunk_data = merge_player_stats(chunk_data, result)
        else:
            # Process chunk sequentially
            chunk_data = process_game_chunk(chunk)
            
        # Merge with existing data
        players_data = merge_player_stats(players_data, chunk_data)
        
        chunk_end_time = time.time()
        print(f"Processed chunk in {chunk_end_time - chunk_start_time:.2f} seconds.")
        
        # Save progress periodically
        chunk_num += 1
        if chunk_num % config.save_interval == 0:
            save_progress(players_data, chunk_num, config)
    
    # Save final progress
    save_progress(players_data, chunk_num, config)
    
    return players_data

## Usage Example

Let's set up the configuration and run the processing pipeline.

In [None]:
# Path to the compressed PGN file
pgn_path = "/Users/a/Documents/personalprojects/chess-opening-recommender/data/raw/lichess_db_standard_rated_2025-07.pgn.zst"

# Create a configuration with customized parameters
config = ProcessingConfig(
    # Filtering parameters
    min_elo=1200,  # Only include games with players above this Elo
    exclude_time_controls={"bullet", "hyperbullet", "ultrabullet"},
    min_moves=15,  # Exclude very short games
    
    # Processing parameters
    chunk_size=10000,  # Process this many games at once
    max_chunks=5,  # Process a maximum of 5 chunks (50K games) - change as needed
    save_interval=1,  # Save after each chunk
    
    # File paths
    save_dir="../data/processed",
    
    # Parallelization
    use_parallel=True,
    num_processes= multiprocessing.cpu_count() - 1
)

# Run the processing pipeline
players_data = process_pgn_file(pgn_path, config)

# Show some statistics
print(f"Total number of players: {len(players_data)}")

# Show a sample player
import random
if players_data:
    sample_player = random.choice(list(players_data.keys()))
    print(f"\nSample stats for player: {sample_player}")
    print(f"Rating: {players_data[sample_player]['rating']}")
    print(f"Total games: {players_data[sample_player]['num_games_total']}")
    print("\nWhite openings:")
    for eco, data in players_data[sample_player]['white_games'].items():
        print(f"  {eco} - {data['opening_name']}: {data['results']['score_percentage_with_opening']}% score in {data['results']['num_games']} games")
    print("\nBlack openings:")
    for eco, data in players_data[sample_player]['black_games'].items():
        print(f"  {eco} - {data['opening_name']}: {data['results']['score_percentage_with_opening']}% score in {data['results']['num_games']} games")

Processed: 5000 games, Included: 2540 (50.8%) in 13.85 seconds
Processed: 10000 games, Included: 5073 (50.7%) in 27.43 seconds
Processed: 15000 games, Included: 7550 (50.3%) in 40.27 seconds
Processing chunk 1 with 10000 games...


Process SpawnPoolWorker-3:
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/miniconda3/lib/python3.12/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/opt/miniconda3/lib/python3.12/multiprocessing/queues.py", line 389, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'process_game_chunk' on <module '__main__' (<class '_frozen_importlib.BuiltinImporter'>)>
Process SpawnPoolWorker-1:
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwa

## Analysis of Efficiency Improvements

Our optimized approach provides several major efficiency improvements:

1. **Early filtering**: We filter games as we read them, avoiding the need to load all games into memory.

2. **Chunked processing**: We process games in chunks, allowing us to save progress and resume later.

3. **Parallel processing**: We use multiple CPU cores to process chunks faster.

4. **Incremental saving**: We save progress after each chunk, so we don't lose work if the process crashes.

5. **Strong typing**: We use type hints throughout, providing better IDE support and code quality.

With these optimizations, we can process large datasets much more efficiently, potentially reducing the processing time from days to hours.

## Estimating Time Savings

Let's analyze the potential time savings from these optimizations:

1. **Early filtering**: If 70% of games can be filtered out early (e.g., bullet games, low-rated players), we only need to fully process 30% of the data.

2. **Parallel processing**: Using multiple CPU cores (e.g., 7 on an 8-core machine) can provide a ~5-6x speedup for the processing phase.

3. **Incremental saving**: Even if processing takes several hours, you can stop and resume at any time without losing progress.

Instead of processing 34 million games at 5,000 games per 13 seconds (1.4 million/hour), we might be able to process at 5-10 million effective games per hour with these optimizations.

For example:
- Original approach: 34 million games ÷ 1.4 million/hour ≈ 24 hours
- Optimized approach: 30% of 34 million games ÷ (1.4 million × 5 speedup)/hour ≈ 1.5 hours

## Next Steps

1. Run the optimized processing pipeline with a small `max_chunks` value to test it.
2. Adjust the `min_elo`, `exclude_time_controls`, and other parameters to match your needs.
3. When satisfied with the results, set `max_chunks` to `None` to process the entire file.
4. Use the saved player data for your chess opening recommendation system.