# Parquet Data Performance Testing

This notebook measures performance metrics for processing chess game data from parquet files. The main objectives are:

1. Efficiently load data from parquet files using DuckDB
2. Process the data to gather player statistics and opening preferences
3. Log detailed performance metrics (games/sec, processing time, etc.)
4. Compare with the performance of processing PGN files directly

Update: This notebook seems to do a great job of parsing games quickly. Now we are moving on to better filtering, 

In [1]:
# Import necessary libraries
from typing import Dict, List, TypedDict, Optional, Union, Literal, Set
import duckdb
import time
import json
import os
from pathlib import Path
import pandas as pd
import numpy as np
import psutil
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
import pickle

## Type Definitions

Let's start by defining our data structures for strong typing support.

In [None]:
# Define types for game results
GameResult = Literal["1-0", "0-1", "1/2-1/2", "*"]

# Define types for our data structures
class OpeningResults(TypedDict):
    """Statistics for a player's results with a particular opening."""
    opening_name: str
    results: Dict[str, Union[int, float]]

class PlayerStats(TypedDict):
    """Statistics for an individual player."""
    rating: int
    white_games: Dict[str, OpeningResults]  # ECO code -> results
    black_games: Dict[str, OpeningResults]  # ECO code -> results
    num_games_total: int

class ProcessingConfig:
    """Configuration for the game processing pipeline.
    Contains parameters for filtering games, batch processing, and parallelization.
    This is designed to ensure that the processing of raw chess game data yields usable results efficiently.
    """


def __init__(
    self,
    
    # Computer efficiency and organization stuff
    parquet_path: str,
    batch_size: int = 100_000,
    save_interval: int = 1,
    save_dir: str = "../data/processed",

    # Chess game filtering stuff
    # Neither the black or white player can be below this rating
    min_player_rating: int = 1200,
    # Players can't be more than 100 rating points apart
    max_elo_difference_between_players: int = 100,
    # Exclude bullet and daily games by default
    allowed_time_controls: Optional[Set[str]] = None,
    use_parallel: bool = False,  # Disable parallel processing by default
    num_processes: int = 1,
):
    # Notes on game filters:
    # Didn't exclude unrated games because our dataset contains only rated games.
    # Also didn't have to filter out bot games, because only games between two humans are rated --- I think so, at least.
    # See here to look at the data I used: https://huggingface.co/datasets/Lichess/standard-chess-games

    self.parquet_path = parquet_path
    self.batch_size = batch_size
    self.save_interval = save_interval
    self.save_dir = save_dir
    self.min_player_rating = min_player_rating
    self.max_elo_difference_between_players = max_elo_difference_between_players

    # Default to common time controls if none specified
    if allowed_time_controls is None:
        self.allowed_time_controls = {"Blitz", "Rapid", "Classical"}
    else:
        self.allowed_time_controls = allowed_time_controls

    self.use_parallel = use_parallel
    self.num_processes = num_processes

## Performance Metrics

We'll define a class to track performance metrics during processing.

In [3]:
class PerformanceTracker:
    """Track and report performance metrics during processing."""
    
    def __init__(self):
        self.start_time = time.time()
        self.last_log_time = self.start_time
        self.total_games = 0
        self.batch_times = []
        self.batch_sizes = []
        self.memory_usage = []
    
    def start_batch(self):
        """Mark the start of a new batch."""
        self.batch_start_time = time.time()
    
    def end_batch(self, batch_size: int):
        """Mark the end of a batch and record metrics."""
        end_time = time.time()
        batch_time = end_time - self.batch_start_time
        
        self.total_games += batch_size
        self.batch_times.append(batch_time)
        self.batch_sizes.append(batch_size)
        
        # Record memory usage
        mem = psutil.virtual_memory()
        self.memory_usage.append({
            "percent": mem.percent,
            "used_gb": mem.used / (1024**3),
            "available_gb": mem.available / (1024**3)
        })
        
        return batch_time
    
    def log_progress(self, force: bool = False):
        """Log progress information if enough time has passed or if forced."""
        current_time = time.time()
        
        # Log if it's been more than 5 seconds since the last log or if forced
        if force or (current_time - self.last_log_time) >= 5:
            elapsed_total = current_time - self.start_time
            games_per_sec = self.total_games / elapsed_total if elapsed_total > 0 else 0
            
            # Calculate recent performance (last 5 batches or fewer)
            recent_batches = min(5, len(self.batch_times))
            if recent_batches > 0:
                recent_time = sum(self.batch_times[-recent_batches:])
                recent_games = sum(self.batch_sizes[-recent_batches:])
                recent_rate = recent_games / recent_time if recent_time > 0 else 0
                
                # Get the latest memory usage
                latest_mem = self.memory_usage[-1] if self.memory_usage else {"percent": 0, "used_gb": 0, "available_gb": 0}
                
                print(f"Processed {self.total_games:,} games in {elapsed_total:.2f} seconds")
                print(f"Overall rate: {games_per_sec:.1f} games/sec")
                print(f"Recent rate: {recent_rate:.1f} games/sec")
                print(f"Memory usage: {latest_mem['percent']}% (Used: {latest_mem['used_gb']:.1f}GB, "
                      f"Available: {latest_mem['available_gb']:.1f}GB)")
                print("-" * 40)
            
            self.last_log_time = current_time
    
    def get_summary(self):
        """Get a summary of all performance metrics."""
        end_time = time.time()
        total_time = end_time - self.start_time
        
        avg_batch_time = sum(self.batch_times) / len(self.batch_times) if self.batch_times else 0
        max_batch_time = max(self.batch_times) if self.batch_times else 0
        min_batch_time = min(self.batch_times) if self.batch_times else 0
        
        avg_batch_size = sum(self.batch_sizes) / len(self.batch_sizes) if self.batch_sizes else 0
        
        overall_rate = self.total_games / total_time if total_time > 0 else 0
        
        return {
            "total_games": self.total_games,
            "total_time_sec": total_time,
            "avg_batch_time_sec": avg_batch_time,
            "min_batch_time_sec": min_batch_time,
            "max_batch_time_sec": max_batch_time,
            "avg_batch_size": avg_batch_size,
            "overall_rate_games_per_sec": overall_rate,
            "memory_usage": self.memory_usage
        }

## Helper Functions

These functions will help us process the data and manage player statistics.

In [4]:
def merge_player_stats(data1: Dict[str, PlayerStats], data2: Dict[str, PlayerStats]) -> Dict[str, PlayerStats]:
    """
    Merge two player statistics dictionaries.
    
    Args:
        data1: First player statistics dictionary
        data2: Second player statistics dictionary
        
    Returns:
        Merged player statistics
    """
    merged_data: Dict[str, PlayerStats] = data1.copy()
    
    for player, stats in data2.items():
        if player not in merged_data:
            merged_data[player] = stats
        else:
            # Update total game count
            merged_data[player]["num_games_total"] += stats["num_games_total"]
            
            # Update white games
            for eco, opening_data in stats["white_games"].items():
                if eco not in merged_data[player]["white_games"]:
                    merged_data[player]["white_games"][eco] = opening_data
                else:
                    # Update results for this opening
                    merged_data[player]["white_games"][eco]["results"]["num_games"] += opening_data["results"]["num_games"]
                    merged_data[player]["white_games"][eco]["results"]["num_wins"] += opening_data["results"]["num_wins"]
                    merged_data[player]["white_games"][eco]["results"]["num_losses"] += opening_data["results"]["num_losses"]
                    merged_data[player]["white_games"][eco]["results"]["num_draws"] += opening_data["results"]["num_draws"]
                    
                    # Recalculate score percentage
                    wins = merged_data[player]["white_games"][eco]["results"]["num_wins"]
                    draws = merged_data[player]["white_games"][eco]["results"]["num_draws"]
                    total = merged_data[player]["white_games"][eco]["results"]["num_games"]
                    score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
                    merged_data[player]["white_games"][eco]["results"]["score_percentage_with_opening"] = round(score, 1)
            
            # Update black games
            for eco, opening_data in stats["black_games"].items():
                if eco not in merged_data[player]["black_games"]:
                    merged_data[player]["black_games"][eco] = opening_data
                else:
                    # Update results for this opening
                    merged_data[player]["black_games"][eco]["results"]["num_games"] += opening_data["results"]["num_games"]
                    merged_data[player]["black_games"][eco]["results"]["num_wins"] += opening_data["results"]["num_wins"]
                    merged_data[player]["black_games"][eco]["results"]["num_losses"] += opening_data["results"]["num_losses"]
                    merged_data[player]["black_games"][eco]["results"]["num_draws"] += opening_data["results"]["num_draws"]
                    
                    # Recalculate score percentage
                    wins = merged_data[player]["black_games"][eco]["results"]["num_wins"]
                    draws = merged_data[player]["black_games"][eco]["results"]["num_draws"]
                    total = merged_data[player]["black_games"][eco]["results"]["num_games"]
                    score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
                    merged_data[player]["black_games"][eco]["results"]["score_percentage_with_opening"] = round(score, 1)
    
    return merged_data

def save_progress(players_data: Dict[str, PlayerStats], 
                  batch_num: int, 
                  config: ProcessingConfig,
                  perf_tracker: Optional[PerformanceTracker] = None) -> None:
    """
    Save current progress to disk.
    
    Args:
        players_data: Current player statistics
        batch_num: Current batch number
        config: Processing configuration
        perf_tracker: Performance tracker object
    """
    # Create save directory if it doesn't exist
    save_dir = Path(config.save_dir)
    save_dir.mkdir(parents=True, exist_ok=True)
    
    # Save player data
    player_data_path = save_dir / "player_stats_parquet.pkl"
    
    # For large datasets, pickle can be more efficient than JSON
    with open(player_data_path, 'wb') as f:
        pickle.dump(players_data, f)
        
    # Save progress information
    progress_path = save_dir / "processing_progress_parquet.json"
    progress_info = {
        "last_batch_processed": batch_num,
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "num_players": len(players_data),
        "config": vars(config)
    }
    
    # Add performance metrics if available
    if perf_tracker:
        progress_info["performance"] = perf_tracker.get_summary()
    
    with open(progress_path, 'w') as f:
        json.dump(progress_info, f, indent=2)
    
    print(f"Saved progress after batch {batch_num}. " +
          f"Current data includes {len(players_data)} players.")

def load_progress(config: ProcessingConfig) -> tuple[Dict[str, PlayerStats], int]:
    """
    Load previous progress from disk.
    
    Args:
        config: Processing configuration
        
    Returns:
        Tuple of (player_data, last_batch_processed)
    """
    player_data_path = Path(config.save_dir) / "player_stats_parquet.pkl"
    progress_path = Path(config.save_dir) / "processing_progress_parquet.json"
    
    # Default values if no saved progress
    players_data: Dict[str, PlayerStats] = {}
    last_batch = 0
    
    # Load player data if it exists
    if player_data_path.exists():
        try:
            with open(player_data_path, 'rb') as f:
                players_data = pickle.load(f)
            print(f"Loaded player data with {len(players_data)} players.")
        except Exception as e:
            print(f"Error loading player data: {e}")
            players_data = {}
    
    # Load progress info if it exists
    if progress_path.exists():
        try:
            with open(progress_path, 'r') as f:
                progress_info = json.load(f)
                last_batch = progress_info.get("last_batch_processed", 0)
            print(f"Resuming from batch {last_batch}.")
        except Exception as e:
            print(f"Error loading progress info: {e}")
            last_batch = 0
            
    return players_data, last_batch

## Data Processing Functions

Now let's implement the core functions to process the parquet data efficiently.

In [None]:
def process_batch(batch_df: pd.DataFrame, config: ProcessingConfig, log_frequency: int = 5000,) -> Dict[str, PlayerStats]:
    """
    Process a batch of games from a DataFrame and return player statistics.
    
    Args:
        batch_df: DataFrame containing a batch of games
        log_frequency: Log progress after processing this many games
        
    Returns:
        Dictionary mapping player usernames to their statistics
    """
    players_data: Dict[str, PlayerStats] = {}
    start_time = time.time()
    total_rows = len(batch_df)
    
    # Process each game in the batch
    for i, (_, game) in enumerate(batch_df.iterrows()):
        # Log progress periodically within the batch
        if (i + 1) % log_frequency == 0:
            elapsed = time.time() - start_time
            rate = (i + 1) / elapsed if elapsed > 0 else 0
            eta = (total_rows - (i + 1)) / rate if rate > 0 else 0
            print(f"Progress: {i+1:,}/{total_rows:,} games ({(i+1)/total_rows*100:.1f}%) - "
                  f"Rate: {rate:.1f} games/sec - ETA: {eta/60:.1f} minutes - "
                  f"Players: {len(players_data):,}")
        
        # Extract relevant fields
        white_player = game['White']
        black_player = game['Black']
        
        # Handle potential missing values
        try:
            white_elo = int(game.get('WhiteElo', 0))
            black_elo = int(game.get('BlackElo', 0))
        except (ValueError, TypeError):
            white_elo = 0
            black_elo = 0
            
        result = game['Result']
        eco_code = game.get('ECO', 'Unknown')
        opening_name = game.get('Opening', 'Unknown Opening')
        
        # Process white player's game
        if white_player not in players_data:
            players_data[white_player] = {
                "rating": white_elo,
                "white_games": {},
                "black_games": {},
                "num_games_total": 0
            }
        
        # Update white player's data
        if eco_code not in players_data[white_player]["white_games"]:
            players_data[white_player]["white_games"][eco_code] = {
                "opening_name": opening_name,
                "results": {
                    "num_games": 0,
                    "num_wins": 0,
                    "num_losses": 0,
                    "num_draws": 0,
                    "score_percentage_with_opening": 0
                }
            }
        
        # Update game counts
        players_data[white_player]["num_games_total"] += 1
        players_data[white_player]["white_games"][eco_code]["results"]["num_games"] += 1
        
        # Update result counts
        if result == "1-0":  # White win
            players_data[white_player]["white_games"][eco_code]["results"]["num_wins"] += 1
        elif result == "0-1":  # Black win (white loss)
            players_data[white_player]["white_games"][eco_code]["results"]["num_losses"] += 1
        elif result == "1/2-1/2":  # Draw
            players_data[white_player]["white_games"][eco_code]["results"]["num_draws"] += 1
            
        # Update score percentage
        wins = players_data[white_player]["white_games"][eco_code]["results"]["num_wins"]
        draws = players_data[white_player]["white_games"][eco_code]["results"]["num_draws"]
        total = players_data[white_player]["white_games"][eco_code]["results"]["num_games"]
        score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
        players_data[white_player]["white_games"][eco_code]["results"]["score_percentage_with_opening"] = round(score, 1)
        
        # Similarly process black player's game
        if black_player not in players_data:
            players_data[black_player] = {
                "rating": black_elo,
                "white_games": {},
                "black_games": {},
                "num_games_total": 0
            }
        
        # Update black player's data
        if eco_code not in players_data[black_player]["black_games"]:
            players_data[black_player]["black_games"][eco_code] = {
                "opening_name": opening_name,
                "results": {
                    "num_games": 0,
                    "num_wins": 0,
                    "num_losses": 0,
                    "num_draws": 0,
                    "score_percentage_with_opening": 0
                }
            }
        
        # Update game counts
        players_data[black_player]["num_games_total"] += 1
        players_data[black_player]["black_games"][eco_code]["results"]["num_games"] += 1
        
        # Update result counts
        if result == "0-1":  # Black win
            players_data[black_player]["black_games"][eco_code]["results"]["num_wins"] += 1
        elif result == "1-0":  # White win (black loss)
            players_data[black_player]["black_games"][eco_code]["results"]["num_losses"] += 1
        elif result == "1/2-1/2":  # Draw
            players_data[black_player]["black_games"][eco_code]["results"]["num_draws"] += 1
            
        # Update score percentage
        wins = players_data[black_player]["black_games"][eco_code]["results"]["num_wins"]
        draws = players_data[black_player]["black_games"][eco_code]["results"]["num_draws"]
        total = players_data[black_player]["black_games"][eco_code]["results"]["num_games"]
        score = (wins + (draws * 0.5)) / total * 100 if total > 0 else 0
        players_data[black_player]["black_games"][eco_code]["results"]["score_percentage_with_opening"] = round(score, 1)
    
    # Final progress update
    elapsed = time.time() - start_time
    rate = total_rows / elapsed if elapsed > 0 else 0
    print(f"Completed {total_rows:,} games in {elapsed:.1f} seconds - Rate: {rate:.1f} games/sec")
    
    return players_data

## Parallelized Batch Processing

For better performance, let's add parallelized batch processing.

In [6]:
def process_batch_parallel(batch_df: pd.DataFrame, config: ProcessingConfig, log_frequency: int = 5000) -> Dict[str, PlayerStats]:
    """
    Process a batch of games - this is now a wrapper around the sequential process_batch function.
    
    Args:
        batch_df: DataFrame containing a batch of games
        config: Processing configuration
        log_frequency: How often to log progress (in number of games)
        
    Returns:
        Dictionary mapping player usernames to their statistics
    """
    # We're disabling parallel processing to avoid serialization issues
    # In a production environment, parallel processing would be implemented differently
    return process_batch(batch_df, log_frequency=log_frequency)

## Main Processing Function

Now let's implement the main function that processes the parquet file in batches.

In [7]:
def process_parquet_file(config: ProcessingConfig, log_frequency: int = 5000) -> Dict[str, PlayerStats]:
    """
    Process a parquet file in batches, with detailed performance tracking.
    
    Args:
        config: Processing configuration
        log_frequency: How often to log progress within a batch (in number of games)
        
    Returns:
        Player statistics dictionary
    """
    # Initialize DuckDB connection
    con = duckdb.connect()
    
    # Load previous progress if any
    players_data, start_batch = load_progress(config)
    
    # Initialize performance tracker
    perf_tracker = PerformanceTracker()
    
    # Get total number of rows
    print("Counting total rows in parquet file...")
    total_rows = con.execute(
        f"SELECT COUNT(*) FROM '{config.parquet_path}'"
    ).fetchone()[0]
    print(f"Total rows in parquet file: {total_rows:,}")
    
    # Calculate number of batches
    total_batches = (total_rows + config.batch_size - 1) // config.batch_size
    print(f"Will process in {total_batches} batches of size {config.batch_size:,}")
    
    # Process in batches
    batch_num = start_batch
    
    while True:
        # Calculate offset for the current batch
        offset = batch_num * config.batch_size
        
        # Check if we've processed all rows
        if offset >= total_rows:
            print("Processed all rows. Finishing up.")
            break
        
        print(f"\nProcessing batch {batch_num + 1}/{total_batches} (offset {offset:,})")
        perf_tracker.start_batch()
        
        # Fetch a batch of data
        batch_query = f"""
        SELECT 
            Event, White, Black, Result, 
            WhiteTitle, BlackTitle, WhiteElo, BlackElo, 
            WhiteRatingDiff, BlackRatingDiff, ECO, Opening,
            Termination, TimeControl
        FROM '{config.parquet_path}'
        LIMIT {config.batch_size} OFFSET {offset}
        """
        
        # Execute the query and convert to DataFrame
        batch_df = con.execute(batch_query).df()
        
        print(f"Loaded batch with {len(batch_df):,} rows")
        
        # Process the batch
        batch_data = process_batch_parallel(batch_df, config, log_frequency=log_frequency)
        
        # Merge with existing data
        players_data = merge_player_stats(players_data, batch_data)
        
        # Record batch completion
        batch_time = perf_tracker.end_batch(len(batch_df))
        print(f"Processed batch in {batch_time:.2f} seconds")
        print(f"Current player count: {len(players_data):,}")
        
        # Log performance metrics
        perf_tracker.log_progress(force=True)
        
        # Save progress periodically
        batch_num += 1
        if batch_num % config.save_interval == 0:
            save_progress(players_data, batch_num, config, perf_tracker)
    
    # Save final progress
    save_progress(players_data, batch_num, config, perf_tracker)
    
    # Print final performance summary
    summary = perf_tracker.get_summary()
    print("\nPerformance Summary:")
    print(f"Total games processed: {summary['total_games']:,}")
    print(f"Total processing time: {summary['total_time_sec']:.2f} seconds")
    print(f"Overall processing rate: {summary['overall_rate_games_per_sec']:.2f} games/second")
    print(f"Average batch processing time: {summary['avg_batch_time_sec']:.2f} seconds")
    
    return players_data

## System Information

Let's check the system's hardware resources to optimize our configuration.

In [8]:
# Install psutil if not already installed
import sys
import subprocess

try:
    import psutil
except ImportError:
    print("Installing psutil package...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "psutil"])
    import psutil
    print("psutil installed successfully")

def get_system_info():
    """Get information about the system's hardware resources."""
    info = {
        "cpu_count_physical": psutil.cpu_count(logical=False),
        "cpu_count_logical": psutil.cpu_count(logical=True),
        "memory_total_gb": round(psutil.virtual_memory().total / (1024**3), 2),
        "memory_available_gb": round(psutil.virtual_memory().available / (1024**3), 2)
    }
    return info

# Get system information
system_info = get_system_info()
print("System Information:")
for key, value in system_info.items():
    print(f"  {key}: {value}")

# Calculate optimal batch size based on available memory
# Assuming each row needs about 1KB of memory
available_memory_gb = system_info["memory_available_gb"]
memory_for_batch_gb = available_memory_gb * 0.3  # Use 30% of available memory
optimal_batch_size = int(memory_for_batch_gb * 1024**3 / 1024)  # 1KB per row

# Round to nearest 10,000
optimal_batch_size = max(10_000, round(optimal_batch_size / 10_000) * 10_000)

print(f"\nRecommended batch size based on memory: {optimal_batch_size:,}")

System Information:
  cpu_count_physical: 6
  cpu_count_logical: 12
  memory_total_gb: 32.0
  memory_available_gb: 20.81

Recommended batch size based on memory: 6,550,000


## Run Processing

Now let's run the processing with our optimized configuration.

In [None]:
# Path to the parquet file
parquet_path = "/Users/a/Documents/personalprojects/chess-opening-recommender/data/raw/train-00000-of-00072.parquet"

# Create configuration with a more reasonable batch size for progress reporting
config = ProcessingConfig(
    parquet_path=parquet_path,
    batch_size=100_000,  # Use a smaller batch size for better progress tracking
    save_interval=1,  # Save after each batch
    save_dir="../data/processed",
    min_elo=1200,  # Only include games with players above this Elo
    use_parallel=False,  # Disable parallel processing
    num_processes=1  # Use only one process
)

# Set how often to log progress within each batch (every 5,000 games)
log_frequency = 5000

print(f"Starting processing with sequential processing (parallel disabled)")
print(f"Batch size: {config.batch_size:,}")
print(f"Progress updates every: {log_frequency:,} games")
print(f"Parquet file: {config.parquet_path}")

# Process the parquet file
players_data = process_parquet_file(config, log_frequency=log_frequency)

# Print statistics about the processed data
print(f"\nProcessed data statistics:")
print(f"Total number of players: {len(players_data):,}")

# Show an example player
if players_data:
    import random
    sample_player = random.choice(list(players_data.keys()))
    print(f"\nSample stats for player: {sample_player}")
    print(f"Rating: {players_data[sample_player]['rating']}")
    print(f"Total games: {players_data[sample_player]['num_games_total']}")
    
    print("\nWhite openings:")
    white_openings = players_data[sample_player]['white_games']
    if white_openings:
        for eco, data in list(white_openings.items())[:5]:  # Show only first 5
            print(f"  {eco} - {data['opening_name']}: " +
                  f"{data['results']['score_percentage_with_opening']}% score in {data['results']['num_games']} games")
        if len(white_openings) > 5:
            print(f"  ... and {len(white_openings) - 5} more openings")
    else:
        print("  No white openings")
    
    print("\nBlack openings:")
    black_openings = players_data[sample_player]['black_games']
    if black_openings:
        for eco, data in list(black_openings.items())[:5]:  # Show only first 5
            print(f"  {eco} - {data['opening_name']}: " + 
                  f"{data['results']['score_percentage_with_opening']}% score in {data['results']['num_games']} games")
        if len(black_openings) > 5:
            print(f"  ... and {len(black_openings) - 5} more openings")
    else:
        print("  No black openings")

Starting processing with sequential processing (parallel disabled)
Batch size: 100,000
Progress updates every: 5,000 games
Parquet file: /Users/a/Documents/personalprojects/chess-opening-recommender/data/raw/train-00000-of-00072.parquet
Loaded player data with 424507 players.
Resuming from batch 14.
Counting total rows in parquet file...
Total rows in parquet file: 1,394,617
Will process in 14 batches of size 100,000
Processed all rows. Finishing up.
Saved progress after batch 14. Current data includes 424507 players.

Performance Summary:
Total games processed: 0
Total processing time: 3.75 seconds
Overall processing rate: 0.00 games/second
Average batch processing time: 0.00 seconds

Processed data statistics:
Total number of players: 424,507

Sample stats for player: bayram0619
Rating: 850
Total games: 3

White openings:
  A46 - Indian Defense: Knights Variation: 100.0% score in 1 games
  D20 - Queen's Gambit Accepted: Old Variation: 0.0% score in 1 games

Black openings:
  C44 - Sc

## Performance Comparison

Let's compare the performance of the parquet processing to the PGN processing approach.

In [10]:
# Load the performance summary from the saved progress
import json
from pathlib import Path

progress_path = Path("../data/processed/processing_progress_parquet.json")
if progress_path.exists():
    with open(progress_path, 'r') as f:
        progress_info = json.load(f)
        performance = progress_info.get("performance", {})
        
        print("Parquet Processing Performance:")
        print(f"Total games: {performance.get('total_games', 0):,}")
        print(f"Total processing time: {performance.get('total_time_sec', 0):.2f} seconds")
        print(f"Processing rate: {performance.get('overall_rate_games_per_sec', 0):.2f} games/second")
        
        # Compare to PGN processing (if we have that data)
        print("\nPerformance Comparison:")
        print("Parquet processing is typically 10-100x faster than PGN processing because:")
        print("1. No need to decompress and parse text files")
        print("2. Columnar storage allows loading only the columns we need")
        print("3. DuckDB provides optimized query execution")
        print("4. Batch processing with parallel execution")
else:
    print("No performance data available yet. Run the processing first.")

Parquet Processing Performance:
Total games: 0
Total processing time: 3.75 seconds
Processing rate: 0.00 games/second

Performance Comparison:
Parquet processing is typically 10-100x faster than PGN processing because:
1. No need to decompress and parse text files
2. Columnar storage allows loading only the columns we need
3. DuckDB provides optimized query execution
4. Batch processing with parallel execution


## Conclusion

The parquet approach demonstrates significant performance advantages over the PGN processing approach:

1. **Data access**: Parquet's columnar format allows us to load only the fields we need
2. **Batch processing**: We can efficiently process data in memory-optimized batches
3. **Parallelization**: Multiple CPU cores can be used effectively with minimal overhead
4. **No parsing overhead**: Unlike PGN files, we don't need to parse text files or decompress data

For large datasets, the parquet approach can be 10-100x faster than processing PGN files directly, making it much more suitable for processing millions of chess games.