# Parquet Data Performance Testing

This notebook measures performance metrics for processing chess game data from parquet files. The main objectives are:

1. Efficiently load data from parquet files using DuckDB
2. Process the data to gather player statistics and opening preferences
3. Log detailed performance metrics (games/sec, processing time, etc.)
4. Compare with the performance of processing PGN files directly

Update: This notebook seems to do a great job of parsing games quickly. Now we are moving on to better filtering.

NOTE If you want to quickly adjust games filtering or processing parameters, ProcessingConfig is probably your best bet. I put as many relevant parameters in there as I could think of.

In [None]:
import sys
from pathlib import Path

# Determine paths to ensure imports work correctly
current_file = Path.cwd()
project_root = current_file.parent  # Move up to the project root

# Add both to path to ensure imports work regardless of structure
if str(current_file) not in sys.path:
    sys.path.append(str(current_file))
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Now try the import with the explicit path
from notebooks.utils.file_processing.types_and_classes import (  # noqa: E402
    PlayerStats,
    ProcessingConfig,
    PerformanceTracker,
)

from notebooks.utils.file_processing.save_and_load_progress import (  # noqa: E402
    save_progress,
    load_progress,
)

# Import necessary libraries
from typing import Dict, Optional, List  # noqa: E402
import duckdb  # noqa: E402
import time  # noqa: E402
from pathlib import Path  # noqa: E402
import pandas as pd  # noqa: E402
import psutil  # noqa: E402
import pickle  # noqa: E402
import sys  # noqa: E402

## Performance Metrics

We'll define a class to track performance metrics during processing.

## Helper Functions

These functions will help us process the data and manage player statistics.

In [49]:
def is_valid_game(row: pd.Series, config: ProcessingConfig) -> bool:
    """
    Check if a game meets the filtering criteria. This ensures only relevant, informative games are processed.
    
    Args:
        row: A row from the DataFrame representing a game
        config: Processing configuration
        
    Returns:
        True if the game passes our filters, False otherwise
    """
    # Check player ratings
    if (row['WhiteElo'] < config.min_player_rating or 
        row['BlackElo'] < config.min_player_rating):
        return False

    # Check rating difference
    if abs(row['WhiteElo'] - row['BlackElo']) > config.max_elo_difference_between_players:
        return False

    # "Event" column on game contains time control, they're titled like "Rated Blitz Games"
    # Check that the time control is in the allowed time controls (case insensitive)
    event_lower = row["Event"].lower()
    if not any(tc.lower() in event_lower for tc in config.allowed_time_controls):
        return False

    # Check for valid result
    # If it's something weird that's not a win loss or draw, toss it out
    if row['Result'] not in {"1-0", "0-1", "1/2-1/2"}:
        return False

    return True

## Main Processing Function

Now let's implement the main function that processes the parquet file in batches.

## Multi-File Processing with Duplicate Detection

Before defining our main processing function, let's implement a system to handle duplicate file detection. This is essential when processing multiple parquet files that might have similar names but come from different months or batches. Our approach uses metadata fingerprinting to uniquely identify each file.

This uses the dupe-check utils defined in our utils folder.

In [50]:
# Add the parent directory to the Python path to enable imports
notebooks_dir = Path.cwd().parent
if str(notebooks_dir) not in sys.path:
    sys.path.append(str(notebooks_dir))

# Try to import our custom file registry utility
try:
    from notebooks.utils.file_processing.raw_data_file_dupe_checks import FileRegistry
    print("Successfully imported FileRegistry")
except ImportError:
    print("Could not import FileRegistry - file duplicate checks will not be available")
    
    # Define a simple FileRegistry class if the import fails
    class FileRegistry:
        """Simple FileRegistry implementation for duplicate detection."""
        def __init__(self):
            self.registry_path = Path(notebooks_dir) / "data/processed/file_registry.json"
            self.processed_files = set()
            self._load_registry()
            
        def _load_registry(self):
            import json
            if self.registry_path.exists():
                try:
                    with open(self.registry_path, 'r') as f:
                        data = json.load(f)
                        self.processed_files = set(data.get('processed_files', []))
                except Exception as e:
                    print(f"Warning: Could not load registry: {e}")
        
        def _save_registry(self):
            import json
            try:
                self.registry_path.parent.mkdir(parents=True, exist_ok=True)
                with open(self.registry_path, 'w') as f:
                    json.dump({'processed_files': list(self.processed_files)}, f)
            except Exception as e:
                print(f"Warning: Could not save registry: {e}")
                
        def is_file_processed(self, file_path: str) -> bool:
            return str(file_path) in self.processed_files
            
        def mark_file_processed(self, file_path: str) -> None:
            self.processed_files.add(str(file_path))
            self._save_registry()
            
        def mark_file_skipped(self, file_path: str) -> None:
            self.processed_files.add(str(file_path))
            self._save_registry()

Successfully imported FileRegistry


## System Information

Let's check the system's hardware resources to optimize our configuration.

In [51]:
# Install psutil if not already installed
import sys
import subprocess

try:
    import psutil
except ImportError:
    print("Installing psutil package...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "psutil"])
    import psutil
    print("psutil installed successfully")

def get_system_info():
    """Get information about the system's hardware resources."""
    info = {
        "cpu_count_physical": psutil.cpu_count(logical=False),
        "cpu_count_logical": psutil.cpu_count(logical=True),
        "memory_total_gb": round(psutil.virtual_memory().total / (1024**3), 2),
        "memory_available_gb": round(psutil.virtual_memory().available / (1024**3), 2)
    }
    return info

# Get system information
system_info = get_system_info()
print("System Information:")
for key, value in system_info.items():
    print(f"  {key}: {value}")

# Calculate optimal batch size based on available memory
# Assuming each row needs about 1KB of memory
available_memory_gb = system_info["memory_available_gb"]
memory_for_batch_gb = available_memory_gb * 0.3  # Use 30% of available memory
optimal_batch_size = int(memory_for_batch_gb * 1024**3 / 1024)  # 1KB per row

# Round to nearest 10,000
optimal_batch_size = max(10_000, round(optimal_batch_size / 10_000) * 10_000)

print(f"\nRecommended batch size based on memory: {optimal_batch_size:,}")

def process_parquet_file(config: ProcessingConfig, 
                         players_data: Dict[str, PlayerStats],
                         log_frequency: int = 5000, 
                         file_context: Optional[Dict] = None) -> None:
    """
    Process a single parquet file in batches, updating a shared players_data dictionary.
    This function orchestrates the processing of one file. It uses a file registry
    to skip already processed files and manages batching. For each batch, it calls
    process_batch to perform the actual game data processing.
    
    Args:
        config: Processing configuration for this specific file.
        players_data: The shared dictionary of player statistics to update.
        log_frequency: How often to log progress within a batch.
        file_context: Dictionary with context for multi-file processing.
    """
    # Check if file has already been processed using FileRegistry
    try:
        registry = FileRegistry()
        if registry.is_file_processed(config.parquet_path):
            print(f"Skipping already processed file: {Path(config.parquet_path).name}")
            return
    except Exception as e:
        print(f"Warning: Could not check file registry: {e}")
    
    # Initialize DuckDB connection
    con = duckdb.connect()
    
    # This prevents the progress from one file affecting the next.
    progress_path = Path(config.save_dir) / "processing_progress_parquet.json"
    if progress_path.exists():
        try:
            progress_path.unlink()
            print(f"Reset progress file for new run: {progress_path}")
        except OSError as e:
            print(f"Error removing progress file: {e}")
    
    # When processing a new file, we should check if there's partial progress for THIS specific file.
    # We load the progress file to get the last batch number, but we will use the `players_data`
    # dictionary that was passed in, which contains the combined data from all previous files.
    _, start_batch = load_progress(config)

    # Initialize performance tracker
    perf_tracker = PerformanceTracker()
    
    # Get total number of rows
    print("Counting total rows in parquet file...")
    total_rows = con.execute(
        f"SELECT COUNT(*) FROM '{config.parquet_path}'"
    ).fetchone()[0]
    print(f"Total rows in parquet file: {total_rows:,}")
    
    # If resuming, check if we're already done
    if start_batch * config.batch_size >= total_rows:
        print(f"Resuming from batch {start_batch}, which is after the end of the file. Skipping.")
        # Mark the file as processed since we are technically done with it
        try:
            registry.mark_file_processed(config.parquet_path)
            print(f"Marked file {Path(config.parquet_path).name} as processed in the registry")
        except Exception as e:
            print(f"Warning: Could not update file registry: {e}")
        return players_data

    # Calculate number of batches
    total_batches = (total_rows + config.batch_size - 1) // config.batch_size
    print(f"Will process in {total_batches} batches of size {config.batch_size:,} (resuming from batch {start_batch})")
    
    # Process in batches
    batch_num = start_batch
    
    while True:
        # Calculate offset for the current batch
        offset = batch_num * config.batch_size
        
        # Check if we've processed all rows
        if offset >= total_rows:
            print("Processed all rows. Finishing up.")
            break
        
        print(f"\nProcessing batch {batch_num + 1}/{total_batches} (offset {offset:,})")
        perf_tracker.start_batch()
        
        # Fetch a batch of data
        batch_query = f"""
        SELECT 
            Event, White, Black, Result, 
            WhiteTitle, BlackTitle, WhiteElo, BlackElo, 
            WhiteRatingDiff, BlackRatingDiff, ECO, Opening,
            Termination, TimeControl
        FROM '{config.parquet_path}'
        LIMIT {config.batch_size} OFFSET {offset}
        """
        
        # Execute the query and convert to DataFrame
        batch_df = con.execute(batch_query).df()
        
        if batch_df.empty:
            print("Loaded an empty batch. This might mean we are past the end of the file.")
            break

        print(f"Loaded batch with {len(batch_df):,} rows")
        
        # Process the batch, updating players_data in-place
        process_batch(
            batch_df, 
            players_data, 
            config, 
            log_frequency=log_frequency, 
            perf_tracker=perf_tracker, 
            file_context=file_context
        )
        
        # Record batch completion
        batch_time = perf_tracker.end_batch(len(batch_df))
        print(f"Processed batch in {batch_time:.2f} seconds")
        print(f"Current player count: {len(players_data):,}")
        
        # Log performance metrics
        perf_tracker.log_progress(force=True)
        
        # Save progress periodically
        batch_num += 1
        if batch_num % config.save_interval == 0:
            save_progress(players_data, batch_num, config, perf_tracker)
    
    # Save final progress
    save_progress(players_data, batch_num, config, perf_tracker)
    
    # Mark the file as processed in the registry
    try:
        registry.mark_file_processed(config.parquet_path)
        print(f"Marked file {Path(config.parquet_path).name} as processed in the registry")
    except Exception as e:
        print(f"Warning: Could not update file registry: {e}")
    
    # Print final performance summary
    summary = perf_tracker.get_summary()
    print("\nPerformance Summary:")
    print(f"Total games processed: {summary['total_games']:,}")
    print(f"Total processing time: {summary['total_time_sec']:.2f} seconds")
    print(f"Overall processing rate: {summary['overall_rate_games_per_sec']:.2f} games/second")
    print(f"Average batch processing time: {summary['avg_batch_time_sec']:.2f} seconds")
    
    # Add filtering stats to the final summary
    print("\nFiltering Statistics:")
    print(f"Accepted games: {summary['accepted_games']:,}")
    print(f"Filtered out games: {summary['filtered_games']:,}")
    print(f"Acceptance rate: {summary['acceptance_rate_percent']}%")
    
    return players_data

System Information:
  cpu_count_physical: 6
  cpu_count_logical: 12
  memory_total_gb: 32.0
  memory_available_gb: 20.33

Recommended batch size based on memory: 6,400,000


In [52]:
def process_multiple_parquet_files(file_paths: List[str], base_config: ProcessingConfig = None, log_frequency: int = 5000) -> Dict[str, PlayerStats]:
    """
    Process multiple parquet files of raw game data, avoiding duplicates.
    This is the main entry point for processing a collection of files. It initializes
    a single 'all_players_data' dictionary that is shared and updated across all
    file processing jobs. It handles duplicate file detection and orchestrates
    the processing of each file by calling 'process_parquet_file'.
    
    Args:
        file_paths: List of paths to parquet files to process.
        base_config: Base configuration to use as a template for each file.
        log_frequency: How often to log progress within a batch.
        
    Returns:
        The final, combined player statistics from all processed files.
    """
    if not file_paths:
        print("No files provided for processing")
        return {}
    
    # Create a default config if none provided
    if base_config is None:
        base_config = ProcessingConfig(
            parquet_path="",  # Will be set for each file
            batch_size=100_000,
            save_interval=1,
            save_dir="../data/processed"
        )
    
    # Initialize file registry
    try:
        registry = FileRegistry()
    except NameError:
        print("FileRegistry not available, skipping duplicate detection")
        registry = None
    
    # Filter out already processed files
    new_files = []
    for file_path in file_paths:
        if registry and registry.is_file_processed(file_path):
            print(f"Skipping already processed file: {Path(file_path).name}")
            try:
                registry.mark_file_skipped(file_path)
            except Exception as e:
                print(f"Warning: Could not mark file as skipped: {e}")
            continue
        new_files.append(file_path)
    
    if not new_files:
        print("No new files to process.")
        return {}
    
    print(f"Found {len(new_files)} new files to process out of {len(file_paths)} total files.")
    
    # Estimate total rows for ETA calculation
    # For simplicity, we'll get the row count of the first file and multiply
    total_rows_estimate = 0
    avg_rows_per_file = 0
    if new_files:
        try:
            con = duckdb.connect()
            first_file_rows = con.execute(f"SELECT COUNT(*) FROM '{new_files[0]}'").fetchone()[0]
            avg_rows_per_file = first_file_rows
            total_rows_estimate = avg_rows_per_file * len(new_files)
            con.close()
            print(f"Estimating total of {total_rows_estimate:,} rows across {len(new_files)} files for ETA.")
        except Exception as e:
            print(f"Warning: Could not estimate total rows for ETA: {e}")

    # Load all existing player data and progress once at the beginning.
    # This single dictionary will be passed to and updated by all subsequent function calls.
    all_players_data, _ = load_progress(base_config)
    total_start_time = time.time()
    
    for i, file_path in enumerate(new_files):
        print(f"\nProcessing file {i+1}/{len(new_files)}: {Path(file_path).name}")
        
        # Create a config for this specific file
        file_config = ProcessingConfig(
            parquet_path=file_path,
            batch_size=base_config.batch_size,
            save_interval=base_config.save_interval,
            save_dir=base_config.save_dir,
            min_player_rating=base_config.min_player_rating,
            max_elo_difference_between_players=base_config.max_elo_difference_between_players,
            allowed_time_controls=base_config.allowed_time_controls
        )
        
        file_context = {
            "current_file_num": i + 1,
            "total_files": len(new_files),
            "total_rows_estimate": total_rows_estimate,
            "avg_rows_per_file": avg_rows_per_file,
            "total_start_time": total_start_time,
        }

        try:
            # Process the file, passing in the single, shared data dictionary
            # to be updated in-place.
            process_parquet_file(
                file_config, 
                all_players_data,
                log_frequency=log_frequency,
                file_context=file_context
            )
            
        except Exception as e:
            print(f"Error processing {file_path}: {str(e)}")
    
    return all_players_data

## Run Processing

Now let's run the processing with our multi-file processing utility. This will allow us to process multiple parquet files at once while handling duplicate detection.

In [None]:
# Import our multi-file processing utility
from notebooks.utils.file_processing.process_multiple_raw_files import process_multiple_files

# Get the processing configuration using the utility
# This will show a directory picker dialog and find all parquet files
processing_config = process_multiple_files(
    # Let the user select a directory via dialog
    directory=None,  
    # Determine batch size automatically based on memory
    batch_size=None,  
    # Use the same filtering parameters as before
    min_player_rating=1200,
    max_elo_difference=100,
    allowed_time_controls={"Blitz", "Rapid", "Classical"},
    save_dir="../data/processed"
)

print("\nProcessing Configuration:")
for key, value in processing_config.items():
    if key != "files_to_process":  # Don't print the full file paths
        print(f"  {key}: {value}")
    else:
        print(f"  {key}: {len(value)} files")
        # Print first 3 file names as examples
        for i, file_path in enumerate(value[:3]):
            print(f"    - {Path(file_path).name}")
        if len(value) > 3:
            print(f"    - ... and {len(value) - 3} more files")

files_to_process = processing_config.get("files_to_process", [])

if not files_to_process:
    print("No new files to process. Exiting.")
else:
    print(f"Starting processing of {len(files_to_process)} files...")
    
    base_config = ProcessingConfig(
        parquet_path="",  # Will be set for each file
        batch_size=processing_config["batch_size"],
        save_interval=1,  # Save after each batch
        save_dir=processing_config["save_dir"],
        min_player_rating=processing_config["min_player_rating"],
        max_elo_difference_between_players=processing_config["max_elo_difference"],
        allowed_time_controls=processing_config["allowed_time_controls"]
    )
    
    all_players_data = process_multiple_parquet_files(
        files_to_process,
        base_config=base_config,
        log_frequency=5000
    )
    
    print("\nFinal combined data statistics:")
    print(f"Total number of players: {len(all_players_data):,}")
    
    # Save the final merged data separately
    final_save_path = Path(processing_config["save_dir"]) / "all_players_stats_combined.pkl"
    with open(final_save_path, 'wb') as f:
        pickle.dump(all_players_data, f)
    
    print(f"Saved final merged data to: {final_save_path}")
    
    # Show an example player from the combined data
    if all_players_data:
        import random
        sample_player = random.choice(list(all_players_data.keys()))
        print(f"\nSample stats for player from combined data: {sample_player}")
        print(f"Rating: {all_players_data[sample_player]['rating']}")
        print(f"Total games: {all_players_data[sample_player]['num_games_total']}")
        
        print("\nTop White openings:")
        white_openings = all_players_data[sample_player]['white_games']
        if white_openings:
            # Sort by number of games
            sorted_openings = sorted(
                white_openings.items(), 
                key=lambda x: x[1]['results']['num_games'], 
                reverse=True
            )
            for eco, data in sorted_openings[:5]: 
                print(f"  {eco} - {data['opening_name']}: " +
                      f"{data['results']['score_percentage_with_opening']}% score in {data['results']['num_games']} games")
            if len(white_openings) > 5:
                print(f"  ... and {len(white_openings) - 5} more openings")
        else:
            print("  No white openings")

        print("\nTop Black openings:")
        black_openings = all_players_data[sample_player]['black_games']
        if black_openings:
            # Sort by number of games
            sorted_openings = sorted(
                black_openings.items(), 
                key=lambda x: x[1]['results']['num_games'], 
                reverse=True
            )
            for eco, data in sorted_openings[:5]: 
                print(f"  {eco} - {data['opening_name']}: " + 
                      f"{data['results']['score_percentage_with_opening']}% score in {data['results']['num_games']} games")
            if len(black_openings) > 5:
                print(f"  ... and {len(black_openings) - 5} more openings")
        else:
            print("  No black openings")

## Usage Instructions

To process multiple parquet files:

1. Run the cell above that calls `process_multiple_files()`
2. A directory picker dialog will appear - select the folder containing your parquet files
3. The utility will identify new files (not previously processed) and process them one by one
4. All player statistics will be merged into a combined dataset

You can keep adding new parquet files to the same directory, and when you run this notebook again it will only process the new ones. This is perfect for incrementally adding to your dataset over time.