# Purpose

This notebook combines several things we've already been doing individually.

Its purpose is to obtain download massive amounts of raw chess game data, process it, then delete the downloaded raw files.

## 1. Download raw data

This will download raw parquet files from HuggingFace's API.

Parquet files are 1GB files containing huge sets of Lichess games.

This notebook downloads all of the parquet files for a certain month (usually 60-70 files). They will be automatically deleted after they're processed.

## 2. Process the data

For each parquet file that is downloaded, this notebook will process the games in the file one by one, extracting useful data for  training our model.

## 3. Delete the data

After each parquet file is done being processed, it will be deleted from the local disk.

These files are 1GB each and we will be downloading hundreds or thousands of them, so deleting them is a must.


In [None]:
import sys
from pathlib import Path

# Determine paths to ensure imports work correctly
current_file = Path.cwd()
project_root = current_file.parent  # Move up to the project root

# Add both to path to ensure imports work regardless of structure
if str(current_file) not in sys.path:
    sys.path.append(str(current_file))
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

from notebooks.utils.file_processing.process_game_batch import (  # noqa: E402
    process_batch,
)  # noqa: E402

# Now try the import with the explicit path
from notebooks.utils.file_processing.types_and_classes import (  # noqa: E402
    PlayerStats,
    ProcessingConfig,
    PerformanceTracker,
)

from notebooks.utils.file_processing.save_and_load_progress import (  # noqa: E402
    save_progress,
    load_progress,
)


from notebooks.utils.file_processing.raw_data_file_dupe_checks import (  # noqa: E402
    FileRegistry,
)  # noqa: E402

# Import necessary libraries
from typing import Dict, Optional  # noqa: E402
import duckdb  # noqa: E402
import time  # noqa: E402
from pathlib import Path  # noqa: E402
import pickle  # noqa: E402
import sys  # noqa: E402

# 1.  Auto-Download Parquet Files

## Wifi Speed

My personal Wifi speed is quite fast (350 mbps); This is a lot of GBs of data, but download speed isn't the bottleneck because my system will take longer to process each file than it will to download the next one. But, if you have slower download speed, you may need to adjust.

## Helper Functions for downloading raw data

Getting helper functions that we're defined elsewhere.

In [None]:
from utils.downloading_raw_parquet_data.api_interaction import (
    get_urls_from_hub_api,
    get_urls_from_dataset_viewer,
    filter_urls_for_month
)
from utils.downloading_raw_parquet_data.file_downloader import (
    probe_fallback_urls,
    download_file
)

## Processing and Deletion Configuration

Here we define configurations for processing the raw data that we will download.


In [None]:

# CONFIG FOR PROCESSING
# ---------------------
# This dictionary holds all the settings for the data processing part of the notebook.

# --- Deletion Toggle ---
# Set this to True to automatically delete parquet files after they are successfully processed.
# Set it to False if you want to keep the raw files for repeated experiments.
is_delete_downloaded_parquet_files = True

# --- Processing Parameters ---
# These parameters are used to filter the chess games from the raw files.
# We only want to process games that are relevant and high-quality for our model.
processing_config = {
    # The directory where processed data and progress files will be saved.
    "save_dir": str(project_root / "data" / "processed"),
    
    # The number of games to process in a single batch. This is optimized based on available memory.
    "batch_size": 100_000,
    
    # How often to save progress (e.g., every 1 batch).
    "save_interval": 1,
    
    # --- Game Filtering Rules ---
    # Minimum rating for players in a game to be included.
    "min_player_rating": 1200,
    
    # Maximum ELO difference between the two players.
    "max_elo_difference": 100,
    
    # Allowed time controls (e.g., "Blitz", "Rapid"). Games with other time controls will be skipped.
    "allowed_time_controls": {"Blitz", "Rapid", "Classical"},
}

# Create the directory for processed data if it doesn't exist.
Path(processing_config["save_dir"]).mkdir(parents=True, exist_ok=True)


## Configuration for Downloading Raw Data

Here we will organize some config items for the downloading of raw data parquet files.

In [1]:
from pathlib import Path
import sys

project_root = (
    Path(__file__).resolve().parent.parent
    if "__file__" in globals()
    else Path.cwd().parent
)
output_dir = str(project_root / "data" / "raw" / "auto_download_parquets")

# CONFIG
# I did my best to put all of the variables that we might need to adjust here.
# This is a proof of concept right now; later we may make this some sort of drop down picker for month, year etc
config = {
    "repo": "Lichess/standard-chess-games",  # Hugging Face repo id
    "year": "2025",  # 4-digit year (string or int)
    "month": "7",  # numeric month (e.g., "7" or "07")
    "max_parquets": 30,  # int or None to download all available
    # Download to /data/raw/auto_download_parquets relative to project root
    "output_dir": output_dir,
    "hf_token": None,  # set to your HF token string if you need to access gated datasets
    "probe_max_attempts": 1000,  # for fallback probing
    "probe_patterns": [  # tried in order if APIs gave no URLs
        # Pattern A: common "train-00000-of-00066.parquet" style
        "https://huggingface.co/datasets/{repo}/resolve/main/data/year={year}/month={month}/train-{idx:05d}-of-{total:05d}.parquet",
        # Pattern B: some datasets use plain shard names
        "https://huggingface.co/datasets/{repo}/resolve/main/data/year={year}/month={month}/train-{idx:05d}.parquet",
        # Pattern C: fall back to zero-padded 4-digit name
        "https://huggingface.co/datasets/{repo}/resolve/main/data/year={year}/month={month}/000{idx}.parquet",
    ],
}

# Extract config variables for use in the rest of the notebook
repo = config["repo"]
year = str(config["year"])
month_raw = str(config["month"])
month_padded = month_raw.zfill(2)
max_parquets = config["max_parquets"]
out_dir = Path(config["output_dir"])
out_dir.mkdir(parents=True, exist_ok=True)
hf_headers = {"Authorization": f"Bearer {config['hf_token']}"} if config.get("hf_token") else {}

## Main Raw Data Downloading Execution Flow

The following cell contains the main logic for querying parquet URLs, filtering them, and downloading the files.

In [None]:
# This cell downloads parquet files from a specified Hugging Face repository
# for a given month and year, processes them, and optionally deletes them after processing.

import urllib

# Main flow for downloading files
print("Querying for parquet URLs...")
urls = get_urls_from_hub_api(repo, hf_headers)
if not urls:
    urls = get_urls_from_dataset_viewer(repo, hf_headers)

filtered = filter_urls_for_month(urls, year, month_padded)

if not filtered:
    print("No URLs found via APIs, falling back to probing...")
    patterns = config.get("probe_patterns", [])
    filtered = probe_fallback_urls(repo, year, month_padded, config["probe_max_attempts"], patterns, hf_headers)

if not filtered:
    print("ERROR: No parquet URLs discovered. Aborting.")
    sys.exit(1)

if max_parquets is not None:
    filtered = filtered[:int(max_parquets)]

print(f"\nWill attempt to download {len(filtered)} file(s) into {out_dir.resolve()}\n")

success_count = 0
start_time = time.time()

for i, url in enumerate(filtered):
    filename = Path(urllib.parse.unquote(url)).name
    dest = out_dir / filename
    
    if dest.exists():
        print(f"[{i+1}/{len(filtered)}] SKIPPING (exists): {filename}")
        success_count += 1
        continue
    
    print(f"[{i+1}/{len(filtered)}] DOWNLOADING: {filename}")
    ok = download_file(url, dest, hf_headers)
    
    if ok:
        success_count += 1
        print(f"  SUCCESS: Downloaded {filename}")
    else:
        print(f"  FAILURE: Could not download {filename}")
        if not urls: # If we were probing, a failure means we stop.
            print("  Stopping probe-based download.")
            break

total_time = time.time() - start_time
print(f"\nDownload phase complete. {success_count} of {len(filtered)} files are ready for processing.")
print(f"Total download time: {total_time / 60:.2f} minutes.")

# --- Part 2: Process and Delete Files ---
print("\n--- Starting Processing and Deletion Phase ---")

# Set up the base configuration for processing.
# This will be used for each file that gets processed.
base_processing_config = ProcessingConfig(
    parquet_path="",  # This will be updated for each file.
    batch_size=processing_config["batch_size"],
    save_interval=processing_config["save_interval"],
    save_dir=processing_config["save_dir"],
    min_player_rating=processing_config["min_player_rating"],
    max_elo_difference_between_players=processing_config["max_elo_difference"],
    allowed_time_controls=processing_config["allowed_time_controls"]
)

# Call the main function to process all files in the directory.
process_and_delete_files(
    base_config=base_processing_config,
    delete_files=is_delete_downloaded_parquet_files
)

print("\n--- All Done! ---")

Querying for parquet URLs...
No URLs found via APIs, falling back to probing...

Will attempt to download 30 file(s) into /Users/a/Documents/personalprojects/chess-opening-recommender/data/raw/auto_download_parquets

[1/30] SKIPPING (exists): train-00000-of-00066.parquet
[2/30] SKIPPING (exists): train-00001-of-00066.parquet
[3/30] SKIPPING (exists): train-00002-of-00066.parquet
[4/30] SKIPPING (exists): train-00003-of-00066.parquet
[5/30] SKIPPING (exists): train-00004-of-00066.parquet
[6/30] SKIPPING (exists): train-00005-of-00066.parquet
[7/30] SKIPPING (exists): train-00006-of-00066.parquet
[8/30] SKIPPING (exists): train-00007-of-00066.parquet
[9/30] SKIPPING (exists): train-00008-of-00066.parquet
[10/30] SKIPPING (exists): train-00009-of-00066.parquet
[11/30] SKIPPING (exists): train-00010-of-00066.parquet
[12/30] SKIPPING (exists): train-00011-of-00066.parquet
[13/30] SKIPPING (exists): train-00012-of-00066.parquet
[14/30] SKIPPING (exists): train-00013-of-00066.parquet
[15/30] 

KeyboardInterrupt: 

## Processing Helper Functions

The following functions are taken from `03_parquet_performance.ipynb` and are responsible for processing the raw parquet files. They handle batching, duplicate checking, and performance tracking.


In [6]:
def process_parquet_file(config: ProcessingConfig, 
                         players_data: Dict[str, PlayerStats],
                         log_frequency: int = 5000, 
                         file_context: Optional[Dict] = None) -> bool:
    """
    Process a single parquet file in batches, updating a shared players_data dictionary.
    This function is a direct copy from 03_parquet_performance.ipynb.
    
    Args:
        config: Processing configuration for this specific file.
        players_data: The shared dictionary of player statistics to update.
        log_frequency: How often to log progress within a batch.
        file_context: Dictionary with context for multi-file processing.
        
    Returns:
        True if processing was successful, False otherwise.
    """
    try:
        # Initialize DuckDB connection
        con = duckdb.connect()
        
        # Reset progress file for a new run to avoid conflicts.
        progress_path = Path(config.save_dir) / "processing_progress_parquet.json"
        if progress_path.exists():
            progress_path.unlink()

        _, start_batch = load_progress(config)
        perf_tracker = PerformanceTracker()
        
        total_rows = con.execute(f"SELECT COUNT(*) FROM '{config.parquet_path}'").fetchone()[0]
        if total_rows == 0:
            print("File is empty, skipping.")
            return True # Return True to allow deletion of empty file

        total_batches = (total_rows + config.batch_size - 1) // config.batch_size
        print(f"Will process {total_rows:,} rows in {total_batches} batches of size {config.batch_size:,}")
        
        batch_num = start_batch
        while batch_num * config.batch_size < total_rows:
            offset = batch_num * config.batch_size
            print(f"\nProcessing batch {batch_num + 1}/{total_batches} (offset {offset:,})")
            perf_tracker.start_batch()
            
            batch_query = f"SELECT * FROM '{config.parquet_path}' LIMIT {config.batch_size} OFFSET {offset}"
            batch_df = con.execute(batch_query).df()
            
            if batch_df.empty:
                break

            process_batch(batch_df, players_data, config, log_frequency, perf_tracker, file_context)
            
            batch_time = perf_tracker.end_batch(len(batch_df))
            print(f"Processed batch in {batch_time:.2f} seconds. Current player count: {len(players_data):,}")
            perf_tracker.log_progress(force=True)
            
            batch_num += 1
            if batch_num % config.save_interval == 0:
                save_progress(players_data, batch_num, config, perf_tracker)
        
        save_progress(players_data, batch_num, config, perf_tracker)
        
        summary = perf_tracker.get_summary()
        print("\nFile Processing Summary:")
        for key, value in summary.items():
            print(f"  {key}: {value}")
            
        return True
    except Exception as e:
        print(f"An error occurred during file processing: {e}")
        return False
    finally:
        if 'con' in locals():
            con.close()


In [None]:
def process_and_delete_files(base_config: ProcessingConfig, delete_files: bool):
    """
    Processes all .parquet files in the download directory and optionally deletes them.
    This function scans the directory, checks for duplicates, processes new files,
    and then deletes them if the toggle is enabled.
    """
    # Load existing player data once at the beginning.
    # This dictionary will be updated by all file processing jobs.
    all_players_data, _ = load_progress(base_config)
    print(f"Loaded initial data. Total players so far: {len(all_players_data):,}")
    
    registry = FileRegistry()
    
    # Get all parquet files from the download directory
    files_to_process = list(out_dir.glob("*.parquet"))
    
    if not files_to_process:
        print("No parquet files found in the download directory to process.")
        return

    print(f"Found {len(files_to_process)} parquet files to process.")

    for i, file_path in enumerate(files_to_process):
        print(f"\n--- Processing file {i+1}/{len(files_to_process)}: {file_path.name} ---")
        
        # Check if the file has already been processed.
        if registry.is_file_processed(str(file_path)):
            print("File has already been processed. Skipping.")
            if delete_files:
                try:
                    os.remove(file_path)
                    print(f"Deleted duplicate file: {file_path.name}")
                except OSError as e:
                    print(f"Error deleting duplicate file: {e}")
            continue

        # Create a specific configuration for this file.
        file_config = ProcessingConfig(
            parquet_path=str(file_path),
            batch_size=base_config.batch_size,
            save_interval=base_config.save_interval,
            save_dir=base_config.save_dir,
            min_player_rating=base_config.min_player_rating,
            max_elo_difference_between_players=base_config.max_elo_difference_between_players,
            allowed_time_controls=base_config.allowed_time_controls
        )
        
        # Process the file. The `all_players_data` dict is updated in-place.
        success = process_parquet_file(file_config, all_players_data)
        
        # If processing was successful, mark it in the registry and delete if enabled.
        if success:
            registry.mark_file_processed(str(file_path))
            print(f"Successfully processed and marked file: {file_path.name}")
            if delete_files:
                try:
                    os.remove(file_path)
                    print(f"Deleted processed file: {file_path.name}")
                except OSError as e:
                    print(f"Error deleting processed file: {e}")
        else:
            print(f"Processing failed for {file_path.name}. The file will not be deleted.")

    # Save the final, combined data from all files.
    final_save_path = Path(base_config.save_dir) / "all_players_stats_combined.pkl"
    with open(final_save_path, 'wb') as f:
        pickle.dump(all_players_data, f)
    
    print(f"\nProcessing complete. Final combined data for {len(all_players_data):,} players saved to: {final_save_path}")


# 2. Process Downloaded Files

The following cells contain the logic from notebook `03_parquet_performance.ipynb` to process the files we just downloaded.

This involves:
- Scanning the download directory for all `.parquet` files.
- Processing each file to extract player statistics.
- Storing the aggregated statistics in a pickle file.
- Keeping a registry of processed files to avoid duplicate work.
- Deleting the raw parquet file after it has been successfully processed.

In [None]:
import os
from dataclasses import dataclass
from typing import Dict, Set
import json
import time

# --- Processing Configuration ---

# Toggle to enable/disable deletion of parquet files after processing
# This should be set to True unless you want to keep the raw files for repeated experiments.
# Setting it to False will retain all downloaded parquet files, whereas if it's True they will be deleted after processing because you don't need them anymore.
is_delete_downloaded_parquet_files = True  # <<< IMPORTANT TOGGLE


# Use the output_dir from the download config
proc_config = ProcessingConfig(
    input_dir=config["output_dir"],
    player_stats_output_path=str(project_root / "data" / "processed" / "player_stats_parquet.pkl"),
    file_registry_path=str(project_root / "data" / "processed" / "file_registry.json"),
)


In [None]:
# --- Helper Function for Processing a Single File ---

def process_parquet_file(filepath: str, config: ProcessingConfig, player_stats: Dict[str, PlayerStats], perf_tracker: PerformanceTracker):
    """
    Processes a single parquet file, updating player_stats dictionary.
    This is the core logic from 03_parquet_performance.ipynb.
    """
    con = duckdb.connect(database=':memory:', read_only=False)
    
    create_view_sql = f"""
    CREATE VIEW games AS SELECT * FROM read_parquet('{filepath}');
    """
    con.execute(create_view_sql)

    # Query to get player game counts
    query = f"""
    WITH players AS (
        SELECT White AS player, WhiteElo AS elo,
            CASE WHEN Result = '1-0' THEN 1 ELSE 0 END AS win,
            CASE WHEN Result = '0-1' THEN 1 ELSE 0 END AS loss,
            CASE WHEN Result = '1/2-1/2' THEN 1 ELSE 0 END AS draw
        FROM games
        WHERE WhiteElo >= {config.min_elo} AND WhiteElo <= {config.max_elo}
        UNION ALL
        SELECT Black AS player, BlackElo AS elo,
            CASE WHEN Result = '0-1' THEN 1 ELSE 0 END AS win,
            CASE WHEN Result = '1-0' THEN 1 ELSE 0 END AS loss,
            CASE WHEN Result = '1/2-1/2' THEN 1 ELSE 0 END AS draw
        FROM games
        WHERE BlackElo >= {config.min_elo} AND BlackElo <= {config.max_elo}
    )
    SELECT
        player,
        COUNT(*) as games_played,
        SUM(win) as wins,
        SUM(loss) as losses,
        SUM(draw) as draws
    FROM players
    GROUP BY player
    LIMIT {config.max_games_per_file};
    """
    
    results = con.execute(query).fetchall()
    con.close()

    games_in_file = 0
    for name, games, wins, losses, draws in results:
        if name not in player_stats:
            player_stats[name] = PlayerStats()
        
        player_stats[name].games_played += games
        player_stats[name].wins += wins
        player_stats[name].losses += losses
        player_stats[name].draws += draws
        games_in_file += games
    
    perf_tracker.record_game_batch(games_in_file)
    return len(results) > 0 # Return True if any data was processed


In [None]:
# --- Main Execution Flow for Processing ---

def process_and_delete_files():
    # 1. Load existing data
    if os.path.exists(proc_config.player_stats_output_path):
        with open(proc_config.player_stats_output_path, "rb") as f:
            player_stats: Dict[str, PlayerStats] = pickle.load(f)
        print(f"Loaded {len(player_stats)} players from existing stats file.")
    else:
        player_stats: Dict[str, PlayerStats] = {}
        print("No existing player stats file found. Starting fresh.")

    file_registry = FileRegistry(proc_config.file_registry_path)
    print(f"Loaded {len(file_registry.processed_files)} processed files from registry.")

    # 2. Find files to process
    try:
        all_files = [f for f in os.listdir(proc_config.input_dir) if f.endswith(".parquet")]
    except FileNotFoundError:
        print(f"ERROR: Input directory not found: {proc_config.input_dir}")
        return

    files_to_process = [f for f in all_files if not file_registry.contains(f)]
    
    if not files_to_process:
        print("\nNo new files to process. Everything is up to date.")
        return

    print(f"\nFound {len(files_to_process)} new parquet file(s) to process.")

    # 3. Process each file
    tracker = PerformanceTracker()
    total_files_to_process = len(files_to_process)

    for i, filename in enumerate(files_to_process):
        filepath = os.path.join(proc_config.input_dir, filename)
        print(f"[{i+1}/{total_files_to_process}] Processing: {filename}")
        
        try:
            file_start_time = time.time()
            processed_ok = process_parquet_file(filepath, proc_config, player_stats, tracker)
            
            if processed_ok:
                # Add to registry
                file_registry.add(filename)
                tracker.record_file()
                
                # Save cumulative stats
                with open(proc_config.player_stats_output_path, "wb") as f:
                    pickle.dump(player_stats, f)
                
                file_duration = time.time() - file_start_time
                print(f"  > Success ({file_duration:.1f}s). {tracker.summary(total_files_to_process)}")

                # Delete file if toggle is on
                if is_delete_downloaded_parquet_files:
                    try:
                        os.remove(filepath)
                        print(f"  > DELETED: {filename}")
                    except OSError as e:
                        print(f"  > ERROR: Failed to delete {filename}: {e}")
            else:
                print("  > SKIPPED: No relevant player data found in file.")

        except Exception as e:
            print(f"  > ERROR processing {filename}: {e}")
            print("    Skipping this file and continuing...")
            continue
            
    print("\nProcessing complete.")
    print(f"Total players with stats: {len(player_stats)}")
    print(f"Player stats saved to: {proc_config.player_stats_output_path}")
    print(f"File registry saved to: {proc_config.file_registry_path}")

# Run the processing and deletion flow
process_and_delete_files()


# 2. Process Downloaded Files

The following cells contain the logic from notebook `03_parquet_performance.ipynb` to process the files we just downloaded.

This involves:
- Scanning the download directory for all `.parquet` files.
- Processing each file to extract player statistics.
- Storing the aggregated statistics in a pickle file.
- Keeping a registry of processed files to avoid duplicate work.
- Deleting the raw parquet file after it has been successfully processed.

In [None]:
from dataclasses import dataclass
from typing import Dict, Set

# --- Processing Configuration ---
is_delete_downloaded_parquet_files = True  # <<< IMPORTANT TOGGLE

@dataclass
class PlayerStats:
    # Using a condensed version for brevity in this example
    games_played: int = 0
    wins: int = 0
    losses: int = 0
    draws: int = 0

@dataclass
class ProcessingConfig:
    input_dir: str
    player_stats_output_path: str
    file_registry_path: str
    min_elo: int = 2000
    max_elo: int = 4000 # No practical upper limit
    max_games_per_file: int = 1_000_000 # Effectively unlimited

# Use the output_dir from the download config
proc_config = ProcessingConfig(
    input_dir=config["output_dir"],
    player_stats_output_path=str(project_root / "data" / "processed" / "player_stats_parquet.pkl"),
    file_registry_path=str(project_root / "data" / "processed" / "file_registry.json"),
)

# --- Performance & Progress Tracking ---
class PerformanceTracker:
    def __init__(self):
        self.start_time = time.time()
        self.games_processed = 0
        self.files_processed = 0

    def record_game_batch(self, count):
        self.games_processed += count

    def record_file(self):
        self.files_processed += 1

    def summary(self, total_files):
        elapsed = time.time() - self.start_time
        games_per_sec = self.games_processed / elapsed if elapsed > 0 else 0
        files_remaining = total_files - self.files_processed
        time_per_file = elapsed / self.files_processed if self.files_processed > 0 else 0
        eta = files_remaining * time_per_file
        
        return (
            f"Processed {self.games_processed:,} games from {self.files_processed}/{total_files} files. "
            f"({games_per_sec:,.0f} games/sec). ETA: {eta/60:.1f} mins."
        )

print("Processing configuration loaded.")
if is_delete_downloaded_parquet_files:
    print("NOTE: Raw parquet files will be DELETED after successful processing.")
else:
    print("NOTE: Raw parquet files will be KEPT after processing.")


In [None]:
# --- Helper Function for Processing a Single File ---

def process_parquet_file(filepath: str, config: ProcessingConfig, player_stats: Dict[str, PlayerStats], perf_tracker: PerformanceTracker):
    """
    Processes a single parquet file, updating player_stats dictionary.
    This is the core logic from 03_parquet_performance.ipynb.
    """
    con = duckdb.connect(database=':memory:', read_only=False)
    
    create_view_sql = f"""
    CREATE VIEW games AS SELECT * FROM read_parquet('{filepath}');
    """
    con.execute(create_view_sql)

    # Query to get player game counts
    query = f"""
    WITH players AS (
        SELECT White AS player, WhiteElo AS elo,
            CASE WHEN Result = '1-0' THEN 1 ELSE 0 END AS win,
            CASE WHEN Result = '0-1' THEN 1 ELSE 0 END AS loss,
            CASE WHEN Result = '1/2-1/2' THEN 1 ELSE 0 END AS draw
        FROM games
        WHERE WhiteElo >= {config.min_elo} AND WhiteElo <= {config.max_elo}
        UNION ALL
        SELECT Black AS player, BlackElo AS elo,
            CASE WHEN Result = '0-1' THEN 1 ELSE 0 END AS win,
            CASE WHEN Result = '1-0' THEN 1 ELSE 0 END AS loss,
            CASE WHEN Result = '1/2-1/2' THEN 1 ELSE 0 END AS draw
        FROM games
        WHERE BlackElo >= {config.min_elo} AND BlackElo <= {config.max_elo}
    )
    SELECT
        player,
        COUNT(*) as games_played,
        SUM(win) as wins,
        SUM(loss) as losses,
        SUM(draw) as draws
    FROM players
    GROUP BY player
    LIMIT {config.max_games_per_file};
    """
    
    results = con.execute(query).fetchall()
    con.close()

    games_in_file = 0
    for name, games, wins, losses, draws in results:
        if name not in player_stats:
            player_stats[name] = PlayerStats()
        
        player_stats[name].games_played += games
        player_stats[name].wins += wins
        player_stats[name].losses += losses
        player_stats[name].draws += draws
        games_in_file += games
    
    perf_tracker.record_game_batch(games_in_file)
    return len(results) > 0 # Return True if any data was processed


In [None]:
# --- Main Execution Flow for Processing ---

def process_and_delete_files():
    # 1. Load existing data
    if os.path.exists(proc_config.player_stats_output_path):
        with open(proc_config.player_stats_output_path, "rb") as f:
            player_stats: Dict[str, PlayerStats] = pickle.load(f)
        print(f"Loaded {len(player_stats)} players from existing stats file.")
    else:
        player_stats: Dict[str, PlayerStats] = {}
        print("No existing player stats file found. Starting fresh.")

    file_registry = FileRegistry(proc_config.file_registry_path)
    print(f"Loaded {len(file_registry.processed_files)} processed files from registry.")

    # 2. Find files to process
    try:
        all_files = [f for f in os.listdir(proc_config.input_dir) if f.endswith(".parquet")]
    except FileNotFoundError:
        print(f"ERROR: Input directory not found: {proc_config.input_dir}")
        return

    files_to_process = [f for f in all_files if not file_registry.contains(f)]
    
    if not files_to_process:
        print("\nNo new files to process. Everything is up to date.")
        return

    print(f"\nFound {len(files_to_process)} new parquet file(s) to process.")

    # 3. Process each file
    tracker = PerformanceTracker()
    total_files_to_process = len(files_to_process)

    for i, filename in enumerate(files_to_process):
        filepath = os.path.join(proc_config.input_dir, filename)
        print(f"[{i+1}/{total_files_to_process}] Processing: {filename}")
        
        try:
            file_start_time = time.time()
            processed_ok = process_parquet_file(filepath, proc_config, player_stats, tracker)
            
            if processed_ok:
                # Add to registry
                file_registry.add(filename)
                tracker.record_file()
                
                # Save cumulative stats
                with open(proc_config.player_stats_output_path, "wb") as f:
                    pickle.dump(player_stats, f)
                
                file_duration = time.time() - file_start_time
                print(f"  > Success ({file_duration:.1f}s). {tracker.summary(total_files_to_process)}")

                # Delete file if toggle is on
                if is_delete_downloaded_parquet_files:
                    try:
                        os.remove(filepath)
                        print(f"  > DELETED: {filename}")
                    except OSError as e:
                        print(f"  > ERROR: Failed to delete {filename}: {e}")
            else:
                print("  > SKIPPED: No relevant player data found in file.")

        except Exception as e:
            print(f"  > ERROR processing {filename}: {e}")
            print("    Skipping this file and continuing...")
            continue
            
    print("\nProcessing complete.")
    print(f"Total players with stats: {len(player_stats)}")
    print(f"Player stats saved to: {proc_config.player_stats_output_path}")
    print(f"File registry saved to: {proc_config.file_registry_path}")

# Run the processing and deletion flow
process_and_delete_files()


# 2. Process Downloaded Files

The following cells contain the logic from notebook `03_parquet_performance.ipynb` to process the files we just downloaded.

This involves:
- Scanning the download directory for all `.parquet` files.
- Processing each file to extract player statistics.
- Storing the aggregated statistics in a pickle file.
- Keeping a registry of processed files to avoid duplicate work.
- Deleting the raw parquet file after it has been successfully processed.

In [None]:
from dataclasses import dataclass
from typing import Dict
import time

# --- Processing Configuration ---
is_delete_downloaded_parquet_files = True  # <<< IMPORTANT TOGGLE

@dataclass
class PlayerStats:
    # ... (rest of the class definition)
    # Using a condensed version for brevity in this example
    games_played: int = 0
    wins: int = 0
    losses: int = 0
    draws: int = 0

@dataclass
class ProcessingConfig:
    input_dir: str
    player_stats_output_path: str
    file_registry_path: str
    min_elo: int = 2000
    max_elo: int = 4000 # No practical upper limit
    max_games_per_file: int = 1_000_000 # Effectively unlimited

# Use the output_dir from the download config
proc_config = ProcessingConfig(
    input_dir=config["output_dir"],
    player_stats_output_path=str(project_root / "data" / "processed" / "player_stats_parquet.pkl"),
    file_registry_path=str(project_root / "data" / "processed" / "file_registry.json"),
)

# --- Performance & Progress Tracking ---
class PerformanceTracker:
    def __init__(self):
        self.start_time = time.time()
        self.games_processed = 0
        self.files_processed = 0

    def record_game_batch(self, count):
        self.games_processed += count

    def record_file(self):
        self.files_processed += 1

    def summary(self, total_files):
        elapsed = time.time() - self.start_time
        games_per_sec = self.games_processed / elapsed if elapsed > 0 else 0
        files_remaining = total_files - self.files_processed
        time_per_file = elapsed / self.files_processed if self.files_processed > 0 else 0
        eta = files_remaining * time_per_file
        
        return (
            f"Processed {self.games_processed:,} games from {self.files_processed}/{total_files} files. "
            f"({games_per_sec:,.0f} games/sec). ETA: {eta/60:.1f} mins."
        )

class FileRegistry:  # noqa: F811
    def __init__(self, path):
        self.path = path
        self.processed_files: Set[str] = self._load()

    def _load(self) -> Set[str]:
        if os.path.exists(self.path):
            with open(self.path, "r") as f:
                return set(json.load(f))
        return set()

    def add(self, filename: str):
        self.processed_files.add(filename)
        self._save()

    def contains(self, filename: str) -> bool:
        return filename in self.processed_files

    def _save(self):
        with open(self.path, "w") as f:
            json.dump(list(self.processed_files), f, indent=4)

print("Processing configuration loaded.")
if is_delete_downloaded_parquet_files:
    print("NOTE: Raw parquet files will be DELETED after successful processing.")
else:
    print("NOTE: Raw parquet files will be KEPT after processing.")


In [None]:
# --- Helper Function for Processing a Single File ---

def process_parquet_file(filepath: str, config: ProcessingConfig, player_stats: Dict[str, PlayerStats], perf_tracker: PerformanceTracker):
    """
    Processes a single parquet file, updating player_stats dictionary.
    This is the core logic from 03_parquet_performance.ipynb.
    """
    con = duckdb.connect(database=':memory:', read_only=False)
    
    create_view_sql = f"""
    CREATE VIEW games AS SELECT * FROM read_parquet('{filepath}');
    """
    con.execute(create_view_sql)

    # Query to get player game counts
    query = f"""
    WITH players AS (
        SELECT White AS player, WhiteElo AS elo,
            CASE WHEN Result = '1-0' THEN 1 ELSE 0 END AS win,
            CASE WHEN Result = '0-1' THEN 1 ELSE 0 END AS loss,
            CASE WHEN Result = '1/2-1/2' THEN 1 ELSE 0 END AS draw
        FROM games
        WHERE WhiteElo >= {config.min_elo} AND WhiteElo <= {config.max_elo}
        UNION ALL
        SELECT Black AS player, BlackElo AS elo,
            CASE WHEN Result = '0-1' THEN 1 ELSE 0 END AS win,
            CASE WHEN Result = '1-0' THEN 1 ELSE 0 END AS loss,
            CASE WHEN Result = '1/2-1/2' THEN 1 ELSE 0 END AS draw
        FROM games
        WHERE BlackElo >= {config.min_elo} AND BlackElo <= {config.max_elo}
    )
    SELECT
        player,
        COUNT(*) as games_played,
        SUM(win) as wins,
        SUM(loss) as losses,
        SUM(draw) as draws
    FROM players
    GROUP BY player
    LIMIT {config.max_games_per_file};
    """
    
    results = con.execute(query).fetchall()
    con.close()

    games_in_file = 0
    for name, games, wins, losses, draws in results:
        if name not in player_stats:
            player_stats[name] = PlayerStats()
        
        player_stats[name].games_played += games
        player_stats[name].wins += wins
        player_stats[name].losses += losses
        player_stats[name].draws += draws
        games_in_file += games
    
    perf_tracker.record_game_batch(games_in_file)
    return len(results) > 0 # Return True if any data was processed


In [None]:
# --- Main Execution Flow for Processing ---

def process_and_delete_files():
    # 1. Load existing data
    if os.path.exists(proc_config.player_stats_output_path):
        with open(proc_config.player_stats_output_path, "rb") as f:
            player_stats: Dict[str, PlayerStats] = pickle.load(f)
        print(f"Loaded {len(player_stats)} players from existing stats file.")
    else:
        player_stats: Dict[str, PlayerStats] = {}
        print("No existing player stats file found. Starting fresh.")

    file_registry = FileRegistry(proc_config.file_registry_path)
    print(f"Loaded {len(file_registry.processed_files)} processed files from registry.")

    # 2. Find files to process
    try:
        all_files = [f for f in os.listdir(proc_config.input_dir) if f.endswith(".parquet")]
    except FileNotFoundError:
        print(f"ERROR: Input directory not found: {proc_config.input_dir}")
        return

    files_to_process = [f for f in all_files if not file_registry.contains(f)]
    
    if not files_to_process:
        print("\nNo new files to process. Everything is up to date.")
        return

    print(f"\nFound {len(files_to_process)} new parquet file(s) to process.")

    # 3. Process each file
    tracker = PerformanceTracker()
    total_files_to_process = len(files_to_process)

    for i, filename in enumerate(files_to_process):
        filepath = os.path.join(proc_config.input_dir, filename)
        print(f"[{i+1}/{total_files_to_process}] Processing: {filename}")
        
        try:
            file_start_time = time.time()
            processed_ok = process_parquet_file(filepath, proc_config, player_stats, tracker)
            
            if processed_ok:
                # Add to registry
                file_registry.add(filename)
                tracker.record_file()
                
                # Save cumulative stats
                with open(proc_config.player_stats_output_path, "wb") as f:
                    pickle.dump(player_stats, f)
                
                file_duration = time.time() - file_start_time
                print(f"  > Success ({file_duration:.1f}s). {tracker.summary(total_files_to_process)}")

                # Delete file if toggle is on
                if is_delete_downloaded_parquet_files:
                    try:
                        os.remove(filepath)
                        print(f"  > DELETED: {filename}")
                    except OSError as e:
                        print(f"  > ERROR: Failed to delete {filename}: {e}")
            else:
                print("  > SKIPPED: No relevant player data found in file.")

        except Exception as e:
            print(f"  > ERROR processing {filename}: {e}")
            print("    Skipping this file and continuing...")
            continue
            
    print("\nProcessing complete.")
    print(f"Total players with stats: {len(player_stats)}")
    print(f"Player stats saved to: {proc_config.player_stats_output_path}")
    print(f"File registry saved to: {proc_config.file_registry_path}")

# Run the processing and deletion flow
process_and_delete_files()
