# Downloading Experiments

This notebook is a proof of concept

Right now, we have some flaws in our raw data download system. We want to make sure we get this right because we will be downloading and processing hundreds or thousands of GBs of raw data parquet files.

So, we'll be messing around here with some implementations, and if they work, we'll be replacing parts of our original code with this.

In [1]:
%pip install huggingface_hub --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Getting file names

We want to use file names and other meta data to do dupe checks of what we've already processed, and see what we still need to download from a certain month. Let's run this code that gets all file names from the repo, to see what the data looks like.

TODO I wrote get_parquet_file_names for this; if we use this notebook again, convert this functionality to use that util.

In [None]:
# Here we get a list of raw data files for the given month/year

# TODO I wrote get_parquet_file_names for this; if we use this notebook again, convert this functionality to use that util.

# File names in the remote repo are structured like:
# data/year=2025/month=03/train-00001-of-00065.parquet
# Obviously, there will be different amounts of them so it won't always be -00065.parquet

from huggingface_hub import HfApi
from pathlib import Path

year = 2025
month = 3  # <-- type manually

api = HfApi()
files = api.list_repo_files(
    repo_id="Lichess/standard-chess-games",
    repo_type="dataset"
)

# Filter for that year/month
target_prefix = f"data/year={year}/month={month:02d}/"
all_file_names_in_month = [f for f in files if f.startswith(target_prefix)]

print(len(all_file_names_in_month), f"files found for {year}-{month}")
for f in all_file_names_in_month[:20]:  # preview first 20
    print(f)

69 files found for 2025-3
data/year=2025/month=03/train-00000-of-00069.parquet
data/year=2025/month=03/train-00001-of-00069.parquet
data/year=2025/month=03/train-00002-of-00069.parquet
data/year=2025/month=03/train-00003-of-00069.parquet
data/year=2025/month=03/train-00004-of-00069.parquet
data/year=2025/month=03/train-00005-of-00069.parquet
data/year=2025/month=03/train-00006-of-00069.parquet
data/year=2025/month=03/train-00007-of-00069.parquet
data/year=2025/month=03/train-00008-of-00069.parquet
data/year=2025/month=03/train-00009-of-00069.parquet
data/year=2025/month=03/train-00010-of-00069.parquet
data/year=2025/month=03/train-00011-of-00069.parquet
data/year=2025/month=03/train-00012-of-00069.parquet
data/year=2025/month=03/train-00013-of-00069.parquet
data/year=2025/month=03/train-00014-of-00069.parquet
data/year=2025/month=03/train-00015-of-00069.parquet
data/year=2025/month=03/train-00016-of-00069.parquet
data/year=2025/month=03/train-00017-of-00069.parquet
data/year=2025/month

## Dupe checks

Now that we have the list of raw data file names for the given month and year, we'll perform our dupe checks. This parses through the list of file names to make sure we haven't already processed any of these files.

Each file is a 1GB download, so it's obviously in our best interest not to download a file we've already processed.

In [3]:
# Parse the list of raw data file names to make sure we haven't already processed any of these files, and skip downloading any dupes

from huggingface_hub import get_hf_file_metadata, hf_hub_url

import sys

# Current working directory (should be project root)
project_root = Path.cwd()
sys.path.insert(0, str(project_root))

from utils.file_processing.raw_data_file_dupe_checks import FileRegistry  # noqa: E402

# Init registry
registry = FileRegistry()

# Remove files already processed
non_dupe_files = []
month_str = f"{year}-{month}"
for f in all_file_names_in_month:
    url = hf_hub_url(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        filename=f,
    )
    meta = get_hf_file_metadata(url=url)
    size = meta.size
    etag = meta.etag

    # This is the filename format that will be saved in the registry
    expected_filename_in_registry = f"{year}-{month:02d}-{Path(f).name}"

    if not registry.is_file_processed(
        month_str, expected_filename_in_registry, size, etag
    ):
        non_dupe_files.append(f)

print(len(non_dupe_files), "new files to download")

69 new files to download


## Config

Now to configure how we want our downloading and processing pipeline to operate.

In [None]:
# Config stuff
import importlib
from utils.database import db_utils
from utils.file_processing import process_game_batch
importlib.reload(db_utils)
importlib.reload(process_game_batch)

from utils.downloading_raw_parquet_data.raw_parquet_data_file_downloader import (
    download_single_parquet_file,
)
from utils.file_processing.process_parquet_file import (
    process_parquet_file,
)
from utils.file_processing.types_and_classes import ProcessingConfig
from utils.database.db_utils import get_db_connection, setup_database


# --- Configuration ---
year, month = 2025, 3
max_files_to_download = 5  # For testing; set to None to process all new files
local_dir = Path("../data/raw/better_downloading_experiments")
local_dir.mkdir(parents=True, exist_ok=True)

# Define the path for the DuckDB database file.
db_path = Path("../data/processed/chess_games.db")
db_path.parent.mkdir(parents=True, exist_ok=True)


# Base config for processing. This will be used for each file.
base_config = ProcessingConfig(
    parquet_path="",  # This will be set per-file
    db_path=db_path,
    batch_size=1_500_000,
    min_player_rating=1200,
    max_elo_difference_between_players=100,
    allowed_time_controls={"Blitz", "Rapid", "Classical"},
)

# --- Database Initialization ---
# Set up the database schema before starting any processing.
# This is idempotent; it will only create tables if they don't exist.
with get_db_connection(db_path) as con:
    setup_database(con)


# --- File Selection ---
# Use the list of non-duplicate files from the previous cell
files_to_download = non_dupe_files
if max_files_to_download is not None:
    files_to_download = non_dupe_files[:max_files_to_download]

print(f"Prepared to download and process {len(files_to_download)} files.")

Initializing database schema...
Database tables and partitioned stats tables are ready.
Prepared to download and process 10 files.


## Downloading, processing, deleting

Now, we'lll download, process and delete our raw data files one by one.

The workflow is:

1. Download a file
2. Process that file, extracting the game data we want
3. Delete that file

In [5]:
import time
import os
from collections import defaultdict
from statistics import mean
# --- Timing Stats ---
timing_stats = defaultdict(list)
total_start_time = time.time()

for i, file_to_download in enumerate(files_to_download):
    print(f"\n--- Downloading [{i+1}/{len(files_to_download)}] ---")
    step_start_time = time.time()

    # 1. Download the file
    downloaded_file_path = download_single_parquet_file(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        file_to_download=file_to_download,
        local_dir=local_dir,
        year=year,
        month=month,
    )

    timing_stats['download'].append(time.time() - step_start_time)
    if not downloaded_file_path:
        print(f"DOWNLOAD FAILED for {file_to_download}. Skipping.")
        continue

    print(f"Successfully downloaded: {downloaded_file_path.name}")

    # Get metadata for the downloaded file
    url = hf_hub_url(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        filename=file_to_download,
    )
    meta = get_hf_file_metadata(url=url)

    # 2. Process the file
    step_start_time = time.time()
    file_config = base_config.replace(parquet_path=str(downloaded_file_path))
    file_context = {
        "current_file_num": i + 1,
        "total_files": len(files_to_download),
        "total_start_time": total_start_time,
    }
    is_processing_successful = process_parquet_file(
        config=file_config,
        file_context=file_context,
    )
    timing_stats['process'].append(time.time() - step_start_time)

    # 3. Register and Delete on Success
    step_start_time = time.time()
    if is_processing_successful:
        print(f"PROCESSING SUCCESSFUL for {downloaded_file_path.name}")
        registry.mark_file_processed(
            month=f"{year}-{month}",
            filename=downloaded_file_path.name,
            size=meta.size,
            etag=meta.etag,
        )
        print("Registered file as processed.")
        os.remove(downloaded_file_path)
        print(f"Deleted local file: {downloaded_file_path.name}")
    else:
        print(f"PROCESSING FAILED for {downloaded_file_path.name}")
        registry.mark_file_processed(
            month=f"{year}-{month}",
            filename=downloaded_file_path.name,
            size=meta.size,
            etag=meta.etag,
        )
        print("Registered file as processed to avoid re-downloading.")
        os.remove(downloaded_file_path)
        print(f"Deleted local file: {downloaded_file_path.name}")
    timing_stats['register_delete'].append(time.time() - step_start_time)

# --- Summary ---
print("\n--- Timing Summary ---")
for step, times in timing_stats.items():
    print(f"{step.capitalize()} - Avg: {mean(times):.2f}s, Max: {max(times):.2f}s, Min: {min(times):.2f}s")
total_elapsed_time = time.time() - total_start_time
print(f"Total elapsed time: {total_elapsed_time:.2f}s")


--- Downloading [1/10] ---
File saved to ../data/raw/better_downloading_experiments/2025-03-train-00000-of-00069.parquet
Successfully downloaded: 2025-03-train-00000-of-00069.parquet
File saved to ../data/raw/better_downloading_experiments/2025-03-train-00000-of-00069.parquet
Successfully downloaded: 2025-03-train-00000-of-00069.parquet
Processing 1,410,475 rows in 1 batches of size 1,500,000

--- Starting Batch 1/1 ---
Processing 1,410,475 rows in 1 batches of size 1,500,000

--- Starting Batch 1/1 ---


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 620,769 games.
    Updated stats for 1,081,675 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.43s
extract_players_openings: 0.07s
insert_entities: 1.66s
aggregate_stats: 0.39s
bulk_upsert: 9.41s
    Batch processed in 14.87s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1410475
  total_time_sec: 15.056173086166382
  avg_batch_time_sec: 14.867581844329834
  min_batch_time_sec: 14.867581844329834
  max_batch_time_sec: 14.867581844329834
  avg_batch_size: 1410475.0
  overall_rate_games_per_sec: 93680.84385905108
  memory_usage: [{'percent': 40.5, 'used_gb': 12.052871704101562, 'available_gb': 19.03714370727539}]
  accepted_games: 620769
  filtered_games: 789706
  acceptance_rate_percent: 44.0
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.19s


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 618,647 games.
    Updated stats for 1,081,239 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.44s
extract_players_openings: 0.08s
insert_entities: 1.72s
aggregate_stats: 0.28s
bulk_upsert: 12.26s
    Batch processed in 17.47s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1410372
  total_time_sec: 17.586232900619507
  avg_batch_time_sec: 17.47227191925049
  min_batch_time_sec: 17.47227191925049
  max_batch_time_sec: 17.47227191925049
  avg_batch_size: 1410372.0
  overall_rate_games_per_sec: 80197.50494435435
  memory_usage: [{'percent': 40.4, 'used_gb': 12.006332397460938, 'available_gb': 19.083538055419922}]
  accepted_games: 618647
  filtered_games: 791725
  acceptance_rate_percent: 43.9
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.11s
b

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 623,441 games.
    Updated stats for 1,078,837 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.53s
extract_players_openings: 0.08s
insert_entities: 1.59s
aggregate_stats: 0.29s
bulk_upsert: 16.23s
    Batch processed in 21.48s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1410528
  total_time_sec: 21.597620010375977
  avg_batch_time_sec: 21.484920740127563
  min_batch_time_sec: 21.484920740127563
  max_batch_time_sec: 21.484920740127563
  avg_batch_size: 1410528.0
  overall_rate_games_per_sec: 65309.41832120164
  memory_usage: [{'percent': 40.7, 'used_gb': 12.117416381835938, 'available_gb': 18.971492767333984}]
  accepted_games: 623441
  filtered_games: 787087
  acceptance_rate_percent: 44.2
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.11

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 613,739 games.
    Updated stats for 1,076,537 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.43s
extract_players_openings: 0.09s
insert_entities: 1.58s
aggregate_stats: 0.31s
bulk_upsert: 17.95s
    Batch processed in 23.19s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1409773
  total_time_sec: 23.320396900177002
  avg_batch_time_sec: 23.185829162597656
  min_batch_time_sec: 23.185829162597656
  max_batch_time_sec: 23.185829162597656
  avg_batch_size: 1409773.0
  overall_rate_games_per_sec: 60452.358767071404
  memory_usage: [{'percent': 41.0, 'used_gb': 12.221397399902344, 'available_gb': 18.868572235107422}]
  accepted_games: 613739
  filtered_games: 796034
  acceptance_rate_percent: 43.5
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.1

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 619,869 games.
    Updated stats for 1,074,957 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.44s
extract_players_openings: 0.09s
insert_entities: 1.58s
aggregate_stats: 0.40s
bulk_upsert: 18.59s
    Batch processed in 23.84s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1411262
  total_time_sec: 23.958751916885376
  avg_batch_time_sec: 23.842582941055298
  min_batch_time_sec: 23.842582941055298
  max_batch_time_sec: 23.842582941055298
  avg_batch_size: 1411262.0
  overall_rate_games_per_sec: 58903.819568555526
  memory_usage: [{'percent': 41.3, 'used_gb': 12.30267333984375, 'available_gb': 18.78677749633789}]
  accepted_games: 619869
  filtered_games: 791393
  acceptance_rate_percent: 43.9
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.12s

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 623,841 games.
    Updated stats for 1,094,477 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.41s
extract_players_openings: 0.09s
insert_entities: 1.51s
aggregate_stats: 0.40s
bulk_upsert: 20.05s
    Batch processed in 25.22s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1410815
  total_time_sec: 25.352068185806274
  avg_batch_time_sec: 25.221807956695557
  min_batch_time_sec: 25.221807956695557
  max_batch_time_sec: 25.221807956695557
  avg_batch_size: 1410815.0
  overall_rate_games_per_sec: 55648.91154678518
  memory_usage: [{'percent': 41.6, 'used_gb': 12.405353546142578, 'available_gb': 18.683609008789062}]
  accepted_games: 623841
  filtered_games: 786974
  acceptance_rate_percent: 44.2
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.13

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 613,428 games.
    Updated stats for 1,066,078 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.49s
extract_players_openings: 0.08s
insert_entities: 1.28s
aggregate_stats: 0.37s
bulk_upsert: 22.65s
    Batch processed in 27.44s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1411185
  total_time_sec: 27.549126863479614
  avg_batch_time_sec: 27.436647176742554
  min_batch_time_sec: 27.436647176742554
  max_batch_time_sec: 27.436647176742554
  avg_batch_size: 1411185.0
  overall_rate_games_per_sec: 51224.30946698102
  memory_usage: [{'percent': 41.8, 'used_gb': 12.464088439941406, 'available_gb': 18.62494659423828}]
  accepted_games: 613428
  filtered_games: 797757
  acceptance_rate_percent: 43.5
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.11s

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 628,517 games.
    Updated stats for 1,094,917 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.44s
extract_players_openings: 0.09s
insert_entities: 1.48s
aggregate_stats: 0.38s
bulk_upsert: 22.80s
    Batch processed in 27.95s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1410968
  total_time_sec: 28.073933124542236
  avg_batch_time_sec: 27.953838109970093
  min_batch_time_sec: 27.953838109970093
  max_batch_time_sec: 27.953838109970093
  avg_batch_size: 1410968.0
  overall_rate_games_per_sec: 50259.006949280345
  memory_usage: [{'percent': 41.7, 'used_gb': 12.417572021484375, 'available_gb': 18.671810150146484}]
  accepted_games: 628517
  filtered_games: 782451
  acceptance_rate_percent: 44.5
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.1

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 614,920 games.
    Updated stats for 1,074,234 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.41s
extract_players_openings: 0.08s
insert_entities: 1.55s
aggregate_stats: 0.39s
bulk_upsert: 22.99s
    Batch processed in 27.89s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1410743
  total_time_sec: 28.024264097213745
  avg_batch_time_sec: 27.89228320121765
  min_batch_time_sec: 27.89228320121765
  max_batch_time_sec: 27.89228320121765
  avg_batch_size: 1410743.0
  overall_rate_games_per_sec: 50340.05514315219
  memory_usage: [{'percent': 41.6, 'used_gb': 12.413520812988281, 'available_gb': 18.676170349121094}]
  accepted_games: 614920
  filtered_games: 795823
  acceptance_rate_percent: 43.6
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.13s
b

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

    Processed 625,192 games.
    Updated stats for 1,087,523 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.58s
extract_players_openings: 0.10s
insert_entities: 1.57s
aggregate_stats: 0.40s
bulk_upsert: 26.07s
    Batch processed in 31.08s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1410874
  total_time_sec: 31.19522523880005
  avg_batch_time_sec: 31.07607913017273
  min_batch_time_sec: 31.07607913017273
  max_batch_time_sec: 31.07607913017273
  avg_batch_size: 1410874.0
  overall_rate_games_per_sec: 45227.24196410612
  memory_usage: [{'percent': 42.5, 'used_gb': 12.686527252197266, 'available_gb': 18.403160095214844}]
  accepted_games: 625192
  filtered_games: 785682
  acceptance_rate_percent: 44.3
Running VACUUM on the database...
VACUUM complete. Elapsed time: 0.00 seconds.
VACUUM took 0.00 seconds for this file.
Database connection closed.

--- Detailed Timing Metrics ---
metadata_retrieval: 0.12s
ba