# Downloading Experiments

This notebook is a proof of concept

Right now, we have some flaws in our raw data download system. We want to make sure we get this right because we will be downloading and processing hundreds or thousands of GBs of raw data parquet files.

So, we'll be messing around here with some implementations, and if they work, we'll be replacing parts of our original code with this.

In [1]:
%pip install huggingface_hub --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Getting file names

We want to use file names and other meta data to do dupe checks of what we've already processed, and see what we still need to download from a certain month. Let's run this code that gets all file names from the repo, to see what the data looks like.

In [2]:
# Here we get a list of raw data files for the given month/year

# File names in the remote repo are structured like:
# data/year=2025/month=03/train-00001-of-00065.parquet
# Obviously, there will be different amounts of them so it won't always be -00065.parquet

from huggingface_hub import HfApi
from pathlib import Path

year = "2025"
month = "03"  # <-- type manually

api = HfApi()
files = api.list_repo_files(
    repo_id="Lichess/standard-chess-games",
    repo_type="dataset"
)

# Filter for that year/month
target_prefix = f"data/year={year}/month={month}/"
all_file_names_in_month = [f for f in files if f.startswith(target_prefix)]

print(len(all_file_names_in_month), f"files found for {year}-{month}")
for f in all_file_names_in_month[:20]:  # preview first 20
    print(f)

  from .autonotebook import tqdm as notebook_tqdm


69 files found for 2025-03
data/year=2025/month=03/train-00000-of-00069.parquet
data/year=2025/month=03/train-00001-of-00069.parquet
data/year=2025/month=03/train-00002-of-00069.parquet
data/year=2025/month=03/train-00003-of-00069.parquet
data/year=2025/month=03/train-00004-of-00069.parquet
data/year=2025/month=03/train-00005-of-00069.parquet
data/year=2025/month=03/train-00006-of-00069.parquet
data/year=2025/month=03/train-00007-of-00069.parquet
data/year=2025/month=03/train-00008-of-00069.parquet
data/year=2025/month=03/train-00009-of-00069.parquet
data/year=2025/month=03/train-00010-of-00069.parquet
data/year=2025/month=03/train-00011-of-00069.parquet
data/year=2025/month=03/train-00012-of-00069.parquet
data/year=2025/month=03/train-00013-of-00069.parquet
data/year=2025/month=03/train-00014-of-00069.parquet
data/year=2025/month=03/train-00015-of-00069.parquet
data/year=2025/month=03/train-00016-of-00069.parquet
data/year=2025/month=03/train-00017-of-00069.parquet
data/year=2025/mont

## Dupe checks

Now that we have the list of raw data file names for the given month and year, we'll perform our dupe checks. This parses through the list of file names to make sure we haven't already processed any of these files.

Each file is a 1GB download, so it's obviously in our best interest not to download a file we've already processed.

In [3]:
# Parse the list of raw data file names to make sure we haven't already processed any of these files, and skip downloading any dupes

from huggingface_hub import get_hf_file_metadata, hf_hub_url

import sys

# Current working directory (should be project root)
project_root = Path.cwd()
sys.path.insert(0, str(project_root))

from utils.file_processing.raw_data_file_dupe_checks import FileRegistry  # noqa: E402

# Init registry
registry = FileRegistry()

# Remove files already processed
non_dupe_files = []
for f in all_file_names_in_month:
    url = hf_hub_url(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        filename=f,
    )
    meta = get_hf_file_metadata(url=url)
    size = meta.size
    etag = meta.etag
    if not registry.is_file_processed(f"{year}-{month}", Path(f).name, size, etag):
        non_dupe_files.append(f)

print(len(non_dupe_files), "new files to download")

69 new files to download


## Config

Now to configure how we want our downloading and processing pipeline to operate.

In [6]:
# Config stuff

from utils.downloading_raw_parquet_data.raw_parquet_data_file_downloader import (
    download_single_parquet_file,
)  # noqa: F401
from utils.file_processing.process_parquet_file import (
    process_parquet_file,
)  # noqa: F401
from utils.file_processing.types_and_classes import ProcessingConfig
from utils.file_processing.save_and_load_progress import load_progress

# --- Configuration ---
year, month = 2025, 3
max_files_to_download = 1  # For testing; set to None to process all new files
local_dir = Path("../data/raw/better_downloading_experiments")
local_dir.mkdir(parents=True, exist_ok=True)

# Base config for processing. This will be used for each file.
base_config = ProcessingConfig(
    parquet_path="",  # This will be set per-file
    batch_size=100_000,
    save_interval=5,  # Save progress every 5 batches
    save_dir="../data/processed",
    min_player_rating=1200,
    max_elo_difference_between_players=100,
    allowed_time_controls={"Blitz", "Rapid", "Classical"},
)

# --- Initialization ---
# Load all existing player data once at the beginning.
# This dictionary will be passed to and updated by each file processing job.
all_players_data, _ = load_progress(base_config)
print(f"Loaded initial data with {len(all_players_data):,} players.")

# Use the list of non-duplicate files from the previous cell
# Note that we still need to actually download these files; that comes next
files_to_download = non_dupe_files
if max_files_to_download is not None:
    files_to_download = non_dupe_files[:max_files_to_download]

# for file_to_download in file_names_to_download:
#     downloaded_file_path = download_single_parquet_file(
#         repo_id="Lichess/standard-chess-games",
#         repo_type="dataset",
#         file_to_download=file_to_download,
#         local_dir=local_dir,
#         year=year,
#         month=month,
#     )
#     if downloaded_file_path:
#         # Here you would call your processing function, e.g.:
#         # process_and_delete_file(downloaded_file_path)
#         print(f"Successfully downloaded {downloaded_file_path.name}")

#     # Now to process the downloaded file, then delete it to save space
#     process_parquet_file(

#     )

Loaded initial data with 0 players.


## Downloading, processing, deleting

Now, we'lll downloadm process and delete our raw data files one by one.

The workflow is:

1. Download a file
2. Process that file, extracting the game data we want
3. Delete that file

In [None]:
import time
import os
# --- Main Loop ---

for i, file_to_download in enumerate(files_to_download):
    print(f"\n--- Downloading [{i+1}/{len(files_to_download)}] ---")

    # 1. Download the file
    downloaded_file_path = download_single_parquet_file(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        file_to_download=file_to_download,
        local_dir=local_dir,
        year=year,
        month=month,
    )

    if not downloaded_file_path:
        print(f"DOWNLOAD FAILED for {file_to_download}. Skipping.")
        continue

    print(f"Successfully downloaded: {downloaded_file_path.name}")

    # 3. Process the file
    file_config = base_config.replace(parquet_path=str(downloaded_file_path))

    file_context = {
        "current_file_num": i + 1,
        "total_files": len(files_to_download),
        "total_start_time": time.time(),  # This context is for ETA within a single file run,
        "avg_rows_per_file": 1_400_000,  # Rough average for ETA purposes
        "total_rows_estimate": len(files_to_download)
        * 1_400_000,
    }

    is_processing_successful = process_parquet_file(
        config=file_config,
        players_data=all_players_data,
        file_context=file_context,
    )

    print ("is_processing_successful:", is_processing_successful)

    if is_processing_successful:
        print(f"PROCESSING SUCCESSFUL for {downloaded_file_path.name}")
        # Register the file as processed
        # So we don't re-download and re-process it in the future
        registry.mark_file_processed(
            month=f"{year}-{month}",
            filename=downloaded_file_path.name,
            size=meta.size,
            etag=meta.etag,
        )
        print("Registered file as processed.")
        # Delete the file to save space
        os.remove(downloaded_file_path)
        print(f"Deleted local file: {downloaded_file_path.name}")
    else:
        print(f"PROCESSING FAILED for {downloaded_file_path.name}")
        # Still mark it as processed in file registry; don't bother re-downloading it; we have eighty bajillion other files to work with
        registry.mark_file_processed(
            month=f"{year}-{month}",
            filename=downloaded_file_path.name,
            size=meta.size,
            etag=meta.etag,
        )
        os.remove(downloaded_file_path)


--- Downloading [1/1] ---
File saved to ../data/raw/better_downloading_experiments/2025-03-train-00000-of-00069.parquet
Successfully downloaded: 2025-03-train-00000-of-00069.parquet
Will process 1,413,223 rows in 15 batches of size 100,000

Processing batch 1/15 (offset 0)
Progress: 5,000/100,000 (5.0%) - Rate: 10632.6 games/sec - File 1/1 (0 left) - File ETA: 0.1 min - Total ETA: 4.5 min
Batch filtering: Accepted 2,232, Filtered 2,767 (Acceptance rate: 44.6%)
Progress: 10,000/100,000 (10.0%) - Rate: 10857.3 games/sec - File 1/1 (0 left) - File ETA: 0.1 min - Total ETA: 3.3 min
Batch filtering: Accepted 4,330, Filtered 5,669 (Acceptance rate: 43.3%)
Progress: 15,000/100,000 (15.0%) - Rate: 11873.0 games/sec - File 1/1 (0 left) - File ETA: 0.1 min - Total ETA: 2.7 min
Batch filtering: Accepted 6,411, Filtered 8,588 (Acceptance rate: 42.7%)
Progress: 20,000/100,000 (20.0%) - Rate: 12668.3 games/sec - File 1/1 (0 left) - File ETA: 0.1 min - Total ETA: 2.4 min
Batch filtering: Accepted 8,

TypeError: FileRegistry.mark_file_processed() got an unexpected keyword argument 'month_str'