# Downloading Experiments

This notebook is a proof of concept

Right now, we have some flaws in our raw data download system. We want to make sure we get this right because we will be downloading and processing hundreds or thousands of GBs of raw data parquet files.

So, we'll be messing around here with some implementations, and if they work, we'll be replacing parts of our original code with this.

In [5]:
%pip install huggingface_hub --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Getting file names

We want to use file names and other meta data to do dupe checks of what we've already processed, and see what we still need to download from a certain month. Let's run this code that gets all file names from the repo, to see what the data looks like.

In [6]:
# Here we get a list of raw data files for the given month/year

# File names in the remote repo are structured like:
# data/year=2025/month=03/train-00001-of-00065.parquet
# Obviously, there will be different amounts of them so it won't always be -00065.parquet

from huggingface_hub import HfApi
from pathlib import Path

year = "2025"
month = "03"  # <-- type manually

api = HfApi()
files = api.list_repo_files(
    repo_id="Lichess/standard-chess-games",
    repo_type="dataset"
)

# Filter for that year/month
target_prefix = f"data/year={year}/month={month}/"
all_file_names_in_month = [f for f in files if f.startswith(target_prefix)]

print(len(all_file_names_in_month), f"files found for {year}-{month}")
for f in all_file_names_in_month[:20]:  # preview first 20
    print(f)

69 files found for 2025-03
data/year=2025/month=03/train-00000-of-00069.parquet
data/year=2025/month=03/train-00001-of-00069.parquet
data/year=2025/month=03/train-00002-of-00069.parquet
data/year=2025/month=03/train-00003-of-00069.parquet
data/year=2025/month=03/train-00004-of-00069.parquet
data/year=2025/month=03/train-00005-of-00069.parquet
data/year=2025/month=03/train-00006-of-00069.parquet
data/year=2025/month=03/train-00007-of-00069.parquet
data/year=2025/month=03/train-00008-of-00069.parquet
data/year=2025/month=03/train-00009-of-00069.parquet
data/year=2025/month=03/train-00010-of-00069.parquet
data/year=2025/month=03/train-00011-of-00069.parquet
data/year=2025/month=03/train-00012-of-00069.parquet
data/year=2025/month=03/train-00013-of-00069.parquet
data/year=2025/month=03/train-00014-of-00069.parquet
data/year=2025/month=03/train-00015-of-00069.parquet
data/year=2025/month=03/train-00016-of-00069.parquet
data/year=2025/month=03/train-00017-of-00069.parquet
data/year=2025/mont

## Dupe checks

Now that we have the list of raw data file names for the given month and year, we'll perform our dupe checks. This parses through the list of file names to make sure we haven't already processed any of these files.

Each file is a 1GB download, so it's obviously in our best interest not to download a file we've already processed.

In [7]:
# Parse the list of raw data file names to make sure we haven't already processed any of these files, and skip downloading any dupes

from huggingface_hub import get_hf_file_metadata, hf_hub_url

import sys

# Current working directory (should be project root)
project_root = Path.cwd()
sys.path.insert(0, str(project_root))

from utils.file_processing.raw_data_file_dupe_checks import FileRegistry  # noqa: E402

# Init registry
registry = FileRegistry()

# Remove files already processed
non_dupe_files = []
for f in all_file_names_in_month:
    url = hf_hub_url(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        filename=f,
    )
    meta = get_hf_file_metadata(url=url)
    size = meta.size
    etag = meta.etag
    if not registry.is_file_processed(f"{year}-{month}", Path(f).name, size, etag):
        non_dupe_files.append(f)

print(len(non_dupe_files), "new files to download")

69 new files to download


In [None]:
from utils.downloading_raw_parquet_data.raw_parquet_data_file_downloader import (
    download_single_parquet_file,
)  # noqa: F401

year, month = 2025, 3
max_files_to_download = 1  # for testing; set to None to download all
file_names_to_download = all_file_names_in_month[:max_files_to_download]

local_dir = Path("../data/raw/better_downloading_experiments")
local_dir.mkdir(parents=True, exist_ok=True)

for file_to_download in file_names_to_download:
    downloaded_file_path = download_single_parquet_file(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        file_to_download=file_to_download,
        local_dir=local_dir,
        year=year,
        month=month,
    )
    if downloaded_file_path:
        # Here you would call your processing function, e.g.:
        # process_and_delete_file(downloaded_file_path)
        print(f"Successfully downloaded {downloaded_file_path.name}")

ModuleNotFoundError: No module named 'utils.downloading_raw_parquet_data.raw_parquet_data_file_downlaoder'