# Downloading Experiments

This notebook is a proof of concept

Right now, we have some flaws in our raw data download system. We want to make sure we get this right because we will be downloading and processing hundreds or thousands of GBs of raw data parquet files.

So, we'll be messing around here with some implementations, and if they work, we'll be replacing parts of our original code with this.

In [8]:
%pip install huggingface_hub --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Getting file names

We want to use file names and other meta data to do dupe checks of what we've already processed, and see what we still need to download from a certain month. Let's run this code that gets all file names from the repo, to see what the data looks like.

In [10]:
# File names in the remote repo are structured like:
# data/year=2025/month=03/train-00001-of-00065.parquet
# Obviously, there will be different amounts of them so it won't always be -00065.parquet

from huggingface_hub import HfApi

year = "2025"
month = "03"  # <-- type manually

api = HfApi()
files = api.list_repo_files(
    repo_id="Lichess/standard-chess-games",
    repo_type="dataset"
)

# Filter for that year/month
target_prefix = f"data/year={year}/month={month}/"
num_files_in_month = [f for f in files if f.startswith(target_prefix)]

print(len(num_files_in_month), f"files found for {year}-{month}")
for f in num_files_in_month[:20]:  # preview first 20
    print(f)



69 files found for 2025-03
data/year=2025/month=03/train-00000-of-00069.parquet
data/year=2025/month=03/train-00001-of-00069.parquet
data/year=2025/month=03/train-00002-of-00069.parquet
data/year=2025/month=03/train-00003-of-00069.parquet
data/year=2025/month=03/train-00004-of-00069.parquet
data/year=2025/month=03/train-00005-of-00069.parquet
data/year=2025/month=03/train-00006-of-00069.parquet
data/year=2025/month=03/train-00007-of-00069.parquet
data/year=2025/month=03/train-00008-of-00069.parquet
data/year=2025/month=03/train-00009-of-00069.parquet
data/year=2025/month=03/train-00010-of-00069.parquet
data/year=2025/month=03/train-00011-of-00069.parquet
data/year=2025/month=03/train-00012-of-00069.parquet
data/year=2025/month=03/train-00013-of-00069.parquet
data/year=2025/month=03/train-00014-of-00069.parquet
data/year=2025/month=03/train-00015-of-00069.parquet
data/year=2025/month=03/train-00016-of-00069.parquet
data/year=2025/month=03/train-00017-of-00069.parquet
data/year=2025/mont

In [None]:
from huggingface_hub import hf_hub_download
from pathlib import Path
import shutil

year, month = 2025, 3
file_to_download = num_files_in_month[0]

local_dir = Path("../data/raw/better_downloading_experiments")
local_dir.mkdir(parents=True, exist_ok=True)

# Download (HF handles caching)
downloaded_path = hf_hub_download(
    repo_id="Lichess/standard-chess-games",
    repo_type="dataset",
    filename=file_to_download,
)

# Rename with prefix and move to flat folder
# HF sends it to me in a nested file structure; but we don't want that so here we reoganize it to the file structure we want locally
target_filename = f"{year}-{month:02d}-{Path(file_to_download).name}"
target_path = local_dir / target_filename
shutil.copy(downloaded_path, target_path)  # or move() if you want to remove original
print(f"File saved to {target_path}")




Downloaded to ../data/raw/better_downloading_experiments/data/year=2025/month=03/train-00000-of-00069.parquet
