# Auto-Download Parquet Files

This notebook is an auto-downloader of raw parquet data.
In this case, parquets are giant collections of Lichess games data, about 1GB each. 
This notebook will download a month's worth of games data at a time, or whatever maximum number of files the user inputs.
Right now (9.7.25) it just downloads the files, but soon I will have it process the files and auto-delete them when finished.

## Wifi Speed

My personal Wifi speed is quite fast (350 mbps); This is a lot of GBs of data, but download speed isn't the bottleneck because my system will take longer to process each file than it will to download the next one. But, if you have slower download speed, you may need to adjust.

In [2]:
from pathlib import Path
import sys

project_root = (
    Path(__file__).resolve().parent.parent
    if "__file__" in globals()
    else Path.cwd().parent
)
output_dir = str(project_root / "data" / "raw" / "auto_download_parquets")

# CONFIG
# I did my best to put all of the variables that we might need to adjust here.
# This is a proof of concept right now; later we may make this some sort of drop down picker for month, year etc
config = {
    "repo": "Lichess/standard-chess-games",  # Hugging Face repo id
    "year": "2025",  # 4-digit year (string or int)
    "month": "7",  # numeric month (e.g., "7" or "07")
    "max_parquets": 30,  # int or None to download all available
    # Download to /data/raw/auto_download_parquets relative to project root
    "output_dir": output_dir,
    "hf_token": None,  # set to your HF token string if you need to access gated datasets
    "probe_max_attempts": 1000,  # for fallback probing
    "probe_patterns": [  # tried in order if APIs gave no URLs
        # Pattern A: common "train-00000-of-00066.parquet" style
        "https://huggingface.co/datasets/{repo}/resolve/main/data/year={year}/month={month}/train-{idx:05d}-of-{total:05d}.parquet",
        # Pattern B: some datasets use plain shard names
        "https://huggingface.co/datasets/{repo}/resolve/main/data/year={year}/month={month}/train-{idx:05d}.parquet",
        # Pattern C: fall back to zero-padded 4-digit name
        "https://huggingface.co/datasets/{repo}/resolve/main/data/year={year}/month={month}/000{idx}.parquet",
    ],
}

# Extract config variables for use in the rest of the notebook
repo = config["repo"]
year = str(config["year"])
month_raw = str(config["month"])
month_padded = month_raw.zfill(2)
max_parquets = config["max_parquets"]
out_dir = Path(config["output_dir"])
out_dir.mkdir(parents=True, exist_ok=True)
hf_headers = {"Authorization": f"Bearer {config['hf_token']}"} if config.get("hf_token") else {}

## Helper Functions

Getting helper functions that we're defined elsewhere.

In [3]:
from utils.downloading_raw_parquet_data.api_interaction import (
    get_urls_from_hub_api,
    get_urls_from_dataset_viewer,
    filter_urls_for_month
)
from utils.downloading_raw_parquet_data.file_downloader import (
    probe_fallback_urls,
    download_file
)

## Main Execution Flow

The following cell contains the main logic for querying parquet URLs, filtering them, and downloading the files.

In [4]:
import urllib

# Main flow
import time
print("1) Querying Hub API for parquet URLs...")
urls = get_urls_from_hub_api(repo, hf_headers)
if urls:
    print(f"  Hub API returned {len(urls)} total parquet URLs (unfiltered).")
else:
    print("  Hub API returned nothing (or failed).")

filtered = filter_urls_for_month(urls, year, month_padded)
if filtered:
    print(f"  Found {len(filtered)} parquet URLs for {year}/{month_padded} via Hub API.")
else:
    print("2) Trying dataset-viewer endpoint...")
    urls2 = get_urls_from_dataset_viewer(repo, hf_headers)
    if urls2:
        print(f"  dataset-viewer returned {len(urls2)} total parquet entries.")
        filtered = filter_urls_for_month(urls2, year, month_padded)
        if filtered:
            print(f"  Found {len(filtered)} parquet URLs for {year}/{month_padded} via dataset-viewer.")
if not filtered:
    print("3) No parquet URLs found via API; falling back to incremental probing (may be slower).")
    patterns = config.get("probe_patterns", [])
    found = probe_fallback_urls(repo, year, month_padded, config["probe_max_attempts"], patterns, hf_headers)
    filtered = found

if not filtered:
    print("ERROR: no parquet URLs discovered for that month/year by API or fallback probing. Aborting.")
    sys.exit(1)

if max_parquets is not None:
    filtered = filtered[:int(max_parquets)]

print(f"\nWill download {len(filtered)} file(s) into {out_dir.resolve()}\n")

success_count = 0
start_time = time.time()
total_downloaded_gb = 0.0
for i, url in enumerate(filtered):
    decoded = urllib.parse.unquote(url)
    filename = Path(decoded).name
    dest = out_dir / filename
    if dest.exists():
        print(f"[{i+1}/{len(filtered)}] Skipping (already exists): {filename}")
        success_count += 1
        continue
    print(f"[{i+1}/{len(filtered)}] Downloading: {filename}")
    file_start_time = time.time()
    ok = download_file(url, dest, hf_headers)
    if not ok:
        print(f"  Failed to download (skipping): {url}")
        if urls == []:
            print("  Probe-based download hit missing file — stopping probe downloads.")
            break
        else:
            continue
    # Time tracking and metrics
    file_end_time = time.time()
    elapsed_time = file_end_time - file_start_time
    total_elapsed_time = file_end_time - start_time
    file_size_gb = 1.01  # All parquet files will be about 1.01 GB, except the last one which may be smaller
    total_downloaded_gb += file_size_gb
    download_speed_gbph = file_size_gb / (elapsed_time / 3600)  # GB per hour
    download_speed_mbps = (file_size_gb * 1024) / (elapsed_time / 8)  # Mbps
    avg_speed_gbph = total_downloaded_gb / (total_elapsed_time / 3600)  # Average GB per hour
    avg_speed_mbps = (total_downloaded_gb * 1024) / (total_elapsed_time / 8)  # Average Mbps
    remaining_files = len(filtered) - (i + 1)
    eta = (total_elapsed_time / (i + 1)) * remaining_files
    print(f"  Download completed in {elapsed_time:.2f} seconds.")
    print(f"  Current Speed: {download_speed_gbph:.2f} GB/hour ({download_speed_mbps:.2f} Mbps).")
    print(f"  Average Speed: {avg_speed_gbph:.2f} GB/hour ({avg_speed_mbps:.2f} Mbps).")
    print(f"  ETA for remaining files: {eta / 60:.2f} minutes.")
    print(f"  Total elapsed time: {total_elapsed_time / 60:.2f} minutes.")
    success_count += 1
    time.sleep(0.5)
    # Regular updates every 15 seconds
    if total_elapsed_time % 15 < 0.5:
        print(f"[Update] Total downloaded: {total_downloaded_gb:.2f} GB.")
        print(f"[Update] Average Speed: {avg_speed_gbph:.2f} GB/hour ({avg_speed_mbps:.2f} Mbps).")
        print(f"[Update] Total elapsed time: {total_elapsed_time / 60:.2f} minutes.")

print(f"\nDone. {success_count} file(s) downloaded to: {out_dir.resolve()}")

1) Querying Hub API for parquet URLs...
  Hub API returned 26010 total parquet URLs (unfiltered).
2) Trying dataset-viewer endpoint...
  dataset-viewer returned 26010 total parquet entries.
3) No parquet URLs found via API; falling back to incremental probing (may be slower).

Will download 30 file(s) into /Users/a/Documents/personalprojects/chess-opening-recommender/data/raw/auto_download_parquets

[1/30] Skipping (already exists): train-00000-of-00066.parquet
[2/30] Skipping (already exists): train-00001-of-00066.parquet
[3/30] Skipping (already exists): train-00002-of-00066.parquet
[4/30] Skipping (already exists): train-00003-of-00066.parquet
[5/30] Skipping (already exists): train-00004-of-00066.parquet
[6/30] Skipping (already exists): train-00005-of-00066.parquet
[7/30] Skipping (already exists): train-00006-of-00066.parquet
[8/30] Skipping (already exists): train-00007-of-00066.parquet
[9/30] Skipping (already exists): train-00008-of-00066.parquet
[10/30] Skipping (already exis