# Downloading Experiments

This notebook is a proof of concept

Right now, we have some flaws in our raw data download system. We want to make sure we get this right because we will be downloading and processing hundreds or thousands of GBs of raw data parquet files.

So, we'll be messing around here with some implementations, and if they work, we'll be replacing parts of our original code with this.

In [1]:
%pip install huggingface_hub --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Getting file names

We want to use file names and other meta data to do dupe checks of what we've already processed, and see what we still need to download from a certain month. Let's run this code that gets all file names from the repo, to see what the data looks like.

TODO I wrote get_parquet_file_names for this; if we use this notebook again, convert this functionality to use that util.

In [2]:
# Here we get a list of raw data files for the given month/year

# TODO I wrote get_parquet_file_names for this; if we use this notebook again, convert this functionality to use that util.

# File names in the remote repo are structured like:
# data/year=2025/month=03/train-00001-of-00065.parquet
# Obviously, there will be different amounts of them so it won't always be -00065.parquet

from huggingface_hub import HfApi
from pathlib import Path

year = 2025
month = 4  # <-- type manually

api = HfApi()
files = api.list_repo_files(
    repo_id="Lichess/standard-chess-games",
    repo_type="dataset"
)

# Filter for that year/month
target_prefix = f"data/year={year}/month={month:02d}/"
all_file_names_in_month = [f for f in files if f.startswith(target_prefix)]

print(len(all_file_names_in_month), f"files found for {year}-{month}")
for f in all_file_names_in_month[:20]:  # preview first 20
    print(f)

65 files found for 2025-4
data/year=2025/month=04/train-00000-of-00065.parquet
data/year=2025/month=04/train-00001-of-00065.parquet
data/year=2025/month=04/train-00002-of-00065.parquet
data/year=2025/month=04/train-00003-of-00065.parquet
data/year=2025/month=04/train-00004-of-00065.parquet
data/year=2025/month=04/train-00005-of-00065.parquet
data/year=2025/month=04/train-00006-of-00065.parquet
data/year=2025/month=04/train-00007-of-00065.parquet
data/year=2025/month=04/train-00008-of-00065.parquet
data/year=2025/month=04/train-00009-of-00065.parquet
data/year=2025/month=04/train-00010-of-00065.parquet
data/year=2025/month=04/train-00011-of-00065.parquet
data/year=2025/month=04/train-00012-of-00065.parquet
data/year=2025/month=04/train-00013-of-00065.parquet
data/year=2025/month=04/train-00014-of-00065.parquet
data/year=2025/month=04/train-00015-of-00065.parquet
data/year=2025/month=04/train-00016-of-00065.parquet
data/year=2025/month=04/train-00017-of-00065.parquet
data/year=2025/month

## Dupe checks

Now that we have the list of raw data file names for the given month and year, we'll perform our dupe checks. This parses through the list of file names to make sure we haven't already processed any of these files.

Each file is a 1GB download, so it's obviously in our best interest not to download a file we've already processed.

In [3]:
# Parse the list of raw data file names to make sure we haven't already processed any of these files, and skip downloading any dupes

from huggingface_hub import get_hf_file_metadata, hf_hub_url

import sys

# Current working directory (should be project root)
project_root = Path.cwd()
sys.path.insert(0, str(project_root))

from utils.file_processing.raw_data_file_dupe_checks import FileRegistry  # noqa: E402

# Init registry
registry = FileRegistry()

# Remove files already processed
non_dupe_files = []
month_str = f"{year}-{month}"
for f in all_file_names_in_month:
    url = hf_hub_url(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        filename=f,
    )
    meta = get_hf_file_metadata(url=url)
    size = meta.size
    etag = meta.etag

    # This is the filename format that will be saved in the registry
    expected_filename_in_registry = f"{year}-{month:02d}-{Path(f).name}"

    if not registry.is_file_processed(
        month_str, expected_filename_in_registry, size, etag
    ):
        non_dupe_files.append(f)

print(len(non_dupe_files), "new files to download")

65 new files to download


## Config

Now to configure how we want our downloading and processing pipeline to operate.

In [None]:
# Config stuff
import importlib
from utils.database import db_utils, player_game_counts_db_utils
from utils.file_processing import process_game_batch
importlib.reload(db_utils)
importlib.reload(process_game_batch)
importlib.reload(player_game_counts_db_utils)


from utils.downloading_raw_parquet_data.raw_parquet_data_file_downloader import (
    download_single_parquet_file,
)
from utils.file_processing.process_parquet_file import (
    process_parquet_file,
)
from utils.file_processing.types_and_classes import ProcessingConfig
from utils.database.db_utils import get_db_connection, setup_database
from utils.database.player_game_counts_db_utils import get_eligible_player_usernames


# --- Configuration ---
year, month = 2025, 3
max_files_to_download = 40  # For testing; set to None to process all new files
local_dir = Path("../data/raw/better_downloading_experiments")
local_dir.mkdir(parents=True, exist_ok=True)

# Define the path for the DuckDB database file.
db_path = Path("../data/processed/chess_games.db")
db_path.parent.mkdir(parents=True, exist_ok=True)

# Path to the database containing our list of eligible players.
player_counts_db_path = Path(
    "../data/processed/find_most_active_players/player_game_counts.duckdb"
)


# Base config for processing. This will be used for each file.
base_config = ProcessingConfig(
    parquet_path="",  # This will be set per-file
    db_path=db_path,
    batch_size=1_500_000,
    min_player_rating=1200,
    max_elo_difference_between_players=100,
    allowed_time_controls={"Blitz", "Rapid", "Classical"},
)

# --- Database Initialization ---
# Set up the database schema before starting any processing.
# This is idempotent; it will only create tables if they don't exist.
with get_db_connection(db_path) as con:
    setup_database(con)

# --- Load Eligible Players ---
# Load the set of usernames for players we want to include in our analysis.
# Games will be filtered to only include those where at least one player is in this set.
with get_db_connection(player_counts_db_path) as con:
    eligible_players = get_eligible_player_usernames(con)


# --- File Selection ---
# Use the list of non-duplicate files from the previous cell
files_to_download = non_dupe_files
if max_files_to_download is not None:
    files_to_download = non_dupe_files[:max_files_to_download]

print(f"Prepared to download and process {len(files_to_download)} files.")

Initializing database schema...
Database tables and partitioned stats tables are ready.
Retrieved 50,000 eligible players from the database.
Prepared to download and process 10 files.


## Downloading, processing, deleting

Now, we'lll download, process and delete our raw data files one by one.

The workflow is:

1. Download a file
2. Process that file, extracting the game data we want
3. Delete that file

In [5]:
import time
import os
from collections import defaultdict
from statistics import mean
# --- Timing Stats ---
timing_stats = defaultdict(list)
total_start_time = time.time()

for i, file_to_download in enumerate(files_to_download):
    print(f"\n--- Downloading [{i+1}/{len(files_to_download)}] ---")
    step_start_time = time.time()

    # 1. Download the file
    downloaded_file_path = download_single_parquet_file(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        file_to_download=file_to_download,
        local_dir=local_dir,
        year=year,
        month=month,
    )

    timing_stats['download'].append(time.time() - step_start_time)
    if not downloaded_file_path:
        print(f"DOWNLOAD FAILED for {file_to_download}. Skipping.")
        continue

    print(f"Successfully downloaded: {downloaded_file_path.name}")

    # Get metadata for the downloaded file
    url = hf_hub_url(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        filename=file_to_download,
    )
    meta = get_hf_file_metadata(url=url)

    # 2. Process the file
    step_start_time = time.time()
    file_config = base_config.replace(parquet_path=str(downloaded_file_path))
    file_context = {
        "current_file_num": i + 1,
        "total_files": len(files_to_download),
        "total_start_time": total_start_time,
    }
    is_processing_successful = process_parquet_file(
        config=file_config,
        eligible_players=eligible_players,
        file_context=file_context,
    )
    timing_stats['process'].append(time.time() - step_start_time)

    # 3. Register and Delete on Success
    step_start_time = time.time()
    if is_processing_successful:
        print(f"PROCESSING SUCCESSFUL for {downloaded_file_path.name}")
        registry.mark_file_processed(
            month=f"{year}-{month}",
            filename=downloaded_file_path.name,
            size=meta.size,
            etag=meta.etag,
        )
        print("Registered file as processed.")
        os.remove(downloaded_file_path)
        print(f"Deleted local file: {downloaded_file_path.name}")
    else:
        print(f"PROCESSING FAILED for {downloaded_file_path.name}")
        registry.mark_file_processed(
            month=f"{year}-{month}",
            filename=downloaded_file_path.name,
            size=meta.size,
            etag=meta.etag,
        )
        print("Registered file as processed to avoid re-downloading.")
        os.remove(downloaded_file_path)
        print(f"Deleted local file: {downloaded_file_path.name}")
    timing_stats['register_delete'].append(time.time() - step_start_time)

# --- Summary ---
print("\n--- Timing Summary ---")
for step, times in timing_stats.items():
    print(f"{step.capitalize()} - Avg: {mean(times):.2f}s, Max: {max(times):.2f}s, Min: {min(times):.2f}s")
total_elapsed_time = time.time() - total_start_time
print(f"Total elapsed time: {total_elapsed_time:.2f}s")


--- Downloading [1/10] ---
File saved to ../data/raw/better_downloading_experiments/2025-03-train-00000-of-00065.parquet
Successfully downloaded: 2025-03-train-00000-of-00065.parquet
Processing 1,409,443 rows in 1 batches of size 1,500,000

--- Starting Batch 1/1 ---


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.92s
Partition B: 3.56s
Partition C: 3.75s
Partition D: 1.64s
Partition E: 0.17s
Partition other: 0.02s
    Processed 268,417 games.
    Updated stats for 253,814 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.32s
extract_players_openings: 0.06s
insert_entities: 0.12s
aggregate_stats: 0.16s
bulk_upsert: 12.05s
    Batch processed in 16.03s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1409443
  total_time_sec: 16.168694019317627
  avg_batch_time_sec: 16.026267051696777
  min_batch_time_sec: 16.026267051696777
  max_batch_time_sec: 16.026267051696777
  avg_batch_size: 1409443.0
  overall_rate_games_per_sec: 87171.10969606209
  memory_usage: [{'percent': 44.9, 'used_gb': 14.276870727539062, 'available_gb': 17.639514923095703}]
  accepted_games: 268417
  filtered_games: 1141026
  acceptance_rate_percent: 19.0
Running VACUUM on the database...
VACUUM complete. El

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.57s
Partition B: 3.14s
Partition C: 3.61s
Partition D: 1.50s
Partition E: 0.18s
Partition other: 0.02s
    Processed 252,297 games.
    Updated stats for 241,071 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.29s
extract_players_openings: 0.04s
insert_entities: 0.11s
aggregate_stats: 0.15s
bulk_upsert: 11.01s
    Batch processed in 14.34s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1409012
  total_time_sec: 14.462533235549927
  avg_batch_time_sec: 14.339946985244751
  min_batch_time_sec: 14.339946985244751
  max_batch_time_sec: 14.339946985244751
  avg_batch_size: 1409012.0
  overall_rate_games_per_sec: 97424.9792240096
  memory_usage: [{'percent': 43.6, 'used_gb': 13.867965698242188, 'available_gb': 18.03232192993164}]
  accepted_games: 252297
  filtered_games: 1156715
  acceptance_rate_percent: 17.9
Running VACUUM on the database...
VACUUM complete. Elap

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.57s
Partition B: 3.53s
Partition C: 3.59s
Partition D: 1.45s
Partition E: 0.14s
Partition other: 0.02s
    Processed 264,109 games.
    Updated stats for 249,461 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.29s
extract_players_openings: 0.04s
insert_entities: 0.10s
aggregate_stats: 0.15s
bulk_upsert: 11.30s
    Batch processed in 15.11s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1409562
  total_time_sec: 15.235915899276733
  avg_batch_time_sec: 15.108738899230957
  min_batch_time_sec: 15.108738899230957
  max_batch_time_sec: 15.108738899230957
  avg_batch_size: 1409562.0
  overall_rate_games_per_sec: 92515.73776847335
  memory_usage: [{'percent': 44.3, 'used_gb': 14.077781677246094, 'available_gb': 17.822982788085938}]
  accepted_games: 264109
  filtered_games: 1145453
  acceptance_rate_percent: 18.7
Running VACUUM on the database...
VACUUM complete. El

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.65s
Partition B: 3.25s
Partition C: 3.57s
Partition D: 1.47s
Partition E: 0.17s
Partition other: 0.02s
    Processed 252,300 games.
    Updated stats for 241,333 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.31s
extract_players_openings: 0.05s
insert_entities: 0.12s
aggregate_stats: 0.16s
bulk_upsert: 11.13s
    Batch processed in 14.89s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1409030
  total_time_sec: 15.034995794296265
  avg_batch_time_sec: 14.891157865524292
  min_batch_time_sec: 14.891157865524292
  max_batch_time_sec: 14.891157865524292
  avg_batch_size: 1409030.0
  overall_rate_games_per_sec: 93716.68733918338
  memory_usage: [{'percent': 44.5, 'used_gb': 14.137603759765625, 'available_gb': 17.76314926147461}]
  accepted_games: 252300
  filtered_games: 1156730
  acceptance_rate_percent: 17.9
Running VACUUM on the database...
VACUUM complete. Ela

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.66s
Partition B: 3.62s
Partition C: 3.52s
Partition D: 1.48s
Partition E: 0.16s
Partition other: 0.02s
    Processed 253,381 games.
    Updated stats for 239,358 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.29s
extract_players_openings: 0.04s
insert_entities: 0.11s
aggregate_stats: 0.24s
bulk_upsert: 11.47s
    Batch processed in 15.10s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1409641
  total_time_sec: 15.242969036102295
  avg_batch_time_sec: 15.104774951934814
  min_batch_time_sec: 15.104774951934814
  max_batch_time_sec: 15.104774951934814
  avg_batch_size: 1409641.0
  overall_rate_games_per_sec: 92478.11214871118
  memory_usage: [{'percent': 46.3, 'used_gb': 14.702259063720703, 'available_gb': 17.18930435180664}]
  accepted_games: 253381
  filtered_games: 1156260
  acceptance_rate_percent: 18.0
Running VACUUM on the database...
VACUUM complete. Ela

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.57s
Partition B: 3.15s
Partition C: 3.42s
Partition D: 1.43s
Partition E: 0.15s
Partition other: 0.02s
    Processed 246,038 games.
    Updated stats for 234,730 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.29s
extract_players_openings: 0.05s
insert_entities: 0.11s
aggregate_stats: 0.14s
bulk_upsert: 10.73s
    Batch processed in 14.28s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1408925
  total_time_sec: 14.41655707359314
  avg_batch_time_sec: 14.280955791473389
  min_batch_time_sec: 14.280955791473389
  max_batch_time_sec: 14.280955791473389
  avg_batch_size: 1408925.0
  overall_rate_games_per_sec: 97729.64465841383
  memory_usage: [{'percent': 46.8, 'used_gb': 14.8580322265625, 'available_gb': 17.024864196777344}]
  accepted_games: 246038
  filtered_games: 1162887
  acceptance_rate_percent: 17.5
Running VACUUM on the database...
VACUUM complete. Elaps

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.59s
Partition B: 3.09s
Partition C: 3.21s
Partition D: 1.47s
Partition E: 0.16s
Partition other: 0.02s
    Processed 246,497 games.
    Updated stats for 232,263 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.28s
extract_players_openings: 0.05s
insert_entities: 0.11s
aggregate_stats: 0.15s
bulk_upsert: 10.54s
    Batch processed in 14.06s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1409351
  total_time_sec: 14.178462028503418
  avg_batch_time_sec: 14.056272268295288
  min_batch_time_sec: 14.056272268295288
  max_batch_time_sec: 14.056272268295288
  avg_batch_size: 1409351.0
  overall_rate_games_per_sec: 99400.8374932864
  memory_usage: [{'percent': 46.7, 'used_gb': 14.831741333007812, 'available_gb': 17.051433563232422}]
  accepted_games: 246497
  filtered_games: 1162854
  acceptance_rate_percent: 17.5
Running VACUUM on the database...
VACUUM complete. Ela

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.58s
Partition B: 3.19s
Partition C: 3.45s
Partition D: 1.32s
Partition E: 0.15s
Partition other: 0.02s
    Processed 250,074 games.
    Updated stats for 237,272 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.28s
extract_players_openings: 0.05s
insert_entities: 0.11s
aggregate_stats: 0.15s
bulk_upsert: 10.71s
    Batch processed in 14.30s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1408968
  total_time_sec: 14.420851945877075
  avg_batch_time_sec: 14.299410820007324
  min_batch_time_sec: 14.299410820007324
  max_batch_time_sec: 14.299410820007324
  avg_batch_size: 1408968.0
  overall_rate_games_per_sec: 97703.52024193858
  memory_usage: [{'percent': 46.7, 'used_gb': 14.819355010986328, 'available_gb': 17.06417465209961}]
  accepted_games: 250074
  filtered_games: 1158894
  acceptance_rate_percent: 17.7
Running VACUUM on the database...
VACUUM complete. Ela

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.52s
Partition B: 3.22s
Partition C: 3.26s
Partition D: 1.38s
Partition E: 0.16s
Partition other: 0.02s
    Processed 244,057 games.
    Updated stats for 228,455 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.36s
extract_players_openings: 0.08s
insert_entities: 0.11s
aggregate_stats: 0.15s
bulk_upsert: 10.57s
    Batch processed in 14.10s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1408988
  total_time_sec: 14.218663930892944
  avg_batch_time_sec: 14.095444917678833
  min_batch_time_sec: 14.095444917678833
  max_batch_time_sec: 14.095444917678833
  avg_batch_size: 1408988.0
  overall_rate_games_per_sec: 99094.26137702618
  memory_usage: [{'percent': 46.9, 'used_gb': 14.879318237304688, 'available_gb': 16.993030548095703}]
  accepted_games: 244057
  filtered_games: 1164931
  acceptance_rate_percent: 17.3
Running VACUUM on the database...
VACUUM complete. El

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))


--- Partition Timing Metrics ---
Partition A: 2.59s
Partition B: 3.31s
Partition C: 3.56s
Partition D: 1.36s
Partition E: 0.15s
Partition other: 0.02s
    Processed 263,027 games.
    Updated stats for 248,906 player-opening combinations (partitioned by ECO letter).

--- Batch Timing Metrics ---
filter_valid_games: 0.28s
extract_players_openings: 0.04s
insert_entities: 0.10s
aggregate_stats: 0.15s
bulk_upsert: 10.99s
    Batch processed in 14.53s.
    File ETA: 00:00:00

File Processing Summary:
  total_games: 1408852
  total_time_sec: 14.679245948791504
  avg_batch_time_sec: 14.531150817871094
  min_batch_time_sec: 14.531150817871094
  max_batch_time_sec: 14.531150817871094
  avg_batch_size: 1408852.0
  overall_rate_games_per_sec: 95975.77456735687
  memory_usage: [{'percent': 47.2, 'used_gb': 14.964763641357422, 'available_gb': 16.880130767822266}]
  accepted_games: 263027
  filtered_games: 1145825
  acceptance_rate_percent: 18.7
Running VACUUM on the database...
VACUUM complete. El