# Purpose

This notebook will find the 50k most active Lichess players (rapid blitz classical), measured by number of games.

It does this by downloading and reading parquet files of raw game data.

It's not practical to download the hundreds of 1GB parquet files; so we will download a certain number and extrapolate from there.

We don't need to know exactly who is the *most* active, just need to find *very* active players.

Then, we will process their games for data to feed in to our chess opening recommender AI model.

# Reason

Originally, we were just collecting data on millions of players without caring who was active or not.

This grew impractical quickly, with a massive local DB. Games processing slowed down exponentially because SQL queries to the local duckdb got so slow due to the db's size.

So I've decided to focus on the 50k most active players instead.

# Method

1. Get names of parquet files
    - We will want to look at games data over the course of a year or so
    - So, we'll download some 1GB raw game data parquet files (number TBD) from each month (there are about 60-70 total per month)
    - To do this, we need the names of those files to get from the HuggingFace API
    - Luckily we already have functionality elsewhere to do this; I'll import it here and update slightly for our needs

2. Download parquet files
    - Once we have the names of the needed files, we'll download them
    - I already have functionality to do this elsewhere, just need to import it here

3. Save game counts
    - Once we have our parquet files downloaded, we'll count the number of games each username has
    - There will be some filters; only Rapid/Blitz/Classical games, only players over 1200 rating
    - Need to decide on a data saving method; probably in a CSV with username/num_games
    - Wherever it's saved it needs to be persistent, as this processing will happen over multiple sessions.

4. Get most active players
    - Finally, we'll analyze our CSV to retrieve the usernames of the 50,000 most active Lichess players
    - Profit


- Note that all data for this will be saved in data/processed/find_most_active_players directory.

## Config

Below, we'll define some important variables that we want to use in this notebook.

I like to keep them in one place at the top.

In [None]:
from utils.downloading_raw_parquet_data.get_parquet_file_names import (
    get_parquet_file_names,
)

# The year and month we want to download data for

month_for_downloading = 6
year_for_downloading = 2025

max_files_to_download_per_month = 3

## 1. Get file names

Now, we'll get the file names for the month and year we want to download.

In [None]:
file_names_for_month = get_parquet_file_names(year_for_downloading, month_for_downloading)
# print with line separations
print("\n".join(file_names_for_month[:max_files_to_download_per_month]))

## Database Setup

Now, we'll initialize our database.

This is a small database that tracks the number of games played by a Lichess username, as well as which files have been checked so there are no duplicates.

In [None]:
import os
from pathlib import Path
from utils.database.db_utils import get_db_connection
from utils.database.player_game_counts_db_utils import setup_player_game_counts_table

# Define the project root
project_root = Path.cwd().parent

# Initialize the database
DB_PATH = project_root / "data" / "processed" / "find_most_active_players" / "player_game_counts.duckdb"
DB_DIR = DB_PATH.parent

# Ensure the directory exists
if not DB_DIR.exists():
    DB_DIR.mkdir(parents=True)
    print(f"Created directory: {DB_DIR}")

# Ensure the database file exists
if not DB_PATH.exists():
    print(f"Database file {DB_PATH} does not exist. Initializing...")
    con = get_db_connection(str(DB_PATH))
    setup_player_game_counts_table(con)
    con.close()
    print("Database created and initialized.")
else:
    print(f"Database file {DB_PATH} already exists. Skipping initialization.")

## Main Pipeline
Now that we have the names of the files we need, we will do the following for each file name (until we've downloaded our predefined max number of files for the given month)

1. Check our local db to make sure that we haven't already downloaded the file in question
    - If we have, cycle through that month's list of files until we find one that hasn't been downloaded
2. Download the file in question
    - Note that HuggingFace's API is smart enough to not re-download files we already have on our local machine, which saves me a lot of headache when doing this repeatedly for testing.
3. Mark the file as having been processed in our local db to avoid duplicates
4. Process the file, recording each player's num_games in the local db
5. Delete the file
6. Move on to next file unless we've reached our maximum number of files

Goals:

- Thorough logging, especially for how long each step takes, including games per second while processing files
- Each file is about 1.4 million rows so that helps when providing a file ETA
- Also log which file number we're on out of the max total files to download

In [None]:
from utils.file_processing.types_and_classes import ProcessingConfig
import pandas as pd
from typing import Set
import re

# --- Processing Configuration ---
# Base config for processing. This will be used for each file.
base_config = ProcessingConfig(
    parquet_path="",  # This will be set per-file
    db_path=DB_PATH,
    min_player_rating=1200,
    max_elo_difference_between_players=100,
    allowed_time_controls={"Blitz", "Rapid", "Classical"},
)

LOG_FREQUENCY = 100_000
TOTAL_ROWS_IN_FILE = 1_400_000

def is_valid_game(row: pd.Series, config: ProcessingConfig) -> bool:
    """
    Checks if a game is valid based on the provided processing configuration.
    """
    # Check for BOTs
    if (pd.notna(row["WhiteTitle"]) and "BOT" in row["WhiteTitle"]) or \
       (pd.notna(row["BlackTitle"]) and "BOT" in row["BlackTitle"]):
        return False

    # Check ratings
    if row["WhiteElo"] < config.min_player_rating or row["BlackElo"] < config.min_player_rating:
        return False

    # Check ELO difference
    if abs(row["WhiteElo"] - row["BlackElo"]) > config.max_elo_difference_between_players:
        return False

    # Check time control
    event_lower = str(row["Event"]).lower()
    if not any(re.search(r'\b' + re.escape(tc.lower()) + r'\b', event_lower) for tc in config.allowed_time_controls):
        return False
        
    # Check result
    if row["Result"] not in {"1-0", "0-1", "1/2-1/2"}:
        return False
        
    # Check for player names
    if not row['White'] or not row['Black']:
        return False

    return True

In [None]:
import time
from utils.database.player_game_counts_db_utils import is_file_already_downloaded
from utils.downloading_raw_parquet_data.raw_parquet_data_file_downloader import (
    download_single_parquet_file,
)

processed_file_count = 0

for file_name in file_names_for_month[:1]:
    if processed_file_count >= max_files_to_download_per_month:
        print("Reached the maximum number of files to process. Stopping.")
        break

    start_time = time.time()

    # 1. Check in the local db whether the file has already been processed - if so, move on to next file
    con = get_db_connection(str(DB_PATH))
    is_already_downloaded = is_file_already_downloaded(con, file_name, year_for_downloading, month_for_downloading)
    con.close()

    elapsed_time = time.time() - start_time
    print(f"Checked file {file_name} in {elapsed_time:.2f} seconds.")

    if is_already_downloaded:
        print(f"File {file_name} has already been processed. Skipping.")
        continue

    print(f"File {file_name} has not been processed yet. Processing now.")

    # 2. If it hasn't been processed, download the file
    start_time = time.time()
    downloaded_file_path = download_single_parquet_file(
        repo_id="Lichess/standard-chess-games",
        repo_type="dataset",
        file_to_download=file_name,
        local_dir=Path("data/raw"),
        year=year_for_downloading,
        month=month_for_downloading,
    )
    elapsed_time = time.time() - start_time
    if downloaded_file_path:
        print(f"Downloaded file {file_name} in {elapsed_time:.2f} seconds.")
        processed_file_count += 1
    else:
        print(f"Failed to download file {file_name}. Skipping.")
        continue

    # 3. Mark the file as processed in the local db. Doing this part first because downloading is cheap and quick, we can just move on to the next file if something goes wrong
    # TODO this would be inconvenient during testing; reinstate once we know the rest of the code works
    # start_time = time.time()
    # con = get_db_connection(str(DB_PATH))
    # from utils.database.player_game_counts_db_utils import mark_file_as_downloaded
    # mark_file_as_downloaded(con, file_name, year_for_downloading, month_for_downloading)
    # con.close()
    # elapsed_time = time.time() - start_time
    # print(f"Marked file {file_name} as downloaded in {elapsed_time:.2f

    # 4. Process the file, updating valid player game counts in the local db
    start_processing_time = time.time()
    games_processed = 0
    
    import pyarrow.parquet as pq
    from collections import Counter

    # Create a specific config for this file
    file_config = base_config.replace(parquet_path=str(downloaded_file_path))

    parquet_file = pq.ParquetFile(file_config.parquet_path)
    player_game_counts = Counter()

    print(f"Processing {downloaded_file_path.name}...")

    for batch in parquet_file.iter_batches(batch_size=LOG_FREQUENCY):
        df = batch.to_pandas()
        for index, row in df.iterrows():
            games_processed += 1
            if is_valid_game(row, file_config):
                player_game_counts[row['White']] += 1
                player_game_counts[row['Black']] += 1

            if games_processed % LOG_FREQUENCY == 0:
                elapsed_since_start = time.time() - start_processing_time
                if elapsed_since_start > 0:
                    games_per_second = games_processed / elapsed_since_start
                    eta_seconds = (TOTAL_ROWS_IN_FILE - games_processed) / games_per_second if games_per_second > 0 else float('inf')
                    print(f"  - Processed {games_processed}/{TOTAL_ROWS_IN_FILE} games. "
                          f"({games_per_second:.0f} games/sec, ETA: {eta_seconds:.2f}s)")

    # Update the database with the collected game counts
    con = get_db_connection(str(DB_PATH))
    from utils.database.player_game_counts_db_utils import update_player_game_count
    for username, num_games in player_game_counts.items():
        update_player_game_count(con, username, num_games)
    con.close()
    
    processing_elapsed_time = time.time() - start_processing_time
    print(f"Finished processing file. Total time: {processing_elapsed_time:.2f} seconds.")


    # 5. Delete the file

    # 6. Move on to the next file