# Purpose

This notebook will find the 50k most active Lichess players (rapid blitz classical), measured by number of games.

It does this by downloading and reading parquet files of raw game data.

It's not practical to download the hundreds of 1GB parquet files; so we will download a certain number and extrapolate from there.

We don't need to know exactly who is the *most* active, just need to find *very* active players.

Then, we will process their games for data to feed in to our chess opening recommender AI model.

# Reason

Originally, we were just collecting data on millions of players without caring who was active or not.

This grew impractical quickly, with a massive local DB. Games processing slowed down exponentially because SQL queries to the local duckdb got so slow due to the db's size.

So I've decided to focus on the 50k most active players instead.

# Method

1. Get names of parquet files
    - We will want to look at games data over the course of a year or so
    - So, we'll download some 1GB raw game data parquet files (number TBD) from each month (there are about 60-70 total per month)
    - To do this, we need the names of those files to get from the HuggingFace API
    - Luckily we already have functionality elsewhere to do this; I'll import it here and update slightly for our needs

2. Download parquet files
    - Once we have the names of the needed files, we'll download them
    - I already have functionality to do this elsewhere, just need to import it here

3. Save game counts
    - Once we have our parquet files downloaded, we'll count the number of games each username has
    - There will be some filters; only Rapid/Blitz/Classical games, only players over 1200 rating
    - Need to decide on a data saving method; probably in a CSV with username/num_games
    - Wherever it's saved it needs to be persistent, as this processing will happen over multiple sessions.

4. Get most active players
    - Finally, we'll analyze our CSV to retrieve the usernames of the 50,000 most active Lichess players
    - Profit

## Config

Below, we'll define some important variables that we want to use in this notebook.

I like to keep them in one place at the top.

In [None]:
from utils.downloading_raw_parquet_data.get_parquet_file_names import (
    get_parquet_file_names,
)

# The year and month we want to download data for

month_for_downloading = 6
year_for_downloading = 2025

max_files_to_download_per_month = 3

## 1. Get file names

Now, we'll get the file names for the month and year we want to download.

In [None]:
file_names_for_month = get_parquet_file_names(year_for_downloading, month_for_downloading)
# print with line separations
print("\n".join(file_names_for_month[:max_files_to_download_per_month]))

NameError: name 'max_files_to_download_per_month' is not defined