# Purpose

To filter out Lichess accounts who have been banned for cheating.

# Method

Lichess has an api/users endpoint that we can call with the list of usernames from our local DB, to check and make sure their account hasn't been banned for cheating.

Note that there is not a way to tell if a closed account was banned for cheating; so we will also make sure that the account is still open.

# Filtering

To reduce our load on the Lichess API and save time, we will examine player stats and eliminate players who won't have enough games for us to work with anyway.

# Notes

- We have compiled a list of the most active players in notebook 14
- This list is saved in a local SQL DB
- We will send these usernames to the lichess API; 300 names at a time (that's the endpoint's limit), filtering out any accounts that are not open or in good standing.

## Preparing local DB

Here we will add an is_valid_account boolean to our local DB for each user.

I should have done this when I was compiling this DB in the first place; just didn't think of it.

In [None]:
# Add an is_eligible_account table to our DB of active Lichess players
# Unnecessary to do this every time we run the notebook, but it's cheap to run and I'm too lazy to do the required refactoring

from pathlib import Path
from utils.database.db_utils import get_db_connection

DB_PATH = (
    Path.cwd().parent
    / "data"
    / "processed"
    / "find_most_active_players"
    / "player_game_counts.duckdb"
)

con = get_db_connection(str(DB_PATH))
try:
    con.execute(
        """
        ALTER TABLE player_game_counts
        ADD COLUMN IF NOT EXISTS is_eligible_account BOOLEAN;
    """
    )
    print(
        "Added 'is_eligible_account' column to player_game_counts table (if not already present)."
    )
finally:
    con.close()

## Config

Getting our Lichess API token and defining any other variables I think of later


In [None]:
# Config stuff

import os
%pip install python-dotenv requests
from dotenv import load_dotenv
import requests

load_dotenv()
lichess_api_token = os.getenv("LICHESS_TOKEN")

headers = {"Authorization": f"Bearer {lichess_api_token}"}

# Example bulk query
usernames = ["DrNykterstein", "MagnusCarlsen", "lichess"]
resp = requests.post(
    "https://lichess.org/api/users", data=",".join(usernames), headers=headers
)

print(resp.json())

## Player Game Count Stats

Before calling the Lichess API, we'll examine some stats from the local DB of players. Particularly to see SDs, means etc of games played. This will help us eliminate players who don't have enough games for our AI model to work with anyway.

In [None]:
# Player game count stats

import pandas as pd
import numpy as np

# --- Database Connection ---
con = get_db_connection(str(DB_PATH))

try:
    # Fetch all game counts into a pandas DataFrame
    df = con.execute("SELECT num_games FROM player_game_counts").df()

    if df.empty:
        print("The 'player_game_counts' table is empty. No stats to display.")
    else:
        # --- Basic Statistics ---
        total_players = len(df)
        mean_games = df["num_games"].mean()
        std_dev = df["num_games"].std()
        median_games = df["num_games"].median()
        min_games = df["num_games"].min()
        max_games = df["num_games"].max()
        total_games = df["num_games"].sum()

        print("--- Player Game Count Statistics ---")
        print(f"{'Total Unique Players:':<25} {total_players:,.0f}")
        print(f"{'Total Games Recorded:':<25} {total_games:,.0f}")
        print("-" * 40)
        print(f"{'Mean (Average) Games:':<25} {mean_games:,.2f}")
        print(f"{'Standard Deviation:':<25} {std_dev:,.2f}")
        print(f"{'Median Games (50th %):':<25} {median_games:,.0f}")
        print(f"{'Minimum Games:':<25} {min_games:,.0f}")
        print(f"{'Maximum Games:':<25} {max_games:,.0f}")
        print("-" * 40)

        # --- Percentile Breakdown ---
        print("\n--- Percentile Breakdown ---")
        header = f"{'Percentile':<12} | {'Games Count':<15} | {'Players At/Below':<20} | {'Players Above':<20}"
        print(header)
        print("=" * len(header))

        percentiles_to_calc = np.arange(0.05, 1.05, 0.05)

        for p in percentiles_to_calc:
            percentile_val = df["num_games"].quantile(p)

            # Count players at/below and above the percentile value
            players_at_or_below = (df["num_games"] <= percentile_val).sum()
            players_above = total_players - players_at_or_below

            p_str = f"{p*100:.0f}%"
            val_str = f"{int(percentile_val):,}"
            below_str = (
                f"{players_at_or_below:,} ({players_at_or_below/total_players:.1%})"
            )
            above_str = f"{players_above:,} ({players_above/total_players:.1%})"

            print(f"{p_str:<12} | {val_str:<15} | {below_str:<20} | {above_str:<20}")

finally:
    con.close()
    print("\nDatabase connection closed.")

## Notes

Notes from player game count stats:

- 90% (283,000) of players played 96 games or more
- This 283,000 players seems like a nice number to get from the Lichess API; I'm hoping to feed the AI model about 50,000 games so this gives us plenty of wiggle room.
- Can easily adjust this upwards if I need more players later.
- Note that this is from a random subset of games from January 2023 through August 2025; those players did much more than 96 games during that time, but it's only a random subset.

## Getting the most active players

Here, we will define a list of the most active players from our DB.

In this case, it'll be the top 90%, which is players who have played more than 96 games in our randomized game sample set.

This amounts to about 283,000 players, which is more than enough -- we only need about 50,000 in the end.

In [None]:
# Get players who are in the top 90% of game counts

con = get_db_connection(str(DB_PATH))
try:
    # First, find the 10th percentile value to identify the top 90% of players
    # We use the 10th percentile as the floor for the top 90%
    top_90_percentile_floor = con.execute(
        "SELECT quantile_cont(num_games, 0.90) FROM player_game_counts"
    ).fetchone()[0]

    print(f"Minimum games for top 90% of players: {top_90_percentile_floor:,.0f}")

    # Fetch usernames of players above the threshold who have not been checked yet
    top_players_df = con.execute(
        """
        SELECT username
        FROM player_game_counts 
        WHERE num_games >= ? AND is_eligible_account IS NULL
        ORDER BY num_games DESC
        """,
        (top_90_percentile_floor,),
    ).df()

    top_usernames_to_check = top_players_df["username"].tolist()
    
    print(f"Found {len(top_usernames_to_check):,} players to check.")

    print("Sample usernames to check:", top_usernames_to_check[:10])

finally:
    con.close()


## Exmample Lichess API call

Here we'll call the Lichess API for the first twenty users in our list, jsut to make sure it works. Can delete this cell later.

In [None]:
# Here we'll call the Lichess API for the first twenty users in our list
sample_usernames = top_usernames_to_check[:20]
resp = requests.post(
    "https://lichess.org/api/users", data=",".join(sample_usernames), headers=headers
)

if resp.status_code == 200:
    users_data = resp.json()
    for user_data in users_data:
        # The API may return null for users that don't exist.
        if user_data:
            print(
                f"Username: {user_data.get('username')}, "
                f"Title: {user_data.get('title')}, "
                f"Disabled: {user_data.get('disabled')}, "
                f"TOS Violation: {user_data.get('tosViolation')}"
            )
else:
    print(
        f"Failed to fetch data. Status code: {resp.status_code}, Response: {resp.text}"
    )

## Main Pipeline

OK, we're ready to go. Now, we'll do the following:

1. Get all players from our local DB where:
    - is_eligible_account is null (meaning it hasn't been processed yet in this pipeline)
    - num_games_played is at least 96; we're not bothering with less active players

2. Call Lichess's API
    - maximum number of username per call is 300
    - So, we call it in chunks of 300
    - Make calls one at a time for simplicity; wait for the response on one batch before sending the next
        - If this takes too long we'll adjust

3. Adjust players in DB
    - If their account is disabled, or if TOS violation is true, they're ineligible
    - We'll do this in batches, one at a time would be horribly inefficient

In [None]:
import time
import requests
import random
from pathlib import Path
from utils.database.db_utils import get_db_connection
from dotenv import load_dotenv
import requests
import os
import time
import random
from pathlib import Path
from utils.database.db_utils import get_db_connection

load_dotenv()
lichess_api_token = os.getenv("LICHESS_TOKEN")

headers = {"Authorization": f"Bearer {lichess_api_token}"}

# --- Main Pipeline Configuration ---
MIN_GAMES_THRESHOLD = 96
BATCH_SIZE = 300  # Lichess API limit per request
API_DELAY_SECONDS = 2  # Polite delay between API calls to respect rate limits
MAX_API_FAILURES = 2  # Stop if we get more than this many consecutive API errors

# --- Database and API Configuration ---
DB_PATH = (
    Path.cwd().parent
    / "data"
    / "processed"
    / "find_most_active_players"
    / "player_game_counts.duckdb"
)
LICHESS_API_URL = "https://lichess.org/api/users"
LICHESS_PROFILE_URL = "https://lichess.org/@/"

# --- Main Processing Logic ---
total_start_time = time.time()
players_processed = 0
total_eligible_count = 0
total_ineligible_count = 0
api_failure_count = 0

# Use a try...finally block to ensure the DB connection is always closed
con = get_db_connection(str(DB_PATH))
try:
    # 1. Get all players to be processed from the DB (self-contained logic)
    print("Fetching list of players to check from the database...")

    usernames_to_check = (
        con.execute(
            """
        SELECT username FROM player_game_counts 
        WHERE num_games >= ? AND is_eligible_account IS NULL
        ORDER BY num_games DESC
        LIMIT 100_000;
        """,
            (MIN_GAMES_THRESHOLD,),
        )
        .df()["username"]
        .tolist()
    )

    total_players_to_check = len(usernames_to_check)

    if total_players_to_check == 0:
        print("No new players to process. All eligible players have been checked.")
    else:
        print(f"--- Starting Main Pipeline ---")
        print(f"Minimum Games Threshold: {MIN_GAMES_THRESHOLD}")
        print(f"Total players to process: {total_players_to_check:,}")
        print(f"Batch size: {BATCH_SIZE:,}")
        print("-" * 40)

        # 2. Process in batches
        for i in range(0, total_players_to_check, BATCH_SIZE):
            if api_failure_count >= MAX_API_FAILURES:
                print(
                    f"\nStopping due to {api_failure_count} consecutive API failures."
                )
                break

            batch_start_time = time.time()

            batch_usernames = usernames_to_check[i : i + BATCH_SIZE]
            current_batch_size = len(batch_usernames)

            batch_num = (i // BATCH_SIZE) + 1
            total_batches = (total_players_to_check + BATCH_SIZE - 1) // BATCH_SIZE

            print(
                f"\nProcessing Batch {batch_num}/{total_batches} ({current_batch_size} players)..."
            )

            # 3. Call Lichess API and categorize players
            eligible_players = []
            ineligible_players = []

            try:
                print("Calling Lichess API...")
                resp = requests.post(
                    LICHESS_API_URL,
                    data=",".join(batch_usernames),
                    headers=headers,
                    timeout=30,
                )

                if resp.status_code == 200:
                    api_failure_count = 0  # Reset failure count on success
                    users_data = resp.json()

                    found_usernames = set()
                    for user in users_data:
                        if user:
                            username = user["username"]
                            found_usernames.add(username.lower())
                            if not user.get("disabled") and not user.get(
                                "tosViolation"
                            ):
                                eligible_players.append(username)
                            else:
                                ineligible_players.append(username)

                    # Any username not returned by the API is considered non-existent/ineligible
                    for username in batch_usernames:
                        if username.lower() not in found_usernames:
                            ineligible_players.append(username)
                else:
                    print(
                        f"API Error! Status: {resp.status_code}. Skipping this batch; their is_eligible_account remains NULL."
                    )
                    api_failure_count += 1

            except requests.RequestException as e:
                print(f"Network Error: {e}. Skipping this batch; their is_eligible_account remains NULL.")
                ineligible_players.extend(batch_usernames)
                api_failure_count += 1

            # 4. Update the database in bulk
            print("Updating database...")
            if eligible_players:
                con.execute(
                    "UPDATE player_game_counts SET is_eligible_account = TRUE WHERE username IN ?",
                    (eligible_players,),
                )
            if ineligible_players:
                con.execute(
                    "UPDATE player_game_counts SET is_eligible_account = FALSE WHERE username IN ?",
                    (ineligible_players,),
                )

            # 5. Report batch stats
            batch_eligible_count = len(eligible_players)
            batch_ineligible_count = len(ineligible_players)
            total_eligible_count += batch_eligible_count
            total_ineligible_count += batch_ineligible_count

            print(
                f"Batch Results: Eligible: {batch_eligible_count}, Ineligible: {batch_ineligible_count}"
            )
            if eligible_players:
                print(
                    f"  - Eligible spot check: {LICHESS_PROFILE_URL}{random.choice(eligible_players)}"
                )
            if ineligible_players:
                print(
                    f"  - Ineligible spot check: {LICHESS_PROFILE_URL}{random.choice(ineligible_players)}"
                )

            # 6. Calculate and report timing stats
            batch_duration = time.time() - batch_start_time
            players_processed += current_batch_size

            total_elapsed_time = time.time() - total_start_time
            cumulative_pps = (
                players_processed / total_elapsed_time if total_elapsed_time > 0 else 0
            )

            players_remaining = total_players_to_check - players_processed
            eta_seconds = (
                (players_remaining / cumulative_pps) if cumulative_pps > 0 else 0
            )

            print(f"Batch completed in {batch_duration:.2f}s.")
            print(
                f"--- Progress: {players_processed:,}/{total_players_to_check:,} ({players_processed/total_players_to_check:.1%}) ---"
            )
            print(f"Cumulative Players/Sec: {cumulative_pps:.2f}")
            print(f"Estimated Time Remaining: {eta_seconds / 60:.1f} minutes")

            # 7. Polite delay before the next API call
            if players_remaining > 0:
                time.sleep(API_DELAY_SECONDS)

finally:
    con.close()
    total_pipeline_duration = time.time() - total_start_time
    print("\n--- Pipeline Finished ---")
    if players_processed > 0:
        eligible_percent = (total_eligible_count / players_processed) * 100
        ineligible_percent = (total_ineligible_count / players_processed) * 100
        print(f"Total time: {total_pipeline_duration / 60:.2f} minutes.")
        print(f"Total players processed: {players_processed:,}")
        print(f"Total Eligible: {total_eligible_count:,} ({eligible_percent:.1f}%)")
        print(
            f"Total Ineligible: {total_ineligible_count:,} ({ineligible_percent:.1f}%)"
        )
    print("Database connection closed.")

Fetching list of players to check from the database...
--- Starting Main Pipeline ---
Minimum Games Threshold: 96
Total players to process: 100,000
Batch size: 300
----------------------------------------

Processing Batch 1/334 (300 players)...
Calling Lichess API...
Updating database...
Batch Results: Eligible: 296, Ineligible: 4
  - Eligible spot check: https://lichess.org/@/krest1977
  - Ineligible spot check: https://lichess.org/@/whittled_rook
Batch completed in 0.90s.
--- Progress: 300/100,000 (0.3%) ---
Cumulative Players/Sec: 178.93
Estimated Time Remaining: 9.3 minutes

Processing Batch 2/334 (300 players)...
Calling Lichess API...
Updating database...
Batch Results: Eligible: 298, Ineligible: 2
  - Eligible spot check: https://lichess.org/@/Kashyrin_M
  - Ineligible spot check: https://lichess.org/@/ruhsuzhayalet
Batch completed in 0.87s.
--- Progress: 600/100,000 (0.6%) ---
Cumulative Players/Sec: 131.78
Estimated Time Remaining: 12.6 minutes

Processing Batch 3/334 (300 pl