# Database Performance Analysis

### Purpose of this Notebook
Our data processing pipeline is experiencing a significant slowdown in write performance. After processing about 20 parquet files, the rate of inserting game statistics has dropped from ~140k games/sec to ~70k games/sec. This notebook aims to diagnose the potential causes of this degradation by thoroughly inspecting the state of our `chess_games.db` DuckDB database.

We will investigate:
- **Database Size**: How large is the database file?
- **Table Counts**: How many players, openings, and player-opening stats entries have we accumulated?
- **Partition Health**: How is the data distributed across our partitioned `player_opening_stats` tables? An imbalance could indicate a performance bottleneck.
- **Data Skew**: Are a few players or openings responsible for a disproportionate number of records? This could strain the primary key lookups during `UPSERT` operations.

By understanding the shape and size of our data, we can better identify whether the slowdown is a temporary issue that will level off or a systemic problem that requires architectural changes.

In [21]:
import duckdb
import pandas as pd
from pathlib import Path
import sys
import os

# Ensure the project root is in the system path to allow for absolute imports
project_root = Path.cwd()
if "notebooks" in str(project_root):
    project_root = project_root.parent

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from notebooks.utils.database.db_utils import get_db_connection

# --- Configuration ---
# Define the path to the DuckDB database file.
db_path = project_root / "data" / "processed" / "chess_games.db"

# Set pandas display options for better readability
pd.set_option('display.float_format', '{:,.2f}'.format)

print(f"Database path: {db_path}")
print(f"Database exists: {db_path.exists()}")

Database path: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db
Database exists: True


## 1. High-Level Database Statistics

First, let's get a high-level overview of the database. We'll check the file size and the total number of records in our main tables: `player`, `opening`, and the unified `player_opening_stats` view. This will give us a sense of the overall scale of the data.

In [22]:
try:
    db_size_bytes = os.path.getsize(db_path)
    db_size_mb = db_size_bytes / (1024 * 1024)
    print(f"Database file size: {db_size_mb:.2f} MB")
except FileNotFoundError:
    print("Database file not found.")
    db_size_mb = 0

if db_size_mb > 0:
    with get_db_connection(db_path) as con:
        num_players = con.execute("SELECT COUNT(*) FROM player").fetchone()[0]
        num_openings = con.execute("SELECT COUNT(*) FROM opening").fetchone()[0]
        num_player_opening_stats = con.execute("SELECT COUNT(*) FROM player_opening_stats").fetchone()[0]

        summary_data = {
            "Metric": ["Total Players", "Total Openings", "Total Player-Opening Stats"],
            "Count": [num_players, num_openings, num_player_opening_stats]
        }
        summary_df = pd.DataFrame(summary_data)
        summary_df["Count"] = summary_df["Count"].apply('{:,.0f}'.format)

        print("\n--- Database Record Counts ---")
        print(summary_df.to_string(index=False))

Database file size: 480.07 MB


IOException: IO Error: The file "/Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db" exists, but it is not a valid DuckDB database file!

## 2. Partition Analysis

Our `player_opening_stats` table is partitioned by the first letter of the ECO code (A, B, C, D, E, and 'other'). The `UPSERT` operations in our pipeline write directly to these partitioned tables. An uneven distribution of data could cause certain partitions to grow much larger than others, potentially slowing down writes to those specific tables.

Let's examine the row counts for each partition to see how the data is distributed.

In [None]:
if db_size_mb > 0:
    with get_db_connection(db_path) as con:
        partitions = list("ABCDE") + ["other"]
        partition_stats = []

        for p in partitions:
            table_name = f"player_opening_stats_{p}"
            try:
                count = con.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
                partition_stats.append({"Partition": table_name, "Row Count": count})
            except duckdb.CatalogException:
                partition_stats.append({"Partition": table_name, "Row Count": 0})

        partition_df = pd.DataFrame(partition_stats)
        
        # Calculate percentages
        total_rows = partition_df["Row Count"].sum()
        if total_rows > 0:
            partition_df["Percentage"] = (partition_df["Row Count"] / total_rows) * 100
        else:
            partition_df["Percentage"] = 0.0

        # Format for display
        partition_df["Row Count"] = partition_df["Row Count"].apply('{:,.0f}'.format)
        partition_df["Percentage"] = partition_df["Percentage"].apply('{:.2f}%'.format)


        print("\n--- Player-Opening Stats Partition Counts ---")
        print(partition_df.to_string(index=False))
        if total_rows > 0:
            print(f"\nTotal Rows: {total_rows:,.0f}")


--- Player-Opening Stats Partition Counts ---
                 Partition Row Count Percentage
    player_opening_stats_A 6,220,395     23.17%
    player_opening_stats_B 6,925,508     25.80%
    player_opening_stats_C 8,922,201     33.24%
    player_opening_stats_D 3,795,685     14.14%
    player_opening_stats_E   979,585      3.65%
player_opening_stats_other         0      0.00%

Total Rows: 26,843,374


## 3. Data Skew Analysis

A common cause of `UPSERT` slowdowns is data skew, where a small number of keys are involved in a large number of operations. In our case, this could mean:
1.  A few highly active players have played a vast number of different openings.
2.  A few very common openings have been played by many different players.

When a new batch of games is processed, the database has to check for conflicts on `(player_id, opening_id, color)`. If the same players or openings appear frequently, their corresponding records in the stats tables are updated repeatedly. As the tables grow, finding these records to update takes longer.

Let's check for this skew by identifying the top players and openings with the most entries in the `player_opening_stats` table.

In [None]:
if db_size_mb > 0:
    with get_db_connection(db_path) as con:
        print("\n--- Top 10 Players by Number of Unique Openings Played ---")
        top_players_df = con.execute("""
            SELECT p.name, COUNT(*) AS opening_count
            FROM player_opening_stats pos
            JOIN player p ON pos.player_id = p.id
            GROUP BY p.name
            ORDER BY opening_count DESC
            LIMIT 10;
        """).fetchdf()
        top_players_df["opening_count"] = top_players_df["opening_count"].apply('{:,.0f}'.format)
        print(top_players_df.to_string(index=False))

        print("\n--- Top 10 Openings by Number of Unique Players ---")
        top_openings_df = con.execute("""
            SELECT o.name, o.eco, COUNT(*) AS player_count
            FROM player_opening_stats pos
            JOIN opening o ON pos.opening_id = o.id
            GROUP BY o.name, o.eco
            ORDER BY player_count DESC
            LIMIT 10;
        """).fetchdf()
        top_openings_df["player_count"] = top_openings_df["player_count"].apply('{:,.0f}'.format)
        print(top_openings_df.to_string(index=False))

        # Same thing but with least common openings
        print("\n--- Bottom 10 Openings by Number of Unique Players ---")
        bottom_openings_df = con.execute("""
            SELECT o.name, o.eco, COUNT(*) AS player_count
            FROM player_opening_stats pos
            JOIN opening o ON pos.opening_id = o.id
            GROUP BY o.name, o.eco
            ORDER BY player_count ASC
            LIMIT 10;
        """).fetchdf()
        bottom_openings_df["player_count"] = bottom_openings_df["player_count"].apply('{:,.0f}'.format)
        print(bottom_openings_df.to_string(index=False))

        # All opening which have been played by less than five players; including the number of such openings
        rare_openings_count = con.execute("""
            SELECT COUNT(*)
            FROM (
                SELECT o.id
                FROM player_opening_stats pos
                JOIN opening o ON pos.opening_id = o.id
                GROUP BY o.id
                HAVING COUNT(DISTINCT pos.player_id) < 5
            ) AS rare_openings;
        """).fetchone()[0]
        print(f"\nTotal number of openings played by less than 5 players: {rare_openings_count:,}")

        # print the number of games by player in percentiles
        print("\n--- Player Game Count Percentiles ---")

        percentiles = [
            i for i in range(10, 101, 10)
        ]  # Percentiles from 10% to 100% in increments of 10
        percentile_data = []

        for p in percentiles:
            percentile_str = "1.0" if p == 100 else f"0.{p:02d}"
            value = con.execute(
                f"""
                SELECT PERCENTILE_CONT({percentile_str}) WITHIN GROUP (ORDER BY total_games) AS percentile_value
                FROM (
                    SELECT 
                        player_id,
                        SUM(num_wins + num_draws) AS total_games
                    FROM player_opening_stats
                    GROUP BY player_id
                ) AS player_game_counts;
                """
            ).fetchone()[0]
            percentile_data.append({"Percentile": f"{p}%", "Game Count": value})

        # Format and print the results
        percentile_df = pd.DataFrame(percentile_data)
        percentile_df["Game Count"] = percentile_df["Game Count"].apply("{:,.0f}".format)
        print(percentile_df.to_string(index=False))

        # Find the players with the most games (outliers)
        top_players_by_games = con.execute(
            """
            SELECT 
                p.name,
                SUM(num_wins + num_draws) AS total_games
            FROM player_opening_stats pos
            JOIN player p ON pos.player_id = p.id
            GROUP BY p.name
            ORDER BY total_games DESC
            LIMIT 10;
        """
        ).fetchdf()

        top_players_by_games["total_games"] = top_players_by_games["total_games"].apply(
            "{:,.0f}".format
        )
        print("\n--- Top 10 Players by Total Games ---")
        print(top_players_by_games.to_string(index=False))


--- Top 10 Players by Number of Unique Openings Played ---
               name opening_count
           sergej-v         2,155
              magho         2,068
    simplesimpson03         2,048
         cdplayer72         2,034
               ttch         2,014
       Mopsik357357         2,005
        DanielFlock         1,968
Christ-Ginting-KBPL         1,922
          EmirGamis         1,916
              EPPik         1,885

--- Top 10 Openings by Number of Unique Players ---
                                         name eco player_count
                            Queen's Pawn Game D00       86,723
                         Van't Kruijs Opening A00       86,429
                                 Pirc Defense B00       75,425
                            Caro-Kann Defense B10       73,344
                               Modern Defense B06       72,271
Scandinavian Defense: Mieses-Kotroc Variation B01       71,853
                               Mieses Opening A00       71,683
        Q

## Initial Findings & Next Steps

Based on the statistics above, we can draw some preliminary conclusions:

- **Scale**: Are the tables growing to a size where DuckDB's `UPSERT` performance is known to degrade? (Typically in the hundreds of millions or billions of rows).
- **Balance**: Is the data evenly distributed across partitions, or is one partition taking most of the load? A heavily skewed partition might benefit from further sub-partitioning.
- **Skew**: Are a few players or openings dominating the stats table? If so, the constant updates to these "hot" records could be the primary source of the slowdown.

If significant skew is detected, we might need to reconsider our processing strategy. For example, we could batch updates by player or opening to reduce contention, or explore alternative data structures. If the issue is purely scale, we may need to accept the performance curve or explore more heavy-duty database solutions.