# Verification of Shrunken Database

**Purpose:** To verify the data integrity of the shrunken database (`chess_games_shrunk.db`) by comparing it against the original database (`chess_games.db`).

**Methodology:**
1.  **Aggregate Count Comparison:** Check if the total number of records in the `player`, `opening`, and partitioned `player_opening_stats` tables are identical between the two databases.
2.  **ID Mapping Reconstruction:** Recreate the logic used to map old, non-sequential IDs to new, sequential IDs.
3.  **Random Spot Checks:**
    *   Select 200 random players and openings from the original database.
    *   Find their corresponding records in the new database using the reconstructed ID mappings.
    *   Verify that their core data (name, title, eco) is identical.
4.  **Stats Record Spot Check:**
    *   Select 200 random records from one of the `player_opening_stats` partitions in the original database.
    *   Find the corresponding records in the new database by mapping both `player_id` and `opening_id`.
    *   Verify that the game statistics (`num_wins`, `num_draws`, `num_losses`) and `color` are identical.

This process will provide strong evidence that the shrinking process was successful and lossless.

In [1]:
# Configuration
import pandas as pd
from pathlib import Path
from utils.database.db_utils import get_db_connection

# Define paths to the original and new database files
project_root = Path.cwd().parent if "notebooks" in str(Path.cwd()) else Path.cwd()
db_path_original = project_root / "data" / "processed" / "chess_games.db"
db_path_shrunk = project_root / "data" / "processed" / "chess_games_shrunk.db"

# Define partitions for player_opening_stats
partitions = list("ABCDE") + ["other"]
SAMPLE_SIZE = 200

# Print configuration details
print(f"Original DB path: {db_path_original}")
print(f"Shrunken DB path: {db_path_shrunk}")
print(f"Partitions to check: {partitions}")
print(f"Sample size for spot checks: {SAMPLE_SIZE}")

Original DB path: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db
Shrunken DB path: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games_shrunk.db
Partitions to check: ['A', 'B', 'C', 'D', 'E', 'other']
Sample size for spot checks: 200


### 1. Aggregate Count Comparison
First, let's compare the total row counts for the main tables in both databases. They should be identical.

In [2]:
def get_db_summary(db_path, db_name):
    summary = {}
    with get_db_connection(db_path) as con:
        summary['player_count'] = con.execute("SELECT COUNT(*) FROM player").fetchone()[0]
        summary['opening_count'] = con.execute("SELECT COUNT(*) FROM opening").fetchone()[0]
        
        total_stats_count = 0
        for letter in partitions:
            total_stats_count += con.execute(f"SELECT COUNT(*) FROM player_opening_stats_{letter}").fetchone()[0]
        summary['stats_count'] = total_stats_count
        
    return pd.Series(summary, name=db_name)

summary_original = get_db_summary(db_path_original, "Original DB")
summary_shrunk = get_db_summary(db_path_shrunk, "Shrunken DB")

comparison_df = pd.DataFrame([summary_original, summary_shrunk])
comparison_df['counts_match'] = comparison_df.iloc[0] == comparison_df.iloc[1]
print(comparison_df)

             player_count  opening_count  stats_count counts_match
Original DB         44459           3132     12867612          NaN
Shrunken DB         44459           3132     12867612          NaN


### 2. Reconstruct ID Mappings
To compare individual records, we need to know how the old IDs map to the new ones. The new IDs were generated sequentially based on an alphabetical sort. We'll load these mappings into memory.

In [3]:
with get_db_connection(db_path_original) as con_orig:
    # Player ID mapping
    player_mapping_df = con_orig.execute("""
        SELECT 
            id as old_id, 
            name,
            title,
            ROW_NUMBER() OVER (ORDER BY name) as new_id
        FROM player
    """).df()

    # Opening ID mapping
    opening_mapping_df = con_orig.execute("""
        SELECT 
            id as old_id,
            eco,
            name,
            ROW_NUMBER() OVER (ORDER BY eco, name) as new_id
        FROM opening
    """).df()

print(f"Player mapping created with {len(player_mapping_df):,} records.")
print(player_mapping_df.head())
print(f"\nOpening mapping created with {len(opening_mapping_df):,} records.")
print(opening_mapping_df.head())

Player mapping created with 44,459 records.
   old_id        name title  new_id
0   60571   1001Moves  None       1
1    6462     2700172  None       2
2    4095       A-2-A  None       3
3   57770        A-HF  None       4
4   20850  A-Haimoura  None       5

Opening mapping created with 3,132 records.
   old_id  eco                        name  new_id
0  246513  A00                 Amar Gambit       1
1    1624  A00                Amar Opening       2
2   38952  A00  Amar Opening: Paris Gambit       3
3   53918  A00            Amsterdam Attack       4
4     228  A00         Anderssen's Opening       5


### 3. Spot Check `player` and `opening` Tables
Now, we'll take a random sample from the original `player` and `opening` tables and ensure their data exists and is correct in the new database under the new ID.

In [None]:
def verify_spot_checks(sample_df, shrunk_con, table_name, id_col='new_id'):
    """
    Verifies a sample DataFrame against a table in the shrunken database.
    Returns a DataFrame with original and shrunken data for comparison.
    """
    shrunk_data = []
    for _, row in sample_df.iterrows():
        lookup_id = row[id_col]
        
        # Fetch the corresponding record from the shrunken DB
        shrunk_record = shrunk_con.execute(f"SELECT * FROM {table_name} WHERE id = ?", [lookup_id]).fetchone()
        
        if shrunk_record:
            shrunk_data.append({desc[0]: val for desc, val in zip(shrunk_con.description, shrunk_record)})
        else:
            # Append a placeholder if not found
            shrunk_data.append(None)

    # Prepare dataframes for comparison
    original_df = sample_df.reset_index(drop=True)
    shrunk_df = pd.DataFrame(shrunk_data).reset_index(drop=True)
    
    # Rename columns for clarity
    original_df.columns = [f"orig_{c}" for c in original_df.columns]
    shrunk_df.columns = [f"shrunk_{c}" for c in shrunk_df.columns]
    
    # Combine and display
    comparison_df = pd.concat([original_df, shrunk_df], axis=1)
    
    return comparison_df

with get_db_connection(db_path_shrunk) as con_shrunk:
    print("--- Verifying Player Table ---")
    player_sample = player_mapping_df.sample(n=SAMPLE_SIZE)
    player_comparison = verify_spot_checks(player_sample, con_shrunk, 'player')
    
    print(f"Displaying {SAMPLE_SIZE} random player records for comparison:")
    display(player_comparison)

    print("\n--- Verifying Opening Table ---")
    opening_sample = opening_mapping_df.sample(n=SAMPLE_SIZE)
    opening_comparison = verify_spot_checks(opening_sample, con_shrunk, 'opening')

    print(f"Displaying {SAMPLE_SIZE} random opening records for comparison:")
    display(opening_comparison)

### 4. Spot Check `player_opening_stats` Table
This is the most critical check. We'll sample from a stats partition in the original DB, map both the `player_id` and `opening_id` to their new values, and verify that the full record (including win/draw/loss counts) is identical in the new DB.

In [None]:
# For simplicity, we'll check the 'A' partition
partition_to_check = 'A'
stats_table = f"player_opening_stats_{partition_to_check}"

with get_db_connection(db_path_original) as con_orig, get_db_connection(db_path_shrunk) as con_shrunk:
    print(f"--- Verifying {stats_table} ---")
    
    # Get a random sample from the original stats table
    original_stats_sample_df = con_orig.execute(f"SELECT * FROM {stats_table} ORDER BY RANDOM() LIMIT {SAMPLE_SIZE}").df()
    
    # Add new IDs to the sample dataframe for easy lookup
    stats_sample_merged = original_stats_sample_df.merge(
        player_mapping_df[['old_id', 'new_id']], left_on='player_id', right_on='old_id', suffixes=('', '_player')
    ).merge(
        opening_mapping_df[['old_id', 'new_id']], left_on='opening_id', right_on='old_id', suffixes=('', '_opening')
    )
    stats_sample_merged.rename(columns={'new_id': 'new_player_id', 'new_id_opening': 'new_opening_id'}, inplace=True)

    shrunk_data = []
    for _, row in stats_sample_merged.iterrows():
        # Query the new DB with the new composite primary key
        shrunk_record = con_shrunk.execute(
            f"SELECT * FROM {stats_table} WHERE player_id = ? AND opening_id = ? AND color = ?",
            [row['new_player_id'], row['new_opening_id'], row['color']]
        ).fetchone()

        if shrunk_record:
            shrunk_data.append({desc[0]: val for desc, val in zip(con_shrunk.description, shrunk_record)})
        else:
            shrunk_data.append(None)

    # Prepare dataframes for comparison
    original_df = stats_sample_merged.reset_index(drop=True)
    shrunk_df = pd.DataFrame(shrunk_data).reset_index(drop=True)
    
    original_df.columns = [f"orig_{c}" for c in original_df.columns]
    shrunk_df.columns = [f"shrunk_{c}" for c in shrunk_df.columns]
    
    comparison_df = pd.concat([original_df, shrunk_df], axis=1)

    print(f"Displaying {SAMPLE_SIZE} random stats records from '{stats_table}' for comparison:")
    display(comparison_df)

### Conclusion

If all checks above passed, we can be highly confident that the database shrinking process was successful and did not result in data loss. The size reduction is primarily due to:
1.  **Rebuilding the file:** Exporting to Parquet and re-importing into a new file eliminates all historical data bloat and fragmentation from past transactions (the biggest factor).
2.  **Sequential Primary Keys:** Using sequential integers for IDs is slightly more space-efficient than using non-sequential ones, though this is a minor factor compared to rebuilding.
3.  **Data Type Optimization:** Changing `VARCHAR` to a 1-byte `ENUM` for the `color` column and `INTEGER` to `SMALLINT` for `num_draws` saves several bytes per record across millions of rows, adding up to a significant saving.