# Clean Database and Resequence IDs

This notebook cleans up the database by removing players with NaN ratings and prepares the data for Matrix Factorization by resequencing IDs.

## The Problem

After fetching ratings in notebook 26, we discovered 82 players with NaN ratings. These are players who deleted their accounts between our data collection and rating fetch. This could indicate cheating or other issues, so we want to remove them entirely.

Additionally, Matrix Factorization requires sequential integer IDs starting from 0 or 1. Our current IDs have gaps due to deletions and may not be sequential.

## The Solution

### Part 1: Delete Players with NaN Ratings (Implemented in this notebook)

1. **Identify NaN players** - Fetch all players where rating IS NULL
   - Safety check: Verify we have exactly 82 players (expected count)
   - If count doesn't match, stop for manual review

2. **Delete their stats** - Remove all `player_opening_stats` entries for these players
   - We delete stats first to avoid foreign key constraint issues
   - Commit this transaction separately
   - This two-step approach prevents issues we encountered with bulk deletions

3. **Delete the players** - Remove the player records themselves
   - Done in a separate transaction after stats deletion
   - Commit to make changes permanent

4. **Verification** - Comprehensive checks to ensure data integrity
   - Verify exactly 49,918 players remain (50,000 - 82)
   - Confirm no NULL ratings exist
   - Spot-check that deleted usernames are truly gone
   - Display statistics about the cleaned database

### Part 2: Resequence IDs for Matrix Factorization (TODO - Not implemented yet)

These steps are documented here but will be implemented later:

4. **Resequence Player IDs**
   - Create a mapping of old player IDs to new sequential IDs (1, 2, 3, ...)
   - Update all `player_opening_stats` partitioned tables to use new player IDs
   - Update the `player` table with new IDs
   - Extensive verification with random spot checks

5. **Resequence Opening IDs**
   - Create a mapping of old opening IDs to new sequential IDs (1, 2, 3, ...)
   - Update all `player_opening_stats` partitioned tables to use new opening IDs
   - Update the `opening` table with new IDs
   - Extensive verification with random spot checks

6. **Note about player_opening_stats**
   - We do NOT need to resequence `player_opening_stats` entries themselves
   - These are composite key records (player_id, opening_id, color)
   - Only the player_id and opening_id columns need updating (done in steps 4-5)

## Why This Matters

- **Data Quality**: Players who deleted accounts may have been cheaters
- **Model Requirements**: MF algorithms expect sequential integer IDs with no gaps
- **Data Integrity**: Critical that player-opening relationships remain intact
- **Verification**: Extensive checks ensure we don't corrupt the dataset

## Setup and Configuration

Import required libraries and set up database connection.

In [None]:
# Setup and imports
from pathlib import Path
from typing import List, Set
import random
from utils.database.db_utils import get_db_connection

# Database path
DB_PATH = Path.cwd().parent / "data" / "processed" / "chess_games.db"

# Expected counts for safety checks
EXPECTED_NAN_PLAYERS = 94  # Number of players with NULL ratings
EXPECTED_REMAINING_PLAYERS = 49_906  # 50,000 - 94

print("‚úì Configuration loaded")
print(f"  Database: {DB_PATH}")
print(f"  Expected NaN players: {EXPECTED_NAN_PLAYERS}")
print(f"  Expected remaining players: {EXPECTED_REMAINING_PLAYERS:,}")

‚úì Configuration loaded
  Database: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db
  Expected NaN players: 82
  Expected remaining players: 49,918


## Step 1: Identify Players with NaN Ratings

Fetch all players where rating is NULL. We expect exactly 82 players.
If the count doesn't match, we'll stop and investigate before proceeding.

In [2]:
# Identify players with NaN (NULL) ratings
print("=" * 60)
print("STEP 1: IDENTIFYING PLAYERS WITH NaN RATINGS")
print("=" * 60)

con = get_db_connection(str(DB_PATH))

try:
    # Fetch players with NULL ratings
    nan_players_df = con.execute("""
        SELECT id, name, title
        FROM player
        WHERE rating IS NULL
        ORDER BY id
    """).df()
    
    nan_count = len(nan_players_df)
    print(f"\nüìä Found {nan_count} players with NULL ratings")
    
    # Safety check: Verify expected count
    if nan_count != EXPECTED_NAN_PLAYERS:
        print(f"\n‚ö†Ô∏è  WARNING: Expected {EXPECTED_NAN_PLAYERS} players, found {nan_count}")
        print("\n‚ùå Count mismatch! Stopping for manual review.")
        print("   Please investigate before proceeding with deletions.")
        
        # Show some examples for investigation
        print(f"\n   Sample of NaN players:")
        for idx, row in nan_players_df.head(10).iterrows():
            title = f" ({row['title']})" if row['title'] else ""
            print(f"   - {row['name']}{title} (ID: {row['id']})")
        
        # Don't proceed
        raise ValueError(f"Expected {EXPECTED_NAN_PLAYERS} NaN players but found {nan_count}")
    
    print(f"‚úì Count matches expected value ({EXPECTED_NAN_PLAYERS})")
    
    # Store player IDs and names for deletion and verification
    nan_player_ids = nan_players_df['id'].tolist()
    nan_player_names = nan_players_df['name'].tolist()
    
    # Display sample of players to be deleted
    print(f"\nüìã Sample of players to be deleted:")
    sample_size = min(10, len(nan_players_df))
    for idx, row in nan_players_df.head(sample_size).iterrows():
        title = f" ({row['title']})" if row['title'] else ""
        profile_link = f"https://lichess.org/@/{row['name']}"
        print(f"  {idx+1}. {row['name']}{title} (ID: {row['id']}) - {profile_link}")
    
    if len(nan_players_df) > sample_size:
        print(f"  ... and {len(nan_players_df) - sample_size} more")
    
    print(f"\n‚úì Ready to delete {nan_count} players and their stats")
    
finally:
    con.close()
    print("\n‚úì Database connection closed")

STEP 1: IDENTIFYING PLAYERS WITH NaN RATINGS

üìä Found 94 players with NULL ratings


‚ùå Count mismatch! Stopping for manual review.
   Please investigate before proceeding with deletions.

   Sample of NaN players:
   - AboEIAD2021 (ID: 450)
   - Alex199408 (ID: 1012)
   - AliAbdulKareem (ID: 1260)
   - Alijourian (ID: 1293)
   - AnrilFurin (ID: 1819)
   - Artem_Lebedev-2007 (ID: 2135)
   - ArtisticPlatinum (ID: 2151)
   - BR_Prien (ID: 2554)
   - BobbyTupperbum (ID: 3346)
   - BoulFR (ID: 3490)

‚úì Database connection closed

üìä Found 94 players with NULL ratings


‚ùå Count mismatch! Stopping for manual review.
   Please investigate before proceeding with deletions.

   Sample of NaN players:
   - AboEIAD2021 (ID: 450)
   - Alex199408 (ID: 1012)
   - AliAbdulKareem (ID: 1260)
   - Alijourian (ID: 1293)
   - AnrilFurin (ID: 1819)
   - Artem_Lebedev-2007 (ID: 2135)
   - ArtisticPlatinum (ID: 2151)
   - BR_Prien (ID: 2554)
   - BobbyTupperbum (ID: 3346)
   - BoulFR (ID: 3490)

‚ú

ValueError: Expected 82 NaN players but found 94

## Step 2a: Delete player_opening_stats Entries

First, we delete all `player_opening_stats` entries for players with NaN ratings.
This must be done before deleting the players themselves to avoid foreign key constraint violations.

**Why separate transactions?**
- DuckDB can have issues with large, complex deletion operations
- Separating stats deletion from player deletion prevents transaction failures
- If something goes wrong, we can investigate without corrupting data
- Each step is committed separately for safety

In [None]:
# Delete player_opening_stats entries for NaN players
print("=" * 60)
print("STEP 2a: DELETING PLAYER_OPENING_STATS ENTRIES")
print("=" * 60)

con = get_db_connection(str(DB_PATH))

try:
    # First, check how many stats entries will be deleted
    stats_count = con.execute("""
        SELECT COUNT(*) as count
        FROM player_opening_stats
        WHERE player_id IN ?
    """, (nan_player_ids,)).fetchone()[0]
    
    print(f"\nüìä Found {stats_count:,} player_opening_stats entries to delete")
    
    # Delete from each partitioned table
    # Note: We can't delete from the view, must delete from base tables
    total_deleted = 0
    for letter in list("ABCDE") + ["other"]:
        table = f"player_opening_stats_{letter}"
        
        # Check count before deletion
        count_before = con.execute(
            f"SELECT COUNT(*) FROM {table} WHERE player_id IN ?",
            (nan_player_ids,)
        ).fetchone()[0]
        
        if count_before > 0:
            # Delete entries
            con.execute(
                f"DELETE FROM {table} WHERE player_id IN ?",
                (nan_player_ids,)
            )
            total_deleted += count_before
            print(f"  ‚úì Deleted {count_before:,} entries from {table}")
    
    # Commit the transaction
    con.commit()
    print(f"\n‚úì Successfully deleted {total_deleted:,} stats entries")
    print("‚úì Transaction committed")
    
    # Verify deletion
    remaining_stats = con.execute("""
        SELECT COUNT(*) as count
        FROM player_opening_stats
        WHERE player_id IN ?
    """, (nan_player_ids,)).fetchone()[0]
    
    if remaining_stats == 0:
        print("‚úì Verification: No stats entries remain for NaN players")
    else:
        print(f"‚ö†Ô∏è  WARNING: {remaining_stats} stats entries still exist!")
        raise ValueError("Stats deletion verification failed")
    
finally:
    con.close()
    print("\n‚úì Database connection closed")

## Step 2b: Delete Player Entries

Now that all stats entries are deleted, we can safely delete the player records.
This is done in a separate transaction to ensure data integrity.

In [None]:
# Delete player entries
print("=" * 60)
print("STEP 2b: DELETING PLAYER ENTRIES")
print("=" * 60)

con = get_db_connection(str(DB_PATH))

try:
    # Double-check that these players have no stats before deleting
    stats_check = con.execute("""
        SELECT COUNT(*) as count
        FROM player_opening_stats
        WHERE player_id IN ?
    """, (nan_player_ids,)).fetchone()[0]
    
    if stats_check > 0:
        print(f"\n‚ùå ERROR: Found {stats_check} stats entries still exist!")
        print("   Cannot delete players with existing stats entries.")
        raise ValueError("Pre-deletion safety check failed")
    
    print(f"\n‚úì Pre-deletion check passed: No stats entries exist")
    
    # Delete the players
    print(f"\nüóëÔ∏è  Deleting {len(nan_player_ids)} players...")
    con.execute("""
        DELETE FROM player
        WHERE id IN ?
    """, (nan_player_ids,))
    
    # Commit the transaction
    con.commit()
    print(f"‚úì Successfully deleted {len(nan_player_ids)} players")
    print("‚úì Transaction committed")
    
    # Verify deletion
    remaining_players = con.execute("""
        SELECT COUNT(*) as count
        FROM player
        WHERE id IN ?
    """, (nan_player_ids,)).fetchone()[0]
    
    if remaining_players == 0:
        print("‚úì Verification: No player entries remain for deleted IDs")
    else:
        print(f"‚ö†Ô∏è  WARNING: {remaining_players} player entries still exist!")
        raise ValueError("Player deletion verification failed")
    
finally:
    con.close()
    print("\n‚úì Database connection closed")

## Step 3: Comprehensive Verification

Perform thorough checks to ensure:
1. Exactly 49,918 players remain (50,000 - 82)
2. No NULL ratings exist in the database
3. Random sample of deleted usernames are truly gone
4. Database integrity is maintained

In [None]:
# Comprehensive verification checks
print("=" * 60)
print("STEP 3: COMPREHENSIVE VERIFICATION")
print("=" * 60)

con = get_db_connection(str(DB_PATH))

try:
    # Check 1: Total player count
    print("\n1Ô∏è‚É£  Checking total player count...")
    total_players = con.execute("SELECT COUNT(*) FROM player").fetchone()[0]
    print(f"   Total players: {total_players:,}")
    
    if total_players == EXPECTED_REMAINING_PLAYERS:
        print(f"   ‚úì Matches expected count ({EXPECTED_REMAINING_PLAYERS:,})")
    else:
        print(f"   ‚ùå ERROR: Expected {EXPECTED_REMAINING_PLAYERS:,}, found {total_players:,}")
        print(f"   Difference: {total_players - EXPECTED_REMAINING_PLAYERS:+,}")
    
    # Check 2: NULL ratings
    print("\n2Ô∏è‚É£  Checking for NULL ratings...")
    null_ratings = con.execute(
        "SELECT COUNT(*) FROM player WHERE rating IS NULL"
    ).fetchone()[0]
    print(f"   Players with NULL ratings: {null_ratings}")
    
    if null_ratings == 0:
        print("   ‚úì No NULL ratings found")
    else:
        print(f"   ‚ùå ERROR: Found {null_ratings} players with NULL ratings!")
    
    # Check 3: Verify deleted usernames are gone
    print("\n3Ô∏è‚É£  Spot-checking deleted usernames...")
    sample_size = min(10, len(nan_player_names))
    sample_names = random.sample(nan_player_names, sample_size)
    
    found_deleted = []
    for name in sample_names:
        exists = con.execute(
            "SELECT COUNT(*) FROM player WHERE name = ?",
            (name,)
        ).fetchone()[0]
        
        if exists > 0:
            found_deleted.append(name)
    
    print(f"   Checked {sample_size} random deleted usernames:")
    if len(found_deleted) == 0:
        print("   ‚úì None found in database (correct)")
    else:
        print(f"   ‚ùå ERROR: Found {len(found_deleted)} that still exist:")
        for name in found_deleted:
            print(f"      - {name}")
    
    # Check 4: Verify no orphaned stats
    print("\n4Ô∏è‚É£  Checking for orphaned stats entries...")
    orphaned_stats = con.execute("""
        SELECT COUNT(*) FROM player_opening_stats pos
        WHERE NOT EXISTS (
            SELECT 1 FROM player p WHERE p.id = pos.player_id
        )
    """).fetchone()[0]
    print(f"   Orphaned stats entries: {orphaned_stats:,}")
    
    if orphaned_stats == 0:
        print("   ‚úì No orphaned stats entries")
    else:
        print(f"   ‚ö†Ô∏è  WARNING: Found {orphaned_stats:,} orphaned stats!")
    
    # Check 5: Database statistics
    print("\n5Ô∏è‚É£  Database statistics...")
    stats_summary = con.execute("""
        SELECT 
            COUNT(*) as total_players,
            COUNT(rating) as players_with_rating,
            MIN(rating) as min_rating,
            MAX(rating) as max_rating,
            AVG(rating) as avg_rating,
            MEDIAN(rating) as median_rating
        FROM player
    """).df()
    
    print(f"   Total players: {stats_summary['total_players'][0]:,}")
    print(f"   Players with rating: {stats_summary['players_with_rating'][0]:,}")
    print(f"   Rating range: {stats_summary['min_rating'][0]} - {stats_summary['max_rating'][0]}")
    print(f"   Average rating: {stats_summary['avg_rating'][0]:.1f}")
    print(f"   Median rating: {stats_summary['median_rating'][0]:.0f}")
    
    # Check 6: Sample remaining players
    print("\n6Ô∏è‚É£  Sample of remaining players...")
    sample_players = con.execute("""
        SELECT name, title, rating
        FROM player
        ORDER BY RANDOM()
        LIMIT 5
    """).df()
    
    for idx, row in sample_players.iterrows():
        title = f" ({row['title']})" if row['title'] else ""
        profile_link = f"https://lichess.org/@/{row['name']}"
        print(f"   ‚Ä¢ {row['name']}{title}: {row['rating']} - {profile_link}")
    
    # Final summary
    print("\n" + "=" * 60)
    print("VERIFICATION SUMMARY")
    print("=" * 60)
    
    all_checks_passed = (
        total_players == EXPECTED_REMAINING_PLAYERS and
        null_ratings == 0 and
        len(found_deleted) == 0 and
        orphaned_stats == 0
    )
    
    if all_checks_passed:
        print("\n‚úÖ ALL VERIFICATION CHECKS PASSED")
        print(f"\n   ‚Ä¢ Deleted {EXPECTED_NAN_PLAYERS} players with NaN ratings")
        print(f"   ‚Ä¢ {EXPECTED_REMAINING_PLAYERS:,} players remain")
        print("   ‚Ä¢ All players have valid ratings")
        print("   ‚Ä¢ No orphaned data exists")
        print("\n‚úì Database is clean and ready for ID resequencing")
    else:
        print("\n‚ö†Ô∏è  SOME VERIFICATION CHECKS FAILED")
        print("   Please review the issues above before proceeding")
    
finally:
    con.close()
    print("\n‚úì Database connection closed")
    print("=" * 60)

## Next Steps (TODO)

The following steps will be implemented in future cells:

1. **Resequence Player IDs** - Make player IDs sequential (1, 2, 3, ...)
2. **Resequence Opening IDs** - Make opening IDs sequential (1, 2, 3, ...)
3. **Final Verification** - Ensure all mappings are correct and data integrity is maintained

These operations are critical for Matrix Factorization and require careful implementation with extensive verification.