# Database Recovery Notebook

This notebook recovers a corrupted database (`chess_games_corrupted_copy.db`) that has been converted to CSV-like format.

## Recovery Plan:
1. **Examine** the corrupted database to confirm its structure
2. **Create** a new recovery database with proper schema
3. **Restore** player_opening_stats data from corrupted file
4. **Copy** player and opening tables from backup database
5. **Verify** the recovered database

## Files:
- **Corrupted DB**: `chess_games_corrupted_copy.db` (CSV-formatted player_opening_stats)
- **Backup DB**: `chess_games_backup_before_optimization.db` (has player & opening tables)
- **Recovery DB**: `recovery_chess_games.db` (NEW - will not modify existing DBs)

## Step 1: Import Libraries
Import necessary modules for database operations and progress tracking.

In [15]:
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import sys

# Add utils to path
sys.path.append(str(Path.cwd().parent))
from utils.database.db_utils import get_db_connection, setup_database

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


## Step 2: Define File Paths
Set up paths for the corrupted database, backup database, and new recovery database.

In [16]:
# Define paths
data_dir = Path.cwd().parent / "data" / "processed"
corrupted_db_path = data_dir / "chess_games_corrupted_copy.db"
backup_db_path = data_dir / "chess_games_backup_before_optimization.db"
recovery_db_path = data_dir / "recovery_chess_games.db"

# Verify files exist
print(f"Corrupted DB exists: {corrupted_db_path.exists()}")
print(f"Backup DB exists: {backup_db_path.exists()}")
print(f"Recovery DB exists: {recovery_db_path.exists()} (should be False)")

print(f"\nPaths:")
print(f"  Corrupted: {corrupted_db_path}")
print(f"  Backup: {backup_db_path}")
print(f"  Recovery: {recovery_db_path}")

Corrupted DB exists: True
Backup DB exists: True
Recovery DB exists: True (should be False)

Paths:
  Corrupted: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games_corrupted_copy.db
  Backup: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games_backup_before_optimization.db
  Recovery: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/recovery_chess_games.db


## Step 3: Examine the Corrupted Database
Load the corrupted database to understand its structure.

### Expected Structure:
Based on the provided sample, we expect:
- **Format**: CSV-like with comma separators
- **Columns**: `player_id`, `opening_id`, `color`, `num_wins`, `num_draws`, `num_losses`
- **Data Types**: integers for IDs and stats, single character ('w' or 'b') for color
- **Content**: Only player_opening_stats data (no player or opening tables)

In [17]:
# Load corrupted data as CSV
print("Loading corrupted database as CSV...")
try:
    corrupted_data = pd.read_csv(corrupted_db_path)
    print(f"✓ Loaded {len(corrupted_data):,} rows\n")
    
    print("Column names:")
    print(corrupted_data.columns.tolist())
    
    print("\nData types:")
    print(corrupted_data.dtypes)
    
    print("\nFirst 10 rows:")
    print(corrupted_data.head(10))
    
except Exception as e:
    print(f"✗ Error loading corrupted data: {e}")

Loading corrupted database as CSV...
✓ Loaded 26,843,374 rows

Column names:
['player_id', 'opening_id', 'color', 'num_wins', 'num_draws', 'num_losses']

Data types:
player_id      int64
opening_id     int64
color         object
num_wins       int64
num_draws      int64
num_losses     int64
dtype: object

First 10 rows:
   player_id  opening_id color  num_wins  num_draws  num_losses
0       5770         457     w        54          7          48
1      16740         178     w         5          0           3
2      27190         382     w        23          4          36
3      15977         427     w         3          0           5
4      43758         207     w         2          0           4
5      38292          71     w         1          1           2
6       5415         417     w         2          0           3
7      10179         445     w         9          1           5
8      11333         376     w        49         10          70
9         61         252     w        

## Step 4: Validate Data Structure
Confirm the data matches our expectations.

In [18]:
# Check for expected columns
expected_columns = ['player_id', 'opening_id', 'color', 'num_wins', 'num_draws', 'num_losses']
actual_columns = corrupted_data.columns.tolist()

print("Column Validation:")
if actual_columns == expected_columns:
    print("✓ All expected columns present and in correct order")
else:
    print("✗ Column mismatch!")
    print(f"  Expected: {expected_columns}")
    print(f"  Actual: {actual_columns}")

# Check value ranges
print("\nValue Ranges:")
print(f"  player_id: {corrupted_data['player_id'].min()} to {corrupted_data['player_id'].max()}")
print(f"  opening_id: {corrupted_data['opening_id'].min()} to {corrupted_data['opening_id'].max()}")
print(f"  Unique players: {corrupted_data['player_id'].nunique():,}")
print(f"  Unique openings: {corrupted_data['opening_id'].nunique():,}")
print(f"  Colors: {corrupted_data['color'].unique()}")

# Check for nulls
print("\nNull Values:")
print(corrupted_data.isnull().sum())

Column Validation:
✓ All expected columns present and in correct order

Value Ranges:
  player_id: 1 to 50000
  opening_id: 1 to 3593
  Unique players: 50,000
  Unique players: 50,000
  Unique openings: 3,361
  Colors: ['w' 'b']

Null Values:
player_id     0
opening_id    0
color         0
num_wins      0
num_draws     0
num_losses    0
dtype: int64


## Step 5: Create New Recovery Database
Initialize a new database with the proper schema using `setup_database()` from db_utils.py.

In [19]:
# Create new recovery database with proper schema
print("Creating new recovery database...")

# Delete if exists (fresh start)
if recovery_db_path.exists():
    recovery_db_path.unlink()
    print("Removed existing recovery database")

# Create and setup schema
with get_db_connection(recovery_db_path) as con:
    setup_database(con)

print(f"✓ Recovery database created at: {recovery_db_path}")

Creating new recovery database...
Removed existing recovery database
Initializing database schema...
Database tables and partitioned stats tables are ready.
✓ Recovery database created at: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/recovery_chess_games.db


## Step 6: Copy Player and Opening Tables from Backup
Before inserting player_opening_stats, we need the player and opening reference tables.
Copy these from the backup database first.

In [20]:
# Copy player table from backup
print("Copying player table from backup...")
with get_db_connection(backup_db_path) as backup_con:
    players_df = backup_con.execute("SELECT * FROM player").fetchdf()
    print(f"  Loaded {len(players_df):,} players from backup")

with get_db_connection(recovery_db_path) as recovery_con:
    # Insert using DuckDB's register + INSERT SELECT pattern for better performance
    recovery_con.register('players_temp', players_df)
    recovery_con.execute("INSERT INTO player SELECT * FROM players_temp")
    print(f"✓ Inserted {len(players_df):,} players into recovery database")

# Verify
with get_db_connection(recovery_db_path) as recovery_con:
    count = recovery_con.execute("SELECT COUNT(*) FROM player").fetchone()[0]
    print(f"  Verified: {count:,} players in recovery database")

Copying player table from backup...
  Loaded 50,000 players from backup
✓ Inserted 50,000 players into recovery database
  Verified: 50,000 players in recovery database


In [21]:
# Copy opening table from backup
print("Copying opening table from backup...")
with get_db_connection(backup_db_path) as backup_con:
    openings_df = backup_con.execute("SELECT * FROM opening").fetchdf()
    print(f"  Loaded {len(openings_df):,} openings from backup")

with get_db_connection(recovery_db_path) as recovery_con:
    # Insert using DuckDB's register + INSERT SELECT pattern
    recovery_con.register('openings_temp', openings_df)
    recovery_con.execute("INSERT INTO opening SELECT * FROM openings_temp")
    print(f"✓ Inserted {len(openings_df):,} openings into recovery database")

# Verify
with get_db_connection(recovery_db_path) as recovery_con:
    count = recovery_con.execute("SELECT COUNT(*) FROM opening").fetchone()[0]
    print(f"  Verified: {count:,} openings in recovery database")

Copying opening table from backup...
  Loaded 3,593 openings from backup
✓ Inserted 3,593 openings into recovery database
  Verified: 3,593 openings in recovery database


## Step 6.5: Get Opening ECO Mapping
We need to map opening_ids to their ECO codes to determine which partition each row belongs to.

In [None]:
# Get opening ECO codes to determine partitioning
print("Loading opening ECO codes for partitioning...")
with get_db_connection(recovery_db_path) as con:
    openings_eco_df = con.execute("SELECT id, eco FROM opening").fetchdf()
    
# Create a mapping from opening_id to partition letter
def get_partition_letter(eco):
    """Determine which partition based on ECO code first letter."""
    if pd.isna(eco) or len(eco) == 0:
        return 'other'
    first_letter = eco[0].upper()
    if first_letter in 'ABCDE':
        return first_letter
    return 'other'

openings_eco_df['partition'] = openings_eco_df['eco'].apply(get_partition_letter)
opening_to_partition = dict(zip(openings_eco_df['id'], openings_eco_df['partition']))

print(f"✓ Loaded {len(opening_to_partition):,} opening-to-partition mappings")
print(f"\\nPartition distribution:")
for letter in list("ABCDE") + ["other"]:
    count = sum(1 for p in opening_to_partition.values() if p == letter)
    print(f"  {letter}: {count:,} openings")

## Step 7: Test Insert - 1% of Data
Before inserting all data, test with 1% to verify the process works correctly.

In [None]:
# Calculate 1% subset
total_rows = len(corrupted_data)
test_size = int(total_rows * 0.01)
test_data = corrupted_data.head(test_size).copy()

print(f"Testing with {test_size:,} rows (1% of {total_rows:,} total)...")

# Add partition column to test data
test_data['partition'] = test_data['opening_id'].map(opening_to_partition)

# Insert test data into appropriate partitioned tables
with get_db_connection(recovery_db_path) as con:
    for letter in list("ABCDE") + ["other"]:
        partition_data = test_data[test_data['partition'] == letter].drop('partition', axis=1)
        if len(partition_data) > 0:
            con.register('test_temp', partition_data)
            con.execute(f"INSERT INTO player_opening_stats_{letter} SELECT * FROM test_temp")
            print(f"  Inserted {len(partition_data):,} rows into player_opening_stats_{letter}")

# Verify insertion
with get_db_connection(recovery_db_path) as con:
    count = con.execute("SELECT COUNT(*) FROM player_opening_stats").fetchone()[0]
    print(f"\\n✓ Test insert successful: {count:,} rows in player_opening_stats")
    
    # Show sample of inserted data
    sample = con.execute("SELECT * FROM player_opening_stats LIMIT 5").fetchdf()
    print("\\nSample of inserted data:")
    print(sample)

Testing with 268,433 rows (1% of 26,843,374 total)...


CatalogException: Catalog Error: player_opening_stats is not an table

## Step 8: Insert Full player_opening_stats Data
If the test was successful, proceed to insert all remaining data with progress tracking.

**Note**: This may take several minutes depending on data size. Progress bar will show status.

In [None]:
# Clear test data first
with get_db_connection(recovery_db_path) as con:
    con.execute("DELETE FROM player_opening_stats_A")
    con.execute("DELETE FROM player_opening_stats_B")
    con.execute("DELETE FROM player_opening_stats_C")
    con.execute("DELETE FROM player_opening_stats_D")
    con.execute("DELETE FROM player_opening_stats_E")
    con.execute("DELETE FROM player_opening_stats_other")
    print("Cleared test data")

# Add partition column to full dataset
print("\\nMapping rows to partitions...")
corrupted_data['partition'] = corrupted_data['opening_id'].map(opening_to_partition)
print("✓ Partition mapping complete")

# Insert full dataset in chunks for better performance and progress tracking
chunk_size = 100000  # Larger chunks for better performance
total_rows = len(corrupted_data)
num_chunks = (total_rows + chunk_size - 1) // chunk_size

print(f"\\nInserting {total_rows:,} rows in {num_chunks:,} chunks of {chunk_size:,}...")

with get_db_connection(recovery_db_path) as con:
    for i in tqdm(range(0, total_rows, chunk_size), desc="Inserting data"):
        chunk = corrupted_data.iloc[i:i+chunk_size]
        
        # Insert each partition separately
        for letter in list("ABCDE") + ["other"]:
            partition_chunk = chunk[chunk['partition'] == letter].drop('partition', axis=1)
            if len(partition_chunk) > 0:
                con.register('chunk_temp', partition_chunk)
                con.execute(f"INSERT INTO player_opening_stats_{letter} SELECT * FROM chunk_temp")

print("\\n✓ All data inserted successfully!")

## Step 9: Verify Recovery
Perform comprehensive verification of the recovered database.

In [None]:
print("=" * 60)
print("DATABASE RECOVERY VERIFICATION")
print("=" * 60)

with get_db_connection(recovery_db_path) as con:
    # Table counts
    player_count = con.execute("SELECT COUNT(*) FROM player").fetchone()[0]
    opening_count = con.execute("SELECT COUNT(*) FROM opening").fetchone()[0]
    stats_count = con.execute("SELECT COUNT(*) FROM player_opening_stats").fetchone()[0]
    
    print(f"\nTable Row Counts:")
    print(f"  player: {player_count:,}")
    print(f"  opening: {opening_count:,}")
    print(f"  player_opening_stats: {stats_count:,}")
    
    # Verify against original
    print(f"\nData Integrity:")
    if stats_count == len(corrupted_data):
        print(f"  ✓ Row count matches corrupted data ({len(corrupted_data):,})")
    else:
        print(f"  ✗ Row count mismatch! Expected {len(corrupted_data):,}, got {stats_count:,}")
    
    # Check partitioned tables
    print(f"\nPartitioned Table Breakdown:")
    for letter in list("ABCDE") + ["other"]:
        count = con.execute(f"SELECT COUNT(*) FROM player_opening_stats_{letter}").fetchone()[0]
        print(f"  player_opening_stats_{letter}: {count:,}")
    
    # Sample data
    print(f"\nSample Data from Recovery Database:")
    sample = con.execute("""
        SELECT * FROM player_opening_stats 
        ORDER BY RANDOM() 
        LIMIT 5
    """).fetchdf()
    print(sample)

print("\n" + "=" * 60)
print("RECOVERY COMPLETE!")
print("=" * 60)
print(f"\nRecovered database saved to: {recovery_db_path}")
print("\nNote: Some openings in the backup may be unused (we deleted some).")
print("This will be cleaned up in a future notebook.")