# Data Sanitization and Normalization

## Purpose
This notebook is designed to sanitize and normalize the chess database by removing inefficiencies, redundancies, and outliers that could negatively impact the quality of opening recommendations. The goal is to clean up the data while preserving the most valuable information for the chess opening recommender system.

## Key Areas of Focus
- **Redundant Opening Names**: Consolidate openings that share the same name but have different ECO codes
- **Data Inefficiencies**: Remove or consolidate records that provide minimal analytical value
- **Outliers**: Identify and handle extreme cases that might skew recommendations
- **Database Optimization**: Reduce storage footprint while maintaining data integrity

## Process Overview
1. **Baseline Analysis**: Establish current database statistics and size metrics
2. **Identify Redundancies**: Find opening names with multiple ECO codes
3. **Data Consolidation**: Merge redundant records while preserving statistical accuracy
4. **Quality Validation**: Verify that changes maintain data integrity
5. **Performance Optimization**: Measure improvements in database size and query performance

This systematic approach ensures that the database becomes more efficient and reliable for generating chess opening recommendations.

In [1]:
# Configuration and setup
import pandas as pd
import os
from pathlib import Path
from utils.database.db_utils import get_db_connection

# Define the path to the DuckDB database file
project_root = Path.cwd().parent if "notebooks" in str(Path.cwd()) else Path.cwd()
db_path = project_root / "data" / "processed" / "chess_games.db"

# Set pandas display options for better readability
pd.set_option('display.float_format', '{:,.2f}'.format)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print(f"Database path: {db_path}")
print(f"Database exists: {db_path.exists()}")

def log_database_statistics():
    """
    Log comprehensive database statistics including table sizes, record counts,
    file size, and key metrics. This function can be called repeatedly to track
    changes as we sanitize and normalize the data.
    """
    if not db_path.exists():
        print(f"Database file not found at {db_path}")
        return
    
    # Get database file size
    db_size_bytes = os.path.getsize(db_path)
    db_size_mb = db_size_bytes / (1024 * 1024)
    db_size_gb = db_size_mb / 1024
    
    with get_db_connection(db_path) as con:
        print("=" * 60)
        print("DATABASE STATISTICS SNAPSHOT")
        print("=" * 60)
        
        # File size information
        print(f"\n--- Database File Size ---")
        print(f"Size: {db_size_mb:,.1f} MB ({db_size_gb:.2f} GB)")
        print(f"Raw bytes: {db_size_bytes:,}")
        
        # Core table counts
        print(f"\n--- Core Tables ---")
        player_count = con.execute('SELECT COUNT(*) FROM player').fetchone()[0]
        opening_count = con.execute('SELECT COUNT(*) FROM opening').fetchone()[0]
        total_stats_records = con.execute('SELECT COUNT(*) FROM player_opening_stats').fetchone()[0]
        
        print(f"Players: {player_count:,}")
        print(f"Openings: {opening_count:,}")
        print(f"Player-Opening-Stats Records: {total_stats_records:,}")
        
        # Partition distribution
        print(f"\n--- Partition Distribution ---")
        partitions = ['A', 'B', 'C', 'D', 'E', 'other']
        total_partition_records = 0
        
        for partition in partitions:
            count = con.execute(f'SELECT COUNT(*) FROM player_opening_stats_{partition}').fetchone()[0]
            total_partition_records += count
            percentage = (count / total_stats_records * 100) if total_stats_records > 0 else 0
            print(f"  Partition {partition}: {count:,} ({percentage:.1f}%)")
        
        # Game statistics
        print(f"\n--- Game Statistics ---")
        total_games = con.execute("""
            SELECT SUM(num_wins + num_draws + num_losses) as total_games
            FROM player_opening_stats
        """).fetchone()[0]
        
        total_wins = con.execute('SELECT SUM(num_wins) FROM player_opening_stats').fetchone()[0]
        total_draws = con.execute('SELECT SUM(num_draws) FROM player_opening_stats').fetchone()[0]
        total_losses = con.execute('SELECT SUM(num_losses) FROM player_opening_stats').fetchone()[0]
        
        print(f"Total Games: {total_games:,}")
        print(f"  Wins: {total_wins:,} ({total_wins/total_games*100:.1f}%)")
        print(f"  Draws: {total_draws:,} ({total_draws/total_games*100:.1f}%)")
        print(f"  Losses: {total_losses:,} ({total_losses/total_games*100:.1f}%)")
        print(f"Average Games per Record: {total_games/total_stats_records:.1f}")
        
        # Color distribution
        print(f"\n--- Color Distribution ---")
        white_records = con.execute("SELECT COUNT(*) FROM player_opening_stats WHERE color = 'w'").fetchone()[0]
        black_records = con.execute("SELECT COUNT(*) FROM player_opening_stats WHERE color = 'b'").fetchone()[0]
        
        print(f"White Records: {white_records:,} ({white_records/total_stats_records*100:.1f}%)")
        print(f"Black Records: {black_records:,} ({black_records/total_stats_records*100:.1f}%)")
        
        # Opening name duplication check
        print(f"\n--- Opening Name Analysis ---")
        unique_names = con.execute('SELECT COUNT(DISTINCT name) FROM opening').fetchone()[0]
        duplicate_names_count = con.execute("""
            SELECT COUNT(*) FROM (
                SELECT name, COUNT(DISTINCT eco) as eco_count
                FROM opening
                GROUP BY name
                HAVING COUNT(DISTINCT eco) > 1
            ) duplicate_check
        """).fetchone()[0]
        
        print(f"Unique Opening Names: {unique_names:,}")
        print(f"Names with Multiple ECO Codes: {duplicate_names_count:,}")
        print(f"Name Duplication Rate: {duplicate_names_count/unique_names*100:.1f}%")
        
        # Storage efficiency metrics
        print(f"\n--- Storage Efficiency ---")
        bytes_per_record = db_size_bytes / total_stats_records if total_stats_records > 0 else 0
        bytes_per_game = db_size_bytes / total_games if total_games > 0 else 0
        
        print(f"Bytes per Stats Record: {bytes_per_record:.1f}")
        print(f"Bytes per Game: {bytes_per_game:.2f}")
        
        print("=" * 60)
        print("END STATISTICS SNAPSHOT")
        print("=" * 60)

# Log initial database state
log_database_statistics()

Database path: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db
Database exists: True
DATABASE STATISTICS SNAPSHOT

--- Database File Size ---
Size: 2,765.0 MB (2.70 GB)
Raw bytes: 2,899,324,928

--- Core Tables ---
Players: 50,000
Openings: 3,593
Player-Opening-Stats Records: 28,229,204

--- Partition Distribution ---
  Partition A: 6,721,561 (23.8%)
  Partition B: 7,292,844 (25.8%)
  Partition C: 9,221,077 (32.7%)
  Partition D: 3,952,764 (14.0%)
  Partition E: 1,040,958 (3.7%)
  Partition other: 0 (0.0%)

--- Game Statistics ---
Total Games: 568,894,735
  Wins: 271,466,481 (47.7%)
  Draws: 25,892,380 (4.6%)
  Losses: 271,535,874 (47.7%)
Average Games per Record: 20.2

--- Color Distribution ---
White Records: 13,227,272 (46.9%)
Black Records: 15,001,932 (53.1%)

--- Opening Name Analysis ---
Unique Opening Names: 3,361
Names with Multiple ECO Codes: 184
Name Duplication Rate: 5.5%

--- Storage Efficiency ---
Bytes per Stats Record: 102.7
By

In [2]:
# Find all opening names with multiple ECO codes (duplicated names)
# This replicates the analysis from notebook 21 to identify consolidation opportunities

if db_path.exists():
    with get_db_connection(db_path) as con:
        print("=== IDENTIFYING OPENING NAMES WITH MULTIPLE ECO CODES ===")
        print("This analysis will help us understand which openings can be consolidated.\n")
        
        # Find opening names that appear with multiple ECO codes
        duplicate_names_query = """
            SELECT 
                name,
                COUNT(DISTINCT eco) as eco_count,
                STRING_AGG(DISTINCT eco, ', ' ORDER BY eco) as eco_codes,
                COUNT(*) as total_opening_records,
                SUM(total_games) as combined_games,
                SUM(unique_players) as combined_players,
                SUM(stats_records) as total_stats_records
            FROM (
                SELECT 
                    o.name,
                    o.eco,
                    o.id as opening_id,
                    SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                    COUNT(DISTINCT pos.player_id) as unique_players,
                    COUNT(*) as stats_records
                FROM opening o
                JOIN player_opening_stats pos ON o.id = pos.opening_id
                GROUP BY o.id, o.name, o.eco
            ) stats
            GROUP BY name
            HAVING COUNT(DISTINCT eco) > 1
            ORDER BY eco_count DESC, combined_games DESC
        """
        
        duplicate_names_df = con.execute(duplicate_names_query).fetchdf()
        
        if len(duplicate_names_df) > 0:
            print(f"Found {len(duplicate_names_df)} opening names with multiple ECO codes:\n")
            
            # Format the summary display
            summary_display = duplicate_names_df.copy()
            summary_display['total_opening_records'] = summary_display['total_opening_records'].apply('{:,}'.format)
            summary_display['combined_games'] = summary_display['combined_games'].apply('{:,}'.format)
            summary_display['combined_players'] = summary_display['combined_players'].apply('{:,}'.format)
            summary_display['total_stats_records'] = summary_display['total_stats_records'].apply('{:,}'.format)
            
            # Rename columns for better display
            summary_display.columns = [
                'Opening Name', 'ECO Count', 'ECO Codes', 'Opening Records', 
                'Total Games', 'Total Players', 'Stats Records'
            ]
            
            print("--- SUMMARY OF ALL DUPLICATED OPENING NAMES ---")
            print(summary_display.to_string(index=False))
            
            # Store the raw data for use in subsequent cells
            duplicate_openings_raw = duplicate_names_df.copy()
            
            # Show detailed breakdown for top 10 most complex cases
            print(f"\n=== DETAILED BREAKDOWN FOR TOP 10 MOST COMPLEX CASES ===")
            print("This shows exactly which ECO codes belong to each duplicated name.\n")
            
            top_complex_names = duplicate_names_df.head(10)['name'].tolist()
            
            detailed_breakdowns = {}
            
            for i, name in enumerate(top_complex_names, 1):
                print(f"{i}. '{name}'")
                print("-" * (len(name) + 10))
                
                detailed_query = """
                    SELECT 
                        o.eco,
                        o.id as opening_id,
                        SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                        COUNT(DISTINCT pos.player_id) as unique_players,
                        COUNT(*) as stats_records,
                        -- White performance
                        SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as white_games,
                        ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                              NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 1) as white_win_pct,
                        -- Black performance  
                        SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as black_games,
                        ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                              NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 1) as black_win_pct
                    FROM opening o
                    JOIN player_opening_stats pos ON o.id = pos.opening_id
                    WHERE o.name = ?
                    GROUP BY o.id, o.eco
                    ORDER BY total_games DESC
                """
                
                breakdown_df = con.execute(detailed_query, [name]).fetchdf()
                detailed_breakdowns[name] = breakdown_df.copy()
                
                # Format for display
                breakdown_display = breakdown_df.copy()
                breakdown_display['total_games'] = breakdown_display['total_games'].apply('{:,}'.format)
                breakdown_display['unique_players'] = breakdown_display['unique_players'].apply('{:,}'.format)
                breakdown_display['stats_records'] = breakdown_display['stats_records'].apply('{:,}'.format)
                breakdown_display['white_games'] = breakdown_display['white_games'].apply('{:,}'.format)
                breakdown_display['black_games'] = breakdown_display['black_games'].apply('{:,}'.format)
                
                # Remove opening_id from display (keep for internal use)
                display_cols = ['eco', 'total_games', 'unique_players', 'stats_records', 
                               'white_games', 'white_win_pct', 'black_games', 'black_win_pct']
                breakdown_display = breakdown_display[display_cols]
                breakdown_display.columns = ['ECO', 'Total Games', 'Players', 'Stats Records',
                                           'White Games', 'White Win%', 'Black Games', 'Black Win%']
                
                print(breakdown_display.to_string(index=False))
                print()  # Empty line for readability
            
            # Calculate and display potential consolidation impact
            print("=== CONSOLIDATION IMPACT SUMMARY ===")
            
            total_duplicated_names = len(duplicate_names_df)
            total_affected_games = duplicate_names_df['combined_games'].sum()
            total_affected_players = duplicate_names_df['combined_players'].sum()
            total_opening_records = duplicate_names_df['total_opening_records'].sum()
            total_stats_records_affected = duplicate_names_df['total_stats_records'].sum()
            
            # Get overall database stats for percentages
            total_openings = con.execute('SELECT COUNT(*) FROM opening').fetchone()[0]
            total_all_stats_records = con.execute('SELECT COUNT(*) FROM player_opening_stats').fetchone()[0]
            
            opening_reduction = total_opening_records - total_duplicated_names
            opening_reduction_pct = (opening_reduction / total_openings * 100) if total_openings > 0 else 0
            
            print(f"Names with multiple ECO codes: {total_duplicated_names:,}")
            print(f"Total affected games: {total_affected_games:,}")
            print(f"Total affected players: {total_affected_players:,}")
            print(f"Opening table records affected: {total_opening_records:,}")
            print(f"Stats records affected: {total_stats_records_affected:,} ({total_stats_records_affected/total_all_stats_records*100:.1f}% of total)")
            print(f"\nPotential opening table reduction: {opening_reduction:,} records ({opening_reduction_pct:.1f}%)")
            print(f"Remaining openings after consolidation: {total_openings - opening_reduction:,}")
            
            # Store data for next cells
            print(f"\n✓ Data prepared for consolidation analysis in next cells")
            print(f"✓ Found {total_duplicated_names} opening names ready for potential consolidation")
            print(f"✓ Detailed breakdowns available for top {len(top_complex_names)} most complex cases")
            
        else:
            print("No opening names found with multiple ECO codes.")
            print("All opening names have unique ECO code assignments.")
            duplicate_openings_raw = pd.DataFrame()
            detailed_breakdowns = {}
    
else:
    print(f"Database file not found at {db_path}")
    duplicate_openings_raw = pd.DataFrame()
    detailed_breakdowns = {}

=== IDENTIFYING OPENING NAMES WITH MULTIPLE ECO CODES ===
This analysis will help us understand which openings can be consolidated.



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Found 184 opening names with multiple ECO codes:

--- SUMMARY OF ALL DUPLICATED OPENING NAMES ---
                                                                    Opening Name  ECO Count                    ECO Codes Opening Records  Total Games Total Players Stats Records
                                         Nimzo-Indian Defense: Sämisch Variation          6 E24, E25, E26, E27, E28, E29               6     65,747.0      16,584.0      17,144.0
                                             Sicilian Defense: Najdorf Variation          5      B90, B94, B95, B96, B98               5    793,261.0      46,243.0      51,984.0
                              Ruy Lopez: Morphy Defense, Modern Steinitz Defense          5      C71, C72, C73, C74, C75               5    219,068.0      35,338.0      37,889.0
                                              Dutch Defense: Classical Variation          5      A84, A90, A91, A92, A96               5    164,971.0      22,233.0      23,397.0
            

## Opening Consolidation Process

The next step is to consolidate duplicated opening names by merging all ECO variants into the variant with the most total games. This will:

### Process Overview:
1. **Select Target Variant**: For each duplicated opening name, choose the ECO code with the most games as the consolidation target
2. **Merge Player Statistics**: Combine all player stats from source variants into the target variant
3. **Handle Player Overlaps**: If a player has stats for multiple ECO variants of the same opening+color, sum their wins/draws/losses
4. **Preserve Data Integrity**: Ensure no games are lost and all statistics remain accurate
5. **Clean Database**: Remove empty opening records and orphaned statistics

### Safety Measures:
- Validate total games before/after to ensure no data loss
- Use transactions to allow rollback if issues occur
- Log all consolidation actions for transparency

This consolidation will reduce database size while preserving all statistical information by combining related opening variants.

In [None]:
# Consolidate duplicated opening names into the ECO variant with the most games
# This is a complex operation that requires careful handling of player statistics

if len(duplicate_openings_raw) > 0 and db_path.exists():
    with get_db_connection(db_path) as con:
        print("=== STARTING OPENING CONSOLIDATION PROCESS ===")
        print("Consolidating duplicated opening names into their most-played ECO variants...\n")
        
        # Begin transaction for safety
        con.begin()
        
        try:
            consolidation_log = []
            total_games_before = con.execute('SELECT SUM(num_wins + num_draws + num_losses) FROM player_opening_stats').fetchone()[0]
            
            for idx, row in duplicate_openings_raw.iterrows():
                opening_name = row['name']
                print(f"Processing '{opening_name}' ({idx + 1}/{len(duplicate_openings_raw)})...")
                
                # Get detailed breakdown for this opening name to find the target variant
                target_query = """
                    SELECT 
                        o.id as opening_id,
                        o.eco,
                        SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                        COUNT(*) as stats_records
                    FROM opening o
                    JOIN player_opening_stats pos ON o.id = pos.opening_id
                    WHERE o.name = ?
                    GROUP BY o.id, o.eco
                    ORDER BY total_games DESC
                """
                
                variants = con.execute(target_query, [opening_name]).fetchall()
                
                if len(variants) <= 1:
                    print(f"  ⚠️  Skipping - only one variant found")
                    continue
                
                # Target is the variant with most games (first in DESC order)
                target_id = variants[0][0]
                target_eco = variants[0][1]
                target_games = variants[0][2]
                
                # Source variants are all the others
                source_variants = variants[1:]
                source_ids = [v[0] for v in source_variants]
                source_ecos = [v[1] for v in source_variants]
                
                print(f"  → Target: {target_eco} (ID: {target_id}, {target_games:,} games)")
                print(f"  → Sources: {', '.join(source_ecos)} (IDs: {', '.join(map(str, source_ids))})")
                
                games_consolidated = 0
                
                # Determine target partition table based on target ECO code
                target_eco_first_letter = target_eco[0].upper()
                if target_eco_first_letter in 'ABCDE':
                    target_partition = target_eco_first_letter
                else:
                    target_partition = 'other'
                target_table = f"player_opening_stats_{target_partition}"
                
                # For each source variant, merge its stats into the target
                for source_id, source_eco in zip(source_ids, source_ecos):
                    print(f"    Merging {source_eco} → {target_eco}...")
                    
                    # Get all stats records for this source variant from all partitions
                    source_stats = []
                    for partition in ['A', 'B', 'C', 'D', 'E', 'other']:
                        partition_stats = con.execute(f"""
                            SELECT player_id, color, num_wins, num_draws, num_losses
                            FROM player_opening_stats_{partition}
                            WHERE opening_id = ?
                        """, [source_id]).fetchall()
                        source_stats.extend(partition_stats)
                    
                    if not source_stats:
                        print(f"      ⚠️  No stats found for {source_eco} (ID: {source_id})")
                        continue
                    
                    records_moved = 0
                    for player_id, color, wins, draws, losses in source_stats:
                        games_consolidated += (wins + draws + losses)
                        records_moved += 1
                        
                        # Check if target already has a record for this player+color combination
                        existing_record = con.execute(f"""
                            SELECT num_wins, num_draws, num_losses 
                            FROM {target_table}
                            WHERE opening_id = ? AND player_id = ? AND color = ?
                        """, [target_id, player_id, color]).fetchone()
                        
                        if existing_record:
                            # Merge with existing record
                            new_wins = existing_record[0] + wins
                            new_draws = existing_record[1] + draws
                            new_losses = existing_record[2] + losses
                            
                            con.execute(f"""
                                UPDATE {target_table}
                                SET num_wins = ?, num_draws = ?, num_losses = ?
                                WHERE opening_id = ? AND player_id = ? AND color = ?
                            """, [new_wins, new_draws, new_losses, target_id, player_id, color])
                            
                        else:
                            # Create new record in target partition
                            con.execute(f"""
                                INSERT INTO {target_table} (opening_id, player_id, color, num_wins, num_draws, num_losses)
                                VALUES (?, ?, ?, ?, ?, ?)
                            """, [target_id, player_id, color, wins, draws, losses])
                    
                    # Now delete all stats records for this source variant from all partitions
                    deleted_count = 0
                    for partition in ['A', 'B', 'C', 'D', 'E', 'other']:
                        partition_deleted = con.execute(f"DELETE FROM player_opening_stats_{partition} WHERE opening_id = ?", [source_id]).rowcount
                        deleted_count += partition_deleted
                    
                    # Verify all records are deleted before proceeding
                    remaining_records = con.execute("SELECT COUNT(*) FROM player_opening_stats WHERE opening_id = ?", [source_id]).fetchone()[0]
                    if remaining_records > 0:
                        raise Exception(f"Failed to delete all records for opening_id {source_id}. {remaining_records} records remain.")
                    
                    print(f"      ✓ Moved {records_moved} records, deleted {deleted_count} source records")
                
                # Now that all foreign key references are removed, delete the source opening records
                for source_id in source_ids:
                    con.execute("DELETE FROM opening WHERE id = ?", [source_id])
                
                consolidation_entry = {
                    'opening_name': opening_name,
                    'target_eco': target_eco,
                    'target_id': target_id,
                    'source_ecos': source_ecos,
                    'source_ids': source_ids,
                    'games_consolidated': games_consolidated,
                    'variants_merged': len(source_variants)
                }
                consolidation_log.append(consolidation_entry)
                
                print(f"  ✓ Consolidated {games_consolidated:,} games from {len(source_variants)} variants into {target_eco}")
                print()
            
            # Validate data integrity
            total_games_after = con.execute('SELECT SUM(num_wins + num_draws + num_losses) FROM player_opening_stats').fetchone()[0]
            
            print("=== CONSOLIDATION VALIDATION ===")
            print(f"Total games before: {total_games_before:,}")
            print(f"Total games after: {total_games_after:,}")
            print(f"Difference: {total_games_after - total_games_before:,}")
            
            if total_games_before == total_games_after:
                print("✅ Data integrity verified - no games lost!")
                
                # Commit the transaction
                con.commit()
                
                # Log consolidation summary
                print(f"\n=== CONSOLIDATION SUMMARY ===")
                total_variants_merged = sum(entry['variants_merged'] for entry in consolidation_log)
                total_games_consolidated = sum(entry['games_consolidated'] for entry in consolidation_log)
                
                print(f"Openings consolidated: {len(consolidation_log):,}")
                print(f"Total variants merged: {total_variants_merged:,}")
                print(f"Total games consolidated: {total_games_consolidated:,}")
                
                # Show top 5 biggest consolidations
                consolidation_df = pd.DataFrame(consolidation_log)
                top_consolidations = consolidation_df.nlargest(5, 'games_consolidated')
                
                print(f"\nTop 5 Biggest Consolidations:")
                for _, row in top_consolidations.iterrows():
                    print(f"  {row['opening_name']}: {row['games_consolidated']:,} games → {row['target_eco']}")
                
                print(f"\n✅ CONSOLIDATION COMPLETED SUCCESSFULLY!")
                print(f"✅ All {len(duplicate_openings_raw)} duplicated opening names have been consolidated")
                
            else:
                print("❌ Data integrity check failed!")
                print("Rolling back transaction...")
                con.rollback()
                print("❌ Consolidation rolled back due to data integrity issues")
                
        except Exception as e:
            print(f"❌ Error during consolidation: {e}")
            print("Rolling back transaction...")
            con.rollback()
            print("❌ Consolidation rolled back due to error")
            raise
            
else:
    if len(duplicate_openings_raw) == 0:
        print("No duplicated opening names found - consolidation not needed")
    else:
        print("Database not accessible - consolidation skipped")

# Log database statistics after consolidation
print(f"\n" + "="*60)
print("DATABASE STATISTICS AFTER CONSOLIDATION")
log_database_statistics()

=== STARTING OPENING CONSOLIDATION PROCESS ===
Consolidating duplicated opening names into their most-played ECO variants...

Processing 'Nimzo-Indian Defense: Sämisch Variation' (1/184)...
  → Target: E27 (ID: 2964, 29,709 games)
  → Sources: E24, E26, E28, E25, E29 (IDs: 3533, 2962, 2965, 2959, 2966)
    Merging E24 → E27...
      ✓ Moved 3762 records, deleted -6 source records
    Merging E26 → E27...
      ✓ Moved 3762 records, deleted -6 source records
    Merging E26 → E27...
      ✓ Moved 2720 records, deleted -6 source records
    Merging E28 → E27...
      ✓ Moved 2720 records, deleted -6 source records
    Merging E28 → E27...
      ✓ Moved 2682 records, deleted -6 source records
    Merging E25 → E27...
      ✓ Moved 2682 records, deleted -6 source records
    Merging E25 → E27...
      ✓ Moved 747 records, deleted -6 source records
    Merging E29 → E27...
      ✓ Moved 747 records, deleted -6 source records
    Merging E29 → E27...
      ✓ Moved 664 records, deleted -6 sou

ConstraintException: Constraint Error: Violates foreign key constraint because key "opening_id: 3533" is still referenced by a foreign key in a different table. If this is an unexpected constraint violation, please refer to our foreign key limitations in the documentation