# Filtering Unneeded Openings

## Purpose
- To understand which openings in the DB may not be helpful for processing.

## Possible Filters
- Very large or small win rates - likely just a bad opening for one side, no need to recommend it
- Extremely rare openings - not enough data to give recommendations


In [13]:
# Configuration and setup
import pandas as pd
from pathlib import Path
from utils.database.db_utils import get_db_connection

# Define the path to the DuckDB database file
project_root = Path.cwd().parent if "notebooks" in str(Path.cwd()) else Path.cwd()
db_path = project_root / "data" / "processed" / "chess_games.db"

# Set pandas display options for better readability
pd.set_option('display.float_format', '{:,.2f}'.format)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print(f"Database path: {db_path}")
print(f"Database exists: {db_path.exists()}")

# Configuration: adjust these numbers to control how many results to show
TOP_N_LEAST_PLAYED_BY_GAMES = 100      # Show N least played openings by total games
TOP_N_LEAST_PLAYED_BY_PLAYERS = 200    # Show N least played openings by unique players  
TOP_N_HIGHEST_SCORING = 15             # Show N highest scoring openings (best win rates)
TOP_N_LOWEST_SCORING = 15              # Show N lowest scoring openings (worst win rates)

print(f"\nAnalysis Configuration:")
print(f"- Least played by games: {TOP_N_LEAST_PLAYED_BY_GAMES}")
print(f"- Least played by players: {TOP_N_LEAST_PLAYED_BY_PLAYERS}")
print(f"- Highest scoring openings: {TOP_N_HIGHEST_SCORING}")
print(f"- Lowest scoring openings: {TOP_N_LOWEST_SCORING}")

Database path: /Users/a/Documents/personalprojects/chess-opening-recommender/data/processed/chess_games.db
Database exists: True

Analysis Configuration:
- Least played by games: 100
- Least played by players: 200
- Highest scoring openings: 15
- Lowest scoring openings: 15


## 1. Database Overview

First, let's get a high-level overview of our database to understand the scale of data we're working with.

In [14]:
# Database overview - basic counts
if db_path.exists():
    with get_db_connection(db_path) as con:
        print("=== DATABASE OVERVIEW ===")
        
        # Count total records in each table
        player_count = con.execute('SELECT COUNT(*) FROM player').fetchone()[0]
        opening_count = con.execute('SELECT COUNT(*) FROM opening').fetchone()[0]
        
        print(f"Total Players: {player_count:,}")
        print(f"Total Openings: {opening_count:,}")
        
        # Count total games across all partitions
        total_games = con.execute("""
            SELECT SUM(num_wins + num_draws + num_losses) as total_games
            FROM player_opening_stats
        """).fetchone()[0]
        
        total_stats_records = con.execute('SELECT COUNT(*) FROM player_opening_stats').fetchone()[0]
        
        print(f"Total Games: {total_games:,}")
        print(f"Total Player-Opening-Color Records: {total_stats_records:,}")
        print(f"Average Games per Record: {total_games/total_stats_records:.1f}")
        
        # Show partition distribution
        print("\n--- Partition Distribution ---")
        partitions = ['A', 'B', 'C', 'D', 'E', 'other']
        partition_data = []
        
        for partition in partitions:
            count = con.execute(f'SELECT COUNT(*) FROM player_opening_stats_{partition}').fetchone()[0]
            partition_data.append({'Partition': partition, 'Records': count})
            
        partition_df = pd.DataFrame(partition_data)
        partition_df['Percentage'] = (partition_df['Records'] / partition_df['Records'].sum() * 100).round(2)
        partition_df['Records'] = partition_df['Records'].apply('{:,}'.format)
        partition_df['Percentage'] = partition_df['Percentage'].apply('{:.2f}%'.format)
        
        print(partition_df.to_string(index=False))
else:
    print(f"Database file not found at {db_path}")

=== DATABASE OVERVIEW ===
Total Players: 50,000
Total Openings: 3,593
Total Games: 568,894,735
Total Player-Opening-Color Records: 26,843,374
Average Games per Record: 21.2

--- Partition Distribution ---
Partition   Records Percentage
        A 6,220,395     23.17%
        B 6,925,508     25.80%
        C 8,922,201     33.24%
        D 3,795,685     14.14%
        E   979,585      3.65%
    other         0      0.00%


## 2. Least Played Openings by Total Games

These are openings that have very few total games played across all players. They might be too rare to provide meaningful recommendations.

In [15]:
# Least played openings by total number of games
if db_path.exists():
    with get_db_connection(db_path) as con:
        print(f"=== TOP {TOP_N_LEAST_PLAYED_BY_GAMES} LEAST PLAYED OPENINGS BY TOTAL GAMES ===")
        
        least_played_by_games = con.execute(f"""
            SELECT 
                o.eco,
                o.name,
                SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                COUNT(DISTINCT pos.player_id) as unique_players,
                COUNT(DISTINCT CASE WHEN pos.color = 'w' THEN pos.player_id END) as white_players,
                COUNT(DISTINCT CASE WHEN pos.color = 'b' THEN pos.player_id END) as black_players,
                -- White's performance when playing this opening
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) as white_wins,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_draws ELSE 0 END) as white_draws,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_losses ELSE 0 END) as white_losses,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as white_games,
                -- Black's performance when playing this opening  
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) as black_wins,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_draws ELSE 0 END) as black_draws,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_losses ELSE 0 END) as black_losses,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as black_games,
                -- Win percentages by color
                ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_win_pct,
                ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_win_pct
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            GROUP BY o.id, o.eco, o.name
            ORDER BY total_games ASC
            LIMIT {TOP_N_LEAST_PLAYED_BY_GAMES}
        """).fetchdf()
        
        # Format the display
        display_df = least_played_by_games.copy()
        display_df['total_games'] = display_df['total_games'].apply('{:,}'.format)
        display_df['unique_players'] = display_df['unique_players'].apply('{:,}'.format)
        display_df['white_players'] = display_df['white_players'].apply('{:,}'.format)
        display_df['black_players'] = display_df['black_players'].apply('{:,}'.format)
        display_df['white_games'] = display_df['white_games'].apply('{:,}'.format)
        display_df['black_games'] = display_df['black_games'].apply('{:,}'.format)
        
        # Rename columns for better display
        display_df.columns = ['ECO', 'Opening Name', 'Total Games', 'Unique Players', 
                             'White Players', 'Black Players', 'White Wins', 'White Draws', 'White Losses', 'White Games',
                             'Black Wins', 'Black Draws', 'Black Losses', 'Black Games', 'White Win %', 'Black Win %']
        
        print(display_df.to_string(index=False))
        
        # Summary stats
        total_games_in_results = least_played_by_games['total_games'].sum()
        avg_games_per_opening = least_played_by_games['total_games'].mean()
        median_games_per_opening = least_played_by_games['total_games'].median()
        
        print(f"\n--- Summary Statistics ---")
        print(f"Total games in these {TOP_N_LEAST_PLAYED_BY_GAMES} openings: {total_games_in_results:,}")
        print(f"Average games per opening: {avg_games_per_opening:.1f}")
        print(f"Median games per opening: {median_games_per_opening:.1f}")
        print(f"Min games: {least_played_by_games['total_games'].min():,}")
        print(f"Max games: {least_played_by_games['total_games'].max():,}")

=== TOP 100 LEAST PLAYED OPENINGS BY TOTAL GAMES ===


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

ECO                                                                  Opening Name Total Games Unique Players White Players Black Players  White Wins  White Draws  White Losses White Games  Black Wins  Black Draws  Black Losses Black Games  White Win %  Black Win %
C78                                               Ruy Lopez: Rabinovich Variation         1.0              1             0             1        0.00         0.00          0.00         0.0        1.00         0.00          0.00         1.0          NaN       100.00
C39  King's Gambit Accepted: Kieseritzky Gambit, Brentano Defense, Caro Variation         1.0              1             0             1        0.00         0.00          0.00         0.0        1.00         0.00          0.00         1.0          NaN       100.00
D99                       Grünfeld Defense: Russian Variation, Yugoslav Variation         1.0              1             1             0        0.00         0.00          1.00         1.0        0.00      

## 3. Least Played Openings by Number of Players

These openings have been played by very few unique players, which might indicate they're too specialized or obscure for general recommendations.

In [16]:
# Least played openings by number of unique players
if db_path.exists():
    with get_db_connection(db_path) as con:
        print(f"=== TOP {TOP_N_LEAST_PLAYED_BY_PLAYERS} LEAST PLAYED OPENINGS BY UNIQUE PLAYERS ===")
        
        least_played_by_players = con.execute(f"""
            SELECT 
                o.eco,
                o.name,
                COUNT(DISTINCT pos.player_id) as unique_players,
                SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                COUNT(DISTINCT CASE WHEN pos.color = 'w' THEN pos.player_id END) as white_players,
                COUNT(DISTINCT CASE WHEN pos.color = 'b' THEN pos.player_id END) as black_players,
                ROUND(AVG(pos.num_wins + pos.num_draws + pos.num_losses), 1) as avg_games_per_player,
                -- White's performance when playing this opening
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as white_games,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as black_games,
                -- Win percentages by color
                ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_win_pct,
                ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_win_pct
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            GROUP BY o.id, o.eco, o.name
            ORDER BY unique_players ASC, total_games ASC
            LIMIT {TOP_N_LEAST_PLAYED_BY_PLAYERS}
        """).fetchdf()
        
        # Format the display
        display_df = least_played_by_players.copy()
        display_df['unique_players'] = display_df['unique_players'].apply('{:,}'.format)
        display_df['total_games'] = display_df['total_games'].apply('{:,}'.format)
        display_df['white_players'] = display_df['white_players'].apply('{:,}'.format)
        display_df['black_players'] = display_df['black_players'].apply('{:,}'.format)
        display_df['white_games'] = display_df['white_games'].apply('{:,}'.format)
        display_df['black_games'] = display_df['black_games'].apply('{:,}'.format)
        
        # Rename columns for better display  
        display_df.columns = ['ECO', 'Opening Name', 'Unique Players', 'Total Games',
                             'White Players', 'Black Players', 'Avg Games/Player', 'White Games', 'Black Games',
                             'White Win %', 'Black Win %']
        
        print(display_df.to_string(index=False))
        
        # Summary stats
        total_players_in_results = least_played_by_players['unique_players'].sum()
        total_games_in_results = least_played_by_players['total_games'].sum()
        avg_players_per_opening = least_played_by_players['unique_players'].mean()
        median_players_per_opening = least_played_by_players['unique_players'].median()
        
        print(f"\n--- Summary Statistics ---")
        print(f"Total unique players across these {TOP_N_LEAST_PLAYED_BY_PLAYERS} openings: {total_players_in_results:,}")
        print(f"Total games in these openings: {total_games_in_results:,}")
        print(f"Average players per opening: {avg_players_per_opening:.1f}")
        print(f"Median players per opening: {median_players_per_opening:.1f}")
        print(f"Min players: {least_played_by_players['unique_players'].min():,}")
        print(f"Max players: {least_played_by_players['unique_players'].max():,}")

=== TOP 200 LEAST PLAYED OPENINGS BY UNIQUE PLAYERS ===


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

ECO                                                                       Opening Name Unique Players Total Games White Players Black Players  Avg Games/Player White Games Black Games  White Win %  Black Win %
C78                                                    Ruy Lopez: Rabinovich Variation              1         1.0             0             1              1.00         0.0         1.0          NaN       100.00
C80                                  Ruy Lopez: Open, Bernstein Variation, Luther Line              1         1.0             0             1              1.00         0.0         1.0          NaN         0.00
E08                                            Catalan Opening: Closed, Spassky Gambit              1         1.0             0             1              1.00         0.0         1.0          NaN         0.00
C51                              Italian Game: Evans Gambit Declined, Hicken Variation              1         1.0             0             1              1.00 

## 4. Highest Scoring Openings by Color (Best Win Rates)

These openings have the highest win rates for White or Black, which might indicate they're unbalanced or too situational to be useful for general recommendations. We analyze each color separately since openings perform very differently depending on which side plays them.

In [17]:
# Highest scoring openings for White (best win rates for White)
if db_path.exists():
    with get_db_connection(db_path) as con:
        print(f"=== TOP {TOP_N_HIGHEST_SCORING} HIGHEST SCORING OPENINGS FOR WHITE (MIN 50 GAMES AS WHITE) ===")
        
        highest_scoring_white = con.execute(f"""
            SELECT 
                o.eco,
                o.name,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as white_games,
                COUNT(DISTINCT CASE WHEN pos.color = 'w' THEN pos.player_id END) as white_players,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) as white_wins,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_draws ELSE 0 END) as white_draws,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_losses ELSE 0 END) as white_losses,
                ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_win_pct,
                ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_draws ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_draw_pct,
                -- Score calculation: (wins + 0.5*draws) / total_games * 100
                ROUND((SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) + 0.5 * SUM(CASE WHEN pos.color = 'w' THEN pos.num_draws ELSE 0 END)) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_score_pct
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'w'
            GROUP BY o.id, o.eco, o.name
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50  -- Min games as White
            ORDER BY white_score_pct DESC, white_games DESC
            LIMIT {TOP_N_HIGHEST_SCORING}
        """).fetchdf()
        
        # Format the display
        display_df = highest_scoring_white.copy()
        display_df['white_games'] = display_df['white_games'].apply('{:,}'.format)
        display_df['white_players'] = display_df['white_players'].apply('{:,}'.format)
        display_df['white_wins'] = display_df['white_wins'].apply('{:,}'.format)
        display_df['white_draws'] = display_df['white_draws'].apply('{:,}'.format)
        display_df['white_losses'] = display_df['white_losses'].apply('{:,}'.format)
        
        # Rename columns for better display
        display_df.columns = ['ECO', 'Opening Name', 'White Games', 'White Players', 'White Wins', 'White Draws', 'White Losses',
                             'White Win %', 'White Draw %', 'White Score %']
        
        print(display_df.to_string(index=False))
        
        # Summary stats
        avg_score = highest_scoring_white['white_score_pct'].mean()
        median_score = highest_scoring_white['white_score_pct'].median()
        min_score = highest_scoring_white['white_score_pct'].min()
        max_score = highest_scoring_white['white_score_pct'].max()
        
        print(f"\n--- White Performance Summary ---")
        print(f"Average White score: {avg_score:.2f}%")
        print(f"Median White score: {median_score:.2f}%")
        print(f"White score range: {min_score:.2f}% - {max_score:.2f}%")
        print(f"Note: Score = (Wins + 0.5*Draws) / Total Games * 100")
        
        print(f"\n=== TOP {TOP_N_HIGHEST_SCORING} HIGHEST SCORING OPENINGS FOR BLACK (MIN 50 GAMES AS BLACK) ===")
        
        highest_scoring_black = con.execute(f"""
            SELECT 
                o.eco,
                o.name,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as black_games,
                COUNT(DISTINCT CASE WHEN pos.color = 'b' THEN pos.player_id END) as black_players,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) as black_wins,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_draws ELSE 0 END) as black_draws,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_losses ELSE 0 END) as black_losses,
                ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_win_pct,
                ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_draws ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_draw_pct,
                -- Score calculation: (wins + 0.5*draws) / total_games * 100
                ROUND((SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) + 0.5 * SUM(CASE WHEN pos.color = 'b' THEN pos.num_draws ELSE 0 END)) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_score_pct
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'b'
            GROUP BY o.id, o.eco, o.name
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50  -- Min games as Black
            ORDER BY black_score_pct DESC, black_games DESC
            LIMIT {TOP_N_HIGHEST_SCORING}
        """).fetchdf()
        
        # Format the display
        display_df = highest_scoring_black.copy()
        display_df['black_games'] = display_df['black_games'].apply('{:,}'.format)
        display_df['black_players'] = display_df['black_players'].apply('{:,}'.format)
        display_df['black_wins'] = display_df['black_wins'].apply('{:,}'.format)
        display_df['black_draws'] = display_df['black_draws'].apply('{:,}'.format)
        display_df['black_losses'] = display_df['black_losses'].apply('{:,}'.format)
        
        # Rename columns for better display
        display_df.columns = ['ECO', 'Opening Name', 'Black Games', 'Black Players', 'Black Wins', 'Black Draws', 'Black Losses',
                             'Black Win %', 'Black Draw %', 'Black Score %']
        
        print(display_df.to_string(index=False))
        
        # Summary stats
        avg_score = highest_scoring_black['black_score_pct'].mean()
        median_score = highest_scoring_black['black_score_pct'].median()
        min_score = highest_scoring_black['black_score_pct'].min()
        max_score = highest_scoring_black['black_score_pct'].max()
        
        print(f"\n--- Black Performance Summary ---")
        print(f"Average Black score: {avg_score:.2f}%")
        print(f"Median Black score: {median_score:.2f}%")
        print(f"Black score range: {min_score:.2f}% - {max_score:.2f}%")
        print(f"Note: Score = (Wins + 0.5*Draws) / Total Games * 100")

=== TOP 15 HIGHEST SCORING OPENINGS FOR WHITE (MIN 50 GAMES AS WHITE) ===
ECO                                                                    Opening Name White Games White Players White Wins White Draws White Losses  White Win %  White Draw %  White Score %
C44                                                     Scotch Game: Sea-Cadet Mate       182.0           130      182.0         0.0          0.0       100.00          0.00         100.00
C44                                                     Scotch Game: Sea-cadet Mate        58.0            45       58.0         0.0          0.0       100.00          0.00         100.00
B12                                 Caro-Kann Defense: Mieses Attack, Landau Gambit       295.0           139      252.0         5.0         38.0        85.42          1.69          86.27
A06                                Zukertort Opening: Tennison Gambit, Brigg's Trap     4,885.0         1,122    3,832.0        94.0        959.0        78.44          1.92  

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

ECO                                                        Opening Name Black Games Black Players Black Wins Black Draws Black Losses  Black Win %  Black Draw %  Black Score %
C50                                          Blackburne Shilling Gambit     4,703.0           905    4,703.0         0.0          0.0       100.00          0.00         100.00
A00                                         Barnes Opening: Fool's Mate       283.0           271      283.0         0.0          0.0       100.00          0.00         100.00
C44                                                        Irish Gambit     2,004.0         1,755    1,497.0        53.0        454.0        74.70          2.64          76.02
C71                                          Ruy Lopez: Noah's Ark Trap     5,830.0           829    4,229.0       290.0      1,311.0        72.54          4.97          75.03
C50                   Italian Game: Giuoco Pianissimo, Dubois Variation     8,662.0           557    6,204.0       130.0

## 5. Lowest Scoring Openings by Color (Worst Win Rates)

These openings have the lowest win rates for White or Black, which might indicate they're fundamentally weak or poorly suited for general recommendations.

In [18]:
# Lowest scoring openings for White (worst win rates for White)
if db_path.exists():
    with get_db_connection(db_path) as con:
        print(f"=== TOP {TOP_N_LOWEST_SCORING} LOWEST SCORING OPENINGS FOR WHITE (MIN 50 GAMES AS WHITE) ===")
        
        lowest_scoring_white = con.execute(f"""
            SELECT 
                o.eco,
                o.name,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as white_games,
                COUNT(DISTINCT CASE WHEN pos.color = 'w' THEN pos.player_id END) as white_players,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) as white_wins,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_draws ELSE 0 END) as white_draws,
                SUM(CASE WHEN pos.color = 'w' THEN pos.num_losses ELSE 0 END) as white_losses,
                ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_win_pct,
                ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_draws ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_draw_pct,
                -- Score calculation: (wins + 0.5*draws) / total_games * 100
                ROUND((SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) + 0.5 * SUM(CASE WHEN pos.color = 'w' THEN pos.num_draws ELSE 0 END)) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as white_score_pct
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'w'
            GROUP BY o.id, o.eco, o.name
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50  -- Min games as White
            ORDER BY white_score_pct ASC, white_games DESC
            LIMIT {TOP_N_LOWEST_SCORING}
        """).fetchdf()
        
        # Format the display
        display_df = lowest_scoring_white.copy()
        display_df['white_games'] = display_df['white_games'].apply('{:,}'.format)
        display_df['white_players'] = display_df['white_players'].apply('{:,}'.format)
        display_df['white_wins'] = display_df['white_wins'].apply('{:,}'.format)
        display_df['white_draws'] = display_df['white_draws'].apply('{:,}'.format)
        display_df['white_losses'] = display_df['white_losses'].apply('{:,}'.format)
        
        # Rename columns for better display
        display_df.columns = ['ECO', 'Opening Name', 'White Games', 'White Players', 'White Wins', 'White Draws', 'White Losses',
                             'White Win %', 'White Draw %', 'White Score %']
        
        print(display_df.to_string(index=False))
        
        # Summary stats
        avg_score = lowest_scoring_white['white_score_pct'].mean()
        median_score = lowest_scoring_white['white_score_pct'].median()
        min_score = lowest_scoring_white['white_score_pct'].min()
        max_score = lowest_scoring_white['white_score_pct'].max()
        
        print(f"\n--- White Performance Summary ---")
        print(f"Average White score: {avg_score:.2f}%")
        print(f"Median White score: {median_score:.2f}%")
        print(f"White score range: {min_score:.2f}% - {max_score:.2f}%")
        print(f"Note: Score = (Wins + 0.5*Draws) / Total Games * 100")
        
        print(f"\n=== TOP {TOP_N_LOWEST_SCORING} LOWEST SCORING OPENINGS FOR BLACK (MIN 50 GAMES AS BLACK) ===")
        
        lowest_scoring_black = con.execute(f"""
            SELECT 
                o.eco,
                o.name,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as black_games,
                COUNT(DISTINCT CASE WHEN pos.color = 'b' THEN pos.player_id END) as black_players,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) as black_wins,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_draws ELSE 0 END) as black_draws,
                SUM(CASE WHEN pos.color = 'b' THEN pos.num_losses ELSE 0 END) as black_losses,
                ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_win_pct,
                ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_draws ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_draw_pct,
                -- Score calculation: (wins + 0.5*draws) / total_games * 100
                ROUND((SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) + 0.5 * SUM(CASE WHEN pos.color = 'b' THEN pos.num_draws ELSE 0 END)) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 2) as black_score_pct
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'b'
            GROUP BY o.id, o.eco, o.name
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50  -- Min games as Black
            ORDER BY black_score_pct ASC, black_games DESC
            LIMIT {TOP_N_LOWEST_SCORING}
        """).fetchdf()
        
        # Format the display
        display_df = lowest_scoring_black.copy()
        display_df['black_games'] = display_df['black_games'].apply('{:,}'.format)
        display_df['black_players'] = display_df['black_players'].apply('{:,}'.format)
        display_df['black_wins'] = display_df['black_wins'].apply('{:,}'.format)
        display_df['black_draws'] = display_df['black_draws'].apply('{:,}'.format)
        display_df['black_losses'] = display_df['black_losses'].apply('{:,}'.format)
        
        # Rename columns for better display
        display_df.columns = ['ECO', 'Opening Name', 'Black Games', 'Black Players', 'Black Wins', 'Black Draws', 'Black Losses',
                             'Black Win %', 'Black Draw %', 'Black Score %']
        
        print(display_df.to_string(index=False))
        
        # Summary stats
        avg_score = lowest_scoring_black['black_score_pct'].mean()
        median_score = lowest_scoring_black['black_score_pct'].median()
        min_score = lowest_scoring_black['black_score_pct'].min()
        max_score = lowest_scoring_black['black_score_pct'].max()
        
        print(f"\n--- Black Performance Summary ---")
        print(f"Average Black score: {avg_score:.2f}%")
        print(f"Median Black score: {median_score:.2f}%")
        print(f"Black score range: {min_score:.2f}% - {max_score:.2f}%")
        print(f"Note: Score = (Wins + 0.5*Draws) / Total Games * 100")

=== TOP 15 LOWEST SCORING OPENINGS FOR WHITE (MIN 50 GAMES AS WHITE) ===
ECO                                                            Opening Name White Games White Players White Wins White Draws White Losses  White Win %  White Draw %  White Score %
C50                                              Blackburne Shilling Gambit     3,561.0         2,327        0.0         0.0      3,561.0         0.00          0.00           0.00
A00                                             Barnes Opening: Fool's Mate       196.0           162        0.0         0.0        196.0         0.00          0.00           0.00
C44                                                            Irish Gambit     2,080.0         1,246      496.0        37.0      1,547.0        23.85          1.78          24.74
C50                       Italian Game: Giuoco Pianissimo, Dubois Variation     4,110.0         2,419      985.0        67.0      3,058.0        23.97          1.63          24.78
C71                        

## 6. Distribution Analysis

Let's look at the overall distribution of games and players to understand what thresholds might make sense for filtering.

In [19]:
# Distribution analysis - understand the spread of data to inform filtering decisions
if db_path.exists():
    with get_db_connection(db_path) as con:
        print("=== OPENING DISTRIBUTION ANALYSIS ===")
        
        # Get distribution statistics for all openings
        distribution_stats = con.execute("""
            SELECT 
                COUNT(*) as total_openings,
                MIN(total_games) as min_games,
                MAX(total_games) as max_games,
                ROUND(AVG(total_games), 1) as avg_games,
                ROUND(MEDIAN(total_games), 1) as median_games,
                ROUND(STDDEV(total_games), 1) as stddev_games,
                MIN(unique_players) as min_players,
                MAX(unique_players) as max_players,
                ROUND(AVG(unique_players), 1) as avg_players,
                ROUND(MEDIAN(unique_players), 1) as median_players,
                ROUND(STDDEV(unique_players), 1) as stddev_players
            FROM (
                SELECT 
                    o.id,
                    SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                    COUNT(DISTINCT pos.player_id) as unique_players
                FROM opening o
                JOIN player_opening_stats pos ON o.id = pos.opening_id
                GROUP BY o.id
            ) stats
        """).fetchone()
        
        print("--- All Openings Statistics ---")
        print(f"Total Openings: {distribution_stats[0]:,}")
        print(f"Games per Opening - Min: {distribution_stats[1]:,}, Max: {distribution_stats[2]:,}, Avg: {distribution_stats[3]:,}, Median: {distribution_stats[4]:,}")
        print(f"Players per Opening - Min: {distribution_stats[5]:,}, Max: {distribution_stats[6]:,}, Avg: {distribution_stats[7]:,}, Median: {distribution_stats[8]:,}")
        
        # Percentile analysis for games
        print("\n--- Games per Opening Percentiles ---")
        games_percentiles = con.execute("""
            SELECT 
                ROUND(PERCENTILE_CONT(0.1) WITHIN GROUP (ORDER BY total_games), 1) as p10,
                ROUND(PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_games), 1) as p25,
                ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_games), 1) as p50,
                ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_games), 1) as p75,
                ROUND(PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY total_games), 1) as p90,
                ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_games), 1) as p95,
                ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY total_games), 1) as p99
            FROM (
                SELECT SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games
                FROM opening o
                JOIN player_opening_stats pos ON o.id = pos.opening_id
                GROUP BY o.id
            ) stats
        """).fetchone()
        
        percentile_labels = ['10th', '25th', '50th (Median)', '75th', '90th', '95th', '99th']
        for i, label in enumerate(percentile_labels):
            print(f"{label}: {games_percentiles[i]:,} games")
        
        # Percentile analysis for players
        print("\n--- Players per Opening Percentiles ---")
        players_percentiles = con.execute("""
            SELECT 
                ROUND(PERCENTILE_CONT(0.1) WITHIN GROUP (ORDER BY unique_players), 1) as p10,
                ROUND(PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY unique_players), 1) as p25,
                ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY unique_players), 1) as p50,
                ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY unique_players), 1) as p75,
                ROUND(PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY unique_players), 1) as p90,
                ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY unique_players), 1) as p95,
                ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY unique_players), 1) as p99
            FROM (
                SELECT COUNT(DISTINCT pos.player_id) as unique_players
                FROM opening o
                JOIN player_opening_stats pos ON o.id = pos.opening_id
                GROUP BY o.id
            ) stats
        """).fetchone()
        
        for i, label in enumerate(percentile_labels):
            print(f"{label}: {players_percentiles[i]:,} players")
        
        # Count openings by game thresholds
        print("\n--- Openings by Game Count Thresholds ---")
        game_thresholds = [1, 10, 50, 100, 500, 1000, 5000, 10000]
        for threshold in game_thresholds:
            count = con.execute(f"""
                SELECT COUNT(*)
                FROM (
                    SELECT SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games
                    FROM opening o
                    JOIN player_opening_stats pos ON o.id = pos.opening_id
                    GROUP BY o.id
                    HAVING total_games < {threshold}
                ) stats
            """).fetchone()[0]
            percentage = (count / distribution_stats[0]) * 100
            print(f"Openings with < {threshold:,} games: {count:,} ({percentage:.1f}%)")
        
        # Count openings by player thresholds
        print("\n--- Openings by Player Count Thresholds ---")
        player_thresholds = [1, 5, 10, 25, 50, 100, 500, 1000]
        for threshold in player_thresholds:
            count = con.execute(f"""
                SELECT COUNT(*)
                FROM (
                    SELECT COUNT(DISTINCT pos.player_id) as unique_players
                    FROM opening o
                    JOIN player_opening_stats pos ON o.id = pos.opening_id
                    GROUP BY o.id
                    HAVING unique_players < {threshold}
                ) stats
            """).fetchone()[0]
            percentage = (count / distribution_stats[0]) * 100
            print(f"Openings with < {threshold:,} players: {count:,} ({percentage:.1f}%)")

=== OPENING DISTRIBUTION ANALYSIS ===
--- All Openings Statistics ---
Total Openings: 3,361
Games per Opening - Min: 1, Max: 14,485,459, Avg: 169,263.5, Median: 9,985.0
Players per Opening - Min: 712,224.5, Max: 1, Avg: 49,996, Median: 6,841.6

--- Games per Opening Percentiles ---
10th: 147.0 games
25th: 1,197.0 games
50th (Median): 9,985.0 games
75th: 65,914.0 games
90th: 312,096.0 games
95th: 726,818.0 games
99th: 3,075,697.2 games

--- Players per Opening Percentiles ---
10th: 87.0 players
25th: 487.0 players
50th (Median): 2,629.0 players
75th: 9,413.0 players
90th: 20,843.0 players
95th: 28,348.0 players
99th: 40,225.4 players

--- Openings by Game Count Thresholds ---
Openings with < 1 games: 0 (0.0%)
Openings with < 10 games: 70 (2.1%)
Openings with < 50 games: 198 (5.9%)
Openings with < 100 games: 277 (8.2%)
Openings with < 500 games: 593 (17.6%)
Openings with < 1,000 games: 776 (23.1%)
Openings with < 5,000 games: 1,363 (40.6%)
Openings with < 10,000 games: 1,681 (50.0%)

---

RuntimeError: Query interrupted

## 7. Recommendations for Filtering

Based on the analysis above, here are some recommendations for filtering unneeded openings from the database.

In [None]:
# Generate filtering recommendations based on the analysis
if db_path.exists():
    with get_db_connection(db_path) as con:
        print("=== FILTERING RECOMMENDATIONS ===")
        
        # Calculate some threshold recommendations
        total_openings = con.execute('SELECT COUNT(*) FROM opening').fetchone()[0]
        
        # Recommendation 1: Filter by minimum games
        min_games_threshold = 50  # Adjust based on your needs
        openings_below_games_threshold = con.execute(f"""
            SELECT COUNT(*)
            FROM (
                SELECT SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games
                FROM opening o
                JOIN player_opening_stats pos ON o.id = pos.opening_id
                GROUP BY o.id
                HAVING total_games < {min_games_threshold}
            ) stats
        """).fetchone()[0]
        
        games_filter_percentage = (openings_below_games_threshold / total_openings) * 100
        
        print(f"1. MINIMUM GAMES FILTER (< {min_games_threshold} games):")
        print(f"   - Would remove {openings_below_games_threshold:,} openings ({games_filter_percentage:.1f}% of total)")
        print(f"   - Rationale: Too few games to provide reliable recommendations")
        
        # Recommendation 2: Filter by minimum players
        min_players_threshold = 10  # Adjust based on your needs
        openings_below_players_threshold = con.execute(f"""
            SELECT COUNT(*)
            FROM (
                SELECT COUNT(DISTINCT pos.player_id) as unique_players
                FROM opening o
                JOIN player_opening_stats pos ON o.id = pos.opening_id
                GROUP BY o.id
                HAVING unique_players < {min_players_threshold}
            ) stats
        """).fetchone()[0]
        
        players_filter_percentage = (openings_below_players_threshold / total_openings) * 100
        
        print(f"\n2. MINIMUM PLAYERS FILTER (< {min_players_threshold} unique players):")
        print(f"   - Would remove {openings_below_players_threshold:,} openings ({players_filter_percentage:.1f}% of total)")
        print(f"   - Rationale: Too few players to generalize recommendations")
        
        # Recommendation 3: Filter by extreme win rates (by color)
        extreme_win_rate_threshold = 75.0  # Adjust based on your needs
        
        # Count openings with extreme win rates for White
        extreme_white_high = con.execute(f"""
            SELECT COUNT(DISTINCT o.id)
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'w'
            GROUP BY o.id
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50
            AND SUM(pos.num_wins) * 100.0 / NULLIF(SUM(pos.num_wins + pos.num_draws + pos.num_losses), 0) > {extreme_win_rate_threshold}
        """).fetchone()[0]
        
        extreme_white_low = con.execute(f"""
            SELECT COUNT(DISTINCT o.id)
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'w'
            GROUP BY o.id
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50
            AND SUM(pos.num_wins) * 100.0 / NULLIF(SUM(pos.num_wins + pos.num_draws + pos.num_losses), 0) < {100 - extreme_win_rate_threshold}
        """).fetchone()[0]
        
        # Count openings with extreme win rates for Black
        extreme_black_high = con.execute(f"""
            SELECT COUNT(DISTINCT o.id)
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'b'
            GROUP BY o.id
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50
            AND SUM(pos.num_wins) * 100.0 / NULLIF(SUM(pos.num_wins + pos.num_draws + pos.num_losses), 0) > {extreme_win_rate_threshold}
        """).fetchone()[0]
        
        extreme_black_low = con.execute(f"""
            SELECT COUNT(DISTINCT o.id)
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            WHERE pos.color = 'b'
            GROUP BY o.id
            HAVING SUM(pos.num_wins + pos.num_draws + pos.num_losses) >= 50
            AND SUM(pos.num_wins) * 100.0 / NULLIF(SUM(pos.num_wins + pos.num_draws + pos.num_losses), 0) < {100 - extreme_win_rate_threshold}
        """).fetchone()[0]
        
        # Count unique openings with extreme performance in either color
        extreme_either_color = con.execute(f"""
            SELECT COUNT(DISTINCT o.id)
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            GROUP BY o.id
            HAVING 
                (SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) >= 50
                 AND (SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) > {extreme_win_rate_threshold}
                      OR SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                         NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) < {100 - extreme_win_rate_threshold}))
                OR
                (SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) >= 50
                 AND (SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                      NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) > {extreme_win_rate_threshold}
                      OR SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                         NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) < {100 - extreme_win_rate_threshold}))
        """).fetchone()[0]
        
        extreme_filter_percentage = (extreme_either_color / total_openings) * 100
        
        print(f"\n3. EXTREME WIN RATE FILTER BY COLOR (>{extreme_win_rate_threshold}% or <{100-extreme_win_rate_threshold}% win rate, min 50 games per color):")
        print(f"   - Would remove {extreme_either_color:,} openings ({extreme_filter_percentage:.1f}% of total)")
        print(f"     * White extreme high (>{extreme_win_rate_threshold}%): {extreme_white_high:,}")
        print(f"     * White extreme low (<{100-extreme_win_rate_threshold}%): {extreme_white_low:,}")
        print(f"     * Black extreme high (>{extreme_win_rate_threshold}%): {extreme_black_high:,}")
        print(f"     * Black extreme low (<{100-extreme_win_rate_threshold}%): {extreme_black_low:,}")
        print(f"   - Rationale: Likely unbalanced or situational openings for specific colors")
        
        # Combined filter impact
        combined_filter = con.execute(f"""
            SELECT COUNT(DISTINCT o.id)
            FROM opening o
            JOIN player_opening_stats pos ON o.id = pos.opening_id
            GROUP BY o.id
            HAVING 
                SUM(pos.num_wins + pos.num_draws + pos.num_losses) < {min_games_threshold}
                OR COUNT(DISTINCT pos.player_id) < {min_players_threshold}
                OR (SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) >= 50
                    AND (SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                         NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) > {extreme_win_rate_threshold}
                         OR SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                            NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) < {100 - extreme_win_rate_threshold}))
                OR (SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) >= 50
                    AND (SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                         NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) > {extreme_win_rate_threshold}
                         OR SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                            NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0) < {100 - extreme_win_rate_threshold}))
        """).fetchone()[0]
        
        combined_percentage = (combined_filter / total_openings) * 100
        
        print(f"\n4. COMBINED FILTER IMPACT:")
        print(f"   - Would remove {combined_filter:,} openings ({combined_percentage:.1f}% of total)")
        print(f"   - Remaining openings: {total_openings - combined_filter:,} ({100 - combined_percentage:.1f}% of total)")
        
        print(f"\n=== SUGGESTED THRESHOLDS FOR YOUR USE CASE ===")
        print(f"Minimum Games: {min_games_threshold} (adjust based on desired data quality)")
        print(f"Minimum Players: {min_players_threshold} (adjust based on generalization needs)")
        print(f"Win Rate Bounds by Color: {100-extreme_win_rate_threshold}% - {extreme_win_rate_threshold}% (adjust based on balance requirements)")
        print(f"\nNote: You can modify the threshold variables at the top of this cell to experiment with different values.")

=== FILTERING RECOMMENDATIONS ===
1. MINIMUM GAMES FILTER (< 50 games):
   - Would remove 203 openings (5.6% of total)
   - Rationale: Too few games to provide reliable recommendations

2. MINIMUM PLAYERS FILTER (< 10 unique players):
   - Would remove 82 openings (2.3% of total)
   - Rationale: Too few players to generalize recommendations

3. EXTREME WIN RATE FILTER BY COLOR (>75.0% or <25.0% win rate, min 50 games per color):
   - Would remove 1 openings (0.0% of total)
     * White extreme high (>75.0%): 1
     * White extreme low (<25.0%): 1
     * Black extreme high (>75.0%): 1
     * Black extreme low (<25.0%): 1
   - Rationale: Likely unbalanced or situational openings for specific colors

4. COMBINED FILTER IMPACT:
   - Would remove 1 openings (0.0% of total)
   - Remaining openings: 3,592 (100.0% of total)

=== SUGGESTED THRESHOLDS FOR YOUR USE CASE ===
Minimum Games: 50 (adjust based on desired data quality)
Minimum Players: 10 (adjust based on generalization needs)
Win Rate

## 8. Openings with Same Name but Different ECO Codes

These are openings that share the same name but have different ECO codes. This could indicate opening variations, transpositions, or potential data quality issues worth investigating.

In [20]:
# Find openings with same name but different ECO codes
if db_path.exists():
    with get_db_connection(db_path) as con:
        print("=== OPENINGS WITH SAME NAME BUT DIFFERENT ECO CODES ===")
        
        # Find opening names that appear with multiple ECO codes
        duplicate_names = con.execute("""
            SELECT 
                name,
                COUNT(DISTINCT eco) as eco_count,
                STRING_AGG(DISTINCT eco, ', ' ORDER BY eco) as eco_codes,
                COUNT(*) as total_records,
                SUM(total_games) as combined_games,
                SUM(unique_players) as combined_players
            FROM (
                SELECT 
                    o.name,
                    o.eco,
                    SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                    COUNT(DISTINCT pos.player_id) as unique_players
                FROM opening o
                JOIN player_opening_stats pos ON o.id = pos.opening_id
                GROUP BY o.id, o.name, o.eco
            ) stats
            GROUP BY name
            HAVING COUNT(DISTINCT eco) > 1
            ORDER BY eco_count DESC, combined_games DESC
        """).fetchdf()
        
        if len(duplicate_names) > 0:
            print(f"Found {len(duplicate_names)} opening names that appear with multiple ECO codes:\n")
            
            # Format the display
            display_df = duplicate_names.copy()
            display_df['total_records'] = display_df['total_records'].apply('{:,}'.format)
            display_df['combined_games'] = display_df['combined_games'].apply('{:,}'.format)
            display_df['combined_players'] = display_df['combined_players'].apply('{:,}'.format)
            
            # Rename columns for better display
            display_df.columns = ['Opening Name', 'ECO Count', 'ECO Codes', 'Records', 'Total Games', 'Total Players']
            
            print(display_df.to_string(index=False))
            
            # Get detailed breakdown for the most complex cases
            print(f"\n=== DETAILED BREAKDOWN FOR TOP 10 MOST COMPLEX CASES ===")
            
            top_complex_names = duplicate_names.head(10)['name'].tolist()
            
            for name in top_complex_names:
                print(f"\n--- '{name}' ---")
                
                detailed_breakdown = con.execute(f"""
                    SELECT 
                        o.eco,
                        o.name,
                        SUM(pos.num_wins + pos.num_draws + pos.num_losses) as total_games,
                        COUNT(DISTINCT pos.player_id) as unique_players,
                        -- White performance
                        SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as white_games,
                        ROUND(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                              NULLIF(SUM(CASE WHEN pos.color = 'w' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 1) as white_win_pct,
                        -- Black performance  
                        SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END) as black_games,
                        ROUND(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins ELSE 0 END) * 100.0 / 
                              NULLIF(SUM(CASE WHEN pos.color = 'b' THEN pos.num_wins + pos.num_draws + pos.num_losses ELSE 0 END), 0), 1) as black_win_pct
                    FROM opening o
                    JOIN player_opening_stats pos ON o.id = pos.opening_id
                    WHERE o.name = ?
                    GROUP BY o.id, o.eco, o.name
                    ORDER BY total_games DESC
                """, [name]).fetchdf()
                
                # Format for display
                breakdown_display = detailed_breakdown.copy()
                breakdown_display['total_games'] = breakdown_display['total_games'].apply('{:,}'.format)
                breakdown_display['unique_players'] = breakdown_display['unique_players'].apply('{:,}'.format)
                breakdown_display['white_games'] = breakdown_display['white_games'].apply('{:,}'.format)
                breakdown_display['black_games'] = breakdown_display['black_games'].apply('{:,}'.format)
                
                # Drop the name column since it's redundant
                breakdown_display = breakdown_display.drop('name', axis=1)
                breakdown_display.columns = ['ECO', 'Total Games', 'Players', 'White Games', 'White Win%', 'Black Games', 'Black Win%']
                
                print(breakdown_display.to_string(index=False))
            
            # Calculate consolidation savings
            print(f"\n=== CONSOLIDATION SAVINGS ANALYSIS ===")
            
            # Get current table sizes for percentage calculations
            total_openings = con.execute('SELECT COUNT(*) FROM opening').fetchone()[0]
            total_stats_records = con.execute('SELECT COUNT(*) FROM player_opening_stats').fetchone()[0]
            
            # Calculate detailed savings for each duplicated name
            consolidation_savings = []
            
            for _, row in duplicate_names.iterrows():
                name = row['name']
                eco_count = row['eco_count']
                
                # Get stats records that would be affected by consolidating this name
                stats_records = con.execute("""
                    SELECT COUNT(*) as records_count
                    FROM opening o
                    JOIN player_opening_stats pos ON o.id = pos.opening_id
                    WHERE o.name = ?
                """, [name]).fetchone()[0]
                
                # Calculate potential reduction in stats records
                # If we consolidate N ECO variants into 1, we could potentially reduce records
                # by combining stats for the same player-color combinations
                unique_player_color_combos = con.execute("""
                    SELECT COUNT(DISTINCT pos.player_id || '_' || pos.color) as unique_combos
                    FROM opening o
                    JOIN player_opening_stats pos ON o.id = pos.opening_id
                    WHERE o.name = ?
                """, [name]).fetchone()[0]
                
                # Theoretical maximum reduction if all variants were perfectly consolidated
                max_records_after_consolidation = unique_player_color_combos
                potential_record_reduction = stats_records - max_records_after_consolidation
                
                consolidation_savings.append({
                    'name': name,
                    'eco_count': eco_count,
                    'current_records': stats_records,
                    'potential_min_records': max_records_after_consolidation,
                    'potential_reduction': potential_record_reduction,
                    'reduction_pct': (potential_record_reduction / stats_records * 100) if stats_records > 0 else 0
                })
            
            # Convert to DataFrame for analysis
            savings_df = pd.DataFrame(consolidation_savings)
            
            # Calculate total potential savings
            total_current_records = savings_df['current_records'].sum()
            total_potential_reduction = savings_df['potential_reduction'].sum()
            total_reduction_pct = (total_potential_reduction / total_current_records * 100) if total_current_records > 0 else 0
            
            # Show top savings opportunities
            top_savings = savings_df.nlargest(10, 'potential_reduction')
            
            print("--- Top 10 Consolidation Opportunities (by potential record reduction) ---")
            display_savings = top_savings.copy()
            display_savings['current_records'] = display_savings['current_records'].apply('{:,}'.format)
            display_savings['potential_min_records'] = display_savings['potential_min_records'].apply('{:,}'.format)
            display_savings['potential_reduction'] = display_savings['potential_reduction'].apply('{:,}'.format)
            display_savings['reduction_pct'] = display_savings['reduction_pct'].apply('{:.1f}%'.format)
            
            display_savings.columns = ['Opening Name', 'ECO Count', 'Current Records', 'Min Records After', 'Potential Reduction', 'Reduction %']
            print(display_savings.to_string(index=False))
            
            # Summary statistics
            total_duplicated_names = len(duplicate_names)
            max_eco_count = duplicate_names['eco_count'].max()
            avg_eco_count = duplicate_names['eco_count'].mean()
            total_affected_games = duplicate_names['combined_games'].sum()
            total_affected_players = duplicate_names['combined_players'].sum()
            
            # Database size impact
            opening_reduction = duplicate_names['total_records'].sum() - total_duplicated_names
            opening_reduction_pct = (opening_reduction / total_openings * 100) if total_openings > 0 else 0
            stats_reduction_pct = (total_potential_reduction / total_stats_records * 100) if total_stats_records > 0 else 0
            
            print(f"\n=== OVERALL CONSOLIDATION IMPACT ===")
            print(f"Opening names with multiple ECO codes: {total_duplicated_names:,}")
            print(f"Maximum ECO codes for single name: {max_eco_count}")
            print(f"Average ECO codes per duplicated name: {avg_eco_count:.1f}")
            print(f"Total games affected by name duplication: {total_affected_games:,}")
            print(f"Total players affected by name duplication: {total_affected_players:,}")
            
            print(f"\n--- Potential Database Size Reductions ---")
            print(f"Opening table:")
            print(f"  Current entries: {total_openings:,}")
            print(f"  Potential reduction: {opening_reduction:,} entries ({opening_reduction_pct:.2f}%)")
            print(f"  After consolidation: {total_openings - opening_reduction:,} entries")
            
            print(f"\nPlayer-Opening-Stats table:")
            print(f"  Current entries: {total_stats_records:,}")
            print(f"  Records for duplicated names: {total_current_records:,}")
            print(f"  Potential reduction: {total_potential_reduction:,} entries ({stats_reduction_pct:.2f}% of total)")
            print(f"  After consolidation: {total_stats_records - total_potential_reduction:,} entries")
            print(f"  Average reduction per duplicated name: {total_reduction_pct:.1f}%")
            
            # Potential issues to investigate
            print(f"\n=== CONSOLIDATION CONSIDERATIONS ===")
            print("1. Benefits of consolidation:")
            print(f"   - Reduce database size by {stats_reduction_pct:.2f}% in player-opening-stats")
            print(f"   - Reduce opening table by {opening_reduction_pct:.2f}%")
            print("   - Simplify opening recommendations")
            print("   - Combine statistics for better sample sizes")
            print("2. Potential challenges:")
            print("   - Loss of ECO classification granularity")
            print("   - May combine genuinely different opening variations")
            print("   - Need to decide which ECO code to keep or create mapping")
            print("3. Names with many ECO codes might indicate:")
            print("   - Opening transpositions (same position via different moves)")
            print("   - Opening variations classified separately")
            print("   - Potential data quality issues or inconsistent naming")
            print("4. Recommended approach:")
            print("   - Review high-impact consolidations manually")
            print("   - Consider keeping ECO variations as metadata")
            print("   - Create hierarchical opening relationships")
            
        else:
            print("No opening names found with multiple ECO codes.")
            print("All opening names have unique ECO code assignments.")

=== OPENINGS WITH SAME NAME BUT DIFFERENT ECO CODES ===


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

No opening names found with multiple ECO codes.
All opening names have unique ECO code assignments.
