# Raw Player Statistics Viewer

This notebook loads and displays the raw player statistics data from the pickle file without any formatting or filtering beyond what's necessary to make it readable.

The goal is to examine the exact structure of the data as it's processed and saved.

In [17]:
# Install required packages
import sys
import subprocess
import importlib

def install_and_import(package):
    try:
        # Try importing the package
        importlib.import_module(package)
        print(f"{package} is already installed.")
    except ImportError:
        # Package not found, install it
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"{package} installed successfully!")

# Install required packages
required_packages = ['pandas', 'numpy']
for package in required_packages:
    install_and_import(package)

pandas is already installed.
numpy is already installed.


In [18]:
# Import required libraries
import pickle
from pathlib import Path
import pandas as pd
import json
from typing import Dict, List, TypedDict, Optional, Union, Literal
from collections import Counter
import numpy as np
from pprint import pprint

## Load Raw Player Statistics

Let's load the player statistics from the pickle file without any modifications.

In [19]:
# Path to the player statistics pickle file
data_path = Path("../data/processed/player_stats_parquet.pkl")

# Load the player statistics
with open(data_path, 'rb') as f:
    players_data = pickle.load(f)

print(f"Loaded player data with {len(players_data):,} players")

Loaded player data with 424,507 players


## Filter for Active Players

As requested, we'll filter for players with at least 200 games.

In [20]:
# Filter for active players (at least 200 games)
active_players = {
    player: stats for player, stats in players_data.items()
    if stats["num_games_total"] >= 200
}

print(f"Found {len(active_players):,} active players with at least 200 games")

# Sort active players by the number of games (most active first)
sorted_active_players = sorted(
    active_players.items(),
    key=lambda item: item[1]["num_games_total"],
    reverse=True
)

# Take the first 100 active players for display
top_100_active_players = sorted_active_players[:100]
print(f"Selected top 100 most active players for display")

Found 24 active players with at least 200 games
Selected top 100 most active players for display


## Examine the Top Level Structure

Let's first examine the structure of the entire dataset.

In [21]:
# Examine the dictionary structure
print("Data is stored as a dictionary:")
print(f"Number of players: {len(players_data)}")
print(f"Dictionary keys are player names/ids")
print(f"Dictionary values are player statistics")

# Show an example key (player name)
example_player = next(iter(players_data.keys()))
print(f"\nExample player name: '{example_player}'")

Data is stored as a dictionary:
Number of players: 424507
Dictionary keys are player names/ids
Dictionary values are player statistics

Example player name: 'Panchito0O'


## Examine Individual Player Data Structure

Now let's look at the exact structure of a player's data.

In [22]:
# Take the player with the most games as an example
player_name, player_data = top_100_active_players[0]

print(f"Raw data structure for player: {player_name}")
print("\nTop-level keys in player data:")
for key in player_data.keys():
    print(f"- {key}: {type(player_data[key])}")

print("\nRating:", player_data["rating"])
print("Total games:", player_data["num_games_total"])
print("Number of white openings:", len(player_data["white_games"]))
print("Number of black openings:", len(player_data["black_games"]))

Raw data structure for player: Alphatik

Top-level keys in player data:
- rating: <class 'int'>
- white_games: <class 'dict'>
- black_games: <class 'dict'>
- num_games_total: <class 'int'>

Rating: 2428
Total games: 324
Number of white openings: 45
Number of black openings: 28


## Display Raw White Games Data

Let's look at the exact structure of the white games data.

In [23]:
# Display white games structure
print("Structure of white_games dictionary:")
print(f"white_games is a dictionary with {len(player_data['white_games'])} entries")
print("Keys are ECO codes, values are opening results")

# Show the first white opening as an example
if player_data["white_games"]:
    eco_code = next(iter(player_data["white_games"].keys()))
    opening_data = player_data["white_games"][eco_code]
    
    print("\nExample white opening:")
    print(f"ECO Code: {eco_code}")
    print("Raw data structure:")
    pprint(opening_data)
    
    print("\nOpening name:", opening_data["opening_name"])
    print("Results dictionary keys:", opening_data["results"].keys())
    
    print("\nDetailed results:")
    for key, value in opening_data["results"].items():
        print(f"- {key}: {value} ({type(value).__name__})")

Structure of white_games dictionary:
white_games is a dictionary with 45 entries
Keys are ECO codes, values are opening results

Example white opening:
ECO Code: A40
Raw data structure:
{'opening_name': 'Englund Gambit',
 'results': {'num_draws': 2,
             'num_games': 36,
             'num_losses': 17,
             'num_wins': 17,
             'score_percentage_with_opening': 50.0}}

Opening name: Englund Gambit
Results dictionary keys: dict_keys(['num_games', 'num_wins', 'num_losses', 'num_draws', 'score_percentage_with_opening'])

Detailed results:
- num_games: 36 (int)
- num_wins: 17 (int)
- num_losses: 17 (int)
- num_draws: 2 (int)
- score_percentage_with_opening: 50.0 (float)


## Display Raw Black Games Data

Similarly, let's examine the black games structure.

In [24]:
# Display black games structure
print("Structure of black_games dictionary:")
print(f"black_games is a dictionary with {len(player_data['black_games'])} entries")
print("Keys are ECO codes, values are opening results")

# Show the first black opening as an example
if player_data["black_games"]:
    eco_code = next(iter(player_data["black_games"].keys()))
    opening_data = player_data["black_games"][eco_code]
    
    print("\nExample black opening:")
    print(f"ECO Code: {eco_code}")
    print("Raw data structure:")
    pprint(opening_data)
    
    print("\nOpening name:", opening_data["opening_name"])
    print("Results dictionary keys:", opening_data["results"].keys())
    
    print("\nDetailed results:")
    for key, value in opening_data["results"].items():
        print(f"- {key}: {value} ({type(value).__name__})")

Structure of black_games dictionary:
black_games is a dictionary with 28 entries
Keys are ECO codes, values are opening results

Example black opening:
ECO Code: B15
Raw data structure:
{'opening_name': 'Caro-Kann Defense: Main Line',
 'results': {'num_draws': 0,
             'num_games': 1,
             'num_losses': 1,
             'num_wins': 0,
             'score_percentage_with_opening': 0.0}}

Opening name: Caro-Kann Defense: Main Line
Results dictionary keys: dict_keys(['num_games', 'num_wins', 'num_losses', 'num_draws', 'score_percentage_with_opening'])

Detailed results:
- num_games: 1 (int)
- num_wins: 0 (int)
- num_losses: 1 (int)
- num_draws: 0 (int)
- score_percentage_with_opening: 0.0 (float)


## Raw Data for Multiple Players

Let's display the raw data for the first 5 active players to compare their structures.

In [25]:
# Display raw data for first 5 players
for i, (player_name, player_data) in enumerate(top_100_active_players[:5]):
    print(f"\n{'=' * 80}")
    print(f"Player {i+1}: {player_name}")
    print(f"{'=' * 80}")
    
    # Top-level structure
    print(f"Rating: {player_data['rating']}")
    print(f"Total games: {player_data['num_games_total']}")
    
    # White games summary
    print(f"\nWhite games: {len(player_data['white_games'])} different openings")
    
    # List a few white openings
    if player_data['white_games']:
        print("\nSome white openings:")
        for j, (eco, opening) in enumerate(list(player_data['white_games'].items())[:3]):
            print(f"  {eco}: {opening['opening_name']}")
            print(f"    Results: {opening['results']}")
            if j >= 2:
                remaining = len(player_data['white_games']) - 3
                if remaining > 0:
                    print(f"    ... and {remaining} more openings")
                break
    
    # Black games summary
    print(f"\nBlack games: {len(player_data['black_games'])} different openings")
    
    # List a few black openings
    if player_data['black_games']:
        print("\nSome black openings:")
        for j, (eco, opening) in enumerate(list(player_data['black_games'].items())[:3]):
            print(f"  {eco}: {opening['opening_name']}")
            print(f"    Results: {opening['results']}")
            if j >= 2:
                remaining = len(player_data['black_games']) - 3
                if remaining > 0:
                    print(f"    ... and {remaining} more openings")
                break


Player 1: Alphatik
Rating: 2428
Total games: 324

White games: 45 different openings

Some white openings:
  A40: Englund Gambit
    Results: {'num_games': 36, 'num_wins': 17, 'num_losses': 17, 'num_draws': 2, 'score_percentage_with_opening': 50.0}
  C42: Petrov's Defense: Italian Variation
    Results: {'num_games': 2, 'num_wins': 1, 'num_losses': 1, 'num_draws': 0, 'score_percentage_with_opening': 50.0}
  B13: Caro-Kann Defense: Exchange Variation
    Results: {'num_games': 1, 'num_wins': 0, 'num_losses': 1, 'num_draws': 0, 'score_percentage_with_opening': 0.0}
    ... and 42 more openings

Black games: 28 different openings

Some black openings:
  B15: Caro-Kann Defense: Main Line
    Results: {'num_games': 1, 'num_wins': 0, 'num_losses': 1, 'num_draws': 0, 'score_percentage_with_opening': 0.0}
  A04: Zukertort Opening: Pirc Invitation
    Results: {'num_games': 24, 'num_wins': 14, 'num_losses': 6, 'num_draws': 4, 'score_percentage_with_opening': 66.7}
  A00: Grob Opening
    Resul

## Detailed Examination of the Results Dictionary

Let's examine in detail the structure of the results dictionary for openings.

In [26]:
# Take the first player's first white opening for detailed examination
player_name, player_data = top_100_active_players[0]
if player_data['white_games']:
    eco_code = next(iter(player_data['white_games'].keys()))
    opening_data = player_data['white_games'][eco_code]
    
    print(f"Detailed examination of results for {player_name}'s white opening: {eco_code}")
    print(f"Opening name: {opening_data['opening_name']}")
    
    print("\nResults dictionary content:")
    results = opening_data['results']
    for key, value in sorted(results.items()):
        print(f"- {key}: {value} ({type(value).__name__})")
    
    print("\nSample calculation verification:")
    if 'num_wins' in results and 'num_losses' in results and 'num_draws' in results:
        total_games = results.get('num_games', 0)
        total_wins_losses_draws = (
            results.get('num_wins', 0) + 
            results.get('num_losses', 0) + 
            results.get('num_draws', 0)
        )
        print(f"Total games: {total_games}")
        print(f"Sum of wins, losses, draws: {total_wins_losses_draws}")
        print(f"Match: {total_games == total_wins_losses_draws}")
    
    if 'score_percentage_with_opening' in results:
        score = results['score_percentage_with_opening']
        wins = results.get('num_wins', 0)
        draws = results.get('num_draws', 0)
        total = results.get('num_games', 0)
        
        if total > 0:
            calculated_score = (wins + 0.5 * draws) / total * 100
            print(f"Stored score percentage: {score}")
            print(f"Calculated score: {calculated_score}")
            print(f"Match (allowing for rounding): {abs(score - calculated_score) < 0.01}")

Detailed examination of results for Alphatik's white opening: A40
Opening name: Englund Gambit

Results dictionary content:
- num_draws: 2 (int)
- num_games: 36 (int)
- num_losses: 17 (int)
- num_wins: 17 (int)
- score_percentage_with_opening: 50.0 (float)

Sample calculation verification:
Total games: 36
Sum of wins, losses, draws: 36
Match: True
Stored score percentage: 50.0
Calculated score: 50.0
Match (allowing for rounding): True


## Display All Keys in Results Dictionary

Let's see all possible keys that can appear in the results dictionary across different players.

In [27]:
# Collect all possible keys in results dictionaries
all_result_keys = set()

# Sample from multiple players to get a comprehensive list
for player_name, player_data in top_100_active_players[:20]:
    # Check white games
    for eco, opening in player_data['white_games'].items():
        all_result_keys.update(opening['results'].keys())
    
    # Check black games
    for eco, opening in player_data['black_games'].items():
        all_result_keys.update(opening['results'].keys())

print("All possible keys found in results dictionaries:")
for key in sorted(all_result_keys):
    print(f"- {key}")

All possible keys found in results dictionaries:
- num_draws
- num_games
- num_losses
- num_wins
- score_percentage_with_opening


## Visualize the Raw Data Structure

Let's create a visual representation of the raw data structure.

In [28]:
print("Raw Player Stats Data Structure:")
print("-------------------------------")
print("players_data (dict)")
print("│")
print("├── player_name_1 (key) → PlayerStats (dict)")
print("│   ├── rating (int)")
print("│   ├── num_games_total (int)")
print("│   ├── white_games (dict)")
print("│   │   ├── ECO_code_1 (key) → OpeningResults (dict)")
print("│   │   │   ├── opening_name (str)")
print("│   │   │   └── results (dict)")
print("│   │   │       ├── num_games (int)")
print("│   │   │       ├── num_wins (int)")
print("│   │   │       ├── num_losses (int)")
print("│   │   │       ├── num_draws (int)")
print("│   │   │       └── score_percentage_with_opening (float)")
print("│   │   └── ECO_code_2 (key) → ...")
print("│   │")
print("│   └── black_games (dict)")
print("│       ├── ECO_code_1 (key) → OpeningResults (dict)")
print("│       │   ├── opening_name (str)")
print("│       │   └── results (dict)")
print("│       │       ├── num_games (int)")
print("│       │       ├── num_wins (int)")
print("│       │       ├── num_losses (int)")
print("│       │       ├── num_draws (int)")
print("│       │       └── score_percentage_with_opening (float)")
print("│       └── ECO_code_2 (key) → ...")
print("│")
print("├── player_name_2 (key) → ...")
print("└── ...")

Raw Player Stats Data Structure:
-------------------------------
players_data (dict)
│
├── player_name_1 (key) → PlayerStats (dict)
│   ├── rating (int)
│   ├── num_games_total (int)
│   ├── white_games (dict)
│   │   ├── ECO_code_1 (key) → OpeningResults (dict)
│   │   │   ├── opening_name (str)
│   │   │   └── results (dict)
│   │   │       ├── num_games (int)
│   │   │       ├── num_wins (int)
│   │   │       ├── num_losses (int)
│   │   │       ├── num_draws (int)
│   │   │       └── score_percentage_with_opening (float)
│   │   └── ECO_code_2 (key) → ...
│   │
│   └── black_games (dict)
│       ├── ECO_code_1 (key) → OpeningResults (dict)
│       │   ├── opening_name (str)
│       │   └── results (dict)
│       │       ├── num_games (int)
│       │       ├── num_wins (int)
│       │       ├── num_losses (int)
│       │       ├── num_draws (int)
│       │       └── score_percentage_with_opening (float)
│       └── ECO_code_2 (key) → ...
│
├── player_name_2 (key) → ...
└── ...


## Show Complete Raw Data for One Player

Let's display the complete raw data for a single player to see everything together.

In [29]:
# Get a player with a moderate number of games for clearer output
sample_player_name, sample_player_data = top_100_active_players[5]  # Taking the 6th most active player

print(f"Complete raw data for player: {sample_player_name}")
print("\nTop-level player data:")
pprint({"rating": sample_player_data["rating"], "num_games_total": sample_player_data["num_games_total"]})

# Count and display summary
white_openings = len(sample_player_data["white_games"])
black_openings = len(sample_player_data["black_games"])
print(f"\nPlayer has data for {white_openings} white openings and {black_openings} black openings")

# Display a limited number of white openings to avoid excessive output
print("\nSample of white openings (first 3):")
for i, (eco, opening_data) in enumerate(list(sample_player_data["white_games"].items())[:3]):
    print(f"\n{i+1}. ECO: {eco}, Opening: {opening_data['opening_name']}")
    print("   Results:")
    pprint(opening_data["results"])

# Display a limited number of black openings to avoid excessive output
print("\nSample of black openings (first 3):")
for i, (eco, opening_data) in enumerate(list(sample_player_data["black_games"].items())[:3]):
    print(f"\n{i+1}. ECO: {eco}, Opening: {opening_data['opening_name']}")
    print("   Results:")
    pprint(opening_data["results"])

# Show complete raw data as JSON
print("\nComplete raw data structure (as JSON):")
print("Note: This may be very large depending on the player's activity")
user_input = input("Do you want to display the complete raw data for this player? (yes/no): ")
if user_input.lower() in ['yes', 'y']:
    print(json.dumps(sample_player_data, indent=2))
else:
    print("Complete raw data display skipped.")

Complete raw data for player: Preszaphodbeeblebrox

Top-level player data:
{'num_games_total': 289, 'rating': 1906}

Player has data for 24 white openings and 23 black openings

Sample of white openings (first 3):

1. ECO: C22, Opening: Center Game: Paulsen Attack Variation
   Results:
{'num_draws': 1,
 'num_games': 12,
 'num_losses': 6,
 'num_wins': 5,
 'score_percentage_with_opening': 45.8}

2. ECO: B06, Opening: Modern Defense: Two Knights Variation
   Results:
{'num_draws': 0,
 'num_games': 8,
 'num_losses': 4,
 'num_wins': 4,
 'score_percentage_with_opening': 50.0}

3. ECO: B03, Opening: Alekhine Defense: Balogh Variation
   Results:
{'num_draws': 0,
 'num_games': 1,
 'num_losses': 1,
 'num_wins': 0,
 'score_percentage_with_opening': 0.0}

Sample of black openings (first 3):

1. ECO: A00, Opening: Van't Kruijs Opening
   Results:
{'num_draws': 0,
 'num_games': 21,
 'num_losses': 12,
 'num_wins': 9,
 'score_percentage_with_opening': 42.9}

2. ECO: E67, Opening: King's Indian Defens

## Export Full Raw Data Sample

Let's provide a way to export raw data for further examination.

In [30]:
def export_raw_player_data(player_index=0, file_path=None):
    """
    Export raw player data to a JSON file.
    
    Args:
        player_index: Index of the player in the top_100_active_players list
        file_path: Path to save the JSON file (if None, a default path is used)
    """
    if player_index >= len(top_100_active_players):
        print(f"Error: Player index {player_index} out of range (max: {len(top_100_active_players) - 1})")
        return
    
    player_name, player_data = top_100_active_players[player_index]
    
    if file_path is None:
        file_path = f"../data/processed/player_{player_index}_raw_data.json"
    
    with open(file_path, 'w') as f:
        json.dump(player_data, f, indent=2)
    
    print(f"Exported raw data for player {player_name} to {file_path}")

# Uncomment to export raw data for a specific player
# export_raw_player_data(player_index=0)

## Raw Data Summary Statistics

Let's calculate some basic statistics about the raw data structure.

In [31]:
# Calculate statistics across all active players
total_players = len(active_players)
total_openings = 0
white_openings_counts = []
black_openings_counts = []
total_games_counts = []
ratings = []

for player_name, player_data in active_players.items():
    white_count = len(player_data["white_games"])
    black_count = len(player_data["black_games"])
    
    white_openings_counts.append(white_count)
    black_openings_counts.append(black_count)
    total_openings += white_count + black_count
    
    total_games_counts.append(player_data["num_games_total"])
    ratings.append(player_data["rating"])

# Display summary statistics
print("Raw Data Summary Statistics:")
print(f"Total active players (≥200 games): {total_players}")
print(f"Total opening entries across all players: {total_openings}")

print("\nGames per player:")
print(f"  Min: {min(total_games_counts)}")
print(f"  Max: {max(total_games_counts)}")
print(f"  Mean: {np.mean(total_games_counts):.1f}")
print(f"  Median: {np.median(total_games_counts):.1f}")

print("\nWhite openings per player:")
print(f"  Min: {min(white_openings_counts)}")
print(f"  Max: {max(white_openings_counts)}")
print(f"  Mean: {np.mean(white_openings_counts):.1f}")
print(f"  Median: {np.median(white_openings_counts):.1f}")

print("\nBlack openings per player:")
print(f"  Min: {min(black_openings_counts)}")
print(f"  Max: {max(black_openings_counts)}")
print(f"  Mean: {np.mean(black_openings_counts):.1f}")
print(f"  Median: {np.median(black_openings_counts):.1f}")

print("\nPlayer ratings:")
print(f"  Min: {min(ratings)}")
print(f"  Max: {max(ratings)}")
print(f"  Mean: {np.mean(ratings):.1f}")
print(f"  Median: {np.median(ratings):.1f}")

Raw Data Summary Statistics:
Total active players (≥200 games): 24
Total opening entries across all players: 948

Games per player:
  Min: 202
  Max: 324
  Mean: 244.2
  Median: 230.0

White openings per player:
  Min: 1
  Max: 51
  Mean: 18.7
  Median: 19.0

Black openings per player:
  Min: 6
  Max: 54
  Mean: 20.8
  Median: 18.0

Player ratings:
  Min: 1507
  Max: 2647
  Mean: 2119.4
  Median: 2059.0


## Sample Conversion to DataFrame

Let's show how this raw data structure could be converted to a DataFrame format if needed.

In [32]:
# Create a DataFrame for white openings across players
def create_openings_dataframe(color="white", max_players=10):
    """
    Create a DataFrame of opening statistics for multiple players.
    
    Args:
        color: 'white' or 'black'
        max_players: Maximum number of players to include
    
    Returns:
        DataFrame of opening statistics
    """
    data = []
    
    # Get data from top players
    for player_name, player_data in top_100_active_players[:max_players]:
        # Get openings for the specified color
        openings_dict = player_data[f"{color}_games"]
        
        # Add each opening as a row
        for eco, opening_data in openings_dict.items():
            row = {
                "player": player_name,
                "rating": player_data["rating"],
                "eco": eco,
                "opening_name": opening_data["opening_name"],
            }
            
            # Add all results data
            for key, value in opening_data["results"].items():
                row[key] = value
            
            data.append(row)
    
    # Create DataFrame
    df = pd.DataFrame(data)
    return df

# Create a sample DataFrame for white openings
white_df = create_openings_dataframe(color="white", max_players=5)
print(f"Sample DataFrame for white openings (first 5 players):")
print(f"Shape: {white_df.shape}")
print(f"Columns: {white_df.columns.tolist()}")
display(white_df.head())

# Create a sample DataFrame for black openings
black_df = create_openings_dataframe(color="black", max_players=5)
print(f"\nSample DataFrame for black openings (first 5 players):")
print(f"Shape: {black_df.shape}")
print(f"Columns: {black_df.columns.tolist()}")
display(black_df.head())

Sample DataFrame for white openings (first 5 players):
Shape: (124, 9)
Columns: ['player', 'rating', 'eco', 'opening_name', 'num_games', 'num_wins', 'num_losses', 'num_draws', 'score_percentage_with_opening']


Unnamed: 0,player,rating,eco,opening_name,num_games,num_wins,num_losses,num_draws,score_percentage_with_opening
0,Alphatik,2428,A40,Englund Gambit,36,17,17,2,50.0
1,Alphatik,2428,C42,Petrov's Defense: Italian Variation,2,1,1,0,50.0
2,Alphatik,2428,B13,Caro-Kann Defense: Exchange Variation,1,0,1,0,0.0
3,Alphatik,2428,B76,"Sicilian Defense: Dragon Variation, Yugoslav A...",1,0,1,0,0.0
4,Alphatik,2428,C47,Four Knights Game: Italian Variation,1,0,1,0,0.0



Sample DataFrame for black openings (first 5 players):
Shape: (125, 9)
Columns: ['player', 'rating', 'eco', 'opening_name', 'num_games', 'num_wins', 'num_losses', 'num_draws', 'score_percentage_with_opening']


Unnamed: 0,player,rating,eco,opening_name,num_games,num_wins,num_losses,num_draws,score_percentage_with_opening
0,Alphatik,2428,B15,Caro-Kann Defense: Main Line,1,0,1,0,0.0
1,Alphatik,2428,A04,Zukertort Opening: Pirc Invitation,24,14,6,4,66.7
2,Alphatik,2428,A00,Grob Opening,34,21,12,1,63.2
3,Alphatik,2428,B12,"Caro-Kann Defense: Advance Variation, Botvinni...",6,3,3,0,50.0
4,Alphatik,2428,B00,Rat Defense: Antal Defense,7,6,1,0,85.7
