# Discovering Insights from Chess Games

**Primary Objective**: Perform exploratory analysis of chess game data to understand distributions of player skills, outcomes, openings, and basic play patterns before constructing an Elo prediction model.

**Raw Data**: Portable Game Notation (PGN) files containing moves from 25,000 expert chess games, player Elo ratings, openings, and Stockfish evaluations.

**Plan**:
* Compute summary statistics on games: lengths, results, openings, theory depth
* Visualize Elo distributions and correlations
* Sample games across the Elo spectrum and inspect move patterns
* Validate data integrity: correct notation, valid moves, matched results
* Identify anomalies: incomplete games, duplicate matches, odd evaluations

**Learnings**:
* Game length, result, and opening trends for different Elo levels
* Imbalances and biases in player populations or outcomes
* Adherence to opening theory by players of varying skill
* General play patterns and trajectories in games
* Data quality and completeness for modeling
* Peculiar games or moves that require investigation

First, we gotta check out how good the players are, what moves they tend to make, and how their games usually turn out. Once we've got a good sense of all that, we can dive deeper into the data and start figuring out what's really going on. This will help us spot any problems and establish a solid analysis foundation.

### 0. Libraries

In [1]:
from collections import Counter

import chess
import chess.pgn

import pandas as pd

### 1. Get PGN Chess Data

In [None]:
pgn = open("../data/raw/data.pgn")
game_count = 0
moves_per_game = []

while True:
    game = chess.pgn.read_game(pgn)
    if game is None:
        break
    game_count += 1
    moves_per_game.append(len(list(game.mainline_moves())))

pgn.close()


In [None]:
print(f"The number of games in the file is: {game_count:,d}")

### 2. Distribution of Total Moves Per Game

In [None]:
print(f"Average number of moves per game: {sum(moves_per_game)/len(moves_per_game):.2f}")

print(f"Min number of moves per game: {min(moves_per_game):,d}")
print(f"25th percentile of moves per game: {sorted(moves_per_game)[int(len(moves_per_game)*0.25)]:,d}")
print(f"50th percentile of moves per game: {sorted(moves_per_game)[int(len(moves_per_game)*0.50)]:,d}")
print(f"75th percentile of moves per game: {sorted(moves_per_game)[int(len(moves_per_game)*0.75)]:,d}")
print(f"Max number of moves per game: {max(moves_per_game):,d}")

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['figure.figsize'] = [12, 6]
plt.style.use('ggplot')

# Calculate the mean and median of the distribution
mean_moves = np.mean(moves_per_game)
median_moves = np.median(moves_per_game)

# Create a histogram to visualize the distribution of moves per game
plt.hist(moves_per_game, bins=range(min(moves_per_game), max(moves_per_game) + 1))
plt.title("Distribution of Moves per Game")
plt.xlabel("Number of Moves")
plt.ylabel("Frequency")

# Add vertical lines to indicate the mean and median
plt.axvline(mean_moves, color='xkcd:sky blue', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(median_moves, color='navy', linestyle='dashed', linewidth=1, label='Median')

# Add a legend to the plot
plt.legend()

# Show the plot
plt.show()

In [None]:
print(f"Proportion of games shorter than average: {sum([1 for x in moves_per_game if x < sum(moves_per_game)/len(moves_per_game)]) / len(moves_per_game):.1%}")

It appears that the distribution of total moves per game has a peak around 80 moves. This suggests that most of the games in the PGN file have around 100 moves. The range of the x-axis indicates that there are games with as few as 1 moves and as many as 300 moves.

Some interesting insights that can be derived from this distribution include:

- [x] The average length of a game in terms of the number of moves.
- [x] The proportion of games that are shorter or longer than the average game length.
- [ ] The presence of any outliers or unusual patterns in the distribution.

#### 2. 1. Deep Dive into Game Movements Distribution

To investigate the data further, let's ask ourselves questions such as:

- [ ] Are there any patterns or trends in the distribution of moves per game? *To answer this question, you can use visualization techniques such as histograms and box plots to explore the distribution of moves per game. You can also use statistical tests such as the chi-squared test or the Kolmogorov-Smirnov test to determine if the distribution follows a known pattern or if there are any significant differences between groups.*

- [ ] Are there any differences in the distribution of moves per game for different players, openings, or other factors? *To answer this question, you can use regression analysis or analysis of variance (ANOVA) to investigate the relationship between the number of moves per game and other variables such as player skill level, opening choice, or time control. This will help you determine if there are any significant differences in the distribution of moves per game for different groups.*

- [ ] Are there any unusually short or long games, and if so, what might explain these outliers? *To answer this question, you can use descriptive statistics to identify any outliers in the data. You can then use techniques such as cluster analysis or factor analysis to investigate the characteristics of these outliers and determine what might explain their unusual length.*

In [2]:
# Functions to extract key data from each game
def get_player_ratings(game):
    """Get player Elo ratings from PGN headers"""
    white_rating = game.headers['WhiteElo']
    black_rating = game.headers['BlackElo']
    
    return {
        'wp_rating': white_rating,
        'bp_rating': black_rating
    }

In [3]:
def get_game_info(game):

    # Get total moves from perspective of white
    total_moves = len(list(game.mainline_moves())) 

    # Initialize variables
    white_moves = None
    black_moves = None
    white_won = None

    # Determine outcome
    result = game.headers['Result']
    if result == '1-0':
        white_won = True
    elif result == '0-1':
        white_won = False

    # Calculate white moves
    white_moves = (total_moves // 2) + (total_moves % 2)

    # Calculate black moves based on parity and outcome
    if white_won is None:
        if total_moves % 2 != 0:
            black_moves = white_moves - 1
        else:
            black_moves = white_moves
    elif white_won:
        black_moves = white_moves - 1
    else:
        black_moves = white_moves
        
    return {
        'result': result,
        'total_moves': total_moves,
        'wp_moves': white_moves,
        'bp_moves': black_moves
    }

In [4]:
def get_piece_moves(game):
    """Track number of moves by piece type and player"""
    
    # Initialize count for each piece
    cols = ['P', 'N', 'B', 'R', 'Q', 'K']
    wp_moves = {col: 0 for col in cols}
    bp_moves = {col: 0 for col in cols}

    # Tally moves by piece based on turn
    board = game.board()
    for move in game.mainline_moves():
        piece = board.piece_at(move.from_square).symbol().upper()
        if board.turn:
            wp_moves[piece] += 1
        else:
            bp_moves[piece] += 1
        board.push(move)

    # Build columns for dataframe  
    df_moves = {}
    for col in cols:
        df_moves[f'wp_{col}'] = wp_moves[col] 
        df_moves[f'bp_{col}'] = bp_moves[col]

    return df_moves

In [6]:
def get_checks(game):
    """Count checks performed by each player"""

    # Initialize counters 
    wp_checks = 0
    bp_checks = 0

    # Iterate through moves
    board = game.board()
    for move in game.mainline_moves():
        
        # Make move and continue
        board.push(move)

        # Check if move is a check
        if board.is_check():
            if board.turn:
                wp_checks += 1
            else:
                bp_checks += 1
                

    return {
        'wp_checks': wp_checks,
        'bp_checks': bp_checks
    }

In [11]:
def get_captures(game):
    
    # Initialize counters
    wp_captures = 0
    bp_captures = 0
    
    # Iterate through moves
    board = game.board()
    for move in game.mainline_moves():
        
        
        # Check if capture occurred
        if board.is_capture(move):
            if board.turn:
                wp_captures += 1
            else:
                bp_captures += 1
                
        board.push(move)
        
    return {
        'wp_captures': wp_captures,
        'bp_captures': bp_captures
    }

In [20]:
def pgn_to_dataframe(pgn_file):

    games = []
    
    # Load PGN
    pgn = open(pgn_file)

    # Iterate over games
    while True:
        game = chess.pgn.read_game(pgn)
        if game is None:
            break
            
        # Extract info 
        game_info = get_game_info(game)
        piece_moves = get_piece_moves(game)
        check_counts = get_checks(game)
        capture_counts = get_captures(game)
        player_ratings = get_player_ratings(game)
        
        # Compile game data
        games.append({**game_info, **piece_moves, **check_counts, **capture_counts, **player_ratings})

    # Build DataFrame
    return pd.DataFrame(games)

Converting the PGN to a `pandas.DataFrame` can enable lots of additional analysis on the chess games. A few other ideas for enhancements:

- [x] Add columns for additional info from the PGN headers like player names, ratings, dates, etc
- [x] Track the moves for each individual player rather than just total moves
- [x] Add columns for other metrics like number of captures, checks, etc
- [ ] Add columns for metrics related to openings

In [21]:
raw_df = pgn_to_dataframe("../data/raw/small_data.pgn")

In [22]:
raw_df

Unnamed: 0,result,total_moves,wp_moves,bp_moves,wp_P,bp_P,wp_N,bp_N,wp_B,bp_B,...,wp_Q,bp_Q,wp_K,bp_K,wp_checks,bp_checks,wp_captures,bp_captures,wp_rating,bp_rating
0,1/2-1/2,38,19,19,5,5,5,4,4,2,...,2,4,1,3,0,1,4,5,2354,2411
1,1/2-1/2,13,7,6,2,3,3,3,1,0,...,0,0,1,0,0,0,1,1,2523,2460
2,0-1,106,53,53,14,15,2,6,4,4,...,2,2,15,9,5,4,11,12,1915,1999
3,1-0,77,39,38,11,11,9,2,5,8,...,4,3,4,10,1,4,8,9,2446,2191
4,1-0,49,25,24,12,5,1,9,5,3,...,6,3,1,2,0,4,8,5,2168,2075
5,1/2-1/2,58,29,29,7,7,5,9,6,3,...,7,5,1,3,0,0,8,8,2437,2254
6,1-0,75,38,37,10,6,5,7,10,4,...,7,11,2,4,2,1,11,8,2449,2201
7,1/2-1/2,127,64,63,17,20,4,2,4,9,...,5,4,28,16,9,1,15,15,1813,1643
8,1-0,91,46,45,12,10,8,2,4,12,...,1,3,8,12,2,1,8,9,2553,2052
9,1-0,59,30,29,7,7,6,9,9,6,...,5,3,1,1,0,3,8,7,2611,2520


In [None]:
pgn = open("../data/raw/data.pgn")
board = chess.Board()
white_first_move = []

while True:
    board.reset()
    game = chess.pgn.read_game(pgn)
    
    if game is None: break
    
    for i, move in enumerate(game.mainline_moves()):
        if i > 0: break
        pgn_move = board.variation_san([move])
        if i == 0:
            white_first_move.append(pgn_move)
        board.push(move)

pgn.close()

In [None]:
Counter(white_first_move)

In [None]:
pgn = open("../data/raw/data.pgn")
black_first_move = []

while True:
    board.reset()
    game = chess.pgn.read_game(pgn)
    
    if game is None: break
    
    for i, move in enumerate(game.mainline_moves()):
        if i > 1: break
        pgn_move = board.variation_san([move])
        if i == 1:
            black_first_move.append(pgn_move)
        board.push(move)
        
Counter(black_first_move)