# Chess data

Analysis of my own games from chess.com. Data obtained using their in-built download (into a single text file)

## Part 1: data extraction

Define a function to take in a single file (with multiple games) and return the associated DataFrame

In [6]:
# Create a function to take a raw game string, and extract the metadata into a dictionary.
# First, it will split each game by new lines into their separate meta-tags.

def extract_game_info(game):

    # separate the list of metadata attributes from the list of moves
    metadata = [section for section in game.split("\n\n") if section != ""]

    assert(len(metadata) == 2)

    # save them into separate variables
    columns_string, moves_string = metadata

    # make the metadata columns into a list
    columns = columns_string.split("\n")

    # get rid of first meta-tag as it's the incomplete event name
    del columns[0]
    
    # keep track of metadata in a dictionary
    game_dict = {}

    # loop through the tags
    for column in columns:
        # identify where the first space is
        first_space = column.index(" ")

        # what's to the left of the first space is the column name
        # (exclude the first square bracket)
        column_name = column[1:first_space]

        # to the right is the column value (exclude the final square bracket)
        column_value = column[first_space+1:-1]

        # get rid of double quotes
        column_value = column_value.replace('"', '')

        # add a value to the dictionary
        game_dict[column_name] = column_value

    # add the entire list of moves to the dictionary
    game_dict["Moves"] = moves_string
    
    return game_dict

In [7]:
import pandas as pd

def parse_pgn_file(filepath):
    
    lines = []

    with open(filepath, "r") as f:
        lines = f.read()
    
    # Assume each record starts with `[Event` and use that as a separator
    
    # first one will be an empty string, ignore that
    raw_games = lines.split("[Event")[1:]
    
    game_dictionaries = [extract_game_info(game) for game in raw_games]

    df = pd.DataFrame(game_dictionaries)
    
    return df

Run the function on all relevant files

In [8]:
import os

dataframes = []

# parse all PGN files in the data subfolder
base_folder_path = "./data"

for pgn_file in os.listdir(base_folder_path):
    # for each PGN file get the associated DataFrame
    pgn_file_path = os.path.join(base_folder_path, pgn_file)
    pgn_dataframe = parse_pgn_file(pgn_file_path)
    
    # save it into our list for concatenation
    dataframes.append(pgn_dataframe)

all_games = pd.concat(dataframes, ignore_index=True)
all_games.head()

Unnamed: 0,Site,Date,Round,White,Black,Result,WhiteElo,BlackElo,TimeControl,EndDate,Termination,Moves,Variant,SetUp,FEN
0,Chess.com,2017.10.18,-,abcdave,utopa,1-0,1400,800,1/604800,2017.10.26,abcdave won on time,1. e4 e5 2. Nf3 d6 3. Nc3 1-0,,,
1,Chess.com,2017.10.18,-,agrazi,abcdave,0-1,1206,1494,1/432000,2017.12.03,abcdave won by checkmate,1. e4 c5 2. d4 cxd4 3. Qxd4 Nc6 4. Qc5 e5 5. Q...,,,
2,Chess.com,2017.11.07,-,abcdave,hetverschil,0-1,1335,1384,1/432000,2017.12.17,hetverschil won by resignation,1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. O-O Nf6 5. d...,,,
3,Chess.com,2017.12.04,-,abcdave,agrazi,1-0,1394,1191,1/172800,2017.12.27,abcdave won by resignation,1. e4 e5 2. f4 exf4 3. Nf3 d6 4. d4 Bd7 5. Bxf...,,,
4,Chess.com,2017.12.28,-,agrazi,abcdave,0-1,1179,1431,1/172800,2018.01.08,abcdave won by resignation,1. e4 e5 2. d3 Nf6 3. Be3 c6 4. Nf3 d6 5. Be2 ...,,,


## Part 2: Cleaning

My research question is "how do I perform within different opening systems?"

### But first: data cleaning!

"Variants": NULLs are regular chess games, non-NULLs are different game modes (which we want to exclude)

In [10]:
all_games = all_games[all_games["Variant"].isnull()] # only keep NULLs!

Any draws?

In [11]:
all_games["Result"].value_counts()

1-0        29
0-1        26
1/2-1/2     1
Name: Result, dtype: int64

Create a "did I win?" column based on the result + the player names

In [12]:
# either white won and I was white, or black won and I was black
all_games["did_I_win"] = (
        ((all_games["Result"] == "1-0") & (all_games["White"] == "abcdave"))
        |
        ((all_games["Result"] == "0-1") & (all_games["Black"] == "abcdave"))
)

all_games["did_I_win"].value_counts(dropna=False)

True     36
False    20
Name: did_I_win, dtype: int64

Create a "which colour was I" column

In [13]:
import numpy as np

all_games["my_colour"] = np.where(all_games["White"] == "abcdave", "White", "Black")
all_games["my_colour"].value_counts(dropna=False)

White    29
Black    27
Name: my_colour, dtype: int64

Remove games that ended due to timeout

In [14]:
all_games.drop(all_games[all_games["Termination"].str.contains("on time")].index, inplace=True)

all_games["did_I_win"].value_counts(dropna=False)

True     29
False    20
Name: did_I_win, dtype: int64

Export

In [15]:
all_games.to_csv("pgn_games_parsed.csv", index=False)