Merge PGN games and opening data to annotate PGN games with the relevant opening system

In [1]:
import pandas as pd

games = pd.read_csv("pgn_games_parsed.csv")
openings = pd.read_csv("openings.csv")

In [2]:
games.head(2)

Unnamed: 0,Site,Date,Round,White,Black,Result,WhiteElo,BlackElo,TimeControl,EndDate,Termination,Moves,Variant,SetUp,FEN,did_I_win,my_colour
0,Chess.com,2017.10.18,-,agrazi,abcdave,0-1,1206,1494,1/432000,2017.12.03,abcdave won by checkmate,1. e4 c5 2. d4 cxd4 3. Qxd4 Nc6 4. Qc5 e5 5. Q...,,,,True,Black
1,Chess.com,2017.11.07,-,abcdave,hetverschil,0-1,1335,1384,1/432000,2017.12.17,hetverschil won by resignation,1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. O-O Nf6 5. d...,,,,False,White


In [5]:
openings.head(10)

Unnamed: 0,name,moves
0,A00 Polish (Sokolsky) opening,1. b4
1,"A00 Polish, Tuebingen variation",1. b4 Nh6
2,"A00 Polish, Outflank variation",1. b4 c6
3,A00 Benko's opening,1. g3
4,A00 Lasker simul special,1. g3 h5
5,"A00 Benko's opening, reversed Alekhine",1. g3 e5 2. Nf3
6,A00 Grob's attack,1. g4
7,"A00 Grob, spike attack",1. g4 d5 2. Bg2 c6 3. g5
8,"A00 Grob, Fritz gambit",1. g4 d5 2. Bg2 Bxg4 3. c4
9,"A00 Grob, Romford counter-gambit",1. g4 d5 2. Bg2 Bxg4 3. c4 d4


Joining these datasets is complex because our PGN games can't be nicely reduced to a single column (which pandas would require) so we can:

1. Cross-merge datasets (Cartesian join, so all rows in table A to all rows in table B)
2. Only keep rows that match our criteria (so the game PGN starts with the opening moves in the same row)

In [8]:
# create a fake join column in both datasets
games["join_field"] = 1
openings["join_field"] = 1

cross_joined = games.merge(openings, on="join_field")
print(len(games), len(openings), len(cross_joined))

49 318 15582


In [10]:
cross_joined.head(1)

Unnamed: 0,Site,Date,Round,White,Black,Result,WhiteElo,BlackElo,TimeControl,EndDate,Termination,Moves,Variant,SetUp,FEN,did_I_win,my_colour,join_field,name,moves
0,Chess.com,2017.10.18,-,agrazi,abcdave,0-1,1206,1494,1/432000,2017.12.03,abcdave won by checkmate,1. e4 c5 2. d4 cxd4 3. Qxd4 Nc6 4. Qc5 e5 5. Q...,,,,True,Black,1,A00 Polish (Sokolsky) opening,1. b4


Now figure out if each row contains a correct annotation

In [12]:
def is_correct_annotation(row):
    # an annotation in our cross-joined data is correct
    # if the move list starts with the opening moves
    return row["Moves"].startswith(row["moves"])

cross_joined["is_correct_annotation"] = cross_joined.apply(is_correct_annotation, axis=1)
cross_joined.head(1)

Unnamed: 0,Site,Date,Round,White,Black,Result,WhiteElo,BlackElo,TimeControl,EndDate,...,Moves,Variant,SetUp,FEN,did_I_win,my_colour,join_field,name,moves,is_correct_annotation
0,Chess.com,2017.10.18,-,agrazi,abcdave,0-1,1206,1494,1/432000,2017.12.03,...,1. e4 c5 2. d4 cxd4 3. Qxd4 Nc6 4. Qc5 e5 5. Q...,,,,True,Black,1,A00 Polish (Sokolsky) opening,1. b4,False


Use the new boolean column to only keep relevant rows

In [13]:
annotated_games = cross_joined[cross_joined["is_correct_annotation"]].copy()
annotated_games.shape

(154, 21)

Now we need to de-duplicate our records because of the opening hierarchy (e.g. 1. e4 e5 games get recorded as both 1. e4 and 1. e4 e5).

For our purposes, we want to go as *deep* as possible, so we will assume the **last** row for each game is what we want to keep.

Let's assume the two usernames + the date uniquely identifies each game, but let's also test this assumption.

In [20]:
games.groupby(["White", "Black", "Date"]).size().sort_values(ascending=False)

White                 Black         Date      
komapc                abcdave       2020.12.29    1
abcdave               dtimaaar      2021.05.12    1
                      budapestcafe  2018.09.12    1
                      agrazi        2018.01.08    1
                                    2017.12.04    1
                      aerokoli5     2021.01.04    1
                      Tomi7771      2018.03.24    1
                                    2018.02.18    1
                      SpekmeisterT  2021.02.08    1
                      Salamonovi4   2021.07.14    1
                      OsoCloud      2021.02.14    1
                      Matheo_R      2021.04.04    1
                      Guschess123   2021.01.05    1
Xrv                   abcdave       2021.02.01    1
VilppuH               abcdave       2021.04.10    1
Tomi7771              abcdave       2018.02.28    1
                                    2018.01.20    1
Saryozek29108         abcdave       2021.02.15    1
NedalAlAgha      

Looks like the combination of players + the starting date uniquely identifies a game, so we can use those 3 columns to deduplicate our annotated games

In [21]:
final_games = annotated_games.drop_duplicates(subset=["White", "Black", "Date"],
                                              keep="last")
final_games.shape

(49, 21)

In [26]:
final_games.head(1)

Unnamed: 0,Site,Date,Round,White,Black,Result,WhiteElo,BlackElo,TimeControl,EndDate,...,Moves,Variant,SetUp,FEN,did_I_win,my_colour,join_field,name,moves,is_correct_annotation
117,Chess.com,2017.10.18,-,agrazi,abcdave,0-1,1206,1494,1/432000,2017.12.03,...,1. e4 c5 2. d4 cxd4 3. Qxd4 Nc6 4. Qc5 e5 5. Q...,,,,True,Black,1,B20-B99 Sicilian defence,1. e4 c5,True


# Now for the actual analysis!

In [25]:
final_games["name"].value_counts().head()

B20-B99  Sicilian defence               11
C60-C99  Ruy Lopez (Spanish opening)     4
A45-A46  Queen's pawn game               4
C42-C43  Petrov's defence                3
C50  Giuoco Piano                        2
Name: name, dtype: int64

In [30]:
final_games.groupby("name")["did_I_win"].agg(["count", "mean"]).sort_values("mean", ascending=False)

Unnamed: 0_level_0,count,mean
name,Unnamed: 1_level_1,Unnamed: 2_level_1
C30-C39 King's gambit,2,1.0
"C20 KP, Indian opening",1,1.0
D20-D29 Queen's gambit accepted,1,1.0
"C46 Four knights, Gunsberg variation",1,1.0
"C41 Philidor, exchange variation",1,1.0
C40 King's knight opening,1,1.0
C23-C24 Bishop's opening,1,1.0
E60-E99 King's Indian defence,2,1.0
B10-B19 Caro-Kann defence,2,1.0
B01 Scandinavian (centre counter) defence,1,1.0


In [32]:
final_games.groupby(["name", "my_colour"])["did_I_win"].agg(["count", "mean"])

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
name,my_colour,Unnamed: 2_level_1,Unnamed: 3_level_1
A10-A39 English opening,Black,1,1.0
A10-A39 English opening,White,1,0.0
A45-A46 Queen's pawn game,Black,4,0.5
"A48-A49 King's Indian, East Indian defence",Black,1,0.0
B00 King's pawn opening,White,1,1.0
B00 Owen defence,White,1,1.0
B01 Scandinavian (centre counter) defence,White,1,1.0
B10-B19 Caro-Kann defence,Black,1,1.0
B10-B19 Caro-Kann defence,White,1,1.0
B20-B99 Sicilian defence,Black,5,0.8
