# Description
This notebook shows creating basic SGD Classifier and using it to predict winner of a chess game. The [dataset](https://www.kaggle.com/datasets/adpawel810/own-chess-games) used here is also gathered by me and contains only my online games. The objective of creating this notebook was mainly to train once again methods from labs and handle data which is not so easy to predict.

# Imports

In [44]:
import pandas as pd
import re
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Reading data
Formatting opening names to be shorter to group them in main openings and eventually reduce number of columns.

In [3]:
def clean_opening(opening):
    parts = opening.split("-")
    cleaned = "-".join(parts[:2])
    return cleaned

# Reading
games = pd.read_json("chess_data.json")

# Apply function to 'opening' column
games["opening"] = games["opening"].apply(clean_opening)


Check if data has loaded properly.

In [4]:
games.head()

Unnamed: 0,rated,turns,victory_status,winner,time_class,white_id,white_rating,black_id,black_rating,opening
0,True,53,Time forfeit,White,blitz,Pablo_810,1162,ahmed8909,838,Pirc-Defense
1,True,67,Resign,White,blitz,MichaelMikeCorleone,1099,Pablo_810,987,Van-t
2,True,30,Resign,Black,blitz,POLCIE,997,Pablo_810,1095,Kings-Fianchetto
3,True,44,Resign,Black,blitz,Pablo_810,1033,contakto,1181,Bishops-Opening
4,True,69,Time forfeit,White,blitz,Fernando2017p,1088,Pablo_810,976,Queens-Pawn


In [5]:
games.dtypes

Unnamed: 0,0
rated,bool
turns,int64
victory_status,object
winner,object
time_class,object
white_id,object
white_rating,int64
black_id,object
black_rating,int64
opening,object


# Adjusting columns to learning
The goal is to have only float or int columns.

Categorizing games by the color which I played. Information about white_id and black_id is redundant but we can merge it into one column.

In [6]:
games["played_as"] = np.where(games["white_id"] == "Pablo_810", "White", "Black")
games.head()

Unnamed: 0,rated,turns,victory_status,winner,time_class,white_id,white_rating,black_id,black_rating,opening,played_as
0,True,53,Time forfeit,White,blitz,Pablo_810,1162,ahmed8909,838,Pirc-Defense,White
1,True,67,Resign,White,blitz,MichaelMikeCorleone,1099,Pablo_810,987,Van-t,Black
2,True,30,Resign,Black,blitz,POLCIE,997,Pablo_810,1095,Kings-Fianchetto,Black
3,True,44,Resign,Black,blitz,Pablo_810,1033,contakto,1181,Bishops-Opening,White
4,True,69,Time forfeit,White,blitz,Fernando2017p,1088,Pablo_810,976,Queens-Pawn,Black


Changing columns type from object to float by dummy encoding.

In [7]:
games = games.drop(['white_id', 'black_id'], axis=1)

victory_dummy = pd.get_dummies(data = games['victory_status'], drop_first = False, dtype = float)
games = pd.concat([games.drop('victory_status', axis = 1), victory_dummy], axis = 1)

rated_dummy = pd.get_dummies(data = games['rated'], drop_first = False, dtype = float)
games = pd.concat([games.drop('rated', axis = 1), rated_dummy], axis = 1)

time_dummy = pd.get_dummies(data = games['time_class'], drop_first = False, dtype = float)
games = pd.concat([games.drop('time_class', axis = 1), time_dummy], axis = 1)

played_as_dummy = pd.get_dummies(data = games['played_as'], drop_first = False, dtype = float)
games = pd.concat([games.drop('played_as', axis = 1), played_as_dummy], axis = 1)

games.head()

Unnamed: 0,turns,winner,white_rating,black_rating,opening,Draw,Mate,Resign,Time forfeit,False,True,blitz,bullet,daily,rapid,Black,White
0,53,White,1162,838,Pirc-Defense,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1,67,White,1099,987,Van-t,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
2,30,Black,997,1095,Kings-Fianchetto,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,44,Black,1033,1181,Bishops-Opening,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
4,69,White,1088,976,Queens-Pawn,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0


Changing from two columns: white and black rating to a column containing the difference of these ratings.

In [8]:
games['rating_diff'] = games['white_rating'] - games['black_rating']

In [9]:
games = games.drop(columns=['black_rating', 'white_rating'])
games.head()

Unnamed: 0,turns,winner,opening,Draw,Mate,Resign,Time forfeit,False,True,blitz,bullet,daily,rapid,Black,White,rating_diff
0,53,White,Pirc-Defense,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,324
1,67,White,Van-t,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,112
2,30,Black,Kings-Fianchetto,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,-98
3,44,Black,Bishops-Opening,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,-148
4,69,White,Queens-Pawn,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,112


In [10]:
opening_dummy = pd.get_dummies(data = games['opening'], dtype = float)
games = pd.concat([games.drop('opening', axis = 1), opening_dummy], axis = 1)
games.head()

Unnamed: 0,turns,winner,Draw,Mate,Resign,Time forfeit,False,True,blitz,bullet,...,Tarrasch-Defense,The-Wrongcloud,Three-Knights,Torre-Attack,Trompowsky-Attack,Undefined,Van-Geet,Van-t,Vienna-Game,Ware-Opening
0,53,White,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,67,White,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,30,Black,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,44,Black,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,69,White,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, we have only float and int types in dataframe (excluding 'winner') so we can start learning.

In [11]:
games.dtypes

Unnamed: 0,0
turns,int64
winner,object
Draw,float64
Mate,float64
Resign,float64
...,...
Undefined,float64
Van-Geet,float64
Van-t,float64
Vienna-Game,float64


# Classification with SGD Classifier

First, we need to divide dataset into training and test sets.

Secondly, we use [SGD Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) and fit linear model with Stochastic Gradient Descent.

In [58]:
X = games.drop('winner', axis=1).values
y = games['winner'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

sgd_clf = SGDClassifier()
sgd_clf.fit(X_train, y_train)

Next make predictions with this classifier on train and test data.

In [55]:
y_train_pred = sgd_clf.predict(X_train)
y_test_pred = sgd_clf.predict(X_test)

acc_train = sum(y_train_pred == y_train)/len(y_train)
acc_test = sum(y_test_pred == y_test)/len(y_test)

As we can see accuracy is not great.

In [56]:
print(f'Training set accuracy: {acc_train:.4f}, Test set accuracy: {acc_test:.4f}')

Training set accuracy: 0.6317, Test set accuracy: 0.6445


We can conclude that model has a problem with recognizing draws which can be caused by fewer number of draw samples.

In [57]:
print(confusion_matrix(y_train, y_train_pred))
print(confusion_matrix(y_test, y_test_pred))

[[2865  161  311]
 [ 110  245   24]
 [1695  334 1410]]
[[706  34  87]
 [ 25  69   4]
 [411  75 378]]


# Classification with Decision Tree Classifier

In [63]:
X = games.drop('winner', axis=1).values
y = games['winner'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [64]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)

In [65]:
predictions = dtree.predict(X_test)

In [66]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

       Black       0.71      0.72      0.71      1249
        Draw       0.86      0.83      0.85       136
       White       0.73      0.72      0.72      1299

    accuracy                           0.72      2684
   macro avg       0.77      0.76      0.76      2684
weighted avg       0.72      0.72      0.72      2684



In [67]:
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy {accuracy:.4f}')

Accuracy 0.7243


After doing the same steps as for SGD Classifier we notice that Decision Tree Classifier has better accuracy but still struggle with classifying draws.

In [68]:
print(confusion_matrix(y_test, predictions))

[[900   7 342]
 [ 12 113  11]
 [357  11 931]]
