# <h1 align="center">The Chess Whisperers</h1>

## <h2 style="text-align: center;">Problem Definition: Predicting Draw or Defeat</h2>

### <h5>Background</h5>

Chess is a strategic board game played between two players, and understanding the factors that contribute to winning a game can provide valuable insights into player performance. The availability of a dataset containing 6.25 million chess games played on lichess.org during July 2016 on <a href="https://www.kaggle.com/datasets/arevel/chess-games">Kaggle</a> presents an opportunity to develop a predictive model.



<h5>Problem Statement</h5>



The primary objective of this project is to ascertain the ultimate result of the game, specifically whether White emerges as the winner, Black emerges as the winner, or if the game concludes in a draw by analyzing features such as player ratings, opening moves, and time control etc. 

<h5>Dataset & Description </h5>

The dataset consists of 6.25 million chess games played on lichess.org in July 2016. It includes game attributes such as event type, player IDs, game result, date and time, player ratings, opening classification, time control, termination reason, and movements. Some games may have Stockfish analysis evaluations. For more specific details, please refer to the original dataset source at lichess.org.


The dataset consists of the following attributes for each chess game:

Event: The type of game.\
White: White player's ID.\
Black: Black player's ID.\
Result: Game result (1-0 for White win, 0-1 for Black win).\
UTCDate: Date of the game in UTC.\
UTCTime: Time of the game in UTC.\
WhiteElo: White player's ELO rating.\
BlackElo: Black player's ELO rating.\
WhiteRatingDiff: Difference in White player's rating after the game.\
BlackRatingDiff: Difference in Black player's rating after the game.\
ECO: Opening in ECO encoding.\
Opening: Opening name.\
TimeControl: Time control for each player in seconds.\
Termination: Reason for the game's end.\
AN: Movements in Movetext format.



## PySpark Environment Setup and Version Information

This code snippet demonstrates the setup of a PySpark environment and provides the version information. It imports the necessary libraries, configures the SparkSession, and prints the version of Spark and PySpark.

In [1]:
# Suppress Hadoop Info looging
!sed -i 's/hadoop.root.logger=INFO,console/hadoop.root.logger=WARN,console/' /usr/hadoop-3.3.2/etc/hadoop/log4j.properties

In [2]:
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from pyspark.sql.functions import *
from itertools import chain
from pyspark.ml.feature import * 
from pyspark.ml.classification import *
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

conf = pyspark.SparkConf().setAll([('spark.master', 'local[10]'),
                                   ('spark.app.name', 'PySpark DataFrame Demo'),
                                   ('spark.driver.memory', '8g'),
                                   ('spark.executor.memory', '8g'),
                                   ('spark.dynamicAllocation.enabled', 'true')])
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.version

2023-06-06 01:57:28,824 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


'3.3.1'

## Reading a CSV File into a PySpark DataFrame

This code snippet demonstrates reading a CSV file into a PySpark DataFrame. It utilizes the `spark.read.csv()` method with specified parameters to read the file and infer the schema.


In [3]:
#read data file
data = spark.read.csv(path = 'file:///home/work/Chess Whisperers/chess_games.csv',header='True', inferSchema='True')

                                                                                

## Dropping Rows with Null Values from a PySpark DataFrame



In [4]:
# Drop null
data = data.dropna()

## Dropping Columns from a PySpark DataFrame



In [5]:
# Drop the columns
data = data.drop('White', 'Black', 'UTCDate', 'UTCTime')

## Filtering Rows based on Column Values in a PySpark DataFrame


In [6]:
data = data.filter(data.Termination!='Abandoned')
data = data.filter(data.Termination!='Rule Infraction')

## Modifying a Column in a PySpark DataFrame


In [7]:
df = data.withColumn('Event', trim(data.Event))

In [8]:
df.show()

+----------------+------+--------+--------+---------------+---------------+---+--------------------+-----------+------------+--------------------+
|           Event|Result|WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|             Opening|TimeControl| Termination|                  AN|
+----------------+------+--------+--------+---------------+---------------+---+--------------------+-----------+------------+--------------------+
|       Classical|   1-0|    1901|    1896|           11.0|          -11.0|D10|        Slav Defense|      300+5|Time forfeit|1. d4 d5 2. c4 c6...|
|           Blitz|   0-1|    1641|    1627|          -11.0|           12.0|C20|King's Pawn Openi...|      300+0|      Normal|1. e4 e5 2. b3 Nf...|
|Blitz tournament|   1-0|    1647|    1688|           13.0|          -13.0|B01|Scandinavian Defe...|      180+0|Time forfeit|1. e4 d5 2. exd5 ...|
|  Correspondence|   1-0|    1706|    1317|           27.0|          -25.0|A00|Van't Kruijs Opening|          -|      

## Predicting game outcomes

### Data Prep for analysis

In [9]:
#Breaking down the 'AN' column to extract information

df  = df.withColumn("Moves", regexp_replace(col("AN"), "\\d+\\.", ""))
df = df.withColumn("Moves", split(col("Moves"), "  "))
df = df.withColumn("MoveArray", transform(col("Moves"), lambda x:split(x, ",") ))

### Using Mapping, String Indexer to convert String features to numeric

In [10]:
#String 'event' to integer
eventDict = {'Bullet':0,'Classical tournament':1,'Bullet tournament':2,'Blitz':3,'Classical':4,'Blitz tournament':5, 'Correspondence':6}
mapping_expr = create_map([lit(x) for x in chain(*eventDict.items())])
df = df.withColumn('Event', mapping_expr[df['Event']])

#String 'Result' to integer
resDict = {'1-0':0,'0-1':1,'1/2-1/2':2}
mapping_expr = create_map([lit(x) for x in chain(*resDict.items())])
df = df.withColumn('Result', mapping_expr[df['Result']])

#String 'termination' to integer
termDict = {'Normal':0,'Time forfeit':1}
mapping_expr = create_map([lit(x) for x in chain(*termDict.items())])
df = df.withColumn('Termination', mapping_expr[df['Termination']])

#Breaking down TimeControl into Time and control columns
df = df.withColumn("Time", split(col("TimeControl"), "\+").getItem(0).cast("int"))
df = df.withColumn("Control", split(col("TimeControl"), "\+").getItem(1).cast("int"))

#Using string indexer to convert openings to integers
opening_indexer = StringIndexer(inputCol="Opening", outputCol="openingIndex")
df = opening_indexer.fit(df).transform(df)

#Using string indexer to convert "ECO" to integers
eco_indexer = StringIndexer(inputCol="ECO", outputCol="ecoIndex")
df = eco_indexer.fit(df).transform(df)

#dropping nulls before vectorizing
df = df.dropna()

                                                                                

### Creating a vectorized column using Vector Assembler

In [11]:
inputCols1 = ['Event','Termination','WhiteElo','BlackElo','openingIndex','Time','Control']
outputCol = 'features'
vec = VectorAssembler(inputCols = inputCols1, outputCol = outputCol)

### Scaling data in order for the classification model to interpret the features on the same scale

In [12]:
std = StandardScaler()
std.setInputCol("features")
std.setOutputCol("scaledFeatures")

StandardScaler_4b9e3473e7c3

### Splitting data into Train and Test sets using sampleBy to do a stratified split

In [13]:
train = df.sampleBy("Result", fractions={0: 0.6, 1: 0.6, 2:0.6}, seed=30)
non_train = df.subtract(train)

val = non_train.sampleBy("Result", fractions={0: 0.5, 1: 0.5, 2:0.5}, seed=30)
test = non_train.subtract(val)

## Model Selection

**Model Selection using validation set**

In [14]:
#LogisticRegression
lr = LogisticRegression(featuresCol = 'scaledFeatures', labelCol = 'Result')
#DecisionTreeClassifier - maxbins = 3000 because 'openings' column has 2940 rows
dt = DecisionTreeClassifier(featuresCol = 'scaledFeatures', labelCol = 'Result')
dt.setMaxBins(3000)
#RandomForestClassifier - maxbins = 3000 because 'openings' column has 2940 rows
rf = RandomForestClassifier(featuresCol = 'scaledFeatures', labelCol = 'Result')
rf.setMaxBins(3000)
#NaiveBayes
nb = NaiveBayes(featuresCol = 'scaledFeatures', labelCol = 'Result')

#Building pipeline with stages - VectorAssembler, StandardScaler, Classification model
lr_pipeline = Pipeline(stages=[vec, std, lr])
lr_model = lr_pipeline.fit(train)

dt_pipeline = Pipeline(stages=[vec, std, dt])
dt_model = dt_pipeline.fit(train)

rf_pipeline = Pipeline(stages=[vec, std, rf])
rf_model = rf_pipeline.fit(train)

nb_pipeline = Pipeline(stages=[vec, std, nb])
nb_model = nb_pipeline.fit(train)

2023-06-06 01:58:46,422 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
2023-06-06 01:58:46,424 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
                                                                                

**Using Multiclass Classification Evaluation for determining validation error**


Multiclass Classification Evaluation - Accuracy
The logistic regression classifier was evaluated using the multiclass classification evaluator, specifically measuring the accuracy metric. The resulting accuracy value is 67.8%.

In [15]:
evaluator = MulticlassClassificationEvaluator()
evaluator.setPredictionCol("prediction")
evaluator.setLabelCol("Result")

print("Logistic Regression:\t", evaluator.evaluate(lr_model.transform(val), {evaluator.metricName: "accuracy"}))

print("Decision Tree Classifier:\t", evaluator.evaluate(dt_model.transform(val), {evaluator.metricName: "accuracy"}))

print("Random Forest:\t", evaluator.evaluate(rf_model.transform(val), {evaluator.metricName: "accuracy"}))

print("Naive Bayes:\t",evaluator.evaluate(nb_model.transform(val), {evaluator.metricName: "accuracy"}))

                                                                                

Logistic Regression:	 0.6307109071905614


                                                                                

Decision Tree Classifier:	 0.6130811561200054


                                                                                

Random Forest:	 0.6191896466480604




Naive Bayes:	 0.5284891108466371


                                                                                

**Trying with other metrics since accuracy is not the right metric for unbalanced data (Class 2 is very less in number)**

**F1 score**

In [16]:
print("Logistic Regression:\t", evaluator.evaluate(lr_model.transform(val), {evaluator.metricName: "f1"}))

print("Decision Tree Classifier:\t", evaluator.evaluate(dt_model.transform(val), {evaluator.metricName: "f1"}))

print("Random Forest:\t", evaluator.evaluate(rf_model.transform(val), {evaluator.metricName: "f1"}))

print("Naive Bayes:\t",evaluator.evaluate(nb_model.transform(val), {evaluator.metricName: "f1"}))

                                                                                

Logistic Regression:	 0.6178412483928153


                                                                                

Decision Tree Classifier:	 0.6005251128522024


                                                                                

Random Forest:	 0.6041533625348708


                                                                                

Naive Bayes:	 0.48960747631117685


**WeightedPrecision**

In [17]:
print("Logistic Regression:\t", evaluator.evaluate(lr_model.transform(val), {evaluator.metricName: "weightedPrecision"}))

print("Decision Tree Classifier:\t", evaluator.evaluate(dt_model.transform(val), {evaluator.metricName: "weightedPrecision"}))

print("Random Forest:\t", evaluator.evaluate(rf_model.transform(val), {evaluator.metricName: "weightedPrecision"}))

print("Naive Bayes:\t",evaluator.evaluate(nb_model.transform(val), {evaluator.metricName: "weightedPrecision"}))

                                                                                

Logistic Regression:	 0.6064023290336455


                                                                                

Decision Tree Classifier:	 0.5893189396889758


                                                                                

Random Forest:	 0.5972605250108711




Naive Bayes:	 0.5172610088691507


                                                                                

## Improving the model with new features extracted from existing data

In [18]:
#Using udfs to split moves as White move and black move
split_white_udf = udf(lambda arr: [elem.split()[0] for [elem] in arr], ArrayType(StringType()))
split_black_udf = udf(lambda arr: [elem.split()[1] for [elem] in arr], ArrayType(StringType()))

df = df.withColumn("WhiteMoves", split_white_udf(df.MoveArray))
df = df.withColumn("BlackMoves", split_black_udf(df.MoveArray))

#Determining number of checks on each side
df = df.withColumn("WhiteChecks", transform(col("WhiteMoves"), lambda x: length(regexp_replace(x, "[^+]", ""))))
df = df.withColumn("NumOfWhiteChecks", aggregate("WhiteChecks", lit(0), lambda acc, x: acc + x))

df = df.withColumn("BlackChecks", transform(col("BlackMoves"), lambda x: length(regexp_replace(x, "[^+]", ""))))
df = df.withColumn("NumOfBlackChecks", aggregate("BlackChecks", lit(0), lambda acc, x: acc + x))

#Determining if castling happened on each side
df = df.withColumn("WhiteQueenCastling", expr("array_contains(WhiteMoves, 'O-O')"))
df = df.withColumn("BlackQueenCastling", expr("array_contains(BlackMoves, 'O-O')"))

df = df.withColumn("WhiteKingCastling", expr("array_contains(WhiteMoves, 'O-O-O')"))
df = df.withColumn("BlackKingCastling", expr("array_contains(BlackMoves, 'O-O-O')"))

df = df.withColumn("WhiteQueenCastling", when(col("WhiteQueenCastling"), 0).otherwise(1))
df = df.withColumn("BlackQueenCastling", when(col("BlackQueenCastling"), 0).otherwise(1))

df = df.withColumn("WhiteKingCastling", when(col("WhiteKingCastling"), 0).otherwise(1))
df = df.withColumn("BlackKingCastling", when(col("BlackKingCastling"), 0).otherwise(1))

#dropping nulls before vectorizing
df = df.dropna()

In [19]:
inputCols1 = ['Event','Termination','openingIndex','Time','Control','WhiteElo','BlackElo','NumOfWhiteChecks','NumOfBlackChecks','WhiteQueenCastling','BlackQueenCastling','WhiteKingCastling','BlackKingCastling']
outputCol = 'new_features'
vec = VectorAssembler(inputCols = inputCols1, outputCol = outputCol)
df = vec.transform(df)

In [21]:
std = StandardScaler()
std.setInputCol("new_features")
std.setOutputCol("scalednewFeatures")
std_df = std.fit(df).transform(df)

                                                                                

In [22]:
train = std_df.sampleBy("Result", fractions={0: 0.7, 1: 0.7, 2:0.7}, seed=30)
test = std_df.subtract(train)

In [None]:
lr = LogisticRegression(featuresCol = 'scalednewFeatures', labelCol = 'Result')
lr_model = lr.fit(train)



In [None]:
evaluator = MulticlassClassificationEvaluator()
evaluator.setPredictionCol("prediction")
evaluator.setLabelCol("Result")
evaluator.evaluate(lr_model.transform(test), {evaluator.metricName: "accuracy"})

**Trying with other metrics since accuracy is not the right metric for unbalanced data (Class 2 is very less in number)**

In [None]:
evaluator.evaluate(lr_model.transform(test), {evaluator.metricName: "f1"})

## Binary Classification Evaluation (Just to feed the curiosity)

This model works slightly better with just two classes. So we drop the class 2 which is Draws to determine the accuracy.
In this evaluation, the logistic regression classifier achieved a ROC AUC value of 70%. 
The ROC AUC is a metric that quantifies the classifier's ability to rank positive instances higher than negative instances across various classification thresholds. A higher ROC AUC value indicates better discrimination between the classes.

In [None]:
train = train.filter(train.Result!=2)
test = test.filter(test.Result!=2)

In [None]:
lr_model = lr.fit(train)
y_pred = lr_model.transform(test)

In [None]:
res = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='Result')
res.evaluate(y_pred)

In [None]:
lr_model.coefficients

**Coefficients clearly indicate the effect of each feature. It can be observed that newly added features NumOfWhiteChecks and NumOfBlackChecks have importance**

## Future Work

* Collect more descriptive data, analyse and redo model selection with hyperparameter tuning

* Incorporate player dynamics : Analyze player ratings over time to capture player performance trends or fluctuations that may affect game outcomes

* Analysis of  how time investment in each move affects the results

* Analyse if there are moves that boost the player ratings

* Blunders or Brilliant moves, which are more correlated with the result?
