# Projet Lichess

_Traitements et données large échelle_

Zoé Marquis & Charlotte Kruzic

TODO : présenter les objectifs du projet, les différentes questions, les données utilisées


In [3]:
!pip install kagglehub



In [4]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
import kagglehub

## Préparation des données

Chargement des données, analyse exploratoire et prétraitement des données.

In [5]:
path = kagglehub.dataset_download("noobiedatascientist/lichess-september-2020-data")
print("Chemin vers le fichier du dataset : ", path)

Chemin vers le fichier du dataset :  /root/.cache/kagglehub/datasets/noobiedatascientist/lichess-september-2020-data/versions/3


In [6]:
files = os.listdir(path)
print("Fichiers du dataset : ", files)

Fichiers du dataset :  ['Sept_20_analysis.RDS', 'Sept_20_analysis.csv', 'Column information.txt']


In [7]:
filename = f"{path}/Sept_20_analysis.csv"
print("Nom du fichier : ", filename)

Nom du fichier :  /root/.cache/kagglehub/datasets/noobiedatascientist/lichess-september-2020-data/versions/3/Sept_20_analysis.csv


In [9]:
# voir le contenu du .txt
filename_txt = f"{path}/Column information.txt"
with open(filename_txt, 'r') as f:
    print(f.read())

GAME: Game ID (not from lichess.org)

BlackElo: Elo rating of the player with the black pieces

BlackRatingDiff: Rating change (gain/loss) after game conclusion for the player with the black pieces

Date: Date the game was played

ECO: Game opening (ECO notation)

Event: Event where the game was played

Opening: Game opening

Result: Result of the game

	1-0 -- White victory
	0-1 -- Black victory
	1/2-1/2 -- Draw
	* -- Undecided
	
Site: URL of the game

Termination: Way the game terminated

	Time forfeit -- One of the players ran out of time
	Normal -- Game terminated with check mate
	Rules infraction -- Game terminated due to rule breaking
	Abandoned -- Game was abandoned
	
TimeControl: Timecontrol in seconds that was used for the game (Starting time: Increment) 

UTCTime: Time the game was played

WhiteElo: Elo rating of the player with the white pieces

WhiteRatingDiff: Rating change (gain/loss) after game conclusion for the player with the white pieces

Black_elo_category: ELO cate

## Spark

In [10]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
!tar xf spark-3.5.3-bin-hadoop3.tgz
!pip install -q findspark

In [11]:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"  # this is proper to Colab


In [12]:
import findspark
from pyspark.sql import SparkSession

In [13]:
findspark.init()
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [21]:
sc = spark.sparkContext
df_spark = spark.read.csv(filename, header=True, inferSchema=True)

In [22]:
df_spark.printSchema()

root
 |-- GAME: integer (nullable = true)
 |-- BlackElo: integer (nullable = true)
 |-- BlackRatingDiff: integer (nullable = true)
 |-- Date: string (nullable = true)
 |-- ECO: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Opening: string (nullable = true)
 |-- Result: string (nullable = true)
 |-- Site: string (nullable = true)
 |-- Termination: string (nullable = true)
 |-- TimeControl: string (nullable = true)
 |-- UTCTime: timestamp (nullable = true)
 |-- WhiteElo: integer (nullable = true)
 |-- WhiteRatingDiff: integer (nullable = true)
 |-- Black_elo_category: string (nullable = true)
 |-- White_elo_category: string (nullable = true)
 |-- starting_time: integer (nullable = true)
 |-- increment: integer (nullable = true)
 |-- Game_type: string (nullable = true)
 |-- Total_moves: integer (nullable = true)
 |-- Black_blunders: integer (nullable = true)
 |-- White_blunders: integer (nullable = true)
 |-- Black_mistakes: integer (nullable = true)
 |-- White_mistak

In [23]:
df_spark.show(5)

+----+--------+---------------+----------+---+----------------+--------------------+------+--------------------+------------+-----------+-------------------+--------+---------------+------------------+------------------+-------------+---------+---------+-----------+--------------+--------------+--------------+--------------+------------------+------------------+--------------------+--------------------+--------------+--------------+-----------------+-----------------+-----------------+----------------+----------------+----------------+--------------------+--------------------+----------+-------------+
|GAME|BlackElo|BlackRatingDiff|      Date|ECO|           Event|             Opening|Result|                Site| Termination|TimeControl|            UTCTime|WhiteElo|WhiteRatingDiff|Black_elo_category|White_elo_category|starting_time|increment|Game_type|Total_moves|Black_blunders|White_blunders|Black_mistakes|White_mistakes|Black_inaccuracies|White_inaccuracies|Black_inferior_moves|White_

### Préparer les données comme l'énoncé

In [None]:
from pyspark.sql.functions import col, when, isnull
df_spark = df_spark.withColum("Black_ELO_category",
                              when((col("BlackElo") >= 1200) & (col("BlackElo") <= 1499), "occasional player")
                              .when((col("BlackElo") >= 1500) & (col("BlackElo") <= 1799), "good club player")
                              .when((col("BlackElo") >= 1800) & (col("BlackElo") <= 1999), "very good club player")
                              .when((col("BlackElo") >= 2000) & (col("BlackElo") <= 2399), "national and international level")
                              .when((col("BlackElo") >= 2400) & (col("BlackElo") <= 2800), "GMI, World Champions")
                              .otherwise("other")
                              )
df_spark = df_spark.withColum("White_ELO_category",
                              when((col("WhiteElo") >= 1200) & (col("WhiteElo") <= 1499), "occasional player")
                              .when((col("WhiteElo") >= 1500) & (col("WhiteElo") <= 1799), "good club player")
                              .when((col("WhiteElo") >= 1800) & (col("WhiteElo") <= 1999), "very good club player")
                              .when((col("WhiteElo") >= 2000) & (col("WhiteElo") <= 2399), "national and international level")
                              .when((col("WhiteElo") >= 2400) & (col("WhiteElo") <= 2800), "GMI, World Champions")
                              .otherwise("other")
                              )

In [None]:
# vérifier combien de "other"
df_spark.filter(col("Black_ELO_category") == "other").count()

In [None]:
df_spark.filter(col("White_ELO_category") == "other").count()

### Calculer les erreurs par coup pour chaque joueur

In [24]:
from pyspark.sql.functions import col, expr

df_augmented = df_spark.withColumn("Black_total_errors", col("Black_blunders") + col("Black_mistakes") + col("Black_inaccuracies"))
df_augmented = df_augmented.withColumn("White_total_errors", col("White_blunders") + col("White_mistakes") + col("White_inaccuracies"))

In [25]:
df_augmented = df_augmented.withColumn("Black_errors_per_move", col("Black_total_errors") / col("Black_moves"))
df_augmented = df_augmented.withColumn("White_errors_per_move", col("White_total_errors") / col("White_moves"))

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `Black_moves` cannot be resolved. Did you mean one of the following? [`Black_ts_moves`, `BlackElo`, `Black_long_moves`, `Black_mistakes`, `Total_moves`].;
'Project [GAME#17, BlackElo#18, BlackRatingDiff#19, Date#20, ECO#21, Event#22, Opening#23, Result#24, Site#25, Termination#26, TimeControl#27, UTCTime#28, WhiteElo#29, WhiteRatingDiff#30, Black_elo_category#31, White_elo_category#32, starting_time#33, increment#34, Game_type#35, Total_moves#36, Black_blunders#37, White_blunders#38, Black_mistakes#39, White_mistakes#40, ... 19 more fields]
+- Project [GAME#17, BlackElo#18, BlackRatingDiff#19, Date#20, ECO#21, Event#22, Opening#23, Result#24, Site#25, Termination#26, TimeControl#27, UTCTime#28, WhiteElo#29, WhiteRatingDiff#30, Black_elo_category#31, White_elo_category#32, starting_time#33, increment#34, Game_type#35, Total_moves#36, Black_blunders#37, White_blunders#38, Black_mistakes#39, White_mistakes#40, ... 18 more fields]
   +- Project [GAME#17, BlackElo#18, BlackRatingDiff#19, Date#20, ECO#21, Event#22, Opening#23, Result#24, Site#25, Termination#26, TimeControl#27, UTCTime#28, WhiteElo#29, WhiteRatingDiff#30, Black_elo_category#31, White_elo_category#32, starting_time#33, increment#34, Game_type#35, Total_moves#36, Black_blunders#37, White_blunders#38, Black_mistakes#39, White_mistakes#40, ... 17 more fields]
      +- Relation [GAME#17,BlackElo#18,BlackRatingDiff#19,Date#20,ECO#21,Event#22,Opening#23,Result#24,Site#25,Termination#26,TimeControl#27,UTCTime#28,WhiteElo#29,WhiteRatingDiff#30,Black_elo_category#31,White_elo_category#32,starting_time#33,increment#34,Game_type#35,Total_moves#36,Black_blunders#37,White_blunders#38,Black_mistakes#39,White_mistakes#40,... 16 more fields] csv


In [None]:
blitz_df = df_augmented.filter(col("Game_type") == "Blitz")
blitz_errors = blitz_df.groupBy("Black_elo_")

In [None]:
# afficher -> analyse visuel par type de jeu avec matplotlib

## Idées questions supplémentaires

- influence of time spent per move on errors
  - existe t il une corrélation entre le temps moyen par coup et le nombre d'erreurs ?
  - analyse par niveau et type de jeu

- distribtion of drawn games by opening and level
  - quelles ouvertures ont une probabilité plus élevée de conduire à une partie nulle ?
  - analyse des taux de parties nulles en fonciton de l'ouverture et des catégories ELO

- impact of ELO difference on game length
  - les parties avec une grande différence d'ELO durent-elles moins longtemps (en nombre de coups) ?

- most common mistakes by level
  - quels types d'erreurs sont les plus fréquents selon les catégories ELO ?
  - comparaison entre joueurs de niveau débutant et expert

- optimal strategy for specific openings
  - pour une ouverture donnée (à sélectionner manuellement), quelle est la stratégie optiale selon le niveau des joueurs ?