## 📊 NB03 - Data Analysis

In this notebook, we will be analyzing the FIDE, Chess.com, and Google Trends data to answer the following research questions:
1) How do the top ten chess players vary in performance across different game modes?
2) Does player performance online align with their FIDE rating?
3) How does their success correlate with chess interest in their home country? Are certain players more influential than others?

To do this, we will initially look at the descriptive statistic of standard deviation, investigate and plot between-game-mode differences, over-the-board vs. online performance, and Google Trends data for players in their countries.

In [202]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import plotly.express as px

from lets_plot import *

LetsPlot.setup_html()

In [203]:
# Define the database path
db = "../data/chess.db"  

# Connect to the database
conn = sqlite3.connect(db)

# Specify the FIDE table
table = "fide"  

# Read data from the database into pandas
df = pd.read_sql(f"SELECT * FROM {table}", conn)

# Close the database connection
conn.close()

# Sort the DataFrame by 'date'
df = df.sort_values(by="date")

# Display the first few rows to check
print(df.head())

      fide_id                name federation  world_rank_active_players  \
202   1503014      Magnus Carlsen         NO                          1   
1789  5000017   Viswanathan Anand         IN                         10   
605   2016192     Hikaru Nakamura         US                          3   
1411  4168119  Ian Nepomniachtchi         RU                          9   
1785  5000017   Viswanathan Anand         IN                         10   

          date  standard  rapid  blitz  
202   Apr 2001      2064    NaN    NaN  
1789  Apr 2001      2794    NaN    NaN  
605   Apr 2001      2428    NaN    NaN  
1411  Apr 2002      2280    NaN    NaN  
1785  Apr 2002      2752    NaN    NaN  


### <center> **Question 1:** How do the top ten chess players vary in performance across different game modes?<center>

In [204]:
# Standard deviation for player FIDE data
df.groupby('name')[['standard', 'rapid', 'blitz']].describe().xs('std', level=1, axis=1)

Unnamed: 0_level_0,standard,rapid,blitz
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alireza Firouzja,311.698406,272.447126,324.035403
Erigaisi Arjun,368.231351,209.188368,399.610189
Fabiano Caruana,183.171105,36.028575,54.349606
Gukesh D,453.20255,528.575106,415.488409
Hikaru Nakamura,94.373605,35.518926,27.67537
Ian Nepomniachtchi,111.627893,22.762101,38.968479
Magnus Carlsen,156.529242,25.547607,40.641887
Nodirbek Abdusattorov,277.047625,293.441806,287.994346
Viswanathan Anand,18.472723,32.382535,34.635731
Yi Wei,159.126699,63.675037,59.078145


To start exploratory data analysis, we first wanted to investigate descriptive statistics. However, after examining a now-deleted version of the full table, we realized that it would benefit us to select necessary statistics to focus on, rather than try to interpret every single data point.

Thus, we chose standard deviation; it is one of the most relevant statistics to shape our investigation, as it signifies the amount of variation within the data, which in turn gives us an idea about how much their progression has varied for specific game modes. The table above shows the standard deviations within each game mode data for every player. 

However, just the game mode-specific standard deviation did not provide enough insight to answer Question 1, so we decided to draw some direct comparisons between the game modes.

In [205]:
# Add a new column for differences between modes
df['std_blitz_diff'] = df['standard'] - df['blitz']
df['std_rapid_diff'] = df['standard'] - df['rapid']
df['rapid_blitz_diff'] = df['rapid'] - df['blitz']

# Average differences by player
player_differences = df.groupby('name')[['std_blitz_diff', 'std_rapid_diff', 'rapid_blitz_diff']].mean()
print(player_differences)


                       std_blitz_diff  std_rapid_diff  rapid_blitz_diff
name                                                                   
Alireza Firouzja            -8.567376       96.460993       -105.028369
Erigaisi Arjun              85.979452      188.957627       -109.576271
Fabiano Caruana             51.065789       23.811594         26.608696
Gukesh D                   265.314516      523.391304       -310.395161
Hikaru Nakamura           -103.789474      -39.789116        -67.190476
Ian Nepomniachtchi         -46.756579      -31.592105        -15.164474
Magnus Carlsen             -36.703947       -7.434211        -29.269737
Nodirbek Abdusattorov       65.089041      144.993464        -86.239726
Viswanathan Anand           -2.609929        4.065359         -8.170213
Yi Wei                      43.638158       28.152174         25.224638


In [206]:
# Calculate the mean of differences for each player
player_differences = df.groupby('name')[['std_blitz_diff', 'std_rapid_diff', 'rapid_blitz_diff']].mean().reset_index()

# Reshape the data to long format for plotting
player_differences_long = player_differences.melt(id_vars=['name'], 
                                                   value_vars=['std_blitz_diff', 'std_rapid_diff', 'rapid_blitz_diff'], 
                                                   var_name='comparison', value_name='average_difference')

# Rename the comparison values for better clarity
player_differences_long['comparison'] = player_differences_long['comparison'].replace({
    'std_blitz_diff': 'std-blitz difference',
    'std_rapid_diff': 'std-rapid difference',
    'rapid_blitz_diff': 'rapid-blitz difference'
})

# Build the plot
rating_diff_plot = (ggplot(player_differences_long) +
     geom_bar(aes(x='name', y='average_difference', fill='comparison', group='comparison'), 
              stat='identity', position='dodge', width=0.7, color='black', size=0.3) + 
     ggtitle('Figure 1. What is the average difference in performance \n between game modes for top 10 chess players?') +
     ylab('Average Difference in Ratings') + 
     theme(axis_text_x=element_text(angle=90, hjust=1, size=11),  
           plot_title=element_text(hjust=0.5),  
           legend_title=element_blank(),
           legend_text=element_text(size=11), 
           legend_position='bottom',
           axis_title_x=element_blank()) + 
     scale_fill_manual(values=['#6ab6e7', '#cd5c44', '#addf1d']))

# Display the plot
rating_diff_plot


The above plot visualizes the performance comparisons between each game mode, for all the top 10 chess players. In general, a positive score indicates better performance in the first game mode listed, and a negative score indicates better performance in the second (e.g., Alireza Firouzja's std-blitz difference is -8.57, indicating better performance in blitz; Erigaisi Arjun's score is 86 for the same condition, indicating stronger standard performance).

Looking at Fig. 1, it is clear that Gukesh has the most noticeable difference between his game mode performances, as his standard game performance trumps all others, followed closely by his blitz game performance. Although not as drastic, Arjun and Abdusattorov's performance differences also follow the same pattern, with them proving more successful at standard games, then blitz, and least successful in rapid. 

Although players' performance patterns differ, there is a clear trend of poorer rapid game performance compared to the other two game modes. 

In [207]:
# Ensure datetime format
df['date'] = pd.to_datetime(df['date'], format='%b %Y', errors='coerce')

# Filter the data to include only dates from 2013 onward (these are the full ones, since some players' scores weren't recorded)
df = df[df['date'] >= '2013-01-01']

# Drop rows with missing values in relevant columns
df = df.dropna(subset=['standard', 'rapid', 'blitz', 'date'])

# Aggregate ratings by date and mode
df_agg = df.melt(id_vars=['date'], value_vars=['standard', 'rapid', 'blitz'], 
                 var_name='mode', value_name='rating')

# Calculate the mean rating for each mode across all players for each date
df_agg_avg = df_agg.groupby(['date', 'mode']).agg({'rating': 'mean'}).reset_index()

# Build the plot
aggregate_plot = (ggplot(df_agg_avg) +
     geom_line(aes(x='date', y='rating', color='mode', group='mode')) +
     ggtitle('Figure 2. How does the average performance of top 10 chess \n players change, depending on game mode?') +
     xlab('Date') + 
     ylab('Average Rating (Elo)') +
     theme(axis_text_x=element_text(angle=45, hjust=1),
           plot_title=element_text(hjust=0.5)) +
     scale_x_datetime(labels=["%b %Y"])  
    )

# Display the plot
aggregate_plot


To further observe the general trends in player performance per game mode, we plotted Figure 2, which showcases the average player performance per year. Looking at it, it is clear that as we suggested previously, players almost always had highest Elo's in the standard game mode, followed (and sometimes surpassed by) blitz, and then by rapid. Although fluctuating before 2015, the average ratings seem to gradually increase over time, until plateuing around late 2022/early 2023.

The difference in game mode performance could signify a general trend of favouring and/or performing better under less time pressure and stress. 

### <center> **Question 2:** Does player performance online align with their FIDE rating?</center> 

As we were also curious about online vs. over-the-board game performance, this got us thinking about whether stress (whether that be a result of in-person games or time conditions) could be one of the primary reasons behind performance. To explore this idea more, we plotted the following graph, comparing players' over-the-board rapid game scores to their Chess.com rapid scores. Due to the unavailability of Chess.com datetime data, we were only able to investigate one point in time, which is now (i.e., February 2025). 

Aside from this, standard mode games (i.e., longer-form games) are split into two non-FIDE-standard game modes on Chess.com: daily and rapid, where daily games can last over 1-2 days, and FIDE-standard long-form games (e.g., a few hours long) are classified as rapid. This is the reason why most players are missing "daily" data, as they tend to not play it. Thus, for this plot, we only picked rapid and blitz scores from 1) Chess.com data and 2) February 2025 FIDE rapid and blitz data

In [208]:
# Connect to the database
conn = sqlite3.connect(db)

# Specify the Chess.com table
table = "chesscom"  

# Read data from the database into pandas
df_online = pd.read_sql(f"SELECT * FROM {table}", conn)

# Close the database connection
conn.close()


In [209]:
df_online_selected = df_online[['name', 'current_blitz', 'current_rapid']].copy()
df_online_selected['name'] = df_online_selected['name'].str.replace('_', ' ')

df_select = df['name', 'date', 'rapid', 'blitz']
df_select = df_select[(df_select['date'].dt.year == 2025) &  
                        (df_select['date'].dt.month == 2)]

KeyError: ('name', 'date', 'rapid', 'blitz')

In [210]:
print(df.columns)


Index(['fide_id', 'name', 'federation', 'world_rank_active_players', 'date',
       'standard', 'rapid', 'blitz', 'std_blitz_diff', 'std_rapid_diff',
       'rapid_blitz_diff'],
      dtype='object')


One thing we noticed was that many players did not play any "daily" mode games, most of them never having played it (signified by the null values in the database) or having uncharacteristically low ratings. This is especially interesting, as over-the-board FIDE data showed that most players performed better at game modes with less of a time constraint. However, when playing online for leisure, they specifically played shorter game modes. This, to us, suggests that time might not be the sole determinant of performance - as players seek out and are successful in these game modes online, it could be that the stress of playing an in-person game (e.g., playing in a foreign environment, being face-to-face with their opponent, tournament streaming, etc.), combined with the time constraint, is affecting their performance. Further research could compare online vs. over-the-board tournament data to provide additional insight into this. 

NOTE TO SELF: here, add a graph comparing the rapid trend line above w/ the chess.com one

In [None]:
# Filtering the FIDE data to only get February blitz and rapid data
# This is due to 1) standard mode data not being available on Chess.com and 2) Chess.com API only providing current data
df_filtered = df_agg_avg[
    (df_agg_avg['date'].dt.year == 2025) &  
    (df_agg_avg['date'].dt.month == 2) &   
    (df_agg_avg['mode'].isin(['rapid', 'blitz']))  
]

df_filtered

Click [here](../README.md#order-of-notebooks) to navigate back to the Order of Notebooks table!