In this notebook we will get data from chess.com API. Then, it will be stored in json file to further proceed.

First, we need to install a [chess library](https://python-chess.readthedocs.io/en/latest/index.html) for Python.

In [None]:
%pip install chess

Collecting chess
  Downloading chess-1.11.2.tar.gz (6.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: chess
  Building wheel for chess (setup.py) ... [?25l[?25hdone
  Created wheel for chess: filename=chess-1.11.2-py3-none-any.whl size=147776 sha256=3f75fe13c434fe0335ab299ce68a46f99f64b2ccbd933b2a2b0cd60818ff93f4
  Stored in directory: /root/.cache/pip/wheels/fb/5d/5c/59a62d8a695285e59ec9c1f66add6f8a9ac4152499a2be0113
Successfully built chess
Installing collected packages: chess
Successfully installed chess-1.11.2


We import:

*   chess.pgn to handle data in [pgn]('https://en.wikipedia.org/wiki/Portable_Game_Notation') format, which is commonly used for saving chess games' details.
*   io which is used for streaming text later
*   pandas - pandas.DataFrame will be extremely useful soon.



In [None]:
import chess.pgn
import io
import pandas as pd

Now we are ready to send http request to the chess.com API endpoint. (Useful article [here]('https://www.chess.com/news/view/published-data-api#pubapi-endpoint-games')).

Here, I am extracting information about all my chess games on chess.com. (It is worth noticing that we iterate over every month because there isn't any official and easily accessible endpoint to get all games at once.)

In [None]:
import requests
import datetime

def get_games(username, start_year, start_month):
    current_date = datetime.datetime.now()
    games = []

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }

    for year in range(start_year, current_date.year + 1):
        for month in range(1, 13):
            if year == start_year and month < start_month:
                continue
            if year == current_date.year and month > current_date.month:
                break

            url = f"https://api.chess.com/pub/player/{username}/games/{year}/{month:02d}"
            response = requests.get(url, headers=headers)

            if response.status_code == 200:
                data = response.json()
                if "games" in data and data["games"]:
                    games.extend(data["games"])
                else:
                    print(f"Brak gier dla {year}-{month:02d}")
            else:
                print(f"Brak danych dla {year}-{month:02d}, status: {response.status_code}")

    return games

username = "Pablo_810"
games = get_games(username, 2020, 3)

print(f"Pobrano {len(games)} partii")


Brak gier dla 2023-12
Brak gier dla 2024-04
Pobrano 8966 partii


To extract only the most valuable information and standardize data I wrote two functions which are used in next code cell.



*   analyzePGN(pgn_text): in the json response from API we get PGN of each game as a string. To easily proceed this data we need it to be a PGN, so first pgn_text is changed to stream and then we use read_game function from Python chess library to get a PGN object. Then we proceed winner, number of turns and the way that the game finished according to chess rules.
*   isMate(game): auxiliary function which checks if game finished with mate.



In [None]:
def isMate(pgn):
  board = pgn.board()
  for move in pgn.mainline_moves():
    board.push(move)
    if board.is_checkmate():
        return True
  return False

def analyzePGN(pgn_text):
  pgn_stream = io.StringIO(pgn_text)
  pgn = chess.pgn.read_game(pgn_stream)

  if pgn.headers['Result'] == '1-0':
    winner = 'White'
  elif pgn.headers['Result'] == '0-1':
    winner = 'Black'
  else:
    winner = 'Draw'

  turns = len(list(pgn.mainline_moves()))

  if 'time' in pgn.headers['Termination']:
    victory_status = 'Time forfeit'
  elif pgn.headers['Result'] == '1/2-1/2':
    victory_status = 'Draw'
  elif isMate(pgn):
    victory_status = 'Mate'
  else:
    victory_status = 'Resign'

  return turns, victory_status, winner

Next we can create a pandas [DataFrame]('https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html') with all desired information.

games is a JSON object so we iterate over its keys and values and adjust it to our game_df.

In [None]:
games_df = pd.DataFrame({
    "rated": [],
    "turns": [],
    "victory_status": [],
    "winner": [],
    "time_class": [],
    "white_id": [],
    "white_rating": [],
    "black_id": [],
    "black_rating": [],
    "opening": []
})

for i, game in enumerate(games):
    rated = True
    turns = 0
    victory_status = ''
    winner = ''
    time_class = ''
    white_id = ''
    white_rating = 0
    black_id = ''
    black_rating = 0
    opening = ''

    for k, v in game.items():
        match k:
            case 'time_class':
                time_class = v
            case 'rated':
                rated = bool(v)
            case 'white':
                white_id = v.get('username', '')
                white_rating = int(v.get('rating', 0))
            case 'black':
                black_id = v.get('username', '')
                black_rating = int(v.get('rating', 0))
            case 'eco':
                opening = str(v).split('/')[-1]
            case 'pgn':
                turns, victory_status, winner = analyzePGN(v)

    # Adding new row
    games_df.loc[len(games_df)] = [
        rated, turns, victory_status, winner, time_class,
        white_id, white_rating, black_id, black_rating, opening
    ]

Quick check if our set of data includes correct information.

In [None]:
games_df.head()

Unnamed: 0,rated,turns,victory_status,winner,time_class,white_id,white_rating,black_id,black_rating,opening
0,True,53,Time forfeit,White,blitz,Pablo_810,1162,ahmed8909,838,Pirc-Defense-2.d4
1,True,67,Resign,White,blitz,MichaelMikeCorleone,1099,Pablo_810,987,Van-t-Kruijs-Opening-1...d5
2,True,30,Resign,Black,blitz,POLCIE,997,Pablo_810,1095,Kings-Fianchetto-Opening-1...d5-2.Bg2
3,True,44,Resign,Black,blitz,Pablo_810,1033,contakto,1181,Bishops-Opening
4,True,69,Time forfeit,White,blitz,Fernando2017p,1088,Pablo_810,976,Queens-Pawn-Opening-1...d5


We can see that all columns that should be numbers are in fact numbers and that dataset includes 8966 rows.

In [None]:
games_df.describe()

Unnamed: 0,turns,white_rating,black_rating
count,8966.0,8966.0,8966.0
mean,68.38646,1441.695963,1441.103279
std,32.468702,308.21245,308.367572
min,0.0,195.0,140.0
25%,45.0,1337.0,1337.0
50%,65.0,1531.0,1530.0
75%,90.0,1630.0,1630.0
max,240.0,2830.0,2659.0


There aren't any null rows.

In [None]:
games_df.isna().sum()

Unnamed: 0,0
rated,0
turns,0
victory_status,0
winner,0
time_class,0
white_id,0
white_rating,0
black_id,0
black_rating,0
opening,0


In [None]:
games_df['victory_status'].unique()

array(['Time forfeit', 'Resign', 'Draw', 'Mate', ''], dtype=object)

We can see that there are some probably problematic rows which are incomplete (' '). In this case we can drop them

In [None]:
games_df[games_df['victory_status']=='']

Unnamed: 0,rated,turns,victory_status,winner,time_class,white_id,white_rating,black_id,black_rating,opening
5212,True,0,,,blitz,Pablo_810,1303,yikeschessishard,1290,English-Opening-Anglo-Lithuanian-Variation-2.Nc3
5213,True,0,,,blitz,yoaLt,1415,Pablo_810,1150,Sicilian-Defense-Open-Classical-Variation-6.Bb...
5214,True,0,,,blitz,Pablo_810,1186,ReitmannB,394,English-Opening-Reversed-Sicilian-Three-Knight...
5215,True,0,,,blitz,megalofia,1327,Pablo_810,1284,Sicilian-Defense...3.Bc4-Nf6-4.d3-e6
5216,True,0,,,blitz,aDropOfLove,1210,Pablo_810,1221,Sicilian-Defense-Canal-Attack-3...Nd7
5217,True,0,,,blitz,Pablo_810,1142,u_menya_net_idei,1239,English-Opening-Four-Knights-Kingside-Fianchet...
5307,True,0,,,blitz,Pablo_810,1077,enterthe,1275,English-Opening-Anglo-Scandinavian-Defense-2.c...
5308,True,0,,,blitz,MIRRIL,869,Pablo_810,1115,Sicilian-Defense-2.Nf3-d6
5309,True,0,,,blitz,Spagetixs,876,Pablo_810,1158,Indian-Game-2.Nc3-g6
5310,True,0,,,blitz,MIRRIL,859,Pablo_810,1158,Kings-Pawn-Opening


In [None]:
games_df = games_df[games_df['victory_status'] != '']

In [None]:
games_df.count()

Unnamed: 0,0
rated,8944
turns,8944
victory_status,8944
winner,8944
time_class,8944
white_id,8944
white_rating,8944
black_id,8944
black_rating,8944
opening,8944


Finally, we can save the DataFrame in JSON file and use the dataset for whatever we want.

In [None]:
games_df.to_json('chess_data.json', orient='records', indent=4)