Skip to content

Latest commit

 

History

History
57 lines (32 loc) · 4.34 KB

Data Collection and Processing Overview.md

File metadata and controls

57 lines (32 loc) · 4.34 KB

Data Collection and Processing Overview

See puzzle_journey_data_collection_processing.ipynb for details.

Contents

  1. Lichess puzzle database
  2. My puzzle activity
  3. My puzzle rating history

Lichess puzzle database

The Lichess puzzle database was downloaded from https://database.lichess.org/#puzzles on March 22, 2023. The data was in the form of a compressed .csv using zstd compression. This file was decompressed in command line and converted to a pandas dataframe using .read_csv().

Screenshot 2023-05-15 at 12 06 50 PM

The data initially contained no headers, so we added headers in line with the database documentation.

Screenshot 2023-05-15 at 12 08 26 PM

Checking for missing or null values, we found quite a few in the Opening_Tags column.

Screenshot 2023-05-15 at 12 09 14 PM

This was to be expected, though—opening tags are only set for puzzles occuring before move 20 since a tactic occuring within the first 20 moves of a game likely has features strongly influenced by the opening played, whereas puzzles occuring later may not be as strongly influenced by the opening.

Finally, we added a column for Puzzle_Length that counts the number of moves in the puzzle from it's starting position. The first move in the Moves column sets up the position to present to the player, so Puzzle_Length is 1 less than the number of moves in Moves.

Screenshot 2023-05-15 at 12 14 08 PM

Note that the number of moves the player must make is actually half of the number of moves in the Moves column, or

$$\dfrac{\text{Puzzle Length} + 1}{2}.$$

My puzzle activity

I used a personal token generated from https://lichess.org/account/oauth/token to access my puzzle activity from the Lichess API at https://lichess.org/api/puzzle/activity.

Screenshot 2023-05-15 at 12 21 36 PM Screenshot 2023-05-15 at 12 22 08 PM Screenshot 2023-05-15 at 12 22 31 PM

The response is a .ndjson file, i.e. a new-line delimited json file. I had trouble getting pandas to read this, so I first split the response text into a list, to which I applied json.loads() to parse each element as a .json object. Afterward, .json_normalize() was able to convert this list of .json objects into a dataframe.

Screenshot 2023-05-15 at 12 27 24 PM

Note that the date column is in 13-digit format. I converted these to datetime.

Screenshot 2023-05-15 at 12 30 30 PM

My puzzle rating history

My rating history was downloaded directly from https://lichess.org/api/user/tclark/rating-history as a .json file. My puzzle rating history was encoded as a list of points at index 13 of this file. This list was read into a dataframe as below.

Screenshot 2023-05-15 at 12 33 01 PM

The month column had January corresponding to 0, which I found strange—so, I added 1 to each entry in month.

Screenshot 2023-05-15 at 12 33 46 PM