Skip to content

Latest commit

 

History

History
180 lines (105 loc) · 10.9 KB

Lichess Puzzle Database Overview.md

File metadata and controls

180 lines (105 loc) · 10.9 KB

Exploratory analysis on the Lichess puzzle database

See the Exploratory Analysis folder for more details.

Lichess Puzzle Database (1)
Interactive version available here

Contents

  1. Descriptive statistics
  2. Distributions
  3. Puzzle rating
  4. Rating deviation
  5. Popularity

Descriptive statistics

See lichess_db_puzzle_eda_descriptive.ipynb for more detials.

We used pandas to read the lichess_db_puzzle_clean.csv file from puzzle_journey_data_collection_processing.ipynb as a dataframe called puzzles_df. We found that there are over 3 million puzzles in the database.

Screenshot 2023-05-16 at 8 43 48 AM

We then examined the descriptive statistics using puzzles_df.describe().

Screenshot 2023-05-16 at 8 45 15 AM

Some interesting characteristics include the following.

  • The minimum puzzle rating is 545, while the maximum puzzle rating is 3,212.
  • The median puzzle rating is 1,514.
  • The median puzzle length is 3 moves (i.e. 2 moves made by the player).
  • The maximum puzzle length is 29 moves (i.e. 15 moves made by the player)!
  • There are puzzles in the database that have yet to be played.
  • Meanwhile, the maximum number of plays is over 1,000,000!

We also used chess to view some puzzles. For example, below is the puzzle with highest rating.

Screenshot 2023-05-16 at 8 47 43 AM

Here is the puzzle with the highest number of plays.

Screenshot 2023-05-16 at 8 48 14 AM

Finally, we examined the feature correlations.

Screenshot 2023-05-16 at 8 49 06 AM

  • Perhaps unsurprisingly,Puzzle_Length and Rating are moderately positively correlated (i.e. longer puzzles tend to be more difficult).
  • There is a weak negative correlation between Popularity and Rating_Deviation. This may indicate that popular puzzles tend to have lower rating deviation.

The second point seems less straight-forward than the first. However, think of Rating_Deviation as measuring the predictability of a puzzle's difficulty—a puzzle with low rating deviation is performing at a relatively stable rating (i.e. difficulty), while a puzzle with high rating deviation has a relatively unstable rating. Meanwhile, Popularity essentially measures whether a puzzle is meeting the users' expectations. In this interpretation, it makes sense that a more popular puzzle is performing as expected, hence would have a more predictable level of difficulty.

Distributions

See lichess_db_puzzle_eda_distributions.ipynb for more details.

We used ggplot to visualize the distributions of puzzle rating, rating deviation, popularity, number of plays, puzzle length, themes, and opening tags.

Puzzle rating

rating_histogram

The distribution of ratings is unimodal and fairly symmetric. Below is a boxplot.

rating_boxplot

Rating deviation

rating_deviation_histogram

The distribution of rating deviation is unimodal and right-skewed. Most puzzles have a rating deviation of less than 100. Below is a boxplot—note the potential outliers.

rating_deviation_boxplot

Popularity

popularity_histogram

The distribution of popularity is unimodal and left-skewed. Most puzzles are rather popular, suggesting there are few puzzles that users find to be inaccurate, poorly designed, or otherwise unfair. Below is a boxplot—note the potential outliers.

popularity_boxplot

Number of plays

number_of_plays_histogram

The distribution of number of plays is unimodal and right-skewed. There are comparatively few puzzles with more than 2,000 plays. Below is a boxplot—note the potential outliers and small interquartile range.

plays_boxplot

Puzzle length

length_bar

Puzzle length ranges from 1 to 29 (i.e. from 1 to 15 player moves), though the distribution is right-skewed. The most common puzzle length is 3, which corresponds to 2 moves made by the player. Below is a boxplot—note the potential outliers occuring around length 9--11.

length_boxplot

Themes

theme_bar

There are 60 distinct themes (not including the healthyMix and playerGames themes). The 5 most frequently occuring themes are as follows.

  • short
  • middlegame
  • crushing
  • endgame
  • advantage

It is interesting to recall here that these puzzles are generated by user games on lichess.org. So, be on the lookout for these sorts of tactics in your games!

Opening tags

opening_bar

There are over 100 opening tags, not including variations! The most common opening in the puzzle database is the Sicilian_Defense—watch out for early tactics in your Sicilian games!

Puzzle rating

See lichess_db_puzzle_eda_rating.ipynb for more details.

The strongest feature correlation to puzzle rating was with puzzle length, vizualized below.

length_vs_rating

The median puzzle rating increases as puzzle length increases.

We investigated the relationship between puzzle theme and puzzle ratings, as well.

theme_vs_rating

The themes with median puzzle rating greater than 2,000 are as follows.

  • castling
  • quietMove
  • veryLong
  • underPromotion
  • mateIn5
  • enPassant
  • defensiveMove
  • zugswang

These are all themes that involve either relatively long puzzles or moves that are subtle, rare, or otherwise non-routine.

Meanwhile, bankRankMate has the lowest median rating—these puzzles are generally considered to be quiet easy.

Finally, we looked at opening tag as it relates to puzzle rating.

opening_vs_rating

The median rating is pretty consistent across openings. The Zukertort_Defense and Amar_Gambit are the only openings with a median rating over 2,000, while the Borg_Opening is the only opening with a median rating under 1,000.

Rating deviation

See lichess_db_puzzle_eda_deviation.ipynb for more details.

Rating deviation is very consistent across most features in the database. Two things that stood out from our investigations here are the following.

  • The theme equality has a much higher median rating deviation than the other themes.

theme_vs_rating_deviation

  • The opening with the highest median rating deviation is the Norwegian_Defense.

opening_vs_rating_deviation

Popularity

See lichess_db_puzzle_eda_popularity.ipynb for more details.

Aside from the relationship between Puzzle_Length and Rating, the next strongest correlation was between Popularity and Rating_Deviation, visualized below.

popularity_vs_rating_deviation_boxplot

Observe, as popularity increases, the median rating deviation decreases.

The other features seemed to have little impact on popularity—notably, median popularity is fairly stable when compared across themes.

theme_vs_popularity_boxplot

Notably, the Queens_Pawn_Mengarini_Attack is the opening with lowest median popularity.

opening_vs_popularity_boxplot