In [1]:
import ibis
from ibis import _

ibis.options.interactive = True

In [2]:
moves = ibis.read_parquet("data/lichess/moves/*.parquet")
moves

While the official count states 89,342,529 games, we don't include games with 0 moves. In the first 100,000 games in the file, there were 305 games with 0 moves (0.305%), so the dropout rate of 0.197% here seems reasonable (I wouldn't be surprised if there are more "fake" games starting at midnight, but I haven't verified this).

In [3]:
%%time
moves.site.nunique().to_pyarrow().as_py()

CPU times: user 2min 17s, sys: 10.4 s, total: 2min 28s
Wall time: 25.5 s


89166466

The average game is shorter than I expected; I expected the [average game to be around 40 moves](https://chess.stackexchange.com/questions/2506/what-is-the-average-length-of-a-game-of-chess).

In [4]:
%%time
moves.group_by(moves.site).agg(
    move_count=_.count()
).move_count.mean().to_pyarrow().as_py() / 2

CPU times: user 2min 44s, sys: 11.3 s, total: 2min 55s
Wall time: 27.6 s


33.30006908090313

My first hypothesis was that "abandoned" games were bringing down the average. (From experience, I know that games are aborted if either player doesn't make their first move in time.)

In [5]:
%%time
games = moves[moves.move_ply == 1]
games.termination.value_counts().preview()

CPU times: user 1min 21s, sys: 2.4 s, total: 1min 23s
Wall time: 12.4 s


While we should likely limit ourselves to games that ended normally (including those where a player ran out of time), we still haven't found our answer as to why games are shorter than expected.

In [6]:
%%time
moves.group_by(moves.site).agg(
    move_count=_.count(), termination=_.termination.arbitrary()
).group_by("termination").agg(average_move_count=_.move_count.mean() / 2).preview()

CPU times: user 3min 26s, sys: 16.5 s, total: 3min 43s
Wall time: 50.1 s


Next, let's explore the time control distribution. We adapt the [`parse_time_control`](https://python-chess.readthedocs.io/en/latest/_modules/chess/pgn.html#skip_game) logic to native Ibis code, as calling the function in a UDF would be prohibitively slow[^1].

In [7]:
%%time
import ibis.expr.types as ir


def parse_time_control(time_control: ir.StringColumn) -> dict[str, ir.Column]:
    index = time_control.find("+")
    base_time = time_control.substr(0, index).try_cast(int)
    increment = time_control.substr(index + 1).try_cast(int)

    time_control_type = (
        ibis.case()
        .when(time_control.isnull() | time_control.startswith("?"), "UNKNOWN")
        .when(time_control.startswith("-"), "UNLIMITED")
        .when(base_time + 60 * increment < 3 * 60, "BULLET")
        .when(base_time + 60 * increment < 15 * 60, "BLITZ")
        .when(base_time + 60 * increment < 60 * 60, "RAPID")
        .else_("STANDARD")
        .end()
    )

    return {
        "time_control_base_time": base_time,
        "time_control_increment": increment,
        "time_control_type": time_control_type,
    }


games_with_parsed_time_control = games.mutate(**parse_time_control(games.time_control))
games_with_parsed_time_control.time_control_type.value_counts().preview()

CPU times: user 2min 21s, sys: 2.85 s, total: 2min 23s
Wall time: 20.4 s


Note that the time control type is not determined using the same logic Lichess uses (e.g. `3+0` is considered Blitz, not Bullet, by most online chess servers, including Lichess).

As a sanity check, let's compare the null count for `time_control_base_time` and `time_control_increment` with the number of Unlimited games.

In [12]:
%%time
games_with_parsed_time_control[
    _.time_control_base_time.isnull()
].count().to_pyarrow().as_py()

CPU times: user 1min 27s, sys: 2.21 s, total: 1min 29s
Wall time: 11.8 s


79076

In [13]:
%%time
games_with_parsed_time_control[
    _.time_control_increment.isnull()
].count().to_pyarrow().as_py()

CPU times: user 1min 31s, sys: 2.33 s, total: 1min 33s
Wall time: 12.4 s


79076

It doesn't seem like longer time controls result in longer games; in fact, it's almost the opposite. Maybe this has something to do with the fact that stronger players generally play shorter time controls online.

In [17]:
%%time
games_with_parsed_time_control.join(
    moves.group_by(moves.site).agg(move_count=_.count()), "site"
).group_by(["time_control_type", "termination"]).agg(
    count=_.count(), average_move_count=_.move_count.mean() / 2
).order_by(
    ["time_control_type", "termination"]
).preview(
    max_rows=25
)

CPU times: user 6min 30s, sys: 7min 14s, total: 13min 44s
Wall time: 7min 29s


In [19]:
%%time
moves_with_parsed_time_control = moves.mutate(**parse_time_control(moves.time_control))
moves_with_parsed_time_control.group_by(moves_with_parsed_time_control.site).agg(
    move_count=_.count(),
    time_control_type=_.time_control_type.arbitrary(),
    termination=_.termination.arbitrary(),
).group_by(["time_control_type", "termination"]).agg(
    count=_.count(), average_move_count=_.move_count.mean() / 2
).order_by(
    ["time_control_type", "termination"]
).preview(
    max_rows=25
)

CPU times: user 23min 1s, sys: 30 s, total: 23min 31s
Wall time: 3min 51s


The most common value for both `white_elo` and `black_elo` is `1500`, because that's the starting Lichess rating. It's probably worth excluding it, or treating it separately, for most purposes.

In [39]:
%%time
games["white_elo", "black_elo"].describe().preview()

CPU times: user 2min 59s, sys: 5.75 s, total: 3min 5s
Wall time: 28.7 s


It also looks like Lichess has a minimum Elo rating of `400`; some online sleuthing shows that this bar may have been lowered over the years.

In [40]:
%%time
elo_columns = ["white_elo", "black_elo"]
games[elo_columns].cast(dict.fromkeys(elo_columns, int)).describe().preview()

CPU times: user 3min 12s, sys: 14.5 s, total: 3min 27s
Wall time: 41.6 s


It does seem that the average game gets progressively longer as we rise through the ranks. This may be the high-level answer we were looking for.

In [32]:
%%time
moves_with_elo_bracket = moves.cast(dict.fromkeys(elo_columns, int)).mutate(
    elo_bracket=ibis.case()
    .when(_.white_elo < 500, "-0499")
    .when((500 <= _.white_elo) & (_.white_elo < 1000), "0500-0999")
    .when((1000 <= _.white_elo) & (_.white_elo < 1500), "1000-1499")
    .when((1500 <= _.white_elo) & (_.white_elo < 2000), "1500-1999")
    .when((2000 <= _.white_elo) & (_.white_elo < 2500), "2000-2499")
    .when((2500 <= _.white_elo) & (_.white_elo < 3000), "2500-2999")
    .when((3000 <= _.white_elo) & (_.white_elo < 3500), "3000-3499")
    .when(3500 <= _.white_elo, "3500+")
    .end()
)
moves_with_elo_bracket[(_.white_elo != 1500) & (_.black_elo != 1500)].group_by(
    _.site
).agg(
    move_count=_.count(),
    elo_bracket=_.elo_bracket.arbitrary(),
    termination=_.termination.arbitrary(),
).group_by(
    ["elo_bracket", "termination"]
).agg(
    count=_.count(), average_move_count=_.move_count.mean() / 2
).order_by(
    ["elo_bracket", "termination"]
).preview(
    max_rows=40
)

CPU times: user 11min 39s, sys: 27.5 s, total: 12min 7s
Wall time: 2min 32s


I was initially surprised that not all moves had clock annotations.

In [41]:
%%time
moves.move_comment.contains("[%clk").value_counts().preview()

CPU times: user 4min 5s, sys: 18.4 s, total: 4min 23s
Wall time: 43.2 s


A quick investigation shows that clock annotations are present if and only if the time control is Unlimited, which makes sense.

In [38]:
%%time
moves_with_parsed_time_control.mutate(
    has_clock=_.move_comment.contains("[%clk")
).group_by(["has_clock", "time_control_type"]).count().order_by(
    ["has_clock", "time_control_type"]
).preview()

CPU times: user 23min 53s, sys: 23.9 s, total: 24min 17s
Wall time: 3min 33s


In [42]:
%%time
moves.move_comment.contains("[%eval").value_counts().preview()

CPU times: user 4min 31s, sys: 18.1 s, total: 4min 50s
Wall time: 46.8 s


81,011,995 of the 89,342,529 games don't have evaluations; over 9% of the games having evaluations is better than the [6% advertised](https://database.lichess.org/#notes), although I imagine the number of games with evaluations has increased over time.

Furthermore, 5,759,094 (6.45%) have evaluations for every move.

In [45]:
%%time
moves.mutate(has_eval=_.move_comment.contains("[%eval")).group_by(_.site).agg(
    percent_has_eval=_.has_eval.mean()
).percent_has_eval.value_counts().order_by(
    ibis.desc("percent_has_eval_count")
).preview()

CPU times: user 6min 6s, sys: 42 s, total: 6min 48s
Wall time: 2min 11s


# Appendix

In [7]:
# from functools import reduce

# import chess.pgn
# from chess.pgn import TimeControlType


# @ibis.udf.scalar.python
# def parse_time_control(time_control: str) -> dict[str, int]:
#     tc = chess.pgn.parse_time_control(time_control)
#     tc_dict = {"type": tc.type.value}

#     if tc.type in [TimeControlType.UNKNOW, TimeControlType.UNLIMITED]:
#         return tc_dict

#     if len(tc.parts) > 1:
#         raise ValueError("Lichess does not support multipart time controls.")

#     if tc.parts[0].delay:
#         raise ValueError("Lichess does not support delay.")

#     tc_dict["base_time"] = tc.parts[0].time
#     tc_dict["increment"] = int(tc.parts[0].increment)
#     return tc_dict


# games_with_parsed_time_control = (
#     moves[moves.move_ply == 1]
#     .mutate(parsed_time_control=parse_time_control(moves.time_control))
#     .mutate(
#         time_control_type=reduce(
#             lambda case_expr, member: case_expr.when(member.value, member.name),
#             TimeControlType,
#             _.parsed_time_control.get("type").case(),
#         ).end(),
#         base_time=_.parsed_time_control.get("base_time"),
#         increment=_.parsed_time_control.get("increment"),
#     )
#     .drop("move_ply", "parsed_time_control")
# )
# games_with_parsed_time_control

In [8]:
# %%time
# games_with_parsed_time_control[
#     games_with_parsed_time_control.termination.isin(["Normal", "Time forfeit"])
# ].time_control_type.value_counts().preview()

In [38]:
games_with_parsed_time_control.time_control_type.value_counts()

In [11]:
# games_with_parsed_time_control.time_control_type.value_counts()

In [12]:
# games_with_parsed_time_control[_.time_control_type.isnull()].time_control.contains("+").value_counts()

In [39]:
games_with_parsed_time_control.time_control_base_time.value_counts().order_by(ibis.desc("time_control_base_time_count"))

In [42]:
games_with_parsed_time_control[_.time_control_base_time.isnull()].count()

┌───────┐
│ [1;36m79076[0m │
└───────┘

In [40]:
games_with_parsed_time_control.time_control_increment.value_counts().order_by(ibis.desc("time_control_increment_count"))