In [1]:
import ibis
from ibis import _

ibis.options.interactive = True

# Create game-level features

In [2]:
game_level_features = []

In [None]:
games = ibis.read_parquet("data/games.parquet")
games

## `event`-based features

The `event` field includes interesting information, such as whether the game was rated or part of a tournament.

The first thing we see above is that all of the `event` values start with `"Rated "`; is this really the case?

In [None]:
games.event[: len("Rated ")].value_counts()

It looks like unrated games simply exclude the prefix. Let's create our first feature, `is_rated`, given this information.

In [None]:
is_rated = games.event.startswith("Rated ")
is_rated.value_counts()

We'll add each feature we define in this section to our list of game-level features. Spoiler alert: when we combine our features later, we'll see an interesting property of working with Ibis this way.

Don't forget to give each feature a meaningful name!

In [None]:
game_level_features.append(is_rated.name("is_rated"))
game_level_features

What else can we extract from the `event` field? For starters, let's examine the most popular `event` values.

In [None]:
games.event.value_counts().order_by(ibis.desc("event_count"))

Lichess categorizes games according to their "time control". If you're not familiar with chess, Classical games are the slowest, followed by Rapid, then Blitz. Bullet games are very fast, and UltraBullet games are, well, ultra-fast.

Correspondence games are essentially untimed. We'll exclude these games later, because we want to see how time modulates win likelihood.

Notice that we reuse the `is_rated` logic below when creating the time control feature.

In [None]:
event_with_rated_prefix_stripped = is_rated.ifelse(
    games.event[len("Rated ") :], games.event
)
lichess_time_control_type = event_with_rated_prefix_stripped.substr(
    0, event_with_rated_prefix_stripped.find(" ")
)
lichess_time_control_type.value_counts()

In [9]:
game_level_features.append(lichess_time_control_type.name("lichess_time_control_type"))

### Exercise 1

The last `event`-based feature we want for now is whether the game was a tournament game. No need to overcomplicate things—just check whether the `event` field [contains](https://ibis-project.org/reference/expression-strings#ibis.expr.types.strings.StringValue.contains) the relevant text.

In [10]:
is_tournament = games

#### Solution

In [None]:
%load solutions/nb02_ex01.py

As usual, don't forget to add the feature you created to the list!

In [12]:
game_level_features.append(is_tournament.name("is_tournament"))

## Elo-based features

Elo ratings provide a comparative measure of skill across a pool of players and could be the basis for a number of meaningful features.

Let's start by adding features corresponding to the Elo rating for each player.

In [None]:
white_elo = games.white_elo.cast(int)
white_elo

In [14]:
game_level_features.append(white_elo.name("white_elo"))

In [None]:
black_elo = games.black_elo.cast(int)
black_elo

In [16]:
game_level_features.append(black_elo.name("black_elo"))

The difference in skill between the two players is another obvious inclusion.

In [None]:
elo_diff = white_elo - black_elo
elo_diff

In [18]:
game_level_features.append(elo_diff.name("elo_diff"))

For our final Elo-based feature, let's compute each player's rating change since their previous game.

Keep in mind that players have separate ratings for each time control (we can reuse the `lichess_time_control_type` feature in our group-by clause). For our sort key, we can use the concatenation of the `utc_date` and `utc_time` columns.

In [None]:
utc_timestamp = games.utc_date + " " + games.utc_time
utc_timestamp

In [None]:
white_elo_gained_since_previous_game = white_elo - white_elo.lag().over(
    ibis.window(
        group_by=[games.white.lower(), lichess_time_control_type],
        order_by=utc_timestamp,
    )
)
white_elo_gained_since_previous_game

Is that correct? We can sanity check our implementation by selecting the feature alongside the relevant columns.

In [None]:
games.select(
    "white", "utc_date", "utc_time", "white_elo", white_elo_gained_since_previous_game
).order_by("white", "utc_date", "utc_time")

Looks good to me! We can copy the logic to compute `black_elo_gained_since_previous_game` and add them both to our list of features.

In [None]:
black_elo_gained_since_previous_game = black_elo - black_elo.lag().over(
    ibis.window(
        group_by=[games.black.lower(), lichess_time_control_type],
        order_by=utc_timestamp,
    )
)
black_elo_gained_since_previous_game

In [23]:
game_level_features += [
    white_elo_gained_since_previous_game.name("white_elo_gained_since_previous_game"),
    black_elo_gained_since_previous_game.name("black_elo_gained_since_previous_game"),
]

## Title features

Last but not least, we can add features corresponding to the title of each player (if any).

In [None]:
white_title = games.white_title
white_title.value_counts()

In [25]:
game_level_features.append(white_title.name("white_title"))

In [None]:
black_title = games.black_title
black_title.value_counts()

In [27]:
game_level_features.append(black_title.name("black_title"))

# Combine game-level features

Early on in this notebook, we mentioned that we could exploit a nice property of adding all of our features to a list when it came time to combine them.

While we have been eagerly evaluating all of the features above using Ibis's _interactive mode_, they are just Ibis expressions. As a result, we can simply select our features from the original table.

In [None]:
games.select("game_id", *game_level_features)

# Create move-level features

In [None]:
moves = ibis.read_parquet("data/moves/*.parquet")
moves

## Eval-based features

The games in our dataset all include move-by-move computer evaluations, always from white's point of view. For example, `[%eval 2.00]` indicates that white has a 200 centipawn advantage, which is the equivalent of having two extra pawns. `[%eval #-4]` means that white is getting mated in four moves.

Theoretically, the objective evaluation should be a good predictor of win probability. Other things being equal, the player with the better position should be more likely to win. Of course, the computer makes its evaluation assuming perfect play; realistically, minute advantages don't mean much until you reach the highest levels of play.

While the logic to parse the evaluation from the `comment` field is a bit hairy, we can apply the [`EVAL_REGEX` from the Python `chess` library](https://python-chess.readthedocs.io/en/v1.11.1/_modules/chess/pgn.html). Since Ibis's [`re_extract()`](https://ibis-project.org/reference/expression-strings.html#ibis.expr.types.strings.StringValue.re_extract) method doesn't support returning multiple matches, we drop into native DuckDB SQL.

In [30]:
eval_based_features = []

In [None]:
import string

from chess.pgn import EVAL_REGEX

moves_with_parsed_eval = (
    moves.alias("moves")
    .sql(
        f"""
        SELECT
          *,
          REGEXP_EXTRACT(
            comment,
            '{EVAL_REGEX.pattern.translate(str.maketrans("", "", string.whitespace))}',
            ['prefix', 'mate', 'cp', 'depth', 'suffix']
          ) AS eval
        FROM moves
        """
    )
    .unpack("eval")
)
moves_with_parsed_eval

Let's also look at the end of the game to see an example of the `mate` field.

In [None]:
moves_with_parsed_eval.filter(_.game_id == "12IUQJ6b").to_pandas()

A lot of interesting features can be derived from the eval. However, for the purpose of this tutorial (and to avoid confusing non-chess players!), we'll add the `mate` and `cp` features and call it a day.

In [33]:
eval_based_features += [
    moves_with_parsed_eval.mate.name("mate"),
    moves_with_parsed_eval.cp.name("cp"),
]

In [None]:
moves_with_parsed_eval.select("game_id", "ply", *eval_based_features)

## Clock-based features

Quick backstory: When I first explored building a live win probability model for chess games, one of the factors I was most interested in looking into was how ["time pressure"](https://en.wikipedia.org/wiki/Time_trouble) affects win likelihood. Take the extreme case—you can be ahead by all the material in the world on the board, but, if you only have a couple seconds left on the clock, you're unlikely to convert the advantage in time.

In [35]:
clock_based_features = []

### Exercise 2

In the same vein as what we did to parse eval information above, we can apply the [`CLOCK_REGEX` from the Python `chess` library](https://python-chess.readthedocs.io/en/v1.11.1/_modules/chess/pgn.html) to extract clock information for each move.

In [36]:
moves_with_parsed_clock = moves

#### Solution

In [None]:
%load solutions/nb02_ex02.py

### Exercise 3

We're not done! While it's nice that we've extracted clock components, a more meaningful feature would be the total number of seconds left on the clock. Compute this expression, and assign it to the `clock` variable.

The [`cast()`](https://ibis-project.org/reference/expression-generic#ibis.expr.types.generic.Value.cast) method will probably come in handy for this.

In [38]:
clock = moves_with_parsed_clock

#### Solution

In [None]:
%load nb02_ex03.py

As always, let's add our feature to the appropriate list!

In [40]:
clock_based_features.append(clock.name("clock"))

In [None]:
moves_with_parsed_clock.select("game_id", "ply", *clock_based_features)