In [None]:
import ibis

ibis.options.interactive = True

# Create game-level features

In [None]:
game_level_features = []

In [None]:
games = ibis.read_parquet("data/games.parquet")
games

## `event`-based features

The `event` field includes interesting information, such as whether the game was rated or part of a tournament.

The first thing we see above is that all of the `event` values start with `"Rated "`; is this really the case?

In [None]:
games.event[: len("Rated ")].value_counts()

It looks like unrated games simply exclude the prefix. Let's create our first feature, `is_rated`, given this information.

In [None]:
is_rated = games.event.startswith("Rated ")
is_rated.value_counts()

We'll add each feature we define in this section to our list of game-level features. Spoiler alert: when we combine our features later, we'll see an interesting property of working with Ibis this way.

Don't forget to give each feature a meaningful name!

In [None]:
game_level_features.append(is_rated.name("is_rated"))
game_level_features

What else can we extract from the `event` field? For starters, let's examine the most popular `event` values.

In [None]:
games.event.value_counts().order_by(ibis.desc("event_count"))

Lichess categorizes games according to their "time control". If you're not familiar with chess, Classical games are the slowest, followed by Rapid, then Blitz. Bullet games are very fast, and UltraBullet games are, well, ultra-fast.

Correspondence games are essentially untimed. We'll exclude these games later, because we want to see how time modulates win likelihood.

Notice that we reuse the `is_rated` logic below when creating the time control feature.

In [None]:
event_with_rated_prefix_stripped = is_rated.ifelse(
    games.event[len("Rated ") :], games.event
)
lichess_time_control_type = event_with_rated_prefix_stripped.substr(
    0, event_with_rated_prefix_stripped.find(" ")
)
lichess_time_control_type.value_counts()

In [None]:
game_level_features.append(lichess_time_control_type.name("lichess_time_control_type"))

### Exercise 1

The last `event`-based feature we want for now is whether the game was a tournament game. No need to overcomplicate things—just check whether the `event` field [contains](https://ibis-project.org/reference/expression-strings#ibis.expr.types.strings.StringValue.contains) the relevant text.

In [None]:
is_tournament = games  # Complete this line of code

#### Solution

In [None]:
%load solutions/nb02_ex01.py

As usual, don't forget to add the feature you created to the list!

In [None]:
game_level_features.append(is_tournament.name("is_tournament"))

## Elo-based features

Elo ratings provide a comparative measure of skill across a pool of players and could be the basis for a number of meaningful features.

Let's start by adding features corresponding to the Elo rating for each player.

In [None]:
white_elo = games.white_elo.cast(int)
white_elo

In [None]:
game_level_features.append(white_elo.name("white_elo"))

In [None]:
black_elo = games.black_elo.cast(int)
black_elo

In [None]:
game_level_features.append(black_elo.name("black_elo"))

## Title features

We can add features corresponding to the title of each player (if any).

In [None]:
white_title = games.white_title
white_title.value_counts()

In [None]:
game_level_features.append(white_title.name("white_title"))

In [None]:
black_title = games.black_title
black_title.value_counts()

In [None]:
game_level_features.append(black_title.name("black_title"))

## `time_control`-based features

Last but not least, we can break the `time_control` column down into `base_time` (the number of seconds each player starts the game with) and `increment` (the number of seconds added to each player's clock after each move) components.

In [None]:
index = games.time_control.find("+")
base_time = games.time_control.substr(0, index).try_cast(int)
increment = games.time_control.substr(index + 1).try_cast(int)
games.select("time_control", base_time, increment).distinct()

In [None]:
game_level_features += [
    base_time.name("base_time"),
    increment.name("increment"),
]

## Target variable

We can include the target variable calculation alongside our game-level features. While there are more complicated alternatives for defining the target variable, we'll simply map a win for white to `1.0`, a win for black to `0.0`, and a draw to `0.5`.

In [None]:
target = games.result.case().when("1-0", 1).when("1/2-1/2", 0.5).when("0-1", 0).end()
target.value_counts()

In [None]:
game_level_features.append(target.name("target"))

# Combine game-level features

Early on in this notebook, we mentioned that we could exploit a nice property of adding all of our features to a list when it came time to combine them.

While we have been eagerly evaluating all of the features above using Ibis's _interactive mode_, they are just Ibis expressions. As a result, we can simply select our features from the original table.

In [None]:
games.select("game_id", *game_level_features)

# Create move-level features

In [None]:
moves = ibis.read_parquet("data/moves/*.parquet")
moves

## Eval-based features

The games in our dataset all include move-by-move computer evaluations, always from white's point of view. For example, `[%eval 2.00]` indicates that white has an advantage which is approximately equivalent to having two extra pawns. `[%eval #-4]` means that white is getting mated in four moves (i.e. black has mate in 4).

Theoretically, the objective evaluation should be a good predictor of win probability. Other things being equal, the player with the better position should be more likely to win. Of course, the computer makes its evaluation assuming perfect play; realistically, minute advantages don't mean much until you reach the highest levels of play and you have enough time left on your clock to think—more on the second point in the next section.

While the logic to parse the evaluation from the `comment` field is a bit hairy, we can apply the [`EVAL_REGEX` from the Python `chess` library](https://python-chess.readthedocs.io/en/v1.11.1/_modules/chess/pgn.html). Since Ibis's [`re_extract()`](https://ibis-project.org/reference/expression-strings.html#ibis.expr.types.strings.StringValue.re_extract) method doesn't support returning multiple matches, we drop into native DuckDB SQL.

In [None]:
eval_based_features = []

In [None]:
import string

from chess.pgn import EVAL_REGEX

moves_with_parsed_eval = (
    moves.alias("moves")
    .sql(
        f"""
        SELECT
          *,
          REGEXP_EXTRACT(
            comment,
            '{EVAL_REGEX.pattern.translate(str.maketrans("", "", string.whitespace))}',
            ['prefix', 'mate', 'regular_eval', 'depth', 'suffix']
          ) AS eval
        FROM moves
        """
    )
    .unpack("eval")
)
moves_with_parsed_eval

Let's also look at the end of a game to see an example of the `mate` field.

In [None]:
moves_with_parsed_eval.filter(moves_with_parsed_eval.game_id == "grBk9gMA").to_pandas()

For mate, numbers closer to 0 indicate a more winning position—it's easier to find a mate in 1 than a mate in 33! What's the furthest-out mate in our dataset?

In [None]:
moves_with_parsed_eval.mate.try_cast(int).abs().max()

 Mates in 121 moves are the furthest-out mates in this dataset. To be safe, [let's assume that we won't have mates longer than 1000 moves](https://chess.stackexchange.com/q/37246) and create a `mate_eval` feature that equals 999 for a white mate in 1, 998 for a white mate in 2, 967 for a white mate in 33, -998 for a black mate in 2, and -999 for a black mate in 1.

In [None]:
MATE_SCORE = 1_000  # Arbitrary large number greater than 121 (`max(abs(mate))`)

mate_eval = moves_with_parsed_eval.mate.try_cast(
    int
).sign() * MATE_SCORE - moves_with_parsed_eval.mate.try_cast(int)

A lot of interesting features can be derived from the eval. However, for the purpose of this tutorial (and to avoid confusing non-chess players!), we'll add the `mate_eval` and `regular_eval` features and call it a day.

In [None]:
eval_based_features += [
    mate_eval.name("mate_eval"),
    moves_with_parsed_eval.regular_eval.try_cast("float").name("regular_eval"),
]

In [None]:
moves_with_parsed_eval.select("game_id", "ply", *eval_based_features)

In [None]:
moves_with_parsed_eval.filter(moves_with_parsed_eval.game_id == "grBk9gMA").select(
    "game_id", "ply", *eval_based_features
).to_pandas()

## Clock-based features

Quick backstory: When I first explored building a live win probability model for chess games, one of the factors I was most interested in looking into was how ["time pressure"](https://en.wikipedia.org/wiki/Time_trouble) affects win likelihood. Take the extreme case—you can be ahead by all the material in the world on the board, but, if you only have a couple seconds left on the clock, you're unlikely to convert the advantage in time.

In [None]:
clock_based_features = []

### Exercise 2

In the same vein as what we did to parse eval information above, we can apply the [`CLOCK_REGEX` from the Python `chess` library](https://python-chess.readthedocs.io/en/v1.11.1/_modules/chess/pgn.html) to extract clock information for each move.

In [None]:
moves_with_parsed_clock = moves  # Complete this line of code

#### Solution

In [None]:
%load solutions/nb02_ex02.py

### Exercise 3

We're not done! While it's nice that we've extracted clock components, a more meaningful feature would be the total number of seconds left on the clock. Compute this expression, and assign it to the `clock` variable.

The [`try_cast()`](https://ibis-project.org/reference/expression-generic#ibis.expr.types.generic.Value.try_cast) method will probably come in handy for this.

In [None]:
clock = moves_with_parsed_clock  # Complete this line of code

#### Solution

In [None]:
%load solutions/nb02_ex03.py

There's just one problem—the `clock` column contains the time left on the player's clock after they've made their move, so it alternates between the time on white's clock and the time on black's clock.

When `ply` is odd, `clock` represents the amount of time white has left, and when `ply` is even, `clock` represents the amount of time black has left. On any given move, the previous `clock` value represents the amount of time left on the other player's clock.

Let's compute `previous_clock` using a window function.

In [None]:
w = ibis.window(group_by="game_id", order_by="ply")
previous_clock = clock.lag().over(w)
moves_with_parsed_clock.select("game_id", "ply", clock, previous_clock).order_by(
    ["game_id", "ply"]
)

Now, we can define our white and black clock features based on these intermediate calculations.

In [None]:
white_clock = ibis.ifelse(moves_with_parsed_clock.ply % 2 == 1, clock, previous_clock)
black_clock = ibis.ifelse(moves_with_parsed_clock.ply % 2 == 0, clock, previous_clock)
moves_with_parsed_clock.select("game_id", "ply", white_clock, black_clock).order_by(
    ["game_id", "ply"]
)

Notice we're missing a `black_clock` value for the first ply. We can just use the `white_clock` value here, since, on move 1, they should both equal the initial base time. (On Lichess, unlike in over-the-board games, the clocks don't start until each player has made their first move.)

In [None]:
black_clock = black_clock.coalesce(white_clock)

As always, let's add our features to the appropriate list!

In [None]:
clock_based_features += [
    white_clock.name("white_clock"),
    black_clock.name("black_clock"),
]

In [None]:
moves_with_parsed_clock.select("game_id", "ply", *clock_based_features).order_by(
    ["game_id", "ply"]
)

# Create model input table

Time to put it all together! We can join all of the game- and move-level features we engineered above to build our model input table.

In [None]:
move_level_features = moves_with_parsed_eval.select(
    "game_id", "ply", *eval_based_features
).join(
    moves_with_parsed_clock.select("game_id", "ply", *clock_based_features),
    ["game_id", "ply"],
)
model_input_table = games.select("game_id", *game_level_features).join(
    move_level_features, "game_id"
)
model_input_table

One last thing—both the `mate_eval` and `regular_eval` are blank after the last ply. Let's fill in `mate_eval` on the last ply based on the target, in hopes that the model we'll train shortly correctly predicts the result of a game once it's over!

In [None]:
model_input_table_with_final_eval = model_input_table.mutate(
    mate_eval=model_input_table.mate_eval.coalesce(
        ibis.ifelse(
            model_input_table.regular_eval.isnull(),
            model_input_table.target.case()
            .when(1.0, MATE_SCORE)
            .when(0.0, -MATE_SCORE)
            .when(0.5, 0)
            .end(),
            None,
        )
    )
)

Finally, let's exclude unrated games, because people play differently when it "doesn't count." We'll also throw out untimed ("Correspondence") games, because we've mentioned that we're especially interested in the effect of time pressure on live win probability, and there is no real time pressure (or clock) when players have unlimited time.

In [None]:
filtered_model_input_table = model_input_table_with_final_eval.filter(
    (model_input_table_with_final_eval.is_rated)
    & (model_input_table_with_final_eval.lichess_time_control_type != "Correspondence")
)

Out of curiosity, what would doing all of this in SQL have looked like? We can use the `to_sql()` function to display the compiled SQL with Ibis.

In [None]:
ibis.to_sql(filtered_model_input_table)

Before moving on, write the final result to disk so that we can use it in the next notebook. This may take a minute.

In [None]:
filtered_model_input_table.to_parquet("model_input_table.parquet")