In [1]:
import ibis
import ibis_ml as ml

ibis.options.interactive = True

Let's pick up where we left off by reloading our model input table.

In [2]:
model_input_table = ibis.read_parquet("model_input_table.parquet")
model_input_table

# Data splitting

To get started, let's split this single dataset into two: a _training_ set and a _testing_ set. We'll keep most of the rows in the original dataset (subset chosen randomly) in the _training_ set. The training data will be used to _fit_ the model, and the _testing_ set will be used to measure model performance.

Because the order of rows in an Ibis table is undefined, we need a unique key to split the data reproducibly. To ensure that moves corresponding to a particular game aren't split across the _training_ and _testing_ sets, we'll only split by `game_id` (instead of splitting by `game_id` and `ply`).

In [3]:
# Create data frames for the two sets:
train_data, test_data = ml.train_test_split(
    model_input_table,
    unique_key="game_id",
    # Put 3/4 of the data into the training set
    test_size=0.25,
    num_buckets=4,
    # Fix the random numbers by setting the seed
    # This enables the analysis to be reproducible when random numbers are used
    random_seed=222,
)



In [4]:
assert not (
    set(train_data.distinct(on="game_id").game_id.to_pyarrow().to_pylist())
    & set(test_data.distinct(on="game_id").game_id.to_pyarrow().to_pylist())
)

In [5]:
assert (
    set(train_data.distinct(on="game_id").game_id.to_pyarrow().to_pylist())
    | set(test_data.distinct(on="game_id").game_id.to_pyarrow().to_pylist())
) == set(model_input_table.distinct(on="game_id").game_id.to_pyarrow().to_pylist())