# 2 Data Preprocessing
**Objective:** Learn how data needs to be represented for machine learning algorithms to be applied.

In [1]:
from chinese_checkers.game.ChineseCheckersGame import ChineseCheckersGame
from chinese_checkers.model.CentroidModel import CentroidModel
from chinese_checkers.simulation.GameSimulation import GameSimulation

from torch.utils.data import Dataset, DataLoader
from torch import tensor, zeros, save, stack

from typing import List, Tuple

## 2.1 State Representation

To represent the `ChineseCheckersGame` state for PyTorch consumption, we must transform the board into a one-hot encoded vector. For a given position on the board, we'll represent each potential player's piece using a binary value: 1 if the player has a piece at that position and 0 otherwise.

Given this, for a board of size `s` with `n` players, the length of our one-hot encoded state vector will be $s \times n$.

For clarity, the one-hot encoded representation is defined as:

$$
p_{i,j} =
\begin{cases}
1 & \ \text{if player } j \text{ has a piece at position } i,\\
0 & \ \text{otherwise.}
\end{cases}
$$

We'll begin by focusing on a board of size 4 with 2 players. Later, we can extend our approach to accommodate different board sizes and player counts.


In [2]:
def encode_game(game: ChineseCheckersGame) -> tensor:
    all_positions = game.board.hexagram_points
    encoded_state = tensor([
        [
            1 if position in player.positions else 0
            for position in all_positions
        ]
        for player in game.players
    ])

    return encoded_state

game = ChineseCheckersGame.start_game(number_of_players=2, board_size=4)
game_state = encode_game(game)
print(game_state)

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
         0]])


In [3]:
def encoded_game_shape(game: ChineseCheckersGame) -> Tuple[int, int]:
    all_positions = game.board.hexagram_points
    player_count = len(game.players)
    return (player_count, len(all_positions))

display(game_state.size())
encoded_game_shape(game)

torch.Size([2, 121])

(2, 121)

## 2.2 Encoding Sequence of Game States

With the capability to encode individual game states, our next step is encoding sequences of game states that represent an entire game. To ensure consistency, we'll set a limit of 400 turns for each game. Each game state within this sequence will occupy a row in our matrix.

In cases where a game concludes before reaching the 400-turn limit, we'll pad the matrix with rows of zeros until it reaches the desired size. This ensures that each game, regardless of its duration, results in a consistent-sized matrix which can be efficiently processed by PyTorch.

In [4]:
def encode_game_states(game_sequence: List[ChineseCheckersGame], max_turns):
    game_shape = encoded_game_shape(game_sequence[0])
    encoded_states_matrix = zeros(max_turns, game_shape[0], game_shape[1])

    # Encode each game state and populate the matrix
    for i, game_state in enumerate(game_sequence):
        encoded_states_matrix[i] = encode_game(game_state)

    return encoded_states_matrix

# Example usage:
simulation = GameSimulation([CentroidModel(), CentroidModel()], max_turns=400)
games = simulation.simulate_game()
print(f"No. of turns: {len(games)}")

encoded_matrix = encode_game_states(games, simulation.max_turns)
print(encoded_matrix.size())

No. of turns: 337
torch.Size([400, 2, 121])


## 2.3 Generating Data for Training
Run a lot of naive simulations and save the results to a file. This will be our training data.

In [5]:
# TODO: This is a bit of a mess. Clean it up. Also, we should refactor the game to return None if the game is a draw (instead of throwing an error).  We can then use this to filter out draws from our training data.

simulated_games = []
labels = []
save_size = 10

for i in range(20):
    simulation = GameSimulation([CentroidModel(), CentroidModel()], max_turns=400)
    try:
        games = simulation.simulate_game()
        encoded_matrix = encode_game_states(games, simulation.max_turns)
        simulated_games.append(encoded_matrix)
        labels.append(0 if games[-1].get_current_player().player_id == 'Player 0' else 1)
        if i % save_size == 0 and i > 0:
            print(labels)
            game_states_tensor = stack(simulated_games)
            labels_tensor = tensor(labels)
            save(game_states_tensor, f"../TrainingData/game_states_{i}.pt")
            save(labels_tensor, f"../TrainingData/labels{i}.pt")
            print(f"Saved {len(simulated_games)} games")
            simulated_games = []
            labels = []
    except Exception as e:
        print(f"Game {i} failed: {e}")
        continue

Game 9 failed: No next move found
[0, 1, 0, 0, 1, 0, 0, 0, 0, 1]
Saved 10 games
Game 13 failed: 'NoneType' object is not subscriptable
Game 18 failed: 'NoneType' object is not subscriptable
