# Tic-Tac-Toe

> Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns marking the spaces in a 3×3 grid. The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game.
> [Wikipedia](https://en.wikipedia.org/wiki/Tic-tac-toe)

At a high level, we will need two data structures: 

- the __environment__ that manages the current state of the game (the grid)
- the __agent__ that learns the optimal way to play the game 

During an episode, there will be 2 instances of the agent class and they both interact with the same instance of the environment. the `play_game()` function will run the main loop of the game.

The grid is represented by a 3x3 `numpy.array`. The 3 different states of a cell are represented by 3 integers: 0 for empty, -1 and 1 for O and X. We choose these opposite values to help finding out an alignment by just summing the rows, columns and diagonals of the matrix.

In [1]:
import numpy as np


In [2]:
# Constants
Debug = 0
BoardWidth = 3
BoardHeight = 3
X = 1
O = -1
_ = 0


The function `find_winner` analyzes the provided grid and returns:

- `X` or `O` if the agent has won
- `_` if there is a draw
- `None` if there is no winner and it is not a terminal state

In [3]:
def find_winner(board):
    
    # return O or X if there is 3 aligned symbols
    for i in range(3):
        s = np.sum(board[i,:]) 
        if s == 3*X: return X
        if s == 3*O: return O

    for i in range(3):
        s = np.sum(board[:,i]) 
        if s == 3*X: return X
        if s == 3*O: return O
        
    s = np.trace(board)
    if s == 3*X: return X
    if s == 3*O: return O

    s = np.trace(np.fliplr(board))
    if s == 3*X: return X
    if s == 3*O: return O

    # return None if the board is not full (at least one cell is empty)
    for row in board:
        for v in row:
            if v == _: return None
        
    # the board is full, so this is a draw
    return _

The game has a finite number of states $N = 3^{rows \times cols} = 3^9 = 19683$. This number is relatively small so we can enumerate all the possible states and decide for each if it is a terminal state with a winner or a draw. 

But first, we need a way to identify each different state. The function `compute_state_hash` computes a hash value that takes into account the state of each cell of the board. We can see the grid as a succession of cells that can take 3 different values. This is just like a number written with digits in base 3. 

$$\{ B_{0,0}, B_{0,1}, \ldots , B_{2,1}, B_{2,2} \}$$

$$H = 3^8*B_{0,0} + 3^7 * B_{0,1} + \ldots + 3^1 * B_{2,1} + 3^0 * B_{2,2}$$

In [4]:
def compute_state_hash(board):
    #   hash = B(2,2) + 3*(B(2,1) + 3*(B(2,0) + 3*(B(1,2) ... + B(0,0))))
    v = board.reshape(-1)
    p = np.power(3, np.arange(len(v)))
    return int(np.sum(v * p))


`enumerate_states_and_winner` is a recursive function that will go through all the possible states of the grid. For each state, it will precalculate if it is a terminal state (one players wins or the game is a draw), or not.

In [5]:
def enumerate_states_and_winner(board = None, x = 0, y = 0):
    
    states = {}
    
    if board is None:
        board = np.empty((3,3))
        
    for symbol in (_, X, O):
        board[x,y] = symbol
        
        if y == 2:
            if x == 2:
                s = compute_state_hash(board)
                w = find_winner(board)
                states[s] = w
            
            else:
                states = {**states, **enumerate_states_and_winner(board, x+1, 0)}
        
        else:
            states = {**states, **enumerate_states_and_winner(board, x, y+1)}
    
    return states

The `Agent` class represent one player that can interact with the game and that will learn how to win.

The `value` property is a data structure that maps a number to each possible state of the grid. This number gives a hint about the future goodness of a state and takes into account the probability of all possible future rewards. The estimation of this function is the key task in this problem as the agent will take its decision based on the value of each state it can choose.

At the beginning, the only states we know the value are the terminal states, those for which the agent wins ($V(s) = 1$) and those for which it loses or nobody wins ($V(s) = 0$). All other states will be initialized with the value 0.5 because we don't know the future of this state.

After each episode, the agent will update this function by 'back-propagating' the final state value into the previous ones with a learning rate $\alpha$: $V(s) = V(s) + \alpha (V(s+1) - V(s))$. To allow this, the agent needs to keep an history of each state it reaches during the game.

At each turn, the agent will have the choice to explore a random strategy, or to exploit what he has learnt so far. For this, we will use the epsilon-greedy strategy.

In [6]:
class Agent:
    
    def __init__(self, symbol, env, alpha, epsilon):
        self.symbol = symbol
        self.state_history = []
        self.value = {s: 1 if w==symbol else 0.5 if w is None else 0 for s,w in env.states.items()}
        self.alpha = alpha
        self.epsilon = epsilon
    
    
    def play(self, env):
        
        # find empty cells
        empty_cells = []
        for i in range(3):
            for j in range(3):
                if env.board[i,j] == _:
                    empty_cells.append((i,j))
        
        # take the decision to explore or to exploit
        p = np.random.random()
        if p < self.epsilon:
            # --- exploration ---
            # choose a random empty cell
            ij = empty_cells[np.random.choice(len(empty_cells))]
            env.board[ij] = self.symbol
            if Debug:
                print("Agent is playing random in", ij)

        
        else:
            # --- exploitation ---
            # choose the cell that have the highest value
            maxV = -1
            maxIJ = None
            
            # display cells values
            if Debug:
                for i in range(3):
                    for j in range(3):
                        if env.board[i,j] == _:
                            env.board[i,j] = self.symbol
                            v = self.value[compute_state_hash(env.board)]
                            env.board[i,j] = _
                            print(f'{v:0.2f} ', end='')
                        else:
                            print('  X  ' if env.board[i,j]==X else '  O  ', end='')
                    print()
                

            for ij in empty_cells:
                env.board[ij] = self.symbol
                v = self.value[compute_state_hash(env.board)]
                env.board[ij] = _
                if v > maxV:
                    maxV = v
                    maxIJ = ij

            env.board[maxIJ] = self.symbol
            if Debug:
                print("Agent is playing greedy in", maxIJ)
    
    
    def update(self, env):
        self.state_history.reverse()
        
        # back-propagation of the final state value
        next_state = self.state_history[0]
        for s in self.state_history[1:]:
            self.value[s] = self.value[s] + self.alpha * (self.value[next_state] - self.value[s])
            next_state = s
        
        # clear the history for next episode
        self.state_history = []
        
    
    def update_state_history(self, state):
        self.state_history.append(state)


The `Environment` class owns the board and can tell for each state if there is a winner.

In [7]:
class Environment:
    def __init__(self, w, h):
        self.board = None
        self.states = enumerate_states_and_winner()
    
    def new_game(self):
        self.board = np.ones((3,3)) * _
        
    def game_over(self):
        h = compute_state_hash(self.board)
        if self.states[h] is None: return False
        
        if Debug:
            if self.states[h] == X:
                print("  ###  X won the game  ###")
            elif self.states[h] == O:
                print("  ###  O won the game  ###")
            else:
                print("  ###  Nobody won  ###")
            
        return True
    
    def draw_board(self):
        print(" -------------")
        for i in range(3):
            for j in range(3):
                v = '  .  ' if self.board[i,j]==0 else '  X  ' if self.board[i,j]==1 else '  O  '
                print(v, end='')
            print()
        print(" -------------")


This is the main loop of an episode (a full game run). The loop will stop as soon as the game is over, which means that either one player managed to align 3 tokens or nobody has won.

This is a turn-base game, so each agent takes turn when playing. The player 1 starts first.

When an agent takes an action, the environment reaches a new state and both agents update their internal history with this state.

Finally, when the game is over, each agent can update its value function.


In [8]:
def play_game(player1, player2, env, draw=False):
    
    current_player = None
    
    env.new_game()
    
    while not env.game_over():
        
        # change the current player
        if current_player == player1:
            current_player = player2
        else:
            current_player = player1
          
        # draw the board
        if draw == current_player.symbol:
            env.draw_board()
        
        # current player makes a move
        current_player.play(env)
    
        # update the history of each agent 
        state = compute_state_hash(env.board)
        player1.update_state_history(state)
        player2.update_state_history(state)
    
    if draw:
        env.draw_board()
        
    # update the value function
    player1.update(env)
    player2.update(env)


Everything has been defined. We can now instantiate the game and the agents and start the learning phase. We will play 10000 episodes.

In [9]:
env = Environment(BoardWidth, BoardHeight)
player1 = Agent(X, env, 0.3, 0.1)
player2 = Agent(O, env, 0.3, 0.1)

for episode in range(10000):
    if not episode % 200: print (episode)
    play_game(player1, player2, env)

0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
8600
8800
9000
9200
9400
9600
9800


To test the agent, we are now defining a new interactive player. This agent does not need to learn, so only the 'play' method is defined.

In [10]:
class HumanPlayer:
       
    def __init__(self, symbol):
        self.symbol = symbol
    
    def play(self, env):
        
        # draw the board
        env.draw_board()

        while True:
            move = input("Enter coordinates 'i j' for your next move (i,j=0..2): ")
            i, j = move.split(' ')
            i = int(i)
            j = int(j)
            if env.board[i,j] == _:
                env.board[i,j] = self.symbol
                break    
    
    def update(self, env):
        pass
        
    def update_state_history(self, state):
        pass

Now play the game !

In [11]:
human = HumanPlayer(X)

Debug = 1

stop = False
while not stop:
    play_game(human, player2, env)
    stop = input("Play again ? [Y/n]") == 'n'


 -------------
  .    .    .  
  .    .    .  
  .    .    .  
 -------------
Enter coordinates 'i j' for your next move (i,j=0..2): 0 0
  X  0.34 0.41 
0.41 0.50 0.33 
0.41 0.44 0.43 
Agent is playing greedy in (1, 1)
 -------------
  X    .    .  
  .    O    .  
  .    .    .  
 -------------
Enter coordinates 'i j' for your next move (i,j=0..2): 0 2
  X  0.40   X  
0.35   O  0.35 
0.35 0.24 0.35 
Agent is playing greedy in (0, 1)
 -------------
  X    O    X  
  .    O    .  
  .    .    .  
 -------------
Enter coordinates 'i j' for your next move (i,j=0..2): 2 1
  X    O    X  
0.35   O  0.32 
0.34   X  0.32 
Agent is playing greedy in (1, 0)
 -------------
  X    O    X  
  O    O    .  
  .    X    .  
 -------------
Enter coordinates 'i j' for your next move (i,j=0..2): 1 2
  X    O    X  
  O    O    X  
0.17   X  0.12 
Agent is playing greedy in (2, 0)
 -------------
  X    O    X  
  O    O    X  
  O    X    .  
 -------------
Enter coordinates 'i j' for your next move (i,