# Deep Reinforcement Learning using AlphaZero methodology

Adapted from https://applied-data.science/blog/how-to-build-your-own-alphazero-ai-using-python-and-keras/

In [7]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


Game will control all the mechanisms to play a game, and agent will emulate a player:

To be able to display the board, we need to create a logger. Here we just print the board to the standard output, to get a graps of the current situation

In [9]:
class mylogger:
    def __init__():
        pass
    def info(log):
        # log is a list of chars resembling the board
        print(str(log) + "\n")

## Playing a first game by hand

In [3]:
# Let's play a game by hand


What can we do in the game?

This is what the board looks like:

So when we introduce a token by the top, it will fall to the bottom. At the bottom, we have the positions 35 to 41, so those are the only actions we can do now.

For instance, let's put a token right in the middle, it will fall to the middle position at the bottom, that's position 38

There are two players in this game, 1 and -1. The first player was 1, so the current player should be -1:

-1

Let's now see what this player can do:


Because position 38 is taken, now the player -1 could put a token on top of it, that's it, position 31. Let's check it out:

Who's the next player?

How is the game going?

This is the count of games won by each one of the players. Let's make player -1 win the game

The second element of the tuple is the value. The value 0 means that nothing has happened yet.

If player 1 moves to position 37, then player 1 will win. But player 1 is dumb, so the next moves are:

To see the score of the game, we have to check who is the current player:

And then get the first value of these tuple. The winner of the game is the multiplication of both values:

Let's keep playing. We need to clear the board to keep playing, because the game goal is to be the first to make a 4-connect. Once that's done, newer 4-connect will not contribute towards the score:

In [30]:
game.reset()

<game.GameState at 0xb334f1240>

In [31]:
game.step(38)
game.step(31)
game.step(35)
game.step(24)
game.step(36)
game.step(17)

(<game.GameState at 0xb334f1860>, 0, 0, None)

In [32]:
game.gameState.render(mylogger)

['-', '-', '-', '-', '-', '-', '-']

['-', '-', '-', '-', '-', '-', '-']

['-', '-', '-', 'O', '-', '-', '-']

['-', '-', '-', 'O', '-', '-', '-']

['-', '-', '-', 'O', '-', '-', '-']

['X', 'X', '-', 'X', '-', '-', '-']

--------------



Now player 1 has learnt, and will do the right thing:

In [38]:
print("And the winner is %d" % (game.currentPlayer*game.gameState.score[0]))

And the winner is 1


To detect that a game has finished, we can monitor the score, or the value returned by each step. When it is different to 0, that means that there has been a winning move.

## Playing the game with an agent

To train a neural network using the results of our games, we need to use an agent. The agent needs to use an untrained neural network as input

For the neural network, we can use any Keras model. Here, we use a function from the game, that needs some configuration:

In [4]:
from model import Residual_CNN

Using TensorFlow backend.


In [5]:
REG_CONST=0.0001
LEARNING_RATE=0.1

HIDDEN_CNN_LAYERS = [
	{'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	]

In [6]:
# Create a neural network

In [43]:
NUM_OF_SIMULATIONS = 3   # number of simulations the agent will attempt to search for the best next movement
CPUCT = 1  # constant controlling the level of exploration

In [7]:
# Create an agent

Let's start from a blank state

In [8]:
state = game.reset()

NameError: name 'game' is not defined

In [9]:
state.render(mylogger)

NameError: name 'state' is not defined

Now the agent will decide what to do next (using 1 for a deterministic move):

Of all the positions in the board, `next_action` is the position with the maximum probability

This is a vector with the probability of all the positions in the board. For instance, we can check that all positions with prob > 0 are in fact allowed actions:

In the `act` method, the second argument should be 0 for a deterministic movement, and 1 for a random movement:

In [10]:
# Now it is the turn of the second player (who plays randomly)


We can keep playing with this agent, that will try to find the best movements for the game:

## Exercise: a learning agent against a random player

Now that you know how to run a learning agent in a game, write a function that given an agent returns the outcome of the game.

Don't worry about keeping the memory of the positions. We just want the final outcome of the game, from the learning agent point of view: WIN, DRAW or LOSS.

The game will be randomly started either by the random player or the neural network.

We will later use this function to run several simulations.

Use this logger to keep track of:
* each new action suggested by the agent (both for the NN and for the random player)
* value after each movement
* a render of the board (you can use state.render(logger))
* if the movement is done by the NN, the values of the MonteCarlo tree search, and the NN network
* a big WARNING if the agent suggest a movement that is not allowed by the state of the board

The function will return a tuple, with the result of the game, and the number of movements of the NN

In [51]:
!mkdir -p logs/

In [52]:
from utils import setup_logger

logger_simgame = setup_logger('logger_simgame', 'logs/logger_simgame.log')

In [53]:
# Student version cell
def simgame(game, agent, logger):
    """Sim a game and return the outcome of the game. 
    
    @param game a Game that will be played by the agent. This game will be reset
    @param agent an Agent with an associated neural network
    @param logger a logger to keep track of the internal statuses
    @return a tuple with the result of the game and the number of movements of the NN
    """
    logger.info("---------------------------------------")
    logger.info("NEW GAME")
    logger.info("---------------------------------------")
    
    state = game.reset()
    
    # 0 -> the neural network starts
    # 1 -> the random player starts
    who_starts = random.choice([0,1])
    
    # Tau is the parameter that controls the act method, 0 is random, 1 is neural network
    if who_starts == 0:
        tau = 0  # NN starts
        logger.info("Game started by neural network. NN will be the X")
        nn_symbol, rnd_symbol = "X", "O"
    else:
        tau = 1  # Random player starts
        logger.info("Game started by random player. NN will be the O")
        nn_symbol, rnd_symbol = "O", "X"
        
    game_is_ended = False
    winner = -2  # we init with an impossible value
    nn_movements = 0

    while not game_is_ended:

        # *** YOUR CODE SHOULD DECIDE THE NEXT ACTION ***
        next_action = None

        # *** YOU NEED TO UPDATE nn_movements ONLY WHEN THE NEURAL NETWORK PLAYS ***
        
        # *** YOU SHOULD LOG MORE INFO, FOR INSTANCE, THE STATE OF THE BOARD ***
        
        if tau == 0:
            logger.info("NN (%s) played, moved to %d" % (nn_symbol, next_action))
            tau = 1            
        else:
            tau = 0
            logger.info("Random (%s) played, moved to %d" % (rnd_symbol, next_action))

        logger.info("Game score: %d     MCTS: %.4f          NN: %.4f" % (score, MCTS_value, NN_value))
        if game_is_ended:
            # *** WHO HAS WON? WRITE YOUR CODE HERE ***
            winner = None
            # If random started, then the result of the game is the opposite
            if who_starts == 1:
                winner = winner*(-1)
            if winner == 1:
                logger.info(" **** The NN has WON! :D ****")
            elif winner == 0:
                logger.info(" **** It is a DRAW :S ****")
            else:
                logger.info(" **** The NN has LOST :'( ****")

    return winner, nn_movements

### How does the agent learnt?

Let's try several times, and plot some stats about the number of wins, and the distribution of the number of movements.

In [4]:
NUM_OF_SIMULATIONS = 10   # number of simulations of movements the agent will attempt to search for the best next movement
CPUCT = 1  # constant controlling the level of exploration

In [11]:
REG_CONST=0.0001
LEARNING_RATE=0.1

HIDDEN_CNN_LAYERS = [
	{'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	 , {'filters':75, 'kernel_size': (4,4)}
	]

# Create a game and agent

In [12]:
# Play N times, and keep track of the winning average of our agent

### Exercise: Learning from this experience

So far, we are not learning from this experience. We are just playing with a neural network that is not trained.

We can add the movements to a _memory_ and record the outcome of the game too, and then train the neural network with this experience.

In [13]:
from memory import Memory

In [14]:
MEMORY_SIZE=30000

This Memory object has two kind of memories:

* Short term, with the set of movements of a game
* Long term, with the full games and their outcomes. This long term memory is used to re-train the agent and gain experience in the game

In [15]:
memory = Memory(MEMORY_SIZE)

In [16]:
# This prepares the memory for a new game
memory.clear_stmemory()
# This adds a movement to the memory
memory.commit_stmemory
# This adds a game to the long term (training) memory

<bound method Memory.commit_stmemory of <memory.Memory object at 0xb31480208>>

In [17]:
def simgame(game, agent, logger, memory = None):
    """Sim a game and return the outcome of the game. 
    
    @param game a Game that will be played by the agent. This game will be reset
    @param agent an Agent with an associated neural network
    @param logger a logger to keep track of the internal statuses
    @param memory a Memory object to record all the movements and outcome of the game
    @return a tuple with the result of the game, the number of movements of the NN and the updated memory
    """
    logger.info("---------------------------------------")
    logger.info("NEW GAME")
    logger.info("---------------------------------------")
    
    state = game.reset()
    
    # 0 -> the neural network starts
    # 1 -> the random player starts
    who_starts = random.choice([0,1])
    
    # Tau is the parameter that controls the act method, 0 is random, 1 is neural network
    if who_starts == 0:
        tau = 0  # NN starts
        logger.info("Game started by neural network. NN will be the X")
        nn_symbol, rnd_symbol = "X", "O"
    else:
        tau = 1  # Random player starts
        logger.info("Game started by random player. NN will be the O")
        nn_symbol, rnd_symbol = "O", "X"
        
    game_is_ended = False
    winner = -2  # we init with an impossible value
    nn_movements = 0

    while not game_is_ended:

        # *** YOUR CODE SHOULD DECIDE THE NEXT ACTION ***
        next_action = None

        # *** YOU NEED TO UPDATE nn_movements ONLY WHEN THE NEURAL NETWORK PLAYS ***
        
        # *** YOU SHOULD LOG MORE INFO, FOR INSTANCE, THE STATE OF THE BOARD ***
        
        # *** HOW SHOULD UPDATE THE MEMORY OBJECT?
        
        if tau == 0:
            logger.info("NN (%s) played, moved to %d" % (nn_symbol, next_action))
            tau = 1            
        else:
            tau = 0
            logger.info("Random (%s) played, moved to %d" % (rnd_symbol, next_action))

        logger.info("Game score: %d     MCTS: %.4f          NN: %.4f" % (score, MCTS_value, NN_value))
        if game_is_ended:
            # *** WHO HAS WON? WRITE YOUR CODE HERE ***
            winner = None
            # If random started, then the result of the game is the opposite
            if who_starts == 1:
                winner = winner*(-1)
            if winner == 1:
                logger.info(" **** The NN has WON! :D ****")
            elif winner == 0:
                logger.info(" **** It is a DRAW :S ****")
            else:
                logger.info(" **** The NN has LOST :'( ****")

    return winner, nn_movements, memory

In [21]:
# Play N times, and keep track of the winning average of our agent

5 games played so far, 2 wins (40.00 %), 6.00 movs avg
10 games played so far, 4 wins (40.00 %), 6.25 movs avg


We can now make our agent learn from this experience:

In [18]:
# How can we learn from this experience?

In [20]:
# Play N times, and keep track of the winning average of our agent
# Has the agent improved?

In [21]:
# Can you retrain after every game? (or after every 5-10 games, to save some time)

## Exercise: Using a custom model

The models that the agent trains are Keras models, created following the interface defined in model.Gen_Model

Could you change the model and use a different architecture? For instance, a model with RNN that could try to learn from the sequences of movements?

In [60]:
from importlib import reload
import model
reload(model)
from model import KSchool_Model  # <--- This is your custom model in model.py


In [22]:
# Create an agent with your network

In [23]:
# Play N times, and keep track of the winning average of our agent