# Alpha Zero Algorithm

In the preceeding parts, we have explained some of the key components of the Alpha Zero algorithm. But the way we did it was a bit disconnecting and we need to connect those parts to see how Alpha Zero works. Let's breifly overview the major components and then discuss them together in an algorithm.

## Neural Network

In the Residual Network Notebook, we showed what makes ResNet different than a vanilla Convolutional Neural Network. We discussed ResNet Architecture in particular, because it is used at the core of the AlphaZero algorithm. The neural network $f_{\theta}$ is parametrized by set of weights $\theta$. It takes as input the board representation of the game, $s$ and outputs two things: a continuous value $v_{\theta}(s)$ in the range $[-1,1]$ from the perspective of the current player and a policy $p_{\theta}(s)$ that is a probability vector over all the possible states. 

Let's see the Neural Network architecture with respect to game Go that is used in AlphaGo Zero. 

### Input:
> - Size of the input = 19 x 19 x 17
> - 19 x 19 is the 2D board size with height and width = 19
> - 17 is the depth of the input, which corresponds to possible game states. Each layer of these 17 layers is a binary feature plane. First 8 binary feature planes $(X_{i})$ indicate the presence of current player's stones. Next 8 binary feature planes $(Y_{i})$ indicate the presence of opponent's stones. The last feature plane $C$ states whose turn it is currently. 
> - So input features = $[X_{i}, Y_{i}, X_{i-1}, Y_{i-1}, ... , X_{i-7}, Y_{i-7}, C]$

The input is then fed into a __Residual Tower__. A residual tower consists of:
> - a single convolution block
> - either 19 or 39 residual blocks

Let's discuss each of these two components of the residual tower.

### Residual Block:

#### Convolution Block Operations:
> - It is a convolution block of 256 filters of kernel size 3 x 3 with stride 1
> - Batch Normalization is then used
> - Then a rectified non-linearity (such as ReLU) is applied

#### Residual Block Operations:
Following operations happen sequentially. Output of each is fed as input to the succeeding layer. 
> - a convolution of 256 filters of kernel size 3 x 3 with stride 1
> - Batch Normalization applied
> - Rectified Non-Linearity applied
> - a convolution of 256 filters of kernel size 3 x 3 with stride 1
> - Batch Normalization applied
> - A __skip connection__ that adds input to the block
> - Rectified Non-Linearity applied

The output of the residual tower goes into two __heads__: __policy head and value head__.

Let's now look at the operations of these two heads.

### Policy Head:

> - a convolution of 2 filters of kernel size 1 x 1 with stride 1
> - Batch Normalization applied
> - Rectified Non-Linearity
> - a fully connected linear layer that outputs a vector of size (# of moves)

### Value Head:

> - a convolution of 1 filter of kernel of size 1 x 1 with stride 1
> - Batch Normalization applied
> - Rectified Non-Linearity
> - a fully connected linear layer to a hidden layer of size 256
> - Rectified Non-Linearity
> - a fully connected linear layer to a scalar
> - a __tanh__ non-linearity outputting a scalar in range [-1,1]

A high level view of the AlphaZero Neural Network Architecture is as follows:

Input --> Residual Tower (containing Convolution Block and Residual Blocks) --> Policy Head and Value Head

<img src="images/alphneur.PNG" alt="Drawing" style="width: 700px;"/>


Let's now discuss the training process of this Neural Network.

### Training:

At the end of each game of self-play, the neural network is provided training examples of the form $(s_{t}, \pi_{t}, z_{t})$. Here $\pi_{t}$ is an estimate of the policy from $s_{t}$ and $z_{t}$ is the final outcome of the self-play game from the perspective of the current player at $s_{t}$. We will soon see how to obtain $\pi_{t}$. So essentially, a neural network outputs $p_{\theta}(s_{t})$ and $v_{\theta}(s_{t})$, it gets a training example for $s_{t}$ as $\pi_{t}$ and $z_{t}$. So to make $p_{\theta}(s_{t})$ and $v_{\theta}(s_{t})$ close to $\pi_{t}$ and $z_{t}$, we minmize the following loss function:

<img src="images/alphloss.PNG" alt="Drawing" style="width: 500px;"/>

The main idea is that over time, the neural network will learn what states eventually lead to wins or losses Learning the policy gives a good estimate of what the best action is from a given state. 

Now, we haven't really seen how to obtain $\pi_{t}$ and to see that we now reintroduce the algorithm we discussed in the preceeding notebook -- __Monte Carlo Tree Search__. Essentially, MCTS is used for policy (from the Neural Network) improvement. Let's now see how MCTS works together with the Neural Network.

## Monte-Carlo Tree Search for Policy Improvement

Remember, in a game search tree, each node represents a game state (board configuration). A directed edge exists between two nodes $i -> j$ if a valid action can cause a transition from $i$ to $j$. We start with an empty search tree with a root node representing the empty-board state. We then expand the search tree one node at a time. When we encounter a new node, we normally perform a rollout. But we discussed that rollout is a time-consuming process for games like Go. So in AlphaZero, instead of performing a rollout, the value of the new node is obtained from the neural network itself. The value is the propagated up the trajectory to update the values. 

Notice that the improvement on traditional MCTS is the use of Neural Network to obtain the value of the new node instead of performing a rollout. This step is faster and also as the neural network learns, the prediction will be more accurate than a random rollout. 

Let's discuss this improved Monte-Carlo Tree Search in more detail. As we had discussed previously, we maintain the following attributes for the tree search:
> - $Q(s, a)$: expected value of taking action $a$ from state $s$
> - $N(s, a)$: number of times we took action $a$ from state $s$ across simulations
> - $P(s, .)$: this is $p_{\theta}(s)$, the initial estimate of taking an action from state $s$ according to the policy returned by the current neural network. 

Remember, we use an Upper Confidence Bound to decide which node to select during tree expansion. Here, let us present a bit modified version of the UCB, however it doesn't make much difference in terms of exploitation vs exploration. 

> - $U(s, a) = Q(s, a) + c . P(s, a) . \frac{\sqrt{\sum_{b}N(s, b)}}{1 + N(s, a)}$

This is how MCTS works to improve the initial policy output by the neural network. We initialize our empty game search tree with $s$ as the root. A single MCTS simulation proceeds as follows:
> - Compute action $a$ that maximizes the UCB $U(s, a)$
> - If the next state $s'$ exists in our tree, we recursively call the search on $s'$. 
> - If it does not exist, we add the new state to our tree and initialize $P(s', .) = p_{\theta}(s')$ and the value $v(s') = v_{theta}(s')$ from the neural network (instead of performing a rollout) and initialize $Q(s', a)$ and $N(s', a)$ to 0 for all $a$.
> - Propagate the value $v(s')$ up along the search path seen in the current simulation and update all $Q(s, a)$ values.
> - If we encounter a terminal state, we propagate the actual value (win/draw/loss).

We perform multiple MCTS simulations (usually a fixed number or time) to expand our game search tree. After doing so, the $N(s, a)$ values at the root provide a better approximation for the policy. The improved stochastic policy, $\pi_{t}$ is simply $\frac{N(s, .)}{\sum_{b}N(s, b)}$. During self-play, we perform MCTS and pick a move by sampling a move from the improved policy $\pi_{t}$. This is how we obtain $\pi_{t}$ that is used to train the neural network to improve $p_{\theta}(s_{t})$.

<img src="images/nnmcts.PNG" alt="Drawing" style="width: 500px;"/>

In [1]:
def search(s, game, NN):
    if game.gameEnded(s):
        return -game.gameReward(s)

    if s not in visited:
        visited.add(s)
        P[s], v = NN.predict(s)
        return -v
  
    max_ucb, best_action = -float("inf"), -1
    for a in range(game.ValidActions(s)):
        u = Q[s][a] + c*P[s][a]*np.sqrt(sum(N[s]))/(1+N[s][a])
        if u > max_ucb:
            max_ucb = u
            best_action = a
    a = best_action
    
    s_next = game.nextState(s, a)
    v = search(s_next, game, NN)

    Q[s][a] = (N[s][a]*Q[s][a] + v)/(N[s][a]+1)
    N[s][a] += 1
    return -v


We have shown how the Neural Network connects with the Monte-Carlo Tree Search algorithm and how it is trained using examples obtained from Monte-Carlo Tree Search. The Alpha Zero algorithm is essentially a Policy Iteration algorithm that improves via self-play. Let's see how this algorithm works on a higher level.

## Policy Iteration

Here is the complete algorithm:

> - Initialize the Neural Network, $f_{\theta}$, with random weights $\theta_{0}$, which means we start with a random policy and value network and hence the network won't be able to predict aything intelligent in the beginning. 
> - In each iteration of this algorithm, play a fixed number of self-play games.
>> - In each game of self-play:
>>> - At each time step t (at each turn of the self-play game), we peform a fixed number of MCTS simulations (based on $f_{\theta_{i-1}}$) from the current state $s_{t}$. After that many simulations, we have a policy estimate at the root node. We then pick a move at time t by sampling from the improved policy $\pi_{t}$.
>>> - We now have a training example $(s_{t}, \pi_{t}, z_{t})$, where $z_{t}$ is the reward achieved at the last time step of the game of self-play. 
>>> - Since the data for each time step t is stored as $(s_{t}, \pi_{t}, z_{t})$, at the end of the game, we have a list containing data for each time step in that game. This list serves as the training data.
>> - After a fixed number of games of self-play, we have the data we obtained after evey single game of self-play.
> - At the end of each iteration, the neural network is trained with the obtained training examples. The old network, $f_{\theta_{i-1}}$ is then put againt the new (trained) network, $f_{\theta_{i}}$. If the new network wins more than a set threshold fraction of games (55% in the DeepMind paper), the network is updated to the new network, otherwise we continue another iterations to obtain more training examples.


This is it. This is how we train our neural network and it improves after every iteration. So after every iteration, it becomes a better player. Here is the high-level implementation of policy iteration Alpha Zero algorithm::

In [2]:
def policy_iteration(game):
    NN = init_NN()
    training_examples = []
    for i in range(noIter):
        for e in range(noGames):
            examples += playGame(game, NN)
        new_NN = train_NN(examples)
        wins = play(new_NN, NN)
        if wins > threshold:
            NN = new_NN
    return NN

def playGame(game, NN):
    examples = []
    s = game.startState()
    mcts = MCTS()
    
    while True:
        for j in range(noMCTSSimulation):
            mcts.search(s, game, NN)
        examples.append([s, mcts.pi(s), None])
        a = random.choice(len(mcts.pi(s)), p=mcts.pi(s))
        s = game.nextState(s, a)
        if game.gameEnded(s):
            examples = reward(examples, game.gamereward(s))
            return examples