In [1]:
import math
import random
import numpy as np

This notebook is dedicated to the explanation of __Monte Carlo Tree Search__ algorithm, which is a key ingredient in the training of AlphaZero program. 

# Monte Carlo Tree Search

Monte Carlo Tree Search works on types of games I described before in the Adversarial Search notebook - zero-sum games. These are the deterministic games of perfect information. When we say that a game is of perfect information, it means that we can sketch out the entire game search tree to look at all the possible game states and actions. 

Classic Game Playing strategies, like MiniMax and Alpha-Beta, will simply traverse this game tree and choose a node that they think is most likely to lead them to victory. As we had stated earlier, this technique is provably optimal but terrible when it comes to time complexity for games with high branching factor (like Go). This is where Monte Carlo Tree Search comes in. 

Since, in MCTS, we begin with an empty search tree and expand it by following the MCTS algorithm (will be dicussed shortly), the tree growth is asymmetric. This provides a huge advantage over MiniMax, which needs a full game tree, because ulike MiniMax, MCTS can be configured to stop after a desired amount of time (# of steps) and we can still select a sufficiently optimal move based on the partially constructed search tree. 

Lets's now see how Monte-Carlo Tree Search Algorithm works. 

The key idea is that in MCTS, we keep some stats for each node that help us decide which action to take at a give state. Remember, nodes in tree represent the game states and edges represent the action possible. So an edge e from node A to node B would mean that taking action e from game state A would result in game state B. Now what are the stats that we keep for the game states? They are:
> - $Q(s,a)$ = it is the win-ratio for a given state (# simulated wins / # simulated games)
> - $N$ = total number of simulations that have occured after $i^{th}$ iteration
> - $n$ = total number of sumilations for a given node after $i^{th}$ iteration

And how do we use these stats to help us navigate through the game search tree? This is where the key of MCTS is! These stats are used to construct a score called the __Upper Confidence Bound__ for each state. At a given state, if we have multiple child nodes (games states resulting from possible moves) and we have to select one, we select the one with the highest Upper Confidence Bound. 

Let's now see what this cryptic term is. 

### Upper Confidence Bound

The formula for UCB is as follows: 
>>> $UCB(s, a) = Q(s, a) + c.\sqrt{\ln(N) / n}$

The first term in the equation above is the __exploitation term__ and the second term is the __exploration term__. Since the first term is simply the number of simulated games that lead to win / number of total simulated games, this term will be high for the states that lead to more wins during the simulations. So this term essentially is the mean value of a given state. The second term is more interesting -- it creates a upper confidence bound on the mean value score of each state and __measure the ucertainty in the measure of state's value__. Let's see how it works.

In $c.\sqrt{\ln(N) /n}$, $c$ is the exploration coefficient. If c = 1, both exploitation and exploration are given equal weightage. If a given state $s_{0}$ (or action leading to it) is chosen many times, the uncertainty goes down (exploration term becomes small) as the $n$ in the denominator is incremented. On the other hand, if any other state is chosen, the uncertainty for $s_{0}$ value goes up as the $N$ in the numerator is incremented while $n$ in the denominator stays constant. Also, we use the natural logartihm to make the increase in uncertainty become smaller over time, but also unbounded. For more in depth explanation, please refer to __Reinforcement Learning: An Introduction by Sutton__.

Upper Confidence Bound (UCB) is used in MCTC to determine which move to take. However, there are altogether 4 key steps in the MCTS algorithm - __Selection, Expansion, Rollout, and Update__. Let's see what each of these terms means.

> 1) __Selection__: In this part, we begin at the root node of the tree, __R__ and we select the successive child node with the highest UCB value until we reach a node with no more children, __L__. 

<img src="images/sel.PNG" alt="Drawing" style="width: 500px;"/>

> 2) __Expansion__: In this part, we check if __L__ is not a terminal state. If it's not, we create all states possible from actions available at L and add them to the tree. Then we select the first child node, __C__.

<img src="images/exp.PNG" alt="Drawing" style="width: 500px;"/>

> 3) __Rollout__: In this part, we generate a random playout from node __C__ until we reach the end of the game, where the result is either win, draw, or loss. 

<img src="images/roll.PNG" alt="Drawing" style="width: 500px;"/>

> 4) __Update__: In this part, we use the result obtained at the end of the rollout (win/draw/loss) and propagate the result back up from __C__ to __R__, hence updating the relevant values in that trajectory. 

<img src="images/upd.PNG" alt="Drawing" style="width: 500px;"/>

The four steps above are part of one iteration, so in Monte-Carlo Tree Search Algorithm, we perform a fixed number of iterations (starting from the root node each time) and keep expanding and updating the tree. The tree grows asymmetrically. After the fixed number of iterations has elapsed, we will select the action from root node $s_{0}$ that has the highest UCB score (calculates using the updated stats in the tree). The resultant state (child node) will become the root node now and we will repeat the fixed number of iterations to pick the next move by the opponent. 

__The beauty of MCTS(UCB) is that, due to its asymmetrical nature, the tree selection and growth gradually converges to better moves. At the end, you get the child node with the highest number of simulations and that’s your best move according to MCTS.__

Before we code the algorithm up, let's state the benefits and the drawbacks of MCTS.

__Benefits__: 
> - __Aheuristic__: MCTS does not require domain knowledge for any game beyod just is rules. This means that the MCTS code can be reused for many different games with little modifications.
> - __Asymmetric__: MCTS performs asymmetric tree growth that adapts to the topology of the search space. The algorithm visits more interesting nodes more often, and focusses its search time in more relevant parts of the tree.
> - __anytime__: The algorithm could be stopped at any iteration number and we will have the current optimal policy.

__Drawbacks__:
> - __Speed__: MCTS search can take many iterations to converge to a good solution. We will see that this is one of the reasons we use an improved MCTS in AlphZero (use Neural Networks instead of random rollouts).
> - __Playing Strength__: MCTS can sometimes fail to find reasonable moves for even games of medium complexity within a reasonable number of iterations

Here's a high-level view of the MCTS algorithm:

In [2]:
def MCTS(s, game):
    """Input: s ~ state
       game = specific game"""

    if game.gameEnded(s):
        return -game.gameReward(s)

    # Expand
    if s not in visited:
        visited.add(s)
        v = rollout(s, game) #Rollout
        return -v

    max_ucb, best_action = -float("inf"), -1
    
    # Select
    for a in range(game.getValidActions(s)):
        ucb = Q[s][a] + (c * sqrt(np.log(sum(N[s])/N[s][a])))
        #c = exploration constant
        #N = dictionary of states as keys mapping to list containing number of simulations corresponding to each of its action
        #Q = disctionary with states as keys mapping to list containing values corresponding to possible actions

        if ucb > max_ucb:
            max_ucb = ucb
            best_action = a

    a = best_action

    s_next = game.nextState(s, a)

    v = MCTS(s_next, game)


    # Update
    Q[s][a] = (N[s][a]*Q[s][a] + v)/(N[s][a] + 1)

    N[s][a] += 1

    return -v

def rollout(s, game):
    while True:
        if teminalState(s, game):
            return -v
        val_act = game.getValidActions(s)
        a = np.random.randint(0,len(val_act))
        action = val_act[a]
        simulate(action, s)