# AlphaGo Zero: Mastering the game of Go without human knowledge

In October 2015 DeepMind's AlphaGo Fan beat the European Go champion Fan Hui in the game of Go decisively, showing the world the potential of RL algorithms to outperform humans in a highly complex game which requres strategy and thinking ahead. Since then, DeepMind has further refined the architecture behind AlphaGo, creating new versions with vastly improved performance. AlphaGo Zero is one such newer version. This blogpost is meant to give the reader a basic understanding of the underlying architecture of AlphaGo Zeroa and to showcase the elegance of the interplay between the utilized algorithms.

### The game of Go

Go is an old chinese boardgame played by two players who take alternating turns. The game is played on a 19x19 grid. On each turn the player places a stone on an intersection or the player chooses to pass with the goal of surrounding as much territory on the board as possible by the time the game concludes. One side uses the black pieces while the other uses the white pieces. Today the game is widely popular in East Asia and is regarded to be very complex, being very difficult, if not impossible, to master.

## The outstanding idea behind AlphaGo Zero

AlphaGo Zero utilizes two main components, a **deep neural network** and a **Monte Carlo tree search alogrithm (MCTS)**. The deep neural network takes the raw board representation of the current board and of previous boardstates as input to predict a probability for each move of the given boardstate and to predict the chance of victory for the current player.
Only the parameters of the deep neural network are trained, improving the performance of AlphaGo Zero.

When asked to choose its next move, the algorithm will conduct a MCTS using the deep neural network. This utilizes a tree in which each node is a boardstate and each edge is the combination of a boardstate and a move. Each edge also contains a counter how often the edge was chosen during the search. This means in practice:

1. Initialize the tree with a single node: The current boardstate.

2. Search the tree by following the most promising path of actions until a move is selected which leads to a boardstate not yet recorded in the tree. The evaluation which move is the most promising is calculated using the neural network and contains the exploration aspect of the RL algorithm.

3. Expand the tree by giving the new boardstate it's own node and increase the counter for each move that lead to this new boardstate by one.

4. If computation time is remaining till the move has to be selected jump to step 2. Else return a propability for each move proportional to the number of visits of the respective edge leading out from the root node. These probabilities act as a refinement of the probabilities returned by the deep neural network. 

5. The next move can then be derived from the given probability

Using this process the algorithm can play against itself over and over again. After each self-play match has conluded, the history can be used to construct muliple triples each containing a boardstate, the MCTS move probabilities for this boardstate, and the eventual winner of the match (did the player whose turn it is in the boardstate win?). 

After multiple of these self-plays have conluded, a batch from the available triples is chosen to be used in training the deep neural network. The network is trained so that the output probabilities more accurately reflecting the MCTS probabilities and the neural network should predict the correct winner as dictated by the triple. For this stochastic gradient descent can be employed.

## Reinforcement Learning in AlphaGo Zero

Neural network trained from games of self-play by a novel RL algorithm (Self-play combined with MCTSl search). MCTS search guided by neural
network.

What is the goal of the neural network and how is it connected with MCTS?

Main parts of the learning process should somehow explain the 4

Illustration showing the whole process.

We should focus on this section here.

Notes:

$(p, v) = f_\theta (s)$

## Exemplary results

Show some exemplary results using the Python and Keras Implementation.
Compare the Performance of AlphaGo Zero to previous variants like AlphaGo Master and AlphaGo Lee

## References

- [Github](https://github.com/AppliedDataSciencePartners/DeepReinforcementLearning)
- [AlphaZero Cheatsheet](https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0)
- [AlphaZero using Python and Keras](https://medium.com/applied-data-science/how-to-build-your-own-alphazero-ai-using-python-and-keras-7f664945c188)