 # Understanding AlphaGo
 AlphaGo is a computer program that plays the board game Go. It was developed by Alphabet Inc.'s Google DeepMind in London. AlphaGo has played thousands of games against itself to improve its play and it has defeated every top professional Go player. AlphaGo's algorithm uses a combination of machine learning and tree search techniques, combined with extensive training, both from human and computer play.
 
 ## How does AlphaGo work?
AlphaGo uses a Monte Carlo tree search to find its moves based on knowledge previously "learned" by machine learning, specifically by an artificial neural network (a deep learning method) by extensive training, both from human and computer play. This neural net takes a description of the Go board as an input and processes it through 12 different network layers containing millions of neuron-like connections. Using a large database of historical games as a reference, the neural net is trained to predict the eventual winner of positions. After extensive training, the neural net "knows" a lot about how to play Go.

AlphaGo's tree search looks many moves ahead by playing out the remainder of the game in its "imagination". Each possible move is simulated by playing out the game to the very end ("rollout"). The AlphaGo program selects its moves based on the results of these rollouts and the neural net evaluations.

The paper describing Alpha go can be found [here](https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf).

## AlphaGo's architecture
AlphaGo receives the board position as an 19x19 image. The image is then passed through a series of convolutional layers that build a representation of the position, followed by a policy head and a value head. The policy head outputs a probability distribution over the next move, while the value head outputs a value between -1 and 1, where -1 represents a certain loss, 0 a certain tie, and 1 a certain win.

## Training the networks
The networks are trained by supervised learning (SL) from human expert games, and by reinforcement learning from games of self-play. In both cases, the input to the network is a representation of the board position, and the output is a move. The network is trained to predict the probability of winning after each possible move, and the weights are updated using gradient descent to increase the probability of selecting moves that lead to a win.

The SL training is based on 30 million positions from the KGS Go server, achieving a 57%  accuracy in predicting expert moves from a given position. The reinforcement learning (RL) training is based on playing against SL policy and other Go programs, starting with parameters from the SL training. The reinforcement learning improves the prediction accuracy to 85.4%.  RL training used a the following reward: +1 for a win, -1 for a loss, and 0 for all non-terminal steps $t<T$. 

Reinforcement learning of the value network was focused on estimating the value of a position given a policy $\pi$ that selects moves. 
\begin{equation}
V_\pi(s) = \mathbb{E}[z_t\mid s_t = s, a_{t\ldots T}\sim\pi]
\end{equation}
where $z_t$ is the outcome of the game at time $t$ and $s_t$ is the state of the game at time $t$. The policy $\pi$ is a probability distribution over moves at each step. The training uses gradient descent to minimize the error (MSE) between the predicted value $V_\pi(s)$ and the actual outcome $z_t$.
\begin{equation}
\Delta\theta \propto \frac{\partial v_\theta(s)}{\partial\theta}(z_t - V_\theta(s))
\end{equation}
where $\theta$ are the parameters of the value network. 