# Part 3: A simple binary problem

In [None]:
#hide
from graphviz import Digraph

Now that we understand the dynamics of the game and we know how to generate large amounts of data, we are ready to train our first model. To begin with, we will focus on a simple problem: determine whether a grid is complete or not, meaning whether one can find at least one allowed move.

This sounds like a dumb problem to solve, and indeed it is: the information whether the game is over or not is already contained in the Python implementation of the game. But we will pretend that that we don't know it and try to answer the question just looking at an image of the grid. Of course it is also very easy to write a simple loop that runs over all lattice sites and check whether there is an allowed move there. But this is not what we want to do. We want to train a neural network without having to tell it *how* to do it. The input will simply be a large number of grids with labels indicating whether the game is over or not, and the net will learn the rules of the game *by itself*!

Even though the problem is a dumb one, it is quite instructive as a starter. And in fact when you think about it, this is not so much a simple problem even for a human: when you play the game with pen and paper and end up with a grid that has lots of points and lines on it, it takes time to verify that you are not missing one move. Let us see if the neural net can do this more quickly.

### The network

As this is a very simple problem, the neural net does not need to be very deep. In fact, a single convolutional layer can do most of the work.

The drawback in this case is that the convolutional kernel must be large enough so that a complete segment of 5 points and 4 lines can fit into its receptive field. In terms of [the representation presented in *Part 2*](/2022/01/05/Part_2_Data.html), this corresponds to a size of 13 x 13 pixels. The first layer of the network will therefore be a convolutional layer with a 13 x 13 kernel, stride 3, and no padding, followed by max pooling. The goal of this layer will be to discover features corresponding to allowed moves in the vertical, horizontal, and both diagonal directions. While in principle only 4 channels are sufficient to do so, in practice the features have a lot of non-trivial structure that are not easy to detect in a stochastic process. We will therefore use many more output channels. Empirical testing shows that 40 is a good number. With less channels, the training tends to converge to a state in which only some of the features are detected (for instance it misses allowed moves in one of the directions).

Past this layer, the rest of the neural network only needs to implement a logic OR gate: if any allowed move is detected by the convolutional layer, the final answer is a *yes*; if no allowed move is detected, the final answer is a *no*.
Few linear layers and channels are sufficient to do so. After some more experimenting we find that the following network architecture is fine for our problem:

In [None]:
#hide
# graph = Digraph('Democritus architecture', format='svg')
# graph.attr('node', fontsize='12')
# graph.attr('edge', fontsize='10', arrowsize='0.7')
# graph.node('0', label='Grid image')
# graph.node('1', label='Convolutional layer, 13 x 13 kernel, stride 3', shape='box', 
#            style='filled', fillcolor='lightpink')
# graph.node('2', label='Max pool', shape='invhouse', 
#            style='filled', fillcolor='gray90')
# graph.node('3', label='Linear layer + ReLu', shape='box', 
#            style='filled', fillcolor='aquamarine')
# graph.node('4', label='Linear layer + ReLu', shape='box', 
#            style='filled', fillcolor='aquamarine')
# graph.node('5', label='Linear layer', shape='box', 
#            style='filled', fillcolor='aquamarine')
# graph.node('6', label='Output', fontsize='12')
# graph.edge('0','1', label='  1 x  94 x 94')
# graph.edge('1','2', label='  40 x  27 x 27')
# graph.edge('2','3', label='  40')
# graph.edge('3','4', label='  4')
# graph.edge('4','5', label='  2')
# graph.edge('5','6', label='  1')
# graph.render('Part_3_Binary_problem_images/Democritus_architecture')
# graph

![svg](Part_3_Binary_problem_images/Democritus_architecture.svg 'Network architecture')

This is one convolutional layer, followed by 3 linear layers, with non-linearities given by max pooling after the first and rectified linear units (ReLu) between subsequent layers. The dimension of each layers' input and output is indicated next to the arrows. Note that the large number of channels after the first layer (40) is immediately taken down to a very small number (4, then 2) in the linear layers.

This model has 6977 trainable parameters, most of which (6800) are inside the first layer.
This is quite a large number for such a small network, and therefore a lot of data is required for training.

### The data




For this problem, we make use of a static data set, in which all grids and labels are computed first, before the training begins. As we shall see in a minute, this is sufficient to achieve very high accuracy. Experimenting with dynamical data did not show significant differences.

We let the computer play 10,000 random games. In each case, we keep an image of the final grid (no allowed moves), and one image of an intermediate grid (at least one allowed move). In this way the data consists in 20,000 grids, and it is perfectly balanced: 10,000 grids are given the label "0", the other 10,000 the label "1". The intermediate stage of the game associated with the label "1" is chosen at random in 1/3 of the games, with uniform probability distribution. In the other 2/3 of the games, we use the next-to-last stage, just before the last move is played: in this way most of these grids have just *one* allowed move. If we do not use this trick, the neural net very quickly settles in a state in which it only verifies the presence of allowed moves in some of the directions, but not all 4 directions: this gives quite high accuracy for most of the games that have plenty of allowed moves, but it does not properly solve the problem. 

### Training and validation

The training follows a standard stochastic gradient descent (SGD) procedure. We use the *PyTorch* optimizer, with learning rate 0.005 and momentum 0.9. 
Stochasticity is introduced in the learning process through the splitting of data into 100 mini-batches of 200 grids each.
Each time a mini-batch is fed to the network, we apply a transformation (90 degrees rotation or mirror flip), hence artificially augmenting the data. In this way the neural net only sees the same data again after 8 epochs (an epoch corresponds to 100 mini-batches, i.e. all of the data passing through the net).
We use the mean square error as a loss function, keeping in mind that we eventually want to solve a regression problem.

This is the value of the loss function over 100 epochs:

![png](Part_3_Binary_problem_images/Democritus_loss.png 'text')

As expected, the loss decreases steadily, and even faster towards the end of the training cycle. I could have pushed the fitting procedure even further, but on the other hand the accuracy has reached a plateau at an impressive 99%:

![png](Part_3_Binary_problem_images/Democritus_accuracy.png 'text')

It looks like the neural net has essentially solved our problem!

### How did it work?



curiosity

![png](Part_3_Binary_problem_images/Democritus_convlayer.png 'text')

only 4 channels are being used; everything else is zero

zoom into one 

![png](Part_3_Binary_problem_images/Democritus_convlayer_zoom.png 'text')

we learned something: it is sufficient to check presence of a segment at both extremities

### Conclusions

principle is working

model cannot be used to significantly improve our exploration yet

