# Part 4: Going deeper with a counting problem

In [None]:
#hide
from graphviz import Digraph

To cover:
- regression problem instead of classification
- how to evaluate accuracy?
- how to generate well-normalized data? i.e. check the frequency of labels
- use deeper network but with smaller kernel size: better learning?

### A deeper neural network

In the [last part](/2022/01/07/Part_3_Binary_problem.html) we addressed the simple binary problem with a network that was not very deep. This was useful to understand what was going on, but to address more interesting problems we will need to go deeper.

first layer: 3 x 3 kernel, stride 3

then four more convolutional layer, "nearest-neighbor"

followed by max pooling and two linear layers to bring the output in the form wanted 
(fewer linear layers are needed because most of the computation is done in the convolutional layers)

network architecture:

In [None]:
# hide
# graph = Digraph('Descartes architecture', format='png')
# graph.attr('node', fontsize='12')
# graph.attr('edge', fontsize='10', arrowsize='0.7')
# graph.node('0', label='Grid image')
# graph.node('1', label='Convolutional layer, 3 x 3 kernel, stride 3 \n + ReLu', shape='box', 
#            style='filled', fillcolor='lightpink')
# graph.node('2', label='Convolutional layer, 2 x 2 kernel, stride 1 \n + ReLu', shape='box', 
#            style='filled', fillcolor='lightpink')
# graph.node('3', label='Convolutional layer, 2 x 2 kernel, stride 1 \n + ReLu', shape='box', 
#            style='filled', fillcolor='lightpink')
# graph.node('4', label='Convolutional layer, 2 x 2 kernel, stride 1 \n + ReLu', shape='box', 
#            style='filled', fillcolor='lightpink')
# graph.node('5', label='Convolutional layer, 2 x 2 kernel, stride 1 \n + ReLu', shape='box', 
#            style='filled', fillcolor='lightpink')
# graph.node('6', label='Average pool', shape='invhouse', 
#            style='filled', fillcolor='gray90')
# graph.node('7', label='Linear layer + ReLu', shape='box', 
#            style='filled', fillcolor='aquamarine')
# graph.node('8', label='Linear layer + Sigmoid', shape='box', 
#            style='filled', fillcolor='aquamarine')
# graph.node('9', label='Output')
# graph.edge('0','1', label='  1 x  94 x 94')
# graph.edge('1','2', label='  20 x  32 x 32')
# graph.edge('2','3', label='  40 x  31 x 31')
# graph.edge('3','4', label='  40 x  30 x 30')
# graph.edge('4','5', label='  40 x  29 x 29')
# graph.edge('5','6', label='  10 x  28 x 28')
# graph.edge('6','7', label='  10')
# graph.edge('7','8', label='  3')
# graph.edge('8','9', label='  1')
# graph.render('Part_4_Counting_images/Descartes_architecture')
# graph

![png](Part_4_Counting_images/Descartes_architecture.png 'Network architecture')

how many parameters? how big is this number compared with previous, shallow network?

### Preparing the data

play game at random.
let's say the final score is $n$.
then rewind by $i$ moves, where $i$ is a random number between $0$ and $n$, chosen with probablity
$$
p_i = \frac{6}{n (n + 1) (2n + 1)} (n - i)^2
$$
this probability decreases with the square of $i$: 
gives a high chance of rewinding by only a few moves, a much smaller chance of rewinding by many moves

for each final game, the label is given by the number of allowed moves to be found at that stage.
this is something that is part of the implementation, so no additional computation needed.
this is the frequency of labels that we obtain

![png](Part_4_Counting_images/labels_counting.png 'legend')

computed on 20,000 games, gives a pretty smooth distribution of labels

then, we turn this integer number of possible moves (let's call it $N$) into a real number $y$ in the interval $[0,1]$ using
$$
y = \frac{N}{N + 5}
$$
$y$ is zero if there are no possible moves, and it approaches 1 if there are many possible moves

this choice gives a mean value $\mu(y) \approx 0.5$, and a standard deviation $\sigma(y) \approx 0.25$

this is the value that the network will try to match;
from the output, we can recove the number of possible moves using the inverse relation
$$
N = \frac{5 y}{1 - y}
$$

### Training the model

deeper model *and* regression requires much more data for training

more epochs

dynamical data: play games as we train; otherwise easy to overfit the data with this deeper network and many parameters

show loss functions

### Using the model to play Morpion Solitaire

next: play the game, this time not at random but using the output of the model

use equivalent of temperature in statistical physics

- if $\beta$ is very large, then choose only the best move at each step (show the output)
- if $\beta$ is close to zero, play at random

optimal value somewhere in between: sufficiently random to explore many different possibilities, yet favor the ones that give more options 
$\beta \approx$

### Looking forward

so far this problem does not need deep learing; computing the number of possible moves for a given computation is something that can easily be programmed by hand;
already here it is nice that the network can check the number for many grids at once, hence slightly improving the efficiency; but also not quite as precise as an explicit computation

but we are making big progress towards our goal:
all ingredients are here! deep neural network, complicated distribution of labels, regression problem

ready to apply these methods to the actual problem