# Neural network model creation

In this notebook, simple neural networks will be created thanks to `deeposlandia` modules. Some standard neural network architectures will be explored whether it be for feature detection or for semantic segmentation.

## Introduction

Before to begin, some modules are imported:

In [10]:
from keras.models import Model

In [20]:
from deeposlandia import feature_detection, semantic_segmentation

Additionnally, we define some basic variable to generalize the model parameters:

In [None]:
IMG_SIZE = 128
NB_CHANNELS = 3
NB_LABELS = 13

## Seminal cases

First of all, two basic models may be generated as an illustration, the former for solving feature detection problems, the latter for solving semantic segmentation problems.

### Feature detection

In the dedicated literature, a lot of neural networks exist in order to address the feature detection problem. Basically, two kinds of layer are unmissable:
- convolutional layers, often coming with pooling layers
- fully-connected layers

The first proposed architecture uses these layers. It comes as follows:
- 1 convolutional layer (number of filters: `16`, kernel size: `7*7`), followed by a max-pooling layer (pool size: `2*2`);
- 1 convolutional layer (number of filters: `32`, kernel size: `5*5`), followed by a max-pooling layer (pool size: `2*2`);
- 1 convolutional layer (number of filters: `64`, kernel size: `3*3`), followed by a max-pooling layer (pool size: `2*2`);
- 1 fully-connected layer (depth: `512`);
- 1 fully-connected layer (depth: `nb_labels`).

#### Network creation

In order to build such a network, an instance of `FeatureDetectionNetwork` is created. The output of this object is a tensor of shape `BATCH_SIZE*NB_LABELS`, each image in the batch being characterized by `NB_LABELS` boolean values (`1` if the label is on the picture, `0` otherwise).

In [14]:
fdn = feature_detection.FeatureDetectionNetwork("featdet",
                                                image_size=IMG_SIZE,
                                                nb_channels=NB_CHANNELS,
                                                nb_labels=NB_LABELS)

In [15]:
fdn.Y.shape

TensorShape([Dimension(None), Dimension(13)])

#### Model building

Once the network is available, a `keras.models.Model` is instance with the network input and output layers. This is the object which will be fitted with a training generator, as defined in [another notebook](./2_generator-creation.ipynb).

In [23]:
fdm = Model(fdn.X, fdn.Y)

In [24]:
fdm.output_shape

(None, 13)

In [28]:
fdm.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 128, 128, 3)       0         
_________________________________________________________________
conv1_conv (Conv2D)          (None, 128, 128, 16)      2368      
_________________________________________________________________
conv1_bn (BatchNormalization (None, 128, 128, 16)      64        
_________________________________________________________________
conv1_activation (Activation (None, 128, 128, 16)      0         
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 64, 64, 16)        0         
_________________________________________________________________
conv2_conv (Conv2D)          (None, 64, 64, 32)        12832     
_________________________________________________________________
conv2_bn (BatchNormalization (None, 64, 64, 32)        128       
__________

As a remark, we can take advantage of [Keras functional API](https://keras.io/models/model/). The `summary` method provides the amount of parameters in the model, as well as the model architecture. Here we have a model with more than 8,4 millions of parameters (mostly due to the first fully-connected layer).

### Semantic segmentation

In the case of semantic segmentation, the point is to provide an output image with a size equivalent to the input image size. The fully-connected layers are far less important, on the opposite of convolutional layers. The key here is the design of a mechanism that ensures decoding process, as standard convolutional layers (or pooling layers) ensure encoding process.

The basic network designed for solving semantic segmentation here is as follows. It uses ["transposed" convolution layers](http://www.matthewzeiler.com/wp-content/uploads/2017/07/cvpr2010.pdf):
- 1 convolutional layer (number of filters: `32`, kernel size: `3*3`), followed by a max-pooling layer (pool size: `2*2`);
- 1 convolutional layer (number of filters: `64`, kernel size: `3*3`), followed by a max-pooling layer (pool size: `2*2`);
- 1 convolutional layer (number of filters: `128`, kernel size: `3*3`), followed by a max-pooling layer (pool size: `2*2`);
- 1 transposed convolution layer (number of filters: `128`, strides: `2`, kernel size: `3*3`);
- 1 transposed convolution layer (number of filters: `64`, strides: `2`, kernel size: `3*3`);
- 1 transposed convolution layer (number of filters: `32`, strides: `2`, kernel size: `3*3`).

#### Network creation

As previously, a network is created with the instanciation of the accurate object:

In [29]:
ssn = semantic_segmentation.SemanticSegmentationNetwork("semseg",
                                                        image_size=IMG_SIZE,
                                                        nb_channels=NB_CHANNELS,
                                                        nb_labels=NB_LABELS)

In [30]:
ssn.Y.shape

TensorShape([Dimension(None), Dimension(None), Dimension(None), Dimension(13)])

Here the output layer shape is quite more complex, as one expects to get a classification answer for each pixel (the three first dimension are relative to the batch, the image width and the image height). This classification information is located in the last dimension: that is an array of `NB_LABELS` boolean values, notifying to which label each pixel corresponds to.

#### Model building

As previously, a Keras model is built starting from the network architecture:

In [31]:
ssm = Model(ssn.X, ssn.Y)

In [32]:
ssm.output_shape

(None, 128, 128, 13)

In [33]:
ssm.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 128, 128, 3)       0         
_________________________________________________________________
conv1_conv (Conv2D)          (None, 128, 128, 32)      896       
_________________________________________________________________
conv1_bn (BatchNormalization (None, 128, 128, 32)      128       
_________________________________________________________________
conv1_activation (Activation (None, 128, 128, 32)      0         
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 64, 64, 32)        0         
_________________________________________________________________
conv2_conv (Conv2D)          (None, 64, 64, 64)        18496     
_________________________________________________________________
conv2_bn (BatchNormalization (None, 64, 64, 64)        256       
__________

The number of parameter here is largely smaller (~330k), as there is not any fully-connected layer in the network.

## Some more complex cases

The following section are dedicated to alternative architectures that are considered in this framework.

For feature detection, we get:
- [VGG](https://arxiv.org/pdf/1409.1556.pdf)
- [Inception](https://arxiv.org/pdf/1512.00567v3.pdf)

For semantic segmentation, we get:
- [Unet](https://arxiv.org/pdf/1505.04597.pdf)
- [Dilated network](https://arxiv.org/abs/1511.07122)

### VGG network

In [37]:
vgg = feature_detection.FeatureDetectionNetwork("vgg",
                                                image_size=IMG_SIZE,
                                                nb_channels=NB_CHANNELS,
                                                nb_labels=NB_LABELS,
                                                architecture="vgg16")
vgg_model = Model(vgg.X, vgg.Y)
vgg_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 128, 128, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 128, 128, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 128, 128, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 64, 64, 64)        0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 64, 64, 128)       73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 64, 64, 128)       147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 32, 32, 128)       0         
__________

This architecture is one of the more performing one for feature detection, however it is at the cost of a high number of parameters. In this little example, we get more than 24 millions of parameters (~3 times more).

### Inception network

In [38]:
inc = feature_detection.FeatureDetectionNetwork("inc",
                                                image_size=IMG_SIZE,
                                                nb_channels=NB_CHANNELS,
                                                nb_labels=NB_LABELS,
                                                architecture="inception")
inc_model = Model(inc.X, inc.Y)
inc_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 128, 128, 3)       0         
_________________________________________________________________
conv1_conv (Conv2D)          (None, 128, 128, 16)      2368      
_________________________________________________________________
conv1_bn (BatchNormalization (None, 128, 128, 16)      64        
_________________________________________________________________
conv1_activation (Activation (None, 128, 128, 16)      0         
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 64, 64, 16)        0         
_________________________________________________________________
conv2_conv (Conv2D)          (None, 64, 64, 32)        12832     
_________________________________________________________________
conv2_bn (BatchNormalization (None, 64, 64, 32)        128       
__________

Its advantage is that it can be less resource-consuming than VGG network, even if last developments arise with bigger Inception networks. However, this architecture is known as the best alternative in the feature detection state-of-the-art.

### Unet network

In [39]:
unet = semantic_segmentation.SemanticSegmentationNetwork("unet",
                                                         image_size=IMG_SIZE,
                                                         nb_channels=NB_CHANNELS,
                                                         nb_labels=NB_LABELS,
                                                         architecture="unet")
unet_model = Model(unet.X, unet.Y)
unet_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 128, 128, 3)       0         
_________________________________________________________________
conv1_conv (Conv2D)          (None, 128, 128, 32)      896       
_________________________________________________________________
conv1_bn (BatchNormalization (None, 128, 128, 32)      128       
_________________________________________________________________
conv1_activation (Activation (None, 128, 128, 32)      0         
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 64, 64, 32)        0         
_________________________________________________________________
conv2_conv (Conv2D)          (None, 64, 64, 64)        18496     
_________________________________________________________________
conv2_bn (BatchNormalization (None, 64, 64, 64)        256       
__________

Unet networks are similar to simple network, in the sense that they are composed of an encoding part and a decoding part. However, a major difference exists, as encoding layers are linked to decoding layers at comparable steps. The decoding side is then composed of concatenations between encoding layer result and transposed convolution of previous layers.

### Dilated network

In [40]:
dil = semantic_segmentation.SemanticSegmentationNetwork("dilated",
                                                        image_size=IMG_SIZE,
                                                        nb_channels=NB_CHANNELS,
                                                        nb_labels=NB_LABELS,
                                                        architecture="dilated")
dil_model = Model(dil.X, dil.Y)
dil_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 128, 128, 3)       0         
_________________________________________________________________
conv1_conv (Conv2D)          (None, 128, 128, 32)      896       
_________________________________________________________________
conv1_bn (BatchNormalization (None, 128, 128, 32)      128       
_________________________________________________________________
conv1_activation (Activation (None, 128, 128, 32)      0         
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 64, 64, 32)        0         
_________________________________________________________________
conv2_conv (Conv2D)          (None, 64, 64, 64)        18496     
_________________________________________________________________
conv2_bn (BatchNormalization (None, 64, 64, 64)        256       
__________

The last developed network makes a large use of dilated convolutions. It is composed of three main phases:
- an encoding step largely inspired from VGG network (without final fully-connected layers);
- a context exploration step, which consists in a row of convolutional layers with different dilation rate, in order to investigate on pixel more or less close from the pixel of interest;
- a decoding step composed of classic transposed convolution layers, to go back to the input image size (this phase is absent from original paper, and has been added here).