## 7.3 Getting the most out of your models

### 7.3.1 Advanced architecture patterns

Residual connections were one important design pattern that was already covered. There are two designs patters more: normalization and depthwise separable convolution. These are relevant when building high-performance convnets, but not exclusive to these ones. 

#### Batch Normalization

The concept of batch normalization is the same used for data normalization before giving the data to a model. 

Normalized data features zero mean and unit standard deviation, but that is not necessarily the case after passing through a network layer. The `BatchNormalization` [(ref.)](https://arxiv.org/abs/1502.03167) layer in Keras comes to correct for those data deviations during training. It works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training. 

Batch normalization helps with gradient propagation, allowing deeper networks. It is included in many advanced convnet architectures that come with Keras, such as ResNet50, Inception, V3, and Xception. 

This layer is usually applied after a convolutional or densely connected layer. 

```python
conv_model.add(layers.Conv2D(32, 3, activation='relu'))
conv_model.add(layers.BatchNormalization())

dense_model.add(layers.Dense(32, activation='relu'))
dense_model.add(layers.BatchNormalization())
```
The `BatchNormalization` layers takes an `axis` argument, specifying the *channel* of the layer. It usually is `-1`, since the `data_format` is normally set to `'channels_last'`. But it must be changed accordingly to the channel position. 

#### New developments: 
- [Batch renormalization](https://arxiv.org/abs/1702.03275)
- [Self-normalizing neural networks](https://arxiv.org/abs/1706.02515)

#### Depthwise separable convolution

Another implementation of Keras is the depthwise separable convolution layer `SeparableConv2D`. This layer performs a spatial convolution on each channel of its input, independetly, before mixing output channels via a pointwise convolution. This is equivalent to separating the learning of spatial features and the learning of channel-wise features. This is the natural step after assuming that spatial locations in the input are highly correlated, but different channels are fairly independent. 

Its use results in better-performing models. This is specially important for small models from scratch on limited data. 

Let's see an implementation. It is a lightweight, depthwise separable convnet for an image-classification task on a small dataset:

```python
from keras.models import Sequential, Model
from keras import layers

height = 64
width = 64
channels = 3
num_classes = 10

model = Sequential()
model.add(layers.SeparableConv2D(32, 3, 
                                 activation='relu',
                                 input_shape=(height, width, channels, )))
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.GlobalAveragePooling2D())

model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(num_classes, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
```

These convolutions are the basis of the Xception architecture, a high performance convnet that comes with Keras.

### 7.3.2 Hyperparameter optimization

The process of optimizing hyperparameters typically looks like this:
1. Choose a set of hyperparameters (automatically)
2. Build the corresponding model
3. Fit it and measure the final performance on validation data
4. Choose the next set of hyperparameters (automatically)
5. Repeat
6. Eventually, measure performance on the test data

There are many algorithms that use the history of validation performance to choose the next set of hyperparameters. Some techniques are: Bayesian optimization, genetic algorithms, simple random search, etc.

Updating hyperparameters is extremely challenging and not so straightforward as training model parameters using the backpropagation algorithm. 

- It is expensive
- The hyperparameter space is discrete, then gradient descent can not be applied. 

It is found that random search is the best way to optimize hyperparameters. 
This field is still very young, but there are some tools to do hyperparameter optimization. There are libraries for this optimization, one of them is [Hyperopt] (https://github.com/hyperopt/hyperopt), and another is [Hyperas](https://github.com/maxpumperla/hyperas). 

### 7.3.3 Model emsembling

Ensembling consists of pooling together the predictions of a set of different models, to produce better predictions. 

It relies on the assumption that different good models trained independently are likely to be good for different reasons: they all contribute to build the "truth" from different aspects of the data. 

You can average the predictions of several classifiers:

```python
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)
final_preds = 0.25 * (preds_a + preds_b + preds_c + preds_d)
```
This works if all the classifiers are equally good.  
Another way is to weight the classifiers according to their performance on the validation data. To search for a good set of ensembling weights, one can use random search or a simple optimization algorithm such as Nelder-Mead:

```python
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)
final_preds = 0.5 * preds_a + 0.25 * preds_b + 0.1* preds_c + 0.15 * preds_d
```

You should ensemble models that are *as good as possible* while being *as different as possible*. This means using very different architectures or even different brands of machine-learning approaches. 


