# Optimization and Regularization in Keras

### Goals: 
- Optimization: explore optimization and regularization in `Keras`


In [40]:
%matplotlib inline 
# display figures in the notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()

In [41]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

data = np.asarray(digits.data, dtype='float32')
target = np.asarray(digits.target, dtype='int32')

X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size=0.15, random_state=37)

# mean = 0 ; standard deviation = 1.0
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# print(scaler.mean_)
# print(scaler.scale_)

In [42]:
import keras
from keras.utils.np_utils import to_categorical

Y_train = to_categorical(y_train)

In [43]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras import optimizers

N = X_train.shape[1]
H = 100
K = 10

model = Sequential()
model.add(Dense(H, input_dim=N))
model.add(Activation("relu"))
model.add(Dense(H))
model.add(Activation("relu"))
model.add(Dense(K))
model.add(Activation("softmax"))


### Stochastic Gradient Descent
The basic method for optimization is SGD. The basic implementation in Keras exposes some add-ons, like Momentum and Nesterov Momentum.

Expore possibilities with:
`optimizers.SGD?`


In [44]:
# optimizers.SGD?

In [45]:
sgd = optimizers.SGD(lr=0.1)

In [46]:
model.compile(optimizer=optimizers.SGD(lr=0.1),
              loss='categorical_crossentropy', metrics=['accuracy'])

In [47]:
model.fit(X_train, Y_train,  epochs=15, batch_size=32);

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


Keras has other types of optimization algorithms. Explore possibilities in the online documentation:

- Replace the SGD optimizer by the Adam optimizer from keras and run it
  with the default parameters.

- Add another hidden layer and use the "Rectified Linear Unit" for each
  hidden layer. Can you still train the model with Adam with its default global
  learning rate?

- Bonus: try the Adadelta optimizer (no learning rate to set).

Hint: use `optimizers.<TAB>` to tab-complete the list of implemented optimizers in Keras.

### Exercise: forward pass and generalization

- Compute predictions on test set using `model.predict_classes(...)`
- Evaluate the model using `model.evaluate`

### Exercise: impact of initialization

Let us now study the impact of a bad initialization when training
a deep feed forward network.

By default Keras dense layers use the "Glorot Uniform" initialization
strategy to initialize the weight matrices:

- each weight coefficient is randomly sampled from [-scale, scale]
- scale is proportional to $\frac{1}{\sqrt{n_{in} + n_{out}}}$

This strategy is known to work well to initialize deep neural networks
with "tanh" or "relu" activation functions and then trained with
standard SGD.

To assess the impact of initialization let us plug an alternative init
scheme into a 2 hidden layers networks with "tanh" activations.
For the sake of the example let's use normal distributed weights
with a manually adjustable scale (standard deviation) and see the
impact the scale value:

In [48]:
from keras import initializers

normal_init = initializers.RandomNormal(stddev=0.01)

model = Sequential()
model.add(Dense(H, input_dim=N, kernel_initializer=normal_init))
model.add(Activation("tanh"))
model.add(Dense(K, kernel_initializer=normal_init))
model.add(Activation("tanh"))
model.add(Dense(K, kernel_initializer=normal_init))
model.add(Activation("softmax"))

model.compile(optimizer=optimizers.SGD(lr=0.1),
              loss='categorical_crossentropy')

history = model.fit(X_train, Y_train,
                    epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Questions:

- Try the following initialization schemes and see whether
  the SGD algorithm can successfully train the network or
  not:
  
  - a very small e.g. `scale=1e-3`
  - a larger scale e.g. `scale=1` or `10`
  - initialize all weights to 0 (constant initialization)
  
- What do you observe? Can you find an explanation for those
  outcomes?

- Are better solvers such as SGD with momentum or Adam able
  to deal better with such bad initializations?

### Regularization
Keras implements several forms of regularization. 
Most forms of regularization are implemented as layers. This is the case for Dropout, for Noise Injection, for Batch Normalization. 

One of the most used techniques in Deep Learning is Dropout. Dropout is implemented in Keras as an extra layer, which can be added after a normal layer, and works on its output (or on the input of the next layer).



In [49]:
from keras.layers.core import Dropout

Dropout?

```python
keras.layers.core.Dropout(rate, noise_shape=None, seed=None)
```

Applies Dropout to the input.

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

Arguments

* rate: float between 0 and 1. Fraction of the input units to drop.
* noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape  (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features).
* seed: A Python integer to use as random seed.

**Note** Keras guarantess automatically that this layer is **not** used in **Inference** (i.e. Prediction) phase
(thus only used in **training** as it should be!)

See `keras.backend.in_train_phase` function

### Exercise: dropout
Add dropout layers to the previous model (defining a new model), use a dropout rate of 0.2 - or explore some alternatives. 

### Other regularization and normalization in Keras
Among the most used regularization layers, we have:
- `keras.layers.GaussianNoise(stddev)`, which applies additive zero-centered Gaussian noise to its input.
- `keras.layers.BatchNormalization`, which implements Batch Normalization. Check its options in the [Keras web page](https://keras.io/layers/normalization/)


There are also other regularizations that can be useful. Layers having weights, like the `Dense` layer, has options to introduce L1 or L2 penalties on weights (`kernel_regularizer`) or activations (`activity_regularizer`). Possible values here are the following objectes `keras.regularizers.l1(alpha)`, `keras.regularizers.l2(alpha)`, and  `keras.regularizers.l1_l2(alpha)`. You can implement your own.


The control of gradient norm (i.e. gradient clipping) can be set directly on the optimizer, using options `clipnorm` and `clipval`.

### Exercise  
Experiment with these different forms or regularization, one at time, to better understand their effect (use the MNIST dataset if your computer allows). 

### Early Stopping
Early stopping is the most used regularizer. But how to use it? 
The solution are Keras Callbacks.


A callback is a set of functions to be applied at given stages of the training procedure. You can use callbacks to get a view on internal states and statistics of the model during training. You can pass a list of callbacks (as the keyword argument callbacks) to the .fit() method of the Sequential or Model classes. The relevant methods of the callbacks will then be called at each stage of the training.


There are some default callbacks available in Keras, which you can use. Check the [Keras documentation page](https://keras.io/callbacks/) for the full list:
- `ModelCheckpoint`: save the model after every epoch;
- `EarlyStopping`: stop training when a monitored quantity has stopped improving;
- `LearningRateScheduler`: allows to change the lerning rate after each epoch.  

`EarlyStopping` takes the following parameters:
- `monitor`: quantity to be monitored.
- `min_delta`: minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.
- `patience`: number of epochs with no improvement after which training will be stopped.
- `verbose`: verbosity mode.
- `mode`: one of {auto, min, max}. In min mode, training will stop when the quantity monitored has stopped decreasing; in max mode it will stop when the quantity monitored has stopped increasing; in auto mode, the direction is automatically inferred from the name of the monitored quantity.

In [50]:
# preparing validation data for Early Stopping
from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train)

In [51]:
#Early Stopping Example. 
from keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=4, verbose=1)

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))

#task: improve the optimizer!
model.compile(loss='categorical_crossentropy', optimizer=SGD(), 
              metrics=['accuracy'])

#increase if your hardware allows that!
epochs = 10
    
model.fit(X_train, Y_train, validation_data = (X_test, Y_test), nb_epoch=epochs, 
          batch_size=128, verbose=True, callbacks=[early_stop]) 

NameError: name 'SGD' is not defined