In [3]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

%load_ext tensorboard

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(23)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Training Deep Neural Networks

Would like to know how to avoid commmon problems associated with training deep neural networks? In this post we test our knowledge of how we can navigate around the problems associated with training deep neural networks.

Some common issues/ things to keep in mind when training DNN are:
* Vanishing/Exploding gradient problems in the lower layers of DNN
* Optimizers to train efficiently
* How to use regularization techniques to reduce the Risk of overfitting 
* Using unsupervised pretraining to tackle complex problems with little labeled data
* How to reduce model traning time
* How to leverage transfer learning i.e. using pretrained models lower layers to create a DNN to accomplish similar task.

<h1 style="color:#3A913F;">The Vanishing/ Exploding Gradients Problems</h1>

Describe the Vanishing Gradients problem? What about the Exploding Gradients problem:
<br><br>
The Vanishing Gradient problem occurs when the gradient used to update each parameters weight (which is calculated from the gradient of the cost function with regard to each parameter in the network ) becomes increasinlgy smaller and smaller as the backpropagations algorithm progresses down to the lower layers. Resulting in the lower connection weights being left virtually unchanged, and consequently the training never converges to a good solution. On the other hand the backpropagation algorithm may experience the  exploding gradient problem where the gradients grow bigger and bigger until layers get absurdly large weight updates and the algorithm diverges. Both of this issues demonstrate the fact that DNN have naturally occuring unstable gradients; meaning different layers may learn at widely different speeds. 

Properties of sigmoid activation function and initilization scheme of ~N(0,1) resulted in the variance of the outpus of each layer being much greater than the variance of its inputs. (AF has mean of 0.5 not 0)

<h2>Glorot and He Initialization</h2">

In a layer what is fan-in and fan-out and how are they involved in Glorot initialization that solves the vanishing/exploding gradient problem (when using a sigmoid AF):
<br><br>
fan-in: is the number of inputs in the layer & fan-out: is the number of neurons in the layer. The Glorot initialization says that the <strong>connection weights in each layer must be initialized as:</strong>
<br><br>
N(0,$\sigma^{2}$) where $\sigma^{2}$ = $1/fan_{avg}$
OR
U~(-r,r) where r = $\sqrt{\frac{3}{fan_{avg}}}$
<br>


where $fan_{avg} = (fan_{in} +fan_{out})/2$


How about for the RELU AF to avoid the V/E gradient problem what initialization strategy should you use?

<br><br> Best to use the He initialization which only differs from the Glorot initialization by the scale of the variance i.e. using $fan_{in}$ where $\sigma^{2} = \frac{2}{fan_{in}}$

By default, Keras uses Glorot initialization with a uniform distribution. When creating
a layer, you can change this to He initialization by setting kernel_initializer
```python
keras.layers.Dense(10, activation='relu',kernel_initialize="he_normal")
```
Or you can also use the Variance Scaling initializer like this:

```python
he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',
distribution='uniform')
keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)
```

<h2>Nonsaturating Activation Functions</h2>

Generally the order in which Activation Functions should be checked are ...? Give a brief explanation of each AF e.g. characteristic and limitations. 
<br><br>
SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh \> logistic
<br><br>
SELU: Scaled ELU tries to make the network <em> self normalize</em> that is for each output of each layer it will try to preserve a mean of 0 and standard deviation of 1 during training. But, it only works for sequential network achitectures,where each input features are standardized (mean 0 and standard deviation 1) and every hidden layer's weights must be initialized with LeCun normal initialization. 
<br><br>
ELU: Alleviates the vanishing gradients problem as it facilitates an average output closer to 0 through allowing negative values when z $<$0. Avoids dead neurons problem with nonzero gradient when z <0. Convergence to a solution is supported by an $\alpha$ value of 1 as the ELU AF will be smooth everywhere in effect speeding up Gradient DEscent. However, will be the ELU will be slow to compute due to its use of the exponential function.
<br>
$ ELU_a(z)=  \left\{
\begin{array}{ll}
      \alpha(exp(z)-1 &  z<0 \\
      z & z\ge0 \\
\end{array} 
\right. $
<br>
Leaky ReLU: The hyperparameter α creates a small slope ensuring that the
leaky ReLUs never die i.e. (output zero) when the input to it is negative. This can, in many cases, completely block backpropagation because the gradients will just be zero after one negative value has been inputted to the ReLU function;
<br>
$ LeakyReLU(z)= \alpha x +x = \left\{
\begin{array}{ll}
      z  & z>0 \\
      \alpha z & z\ge0 \\
\end{array} 
\right. $

<h2>Batch Normalization</h2>

Batch Normalization -> Provides support to minimize/elimanate the vanishing/exploading gradients problem that could occur during the training of a DNN. It effectively, learns the optimal scale and mean of each layer's inputs in a deep neural network.

Here is the math behind batch normalization: the goal is zero center and normalize inputs and the algorithm accomplishes this by estimating each input's mean and standard deviation by evaluating the mean and standard deviation of teh input over the current mini batch. 

<img src="../jupyter_images/Batch_Normalization_Algorithm.png">

In [6]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300,activation='elu',kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100,activation="elu",kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10,activation="softmax")  
])

In [7]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 784)               3136      
_________________________________________________________________
dense_3 (Dense)              (None, 300)               235500    
_________________________________________________________________
batch_normalization_4 (Batch (None, 300)               1200      
_________________________________________________________________
dense_4 (Dense)              (None, 100)               30100     
_________________________________________________________________
batch_normalization_5 (Batch (None, 100)               400       
_________________________________________________________________
dense_5 (Dense)              (None, 10)               

Each BN layer adds four parameter per input γ, β, μ, and σ. The last two
parameters, μ and σ, are the moving averages; they are not affected by backpropagation,
so Keras calls them “non-trainable”9

In [5]:
784*300

235200

Batch Normalization hyperparameters

$\hat{v}$ <- $\hat{v} * momentum + v * (1-momentum)$ :Used to update the exponential moving averages; given a new value <strong>v</strong> (i.e., a new vector of input means or standard deviations computed over the current batch) typically value is close to 1 ~ .9, .99, .999

Depending on the task BN before the activation functionsadding the BN layers before the activation
functions, rather than after (as we just did). There is some debate about this, as
which is preferable seems to depend on the task—you can experiment with this too to
see which option works best on your dataset. To add the BN layers before the activation
functions, you must remove the activation function from the hidden layers and
add them as separate layers after the BN layers. Moreover, since a Batch Normalization
layer includes one offset parameter per input, you can remove the bias term from
the previous layer (just pass use_bias=False when creating it):

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300,activation='elu',kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100,activation="elu",kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10,activation="softmax")  
])

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.BatchNormalization(),
keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
keras.layers.BatchNormalization(),
keras.layers.Activation("elu"),
keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
keras.layers.BatchNormalization(),
keras.layers.Activation("elu"),
keras.layers.Dense(10, activation="softmax")
])

<h2>Gradient Clipping</h2>

Gradient Clipping involves clipping the gradients during backpropagation so that they never exceed some threshold it is another technique to mitigate the exploding gradients problem.

```python
optimizer = keras.optimizer.SGD(clipvalue =1.0) or optimizer = keras.optimizer.SGD(clipnorm=1.0)
#Given this optimizer, all partial derivatives of the loss (with regard to each and every trainable parameter) will be clipped if its l2 norm is greater than the threshold you picked.

# clipnorm will preserve direction but not scale and vice versa for clipvalue.
model.compile(loss="mse",optimizer=optimizer)
```

<h1 style="color:#3A913F;">Reusing Pretrained Layers</h1>


Recall that transfer learning will work best when the inputs have similar low level features.

General Process for Transfer Learning:
<br>
- Freeze all the reused layers first (making their weights non-trainable so that Gradient Descent does not modify them) then train your model and see how it performs.
- Then by sequence unfreeze one or two top hidden layers to let backpropagation modify them and see if performance improves. 
    - If you have a large set of training data you could unfreeze more hidden layers respectively.
    - Keep in mind a small learning rate might be preferred so that you may preserve the finely tuned weights

<h2>Transfer Learning with Keras</h2>



```python
#clone model architecture and weights 
model_B = keras.models.clone_model(model_A)
model_B.set_weights(model_A.get_weights()) 

#freeze layers weights 
for layer in model_B.layers[:-1]:
    layer.trainable = False 
    
#must compile after unfreezing or freezing layers 
model_B.compile(loss="binary_crossentropy", optimizer="sgd",metrics="accuracy") 

#rewire output connection in this particular case because it was randomized 
model_B.fit(X_train,Y_train, epochs=6, validation_data = (X_valid, Y_valid))

#unfreeze 
for layer in model_B.layers[:-1]:
    layer.trainable = True
    
#ensure we preserve the fine tuned weights; by using a decreased learning rate
optimizer = keras.optimizers.SGD(lr=1e-4) #the default is le-2
model_B.compile(loss="binary_crossentropy",optimizer=optimizer, metrics=['accuracy'])

history = model_B.fit(X_train, Y_train, epochs=16, validation_data=(X_valid, Y_valid))

```

One thing to keep in mind is that Transfer learning will learn shallow patters when trained on small dense networks and it will learn very specific patterns with dense networks. In either case those patterns may not be useful in other transfer tasks. <em>One architecture that benefits from Transfer learning is the Deep CNN because it tends to learn feature detectors that are general (especially in the lower layers)</em>

<h2>Unsupervised Pretraining</h2>

Aerelien mentions that previously GBM and GAN were used for unsupervised pretraining these are Deep Learning techniques I am not currently familiar with. He also points out that generally GANs and autocoders are the techniques that are being used most commonly.

<h2>Pretraining on an Auxillary Task</h2>

Used often in circumstances where you might not have readily available training data for your specific task but you do have data to train your lower layers so that they may be used for your limited training data for your specific task. 

<h3 style="color:#3A913F;">Summary: Reusing Pretrained Layers</h1>
There are multiple methods Transfer with different use cases. Transfer learning for speed and similar tasks, unsupervised pretraining for expensibe labeling of data, and pretraining on auxillary task to make use of readily available data. 

<h1 style="color:#3A913F;">Faster Optimizers</h1>

<h2>Momentum Optimization</h2>

In [None]:
An optimization of changing the weights associated with inputs, based on the in moment training loss from instance to instance. Simply put The gradients are updated not based on what the earlier graidents were but rather what the current local gradients are from instance to instance. That is it updates the weights by directly subtracting the gradient of cost functions with regard to the weights multiplied by the learning rate. 

<h2>Nesterov Accelerated Gradient</h2>

Rather than compute the gradients based on the local instance to instance or batch to batch feedback Nesterov Accelerated Gradient (NAG) computes the gradient in the direction of the momentum, theta + Bm.

<h2>Adam and Nadam Optimization</h2>

Adam <br>
Keeps track of an exponentially decaying average of past gradients to optimize the moment (using the gradients from the most recent iterations and decaying this influence as you move away from these instances like a rolling window). It also keeps track of an exponentially decaying average of past squared gradients which are being used to decay the learning rate for steep dimensions and less so for dimensions for gentler slopes i.e an adaptive learning rate. 
<br><br>
Nadam <br>
This optimization techinique combines Adam with the Nesterov algorithm (updating weights in the directions of the momentum theta + Bm)





<h2>Learning Rate Scheduling</h2>

In [5]:
(X_train_full, y_train_full), (X_test,y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
#Scaling the inputs to mean 0 and sd 1
pixel_means = X_train.mean(axis=0,keepdims=True)
pixel_stds = X_train.std(axis=0,keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

In [None]:
class OneCycleScheduler(keras.callbacks.Callback):
    def __init__(self, iterations,max_rate)

In [19]:
import tensorflow as tf
from tensorflow import keras

In [20]:
#Defining the Keras model to add callbacks to 
def get_model():
    model = keras.Sequential()
    model.add(keras.layers.Dense(1, input_dim =784))
    model.compile(
        optimizer = keras.optimizers.RMSprop(learning_rate=0.1),
        loss="mean_squared_error",
        metrics=["mean_absolute_error"]
    )
    return model
# Load example MNIST data and pre-process it
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(-1, 784).astype("float32") / 255.0
x_test = x_test.reshape(-1, 784).astype("float32") / 255.0

# Limit the data to 1000 samples
x_train = x_train[:1000]
y_train = y_train[:1000]
x_test = x_test[:1000]
y_test = y_test[:1000]

In [24]:
class CustomCallback(keras.callbacks.Callback):
    def on_train_begin(self, logs=None):
        if logs != None:
            print(logs.keys())
        # keys = list(logs.keys())
        # print("Starting training; got log keys: {}".format(keys))
    def on_train_batch_end(self, batch, logs=None):
        if logs != None:
            keys = list(logs.keys())
            print("...Training: end of batch {}; got log keys: {}".format
            
(batch, keys))

In [1]:
# model = get_model()
# model.fit(
#     x_train,
#     y_train,
#     batch_size=128,
#     epochs=1,
#     verbose=0,
#     validation_split=0.5,
#     callbacks=[CustomCallback()],
# )

# res = model.evaluate(
#     x_test, y_test, batch_size=128, verbose=0, callbacks=[CustomCallback()]
# )

# res = model.predict(x_test, batch_size=128, callbacks=[CustomCallback()])

<h3 style="color:#3A913F;">Summary: Faster Optimizers</h1>

Adaptive Optimizers of gradient descent are special. However, a 2017 paper by Ashia C. Wilson et al. showed that they can lead to solutions that generalize poorly on some datasets. So when you are disappointed by your model’s performance, try using plain Nesterov Accelerated Gradient instead: your dataset
may just be allergic to adaptive gradients. In terms of learning rate scheduling you should consider using Performance scheduling in which you measure the validation error every N steps (just like for early stopping), and reduce the learning rate by a factor of λ when the error stops dropping. Or you can perfom 1cycle scheduling two mountains inverse of each other one being the momentum and the other being the learning rate. 

<h1 style="color:#3A913F;">Avoiding Overfitting Through Regularization</h1>

<h2>ℓ1 and ℓ2 Regularization</h2>

<h2>Dropout</h2>

<h2>Monte Carlo (MC) Dropout</h2>

<h2>Max-Norm Regularization</h2>

<h1 style="color:#3A913F;">Avoiding Overfitting Through Regularization Summary:</h1>

<h1 style="color:#3A913F;">Summary and Practical Guidelines</h1>