# Keras

## Typical Keras workflow
1. Specify Architecture (*layers, nodes, activation functions, etc.*)
2. Compile the model (*loss function, optimizer, etc.*)
3. Fit (actual cycle of forward and back propagation)
4. Predict

## Sequential Model
- One of the 2 ways of building models in Keras, and the easier of the two.
- Requires weights/connections only to 1 layer which is next in the network diagram


## Compilation Step

1. Specify the optimizer
    - It controls the learning rate
    - The learning rate can greatly affect how quickly weights are computed and how good they are
    - Many optimization algorithms themselves tune the learning rate.
    - There are many options, each with it's own mathematical complexities.
    - So it is good to follow a pragmatic approach of choosing 1 optimization algorithm and use it for most problems.
    - 'Adam' is usually a good choice - It adjusts learning rate as it does gradient descent


2. Loss function
    - "mean_squared_error" is a common choice for regression problemms

## Fit the Model
- Apply backpropagation and gradient descent with your data to update the weights
- Scaling data before fitting further eases optimization, so that each feature on average is similar sized values
    * One common technique is to subtract each feature by it's mean and divide by their standard deviation

## Classification Model
- Here the loss function to be used is **categorical_crossentropy**
- Similar to logloss. The lower the better.
- In the compile step, We add metrics=['accuracy'] to see performance of model at each step.
- The output layer will now have multiple nodes, each corresponding to a possible outcome and will use softmax activation.

## Using Your Model - Save, Load, Predict
- You can save the model by calling **.save()** method on model.
- Models are saved in **HDF5** format for which **.h5** is the common extension
- We can load the model by calling **load_model()** method from keras.models.
- We can make predictions by calling predict() method and passing the data feature values, it will return the output in a same structure as target that we passed during training. It will list probabilities of each possible outcome.
- We can verify a model structure after loading by calling **.summary()** method on it.

## Model Optimization - Choosing the right architecture and optimization arguments
- Optimization - Hard problem
    * An optimal value of a weight depends on other weights, and we update many weights simultaneously
- We're simultaneously optimizing 1000s of parameters with complex relationships.
- Even if the slope tells us which weights to increase and which ones to decrease, our model may not improve meaningfully.
- A **small learning rate** causes to make such small updates to the weights that the model doesn't really improve materially.
- A **large learning rate** may take us too far in the right direction.
- Adam is a smart optimizer, but still there could be optimization problems
- To understand the effect of learning rate, we can use SGD, where we try out different learning rates from a set.

## The Dying Neuron Problem
- If *a neuron takes a value* less than 0
- In ReLU function, a node with negative  input results in output 0, and the slope is also zero.
- As a result any weights flowing into that node are also zero, hence those weights don't get updates.
  
- Once the node starts always getting negative inputs, It may continue getting negative inputs.
- Hence it doesn't really contribute anything to the model, hence the name **"Dead"** neuron

Shall we then use an activation function whose slope becomes exactly zero ?

## Vanishing Gradients
- Occurs when many layers have very small slopes (eg. due to being on flat part of tanh curve)
- Earlier activations like S-shaped tanh were used, whose slope outside the middle S was small.
- In a deep network, repeated multiplication of small slopes cause slopes to get close to 0, and hence **updates to backprop were close to 0**

**This is a phenomenon worth keeping in mind to understand why your model isn't training  better.  
Changing the activation function may be the solution.**

**NOTE**: Typically a good model should show significantly improved loss in first few epochs and then rate of improvement slows down. 
If a model doesn't show improved loss in first few epochs, it could be due to:  
- Too small learning rate
- Too high learning rate
- Poor choice of activation function

## Validation in Deep Learning
- Instead of relying on model performance on training date, we should validate it's performance on a held-out data.
- Validation split is more commonly used than k-fold cross validation.
- Ths is because deep learning is about large datasets, so computational expense of running k-fold calidation would be large.
- Here we trust the single validation score because it is based on large amount of data, and is reliable.
- Keras makes it easy for us to use some of our data for validation, by specifying it using **validation_split** in the fit() method.

## Early Stopping: Optimizing the optimization
- We keep training the data as long as the validation score is improving. Once it stops improving, we stop training. This is Early Stopping.
- We can use **EarlyStopping** from **keras.callbacks** to create an early stopping monitor, before the calling fit method. This monitor will check whether the validation score is improving in subsequent epochs.

In [2]:
from keras.callbacks import EarlyStopping
early_stopping_monitor = EarlyStopping(patience = 2)

Using TensorFlow backend.


- **patience** argument is used to specify the **number of epochs the model can go without improving**.  
  **2 or 3** is a good choice. (Model may not improve after one epoch, but we should wait as it may improve in next epoch)
- We then pass this monitor to the fit method under the argument **callbacks**

In [None]:
model.fit(X, y, validation_split = 0.3, callbacks = [early_stopping_monitor], epochs=20)

(We may later specify more callbacks in the list, when we've advanced our skills !)
- Now that we have an early stopping callback, we can specify much higher max limit for number of epochs to run in **epochs** attribute.
- Now our model will keep iterating until it doesn't improve before max limit, which is early stopping.  
  This is a smarter training logic than relying on a fixed np. of epochs without looking at validation scores, while missing out on further possible improvement.

## Experimentation
Building great models requires experimentations:  
- Experiment with different architectures
- More layers
- Fewer layers
- Layers with more nodes
- Layers with fewer nodes