# Keras

## Typical Keras workflow
1. Specify Architecture (*layers, nodes, activation functions, etc.*) **keras.model, model.add**
2. Compile the model (*loss function, optimizer, etc.*) **model.compile**
3. Fit (actual cycle of forward and back propagation) **model.fit**
4. Predict **model.predict**

## Sequential Model
- One of the 2 ways of building models in Keras, and the easier of the two.
- Requires weights/connections only to 1 layer which is next in the network diagram


In [5]:
! pip install tensorflow

Collecting tensorflow
  Using cached https://files.pythonhosted.org/packages/22/cc/ca70b78087015d21c5f3f93694107f34ebccb3be9624385a911d4b52ecef/tensorflow-1.12.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting astor>=0.6.0 (from tensorflow)
  Using cached https://files.pythonhosted.org/packages/35/6b/11530768cac581a12952a2aad00e1526b89d242d0b9f59534ef6e6a1752f/astor-0.7.1-py2.py3-none-any.whl
Collecting tensorboard<1.13.0,>=1.12.0 (from tensorflow)
  Using cached https://files.pythonhosted.org/packages/e0/d0/65fe48383146199f16dbd5999ef226b87bce63ad5cd73c840cf722637969/tensorboard-1.12.0-py3-none-any.whl
Collecting protobuf>=3.6.1 (from tensorflow)
  Using cached https://files.pythonhosted.org/packages/c2/f9/28787754923612ca9bfdffc588daa05580ed70698add063a5629d1a4209d/protobuf-3.6.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting grpcio>=1.8.6 (from tensorflow)
[?25l  Downloading https://files.pythonhosted.org/packages/0e/4f/e9e84e4600c43cae7ce58489c6e73ff4c864557bc4d4d0f0029c79e07f31/grpcio-1

In [6]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential() #model with 2 hidden layers and 1 output layer
model.add(Dense(100, activation='relu', input_shape=(50,))) #nothing after comma indicates it can have any data points(rows)
model.add(Dense(100, activation='relu'))
model.add(Dense(1))

Using TensorFlow backend.


## Compilation Step

1. Specify the **optimizer** - *sgd, rmsprop, etc.*
    - It controls the learning rate.
    - The learning rate can greatly affect how quickly weights are computed and how good they are.
    - Many optimization algorithms themselves tune the learning rate.
    - There are many options, each with it's own mathematical complexities.
    - So it is good to follow a pragmatic approach of choosing 1 optimization algorithm and use it for most problems.
    - **'Adam'** is usually a good choice - It adjusts learning rate as it does gradient descent


2. **Loss function** - *binary_crossentropy, categorical_crossentropy, etc.*
    - **"mean_squared_error"** is a common choice for regression problems.

In [7]:
model.compile(optimizer="adam", loss="mean_squared_error")

## Fit the Model
- Apply backpropagation and gradient descent with your data to update the weights
- Scaling data before fitting further eases optimization, so that each feature on average is similar sized values
    * One common technique is to subtract each feature by it's mean and divide by their standard deviation

## Classification Model
- Here the loss function to be used is **categorical_crossentropy**
- Similar to logloss. The lower the loss, the better.
- In the compile step, We add **metrics=['accuracy']** to see performance of model at each step.
- The output layer will now have multiple nodes, each corresponding to a possible outcome and will use softmax activation.

## Using Your Model - Save, Load, Predict
- You can save the model by calling **.save()** method on model.
- Models are saved in **HDF5** format for which **.h5** is the common extension
- We can load the model by calling **load_model()** method from keras.models.
- We can make predictions by calling predict() method and passing the data feature values, it will return the output in a same structure as target that we passed during training. It will list probabilities of each possible outcome.
- We can verify a model structure after loading by calling **.summary()** method on it.

In [None]:
from keras.models import load_model
model.save('my_model.h5')
model = load_model('my_model.h5')
preds = model.predict(X_test)

## Model Optimization - Choosing the right architecture and optimization arguments
- Optimization - Hard problem
    * An optimal value of a weight depends on other weights, and we update many weights simultaneously
- We're simultaneously optimizing 1000s of parameters with complex relationships.
- Even if the slope tells us which weights to increase and which ones to decrease, our model may not improve meaningfully.
- A **small learning rate** causes to make such small updates to the weights that the model doesn't really improve materially.
- A **large learning rate** may take us too far in the right direction.
- Adam is a smart optimizer, but still there could be optimization problems
- To understand the effect of learning rate, we can use **SGD** (Stochastic Gradient Descent), where we try out different learning rates from a set.

In [None]:
from keras.optimizers import SGD
lr_to_test = [0.000001, 0.01, 1]
for lr in lr_to_test:
    model = get_new_model()
    my_optimizer = SGD(lr = lr)
    model.compile(optimizer=my_optimizer, loss= = 'categorical_entropy')
    predictions = model.fit(predictors, targets)

## The Dying Neuron Problem
<img src="files/dying_neuron_relu.png">  
- If *a neuron takes a value* less than 0  
- In ReLU function, a node with negative  input results in **output 0**, and **the slope is also zero** (as shown in aboce fig,)  
- As a result any **weights flowing into that node** are also zero, hence those weights don't get updated.  
  
- Once the node starts always getting negative inputs, It may continue getting negative inputs.  
- Hence it doesn't really contribute anything to the model, hence the name **"Dead"** neuron

Shall we then use an activation function whose slope never becomes exactly zero ?

## Vanishing Gradients
- Occurs when many layers have very small slopes (eg. due to being on flat part of tanh curve)
- Earlier, activations like S-shaped tanh were used, whose slope outside the middle S was small.
- In a deep network, repeated multiplication of small slopes cause slopes to get close to 0, and hence **updates to backprop were close to 0**

This is a phenomenon *worth keeping in mind* to understand why your model isn't training  better.  
**Changing the activation function may be the solution.**

**NOTE**: Typically a good model should show significantly improved loss in first few epochs and then rate of improvement slows down. 
If a model doesn't show improved loss in first few epochs, it could be due to:  
- Too small learning rate
- Too high learning rate
- Poor choice of activation function

## Validation in Deep Learning
- Instead of relying on model performance on training data, we should validate it's performance on a held-out data.
- Validation split is more commonly used than k-fold cross validation.
- Ths is because deep learning is about large datasets, so computational expense of running k-fold validation would be large.
- Here we trust the single validation score because it is based on large amount of data, and is reliable.
- Keras makes it easy for us to use some of our data for validation, by specifying it using **validation_split** in the fit() method.

## Early Stopping: Optimizing the optimization
- Stop optimization when it isn't helping any more.
- We keep training the data as long as the validation score is improving. Once it stops improving, we stop training. This is Early Stopping.
- We can use **EarlyStopping** from **keras.callbacks** to create an early stopping monitor, before calling the fit method. This monitor will check whether the validation score is improving in subsequent epochs.

In [2]:
from keras.callbacks import EarlyStopping
early_stopping_monitor = EarlyStopping(patience = 2)

Using TensorFlow backend.


- **patience** argument is used to specify the **number of epochs the model can go without improving**.  
  **2 or 3** is a good choice. (Model may not improve after one epoch, but we should wait as it may improve in next epoch)
- We then pass this monitor to the fit method under the argument **callbacks**

In [None]:
model.fit(X, y, validation_split = 0.3, callbacks = [early_stopping_monitor], epochs=20)

(We may later specify more callbacks in the list, when we've advanced our skills !)
- Now that we have an early stopping callback, we can specify much higher max limit for number of epochs to run in **epochs** attribute.
- Now our model will keep iterating until it doesn't improve before max limit, which is early stopping.  
  This is a smarter training logic than relying on a fixed no. of epochs without looking at validation scores, while missing out on further possible improvement.

## Experimentation
Building great models requires experimentations:  
- Experiment with different architectures
- More layers
- Fewer layers
- Layers with more nodes
- Layers with fewer nodes

### Fine Tuning Keras model by adding layers

In [None]:
# The input shape to use in the first hidden layer
input_shape = (n_cols,)

# Create the new model: model_2
model_2 = Sequential()

# Add the first, second, and third hidden layers
model_2.add(Dense(50, activation='relu', input_shape=input_shape))
model_2.add(Dense(50, activation='relu'))
model_2.add(Dense(50, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit model 1
model_1_training = model_1.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Fit model 2
model_2_training = model_2.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()


## Model Capacity
- There A little more art to finding good DL architectures than for fine-tuning other ML algos.
- 'Model Capacity' should be considered while deciding what models to try.
- **Model Capacity** is a model's ability to capture predictive patterns in your data. This is similar to concepts of overfitting underfitting.
<img src="files/overfitting.png">  
- Overfitting is a model's ability to fit oddities in training data ( that are purely due to happenstance, and won't be present in new dataset)
- In Underfitting, model fails to find important predictive patterns in training data.
- The more the capacity of our Deep Learning Model, the further right we will be on the above graph (i.e more complex model)
- Increasing number of layers or number of nodes per layer increases the model capacity


## Workflow for optimizing model capacity
1. Start with a simple network (baseline model)
2. Get the validation score
3. Keep increasing capacity as long as the validation score is improving
4. Once it stops improving, we can decrease capacity slightly but that's still near the ideal  
Here's a sequential experiment trying to optimize model capacity:  
<img src="files/capacity_experiment.png">  
Should we change capacity by **adding nodes to existing layer** or **adding another layer** ?  
No Universal Answer to that.  Keep experimenting.

## DL on Images

### Recognizing handwritten digits : MNIST
- 28 x 28 pixels  grid flattened to 784  values for each image
- Each value denotes darkness of that pixel
- Create a DL model that takes those 784 features of each images as inputs, and predicts one of the 10 possible values for output

In [33]:
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

#### Load Data

In [43]:
from keras.datasets import mnist

In [53]:
(X_train,y_train), (X_test, y_test) = mnist.load_data()

In [54]:
X_train.shape, X_test.shape

((60000, 28, 28), (10000, 28, 28))

#### Flatten data

In [55]:
X_train = X_train.reshape((X_train.shape[0],-1)).astype('float32')/255
X_test = X_test.reshape((X_test.shape[0],-1)).astype('float32')/255

In [56]:
X_train.shape, X_test.shape

((60000, 784), (10000, 784))

#### Categorical encoding of output

In [58]:
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test,10)

In [59]:
print(y_train.shape, y_test.shape)

(60000, 10) (10000, 10)


#### Create Model, Compile and Fit the model

In [60]:
model = Sequential()
# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape = (784,)))

# Add the second hidden layer
model.add(Dense(50, activation='relu') )

# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

'''# Create early stopping callback
early_stop_monitor = EarlyStopping(patience=2)'''

# Fit the model
model.fit(X_train, y_train, validation_split=0.3, epochs=10)

Train on 42000 samples, validate on 18000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ffb6c6ad128>

### Increase Capacity

In [61]:
model = Sequential()
# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape = (784,)))

# Add the second hidden layer
model.add(Dense(50, activation='relu') )
model.add(Dense(50, activation='relu') )


# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Create early stopping callback
early_stop_monitor = EarlyStopping(patience=3)

# Fit the model
model.fit(X_train, y_train, validation_split=0.3, epochs=10, callbacks=[early_stop_monitor])

Train on 42000 samples, validate on 18000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ffb6c1f7860>

In [63]:
model.evaluate(X_test, y_test)



[0.13199304647133686, 0.9672]

## Next Steps
- Start with standard prediction problems on tables of numbers (pandas or numpy arrays)
- Images (with convolutional neural networks) are common next steps (or text or sound or something else !)
- keras.io for excellent documentation
- Graphical processing unit (GPU) provides dramatic speedups in model training times
- Need a CUDA compatible GPU
- For training on using **GPUs in the cloud** look here: http://bit.ly/2mYQXQb
- Kaggle datasets and it's forum
- [Wikipage](https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research) on datasets
- Keras, TF repo on github