# Keras Practice

## Definitions/Basics

For whole of 'Basics' section, this post was quite helpful : https://machinelearningmastery.com/5-step-life-cycle-neural-network-models-keras/

### IMPORTS. 
Sequential resides in keras.models
While Dense resides in keras.layers

In [77]:
# Importing NN Architecture model
from keras.models import Sequential

# Importing different sorts of layers
from keras.layers import Dense
from keras.layers import Activation
from keras.layers import Dropout
from keras.layers import Conv2D
from keras.layers import Flatten
from keras.layers import Lambda
from keras.layers import RepeatVector
from keras.layers import Reshape
from keras.layers import ZeroPadding2D
from keras.layers import UpSampling2D
from keras.layers import Cropping2D
from keras.layers import MaxPooling2D
from keras.layers import GlobalMaxPooling2D

# Importing regularization functions
from keras import regularizers

# Importing constraints
from keras import constraints

from keras.models import model_from_json

### Defining Model

Everything starts with a Sequential() class instantiation. This class is the container for the architecture for a neural network model in Keras.

In [5]:
model = Sequential()
model.add(Dense(2))

ValueError: The first layer in a Sequential model must get an `input_shape` or `batch_input_shape` argument.

**As we can see, the first layer in a Sequential model must have an input_shape!!**
For this we use *input_dim* parameter

In [12]:
model = Sequential()
# say 15 features
model.add(Dense(2,input_dim=15))

### Different sorts of Layers

#### Dense (FC) Layers

Definition in docs : 

`keras.layers.Dense(units, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)`

Understanding Parameters for Dense Layer : 

- **use_bias** : whether to have the bias (b) term or not
- **activation** : which activation function to use. By Default it is None, so you can add an activation layer afterwards. But if you know which one to use then you can specify here itself
- **kernel_initializer** : which initializer to use to initialize weights of the layer. Like, you could do it randomly, but random with what distribution? By default it uses 'glorot_uniform', however there are many initializers [here](https://keras.io/initializers/). You could use 'RandomUniform' also.
- **bias_initializer** : How to initialize biases. Usually zeros work, so 'zeros' is default
- **kernel_regularizer** : Whether to use L1 or L2 regularization to penalize weights. values could be `regularizers.l2(0.01)` or `regularizers.l1(0.01)` or None. It is None by default.
- **bias_regularizer** : As we studied, bias regularization is often not really required. So, 'None' works fine here. However you do have same options as above.
- **activity_regularizer** : This is regularization applied to the output matrix of a layer. Dunno exactly why...
- **kernel_constraint** : You can further have constraints on your weights, that while they optimize during training, they always should follow a certain criteria. Like having norm = 1, or always being non-negative. Or sth else. [All constraints are here](https://keras.io/constraints/).
- **bias_constraint** : Same thing as above but for bias values for this layer

** Simple initiation**

In [18]:
# units = 2 means 2 hidden nodes in this layer!
model.add(Dense(2))

In [24]:
model.add(Dense(5, activation='relu', kernel_initializer='RandomUniform', kernel_regularizer=regularizers.l2(0.01), kernel_constraint=constraints.non_neg()))

#### Activation layer

You might want decoupling, so in which case you add your activation layers separately.

Below are some common predictive modeling problem types and the structure and standard activation function that you can use in the output layer:

* Regression: Linear activation function or ‘linear’ and the number of neurons matching the number of outputs.
* Binary Classification (2 class): Logistic activation function or ‘sigmoid’ and one neuron the output layer.
* Multiclass Classification (>2 class): Softmax activation function or ‘softmax’ and one output neuron per class value, assuming a one-hot encoded output pattern.

In [15]:
model.add(Activation('relu'))

#### Dropout Layer

Just like activation layer, often you wanna add a dropout layer. This is for that purpose only.

In [27]:
model.add(Dropout(0.3))

Dropout definition : `keras.layers.Dropout(rate, noise_shape=None, seed=None)`

- rate : intuitive to understand. A number between 0 and 1.
- noise_shape : If you wish to keep same dropout for a particular dimension, or you want your dropout thingy to work in a structured way, you use this. [More details here.](https://keras.io/layers/core/)
- seed : for python random number

#### Convolutional Layers

Definition : `keras.layers.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)`

Some parameters are same as *Dense* layer and they have the same meaning. 
Let's talk about other params that are exclusive to Conv2D.

- **filters** : Number of filters to use. This will decide the depth of output volume
- **kernel_size** : 2d shape describing the dimensions of filter matrix. Like, (3,3).
- **strides** : Could be a single number like 1 or 2, or a tuple of two numbers (if you wish to have different strides vertically and horizontally)
- **padding** : 'valid' or 'same'.
- **data_format** : This is important. Some people on the internet follow convention of representing image as (width,height,channel) while some otehr folks use (channel,width,height). This describes which data_format to use. `channels_last` (default) or `channels_first` are the values.
- **dilation_rate** : This is for dilated convolutions. Again a single number or tuple of two numbers. By default (1,1). [Check this link](https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d) to understand dilated convolutions. The image below should be enough to get an idea of it though : 
    
![dilated-convolution](https://cdn-images-1.medium.com/max/1200/1*SVkgHoFoiMZkjy54zM_SUw.gif)

^ Dilated Convolutions

In [36]:
# We'll use Conv2D (usual). To check out what Conv1D, Conv3D, etc do, check the other notebook I've written for comparisons.

cnnModel = Sequential()
cnnModel.add(Conv2D(5,(3,3),activation='relu', input_shape=(32,32,3)))
cnnModel.add(Conv2D(10,(3,3),activation='relu'))   # input_shape only required in first layer
cnnModel.add(Conv2D(20,(3,3), padding='same', activation='relu'))

cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
Total params: 2,420
Trainable params: 2,420
Non-trainable params: 0
_________________________________________________________________


There are other types of convolutional layers in Keras as well, like SeparableConv and Cropping1D. You could check [this post](https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d) to learn about different types of convolutions. However, I'm skipping them here.

#### Flattening Layers

Flattens the input. Does not affect the batch size.

In [39]:
cnnModel.add(Flatten())
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0         
Total params: 2,420
Trainable params: 2,420
Non-trainable params: 0
_________________________________________________________________


#### Custom Function Layer (lambda)

Wraps arbitrary expression as a Layer object.

In [41]:
# Square the input
cnnModel.add(Lambda(lambda x: x ** 2))

In [42]:
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 15680)             0         
Total params: 2,420
Trainable params: 2,420
Non-trainable params: 0
_________________________________________________________________


#### Repeat Layer

This repeats the vector n times.

In [48]:
cnnModel.add(RepeatVector(3))
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 15680)             0         
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 3, 15680)          0         
Total params: 2,420
Trainable params: 2,420
Non-trainable params: 0
_________________________________________________________________


#### Reshape Layer

You can reshape whatever you have. Like we have (3,15680) as size above. It could also be (12,3920)

In [51]:
cnnModel.add(Reshape((12,3920)))
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 15680)             0         
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 3, 15680)          0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 12, 3920)          0         
Total para

#### Zero-Padding 2D

This layer can add rows and columns of zeros at the top, bottom, left and right side of an image tensor.

In [55]:
# need to reshape to make it look like it has 3 dimensions.
cnnModel.add(Reshape((3920,4,3)))
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 15680)             0         
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 3, 15680)          0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 12, 3920)          0         
__________

In [56]:
# Now, let's add zero padding
cnnModel.add(ZeroPadding2D(padding=(2,2)))
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 15680)             0         
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 3, 15680)          0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 12, 3920)          0         
__________

#### Upsampling 2D

Again, for upsampling as well there are 1d,2d and 3d. Since we'll work with images, whose channels remain fixed at 3. We do 2D convolution functions.
Upsampling repeats the rows and columns of the data by size[0] and size[1] respectively.

`keras.layers.UpSampling2D(size=(2, 2), data_format=None)`

In [59]:
cnnModel.add(UpSampling2D(size=(2,2)))
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 30, 30, 5)         140       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 28, 28, 10)        460       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 28, 28, 20)        1820      
_________________________________________________________________
flatten_1 (Flatten)          (None, 15680)             0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 15680)             0         
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 3, 15680)          0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 12, 3920)          0         
__________

#### Cropping 2D

This crops the image. You can provide `cropping` as just an int, or tuple of 2 ints, or tuple of 2 tuples of 2 ints.
- If int: the same symmetric cropping is applied to width and height.
- If tuple of 2 ints: interpreted as two different symmetric cropping values for height and width: (symmetric_height_crop, symmetric_width_crop).
- If tuple of 2 tuples of 2 ints: interpreted as ((top_crop, bottom_crop), (left_crop, right_crop))

In [72]:
# We'll start a new model because previous model has been fiddled around with a lot and the dimensions are not at all good to explain Cropping.
newModel = Sequential()
newModel.add(Cropping2D(cropping=((2, 2), (4, 4)),
                     input_shape=(28, 28, 3)))
# now model.output_shape == (None, 24, 20, 3)
newModel.add(Conv2D(64, (3, 3), padding='same'))
newModel.add(Cropping2D(cropping=((2, 2), (2, 2))))
# now model.output_shape == (None, 20, 16. 64)
newModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
cropping2d_6 (Cropping2D)    (None, 24, 20, 3)         0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 24, 20, 64)        1792      
_________________________________________________________________
cropping2d_7 (Cropping2D)    (None, 20, 16, 64)        0         
Total params: 1,792
Trainable params: 1,792
Non-trainable params: 0
_________________________________________________________________


**How did this work?**

Look at the last two layers. It was 24x20 image. 
You crop off 2 pixels from the top, 2 pixels from the bottom, 2 pixels from left and 2 from the right

#### MaxPooling 2D

Max pooling operation for temporal data. Btw, similarly we have AveragePooling2D which I'm not writing separately.

Definition : `keras.layers.MaxPooling1D(pool_size=2, strides=None, padding='valid')`

In [76]:
# Remember we had read f=2, s=2 in MaxPooling sort of halves the output dimensions?
newModel.add(MaxPooling2D(pool_size=2, strides=2))
newModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
cropping2d_6 (Cropping2D)    (None, 24, 20, 3)         0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 24, 20, 64)        1792      
_________________________________________________________________
cropping2d_7 (Cropping2D)    (None, 20, 16, 64)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 10, 8, 64)         0         
Total params: 1,792
Trainable params: 1,792
Non-trainable params: 0
_________________________________________________________________


#### GlobalMaxPooling2D

Definition : `keras.layers.GlobalMaxPooling2D(data_format=None)`
    
Here, it eats up rows and columns dimensions. So, while input is (batchsize, rows, cols, channels), the output is (batchsize, channels)
For each (row,column) data it gives 1 max value!

In [78]:
newModel.add(GlobalMaxPooling2D())
newModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
cropping2d_6 (Cropping2D)    (None, 24, 20, 3)         0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 24, 20, 64)        1792      
_________________________________________________________________
cropping2d_7 (Cropping2D)    (None, 20, 16, 64)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 10, 8, 64)         0         
_________________________________________________________________
global_max_pooling2d_1 (Glob (None, 64)                0         
Total params: 1,792
Trainable params: 1,792
Non-trainable params: 0
_________________________________________________________________


#### Locally Connected 2D Convolution

The LocallyConnected2D layer works similarly to the Conv2D layer, except that weights are unshared, that is, a different set of filters is applied at each different patch of the input.

Definition : `keras.layers.LocallyConnected2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)`

[Read more here](https://keras.io/layers/local/#locallyconnected2d)

**Not writing example code here because it has never been of use to me till now**



#### Embedding Layer

This helps you output embeddings for text. Now, it does require that input to it is integer indices for words instead of actual text. However, keras itself provides Tokenizer API and if not that you could use one_hot followed by pad_sequences to get data in the integer form. Once you have that, it can be passed to embeddings.

[Read about a great example here](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

[Another great explanation here](https://stackoverflow.com/questions/45649520/explain-with-example-how-embedding-layers-in-keras-works)

**I'm skipping code for this since haven't used it anywhere yet.**

#### LeakyReLu

Aah! Yes. So, while all other common activation layers come under 'Activation' layer. Some special ones have their own classes in Layers.
These are : 
    - LeakyReLu
    - Softmax
    - PReLu
    - ELU
    - ThresholdReLu
    
LeakyReLu works a bit better than ReLu. It allows a small gradient when the unit is not active.

Definition : `keras.layers.LeakyReLU(alpha=0.3)`

#### BatchNormalization

Definition : `keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)`

Normalize the activations of the previous layer at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

**Understanding Parameters**:
- **axis**: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format="channels_first", set axis=1 in BatchNormalization.
- **momentum**: Momentum for the moving mean and the moving variance.
- **epsilon**: Small float added to variance to avoid dividing by zero.
- **center**: If True, add offset of beta to normalized tensor. If False, beta is ignored.
- **scale**: If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
- **beta_initializer**: Initializer for the beta weight.
- **gamma_initializer**: Initializer for the gamma weight.
- **moving_mean_initializer**: Initializer for the moving mean.
- **moving_variance_initializer**: Initializer for the moving variance.
- **beta_regularizer**: Optional regularizer for the beta weight.
- **gamma_regularizer**: Optional regularizer for the gamma weight.
- **beta_constraint**: Optional constraint for the beta weight.
- **gamma_constraint**: Optional constraint for the gamma weight.

#### OTHER INTERESTING LAYERS

[GaussianNoise](https://keras.io/layers/noise/#gaussiannoise) : Adds noise to the data. Data Augmentation technique. To Reduce overfitting. Only active during training.

[AlphaDropout](https://keras.io/layers/noise/#alphadropout) : It is Dropout only but it  keeps mean and variance of inputs to their original values, in order to ensure the self-normalizing property even after this dropout.
               
**Sequence Models Related Layers**
- [RNN, SimpleRNN, GRU, LSTM, ConvLSTM2D, SimpleRNNCell, GRUCell, LSTMCell, CuDNNGRU, CuDNNLSTM](https://keras.io/layers/recurrent/)
 

### Defining loss, optimizer and metric for Model

Once we have defined our architecture for the model, the next thing is to define a loss function for it, and the optimizer for it. Optimizer is some or the other variant of Gradient Descent. Loss could be one of many losses.
We use `compile` method of Keras to attach these to our model architecture.

In [80]:
modelA = Sequential()
modelA.add(Dense(5, input_dim=2))
modelA.add(Activation('relu'))
modelA.add(Dense(1))
modelA.add(Activation('sigmoid'))
modelA.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 5)                 15        
_________________________________________________________________
activation_2 (Activation)    (None, 5)                 0         
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 6         
_________________________________________________________________
activation_3 (Activation)    (None, 1)                 0         
Total params: 21
Trainable params: 21
Non-trainable params: 0
_________________________________________________________________


Finally, you can also specify metrics to collect while fitting your model in addition to the loss function. Generally, the most useful additional metric to collect is accuracy for classification problems. The metrics to collect are specified by name in an array.

In [83]:
# Let's use stochastic gradient descent (update weights with every example!) and loss as mean squared error. Basic of the most basic stuff!
modelA.compile(optimizer='sgd', loss='mse', metrics=['accuracy'])
modelA.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 5)                 15        
_________________________________________________________________
activation_2 (Activation)    (None, 5)                 0         
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 6         
_________________________________________________________________
activation_3 (Activation)    (None, 1)                 0         
Total params: 21
Trainable params: 21
Non-trainable params: 0
_________________________________________________________________


Below are some standard loss functions for different predictive model types:

- **Regression**: Mean Squared Error or ‘mse‘.
- **Binary Classification (2 class)**: Logarithmic Loss, also called cross entropy or ‘binary_crossentropy‘.
- **Multiclass Classification (>2 class)**: Multiclass Logarithmic Loss or ‘categorical_crossentropy‘.

#### Different Loss functions 

Some common ones that I've heard of :

- mean_squared_error
- mean_absolute_error
- mean_absolute_percentage_error
- mean_squared_logarithmic_error
- logcosh
- categorical_crossentropy
- sparse_categorical_crossentropy
- binary_crossentropy
- cosine_proximity
- poisson

[All loss functions available here](https://keras.io/losses/)

#### Different Optimizers 

Some common ones I've seen being used : 

- SGD
- RMSprop
- Adam
- Adagrad

[All Optimizers mentioned here](https://keras.io/optimizers/)

#### Different metrics

Some common ones I've used : 
- "accuracy" (means binary_accuracy only)
- binary_accuracy
- categorical_accuracy

[All metrics here](https://keras.io/metrics/)

### Training a model

Okay, now that we've defined layers and losses and optimizers for a model, we've basically defined the model completely. All that is left is to feed (fit) it some data and let it train!
So, let's do that now. 
We will be using `fit` method for this purpose.

Definition : `fit(self, x=None, y=None, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None)`

**Understanding the parameters**:
    - x: Numpy array of training data.
    - y: Numpy array of target (label) data. 
    - batch_size : Whatever mini-batch size to use. Default : 32. Possible values - any integer or None.
    - epochs : no. of epochs to train for
    - verbose : 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch.
    - callbacks : This is an amazing thing. I've talked about it below in another cell. 
    - validation_split: Float between 0 and 1. If you say give 0.3, then 30% of your training data would actually not be used for training set and instead be used for validation as a dev set. The validation data is selected from the last samples in the x and y data provided, before shuffling.
    - validation_data : If you have a separate validation set, then we can specify that as a tuple (x_devset,y_devset) here. Giving this would ignore validation_split
    - shuffle : whether to shuffle the training data before each epoch (True/False). True by default.
    - class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
    - sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only).
    - initial_epoch: Epoch at which to start training (useful for resuming a previous training run).
    - steps_per_epoch: Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined.
    - validation_steps: Only relevant if steps_per_epoch is specified.

Mostly, batch_size, epochs, verbose, callbacks, validation_split and validation_data seem to be important params here.

In [107]:
# MNIST using keras basic DNN model
from keras.datasets import mnist
from keras.utils import np_utils

# loading and fixing data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print("x_train shape : ",x_train_mod.shape)
print("y_train shape : ",y_train.shape)
x_train_mod = x_train.reshape(x_train.shape[0],784)
y_train_mod = np_utils.to_categorical(y_train, 10)
print("y_train_mod shape : ",y_train_mod.shape)

# Define a simple model which take inputs Nonex28x28
modelB = Sequential()
modelB.add(Dense(100, input_dim=784, activation='relu', name='hid1'))
modelB.add(Dense(25, activation='relu', name='hid2'))
modelB.add(Dense(10, activation='softmax', name='hid3'))
print(modelB.summary())

# configure loss & optimizer
modelB.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

# Feed data and train!
modelB.fit(x_train_mod, y_train_mod, batch_size=128, epochs=10, verbose=1, validation_split=0.2)

x_train shape :  (60000, 784)
y_train shape :  (60000,)
y_train_mod shape :  (60000, 10)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
hid1 (Dense)                 (None, 100)               78500     
_________________________________________________________________
hid2 (Dense)                 (None, 25)                2525      
_________________________________________________________________
hid3 (Dense)                 (None, 10)                260       
Total params: 81,285
Trainable params: 81,285
Non-trainable params: 0
_________________________________________________________________
None
Train on 48000 samples, validate on 12000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x11e799eb8>

#### Callbacks in training (fit)

While using `fit` function, one of the parameters we can pass is callbacks, which accepts a list of `keras.callbacks.Callback` instances. These callbacks can help us log things at different epochs, save models at checkpoints, stop training if a metric isn't improving and many other useful things...

Here are list of useful callbacks : 
    - TerminateOnNaN : Callback that terminates training when a NaN loss is encountered. Sometimes you've coded something incorrectly, and loss starts becoming NaN after a while. So, you should stop training then!
    - ModelCheckpoint : Save the model after every epoch. [More Info here](https://keras.io/callbacks/#modelcheckpoint)
    - EarlyStopping : Stop training when a monitored quantity has stopped improving. It'll stop if val_acc for example stops improving. [Read here](https://keras.io/callbacks/#earlystopping)
    - RemoteMonitor : stream your logs to a server by making POST calls
    - TensorBoard : This callback writes a log for TensorBoard, which allows you to visualize dynamic graphs of your training and test metrics, as well as activation histograms for the different layers in your model.
    - ReduceLROnPlateau : Reduce learning rate when a metric has stopped improving.
    - LambdaCallback : Callback for creating simple, custom callbacks on-the-fly. [More info](https://keras.io/callbacks/#lambdacallback)
                                                                                              


### Evaluating your model

Now's the time to check how your model performs!

In [109]:
# fix the test data as well
x_test_mod = x_test.reshape(x_test.shape[0],784)
y_test_mod = np_utils.to_categorical(y_test, 10)

# evaluate it
loss, accuracy = modelB.evaluate(x_test_mod, y_test_mod)
print("loss:",loss," | accuracy:",accuracy)

# predictions for the dataset (if I wish to check predictions myself)
predictions = modelB.predict(x_test_mod)
print("prediction for 0th test data example:",predictions[0])

loss: 2.17363247015  | accuracy: 0.8622
prediction for 0th test data example: [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.]


**Lastly, [Here's some more great methods](https://keras.io/models/sequential/) to be used with a Sequential model.**

## Practice : CNN on MNIST