# Deep Learning
Uses algorithms inspired by the structure and function of the brain's neural networks.


## Artificial Neural Networks
Artificial neural networks are a type of deep learning model that is based on the structure of brain's neural networks.

Consist of connected units called neurons and  each neuron receives, processes and passes a signal to lower neurons.

Neurons are organized in layers-- input, hidden and output.

### Keras Sequential model
Sequential is a linear stack of layers

In [2]:
from keras.models import Sequential
from keras.layers import Dense, Activation

In [4]:
model = Sequential([
    Dense(5, input_shape=(3,), activation='relu'), 
    # 5 is the number of nodes and input_shape is the shape of the input to the layer
    # First layer of the network requires the input shape
    # 'relu' is the activation function
    Dense(2, activation='softmax'),
])

## Layers

<ul>
    <li>Deep or fully connected</li>
    <li>Convolutional</li>
    <li>Pooling</li>
    <li>Recurrent</li>
    <li>Normalization</li>
</ul>

### Deeply connected layer
In a deeply connected layer, all nodes of the previous layer are connected to all of the current layer. 

Connections imply data tranfer between the nodes. The final input to the activation function of a node is a weighted sum of all the inputs received.

The weight of an individual input is a number between 0 and 1. These weights are optimized by the model to get the best output.

## Activation Function
A function that operates on the input of any node and produces an output.

<b>Relu function: </b>Returns 0 if the input <= 0 and the input itself otherwise.

In [5]:
# Another method of defining a model is:
model = Sequential()
model.add(Dense(5, input_shape=(3,)))
model.add(Activation('relu')) # adding an activation layer separately

## Training

The output obtained is due to the weights used in the deep learning model.

During training the model is supplied with both input and output. The weights are arbitrarily chosen at first and the model predicts an output according to this.

The loss function is then calculated as the difference between the actual ouput and the predicted output or any other function such as the MSE etc.

New weight w1 = w1 - [d(loss)/d(w1)]  *  (Learning rate)

Learning rate is generally between 0.001 and 0.0001

In [6]:
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy

In [7]:
model = Sequential([
    Dense(16, input_shape=(1,), activation='relu'),
    Dense(32, activation='relu'),
    Dense(2, activation='softmax')
])

In [9]:
model.compile(Adam(learning_rate=0.0001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Adam is an optimizer that optimizes the weights, loss function(up) and the metric used for judging the predicted output is the accuracy of the prediction

In [10]:
model.loss

'sparse_categorical_crossentropy'

In [None]:
model.fit(scaled_train_samples, train_labels, batch_size=10, epochs=20, shuffle=True, verbose=2)
# input data, given output
# batch size is the number of inputs given at a time
# epoch- number of times the given data is reiterated
# shuffle- the data is randomized wrt order
# verbose is the level of the output:
#     0-silent, 1-progress bar, 2-report of individual epoch

 ## Learning rate

In [12]:
model.optimizer.lr = 0.01

In [13]:
model.optimizer.lr

<tf.Variable 'learning_rate:0' shape=() dtype=float32, numpy=0.01>

## Validation set
We split the given dataset into 3 sets which are training, validation and testing.

Validation set is a labelled set that is used on the model after each epoch to check its accuracy.

Test is an unlabelled dataset used at the end of training(and validation) to test the accuracy of the final model.

In [None]:
model.fit(scaled_train_samples, train_labels, validation_split=0.20, batch_size=10, epochs=20, shuffle=True, verbose=2)
# 20% of the training set is set aside as validation set
# or use validation_set=<set> instead of val_split

### Predicting

In [None]:
predictions = model.predict(test_samples, batch_size=10, verbose=0)
for i in predictions:
    print(i)

### Overfitting
The model predicts data from training set really well but is not able to predict data from test set.

Ways to eliminate this are:
<ul>
    <li>Adding more data</li>
    <li>Data augmentation: Cropping, flipping, rotating images</li>
    <li>Reducing complexity of the model</li>
    <li>Dropout: dropping some of the nodes</li>
</ul

### Underfitting
Model is not able to predict data of the training set

Ways to eliminate:
<ul>
    <li>Increasing the complexity</li>
    <li>Decreasing the dropout rate (If accuracy is high in the validation set but not in the training set.)</li>
</ul>


## Supervised Learning
The data is given along with their respective labels

Example: Model to predict gender based on height and weight

In [14]:
model = Sequential([
    Dense(16, activation='relu', input_shape=(2,)),
    Dense(32, activation='relu'),
    Dense(2, activation='sigmoid')
])

In [16]:
model.compile(Adam(learning_rate=0.0001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [18]:
train_samples = [[150, 67], [130, 60], [200, 65], [125, 52], [230, 72], [181, 70]]

In [19]:
train_labels = [1, 1, 0, 1, 0, 0]

In [20]:
model.fit(x=train_samples, y=train_labels, batch_size=3, epochs=10, shuffle=True, verbose=2)

Epoch 1/10
2/2 - 1s - loss: 13.0168 - accuracy: 0.5000 - 504ms/epoch - 252ms/step
Epoch 2/10
2/2 - 0s - loss: 12.7970 - accuracy: 0.5000 - 9ms/epoch - 4ms/step
Epoch 3/10
2/2 - 0s - loss: 12.5044 - accuracy: 0.5000 - 9ms/epoch - 5ms/step
Epoch 4/10
2/2 - 0s - loss: 12.2127 - accuracy: 0.5000 - 7ms/epoch - 4ms/step
Epoch 5/10
2/2 - 0s - loss: 12.0175 - accuracy: 0.5000 - 7ms/epoch - 3ms/step
Epoch 6/10
2/2 - 0s - loss: 11.8126 - accuracy: 0.5000 - 8ms/epoch - 4ms/step
Epoch 7/10
2/2 - 0s - loss: 11.5276 - accuracy: 0.5000 - 7ms/epoch - 3ms/step
Epoch 8/10
2/2 - 0s - loss: 11.2765 - accuracy: 0.5000 - 6ms/epoch - 3ms/step
Epoch 9/10
2/2 - 0s - loss: 11.0395 - accuracy: 0.5000 - 6ms/epoch - 3ms/step
Epoch 10/10
2/2 - 0s - loss: 10.8307 - accuracy: 0.5000 - 8ms/epoch - 4ms/step


<keras.callbacks.History at 0x13165a6b0>

### Semi Supervised learning
Suppose we have a dataset that is only partially labelled.

We take the labelled data and train our model on it.

We then label the unlabelled data (called pseudolabelling) and now we have a whole dataset to train our model upon.

### One hot encoding
Encodes categorical data into integers or vectors.

Length of vectors is equal to the dimension of the sample space of categories. 

A column of the same index in every vector corresponds to a certain category and in every vector only of the entries can be 1 and the rest are zeros.

## Convolutional Neural Network
Have convolutional hidden layers. 

A convolution involves applying one or more filters on a given image. 


Majority of concepts discussed are already covered..

### Zero Padding

In [23]:
from keras.layers.convolutional import *
from keras.layers.core import Flatten

In [25]:
model_valid = Sequential([
    Dense(16, activation='relu', input_shape= (20, 20, 3)), 
    Conv2D(32, kernel_size=(3, 3), activation='relu', padding='valid'), # valid is default
    Conv2D (64, kernel_size=(5, 5), activation='relu', padding='valid'), 
    Conv2D(128, kernel_size=(7, 7), activation='relu', padding='valid'),
    Flatten(), 
    Dense(2, activation='softmax')
])

In [26]:
model_valid.summary() # at the end output becomes 8x8

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_15 (Dense)            (None, 20, 20, 16)        64        
                                                                 
 conv2d_6 (Conv2D)           (None, 18, 18, 32)        4640      
                                                                 
 conv2d_7 (Conv2D)           (None, 14, 14, 64)        51264     
                                                                 
 conv2d_8 (Conv2D)           (None, 8, 8, 128)         401536    
                                                                 
 flatten_1 (Flatten)         (None, 8192)              0         
                                                                 
 dense_16 (Dense)            (None, 2)                 16386     
                                                                 
Total params: 473,890
Trainable params: 473,890
Non-tr

In [27]:
model_same = Sequential([
    Dense(16, activation='relu', input_shape= (20, 20, 3)), 
    Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same'), 
    Conv2D (64, kernel_size=(5, 5), activation='relu', padding='same'), 
    Conv2D(128, kernel_size=(7, 7), activation='relu', padding='same'),
    Flatten(), 
    Dense(2, activation='softmax')
])

In [30]:
model_same.summary() # shape is maintained

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_17 (Dense)            (None, 20, 20, 16)        64        
                                                                 
 conv2d_9 (Conv2D)           (None, 20, 20, 32)        4640      
                                                                 
 conv2d_10 (Conv2D)          (None, 20, 20, 64)        51264     
                                                                 
 conv2d_11 (Conv2D)          (None, 20, 20, 128)       401536    
                                                                 
 flatten_2 (Flatten)         (None, 51200)             0         
                                                                 
 dense_18 (Dense)            (None, 2)                 102402    
                                                                 
Total params: 559,906
Trainable params: 559,906
Non-tr

## Max Pooling
It reduces the dimensions of the image.

This is done by applying a filter that returns the maximum pixel value among the pixels that it is currently operating on.

In addition to this we also define a stride value that determines the pixels by which the filter moves after an operation.

<b>Uses: </b>For scaling down and thus reducing the computational load. 

In [31]:
from keras.layers.pooling import *

In [32]:
model = Sequential ([
    Dense(16, activation='relu', input_shape= (20,20, 3)),
    Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same'), 
    MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'),
    Conv2D(64, kernel_size=(5, 5), activation='relu', padding='same'),
    Flatten(),
    Dense(2, activation='softmax'),
])

## Backpropagation
For calculating the derivative of the loss function wrt a certain weight, we need to have a function relating the weight to the loss function.

Backpropagation is the process by which we obtain this relation.

#### What if the gradient is vanishing?
If the gradient is really small then the corresponding weight will eb changed by a really small value. Thus the weight will not change considerably and will never reach close to its optimal value even after many epochs.

#### Gradient explosion
If the gradient is really high then the weight will changeby a very big value after each epoch and the likelihood of attaining an optimal value is less.

### Weight Initialization
##### Xavier Initialization
If we initialize the weights randomly then their variance is n [number of nodes in the previous layer] and S.D. is sqrt(n). 

Now if we pass on these weights to the activation function then it is very likely to take on a value greater than 1. If using the sigmoid function, the derivative at such points>1 is vanishing.

Thus we need to bring the variance close to 1/n [statistical data]. This is done by multiplying the randomized weights by 1/sqrt(n) for sigmoid and 2/sqrt(n) for relu.

In [34]:
model = Sequential ([
    Dense(16, input_shape=(1,5), activation='relu'),
    Dense (32, activation='relu', kernel_initializer='glorot_uniform'), # this is default glorot_normal can also be used
    Dense (2, activation='softmax')
])

### Bias
It is similar to a threshold value that determines if the given neuron will fire or not

Biases are learnable.

For relu the bias is 0 and any value below it is rejected(0). To change the bias, we need to subtract the bias from the weighted average of inputs.


## Learnable Parameters
The values which are changed during training of a model.

Include the weights and biases.

For any layer in a model the number of learnable parameters are given by (input x output + biases) 

For a dense layer, input is the number of inputs from the previous layer and the output is the number of nodes.

For a convolutional layer, the input is the # of filters in prev, output is the # x size of curr filters and bias is the # of curr filters.



### L2 Regularization
We add a term x in the loss function that penalizes for large weights. This reduces the complexity in our model and reduces overfitting.

x is sum of the squared norms of the weight matrices multiplied by l/2m

l is the regularization param. and m is the number of inputs.

Main objective is that the loss function will be so high and the weights so low that the effect of some layers might be cancelled.

In [35]:
from keras import regularizers

In [36]:
model = Sequential([
    Dense(16, activation='relu', input_shape=(2,)),
    Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.01)), # 0.01 is the reg. param.
    Dense(2, activation='sigmoid')
])

### Batch Size
No. of data passed at a time. Higher batch size allows faster training time and generalization of model but a very high would result into computer overloading.


### Batch Normalization
Standardization: x = (x - mean)/s.d.

This is to reduce the range of the given data

causes exploding gradient problem as the ranges of all data serie are not the same or even similar.

Batch norm. applies this to each layer in the model which eliminates the outlying weights. This is done in the following way:

x = (x - mean)/s.d.

and then x*g + b

where g and b are arbitrary constants and learnable param. Thus the mean and s.d. of the batch is varied and thus they too become learnable.

In [37]:
from keras.layers import BatchNormalization

In [38]:
model = Sequential([
    Dense(16, activation='relu', input_shape=(2,)),
    Dense(32, activation='relu'),
    BatchNormalization(axis=1),
    Dense(2, activation='sigmoid')
])