## Dropout is a regularization techinque which can simply prevent overfitting. It drops out some nodes/neuron randomly during training. This helps in avoiding the network to closely align with the input samples(overfitting).  we can even call the dropout as ensemble methods or bagging

### What is Ensemble Methods or Bagging? Why we call Dropout is one of them?

Bagging or Ensemble is an idea to train several different models independent of each other and vote on all model outputs to choose the prediction. 

### How come Ensemble Methods generalize to the test set ?

Before answering this question, let's define how to choose a different model.

1) Using a Diffirent alogrithms or different hyper parameters
2) Using a different constructed datasets(a subsets) from original datasets

As per my exploration., the point 2 provides better generalization but there is no proper definition by it. The objective of point 2 here is to choose different subsets of samples constructed from orginal dataset of same size which means there is a high probability that each dataset missing some examples from original dataset and contains several duplicate samples.Remembers One classic example given by ian goodfellow in his deep learning book. say we need to predict the number 8, Model-1 with distribution 8,6,8 where it learns circle/loop on top is number 8. Model-2 with distribution 9 , 9 , 8 learns circle/loop on bottom is number 8. if we combine and mean the score of 2 models we get the prediction 8. Since each model has slightly different features from one another , this approach seems to be generalize well with test set.

### What is the Problem with the ensemble methods stated above? 
    
Simple, the more memory and computations is needed especially for larger network since it need to train multiple models for prediction. What if we create a approximation of this process in a single training loop i mean in O(N) loop. That's where drop out comes in.

### What is Dropout  and how does it can be acheived?

Dropout provides an inexpensive approximation to training and evaluating the bagged ensemble of exponentially many neural networks. Here our objective is to drop some percentage(is a hyperparameter to be configured) of neurons/node during forward propagation. So to acheive this we create mask vector usually a binomial vector with 0's and 1's and multiply it with the layer outcome. The zeros in mask vector helps to randomly drop features/neurons from the given layer. In other words, Dropout is a regularization technique where during each iteration of gradient descent, we drop a set of neurons selected at random. By drop, what we mean is that we essentially act as if they do not exist.

Each neuron is dropped at random with some fixed probability 1-p, and kept with probability p. The value p may be different for each layer in the neural network. A value of 0.5 for the hidden layers, and 0.0 for input layer (no dropout) has been shown to work well on a wide range of tasks [1].

During evaluation (and prediction), we do not ignore any neurons, i.e. no dropout is applied. Instead, the output of each neuron is multiplied by p. This is done so that the input to the next layer has the same expected value.

To state with the real world example, from the book of deep learning. the power of droput arises from the fact that the masking noise is applied to hidden units. If the model learns a hidden unit h, that detects a face by finding the nose, then dropping h corresponds to erasing the information that there is a nose in the image. The model must learn another h, that either redundantly encodes the presence of a nose or detects the face by another feature, such as the mouth.

Also, it said that dropout is less effective with extremely few labeled training samples. 



In [37]:
from DeepLearnerBase import Layer 
import numpy as np

In [48]:
class Dropout(Layer):
    def __init__(self, p = 0.5):
        #probability of neuron  to drop out from the layer
        self.p = p
        self.mask = None
        
    @property
    def shape(self):
        return (self.inputshape ,self.outputshape())
        
    def forward(self, X, training = True): 
        c = 1 - self.p
         
        if(training):
            self.mask = np.random.binomial(1, c, size=X.shape)/self.p
            c= self.mask
            
        return X * c   
    
    def backward(self, grad):
        return grad * self.mask
    
    def outputshape(self):
        return self.inputshape  

In [49]:
from scipy.io import loadmat
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from DeepLearnerBase import Sequential, Dense, Activation, CrossEntropyForSoftMax,relu,softmax,SGD
%matplotlib inline

In [50]:
data = loadmat("data\handwritten.mat")
print(data['X'].shape)
print(data['y'].shape)

(5000, 400)
(5000, 1)


In [51]:
X = data['X']
y =  data['y']

In [52]:
X_train, X_valid, y_train, y_valid = train_test_split(
            X,y, test_size=0.20, random_state=42)
print(X_train.shape)
print(X_valid.shape)

(4000, 400)
(1000, 400)


In [53]:
optimizer = SGD(learning_rate =  0.1,momentum=0.9)
loss = CrossEntropyForSoftMax()

In [54]:
model = Sequential([    
    Dense(100),
    Activation(relu), 
    Dropout(0.25),
    Dense(50),
    Activation(relu), 
    Dropout(0.25),
    Dense(10),
    Activation(softmax)    
],  optimizer, loss, X.shape[1])

In [55]:
model.summary()

+---------------+
| Model Summary |
+---------------+
Input Shape: 400
+------------+-------------+--------------+------------+
| Layer Name | Input Shape | Output Shape | Shape      |
+------------+-------------+--------------+------------+
| Dense      | 400         | 100          | (400, 100) |
| relu       | 100         | 100          | (100, 100) |
| Dropout    | 100         | 100          | (100, 100) |
| Dense      | 100         | 50           | (100, 50)  |
| relu       | 50          | 50           | (50, 50)   |
| Dropout    | 50          | 50           | (50, 50)   |
| Dense      | 50          | 10           | (50, 10)   |
| softmax    | 10          | 10           | (10, 10)   |
+------------+-------------+--------------+------------+


In [56]:
model.fit(X_train,y_train,X_valid,y_valid,epochs= 10000,batchsize= 1000)

  0% (3 of 10000) |                       | Elapsed Time: 0:00:00 ETA:  0:13:08

Epoch# 0 Training Loss:2.0665242729444424 Validation Loss: 1.9476114298103522 Training Accuracy:0.313 Validation Accuracy:0.364


 10% (1003 of 10000) |##                  | Elapsed Time: 0:01:05 ETA:  0:09:30

Epoch# 1000 Training Loss:0.08437067691005729 Validation Loss: 0.3621567978584943 Training Accuracy:0.996 Validation Accuracy:0.946


 20% (2003 of 10000) |####                | Elapsed Time: 0:02:10 ETA:  0:08:34

Epoch# 2000 Training Loss:0.08459729827381135 Validation Loss: 0.37444662687524183 Training Accuracy:0.997 Validation Accuracy:0.942


 30% (3004 of 10000) |######              | Elapsed Time: 0:03:15 ETA:  0:07:20

Epoch# 3000 Training Loss:0.08900893040216284 Validation Loss: 0.41145091629614694 Training Accuracy:0.999 Validation Accuracy:0.942


 40% (4001 of 10000) |########            | Elapsed Time: 0:04:34 ETA:  0:11:24

Epoch# 4000 Training Loss:0.09983547613265839 Validation Loss: 0.4762453704234197 Training Accuracy:0.996 Validation Accuracy:0.944


 50% (5003 of 10000) |##########          | Elapsed Time: 0:06:06 ETA:  0:06:21

Epoch# 5000 Training Loss:0.09533417242197209 Validation Loss: 0.4894437915492644 Training Accuracy:0.999 Validation Accuracy:0.938


 60% (6004 of 10000) |############        | Elapsed Time: 0:07:14 ETA:  0:04:35

Epoch# 6000 Training Loss:0.09643112176253135 Validation Loss: 0.5064418398568445 Training Accuracy:1.0 Validation Accuracy:0.945


 70% (7003 of 10000) |##############      | Elapsed Time: 0:08:21 ETA:  0:03:43

Epoch# 7000 Training Loss:0.09983487251005285 Validation Loss: 0.47545355851497206 Training Accuracy:1.0 Validation Accuracy:0.948


 80% (8003 of 10000) |################    | Elapsed Time: 0:09:32 ETA:  0:02:50

Epoch# 8000 Training Loss:0.10517273240045807 Validation Loss: 0.4959175795441569 Training Accuracy:0.998 Validation Accuracy:0.947


 90% (9002 of 10000) |##################  | Elapsed Time: 0:10:47 ETA:  0:01:28

Epoch# 9000 Training Loss:0.10428848722728072 Validation Loss: 0.5535066356849893 Training Accuracy:1.0 Validation Accuracy:0.945


100% (10000 of 10000) |###################| Elapsed Time: 0:12:01 Time: 0:12:01


## The dropout nowadays appears to be a mandatory step and it is very effective especially with larger neural networks and with bigger datasets.