# Deep Dive Into Deep Learning

# Recap

- ### Multi-layer Perceptron (MLP)
- ### Automatic feature extraction
- ### Prediction function, loss function
- ### Optimization algorithm, optimization method

# Goals Of This Class

- ### Advanced classification using softmax/cross-entropy
- ### Understanding of backpropagation algorithm
- ### Hyper-parameter tuning

### Dataset background

The Otto Group is one of the world’s biggest e-commerce companies. For Otto group, a consistent analysis of the performance of products is crucial. However, due to diverse global infrastructure, many identical products get classified differently. For this problem, we have a dataset of 200,000 products with 93 features, and each product is classified into one of the 9 categories.

The objective is to build a predictive model which is able to distinguish between 9 main product categories. 


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import math
import seaborn as sns
import keras 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder 
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation
import warnings
warnings.filterwarnings('ignore')

Using TensorFlow backend.


## Changes wrt the previous problem

- ### More features, larger dataset
- ### Multi-class classification

Binary vs Multiclass Classification

<img src="../images/binary-vs-multiclass-0.png" width="70%" />

In [2]:
def load_data(path, train=True):
    """Load data from a CSV File
    
    Parameters
    ----------
    path: str
        The path to the CSV file
        
    train: bool (default True)
        Decide whether or not data are *training data*.
        If True, some random shuffling is applied.
        
    Return
    ------
    X: numpy.ndarray 
        The data as a multi dimensional array of floats
    ids: numpy.ndarray
        A vector of ids for each sample
    """
    df = pd.read_csv(path)
    
    X = df.values.copy()
    if train:
        np.random.shuffle(X)
        X, labels = X[:, 1:-1].astype(np.float32), X[:, -1]
        return X, labels, df
    else:
        X, ids = X[:, 1:].astype(np.float32), X[:, 0].astype(str)
        return X, ids, df


In [3]:
def preprocess_data(X, scaler=None):
    """Preprocess input data by standardise features 
    by removing the mean and scaling to unit variance"""
    if not scaler:
        scaler = StandardScaler()
        scaler.fit(X)
    X = scaler.transform(X)
    return X, scaler


def preprocess_labels(labels, encoder=None, categorical=True):
    """Encode labels with values among 0 and `n-classes-1`"""
    if not encoder:
        encoder = LabelEncoder()
        encoder.fit(labels)
    y = encoder.transform(labels).astype(np.int32)
    if categorical:
        y = np_utils.to_categorical(y)
    return y, encoder

In [4]:
print("Loading data...")
X, labels, dataset = load_data('../data/train.csv', train=True)

dataset.describe()


Loading data...


Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_84,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93
count,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,...,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0,61878.0
mean,30939.5,0.38668,0.263066,0.901467,0.779081,0.071043,0.025696,0.193704,0.662433,1.011296,...,0.070752,0.532306,1.128576,0.393549,0.874915,0.457772,0.812421,0.264941,0.380119,0.126135
std,17862.784315,1.52533,1.252073,2.934818,2.788005,0.438902,0.215333,1.030102,2.25577,3.474822,...,1.15146,1.900438,2.681554,1.575455,2.115466,1.527385,4.597804,2.045646,0.982385,1.20172
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,15470.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,30939.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,46408.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,61878.0,61.0,51.0,64.0,70.0,19.0,10.0,38.0,76.0,43.0,...,76.0,55.0,65.0,67.0,30.0,61.0,130.0,52.0,19.0,87.0


In [5]:
# Quick look at the data
dataset.head()


Unnamed: 0,id,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,feat_9,...,feat_85,feat_86,feat_87,feat_88,feat_89,feat_90,feat_91,feat_92,feat_93,target
0,1,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,Class_1
1,2,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,Class_1
2,3,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,Class_1
3,4,1,0,0,1,6,1,5,0,0,...,0,1,2,0,0,0,0,0,0,Class_1
4,5,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,Class_1


In [None]:
# Plot of the distribution of each feature
def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
    plt.style.use('seaborn-whitegrid')
    fig = plt.figure(figsize=(width,height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(dataset.shape[1]) / cols)
    for i, column in enumerate(dataset.columns):
        ax = fig.add_subplot(rows, cols, i + 1)
        ax.set_title(column)
        if dataset.dtypes[column] == np.object:
            g = sns.countplot(y=column, data=dataset)
            substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
            g.set(yticklabels=substrings)
            plt.xticks(rotation=25)
        else:
            g = sns.distplot(dataset[column])
            plt.xticks(rotation=25)
    
plot_distribution(dataset, cols=10, width=20, height=20, hspace=0.45, wspace=0.5)

In [7]:
print("Loading data...")
X, labels, dataset = load_data('../data/train.csv', train=True)
X, scaler = preprocess_data(X)
Y, encoder = preprocess_labels(labels)

X = X[10000:]
Y = Y[10000:]
X_val = X[:10000]
Y_val = Y[:10000]

dims = X.shape[1]
print(dims, 'dims')
nb_classes = Y.shape[1]
print(nb_classes, 'classes')

X_test, ids, _ = load_data('../data/test.csv', train=False)
X_test, ids = X_test[:1000], ids[:1000]
X_test, _ = preprocess_data(X_test, scaler)




Loading data...
93 dims
9 classes


# Let's create a simple one layer multi-class classification model.

In [8]:
# import tensorflow as tf
# gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
# sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

model = Sequential()
model.add(Dense(nb_classes, input_shape=(dims,)))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X, Y, epochs=30, batch_size=32, validation_data=(X_val, Y_val))

Train on 51878 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1a1fc400b8>

In [9]:
# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results
# Training set, see if the new data probability is right
results = model.evaluate(X_val, Y_val)
print(results)

[0.6424088631629944, 0.7632]


In [10]:
predictions = model.predict(X_val)
print("\n\noutput shape of first test sample", predictions[0].shape)
print("\n\nsum over predictions",np.sum(predictions[0]))
print("\n\nactual predictions\n",predictions[0])
print("\n\npredicted class",np.argmax(predictions[0]))




output shape of first test sample (9,)


sum over predictions 0.9999999


actual predictions
 [4.8122888e-06 6.1857444e-01 3.6134896e-01 1.8463681e-02 9.9824998e-04
 2.6941436e-06 5.9060508e-04 5.4218312e-06 1.1036260e-05]


predicted class 1


# Let's understand the multi-class classification model.

Binary vs Multiclass Classification

<img src="../images/binary-vs-multiclass.png" width="70%" />

Multi-class vs multi-label classification

<img src="../images/multiclass-vs-multilabel.png" width="70%" />

How does our current network look?

<img src="../images/softmax-cross-entropy.png" width="70%" />

<img src="../images/cross-entropy-1.png" width="70%" />

<img src="../images/Maths-Insight.png" alt="Math Alert" style="width: 100px;float:left; margin-right:15px"/>
## Let's understand the prediction function and loss function in some more details.
***
***


<img src="../images/softmax-cross-entropy-1.png" width="70%" />

# How does cross-entropy help in capturing the classification loss?

Two different probability distributions but same output. But which distribution has lesser uncertainty?

<img src="../images/cross-entropy-2.png" width="70%" />

Difference between two probability distributions. The second probability distribution has lesser uncertainty with respect to the target distribution as compared to the first distribution.

<img src="../images/cross-entropy-3.png" width="70%" />

<img src="../images/Recap.png" alt="Recap" style="width: 100px;float:left; margin-right:15px"/>
## Recap
***
***

- A multi-class classification problem
- Soft-max as prediction function
- Cross-entropy as loss function

### Soft-max plus cross-entropy is used across numerous applications for different media types such as image processing, text processing, genome processing etc.

In [11]:

Y_prediction = []
y_prediction = []
for prob_array in predictions:
    ind = np.argmax(prob_array)
    y_prediction.append(ind)
    new_prob_array = [0] * len(prob_array)
    new_prob_array[ind] = 1
    Y_prediction.append(new_prob_array)


y_truth = []
for prob_array in Y_val:
    y_truth.append(np.argmax(prob_array))

## Evaluation metrics

- ### Confusion matrix
- ### Precision, recall, accuracy
- ### F-score

In [12]:
from sklearn.metrics import confusion_matrix

print("confusion matrix")
cm = confusion_matrix(y_truth, y_prediction)
print(cm)

confusion matrix
[[ 104   18    0    0    0   33    9   52   68]
 [   4 2387  196   15   12    4   17   13    4]
 [   0  941  314   11    1    5   13    4    3]
 [   0  262   29  116    3   10    5    1    1]
 [   0   30    0    0  430    0    0    1    0]
 [  14   29    3    9    1 2090   25   56   30]
 [   6   84   22    5    3   41  266   42    1]
 [  20   22    5    0    1   31   14 1258   23]
 [  21   20    0    0    2   26    6   41  667]]


## Precision, Recall, Accuracy, F-score

- For class x:
  - True positive: diagonal position, cm(x, x).
  - False positive: sum of column x (without main diagonal), sum(cm(:, x))-cm(x, x).
  - False negative: sum of row x (without main diagonal), sum(cm(x, :))-cm(x, x).

- We can compute precision, recall and F1 score following the formulas mentioned below.
- Averaging over all classes (with or without weighting) gives values for the entire model.

Understand different evaluation metrics

<img src="../images/metrics.png" width="70%" />

In [13]:
true_pos = []
false_pos = []
false_neg = []
i = 0
for a_class in cm:
    true_pos.append(cm[i,i])
    false_pos.append(sum(cm[:,i]) - cm[i,i])
    false_neg.append(sum(cm[i,:]) - cm[i,i])
    i = i + 1

print("\n\n verify true pos")
print(true_pos)




 verify true pos
[104, 2387, 314, 116, 430, 2090, 266, 1258, 667]


In [14]:

precision = []
recall = []
total_precision = 0
total_recall = 0
total_true = 0
total_false = 0
accuracy = 0
f_score = 0
i = 0
for a_class in true_pos:
    precision.append(true_pos[i]/(true_pos[i]+false_pos[i]))
    recall.append(true_pos[i]/(true_pos[i]+false_neg[i]))
    total_true = total_true + true_pos[i]
    total_false = total_false + false_neg[i]
    total_precision = total_precision + precision[i]
    total_recall = total_recall + recall[i]
    i = i + 1


In [15]:

avg_precision = total_precision/len(true_pos)
avg_recall = total_recall/len(true_pos)    
accuracy = total_true/(total_false+total_true)
f_score = 2 * avg_precision * avg_recall/(avg_precision + avg_recall)
print("\n\n precision per class")
print(precision)
print("\n\n recall per class")
print(recall)
print("\n\n accuracy")
print(accuracy)
print("\n\n f_score")
print(f_score)



 precision per class
[0.6153846153846154, 0.6293171631953599, 0.5518453427065027, 0.7435897435897436, 0.9492273730684326, 0.9330357142857143, 0.7492957746478873, 0.8569482288828338, 0.8368883312421581]


 recall per class
[0.36619718309859156, 0.9000754147812972, 0.24303405572755418, 0.2716627634660422, 0.9327548806941431, 0.926007975188303, 0.5659574468085107, 0.9155749636098981, 0.8518518518518519]


 accuracy
0.7632


 f_score
0.7098120508275872


# Training the ANN with Backpropagation
First layer - input layer 
- Step 1: Randomly initialise the weights to small numbers close to 0 (but not 0)
- Step 2: Input the first observation of the dataset in the input layer, each feature in one input node (11 input nodes)
- Step 3: Activation function hidden layer (rectifier), sigmoid (good for output layer)
- Step 4: Compare the predicted result o the actual result, measure the generated error
- Step 5: Back-propagation to update the weights, earning rate decideds how much we update the weights
- Step 6: update the weights after a set of observation
- Step 7: Train

## Gradient descent is a special case of backpropagation algorithm

<img src="../images/backprop-10-1.png" width="70%" />


## Intuition of backpropagation algorithm


<img src="../images/training-1.png" width="70%" />

<img src="../images/training-2.png" width="70%" />

<img src="../images/training-3.png" width="70%" />

<img src="../images/training-4.png" width="70%" />

<img src="../images/training-5.png" width="70%" />

<img src="../images/training-6.png" width="70%" />

- ### Training deep neural networks involve numerous iterations through the dataset to learn the weights.
- ### In each iteration, we pick a random instance or a set of instance, and do the training.
- ### We may have to have even millions of iterations depending on the complexity of the problem
- ### Each backward pass of training strives to minimize the training error, thus learning a robust classification network.

## Mathematics of backpropagation algorithm


<img src="../images/backprop-1.png" width="70%" />

<img src="../images/backprop-2-1.png" width="70%" />

<img src="../images/backprop-3-1.png" width="70%" />

<img src="../images/backprop-4.png" width="70%" />


<img src="../images/backprop-5.png" width="70%" />

Let's continue with backward pass

<img src="../images/backprop-6-3.png" width="70%" align="middle" />

One more round of backward pass...almost there.

<img src="../images/backprop-8-1.png" width="70%" align="middle"/>

<img src="../images/backprop-9-1.png" width="70%" />

<img src="../images/Recap.png" alt="Recap" style="width: 100px;float:left; margin-right:15px"/>
# We now know how deep neural networks learn a function
***
***


- ### define prediction and loss functions
- ### do forward pass and compute loss for a sample(or cost over a batch)
- ### do backward ass to tune the weights, this is weight training
- ### repeat weight training until training loss converges

# But training loss alone does not suffice

- ## We want our network to do well in production
- ## Typically data is divided between train, validation, and test.
- ## use train data to tune the weights
- ## use validation data to evaluate the model, and change hyper-parameters
- ## use test data, to quantify performance




<img src="../images/under-over-fitting.png" width="70%" />

Thum rule when training deep neural networks

<img src="../images/training-nn.png" width="70%" />

# What are hyper-parameters

- ## Parameters are weights on the edges of the network
- ## Hyper-parameters concern about the architecture and training
  - ## Depth of the network
  - ## Choice of activation function
  - ## Learning rate variations
  - ## Batch size




# batch_size vs iteration vs epoch

- ## stochastic gradient descent : sample batch size, do forward pass for the batch, compute loss over the batch, then do backward pass
- ## iteration : number of times batch size is sampled
- ## epoch : number of iterations (one epoch typically signifies one pass over the entire training set)
- ## rule of thumb : show one training sample at least 6 to 8 times



# Let's change hyper parameters

- From depth 1 to depth 2 network
- Relu as activation function

In [16]:
model_1 = Sequential()
model_1.add(Dense(64, input_shape=(dims,), activation='relu'))
model_1.add(Dense(nb_classes, activation='relu'))
model_1.add(Activation('softmax'))
model_1.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
history=model_1.fit(X, Y, epochs=30, batch_size=32)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [17]:
# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results
# Training set, see if the new data probability is right
results = model_1.evaluate(X_val, Y_val)
print(results)

[0.5066415110111236, 0.8049]


## ReLU as activation function

In the past, nonlinear functions like tanh and sigmoid were used, but researchers found out that **ReLU layers** work far better because the network is able to train a lot faster (because of the computational efficiency) without making a significant difference to the accuracy.

ReLu also helps to alleviate the **vanishing gradient problem**, which is the issue where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers

Vanishing gradient problem

<img src="../images/sigmoid-vanishing-gradient.png" width="70%" />

ReLU: much simpler non-linearity. Constant derivative, faster training.

<img src="../images/relu-derivative.png" width="70%" />

The **ReLu** function is defined as $f(x) = \max(0, x),$ [2]

A smooth approximation to the rectifier is the *analytic function*: $f(x) = \ln(1 + e^x)$

which is called the **softplus** function.

The derivative of softplus is $f'(x) = e^x / (e^x + 1) = 1 / (1 + e^{-x})$, i.e. the **logistic function**.

Activation functions

<img src="../images/activation-functions.png" width="70%" />

## Choice of activation functions

- ### Driven by induction of non-linearity
- ### The gradient of the activation function should be smooth/constant
- ### Since DNN learn complex hierarchy of functions, the choice of a particular activation function is almost always driven by experimentation.
- ### Yes, the choice of activation function is a hyper-parameter

In [18]:
model_2 = Sequential()
model_2.add(Dense(16, input_shape=(dims,), activation='relu'))
model_2.add(Dense(nb_classes, activation='relu'))
model_2.add(Activation('softmax'))
model_2.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
model_2.fit(X, Y, epochs=30, batch_size=32, validation_data=(X_val, Y_val))

Train on 51878 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1a200264e0>

In [19]:
# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results
# Training set, see if the new data probability is right
results = model_2.evaluate(X_val, Y_val)
print(results)

[0.5642689529418945, 0.7861]


In [20]:
from keras.layers import Dropout

model_4 = Sequential()
model_4.add(Dense(64, input_shape=(dims,), activation='relu'))
model_4.add(Dropout(rate = 0.1))
model_4.add(Dense(nb_classes, activation='relu'))
model_4.add(Dropout(rate = 0.1))
model_4.add(Activation('softmax'))
model_4.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy',])
model_4.fit(X, Y, epochs=30, batch_size=32, validation_data=(X_val, Y_val))

Train on 51878 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1a202e8be0>

In [21]:
# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results
# Training set, see if the new data probability is right
results = model_4.evaluate(X_val, Y_val)
print(results)

[0.5938321981430054, 0.7958]


In [22]:
import h5py
model_4.save('best_model.h5')

In [23]:
X, X_test, Y, Y_test = train_test_split(X, Y, test_size=0.15, random_state=42)

fBestModel = 'best_model.h5' 
early_stop = EarlyStopping(monitor='val_loss', patience=4, verbose=1) 
best_model = ModelCheckpoint(fBestModel, verbose=0, save_best_only=True)
model.fit(X, Y, validation_data = (X_test, Y_test), epochs=20, 
          batch_size=128, verbose=True, validation_split=0.15, 
          callbacks=[best_model, early_stop]) 

Train on 44096 samples, validate on 7782 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 00005: early stopping


<keras.callbacks.History at 0x10cade518>

## The problem of overfitting

One of the problems that occur during neural network training is called overfitting. The error on the training set is driven to a very small value, but when new data is presented to the network the error is large. The network has memorized the training examples, but it has not learned to generalize to new situations.


## Application of dropout to avoid overfitting

Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different “thinned” networks.

At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods.

Application of dropout to avoid overfitting

<img src="../images/dropout.png" width="70%" />

Different Neurons are set to zero at different points in time

<img src="../images/dropout-2.png" width="70%" />

<img src="../images/Recap.png" alt="Recap" style="width: 100px;float:left; margin-right:15px"/>
## Dropout: Summary
***
***


The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.
It is a very efficient way of performing model averaging with neural networks.

Comparison of different networks

<img src="../images/comparision.png" width="70%" />

# Observations

- ## Deeper is not always better
- ## Set architecture based on complexity of the problem
- ## Use techniques such as 'dropout' to avoid overfitting on training set
- ## Always evaluate on test set before putting models into production

# Take aways

- ## Multi-class classification using soft-max and cross-entropy
- ## Backpropagation algorithm
- ## Overfitting and dropout
- ## Experimentation over different depth DNNs