# Dropout Regularization In Deep Neural Network

Dropout regularization is a popular technique used in deep neural networks (DNNs) to prevent overfitting and improve the generalization performance of the model. It was introduced by Geoffrey Hinton and his colleagues in their 2012 paper titled `"Improving neural networks by preventing co-adaptation of feature detectors."`

The core idea behind dropout is to randomly drop (i.e., set to zero) a proportion of neurons in the neural network during each training iteration. This means that certain neurons are temporarily removed from the network, along with all of their incoming and outgoing connections. As a result, the network becomes more robust and less reliant on any specific subset of neurons for making predictions.

Dropout regularization in practice:

1. **During Training**:
   - During each training iteration, dropout is applied to hidden units (neurons) in the network with a certain probability \( p \). This probability typically ranges between 0.2 and 0.5.
   - For each hidden unit, a binary mask is generated where each element is set to 1 with probability \( p \) and 0 with probability \( 1 - p \).
   - The mask is then applied element-wise to the output of the hidden units. This effectively randomly sets some of the activations to zero.
   - The modified activations are passed to the next layer for further processing.
   - During backpropagation, only the weights associated with the active neurons (those not set to zero) are updated. This encourages the network to learn redundant representations and prevents co-adaptation of neurons.

2. **During Inference**:
   - During inference (i.e., when making predictions on new data), dropout is not applied. Instead, the full network is used for making predictions.
   - However, to account for the fact that more neurons might be active during inference compared to training, the weights of the neurons are scaled by the dropout probability \( p \) at inference time. This scaling ensures that the expected output remains the same across training and inference.

The benefits of dropout regularization include:
- **Reduction of Overfitting**: Dropout acts as a form of ensemble learning by training multiple subnetworks with shared parameters. This reduces the risk of overfitting by preventing the network from memorizing noise in the training data.
- **Improved Generalization**: By encouraging the network to learn more robust and diverse features, dropout helps improve the generalization performance of the model on unseen data.
- **Simplicity and Efficiency**: Dropout is a simple yet effective regularization technique that can be easily implemented in various deep learning frameworks.

Overall, dropout regularization is a powerful tool for training deep neural networks, especially when dealing with large datasets and complex models prone to overfitting. 

## Dataset
We want to work on the Sonar dataset. This is a **binary classification problem** that requires a model to differentiate rocks from metal cylinders.

[Dataset information](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv("/kaggle/input/sonar-dataset/sonar_dataset.csv", header=None)
df.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
182,0.0095,0.0308,0.0539,0.0411,0.0613,0.1039,0.1016,0.1394,0.2592,0.3745,...,0.0181,0.0019,0.0102,0.0133,0.004,0.0042,0.003,0.0031,0.0033,M
7,0.0519,0.0548,0.0842,0.0319,0.1158,0.0922,0.1027,0.0613,0.1465,0.2838,...,0.0081,0.012,0.0045,0.0121,0.0097,0.0085,0.0047,0.0048,0.0053,R
55,0.0201,0.0116,0.0123,0.0245,0.0547,0.0208,0.0891,0.0836,0.1335,0.1199,...,0.0076,0.0045,0.0056,0.0075,0.0037,0.0045,0.0029,0.0008,0.0018,R
31,0.0084,0.0153,0.0291,0.0432,0.0951,0.0752,0.0414,0.0259,0.0692,0.1753,...,0.0236,0.0114,0.0136,0.0117,0.006,0.0058,0.0031,0.0072,0.0045,R
44,0.0257,0.0447,0.0388,0.0239,0.1315,0.1323,0.1608,0.2145,0.0847,0.0561,...,0.0096,0.0153,0.0096,0.0131,0.0198,0.0025,0.0199,0.0255,0.018,R


In [4]:
df.shape

(208, 61)

In [5]:
# check for nan values
df.isna().sum()

0     0
1     0
2     0
3     0
4     0
     ..
56    0
57    0
58    0
59    0
60    0
Length: 61, dtype: int64

In [6]:
df.columns

Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
       54, 55, 56, 57, 58, 59, 60],
      dtype='int64')

In the next cell we want to see the number of datapoints for each of the both categories.

In [7]:
df[60].value_counts() 

60
M    111
R     97
Name: count, dtype: int64

As the output suggests that the labels are `not skewed`. Meaning that the distribution of classes or categories in the dataset used for training the model varies. 

A skewed label distribution means that the number of instances belonging to each class is not evenly distributed. Instead, one or more classes may dominate the dataset, while others are underrepresented. This can lead to imbalanced classes, where some classes have significantly fewer instances compared to others.

For example, consider a binary classification problem where you're predicting whether an email is spam or not spam. A skewed label distribution might occur if the dataset contains a much larger number of non-spam emails compared to spam emails.

Here, in our case, the distribution of classes in the dataset is relatively balanced. This means that each class is represented by a roughly equal number of instances, or at least that no class dominates the dataset to an extent that could bias the model's learning process.

Having a balanced label distribution is generally desirable because it helps prevent the model from being biased towards the majority class(es) and ensures that it learns to generalize well across all classes. However, in real-world datasets, imbalanced label distributions are common, and handling them appropriately is an important consideration in model training and evaluation. 

For this purpose we can use techniques such as:
- class weighting, 
- resampling, and 

using evaluation metrics that account for class imbalance can be employed to mitigate the effects of skewed labels.

In [8]:
X = df.drop(60, axis=1)  # Because the "60" is the target feature
y = df[60]
y.head()

0    R
1    R
2    R
3    R
4    R
Name: 60, dtype: object

In [9]:
y = pd.get_dummies(y, drop_first=True) * 1 # Changing the target values to numerical values
y.sample(5) # R --> 1 and M --> 0

Unnamed: 0,R
27,1
142,0
118,0
50,1
168,0


In [10]:
y.value_counts()

R
0    111
1     97
Name: count, dtype: int64

In [11]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0033,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0241,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0156,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094


Now, let's split the dataset to the training and test sets

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)

In [13]:
X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
157,0.0201,0.0178,0.0274,0.0232,0.0724,0.0833,0.1232,0.1298,0.2085,0.272,...,0.0253,0.0131,0.0049,0.0104,0.0102,0.0092,0.0083,0.002,0.0048,0.0036
160,0.0258,0.0433,0.0547,0.0681,0.0784,0.125,0.1296,0.1729,0.2794,0.2954,...,0.0121,0.0091,0.0062,0.0019,0.0045,0.0079,0.0031,0.0063,0.0048,0.005
155,0.0211,0.0128,0.0015,0.045,0.0711,0.1563,0.1518,0.1206,0.1666,0.1345,...,0.0174,0.0117,0.0023,0.0047,0.0049,0.0031,0.0024,0.0039,0.0051,0.0015
132,0.0968,0.0821,0.0629,0.0608,0.0617,0.1207,0.0944,0.4223,0.5744,0.5025,...,0.0206,0.0073,0.0081,0.0303,0.019,0.0212,0.0126,0.0201,0.021,0.0041
87,0.0856,0.0454,0.0382,0.0203,0.0385,0.0534,0.214,0.311,0.2837,0.2751,...,0.0128,0.0172,0.0138,0.0079,0.0037,0.0051,0.0258,0.0102,0.0037,0.0037


# Deep Learning Model
Let's start creating the model.

## Model without Dropout Layer
First of all, we create our model by not considering the dropout techniques, then we consider the dropout in our model so that we can compare the results of both model with each other.

In [14]:
import tensorflow as tf
from tensorflow import keras

2024-04-25 09:09:01.477163: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-25 09:09:01.477290: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-25 09:09:01.629701: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [15]:
model = keras.Sequential([
    keras.layers.Dense(60, input_dim=60, activation='relu'),
    keras.layers.Dense(30, activation='relu'),
    keras.layers.Dense(15, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=100, batch_size=8)

Epoch 1/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.5254 - loss: 0.6856
Epoch 2/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6385 - loss: 0.6430 
Epoch 3/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6937 - loss: 0.6097 
Epoch 4/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7965 - loss: 0.5365 
Epoch 5/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7934 - loss: 0.5270 
Epoch 6/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8545 - loss: 0.4832 
Epoch 7/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8087 - loss: 0.4431 
Epoch 8/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8559 - loss: 0.3954 
Epoch 9/100
[1m20/20[0m [32m━━━━━━━━━━

<keras.src.callbacks.history.History at 0x79354fe59e40>

In [16]:
model.evaluate(X_test, y_test)

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7524 - loss: 0.9834  


[0.9080442190170288, 0.7692307829856873]

Notice the accuracy on the `Training` data and `Test` data. You see that on the training data we achieved a 1.00 accuracy meaning that the model is capable of predicting all the datapoints on which it learned from correctly. However, we it comes to new data, the model is performing very poor and its accuracy is 0.75. 

This difference suggests that the model is overfitted to the data and in the realworld problem, where the model has not seen the data, it performs poorly. 

Training Accuracy --- Test Accuracy

In [17]:
y_pred = model.predict(X_test).reshape(-1)
print(y_pred[:10])

# round the values to nearest integer ie 0 or 1 as our threshold
y_pred = np.round(y_pred)
print(y_pred[:10])

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1.2375977e-09 9.8164099e-01 5.2204810e-04 9.4711063e-03 9.5884579e-01
 7.0451832e-01 5.1892766e-06 9.9151140e-01 9.9992484e-01 2.3863416e-07]
[0. 1. 0. 0. 1. 1. 0. 1. 1. 0.]


In [18]:
y_test[:10]

Unnamed: 0,R
147,0
169,0
7,1
20,1
168,0
106,0
176,0
81,1
10,1
203,0


In [19]:
from sklearn.metrics import confusion_matrix , classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.71      0.77        28
           1       0.71      0.83      0.77        24

    accuracy                           0.77        52
   macro avg       0.77      0.77      0.77        52
weighted avg       0.78      0.77      0.77        52



### Model with Dropout Layer
Now, here, we want to use the Dropout technique to see how it can improve our model. 

In [20]:
modeld = keras.Sequential([
    keras.layers.Dense(60, input_dim=60, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(30, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(15, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(1, activation='sigmoid')
])

modeld.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

modeld.fit(X_train, y_train, epochs=100, batch_size=8)

Epoch 1/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.5617 - loss: 0.6854
Epoch 2/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5350 - loss: 0.6754  
Epoch 3/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5196 - loss: 0.6882 
Epoch 4/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.5761 - loss: 0.6714  
Epoch 5/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6690 - loss: 0.6337 
Epoch 6/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7032 - loss: 0.5899 
Epoch 7/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7128 - loss: 0.6013  
Epoch 8/100
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6143 - loss: 0.6388 
Epoch 9/100
[1m20/20[0m [32m━━━━━━━

<keras.src.callbacks.history.History at 0x79353ef871f0>

In [21]:
modeld.evaluate(X_test, y_test)

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7524 - loss: 0.8112  


[0.7840891480445862, 0.7692307829856873]

Comparing the output, we see the although the accuracy of the training data decreased by a small amount, the accuracy of the unseen data increased. This means, by taking the dropout technique into account, we can create more powerful models that is capable of performing better on the unseen data. 

As a notice, for us, in the real world, the accuracy of the unseen data matters more. Because in real world the AI models are mostly in countering unseen data. So, aside the training accuracy, we must care about the test data where the data are unseen for the models.

In [22]:
y_pred = modeld.predict(X_test).reshape(-1)
print(y_pred[:10])

# round the values to nearest integer ie 0 or 1
y_pred = np.round(y_pred)
print(y_pred[:10])

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[7.7407589e-07 9.5081764e-01 6.0690957e-04 5.0262291e-02 9.5935118e-01
 8.2204401e-01 2.0483922e-05 9.4835418e-01 9.9996525e-01 6.9435111e-05]
[0. 1. 0. 0. 1. 1. 0. 1. 1. 0.]


In [23]:
from sklearn.metrics import confusion_matrix , classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.71      0.77        28
           1       0.71      0.83      0.77        24

    accuracy                           0.77        52
   macro avg       0.77      0.77      0.77        52
weighted avg       0.78      0.77      0.77        52

