## Dropout details

Last lecture, I said that during evaluation, the weights were multiplied by $(1-p)$ so that their expectation is the same as during training. That is, in training, we have
\begin{align*}
E[w_{jk} \times e_{jk}] &= 0 \times p \times w_{jk} + (1 - p) \times w_{jk}, \quad e_{jk}\sim\text{Bernoulli}(1-p) \\
&= (1-p)w_{jk}
\end{align*}
So to keep the expectation the same at test time, we need to multiply all the weights by $(1-p)$.

Alternatively, we can scale weights during training (instead of during evaluation). This is what PyTorch does.
The weights that are not dropped out are scaled by $1/(1-p)$. Why?
The expecation is:
\begin{align*}
E[w_{jk} \times e_{jk} / (1-p)] &= 0 \times p/(1-p)\times w_{jk} + (1-p)/(1-p)\times w_{jk} \\
&= w_{jk}
\end{align*}

Let's look at an illustration of this.


In [10]:
import torch
import torch.nn as nn
import torch.optim as optim

# Create a dropout layer with 50% dropout rate
dropout = nn.Dropout(p=0.5)

# Input tensor
x = torch.ones(5, 10) # 5 samples, each of dimension 10

# Training mode (default)
output_train = dropout(x)
print("Training output:")
print(output_train)
# outputs are scaled by 1/(1-0.5) = 2

# Evaluation mode
dropout.eval()
output_eval = dropout(x)
print("\nEvaluation output:")
print(output_eval)

Training output:
tensor([[0., 2., 2., 2., 0., 2., 0., 2., 2., 2.],
        [2., 0., 2., 2., 0., 0., 2., 0., 0., 0.],
        [2., 0., 0., 2., 0., 0., 0., 2., 0., 2.],
        [0., 0., 0., 0., 0., 0., 0., 2., 0., 2.],
        [0., 0., 0., 0., 2., 0., 2., 0., 2., 2.]])

Evaluation output:
tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])


Now let's see what happens when we use dropout during training. (We manually code dropout so we can visualize the masks).

In [11]:
class DropoutVisualization(nn.Module):
    def __init__(self, p=0.5):
        super(DropoutVisualization, self).__init__()
        self.p = p
        self.mask = None

    def forward(self, x):
        if self.training:
            self.mask = torch.bernoulli(torch.full_like(x, 1 - self.p)) / (1 - self.p)
            return x * self.mask
        return x

We define our neural network.

In [12]:
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, dropout_prob):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.dropout = DropoutVisualization(p=dropout_prob) # or nn.Dropout(dropout_prob)
        self.fc2 = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x


In [13]:
# Network parameters
input_size = 5
hidden_size = 10
output_size = 1
dropout_prob = 0.5

# Create the network and optimizer
model = SimpleNet(input_size, hidden_size, output_size, dropout_prob)

We set up our optimizer and compute the loss and the gradients.

In [20]:
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Create a mini-batch of samples
batch_size = 15
x = torch.randn(batch_size, input_size)
y = torch.randn(batch_size, output_size)

# Forward pass
model.train()  # Set the model to training mode
output = model(x)

# Compute loss
loss = nn.MSELoss()(output, y)

# Backward pass
loss.backward()

We can see which weights were dropped out for which samples:

In [21]:
# Visualize dropout mask for each sample in the batch
print("Dropout masks for each sample in the batch:")
dropout_mask = model.dropout.mask
for i in range(batch_size):
    print(f"Sample {i + 1}:")
    print(dropout_mask[i])
    print()

# Count dropped units for each sample
dropped_units = (dropout_mask == 0).sum(dim=1)
for i in range(batch_size):
    print(f"Sample {i + 1}: {dropped_units[i].item()} out of {hidden_size} units dropped")

Dropout masks for each sample in the batch:
Sample 1:
tensor([2., 2., 2., 2., 0., 0., 0., 0., 2., 2.])

Sample 2:
tensor([0., 2., 0., 2., 0., 2., 0., 0., 0., 0.])

Sample 3:
tensor([2., 0., 2., 0., 2., 2., 2., 0., 2., 0.])

Sample 4:
tensor([0., 0., 2., 2., 2., 2., 2., 2., 0., 0.])

Sample 5:
tensor([2., 0., 2., 0., 0., 0., 0., 2., 0., 2.])

Sample 6:
tensor([0., 2., 0., 0., 2., 2., 2., 0., 0., 2.])

Sample 7:
tensor([0., 2., 0., 0., 0., 2., 2., 2., 0., 0.])

Sample 8:
tensor([0., 0., 2., 2., 0., 2., 2., 2., 2., 0.])

Sample 9:
tensor([2., 2., 2., 0., 2., 0., 0., 0., 0., 2.])

Sample 10:
tensor([2., 0., 0., 0., 0., 2., 2., 2., 0., 2.])

Sample 11:
tensor([2., 0., 0., 2., 0., 2., 0., 0., 2., 0.])

Sample 12:
tensor([0., 0., 0., 0., 0., 0., 0., 2., 2., 0.])

Sample 13:
tensor([2., 2., 2., 0., 0., 2., 2., 0., 0., 0.])

Sample 14:
tensor([2., 2., 2., 2., 2., 0., 0., 2., 2., 2.])

Sample 15:
tensor([0., 2., 0., 0., 0., 0., 0., 2., 2., 2.])

Sample 1: 4 out of 10 units dropped
Sample 2: 7 ou

We can see how this impacts the gradients: weights which were dropped out for all samples have zero gradients.

In [17]:
dropout_mask.sum(axis=0)

tensor([2., 4., 4., 4., 2., 6., 4., 4., 2., 2.])

In [22]:
# Print gradients before the optimization step
print("Gradients before optimization step:")
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name} grad:")
        print(param.grad)
        print()

Gradients before optimization step:
fc1.weight grad:
tensor([[-0.0746, -0.0586, -0.0095, -0.0553, -0.0109],
        [-0.1191, -0.1163, -0.0267, -0.4393,  0.3315],
        [-0.0185,  0.0294, -0.0818,  0.0658, -0.1012],
        [-0.0302, -0.0059, -0.0357, -0.0163, -0.0188],
        [ 0.2595, -0.1405, -0.3040,  0.1821, -0.4055],
        [-0.0009, -0.0383,  0.0364, -0.1232,  0.0276],
        [-0.0552, -0.0530,  0.0269, -0.2311,  0.1850],
        [-0.7471, -0.1551,  0.3146, -0.4045,  0.4915],
        [ 0.0190, -0.0561,  0.0690,  0.0069,  0.1228],
        [ 0.0482,  0.1049, -0.1045,  0.2355, -0.1809]])

fc1.bias grad:
tensor([ 0.0420,  0.2904,  0.0239, -0.0202, -0.0982, -0.0267,  0.0480,  0.1704,
        -0.0929,  0.1276])

fc2.weight grad:
tensor([[-0.7781,  0.0268,  0.0181, -0.0936,  0.2109, -0.0328, -0.2389,  0.4150,
         -0.5130,  0.0824]])

fc2.bias grad:
tensor([-1.0284])



We can then update our parameters based on the gradients. We see that the weights which were dropped out for all samples were not updated. 

In [19]:
# Perform optimization step
optimizer.step()

# Print weights before and after the update
print("\nWeights before and after update:")
for name, param in model.named_parameters():
    if 'weight' in name:  # Only print weights, not biases
        print(f"{name}:")
        print("Before:", param.data)
        print("After: ", param.data - 0.01 * param.grad.data)  # lr * grad
        print()


Weights before and after update:
fc1.weight:
Before: tensor([[-0.0434, -0.2658, -0.3721, -0.0767, -0.3105],
        [ 0.0178,  0.1556, -0.3232,  0.1080,  0.3442],
        [ 0.2958,  0.2077, -0.3135,  0.4426,  0.4466],
        [-0.1365, -0.0235,  0.2594,  0.4355,  0.1102],
        [ 0.0728, -0.0578,  0.2075, -0.2832, -0.1104],
        [ 0.3351, -0.3084, -0.1341,  0.0132, -0.3680],
        [ 0.3294,  0.0711,  0.3388, -0.3490, -0.3195],
        [ 0.0751,  0.2023,  0.0721, -0.0433,  0.2243],
        [ 0.1800,  0.3127, -0.3854, -0.3738, -0.3710],
        [ 0.1973, -0.1108, -0.2598,  0.1036,  0.0622]])
After:  tensor([[-0.0427, -0.2652, -0.3721, -0.0759, -0.3104],
        [ 0.0189,  0.1558, -0.3228,  0.1126,  0.3411],
        [ 0.2956,  0.2073, -0.3130,  0.4417,  0.4474],
        [-0.1365, -0.0235,  0.2594,  0.4355,  0.1102],
        [ 0.0697, -0.0562,  0.2099, -0.2851, -0.1059],
        [ 0.3352, -0.3081, -0.1345,  0.0144, -0.3683],
        [ 0.3300,  0.0715,  0.3385, -0.3465, -0.3212],
  

### Derivatives of fc1 Weights in Matrix Notation

#### Notation (with dimensions):

- $\mathbf{W}_1$: Weight matrix of fc1 (dimensions: hidden_size × input_size)
- $\mathbf{b}_1$: Bias vector of fc1 (dimensions: hidden_size × 1)
- $\mathbf{W}_2$: Weight matrix of fc2 (dimensions: output_size × hidden_size)
- $\mathbf{b}_2$: Bias vector of fc2 (dimensions: output_size × 1)
- $\mathbf{x}$: Input vector (dimensions: input_size × 1)
- $\mathbf{y}$: Target output vector (dimensions: output_size × 1)
- $\mathbf{h}$: Output of the hidden layer (dimensions: hidden_size × 1)
- $\mathbf{\hat{y}}$: Predicted output vector (dimensions: output_size × 1)
- $L$: Loss function (scalar)
- $f$: ReLU activation function

#### Derivative Calculation:

We want to compute $\frac{\partial L}{\partial \mathbf{W}_1}$. 

1. $\frac{\partial L}{\partial \mathbf{\hat{y}}} = (\mathbf{\hat{y}} - \mathbf{y})^T$ (dimensions: 1 × output_size)

2. $\frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{h}} = \mathbf{W}_2$ (dimensions: output_size × hidden_size)

3. $\frac{\partial \mathbf{h}}{\partial (\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)} = \text{diag}(f'(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1))$ 
   This is a diagonal matrix $\mathbf{F}'$ (dimensions: hidden_size × hidden_size)

4. $\frac{\partial (\mathbf{W}_1\mathbf{x} + \mathbf{b}_1)}{\partial \mathbf{W}_1} = \mathbf{x}^T$ 
   This is a tensor product, not a simple matrix multiplication.

Now, let's combine these:

$\frac{\partial L}{\partial \mathbf{W}_1} = ((\mathbf{\hat{y}} - \mathbf{y})^T \mathbf{W}_2 \mathbf{F}') \otimes \mathbf{x}^T$

Where $\otimes$ denotes the outer product (https://en.wikipedia.org/wiki/Outer_product)

Let's break down the dimensions:

- $(\mathbf{\hat{y}} - \mathbf{y})^T$: 1 × output_size
- $\mathbf{W}_2$: output_size × hidden_size
- $\mathbf{F}'$: hidden_size × hidden_size
- Result of $(\mathbf{\hat{y}} - \mathbf{y})^T \mathbf{W}_2 \mathbf{F}'$: 1 x hidden_size
- $\mathbf{x}^T$: 1 x input_size 

The outer product of (1 × hidden_size) and (1 × input_size) results in a matrix of size (hidden_size × input_size), which correctly matches the dimensions of $\mathbf{W}_1$.

#### Interpretation:

- $(\mathbf{\hat{y}} - \mathbf{y})^T \mathbf{W}_2$ computes how the error at the output layer affects the hidden layer.
- Multiplying by $\mathbf{F}'$ applies the ReLU derivative, effectively letting gradients flow only through active neurons.
- The outer product with $\mathbf{x}^T$ distributes this gradient to each weight based on the corresponding input value.