In [1]:
import torch, math, copy
import numpy as np
from torchvision import datasets, transforms
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F

  from .autonotebook import tqdm as notebook_tqdm


# From Shallow to Deep Neural Networks

The main goal of this assignment is to develop a better understanding of how the depth of a network interacts with its trainability and performance.

In the previous assignment you likely observed difficulties in training sigmoid and ReLU networks with over ~8 layers, which is typically associated with 'vanishing' or 'exploding' gradients. As you will see, some of the biggest achievements in deep learning have been the development of techniques that enable deeper networks to be successfully trained, and without them deep networks are notoriously difficult to train successfully.

You will be working with the MNIST dataset, which will be downloaded and loaded in the cell below.

In [2]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

train_dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=True)

test_dataset = datasets.MNIST("data", train=False, download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=256, shuffle=False)

Fill the missing code below. In both train_epoch and test, total_correct should be the total number of correctly classified samples, while total_samples should be the total number of samples that have been iterated over.

In [3]:
def train(epochs, model, criterion, optimizer, train_loader, test_loader):
    for epoch in range(epochs):
        train_err = train_epoch(model, criterion, optimizer, train_loader)
        test_err = test(model, test_loader)
        print('Epoch {:03d}/{:03d}, Train Error {:.2f}% || Test Error {:.2f}%'.format(epoch, epochs, train_err*100, test_err*100))
    return train_err, test_err

def train_epoch(model, criterion, optimizer, loader):
    total_correct = 0.
    total_samples = 0.

    #model.train()
    for batch_idx, (data, target) in enumerate(loader):
        if torch.cuda.is_available():
            data, target = data.cuda(), target.cuda()

        # insert code to feed the data to the model and collect its output
        output = model(data)

        # insert code to compute the loss from output and the true target
        loss = criterion(output, target)

        # insert code to update total_correct and total_samples
        # total_correct: total number of correctly classified samples
        # total_samples: total number of samples seen so far
        total_correct += (output.argmax(dim=1) == target).sum().item()
        total_samples += target.size(0)

        # insert code to update the parameters using optimizer
        # be careful in this part as an incorrect implementation will affect
        # all your experiments and have a significant impact on your grade!
        # in particular, note that pytorch does --not-- automatically
        # clear the parameter's gradients: check tutorials to see
        # how this can be done with a single method call.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return 1 - total_correct/total_samples

def test(model, loader):
    total_correct = 0.
    total_samples = 0.
    model.eval()

    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(loader):
            if torch.cuda.is_available():
                data, target = data.cuda(), target.cuda()

            # insert code to feed the data to the model and collect its output
            output = model(data)

            # insert code to update total_correct and total_samples
            # total_correct: total number of correctly classified samples
            # total_samples: total number of samples seen so far
            total_correct += (output.argmax(dim=1) == target).sum().item()
            total_samples += target.size(0)

    return 1 - total_correct/total_samples

### CNN with Tanh activations

Next, you should implement a baseline model so you can check how increasing the number of layers can make a network considerably harder to train, given that no additional methods such as residual connections and normalization layers are adopted.

Finish the implementation of CNNtanh below, carefully following the specifications:

The model should have exactly 'k' many convolutional layers, followed by a linear (fully-connected) layer that actually outputs the logits for each of the 10 MNIST classes.

The network should consist of 3 stages, each with k/3 many convolutional layers (you can assume k is divisible by 3). Each conv layer should have a 3x3 kernel, a stride of 1 and a padding of 1 pixel (such that the output of the convolution has the same height and width as its input).

It should also have an average pooling layer at the end of each stage, with a 2x2 window (hence halving the spatial dimensions), and the number of channels should double from one stage to the other (starting with 4 in the first stage). Moreover, a Tanh activation should follow each convolution layer.

When k=3, for example, the network should be:

1. Stage 1 (1x28x28 input, 4x14x14 output):
    1. Conv layer with 1 input channel and 4 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
2. Stage 2 (4x14x14 input, 8x7x7 output):
    1. Conv layer with 4 input channels and 8 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
3. Stage 3 (8x7x7 input, 16x3x3 output):
    1. Conv layer with 8 input channels and 16 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
4. Fully-connected layer with 16 * 3 * 3=144 input dimension and 10 output dimension

Note that the model should not have any activation after the fully-connected layer: the PyTorch loss module that will be adopted takes logits as input and not class probabilities.

In contrast to the network exemplified above with k=3, when k=6 it should have two conv layers per stage instead of one (each one with a tanh activation following it).

Lastly, do not change the code block with a for loop in the end of init: its purpose to randomly initialize the parameters of the conv layers by sampling from a Gaussian with zero mean and 0.05 deviation.

In [4]:
class CNNtanh(nn.Module):
    def __init__(self, k):
        super(CNNtanh, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                layers_list.append(nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1))
                layers_list.append(nn.Tanh())
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc = nn.Linear(channel_in*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        for stage in self.conv_stages:
            x = stage(x)
        u = self.fc(x.view(x.size(0), -1))

        return u

The line below just instantiates the PyTorch Cross Entropy loss, whose inputs should be logits: hence the reason that the CNN should not have an activation after last (feedforward) layer.

In [5]:
criterion = torch.nn.CrossEntropyLoss()

Now, you should train CNNtanh with different values for k: your goal is to find the largest value for k such that the network achieves less than 20% error (either train or test) in 3 epochs. You should also choose an appropriate learning rate (but do not change the optimizer or the momentum settings!).

Note that CNNs can easily achieve under 2% test error on MNIST, but we're choosing 20% as a threshold since you will be training each network for only 3 epochs.

Remember to use values for k that are divisible by 3. When submitted, your notebook should have the training log of a network with two consecutive values for k (for example, 6 and 9) such that the network is 'trainable' with the smaller one but not 'trainable' with the larger one. It is fine for the training log to include runs with more than two values of k.

In [6]:
lr = 0.05
k = 6
print("\033[33m\nTraining Tanh CNN with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNtanh(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training Tanh CNN with 6 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 73.45% || Test Error 21.00%
Epoch 001/003, Train Error 11.39% || Test Error 6.36%
Epoch 002/003, Train Error 5.18% || Test Error 3.62%


In [7]:
lr = 0.05
k = 9
print("\033[33m\nTraining Tanh CNN with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNtanh(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training Tanh CNN with 9 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 88.80% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 88.85% || Test Error 88.65%


### Better Initialization

Next, we will change the initialization of the conv layers and see how it affects the trainability of deep networks. Instead of sampling from a Gaussian with a deviation of 0.05, you should sample from a Gaussian with a deviation $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{out}}}$ or $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{in}}}$, where $k$ is the kernel size ($k=3$ for 3x3 convolutions), $C_{in}$ is the number of input channels, and $C_{out}$ the number of output channels.

The model below should be exactly like CNNtanh except for the standard deviation of the normal distribution used to initialize the conv layers.

The paper 'Understanding the difficulty of training deep feedforward neural networks' by Glorot and Bengio provides some intuition behind such a choice for $\sigma$.

In [8]:
class CNNtanh_newinit(nn.Module):
    def __init__(self, k):
        super(CNNtanh_newinit, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                layers_list.append(nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1))
                layers_list.append(nn.Tanh())
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc = nn.Linear(channel_in*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = np.sqrt(1 / (m.kernel_size[0] * m.kernel_size[1] * m.out_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        for stage in self.conv_stages:
            x = stage(x)
        u = self.fc(x.view(x.size(0), -1))

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNtanhinit.

In [16]:
lr = 0.05
k = 48
print("\033[33m\nTraining Tanh CNN + new init with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNtanh_newinit(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training Tanh CNN + new init with 48 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 52.93% || Test Error 12.13%
Epoch 001/003, Train Error 7.00% || Test Error 5.01%
Epoch 002/003, Train Error 4.31% || Test Error 3.48%


In [14]:
lr = 0.05
k = 51
print("\033[33m\nTraining Tanh CNN + new init with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNtanh_newinit(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training Tanh CNN + new init with 51 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 89.22% || Test Error 88.65%
Epoch 001/003, Train Error 89.06% || Test Error 88.65%
Epoch 002/003, Train Error 88.94% || Test Error 88.65%


### CNN with ELU activations

In this section you should replace the Tanh activations of the previous network for Exponential Linear Units (ELUs). Complete CNNelu below, which should be exactly like CNNtanhinit except for ELU activations instead of Tanh (ELUs are readily available in PyTorch, check its documentation for more details).

In [17]:
class CNNelu(nn.Module):
    def __init__(self, k):
        super(CNNelu, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                layers_list.append(nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1))
                layers_list.append(nn.ELU())
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc = nn.Linear(channel_in*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = np.sqrt(1 / (m.kernel_size[0] * m.kernel_size[1] * m.out_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        for stage in self.conv_stages:
            x = stage(x)
        u = self.fc(x.view(x.size(0), -1))

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNelu.

In [28]:
lr = 0.05
k = 39
print("\033[33m\nTraining ELU CNN, with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNelu(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ELU CNN, with 39 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 41.42% || Test Error 6.39%
Epoch 001/003, Train Error 4.58% || Test Error 3.49%
Epoch 002/003, Train Error 3.09% || Test Error 2.31%


In [29]:
lr = 0.05
k = 42
print("\033[33m\nTraining ELU CNN, with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNelu(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ELU CNN, with 42 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 88.56% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%


### CNN with Batch Normalization

Next, you will check how batch normalization can make deep networks easier to train. Implement the network below, which should be exactly like CNNelu except for additional BatchNorm2d layers after each convolution (before the ELU activation).

Note that BatchNorm2d modules require the number of channels as argument -- see the PyTorch documentation for more details.

In [30]:
class CNNeluBN(nn.Module):
    def __init__(self, k):
        super(CNNeluBN, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                layers_list.append(nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1))
                layers_list.append(nn.BatchNorm2d(channel_out))
                layers_list.append(nn.ELU())
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc = nn.Linear(channel_in*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = np.sqrt(1 / (m.kernel_size[0] * m.kernel_size[1] * m.out_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        for stage in self.conv_stages:
            x = stage(x)
        u = self.fc(x.view(x.size(0), -1))

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNeluBN.

In [43]:
lr = 0.05
k = 63
print("\033[33m\nTraining ELU CNN + BN with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNeluBN(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ELU CNN + BN with 63 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 14.75% || Test Error 4.09%
Epoch 001/003, Train Error 4.10% || Test Error 3.69%
Epoch 002/003, Train Error 2.84% || Test Error 1.95%


In [54]:
lr = 0.05
k = 66
print("\033[33m\nTraining ELU CNN + BN with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNeluBN(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ELU CNN + BN with 66 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 25.83% || Test Error 8.61%
Epoch 001/003, Train Error 89.09% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%


### Residual Networks

Finally, you experiment adding residual connections to a CNN.

To implement the model below, you should add a 'skip connection' to 'Conv->BatchNorm->ELU' blocks whenever the shape of the block's input and output are the same: this will be the case for every such block except for the first ones in each stage, as they double the number of channels.

More specifically, you should change $u = ELU(BatchNorm(Conv(x)))$ to $u = ELU(BatchNorm(Conv(x))) + x$, where $x$ and $u$ denote the block's input and output, respectively.

You should take your CNNeluBN implementation and add skip-connections as described above.

Note that there are key differences between the resulting model and the actual ResNet proposed by He et al. in 'Deep Residual Learning for Image Recognition', for example the use of ELU activations instead of ReLU and the exact position of skip-connections.

In [55]:
class ResidualBlock(nn.Module):
    def __init__(self, channel_in, channel_out):
        super(ResidualBlock, self).__init__()
        self.block = nn.Sequential(
            nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(channel_out),
            nn.ELU()
        )

    def forward(self, x):
        return self.block(x) + x

class ResNet(nn.Module):
    def __init__(self, k):
        super(ResNet, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                if channel_in == channel_out:
                    layers_list.append(self.residual_block(channel_in, channel_out))
                else:
                    layers_list.append(self.basic_block(channel_in, channel_out))
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc = nn.Linear(channel_in*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = np.sqrt(1 / (m.kernel_size[0] * m.kernel_size[1] * m.out_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def basic_block(self, channel_in, channel_out):
        return nn.Sequential(
            nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(channel_out),
            nn.ELU()
        )

    def residual_block(self, channel_in, channel_out):
        return ResidualBlock(channel_in, channel_out)

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        for stage in self.conv_stages:
            x = stage(x)
        u = self.fc(x.view(x.size(0), -1))

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with the 'ResNet' model.

In [59]:
lr = 0.05
k = 102
print("\033[33m\nTraining ResNet with {} layers, learning rate {}\033[0m".format(k, lr))
model = ResNet(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ResNet with 102 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 22.50% || Test Error 7.70%
Epoch 001/003, Train Error 8.24% || Test Error 6.71%
Epoch 002/003, Train Error 6.60% || Test Error 6.13%


In [57]:
lr = 0.05
k = 105
print("\033[33m\nTraining ResNet with {} layers, learning rate {}\033[0m".format(k, lr))
model = ResNet(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ResNet with 105 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 39.61% || Test Error 12.69%
Epoch 001/003, Train Error 86.88% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%


### Interactions: Batch Norm and Initialization

Intuitively, batch norm should make the model more robust to changes in the magnitude of the network's weights: informally, scaling up all the elements of a conv layer's filters by a factor of 10 would not affect the network's output as long as there is a batch norm layer following such convolution, as the normalization would undo the scaling.

To check how this intuition translates to practical settings, you should change the original 'CNNtanh' model so that it incorporates batch norm layers (like you have done when modifying 'CNNelu' into 'CNNeluBN').

The model below should adopt the naive initialization procedure of sampling from a Gaussian with a deviation of 0.05, not the more sophisticated one that you implemented previously

In [60]:
class CNNtanhBN_oldinit(nn.Module):
    def __init__(self, k):
        super(CNNtanhBN_oldinit, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                layers_list.append(nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1))
                layers_list.append(nn.BatchNorm2d(channel_out))
                layers_list.append(nn.Tanh())
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc = nn.Linear(channel_in*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        for stage in self.conv_stages:
            x = stage(x)
        u = self.fc(x.view(x.size(0), -1))

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNeluBN_oldinit.

In [63]:
lr = 0.05
k = 48
print("\033[33m\nTraining Tanh CNN + BN + naive init with {} layers, learning rate {}\033[0m".format(k,lr))
model = CNNtanhBN_oldinit(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training Tanh CNN + BN + naive init with 48 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 26.44% || Test Error 11.50%
Epoch 001/003, Train Error 9.55% || Test Error 6.41%
Epoch 002/003, Train Error 4.53% || Test Error 3.53%


In [64]:
lr = 0.05
k = 51
print("\033[33m\nTraining Tanh CNN + BN + naive init with {} layers, learning rate {}\033[0m".format(k,lr))
model = CNNtanhBN_oldinit(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training Tanh CNN + BN + naive init with 51 layers, learning rate 0.05[0m
Epoch 000/003, Train Error 20.38% || Test Error 19.12%
Epoch 001/003, Train Error 89.45% || Test Error 88.65%
Epoch 002/003, Train Error 89.76% || Test Error 88.65%


### Interactions: Batch Norm and Residual Connections

Lastly, implement and train a CNN with residual connections but without batch normalization layers -- the goal here is to check how residuals interact with normalization.

The model below should be exactly like ResNet, except that it should not have batch norm layers.

In [65]:
class ResidualBlock_noBN(nn.Module):
    def __init__(self, channel_in, channel_out):
        super(ResidualBlock_noBN, self).__init__()
        self.block = nn.Sequential(
            nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1),
            nn.ELU()
        )

    def forward(self, x):
        return self.block(x) + x

class ResNet_noBN(nn.Module):
    def __init__(self, k):
        super(ResNet_noBN, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                if channel_in == channel_out:
                    layers_list.append(self.residual_block(channel_in, channel_out))
                else:
                    layers_list.append(self.basic_block(channel_in, channel_out))
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc = nn.Linear(channel_in*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = np.sqrt(1 / (m.kernel_size[0] * m.kernel_size[1] * m.out_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def basic_block(self, channel_in, channel_out):
        return nn.Sequential(
            nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1),
            nn.ELU()
        )

    def residual_block(self, channel_in, channel_out):
        return ResidualBlock_noBN(channel_in, channel_out)

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        for stage in self.conv_stages:
            x = stage(x)
        u = self.fc(x.view(x.size(0), -1))

        return u

In [67]:
k = 21
lr = 0.005
print("\033[33m\nTraining ResNet w/o BN with {} layers, learning rate {}\033[0m".format(k, lr))
model = ResNet_noBN(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ResNet w/o BN with 21 layers, learning rate 0.005[0m
Epoch 000/003, Train Error 12.41% || Test Error 2.78%
Epoch 001/003, Train Error 2.50% || Test Error 1.87%
Epoch 002/003, Train Error 1.88% || Test Error 1.67%


In [66]:
k = 24
lr = 0.005
print("\033[33m\nTraining ResNet w/o BN with {} layers, learning rate {}\033[0m".format(k, lr))
model = ResNet_noBN(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ResNet w/o BN with 24 layers, learning rate 0.005[0m
Epoch 000/003, Train Error 90.11% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%


Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with ResNet_noBN.

### (Optional) Multiple Loss Heads

In this optional section, your goal is to incorporate the idea of having multiple loss heads throughout the network, distributed across its depth.

For the CNNelu_multihead model below, you should take the CNNelu model that you implemented previously and add two additional classification heads, connected to the outputs of stages 1 and 2.

More specifically, the outputs of stages 1 and 2, with shapes 4x14x14 and 8x7x7, should be connected to new fully-connected layers that map them to a 10-dimensional vector (logits for the 10 MNIST classes). The network should output three logit vectors (the original one at the end of the network plus the two new ones) instead of just one, and the loss should be computed as the average of the cross entropies between the true target and each of the three predictions.

Note that you will likely have to change the implementation of train_epoch() and test() to accomodate the fact that this model will output three logit vectors instead of one.

In [68]:
class CNNelu_multihead(nn.Module):
    def __init__(self, k):
        super(CNNelu_multihead, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        number_of_stages = 3
        layers_per_stage = k // number_of_stages

        self.conv_stages = nn.ModuleList()

        channel_in = 1
        channel_out = 4

        for _ in range(number_of_stages):
            layers_list = []
            for _ in range(layers_per_stage):
                layers_list.append(nn.Conv2d(channel_in, channel_out, kernel_size=3, stride=1, padding=1))
                layers_list.append(nn.ELU())
                channel_in = channel_out
            layers_list.append(nn.AvgPool2d(kernel_size=2, stride=2))
            channel_out *= 2
            self.conv_stages.append(nn.Sequential(*layers_list))

        self.fc1 = nn.Linear(4*14*14, 10)
        self.fc2 = nn.Linear(8*7*7, 10)
        self.fc3 = nn.Linear(16*3*3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = np.sqrt(1 / (m.kernel_size[0] * m.kernel_size[1] * m.out_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):
        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        x = input
        u1, u2, u3 = None, None, None
        for idx, stage in enumerate(self.conv_stages):
            x = stage(x)
            if idx == 0:
                u1 = self.fc1(x.view(x.size(0), -1))
            elif idx == 1:
                u2 = self.fc2(x.view(x.size(0), -1))
            else:
                u3 = self.fc3(x.view(x.size(0), -1))

        return u1, u2, u3

In [69]:
def train_multihead(epochs, model, criterion, optimizer, train_loader, test_loader):
    for epoch in range(epochs):
        train_err = train_epoch_multihead(model, criterion, optimizer, train_loader)
        test_err = test_multihead(model, test_loader)
        print('Epoch {:03d}/{:03d}, Train Error {:.2f}% || Test Error {:.2f}%'.format(epoch, epochs, train_err*100, test_err*100))
    return train_err, test_err

def train_epoch_multihead(model, criterion, optimizer, loader):
    total_correct = 0.
    total_samples = 0.

    model.train()
    for batch_idx, (data, target) in enumerate(loader):
        if torch.cuda.is_available():
            data, target = data.cuda(), target.cuda()

        # insert code to feed the data to the model and collect its output
        output = model(data)

        # insert code to compute the loss from output and the true target
        loss1 = criterion(output[0], target)
        loss2 = criterion(output[1], target)
        loss3 = criterion(output[2], target)

        loss = (loss1 + loss2 + loss3) / 3

        # insert code to update total_correct and total_samples
        # total_correct: total number of correctly classified samples
        # total_samples: total number of samples seen so far
        total_correct += ((output[0].argmax(dim=1) == target).sum().item() + \
                            (output[1].argmax(dim=1) == target).sum().item() + \
                            (output[2].argmax(dim=1) == target).sum().item()) / 3
        total_samples += target.size(0)

        # insert code to update the parameters using optimizer
        # be careful in this part as an incorrect implementation will affect
        # all your experiments and have a significant impact on your grade!
        # in particular, note that pytorch does --not-- automatically
        # clear the parameter's gradients: check tutorials to see
        # how this can be done with a single method call.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return 1 - total_correct/total_samples

def test_multihead(model, loader):
    total_correct = 0.
    total_samples = 0.
    model.eval()

    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(loader):
            if torch.cuda.is_available():
                data, target = data.cuda(), target.cuda()

            # insert code to feed the data to the model and collect its output
            output = model(data)

            # insert code to update total_correct and total_samples
            # total_correct: total number of correctly classified samples
            # total_samples: total number of samples seen so far
            total_correct += ((output[0].argmax(dim=1) == target).sum().item() + \
                            (output[1].argmax(dim=1) == target).sum().item() + \
                            (output[2].argmax(dim=1) == target).sum().item()) / 3
            total_samples += target.size(0)

    return 1 - total_correct/total_samples

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNelu_multihead.

In [85]:
l = 0.05
k = 139
print("\033[33m\nTraining ELU CNN + multiloss with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNelu_multihead(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train_multihead(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ELU CNN + multiloss with 139 layers, learning rate 0.005[0m
Epoch 000/003, Train Error 81.57% || Test Error 60.38%
Epoch 001/003, Train Error 39.57% || Test Error 29.86%
Epoch 002/003, Train Error 16.11% || Test Error 7.04%


In [86]:
l = 0.05
k = 142
print("\033[33m\nTraining ELU CNN + multiloss with {} layers, learning rate {}\033[0m".format(k, lr))
model = CNNelu_multihead(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train_multihead(3, model, criterion, optimizer, train_loader, test_loader)

[33m
Training ELU CNN + multiloss with 142 layers, learning rate 0.005[0m
Epoch 000/003, Train Error 88.81% || Test Error 86.13%
Epoch 001/003, Train Error 55.47% || Test Error 36.37%
Epoch 002/003, Train Error 29.81% || Test Error 21.37%
