## Stacks of Convolution 

Welcome to Day 16! My name is Tahiya Chowdhury, a Postdoctoral Research Fellow at Davis Institute for AI at Colby.

We met earlier this semester when we talked about Multi-layer perceptron. Later, we learned about more advanced and powerful neural networks called Convolutional Neural Network (CNN). 

CNNs have the state of the art for decades particularly for **computer vision** dealing with images, but equally effective for audio, text, and time series such as EEG or sensor signals.

The first CNN developed for Handwritten digit recognition task, called **LeNet**, which we have already seen in Day 13. Over the years, a wide variants of CNN architectures emerged from different design choices in search of:
* practical and feasible for solving vision tasks (padding, pooling, regularization)
* Extending existing networks for more complex, specialized tasks 
    * segmentation ([U-Net](https://arxiv.org/pdf/1505.04597.pdf))
    * detection ([R-CNN](https://arxiv.org/abs/1703.06870)) 
    * tracking (you probably have heard of object tracking, how about [particle tracking](https://www.pnas.org/doi/10.1073/pnas.1804420115)?)

A lot of the earlier architectures and techniques still appear useful when designing modern networks 
* U-Net, developed biomedical images, are often used in diffusion model (yes, [Stable diffusion](https://replicate.com/stability-ai/stable-diffusion) is one such diffusion model)
* Padding is a useful technique for Transformers, a state of the art technique in many machine learning tasks

The motivation for Day 16 is to understand how classic convolutional operation, when coupled with many **layers** of **stacked convolution** operations, can be very powerful to learn complex features and build deeper models for classification and other tasks.

The textbook talks about the several Deep Convolutional Neural Networks that slowly replaced classical computer vision pipeline that relied on pre-calculated features instead of learning the features (representation) during classifier training.


### Can we think of a limitation of models built on pre-calculated hand-crafted features?

## Today we will look at AlexNet and ResNet

### [AlexNet](https://papers.nips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) (Krizhevsky et al.)



**Quick facts about AlexNet**

Input: 224 X 224 X 3 channel image

++
* Data: 1 million images from ImageNet
* Data augmentation used to instead traning data size
* ReLU non-linear activation to avoid vanishing gradient (gradient decreasing in deep networks due to small derivative in backpropagation)

--
* Large number of learnable parameters in optimization (LeNet: ~60000 parameters)
* high convergence time due to normalization and dropout to avoid overfitting

We will use [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset available from PyTorch for this example.

This example is based on [PaperSpace](https://blog.paperspace.com) DL models from scratch blogseries.

## Import packages

In [1]:
import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler


### Let's set device configuration to use GPU if available (because AlexNet has 62 million trainable parameters!)

In [2]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## We define a train-validation loader and test loader for batch loading the data

In [3]:
def get_train_valid_loader(data_dir,
                           batch_size,
                           augment,
                           random_seed,
                           valid_size=0.1,
                           shuffle=True):

    #normalizing function usimng mean and standard deviation, will be applied to each channel sepaartely
    normalize = transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.2010],)

    # define transforms
    valid_transform = transforms.Compose([
            transforms.Resize((227,227)),
            transforms.ToTensor(),
            normalize,
    ])

  # If augmentation is used to increase training data
  # two augmentation is considered: Crop and Flip
  # Original AlexNet uses a third augmentation which is omitted 
    if augment:
        train_transform = transforms.Compose([
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ])

    else:
        train_transform = transforms.Compose([
            transforms.Resize((227,227)),
            transforms.ToTensor(),
            normalize,
        ])

    # load the dataset: train
    train_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=train_transform,
    )

    # load the dataset: valid
    valid_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=valid_transform,
    )

    # train and valid split
    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    # random shuffling for generalization
    if shuffle:
        np.random.seed(random_seed)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    #batch loading data using random sampler
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler)
 
    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=batch_size, sampler=valid_sampler)

    return (train_loader, valid_loader)



Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:04<00:00, 41054639.10it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Files already downloaded and verified


In [None]:
#test loader
def get_test_loader(data_dir,
                    batch_size,
                    shuffle=True):
    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    )

    # define transform
    transform = transforms.Compose([
        transforms.Resize((227,227)),
        transforms.ToTensor(),
        normalize,
    ])

    dataset = datasets.CIFAR10(
        root=data_dir, train=False,
        download=True, transform=transform,
    )

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle
    )

    return data_loader

# MNIST dataset loading
train_loader, valid_loader = get_train_valid_loader(data_dir = './data', batch_size = 64,
                       augment = False, random_seed = 1)

test_loader = get_test_loader(data_dir = './data',
                              batch_size = 64)

### Let's check the split of the data

In [11]:
print("Number of train images:", len(train_loader)*64)  #45000 images
print("Number of train images:", len(valid_loader)*64)  #5000 images
print("Number of train images:", len(test_loader)*64)   #10000 image

Number of train images: 45056
Number of train images: 5056
Number of train images: 10048


## Questions

* Why is `shuffle` used in the loader function?
* The input dimension is 227 X 227. Can we use input with other sizes here?
* For what purpose, augmentation is used here?
* How does normalization help in training?

### Now we will implement the main architecture of AlexNet

In [14]:
#AlexNet
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=0),
            nn.BatchNorm2d(96),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.layer3 = nn.Sequential(
            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(384),
            nn.ReLU())
        self.layer4 = nn.Sequential(
            nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(384),
            nn.ReLU())
        self.layer5 = nn.Sequential(
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 3, stride = 2))
        self.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(9216, 4096),
            nn.ReLU())
        self.fc1 = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU())
        self.fc2= nn.Sequential(
            nn.Linear(4096, num_classes))

    #putting all the layers together   
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

## Training AlexNet

In [6]:
num_classes = 10
num_epochs = 20
batch_size = 64
learning_rate = 0.005

model = AlexNet(num_classes).to(device)


# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.005, momentum = 0.9)  


# Train the model
total_step = len(train_loader)

In [7]:
total_step = len(train_loader)

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))
            
    # Validation
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in valid_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            del images, labels, outputs
    
        print('Accuracy of the network on the {} validation images: {} %'.format(5000, 100 * correct / total)) 

Epoch [1/20], Step [704/704], Loss: 1.5783
Accuracy of the network on the 5000 validation images: 58.58 %
Epoch [2/20], Step [704/704], Loss: 0.6471
Accuracy of the network on the 5000 validation images: 69.16 %
Epoch [3/20], Step [704/704], Loss: 1.4001
Accuracy of the network on the 5000 validation images: 70.24 %
Epoch [4/20], Step [704/704], Loss: 0.9505
Accuracy of the network on the 5000 validation images: 74.06 %
Epoch [5/20], Step [704/704], Loss: 0.9067
Accuracy of the network on the 5000 validation images: 77.64 %
Epoch [6/20], Step [704/704], Loss: 1.4402
Accuracy of the network on the 5000 validation images: 77.3 %
Epoch [7/20], Step [704/704], Loss: 0.4090
Accuracy of the network on the 5000 validation images: 78.46 %
Epoch [8/20], Step [704/704], Loss: 0.3330
Accuracy of the network on the 5000 validation images: 79.26 %
Epoch [9/20], Step [704/704], Loss: 0.2193
Accuracy of the network on the 5000 validation images: 80.34 %
Epoch [10/20], Step [704/704], Loss: 1.3074
Acc

## Testing on Unseen data

In [8]:
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        del images, labels, outputs

    print('Accuracy of the network on the {} test images: {} %'.format(10000, 100 * correct / total))

Accuracy of the network on the 10000 test images: 79.36 %


## Questions

* Why augmentation is only used on training

* There are lots of choices about the parameters here! Feel free to experiment with one of them.
  * Try changing the kernel size. Do you observe any change?

  * What about changing the stride?

  * Change the optimizer from SGD to Adam.

* What is weight decay? 
  * It is regularization technique to avoid overfitting by penalizing large weight by a weghting factor

* What is momentum?

* Observe the loss and accuracy for the 20 epochs. What can you see? How do you explain it?

* You are curious to observe how loss and accuracy changes per epoch visually? How would you record these to plot them?

* For calculating classification performance, this example used accuracy. Are there other metrics? Is accuracy the correct metric to use in this case? 

## [Optional] Let's look at [ResNet](https://arxiv.org/pdf/1512.03385.pdf) (He et al.)

**Quick facts about ResNet**

Input: 224 X 224 X 3 channel image

++
* Data: 1 million images from ImageNet
* Bigger network with many layers, but fewer trainable parameters
* Skip connection to reduce number of parameters and avoid vanishing gradients


We will use [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset available from PyTorch for this example.

## Defining data loader

In [16]:
def data_loader(data_dir,
                batch_size,
                random_seed=42,
                valid_size=0.1,
                shuffle=True,
                test=False):
  
    normalize = transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010],
    )

    # define transforms
    transform = transforms.Compose([
            transforms.Resize((224,224)),
            transforms.ToTensor(),
            normalize,
    ])

    if test:
        dataset = datasets.CIFAR10(
          root=data_dir, train=False,
          download=True, transform=transform,
        )

        data_loader = torch.utils.data.DataLoader(
            dataset, batch_size=batch_size, shuffle=shuffle
        )

        return data_loader

    # load the dataset
    train_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=transform,
    )

    valid_dataset = datasets.CIFAR10(
        root=data_dir, train=True,
        download=True, transform=transform,
    )

    num_train = len(train_dataset)
    indices = list(range(num_train))
    split = int(np.floor(valid_size * num_train))

    if shuffle:
        np.random.seed(42)
        np.random.shuffle(indices)

    train_idx, valid_idx = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, sampler=train_sampler)
 
    valid_loader = torch.utils.data.DataLoader(
        valid_dataset, batch_size=batch_size, sampler=valid_sampler)

    return (train_loader, valid_loader)


# CIFAR10 dataset 
train_loader, valid_loader = data_loader(data_dir='./data',
                                         batch_size=64)

test_loader = data_loader(data_dir='./data',
                              batch_size=64,
                              test=True)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


## Defining the Residual Block

In [17]:
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride = 1, downsample = None):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Sequential(
                        nn.Conv2d(in_channels, out_channels, kernel_size = 3, stride = stride, padding = 1),
                        nn.BatchNorm2d(out_channels),
                        nn.ReLU())
        self.conv2 = nn.Sequential(
                        nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = 1, padding = 1),
                        nn.BatchNorm2d(out_channels))
        self.downsample = downsample
        self.relu = nn.ReLU()
        self.out_channels = out_channels
        
    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.conv2(out)
        if self.downsample:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

## ResNet architecture

In [18]:
class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes = 10):
        super(ResNet, self).__init__()
        self.inplanes = 64
        self.conv1 = nn.Sequential(
                        nn.Conv2d(3, 64, kernel_size = 7, stride = 2, padding = 3),
                        nn.BatchNorm2d(64),
                        nn.ReLU())
        self.maxpool = nn.MaxPool2d(kernel_size = 3, stride = 2, padding = 1)
        self.layer0 = self._make_layer(block, 64, layers[0], stride = 1)
        self.layer1 = self._make_layer(block, 128, layers[1], stride = 2)
        self.layer2 = self._make_layer(block, 256, layers[2], stride = 2)
        self.layer3 = self._make_layer(block, 512, layers[3], stride = 2)
        self.avgpool = nn.AvgPool2d(7, stride=1)
        self.fc = nn.Linear(512, num_classes)
        
    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes:
            
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes, kernel_size=1, stride=stride),
                nn.BatchNorm2d(planes),
            )
        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)
    
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.maxpool(x)
        x = self.layer0(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

## Defining Hyperparameters

In [19]:
num_classes = 10
num_epochs = 20
batch_size = 16
learning_rate = 0.01

model = ResNet(ResidualBlock, [3, 4, 6, 3]).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.001, momentum = 0.9)  

# Train the model
total_step = len(train_loader)

## Training ResNet

In [20]:
import gc
total_step = len(train_loader)

for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        del images, labels, outputs
        torch.cuda.empty_cache()
        gc.collect()

    print ('Epoch [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, loss.item()))
            
    # Validation
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in valid_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            del images, labels, outputs
    
        print('Accuracy of the network on the {} validation images: {} %'.format(5000, 100 * correct / total)) 

Epoch [1/20], Loss: 1.9688
Accuracy of the network on the 5000 validation images: 55.36 %
Epoch [2/20], Loss: 1.3953
Accuracy of the network on the 5000 validation images: 71.12 %
Epoch [3/20], Loss: 1.0717
Accuracy of the network on the 5000 validation images: 77.2 %
Epoch [4/20], Loss: 1.1225
Accuracy of the network on the 5000 validation images: 78.96 %
Epoch [5/20], Loss: 0.2865
Accuracy of the network on the 5000 validation images: 79.26 %
Epoch [6/20], Loss: 0.5070
Accuracy of the network on the 5000 validation images: 81.4 %
Epoch [7/20], Loss: 0.9793
Accuracy of the network on the 5000 validation images: 81.68 %
Epoch [8/20], Loss: 0.2860
Accuracy of the network on the 5000 validation images: 83.48 %
Epoch [9/20], Loss: 0.2861
Accuracy of the network on the 5000 validation images: 82.06 %
Epoch [10/20], Loss: 0.2897
Accuracy of the network on the 5000 validation images: 83.12 %
Epoch [11/20], Loss: 0.2443
Accuracy of the network on the 5000 validation images: 82.32 %
Epoch [12/

## Test on unseen test images

In [None]:
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        del images, labels, outputs

    print('Accuracy of the network on the {} test images: {} %'.format(10000, 100 * correct / total))   

## Questions

* What characteristics do you observe from the two architectures?
  * similarities
  * differences