

If running with Colab:

We commented the cells containing: %tensorboard --logdir runs
Since sometimes it got stack. While running the notebook cell by cell we removed the comment, viewed the results and then comment the cell again and rerun it.

Another option was to run all the notebook at once and than open the tensorboard window.


# Image Classification - Tensorboard, Batch Norm and Custom Loss Functions
In this exercise, you'll continue to work with our neural network for classifying Israeli Politicians.  
We will use tensorboard to monitor the training process and model performance.  

For the questions below, please use the network architecture you suggested in Q8 of HW1.  
This time, we provide you with a clean dataset of Israeli Politicians, that doesn't include multiple politicians in the same image, in the folder `data/israeli_politicians_cleaned.zip`.

## Tensorboard
TensorBoard provides visualization and tooling for machine learning experimentation:
- Tracking and visualizing metrics such as loss and accuracy
- Visualizing the model graph (ops and layers)
- Viewing histograms of weights, biases, or other tensors as they change over time
- Projecting embeddings to a lower dimensional space
- Displaying images, text, and audio data
- Profiling programs

Tensorboard worked originally with Tensorflow but can now be used with PyTorch as well.  
You can embed a tensorboard widget in a Jupyter Notebook, although if you're not using Google Colab we recommend that you open tensorboard separately.

To get started with Tensorboard, please read the following pages:

PyTorch related:
1. https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html
1. https://becominghuman.ai/logging-in-tensorboard-with-pytorch-or-any-other-library-c549163dee9e
1. https://towardsdatascience.com/https-medium-com-dinber19-take-a-deeper-look-at-your-pytorch-model-with-the-new-tensorboard-built-in-513969cf6a72
1. https://pytorch.org/docs/stable/tensorboard.html
1. https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/04-utils/tensorboard

Tensorflow related:
1. https://itnext.io/how-to-use-tensorboard-5d82f8654496
1. https://www.datacamp.com/community/tutorials/tensorboard-tutorial
1. https://medium.com/@anthony_sarkis/tensorboard-quick-start-in-5-minutes-e3ec69f673af
1. https://www.guru99.com/tensorboard-tutorial.html
1. https://www.youtube.com/watch?time_continue=1&v=s-lHP8v9qzY&feature=emb_logo
1. https://www.youtube.com/watch?v=pSexXMdruFM


### Starting Tensorboard
Jupyter Notebook has extensions for displaying TensorBoard inside the notebook. Still, I recommend that you run it separately, as it tends to get stuck in notebooks.

The syntax to load TensorBoard in a notebook is this:
```python
# Load the TensorBoard notebook extension
%load_ext tensorboard
%tensorboard --logdir ./logs
```

In the shell, you can instead run:
```
tensorboard --logdir ./logs
```

In [1]:
%load_ext tensorboard

In [2]:
import warnings
warnings.filterwarnings("ignore")

### Show images using TensorBoard

In [3]:
from torchvision import datasets, models, transforms
import os
import torch
import torchvision
import matplotlib.pyplot as plt
import numpy as np
from torch.utils.tensorboard import SummaryWriter

In [4]:
# Create a folder for our data
!mkdir data
!mkdir data/israeli_politicians_cleaned

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘data/israeli_politicians_cleaned’: File exists


In [5]:
# Download our dataset and extract it
import requests
from zipfile import ZipFile

url = 'https://github.com/omriallouche/ydata_deep_learning_2021/blob/main/data/israeli_politicians_cleaned.zip?raw=true'
r = requests.get(url, allow_redirects=True)
open('./data/israeli_politicians_cleaned.zip', 'wb').write(r.content)

with ZipFile('./data/israeli_politicians_cleaned.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall(path='./data/israeli_politicians_cleaned/')

In [6]:
# Create transformers
means = [0.485, 0.456, 0.406]
stds = [0.229, 0.224, 0.225]

data_transforms = {
    'train': transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor(),
        transforms.Normalize(means, stds)
    ]),
    'val': transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor(),
        transforms.Normalize(means, stds)
    ]),
}

In [7]:
# Load data
data_dir = r'./data/israeli_politicians_cleaned/'

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}
dataloaders = {
    'train': torch.utils.data.DataLoader(image_datasets['train'], batch_size=16,
                                             shuffle=True, num_workers=4),
    'val': torch.utils.data.DataLoader(image_datasets['val'], batch_size=16,
                                          shuffle=False, num_workers=4)
  }
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
print('dataset_sizes: ', dataset_sizes)

class_names = image_datasets['train'].classes
print('class_names:', class_names)

trainloader = dataloaders['train']

dataset_sizes:  {'train': 812, 'val': 202}
class_names: ['ayelet_shaked', 'benjamin_netanyahu', 'benny_gantz', 'danny_danon', 'gideon_saar', 'kostya_kilimnik', 'naftali_bennett', 'ofir_akunis', 'yair_lapid']


In [8]:
# writer for tensorboaed

# clear logs
!rm -rf runs

# default `log_dir` is "runs"
writer = SummaryWriter()

In [9]:
# Undo normalization to show the original images on Tensorboard
def denormalize(image):
  inp = image.numpy().transpose((1, 2, 0))
  mean = np.array(means)
  std = np.array(stds)
  inp = std * inp + mean
  inp = np.clip(inp, 0, 1)
  inp = inp.transpose((2, 0, 1))
  return torch.tensor(inp)

In [10]:
# Show images using TessorBoard

# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()
images_to_show = []

for i in range(len(images)):
  images_to_show.append(denormalize(images[i]))

# create grid of images
img_grid = torchvision.utils.make_grid(images_to_show)

# write to tensorboard
writer.add_image('israeli_politicians_cleaned', img_grid)

writer.flush()

In [11]:
#%tensorboard --logdir runs

### Inspect the model graph
You can print a network object to find useful information about it:

In [12]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler

In [13]:
# Our network from assignment 1

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 3 input image channel, 128 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv2 = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv3 = nn.Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv4 = nn.Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv5 = nn.Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(in_features=128*8*8, out_features=1024, bias=True)
        self.fc2 = nn.Linear(in_features=1024, out_features=1024, bias=True)
        self.fc3 = nn.Linear(in_features=1024, out_features=9, bias=True)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = F.max_pool2d(F.relu(self.conv2(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = F.max_pool2d(F.relu(self.conv3(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = F.max_pool2d(F.relu(self.conv4(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = (F.max_pool2d(F.relu(self.conv5(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False))
        x = x.view(-1, self.num_flat_features(x))
        x=F.relu(self.fc1(x))
        x=F.relu(self.fc2(x))
        x = self.fc3(x)
        
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

net = Net()

In [14]:
print(net)

Net(
  (conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=8192, out_features=1024, bias=True)
  (fc2): Linear(in_features=1024, out_features=1024, bias=True)
  (fc3): Linear(in_features=1024, out_features=9, bias=True)
)


TensorBoard can help visualize the network graph. It takes practice to read these.  

Write the graph to TensorBoard and review it.

In [15]:
writer.add_graph(net, images)
writer.flush()

In [16]:
#%tensorboard --logdir runs

You can also use the package `torchsummary` for a fuller info on the model:

In [17]:
!pip install torchsummary



In [18]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net = net.to(device)

In [19]:
channels=3; H=256; W=256
from torchsummary import summary
summary(net, input_size=(channels, H, W))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 256, 256]           1,792
            Conv2d-2         [-1, 64, 128, 128]          36,928
            Conv2d-3          [-1, 128, 64, 64]          73,856
            Conv2d-4          [-1, 128, 32, 32]         147,584
            Conv2d-5          [-1, 128, 16, 16]         147,584
            Linear-6                 [-1, 1024]       8,389,632
            Linear-7                 [-1, 1024]       1,049,600
            Linear-8                    [-1, 9]           9,225
Total params: 9,856,201
Trainable params: 9,856,201
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.75
Forward/backward pass size (MB): 45.27
Params size (MB): 37.60
Estimated Total Size (MB): 83.61
----------------------------------------------------------------


## Train the network
Next, we'll train the network. In the training loop, log relevant metrics that would allow you to plot in TensorBoard:

1. The network loss
1. Train and test error
1. Average weight in the first layer
1. Histogram of weights in the first layer

In [20]:
import time
import copy

In [21]:
# this version write logs for tensorBoard
def train_model(model, dataloaders, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    # Init variables that will save info about the best model
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                # Set model to training mode. 
                model.train()  
            else:
                # Set model to evaluate mode. In evaluate mode, we don't perform backprop and don't need to keep the gradients
                model.eval()   

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data
            for inputs, labels in dataloaders[phase]:
                # Prepare the inputs for GPU/CPU
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # ===== forward pass ======
                with torch.set_grad_enabled(phase=='train'):
                    # If we're in train mode, we'll track the gradients to allow back-propagation
                    outputs = model(inputs) # apply the model to the inputs. The output is the softmax probability of each class
                    _, preds = torch.max(outputs, 1) # 
                    loss = criterion(outputs, labels)

                    # ==== backward pass + optimizer step ====
                    # This runs only in the training phase
                    if phase == 'train':
                        loss.backward() # Perform a step in the opposite direction of the gradient
                        optimizer.step() # Adapt the optimizer

                # Collect statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                
            if phase == 'train':
                # Adjust the learning rate based on the scheduler
                scheduler.step()  

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

            # log for tensorBoard
            writer.add_scalar(f'Loss/{phase}', epoch_loss, epoch) # loss
            writer.add_scalar(f'Error/{phase}', 1 - epoch_acc, epoch) # error
            if phase == 'train':
              writer.add_histogram("conv1.weight", model.conv1.weight, epoch) # layer 1 weights histogram
              writer.add_scalar('conv1.weight.avg', torch.mean(model.conv1.weight), epoch) # layer 1 weights average

            # Keep the results of the best model so far
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                # deepcopy the model
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print(f'Training complete in {(time_elapsed // 60):.0f}m {(time_elapsed % 60):.0f}s')
    print(f'Best val Acc: {best_acc:4f}')

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [22]:
# train the network using the same parameters as in assignment 1 which gave the best performance for our network

optimizer_ft = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
criterion = nn.CrossEntropyLoss()

dataloaders = {
    'train': torch.utils.data.DataLoader(image_datasets['train'], batch_size=2,
                                             shuffle=True, num_workers=4),
    'val': torch.utils.data.DataLoader(image_datasets['val'], batch_size=2,
                                          shuffle=False, num_workers=4)
  }

net = train_model(net, 
                    dataloaders,
                       criterion, 
                       optimizer_ft, 
                       exp_lr_scheduler,
                       num_epochs=20)
writer.flush()

Epoch 0/19
----------
train Loss: 2.1015 Acc: 0.2672
val Loss: 2.0875 Acc: 0.2376

Epoch 1/19
----------
train Loss: 2.0375 Acc: 0.2820
val Loss: 2.0910 Acc: 0.2376

Epoch 2/19
----------
train Loss: 2.0096 Acc: 0.2820
val Loss: 2.0806 Acc: 0.2574

Epoch 3/19
----------
train Loss: 1.9712 Acc: 0.2796
val Loss: 2.0372 Acc: 0.2376

Epoch 4/19
----------
train Loss: 1.9063 Acc: 0.3153
val Loss: 1.9549 Acc: 0.2723

Epoch 5/19
----------
train Loss: 1.8531 Acc: 0.3510
val Loss: 1.9496 Acc: 0.3267

Epoch 6/19
----------
train Loss: 1.7982 Acc: 0.3559
val Loss: 1.9383 Acc: 0.3020

Epoch 7/19
----------
train Loss: 1.6191 Acc: 0.4138
val Loss: 1.8283 Acc: 0.3465

Epoch 8/19
----------
train Loss: 1.5445 Acc: 0.4594
val Loss: 1.8148 Acc: 0.3515

Epoch 9/19
----------
train Loss: 1.4845 Acc: 0.4828
val Loss: 1.8660 Acc: 0.3663

Epoch 10/19
----------
train Loss: 1.4152 Acc: 0.5000
val Loss: 1.7650 Acc: 0.4010

Epoch 11/19
----------
train Loss: 1.3249 Acc: 0.5456
val Loss: 2.0607 Acc: 0.3614

Ep

In [23]:
#%tensorboard --logdir runs

### Precision-Recall Curve
Use TensorBoard to plot the precision-recall curve:

In [24]:
# returns all predictions, labels and probabilities
def get_pred_with_probs(model, data_loader):
    model.eval()
    predictions = []
    real_values = []
    probabilities = []
    sm = torch.nn.Softmax()

    with torch.no_grad():
    
        for inputs, labels in data_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = sm(model(inputs))
            probs, preds = torch.max(outputs, 1)
            predictions.extend(preds)
            real_values.extend(labels)
            probabilities.extend(outputs)
  
    predictions = torch.as_tensor(predictions).cpu()
    real_values = torch.as_tensor(real_values).cpu()
    probabilities = torch.stack(probabilities)
    return predictions, real_values, probabilities

In [25]:
# write PR Curve

y_pred, y_test, probs = get_pred_with_probs(net, dataloaders['val'])
classes = range(9)

for i in classes:
  labels_i = y_test == i
  preds_i = probs[:, i]
  writer.add_pr_curve(class_names[i], labels_i, preds_i, global_step=0)

writer.flush()

In [26]:
#%tensorboard --logdir runs

### Display Model Errors
A valuable practice is to review errors made by the model in the test set. These might reveal cases of bad preprocessing or lead to come up with improvements to your original model.

Show 12 images of errors made by the model. For each, display the true and predicted classes, and the model confidence in its answer.

In [27]:
errors = np.array(y_pred != y_test)
err_probs = errors / np.count_nonzero(errors)
indices = np.random.choice(len(errors),size=12,replace=False,p=err_probs)
max_probs, _ = probs.max(axis=1)

for i in indices:
    label = f'True label: {class_names[y_test[i]]}  Predicted label: {class_names[y_pred[i]]}  Confidence: {max_probs[i]}'
    image = denormalize(dataloaders['val'].dataset[i][0])
    writer.add_image(label, image)

writer.flush()

In [28]:
#%tensorboard --logdir runs

## Batch Normalization
In this section, we'll add a Batch Norm layer to your network.  
Use TensorBoard to compare the network's convergence (train and validation loss) with and without Batch Normalization.

In [29]:
# Added norm after each layer which improved the performance
class NetNorm(nn.Module):

    def __init__(self):
        super(NetNorm, self).__init__()
        # 3 input image channel, 128 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv1_bn=nn.BatchNorm2d(64)
        self.conv2 = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv2_bn=nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv3_bn=nn.BatchNorm2d(128)
        self.conv4 = nn.Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv4_bn=nn.BatchNorm2d(128)
        self.conv5 = nn.Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        self.conv5_bn=nn.BatchNorm2d(128)
        
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(in_features=128*8*8, out_features=1024, bias=True)
        self.fc1_bn=nn.BatchNorm1d(1024)
        self.fc2 = nn.Linear(in_features=1024, out_features=1024, bias=True)
        self.fc2_bn=nn.BatchNorm1d(1024)
        self.fc3 = nn.Linear(in_features=1024, out_features=9, bias=True)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = self.conv1(x)
        x = F.max_pool2d(F.relu(self.conv1_bn(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = self.conv2(x)
        x = F.max_pool2d(F.relu(self.conv2_bn(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = self.conv3(x)
        x = F.max_pool2d(F.relu(self.conv3_bn(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = self.conv4(x)
        x = F.max_pool2d(F.relu(self.conv4_bn(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        x = self.conv5(x)
        x = (F.max_pool2d(F.relu(self.conv5_bn(x)), kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False))
        x = x.view(-1, self.num_flat_features(x))
        x = self.fc1(x)
        x = F.relu(self.fc1_bn(x))
        x = self.fc2(x)
        x = F.relu(self.fc2_bn(x))
        x = self.fc3(x)
        
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

In [30]:
# train with logging loss and wether the model uses normalizations
def train_model(model, dataloaders, criterion, optimizer, scheduler, num_epochs=25, norm="NoNorm"):
    since = time.time()

    # Init variables that will save info about the best model
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                # Set model to training mode. 
                model.train()  
            else:
                # Set model to evaluate mode. In evaluate mode, we don't perform backprop and don't need to keep the gradients
                model.eval()   

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data
            for inputs, labels in dataloaders[phase]:
                # Prepare the inputs for GPU/CPU
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # ===== forward pass ======
                with torch.set_grad_enabled(phase=='train'):
                    # If we're in train mode, we'll track the gradients to allow back-propagation
                    outputs = model(inputs) # apply the model to the inputs. The output is the softmax probability of each class
                    _, preds = torch.max(outputs, 1) # 
                    loss = criterion(outputs, labels)

                    # ==== backward pass + optimizer step ====
                    # This runs only in the training phase
                    if phase == 'train':
                        loss.backward() # Perform a step in the opposite direction of the gradient
                        optimizer.step() # Adapt the optimizer

                # Collect statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                
            if phase == 'train':
                # Adjust the learning rate based on the scheduler
                scheduler.step()  

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

            # log for tensorBoard
            writer.add_scalar(f'{norm}/Loss/{phase}', epoch_loss, epoch) # loss

            # Keep the results of the best model so far
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                # deepcopy the model
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print(f'Training complete in {(time_elapsed // 60):.0f}m {(time_elapsed % 60):.0f}s')
    print(f'Best val Acc: {best_acc:4f}')

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model

In [31]:
# to compare performance with batchnorm we will set batch_size=16, without norm the net converges very slowly
dataloaders = {
    'train': torch.utils.data.DataLoader(image_datasets['train'], batch_size=16,
                                             shuffle=True, num_workers=4),
    'val': torch.utils.data.DataLoader(image_datasets['val'], batch_size=16,
                                          shuffle=False, num_workers=4)
  }

net = Net()
net = net.to(device)
optimizer_ft = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

net = train_model(net, 
                    dataloaders,
                       criterion, 
                       optimizer_ft, 
                       exp_lr_scheduler,
                       num_epochs=20)
writer.flush()

Epoch 0/19
----------
train Loss: 2.1932 Acc: 0.1798
val Loss: 2.1806 Acc: 0.2376

Epoch 1/19
----------
train Loss: 2.1603 Acc: 0.2820
val Loss: 2.1549 Acc: 0.2376

Epoch 2/19
----------
train Loss: 2.1186 Acc: 0.2820
val Loss: 2.1209 Acc: 0.2376

Epoch 3/19
----------
train Loss: 2.0573 Acc: 0.2820
val Loss: 2.0985 Acc: 0.2376

Epoch 4/19
----------
train Loss: 2.0261 Acc: 0.2820
val Loss: 2.0916 Acc: 0.2376

Epoch 5/19
----------
train Loss: 2.0180 Acc: 0.2820
val Loss: 2.0936 Acc: 0.2376

Epoch 6/19
----------
train Loss: 2.0115 Acc: 0.2820
val Loss: 2.0766 Acc: 0.2376

Epoch 7/19
----------
train Loss: 2.0108 Acc: 0.2820
val Loss: 2.0756 Acc: 0.2376

Epoch 8/19
----------
train Loss: 2.0022 Acc: 0.2820
val Loss: 2.0696 Acc: 0.2376

Epoch 9/19
----------
train Loss: 1.9913 Acc: 0.2820
val Loss: 2.0518 Acc: 0.2525

Epoch 10/19
----------
train Loss: 1.9797 Acc: 0.2919
val Loss: 2.0599 Acc: 0.2426

Epoch 11/19
----------
train Loss: 1.9569 Acc: 0.2759
val Loss: 2.0297 Acc: 0.2475

Ep

In [32]:
# train the model which uses the batch norm with logging the loss
netNorm = NetNorm()
netNorm = netNorm.to(device)
optimizer_ft = optim.SGD(netNorm.parameters(), lr=0.001, momentum=0.9)

netNorm = train_model(netNorm, 
                    dataloaders,
                       criterion, 
                       optimizer_ft, 
                       exp_lr_scheduler,
                       num_epochs=20,
                      norm="WithNorm")
writer.flush()

Epoch 0/19
----------
train Loss: 1.8953 Acc: 0.3473
val Loss: 1.7155 Acc: 0.4406

Epoch 1/19
----------
train Loss: 1.0391 Acc: 0.6909
val Loss: 1.3776 Acc: 0.5842

Epoch 2/19
----------
train Loss: 0.4818 Acc: 0.9101
val Loss: 1.1415 Acc: 0.6485

Epoch 3/19
----------
train Loss: 0.2178 Acc: 0.9889
val Loss: 1.0435 Acc: 0.6485

Epoch 4/19
----------
train Loss: 0.0983 Acc: 0.9988
val Loss: 1.0168 Acc: 0.6683

Epoch 5/19
----------
train Loss: 0.0561 Acc: 1.0000
val Loss: 1.0104 Acc: 0.6782

Epoch 6/19
----------
train Loss: 0.0361 Acc: 1.0000
val Loss: 1.0005 Acc: 0.6485

Epoch 7/19
----------
train Loss: 0.0384 Acc: 1.0000
val Loss: 0.9753 Acc: 0.6584

Epoch 8/19
----------
train Loss: 0.0257 Acc: 1.0000
val Loss: 0.9985 Acc: 0.6634

Epoch 9/19
----------
train Loss: 0.0200 Acc: 1.0000
val Loss: 1.0227 Acc: 0.6832

Epoch 10/19
----------
train Loss: 0.0197 Acc: 1.0000
val Loss: 1.0711 Acc: 0.6683

Epoch 11/19
----------
train Loss: 0.0159 Acc: 1.0000
val Loss: 1.0583 Acc: 0.6683

Ep

In [33]:
#%tensorboard --logdir runs

Use TensorBoard to plot the distribution of activations with and without Batch Normalization.

In [34]:
activation = {}

def get_activation(name):

    def hook(model, input, output):
        activation[name] = output.detach()

    return hook

In [35]:
# add activation histogram of each fc layer
models = {'No_Norm': Net(),
          'Norm': NetNorm()}

images = images.to(device)

for model_name, model in models.items():
    model = model.to(device)
    
    for name, layer in model.named_modules():
      if '_bn' not in name and 'fc' in name:
        layer.register_forward_hook(get_activation(name))
    
    outputs = model(images)

    for name, layer in model.named_modules():
      if '_bn' not in name and 'fc' in name:
        act = activation[name].squeeze()
        writer.add_histogram(f'{name}/{model_name}', act)
    
    writer.flush()

In [36]:
#%tensorboard --logdir runs

## Custom Loss Function
Manually labeled datasets often contain labeling errors. These can have a large effect on the trained model.  
In this task we’ll work on a highly noisy dataset. Take our cleaned Israeli Politicians dataset and randomly replace 10% of the true labels.
Compare the performance of the original model to a similar model trained on the noisy labels. 

Suggest a loss function that might help with noisy labels. Following this guide, implement your own custom loss function in PyTorch and compare the model performance using it:  
https://discuss.pytorch.org/t/solved-what-is-the-correct-way-to-implement-custom-loss-function/3568/9


In [37]:
# train which returns val accuracy as well
def train_model(model, dataloaders, criterion, optimizer, scheduler, num_epochs=25, norm="NoNorm"):
    since = time.time()

    # Init variables that will save info about the best model
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                # Set model to training mode. 
                model.train()  
            else:
                # Set model to evaluate mode. In evaluate mode, we don't perform backprop and don't need to keep the gradients
                model.eval()   

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data
            for inputs, labels in dataloaders[phase]:
                # Prepare the inputs for GPU/CPU
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # ===== forward pass ======
                with torch.set_grad_enabled(phase=='train'):
                    # If we're in train mode, we'll track the gradients to allow back-propagation
                    outputs = model(inputs) # apply the model to the inputs. The output is the softmax probability of each class
                    _, preds = torch.max(outputs, 1) # 
                    loss = criterion(outputs, labels)

                    # ==== backward pass + optimizer step ====
                    # This runs only in the training phase
                    if phase == 'train':
                        loss.backward() # Perform a step in the opposite direction of the gradient
                        optimizer.step() # Adapt the optimizer

                # Collect statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                
            if phase == 'train':
                # Adjust the learning rate based on the scheduler
                scheduler.step()  

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

            # Keep the results of the best model so far
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                # deepcopy the model
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print(f'Training complete in {(time_elapsed // 60):.0f}m {(time_elapsed % 60):.0f}s')
    print(f'Best val Acc: {best_acc:4f}')

    # load best model weights
    model.load_state_dict(best_model_wts)
    return (model, best_acc)

In [38]:
# NetNorm trained on clean data
netNorm = NetNorm()
netNorm = netNorm.to(device)
optimizer_ft = optim.SGD(netNorm.parameters(), lr=0.001, momentum=0.9)

netNorm, net_acc = train_model(netNorm, 
                    dataloaders,
                       criterion, 
                       optimizer_ft, 
                       exp_lr_scheduler,
                       num_epochs=20,
                      norm="WithNorm")

Epoch 0/19
----------
train Loss: 1.8771 Acc: 0.3399
val Loss: 1.6735 Acc: 0.4604

Epoch 1/19
----------
train Loss: 1.0551 Acc: 0.6921
val Loss: 1.4022 Acc: 0.5743

Epoch 2/19
----------
train Loss: 0.4780 Acc: 0.9113
val Loss: 1.1471 Acc: 0.6733

Epoch 3/19
----------
train Loss: 0.2122 Acc: 0.9828
val Loss: 1.0862 Acc: 0.6485

Epoch 4/19
----------
train Loss: 0.0961 Acc: 0.9975
val Loss: 1.0213 Acc: 0.6733

Epoch 5/19
----------
train Loss: 0.0613 Acc: 1.0000
val Loss: 1.0372 Acc: 0.6683

Epoch 6/19
----------
train Loss: 0.0351 Acc: 1.0000
val Loss: 1.0389 Acc: 0.6535

Epoch 7/19
----------
train Loss: 0.0286 Acc: 1.0000
val Loss: 1.0541 Acc: 0.6584

Epoch 8/19
----------
train Loss: 0.0228 Acc: 1.0000
val Loss: 1.0500 Acc: 0.6683

Epoch 9/19
----------
train Loss: 0.0193 Acc: 1.0000
val Loss: 1.0690 Acc: 0.6634

Epoch 10/19
----------
train Loss: 0.0165 Acc: 1.0000
val Loss: 1.0510 Acc: 0.6782

Epoch 11/19
----------
train Loss: 0.0147 Acc: 1.0000
val Loss: 1.0670 Acc: 0.6634

Ep

In [39]:
# Mess 10% of the labels and train NetNorm model on it
num_images = len(dataloaders['train'].dataset)
indices = np.random.choice(range(num_images),size=int(num_images/10),replace=False)

for i in indices:
  probs = np.array([1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8])
  probs[dataloaders['train'].dataset[i][1]] = 0
  dataloaders['train'].dataset.samples[i] = (dataloaders['train'].dataset.samples[i][0],np.random.choice(range(9),p=probs))

netNoise = NetNorm()
netNoise = netNoise.to(device)
optimizer_ft = optim.SGD(netNoise.parameters(), lr=0.001, momentum=0.9)

netNoise, noise_acc = train_model(netNoise, 
                    dataloaders,
                       criterion, 
                       optimizer_ft, 
                       exp_lr_scheduler,
                       num_epochs=20,
                      norm="WithNorm")

Epoch 0/19
----------
train Loss: 1.9554 Acc: 0.3091
val Loss: 1.7676 Acc: 0.4010

Epoch 1/19
----------
train Loss: 1.2125 Acc: 0.6281
val Loss: 1.4841 Acc: 0.5644

Epoch 2/19
----------
train Loss: 0.6669 Acc: 0.8362
val Loss: 1.2729 Acc: 0.5891

Epoch 3/19
----------
train Loss: 0.3833 Acc: 0.9273
val Loss: 1.2399 Acc: 0.5941

Epoch 4/19
----------
train Loss: 0.2314 Acc: 0.9581
val Loss: 1.2516 Acc: 0.5990

Epoch 5/19
----------
train Loss: 0.1794 Acc: 0.9544
val Loss: 1.3061 Acc: 0.6089

Epoch 6/19
----------
train Loss: 0.1537 Acc: 0.9631
val Loss: 1.2716 Acc: 0.5990

Epoch 7/19
----------
train Loss: 0.1596 Acc: 0.9557
val Loss: 1.3358 Acc: 0.5941

Epoch 8/19
----------
train Loss: 0.1377 Acc: 0.9569
val Loss: 1.3053 Acc: 0.6040

Epoch 9/19
----------
train Loss: 0.1258 Acc: 0.9618
val Loss: 1.3096 Acc: 0.6139

Epoch 10/19
----------
train Loss: 0.1169 Acc: 0.9631
val Loss: 1.3079 Acc: 0.5941

Epoch 11/19
----------
train Loss: 0.0987 Acc: 0.9667
val Loss: 1.3255 Acc: 0.5990

Ep

In [40]:
print("Val accuracy without noise:", net_acc.item())
print("Val accuracy with noise:", noise_acc.item())

Val accuracy without noise: 0.6831683168316832
Val accuracy with noise: 0.6188118811881188


We see a decrease of the validation accuracy when training on noisy data

In [41]:
# Boost the cross entropy with reversed cross entropy, we can set the weight assigned to each method
class CustomLoss(nn.Module):
    def __init__(self, ce_weight, rce_weight, num_classes=9):
        super(CustomLoss, self).__init__()
        self.device = device
        self.ce_weight = ce_weight
        self.rce_weight = rce_weight
        self.num_classes = num_classes
        self.cross_entropy = nn.CrossEntropyLoss()

    def forward(self, pred, labels):
        # Cross entropy loss
        ce = self.cross_entropy(pred, labels)

        # Reversed cross entropy
        pred = F.softmax(pred, dim=1)
        pred = torch.clamp(pred, min=1e-7, max=1.0)
        label_one_hot = torch.nn.functional.one_hot(labels, self.num_classes).float().to(self.device)
        label_one_hot = torch.clamp(label_one_hot, min=1e-4, max=1.0)
        rce = (-1*torch.sum(pred * torch.log(label_one_hot), dim=1))

        # Loss
        loss = self.ce_weight * ce + self.rce_weight * rce.mean()
        return loss

In [42]:
# train a model with the new loss function
criterion = CustomLoss(0.7,0.3)
netNoise = NetNorm()
netNoise = netNoise.to(device)
optimizer_ft = optim.SGD(netNoise.parameters(), lr=0.001, momentum=0.9)

netNoise, new_loss_acc = train_model(netNoise, 
                    dataloaders,
                       criterion, 
                       optimizer_ft, 
                       exp_lr_scheduler,
                       num_epochs=20,
                      norm="WithNorm")

Epoch 0/19
----------
train Loss: 3.5343 Acc: 0.3264
val Loss: 3.2422 Acc: 0.3317

Epoch 1/19
----------
train Loss: 2.4710 Acc: 0.5628
val Loss: 2.6907 Acc: 0.5000

Epoch 2/19
----------
train Loss: 1.5092 Acc: 0.7956
val Loss: 2.3215 Acc: 0.6188

Epoch 3/19
----------
train Loss: 0.8726 Acc: 0.8916
val Loss: 2.3398 Acc: 0.5396

Epoch 4/19
----------
train Loss: 0.5275 Acc: 0.9372
val Loss: 2.3180 Acc: 0.5792

Epoch 5/19
----------
train Loss: 0.3765 Acc: 0.9520
val Loss: 2.1745 Acc: 0.5990

Epoch 6/19
----------
train Loss: 0.2877 Acc: 0.9631
val Loss: 2.1656 Acc: 0.6089

Epoch 7/19
----------
train Loss: 0.2811 Acc: 0.9569
val Loss: 2.1557 Acc: 0.5990

Epoch 8/19
----------
train Loss: 0.2539 Acc: 0.9606
val Loss: 2.1907 Acc: 0.6238

Epoch 9/19
----------
train Loss: 0.2490 Acc: 0.9631
val Loss: 2.1496 Acc: 0.6287

Epoch 10/19
----------
train Loss: 0.2295 Acc: 0.9594
val Loss: 2.1839 Acc: 0.6089

Epoch 11/19
----------
train Loss: 0.1959 Acc: 0.9643
val Loss: 2.1456 Acc: 0.5990

Ep

In [43]:
print("Val accuracy with nn.CrossEntropyLoss():", noise_acc.item())
print("Val accuracy with the custom loss:", new_loss_acc.item())

Val accuracy with nn.CrossEntropyLoss(): 0.6188118811881188
Val accuracy with the custom loss: 0.6336633663366337


For most of the runs the custom loss function improved the noisy data performance. Still it couldn't reach the accuracy of the model trained with no noise.