---
# <center> Visualization of CNN: Grad-CAM

<center> Eya Ghamgui $~~$ eya.ghamgui@telecom-paris.fr
<center> Siwar Mhadhbi $~~$ siwar.mhadhbi@telecom-paris.fr
<center> Saifeddine Barkia $~~$ saifeddine.barkia@telecom-paris.fr
<center> February 02, 2022

---

* **Objective**: Convolutional Neural Networks are widely used on computer vision. It is powerful for processing grid-like data. However we hardly know how and why it works, due to the lack of decomposability into individually intuitive components. In this assignment, we use Grad-CAM, which highlights the regions of the input image that were important for the neural network prediction.

* **To be submitted within 2 weeks**: this notebook, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!

* NB: if `PIL` is not installed, try `conda install pillow`.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, datasets, transforms

import pickle
import urllib.request

import numpy as np
from PIL import Image
from cv2 import cv2
from skimage import exposure
import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

### Download the Model
We provide you a pretrained model `ResNet-34` for `ImageNet` classification dataset.
* **ImageNet**: A large dataset of photographs with 1 000 classes.
* **ResNet-34**: A deep architecture for image classification.

In [None]:
resnet34 = models.resnet34(pretrained=True)
resnet34.eval()  # set the model to evaluation mode

In [None]:
classes = pickle.load(
    urllib.request.urlopen(
        "https://gist.githubusercontent.com/yrevar/6135f1bd8dcf2e0cc683/raw/d133d61a09d7e5a3b36b8c111a8dd5c4b5d560ee/imagenet1000_clsid_to_human.pkl"
    )
)

![ResNet34](https://miro.medium.com/max/1050/1*Y-u7dH4WC-dXyn9jOG4w0w.png)

### Input Images
We provide you 20 images from ImageNet (download link on the webpage of the course or download directly using the following command line,).<br>
In order to use the pretrained model resnet34, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(224, 224)`.

In [None]:
def preprocess_image(dir_path):
    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
    )

    dataset = datasets.ImageFolder(
        dir_path,
        transforms.Compose(
            [
                transforms.Resize(256),
                transforms.CenterCrop(224),  # resize the image to 224x224
                transforms.ToTensor(),  # convert numpy.array to tensor
                normalize,
            ]
        ),
    )  # normalize the tensor

    return dataset

In [None]:
# The images should be in a *sub*-folder of "data/" (ex: data/TP2_images/images.jpg) and *not* directly in "data/"!
# otherwise the function won't find them

import os

os.mkdir("data")
os.mkdir("data/TP2_images")
!cd data/TP2_images && wget "https://www.lri.fr/~gcharpia/deeppractice/2022/TP2/TP2_images.zip" && unzip TP2_images.zip
dir_path = "data/"
dataset = preprocess_image(dir_path)

In [None]:
# show the orignal image
index = 5
input_image = Image.open(dataset.imgs[index][0]).convert("RGB")
plt.imshow(input_image)

In [None]:
output = resnet34(dataset[index][0].view(1, 3, 224, 224))
values, indices = torch.topk(output, 3)
print("Top 3-classes:", indices[0].numpy(), [classes[x] for x in indices[0].numpy()])
print("Raw class scores:", values[0].detach().numpy())

### Grad-CAM 
* **Overview:** Given an image, and a category (‘tiger cat’) as input, we forward-propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (tiger cat), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the given images. For each image, choose the top-3 possible labels as the desired classes. Compare the heatmaps of the three classes, and conclude. 


* **Hints**: 
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully. 
 + The pretrained model resnet34 doesn't have an activation function after its last layer, the output is indeed the `raw class scores`, you can use them directly. 
 + The size of feature maps is 7x7, so your heatmap will have the same size. You need to project the heatmap to the resized image (224x224, not the original one, before the normalization) to have a better observation. The function [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/nn.functional.html?highlight=interpolate#torch.nn.functional.interpolate) may help.  
 + Here is the link of the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

![Grad-CAM](https://da2so.github.io/assets/post_img/2020-08-10-GradCAM/2.png)

In [None]:
################
## Grad Cam
################
def Grad_Cam(image, category):
    # Useful functions to extract gradients and features
    def _backward_hook(model, grad_input, grad_output):
        gradients.append(grad_output[0])

    def _forward_hook(model, input, output):
        features.append(output.data)

    features = []
    gradients = []

    # Hooks for the gradients and features
    hook1 = model.layer4[2].bn2.register_backward_hook(_backward_hook)
    hook2 = model.layer4[2].bn2.register_forward_hook(_forward_hook)

    # Extract last predicted layer
    output = model(image)

    # Create signal
    signal = np.zeros((1, output.size()[-1]), dtype=np.float32)
    signal[0][category] = 1
    signal = torch.from_numpy(signal).requires_grad_(True)
    signal = torch.sum(signal * output)

    # Backpropagate signal
    model.zero_grad()
    signal.backward(retain_graph=True)

    # Extract gradients and features
    gradients = gradients[0][-1].numpy()
    features = features[0][-1].numpy()

    # Compute weights using gradients
    Weights = np.mean(gradients, axis=(1, 2))

    # Initiate a heatmap
    heatmap = np.zeros(features.shape[1:])

    # Remove Hooks
    hook1.remove()
    hook2.remove()

    # Compute heatmap
    for i in range(Weights.shape[0]):
        heatmap += Weights[i] * features[i, :, :]

    # ReLU on top of the heatmap
    heatmap = np.maximum(heatmap, 0)

    # Interpolate values
    heatmap = torch.from_numpy(heatmap.reshape(1, 1, 7, 7))
    heatmap = F.interpolate(heatmap, scale_factor=32, mode="bilinear")
    heatmap = heatmap.numpy()[0, 0, :, :]

    # Normalize the heatmap
    heatmap = heatmap / np.max(heatmap)

    return heatmap


################
# Combine an image with its heatmap
################
def apply_heatmap(img, map):
    # Construct a map
    heatmap = cv2.applyColorMap(np.uint8(255 * map), cv2.COLORMAP_JET)
    heatmap = np.float32(heatmap) / 255

    # Merge the image with its map
    merged_image = cv2.addWeighted(heatmap, 0.5, img, 0.5, 0.0)
    merged_image = np.uint8(255 * merged_image[:, :, ::-1])

    return merged_image

In [None]:
for i in range(20):
    img = np.array(Image.open(dataset.imgs[i][0]).convert("RGB"))
    img = np.float32(cv2.resize(img, (224, 224))) / 255
    input = dataset[i][0].view(1, 3, 224, 224)

    model = models.resnet34(pretrained=True)
    model.eval()

    output = model(input)
    values, indices = torch.topk(output, 3)

    f, ax = plt.subplots(1, 4, figsize=(20, 5))
    ax[0].imshow(img)
    ax[0].set_title("Sample " + str(i + 1))
    for j in range(1, 4):
        category = indices[0].numpy()[j - 1]
        heatmap = Grad_Cam(input, category)
        merged_img = apply_heatmap(img, heatmap)
        ax[j].imshow(merged_img)
        names = classes[category].split(",")
        ax[j].set_title(names[0])

    plt.show()

**Comment:** 

In the paper [Grad-CAM 2019, Selvaraju et al.], the authors applied this method on the last convolutional layer. Using the ResNet34 architecture, we interpreted the last convolutional block as the last convolutional layer and we applied this method on the last BatchNormalization layer. Indeed, by comparing the ploted heatmaps applied on the last bn2 layer and the last conv2 layer, we found that the results of the bn2 layer are much better. 

**Interpretations:**

* From the previous results, we can notice that almost in all samples, the first predictions, associated with the first heatmaps, seem to be more correct and relevant to the original image. However, in some particular cases, second or third prediction outweigh the first one.

In fact, more precisely,

* In the case of samples 1, (3, 5), 6, (9, 18), (12, 7), 15, and (17, 20), all three heatmaps show that the network sees almost exactly the same region for the prediction. For each aforementioned samples, the predicted animals belong to the same family of mammals. That is, they share many of the same characteristics. They indeed belong to the same family of elephants, dogs, cats, mustelids, foxes, felines and monkeys respectively.
* In the case of sample 2, even though the network sees almost exactly the same region in the three heatmaps, it predicts “porcupine” and “marmoset” which are relevant predictions for the original image, but “sloth bear” is way different which explains its lower score.
* In the case of sample 4, observing the three heatmaps, the network doesn’t focus on the same regions for the different predictions. To predict “Norwegian elkhound” or “German shepherd”, the network focuses on the front region. However, to predict “Cardigan”, it focuses on the back region of the animal.
* In the case of sample 8, it is interesting to note that the heatmaps reveal that the first and second predictions are based on the dog in the foreground, while the third prediction is based on the dog in the background.
* In the case of sample 10, the first heatmap reveals that the network focuses primarily on the horns, which is why it detects "ibex", while in the second and especially the third heatmap, the network sees both the horns and the body, allowing it to detect sheep breeds rather than goats.

Few failure prediction noticed:

* In the case of sample 11, the network correctly predicts a horse when it focuses on the entire region of the animal. However, it gives incorrect predictions such as ox or dog breeds when it focuses only on the neck region.
* In the case of sample 13, the network fails to detect the animal correctly, which is also noticeable through the heatmaps, in all examples the network focuses more on the border regions outside the animals region.
* In the case of sample 14, first and second heatmaps reveal that the network sees almost exactly the same region in both cases, and give relevant predictions. However, the third heatmap reveals that the network sees the whole body but wrongly predicts the animal species.
* In the case of sample 16, it is interesting to note that when the network sees the entire animal region, it correctly detects a “sea lion”, but when it sees only a portion of the animal region, it directs the detection to unrelated objects, such as a cowboy boot or a balance beam.





**Conclusion:**

* To conclude, we can say that Grad_Cam helps us to better understand what is going on inside the network and what it sees for prediction. It also helps us understand the differences in prediction and how they are related to specific parts seen by the network in the input image.
* Specifically in our task, Grad_Cam helped us understand how the network differentiates between animal species and even breeds within the same family based on their characteristics that the network focuses on during prediction. 
