# Object localization
One last time, recall our two guiding questions for this week:

- _What_ is in an image (e.g. debris, buildings, etc.)?
- _Where_ are these things located _in 3D space_ ?

Let's take stock of everything we've learned this week. The first two days, we learned about structure from motion. We saw how, so long as you have two images with some overlap, you can reconstruct a scene up to scale in an arbitrary reference frame. If you have some additional information (e.g. GPS location) and can afford to make some additional assumptions (e.g. the ground is flat), you can have a coarse georegistration of every image that is part of the structure from motion sequence. 

We spent the next two days discussing deep learning. We saw that machine learning can pick up patterns in the data that can aid in the general classification problem. We also saw that, using convolutional neural networks, we were able to extend this approach to the classification of images. 

Are we done?

## xBD and Image Segmentation
*Image segmentation* is a related but distinct problem from image classification. Rather than asking "is there flooding in the image?", it asks "where is there flooding in an image?". This is a problem that has seen interest in recent years, as it is no longer enough to just say whether there is a human in an image (think self-driving cars). One of the largest image segmentation datasets is called xBD. Here's what a sample pair of images looks like: 

<img src="notebook_images/hurricane-harvey_00000006_post_disaster.png" width="250"  />

<img src="notebook_images/targets.png" width="250"  />

Just as the output of our classification CNN was a single label, the output of a segmentation CNN is an entirely new image! 
In a sense, image segmentation is just pixel-wise classification. This means that image segmentation is a much more challenging task than just image classification. Accordingly, neural networks trained for image segmentation are much more complex than most classification CNN's. 

## Unet
One of the more popular architectures for image segmentation is called unet. It was originally invented to do image segmentation for biomedical purposes. The structure of the unet is shown below:

<img src="notebook_images/unet.png" width="500"  />

Unet features a contracting path in the first half that provides discriminative feature extraction. The second half is an expanding path that does the actual localization of these features in the image. 

The dataset creation for image segmentation is much more complex than for image classification: absent any other data sources, someone has to literally sit down and click on polygons that contain the target region. For georeferenced satellite imagery this might not be too hard, since you could just overlay OpenStreetMap data on top of the image, but for CAP imagery like ours a single image could take upwards to half an hour. 

## Class Activation Maps
The following approach is adapted from the following paper by Zhou et al: https://arxiv.org/pdf/1512.04150.pdf. 

Zhou et al found that neural networks that were trained for classification still retain a good amount of localization capability. Therefore, by doing some clever manipulations to our neural network, we can recover some of that localization capability.

In [1]:
import io
import requests
from PIL import Image
import torch
from torchvision import models, transforms
from torch.autograd import Variable
from torch.nn import functional as F
import numpy as np
from matplotlib import pyplot as plt
try:
    import cv2
except:
    !pip install opencv-python
    import cv2
import pdb
from cnn_finetune import make_model

net = make_model('resnet18', num_classes=2, pretrained=False)
finalconv_name = '_features'
PATH = "flood_checkpoint_epoch0.pth"
net.load_state_dict(torch.load(PATH))

<All keys matched successfully>

In [2]:
# hook the feature extractor
features_blobs = []
def hook_feature(module, input, output):
    features_blobs.append(output.data.cpu().numpy())

net._modules.get(finalconv_name).register_forward_hook(hook_feature)

# get the softmax weight
params = list(net.parameters())
weight_softmax = np.squeeze(params[-2].data.numpy())

def returnCAM(feature_conv, weight_softmax, class_idx):
    # generate the class activation maps upsample to 256x256
    size_upsample = (512, 512)
    bz, nc, h, w = feature_conv.shape
    output_cam = []
    for idx in class_idx:
        cam = weight_softmax[idx].dot(feature_conv.reshape((nc, h*w)))
        cam = cam.reshape(h, w)
        cam = cam - np.min(cam)
        cam_img = cam / np.max(cam)
        cam_img = np.uint8(255 * cam_img)
        output_cam.append(cv2.resize(cam_img, size_upsample))
    return output_cam


normalize = transforms.Normalize(
   mean=[0.485, 0.456, 0.406],
   std=[0.229, 0.224, 0.225]
)
preprocess = transforms.Compose([
   transforms.Resize(768),
   transforms.RandomCrop(512),
   transforms.ToTensor(),
])

In [3]:
img_path = "test_images/010_1282_99809b0c-fc64-46e5-957f-7ff8e7547d8c.jpg"
img_pil = Image.open(img_path)

img_tensor = preprocess(img_pil)
img_array = cv2.cvtColor(np.round(255*img_tensor.permute(1, 2, 0).numpy()), cv2.COLOR_BGR2RGB)
cv2.imwrite('test_image.jpg', img_array)

img_tensor = normalize(img_tensor)
img_variable = Variable(img_tensor.unsqueeze(0))
logit = net(img_variable)

In [4]:
# download the imagenet category list
classes = {0:"no flood", 1:"flood/water"}

h_x = F.softmax(logit, dim=1).data.squeeze()
probs, idx = h_x.sort(0, True)
probs = probs.numpy()
idx = idx.numpy()

# output the prediction
for i in range(2):
    print('{:.3f} -> {}'.format(probs[i], classes[idx[i]]))

# generate class activation mapping for the top1 prediction
CAMs = returnCAM(features_blobs[0], weight_softmax, [idx[0]])

# render the CAM and output
print('output CAM.jpg for the top1 prediction: %s'%classes[idx[0]])
img = cv2.imread('test_image.jpg')
height, width, _ = img.shape
heatmap = cv2.applyColorMap(cv2.resize(CAMs[0],(width, height)), cv2.COLORMAP_JET)
result = heatmap * 0.3 + img * 0.5
cv2.imwrite('CAM_noflood.jpg', result)

# generate class activation mapping for the flooding
CAMs = returnCAM(features_blobs[0], weight_softmax, [idx[1]])

# render the CAM and output
print('output CAM.jpg for the top1 prediction: %s'%classes[idx[1]])
img = cv2.imread('test_image.jpg')
height, width, _ = img.shape
heatmap = cv2.applyColorMap(cv2.resize(CAMs[0],(width, height)), cv2.COLORMAP_JET)
result = heatmap * 0.3 + img * 0.5
cv2.imwrite('CAM_flood.jpg', result)

0.591 -> no flood
0.409 -> flood/water
output CAM.jpg for the top1 prediction: no flood
output CAM.jpg for the top1 prediction: flood/water


True

How does this work? It turns out that, if you were to open up a CNN and examine the weights of the different layers, you would find that they tend to record key features of identifiers of specific objects:

<img src="notebook_images/cnn.png" width="750"  />

If your neural network has a specific architecture, you can essentially reconstruct a heatmap of where it is that those features are "activated" in an image. The architecture that Zhou et al used took advantage of a final global average pooling layer, which is also present in our ResNet architecture. This is why we chose to make our network following that architecture.

Think about how amazing that is! We never once provided the network the ability to localize objects in an image. This is something that the network learns entirely on its own. 

### Exercise
Look for a set of 2-3 images that you believe are emblematic of the class that you are aiming for. Apply the Class Activation Mapping procedure to them. Is the neural network able to localize key features of the class? Why or why not?