
# Neuromatch Academy: Week 2, Day 2, Tutorial 1
# Modern Convnets

__Content creators:__ Laura Pede, Richard Vogg, Marissa Weis, Timo Lüddecke, Alexander Ecker (based on an initial version by Ben Heil)

__Content reviewers:__ Arush Tagade, Polina Turishcheva, Yu-Fang Yang. 

__Content editors:__ Roberto Guidotti, Spiros Chavlis

__Production editors:__ Anoop Kulkarni, Roberto Guidotti, Cary Murray, Spiros Chavlis.  



---
# Setup

In [None]:
#@title Install facenet - a model used to do facial recognition
!pip -q install facenet-pytorch

In [None]:
# Import libraries
import copy
import glob
import random
import time
import os

import ipywidgets as widgets
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import sklearn.decomposition
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import tqdm
import urllib
from facenet_pytorch import MTCNN, InceptionResnetV1
from matplotlib.colors import ListedColormap
from IPython import display
from PIL import Image
from torchvision.datasets import ImageFolder
from torchvision.utils import make_grid
from torchvision import transforms

import requests
from io import BytesIO

In [None]:
#@title Setup GPU device
seed = 522
random.seed(522)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Download and prepare the data

In [None]:
# @title Download Data
!git clone --quiet https://github.com/ben-heil/cis_522_data.git
!tar -xzf cis_522_data/archive.tar.gz
!tar -xzf cis_522_data/faces.tar.gz

---
# Section 9: Face Recognition

In [None]:
#@title Video 9: Face Recognition using CNNs
from IPython.display import YouTubeVideo
video = YouTubeVideo(id="3q4fKmimZm8", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)

video

One application of large CNNs is **facial recognition**. The problem formulation in facial recognition is a little different from the image classification we've seen so far. In facial recognition we don't want to have a fixed number of individuals that the model can learn. If that were the case then to learn a new person it would be necessary to modify the output portion of the architecture and retrain to account for the new person.

Instead, we train a model to learn an **embedding** where images from the same individual are close to each other in an embedded space, and images corresponding to different people are far apart. When the model is trained, it takes as input an image and outputs an embedding vector corresponding to the image. 

To achieve this, facial recognitions typically use a triplet loss that compares and two images from the same individual (the "anchor" and "positive" images) and a negative image from a different individual (the "negative" image). The loss requires the distance between the anchor and negative points to be greater than a margin $\alpha$ + the distance between the anchor and positive points.

### View and transform the data

A well-trained facial recognition system should be able to map different images of the same individual relatively close together. We will load 15 images of three individuals (maybe you know them - then you can see that your brain is quite well in facial recognition).

After viewing the images, we will transform them: MTCNN detects the face and crops the image around the face. Then we stack all the images together in a tensor.

In [None]:
#@title Display Images
#@markdown Here are the source images of Bruce Lee, Neil Patrick Harris, and Pam Grier
train_transform = transforms.Compose((transforms.Resize((256, 256)),
                                     transforms.ToTensor()))

face_dataset = ImageFolder('faces', transform=train_transform)

image_count = len(face_dataset)

face_loader = torch.utils.data.DataLoader(face_dataset,
                                          batch_size=45,
                                          shuffle=False)

dataiter = iter(face_loader)
images, labels = dataiter.next()

# show images
plt.figure(figsize=(15,15))
plt.imshow(make_grid(images, nrow=15).permute(1,2,0))

In [None]:
# @title Image Preprocessing Function
def process_images(image_dir: str, size = 256):
    """
    This function returns two tensors for the given image dir: one usable for inputting into the
    facenet model, and one that is [0,1] scaled for visualizing

    Parameters:
        image_dir: The glob corresponding to images in a directory

    Returns:
        model_tensor: A image_count x channels x height x width tensor scaled to between -1 and 1,
                      with the faces detected and cropped to the center using mtcnn
        display_tensor: A transformed version of the model tensor scaled to between 0 and 1
    """
    mtcnn = MTCNN(image_size=size, margin=32)
    images = []
    for img_path in glob.glob(image_dir):
        img = Image.open(img_path)
        # Normalize and crop image
        img_cropped = mtcnn(img)
        images.append(img_cropped)

    model_tensor = torch.stack(images)
    display_tensor = model_tensor / (model_tensor.max() * 2)
    display_tensor += .5

    return model_tensor, display_tensor

Now that we have our images loaded, we need to preprocess them. To make the images easier for the network to learn, we crop them to include just faces.

In [None]:
bruce_tensor, bruce_display = process_images('faces/bruce/*.jpg')
neil_tensor, neil_display = process_images('faces/neil/*.jpg')
pam_tensor, pam_display = process_images('faces/pam/*.jpg')


display_tensor = torch.cat((bruce_display, neil_display, pam_display))

plt.figure(figsize=(15,15))
plt.imshow(make_grid(display_tensor, nrow=15).permute(1, 2, 0, ))

## Embedding with a pretrained network 

We load a pretrained facial recognition model called [FaceNet](https://github.com/timesler/facenet-pytorch). It was trained on the [VGGFace2](https://github.com/ox-vgg/vgg_face2) dataset which contains 3.31 million images of 9131 individuals.

We use the pretrained model to calculate embeddings for all of our input images.

In [None]:
resnet = InceptionResnetV1(pretrained='vggface2').eval().to(device)

In [None]:
# Calculate embedding
resnet.classify = False
bruce_embeddings = resnet(bruce_tensor.to(device))
neil_embeddings = resnet(neil_tensor.to(device))
pam_embeddings = resnet(pam_tensor.to(device))

## Think!

We want to understand what happens the model receives an image and returns the corresponding embedding vector.

- What are the height, width and number of channels of one input image?
- What are the dimensions of one stack of images (e.g. bruce_tensor)?
- What are the dimensions of the corresponding embedding (e.g. bruce_embeddings)?
- What would be the dimensions of the embedding of one input image?


Hints: 
- You can double click on a variable name and hover over it to see the dimensions of tensors.
- You do not have to answer the questions in the order they are asked.

In [None]:
one_image = '' #@param {type:"string"}

In [None]:
stack_of_images = '' #@param {type:"string"}

In [None]:
stack_embedding = '' #@param {type:"string"}

In [None]:
one_embedding = '' #@param {type:"string"}

In [None]:
# to_remove explanation

"""
1. height: 256, width: 256, channels: 3 (RGB)
2. 15x3x256x256
3. 15x512
4. 1x512 or just 512
"""

We cannot show 512-dimensional vectors visually, but using **Principal Component Analysis (PCA)** we can project the 512 dimensions onto a 2-dimensional space while preserving the maximum amount of data variation possible. This is just a visual aid for us to understand the concept. If you would do any caluclation, like distances between two images, this would be done with the whole 512-dimensional embedding vectors.

In [None]:
embedding_tensor = torch.cat((bruce_embeddings, neil_embeddings, pam_embeddings)).to(device = 'cpu')
pca = sklearn.decomposition.PCA(n_components=2)
pca_tensor = pca.fit_transform(embedding_tensor.detach().numpy())

In [None]:
colors = ['blue'] * 15 + ['orange'] * 15 + ['magenta'] * 15

plt.scatter(pca_tensor[:,0], pca_tensor[:,1], c=colors, marker = 'x')
green_patch = mpatches.Patch(color='blue', label='Bruce Lee')
orange_patch = mpatches.Patch(color='orange', label='Neil Patrick Harris')
purple_patch = mpatches.Patch(color='magenta', label='Pam Grier')

plt.title('PCA Representation of the Image Embeddings')
plt.legend(handles=[green_patch, orange_patch, purple_patch])

Great! The images corresponding to each individual are separated from each other in the embedding space!

If Neil Patrick Harris wants to unlock his phone with facial recognition, the phone takes the image from the camera, calculates the embedding and checks if it is close to the registered embeddings corresponding to Neil Patrick Harris.

---
# Section 10: Ethics – bias/discrimination due to pre-training datasets
Popular facial recognition datasets like VGGFace2 and CASIA-WebFace consist primarily of caucasian faces. 
As a result, even state of the art facial recognition models [substantially underperform](https://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_Racial_Faces_in_the_Wild_Reducing_Racial_Bias_by_Information_ICCV_2019_paper.pdf) when attempting to recognize faces of other races.

Given the implications that poor model performance can have in fields like security and criminal justice, it's very important to be aware of these limitations if you're going to be building facial recognition systems.

In this example we will work with a small subset from the [UTKFace](https://susanqq.github.io/UTKFace/) dataset with 49 pictures of black women and 49 picture of white women. We will use the same pretrained model as in Section 8, see and discuss the consequences of the model being trained on an imbalanced dataset.

In [None]:
#@title Video 9: Ethical aspects
from IPython.display import YouTubeVideo
video = YouTubeVideo(id="9i8fQwd5fak", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)

video

In [None]:
# @title Download Data
!git clone --quiet https://github.com/richardvogg/face_sample.git
!unzip -q face_sample/face_sample2.zip

### Load, view and transform the data

In [None]:
black_female_tensor, black_female_display = process_images('face_sample2/??_1_1_*.jpg', size = 200)
white_female_tensor, white_female_display = process_images('face_sample2/??_1_0_*.jpg', size = 200)

We can check the dimensions of these tensors and see that for each group we have images of size 200x200 and three channels (RGB) of 49 individuals.

In [None]:
print(white_female_tensor.shape)
print(black_female_tensor.shape)

In [None]:
# @title Visualize some example faces
display_tensor = torch.cat((white_female_display[:15], black_female_display[:15]))

plt.figure(figsize=(12,12))
plt.imshow(make_grid(display_tensor, nrow = 15).permute(1, 2,0,))

### Calculate embeddings

We use the same pretrained facial recognition network as in section 8 to calculate embeddings. If you have memory issues running this part, go to Edit > Notebook settings and check if GPU is selected as Hardware accelerator. If this does not help you can restart the notebook, go to Runtime -> Restart runtime.

In [None]:
resnet.classify = False
black_female_embeddings = resnet(black_female_tensor.to(device))
white_female_embeddings = resnet(white_female_tensor.to(device))

We will use the embeddings to show that the model was trained on an imbalanced dataset. For this, we are going to calculate a distance matrix of all combinations of images, like in this small example with n = 3 (in our case n = 98).

<img height=500 src=https://raw.githubusercontent.com/richardvogg/face_sample/main/04_DistanceMatrix.png>

Calculate the distance between each pair of image embeddings in our tensor and visualize all the distances. Remember that two embeddings are vectors and the distance between two vectors is the Euclidean distance.

In [None]:
#@title Function to calculate pairwise distances
def calculate_pairwise_distances(embedding_tensor: torch.tensor):
    """
    This function calculates the distance between each pair of image embeddings in a tensor

    Parameters:
        embedding_tensor: A num_images x embedding_dimension tensor

    Returns:
        distances: A num_images x num_images tensor containing the pairwise distances between each
                   image embedding
    """

    distances = torch.cdist(embedding_tensor, embedding_tensor)

    return distances

In [None]:
#@title Visualize the distances

embedding_tensor = torch.cat((black_female_embeddings, white_female_embeddings)).to(device = 'cpu')

distances = calculate_pairwise_distances(embedding_tensor)

plt.figure(figsize=(8,8))
plt.imshow(distances.detach().numpy())
plt.annotate('Black female', (2,-0.5), fontsize=24, va='bottom')
plt.annotate('White female', (52,-0.5), fontsize=24, va='bottom')
plt.annotate('Black female', (-0.5, 45), fontsize=24, rotation=90, ha='right')
plt.annotate('White female', (-0.5, 90), fontsize=24, rotation=90, ha='right')
plt.colorbar()
plt.axis('off')

## Exercise 10.1

What do you observe? The faces of which group are more similar to each other for the Face Detection algorithm?

In [None]:
observation = '' #@param {type:"string"}

In [None]:
# to_remove explanation

"""
The distances between black female embeddings are generally lower than the distances between white female embeddings, i.e. for the Face Detection algorithm faces of black females are more similar to each other.
"""

## Exercise 9.2
- What does it mean in real life applications that the distance is smaller between the embeddings of one group?
- Can you come up with example situations/applications where this has a negative impact?
- What could you do to avoid these problems?

In [None]:
ethics_discussion = '' #@param {type:"string"}

In [None]:
# to_remove explanation

"""
1. Algorithms will have problems to distinguish people of this group if the distances of the embeddings are smaller.
2. Many examples possible
  - Unblocking the smartphone with a face recognition system might not work or not be secure.
  - Surveillance cameras might confuse indivduals.
  - Social network automated tagging might confuse individuals.
3. Train the model on a balanced dataset with enough samples for each minority. If you use pre-trained models, obtain information about the datasets they were pretrained on.
"""

Lastly, to show the importance of the dataset which you use to pretrain your model, look at how much space white men and women take in different embeddings. FairFace is a dataset which is specifically created with completely balanced classes. The blue dots in all visualizations are white male and white female.

<img src=https://i.imgur.com/hCdCBOa.png>

[Image Source](https://arxiv.org/abs/1908.04913)

# Bonus (optional): Within Sum of Squares

We can try to put this observation in numbers. For this we work with the embeddings.
We want to calculate the centroid of each group, which is the average of the 49 embeddings of the group. As each embedding vector has a dimension of 512, the centroid will also have this dimension.

Now we can calculate how far away the observations of each group $S_i$ are from the centroid $\mu_i$. This concept is known as Within Sum of Squares (WSS) from cluster analysis.

$ \text{WSS} = \sum_{x\in S_i} ||x - \mu_i||^2$

where ||.|| is the Euclidean norm.

If the WSS is small, all elements of a group are close to each other. If WSS is larger, they are further away from each other.


In [None]:
# @title Function to calculate WSS

def wss(group):
  """
    This function returns the sum of squared distances of the N vectors of a
    group tensor (N x K) to its centroid (1 x K).

    Parameters:
        group: A image_count x embedding_size tensor

    Returns:
        sum_sq: A 1x1 tensor with the sum of squared distances.

    Hints:
        - to calculate the centroid, torch.mean() will be of use.
        - We need the mean of the N=49 observations. If our input tensor is of size
          N x K, we expect the centroid to be of dimensions 1 x K.
          Use the axis argument within torch.mean
    """
  centroid = torch.mean(group, axis = 0)
  distance = torch.linalg.norm(group - centroid.view(1,-1), axis = 1)
  sum_sq = torch.sum(distance**2)
  return sum_sq

In [None]:
# @markdown Let's calculate the WSS for the two groups of our example.

print("Black female embedding WSS: " + str(round(wss(black_female_embeddings).item(), 2)))
print("White female embedding WSS: " + str(round(wss(white_female_embeddings).item(), 2)))

# Summary

In [None]:
#@title Video 10: Summary and Outlook
from IPython.display import YouTubeVideo
video = YouTubeVideo(id="MdD6DzqLrLY", width=854, height=480, fs=1)
print("Video available at https://youtube.com/watch?v=" + video.id)

video