<a href="https://colab.research.google.com/github/aamini/introtodeeplearning_labs/blob/2019/Lab2_Part2_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Laboratory 2 : Computer Vision

# Part 2: De-Biasing Facial Recognition Systems

In the second portion of this lab, we will explore two prominent aspects of applied deep learning: facial recognition systems and algorithmic bias. 

Deploying fair, unbiased AI systems is critical for long-term acceptance of these approaches. Consider the task of facial recognition: given an image, is it an image of a face? [Recent work from the MIT Media Lab](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf) showed that this seemingly simple, but extremely important, task is subject to extreme amounts of algorithmic bias among select demographics. [Another report](https://ieeexplore.ieee.org/document/6327355) analyzed the face detection system used by the US law enforcement, and found that it had significantly lower accuracy among dark women between the age of 18-30 years old. These results are especially concerning since these facial recognition systems are almost never deployed in isolation.

We'll investigate one approach to addressing this problem by building a facial recognition model that learns the underlying *latent variables* in a dataset and uses this to adaptively re-sample the training data, mitigating any bias. This lab is based on a very recent paper in which this approach was originially proposed.   

Let's get started.

TODO: cite Face paper

First let's install the relevant dependencies:

In [0]:
# TODO: install dependencies

## 2.1 Datasets

We'll be using three datasets in this lab. In order to train our facial recognition models, we'll need a dataset of positive examples (i.e., of faces) and a dataset of negative examples (i.e., of things that are not faces). Finally, we'll need a test dataset of face images. Since we're concerned about the potential *bias* of our learned models against certain demographics, it's important that the test dataset we use has equal representation across the demographics or features of interest. We'll specifically be looking at skin tone and gender. 


1.   Positive training data: [CelebA Dataset](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). A large-scale (over 200K images) of celebrity faces.   
2.   Negative training data: [ImageNet](http://www.image-net.org/). Many images across many different categories. We'll take negative examples from a variety of non-human categories. 
3. Test data: [Pilot Parliaments Benchmark](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf) (PPB). Images of parliamentarians from three African countries and three European countries, selected for parity across gender and skin tone. 



Let's begin by importing these three datasets. We've written a function that does a bit of data pre-processing and imports these data for you :)

In [0]:
# TODO function to import the datasets in the appropriate format

To get a better sense of what's in each of these datasets, we can display some randomly selected images from each.

In [0]:
# TODO display images from the three datasets

As you can see, the PPB dataset is quite balanced in terms of the skin tone and gender across the set of displayed images. Do you notice any trends or patterns in the images from the CelebA dataset? Do you anticipate any potential issues in terms of classification performance for models trained on CelebA and then tested on a dataset like PPB?

### Thinking about bias

Remember that we'll be training facial detection classifiers on the large, well-curated CelebA dataset (and ImageNet), and then evaluate the *accuracy* and *bias* of our models across different demographics by testing them on the PPB dataset. Our goal is to build a model that trains on CelebA *and* achieves high classification accuracy on PPB across all demographics, and to show that this model does not suffer from algorithmic bias. 

What exactly do we mean if we say a classifier is biased? In order to formalize this, we'll need to think about [*latent variables*](https://en.wikipedia.org/wiki/Latent_variable), variables that define a dataset but are not strictly observed, which was introduced during the generative modeling lecture. We can think of a classifier as *biased* if its classification decision changes after it sees some additional latent features. This notion of bias will be helpful to keep in mind throughout the rest of the lab. 

## 2.2 CNN for facial detection 

First, we'll define and train a CNN on the facial classification task, and evaluate its accuracy on the PPB dataset. Later, we'll evaluate the performance of our debiased models against this baseline CNN. The model architecture is shown here:

![CNN model](img/mnist_model.png "CNN Architecture for MNIST Classification")
TODO: update the architecture figure