# Notebook 02: Analysis of Faces in U.S. Movie Posters (1980-2019)

In this notebook we will work through a study of faces in U.S. movie posters by applying
the Distant Viewing approach through the DVT.

To start, run the following lines to load some standard Python libraries, a helpful helper
function, and set up plotting in the Jupyter notebook.

In [None]:
%pylab inline

import numpy as np
import scipy as sp
import pandas as pd
import json

import statsmodels.api as sm
import statsmodels.formula.api as smf

import os
from os.path import join, basename

In [None]:
def conf_int(vals, ndigits=1):
    se = 1.96 * np.sqrt(np.var(vals) / len(vals))
    mu = np.mean(vals)
    return [round(mu - se, ndigits=ndigits), round(mu + se, ndigits=ndigits)]

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches

plt.rcParams["figure.figsize"] = (8,8)

## 1. Create a Code System

The first step in DV is to construct a code system. Let's begin by 
loading a sample movie poster from the dataset.

In [None]:
img_paths = [join("images", "posters", x) for x in os.listdir(join("images", "posters"))]

In [None]:
img = imread(img_paths[4])
plt.imshow(img)
_ = plt.axis("off")

We are going to use a code system that consists in identifying all of the faces present
in a movie poster. We will record the location, number, and size of each face and then
explore the relationship of these across time and genre.

## 2. Annotate the Corpus


### 2a. Extract information from the visual material

In order to extract information about faces in the corpus, we will make use 
of the DVT. Here, we need an `ImageInput` object, as well as an annotator
specifically designed for detecting faces.

In [None]:
from dvt.core import DataExtraction, ImageInput
from dvt.annotate.face import FaceAnnotator, FaceDetectMtcnn

To run this, we use an approach similar to notebook 01. It will take a while
to run the code over the entire dataset; to test, we start by just running over
the first 10 images.

In [None]:
input_obj = ImageInput(input_paths=img_paths[:10])
dextra = DataExtraction(input_obj)
dextra.run_annotators([
    FaceAnnotator(detector=FaceDetectMtcnn())
])

The output data contains metadata about each image, as well as the detected faces.

In [None]:
dt = dextra.get_json()

dt.keys()

Each item in the face data gives information about one detected face.

In [None]:
dt['face']

For the purpose of visualizing the output, it is helpful to collapse all of the
infomation about each poster in one location:

In [None]:
all_faces = [{'path': x['paths'], 'face_count': 0, 'faces': []} for x in dt['meta']]
for i, face in enumerate(dt['face']):
    all_faces[face['frame']]['face_count'] += 1
    all_faces[face['frame']]['faces'].append(face)

Here, for example, is all of the extracted data about the Poster 3:

In [None]:
all_faces[3]

To understand how the algorithm is working, let's plot the detected faces.

In [None]:
this_img = all_faces[3]
img = imread(this_img['path'])

fig, ax = plt.subplots(1)
ax.imshow(img)
_ = plt.axis("off")

for this_face in this_img['faces']:
    x_center = (this_face['left'] + this_face['right'] / 2)
    y_center = (this_face['top'] + this_face['bottom'] / 2)
    height = (this_face['bottom'] - this_face['top']) 
    width = (this_face['right'] - this_face['left'])

    rect = patches.Rectangle(
        (this_face['left'],this_face['top']),
        width,
        height,
        linewidth=3,
        edgecolor='orange',
        facecolor='none'
    )
    ax.add_patch(rect)

Not too bad! It finds all three faces in the image and places them reasonably well
on the poster. 

### 2b. Aggregate information across the corpus

Inside of waiting for everyone to run over the entire collection (which can take an
hour or more), we will now load the cached data about the entire collection of 
images:

In [None]:
with open(join('cache', 'movie_face.json'), 'r') as json_file:
    dt = json.load(json_file)

As with the smaller collection, we will aggregate the information for
each image in one place.

In [None]:
all_faces = [{'path': x['paths'], 'face_count': 0, 'faces': []} for x in dt['meta']]
for i, face in enumerate(dt['face']):
    all_faces[face['frame']]['face_count'] += 1
    all_faces[face['frame']]['faces'].append(face)

It will be helpful to further collapse all of the information into a 
rectangular data frame object that we can join with metadata about the
posters.

In [None]:
face_info = pd.DataFrame({
    'img': [basename(x['path']) for x in all_faces],
    'face_count': [x['face_count'] for x in all_faces],
    'faces': [x['faces'] for x in all_faces]
})
face_info

## 3. Combine with Metadata

We have several metadata fields about each of the movie posters. Here are the
available fields:

In [None]:
df = pd.read_csv(join("meta", "poster_metadata.csv"))
df

Many interesting research questions can be addressed by combing the metadata with
our extracted features. Let's merge the metadata with our information about faces
in the images.

In [None]:
df = df.join(face_info.set_index('img'), on='img')
df

Now, we can proceed to an exploratory analysis of the collection.

## 4. Exploratory Analysis

### 4a. Detected faces

Before doing any aggregative analysis, let's start by looking at some of the
detected faces. Here are the 36 movie posters corresponding to the movies 
with the highest number of ratings on IMDb:

In [None]:
plt.figure(figsize=(12, 14))

nx = 6
ny = 6

df = df.sort_values('rating_count', ascending=False)

pnum = 1
for j, row in df[:36].iterrows():
    img = imread(join('images', 'posters', row['img']))
    ax = plt.subplot(ny, nx, pnum)
    plt.imshow(img)
    plt.axis("off")
        
    for this_face in row['faces']:
        x_center = (this_face['left'] + this_face['right'] / 2)
        y_center = (this_face['top'] + this_face['bottom'] / 2)
        height = (this_face['bottom'] - this_face['top']) 
        width = (this_face['right'] - this_face['left'])

        rect = patches.Rectangle(
            (this_face['left'],this_face['top']),
            width,
            height,
            linewidth=3,
            edgecolor='orange',
            facecolor='none'
        )
        ax.add_patch(rect)

    pnum += 1
    
plt.tight_layout(pad=0)

The algorithm misses some small faces, and masked faces, but generally does very well.
We can also take a look at the posters that seem to have the highest number of faces.

In [None]:
plt.figure(figsize=(12, 14))

nx = 6
ny = 6

df = df.sort_values('face_count', ascending=False)

pnum = 1
for j, row in df[:36].iterrows():
    img = imread(join('images', 'posters', row['img']))
    ax = plt.subplot(ny, nx, pnum)
    plt.imshow(img)
    plt.axis("off")
        
    for this_face in row['faces']:
        x_center = (this_face['left'] + this_face['right'] / 2)
        y_center = (this_face['top'] + this_face['bottom'] / 2)
        height = (this_face['bottom'] - this_face['top']) 
        width = (this_face['right'] - this_face['left'])

        rect = patches.Rectangle(
            (this_face['left'],this_face['top']),
            width,
            height,
            linewidth=3,
            edgecolor='orange',
            facecolor='none'
        )
        ax.add_patch(rect)

    pnum += 1
    
plt.tight_layout(pad=0)

Given the small size of the faces on some of these posters, the algorithm again is reasonably
good at detected faces on the posters.

### 4b. Faces and genre

Now, let's use confidence intervals to predict the average number of faces that
are detected in a typical movie poster from a specific genre:

In [None]:
x = ['Horror', 'Comedy', 'Drama', 'Action', 'Adventure']
y = []
esize = []

for genre in x:
    ci = conf_int(df[df['genre'].str.contains(genre)].face_count.values)
    y.append((ci[0] + ci[1]) / 2)
    esize.append(ci[1] - ci[0])
    
plt.errorbar(x, y, yerr=esize, fmt='none')

How would you describe the pattern here? Does it confirm your assumptions about
the genres, or contradict them? 

## 5. Communication

As this is a workshop, the "communication" of these results is largely the workshop
itself. Can you think of other ways to communicate these results to a larger
public?