# Notebook 03: Shot Duration in All the Presidents Men

We now return to take up the question of shot duration in the film *All the President's Men*.
As before, the starting point will be loading some libraries and setting up the notebook:

In [None]:
%pylab inline

import numpy as np
import scipy as sp
import pandas as pd
import json

import statsmodels.api as sm
import statsmodels.formula.api as smf

import os
from os.path import join, basename

In [None]:
def conf_int(vals, ndigits=1):
    se = 1.96 * np.sqrt(np.var(vals) / len(vals))
    mu = np.mean(vals)
    return [round(mu - se, ndigits=ndigits), round(mu + se, ndigits=ndigits)]

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches

plt.rcParams["figure.figsize"] = (8,8)

## 01: Construct a Code System

Our code system consists of two steps. First, we detect shot breaks and
therefore detemine the duration of each shot. Then, for the frame in the
middle of the shot, we re-use the code system from notebook02 to detect
the number and location of faces. Finally, we aggregate the information 
about each shot to assign a the *shot length*, such as medium-close shot
or a very-long shot.

## 02: Annotate

The code system that we want to annotate is provided by the `VideoCsvPipeline` object.
Because it is difficult to distribute the original video file, and processing can take
a significant amount of time, we will load a cached version of the files as supplied
in the notebooks. 

In [None]:
if False: # this stops the code from running, which requires the full video file
    from dvt.pipeline.csv import VideoCsvPipeline
    VideoCsvPipeline(finput=join("videos", "all-presidents-men.mp4"), dirout="dvt-csv").run()
    
meta = pd.read_csv(join('cache', 'all-presidents-men', 'meta.csv'))
cut = pd.read_csv(join('cache', 'all-presidents-men', 'cut.csv'))
length = pd.read_csv(join('cache', 'all-presidents-men', 'length.csv'))

As we did in the first notebook, the frames per second are taken from the metadata
and used to compute the length of each shot.

In [None]:
cut['length_sec'] = ((cut['frame_end'] - cut['frame_start']) / meta['fps'].values)
cut

Then, we join the shot data together with the data about detected faces.

In [None]:
df = cut.join(length.set_index('frame'), on='mpoint')
df

Notice that now each shot is associated with a code indicating the shot length.

## 03: Combine with Metadata

Ideally we would do this analysis with a collection of films, and in that case would have 
rich metadata about each of them. In this workshop we have only one film, so there is little
to be gained from adding metadata.

## 04: Exploratory Analysis

We can compute the median shot length of each shot type:

In [None]:
x = np.sort(df.shot_length.unique())
y =[np.median(df[df['shot_length'] == stype].length_sec.values) for stype in x]
    
plt.plot(x, y)

What pattern(s) do you notice here? What shot types last the longest and
what types alst the shortest?

Similarly, we can investigate the relationship between the number of 
people onscreen and shot length:

In [None]:
x = [0, 1, 2, 3, 4]
y =[np.median(df[df['num_people'] == stype].length_sec.values) for stype in x]
    
plt.plot(x, y)

How does this plot compare to the shot length dataset? Does it
tell the same or a different story? It may be helpful to see 
the intersection between the number of people and the shot length
over the film:

In [None]:
pd.crosstab(df['num_people'], df['shot_length'])

A lot of research has been done comparing shot lengths across films/directors/genres.
What features might shot length be a proxy for?

## 05: Communicate

We published an article that featured an analysis of shot lengths and shot
duration for Network Sitcoms (available OA here: https://culturalanalytics.org/article/11045)
and are working on a larger study of New Hollywood films. If you have any thoughts
or ideas, feedback is very much welcome!