# Interactive Data Visualization by Example

## Table of Contents

* [Introduction](#Introduction)
* [Prologue: Examining the Data](#Prologue:-Examining-The-Data)
* [Plot 1: Average Bechdel Score over Time](#Plot-1:-Average-Bechdel-Score-over-Time)
    * [Recreating Plot 1](#Recreating-Plot-1)
    * [Manipulating Plot 1: Or, The Treachery of Ordinal Data ](#Manipulating-Plot-1:-Or,-The-Treachery-of-Ordinal-Data)

## Introduction

On April 1, 2014, Walt Hickey wrote an article titled [*The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women*](https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/), which looked at how Hollywood blockbusters have faired in the [Bechdel Test](https://en.wikipedia.org/wiki/Bechdel_test) over the last fifty years. They conclude that films that pass the Bechdel test are not only better representations of gender equality, but also tend to have a higher return on investment (ROI). Because of this, it is in Hollywood's interest to produce films with a more diverse female cast.

Several weeks later, on April 21, 2014, Brian Keegan wrote a response titled [*The Need for Openness in Data Journalism*](https://nbviewer.jupyter.org/github/brianckeegan/Bechdel/blob/master/Bechdel_test.ipynb). Using the original article as a case study, Brian point as much about female representation in film (on which he mostly validated Walk Hickey's conclusions, with some caveats) as it is on the importance for making the source code and data for this kind of analysis available. Five Thirty Eight themselves took notice of the response and [linked to it from their site](https://fivethirtyeight.com/features/the-bechdel-test-checking-our-work/), and later made their articles available for replication on their [Github page](https://github.com/fivethirtyeight/data/).

This document follows a similar vein; I will use the Bechdel Test data as a worked example of how to create visually appealing plots, and also as a tutorial for creating *interactive* data visualizations. I was spurred to write this document because of a Communication Computational Concepts course I taught at Occidental College in Spring 2018, when I couldn't find good introductions for basic data literacy. My goal here is to not only demonstrate what a good visualization might look like, but also to explain the *thought process* in creating it. Finally, a side effect of writing this document is to illustrate how I use/organize a Jupyter notebook.

The overall structure of this document is to take the four main plots in Brian Keegan's document, and for each one:

1. recreate the plot with Bokeh
2. improve the plot by making it more readable and/or information dense
3. create an interactive version that would allow reader exploration

Although this document can be viewed statically with the [Jupyter Notebook Viewer](http://nbviewer.jupyter.org/), the interaction visualizations requires a running Jupyter notebook with Python 3. Those interested can check out the [Github repository](https://github.com/justinnhli/blog-code/tree/master/2018/05/bechdel) to run it locally.

Note: I am assuming that the reader are passingly fluent with pandas DataFrames (and Python, of course). DataFrame operations will mostly not be explained, but feel free to look up the functions in the [DataFrame API reference](https://pandas.pydata.org/pandas-docs/stable/api.html) as we go along. If you need a refresher on pandas, I recommend working through [First Python Notebook](http://www.firstpythonnotebook.org/).

## Prologue: Examining The Data

In [None]:
import math

import numpy as np
import pandas as pd
from bokeh.plotting import figure, ColumnDataSource, show, output_notebook

output_notebook()

Before we get dive into making pretty graphs, we need to understand the data that we are using to create those graphs. The data I am using here are the same ones that Brian Keegan collected in April 2014. Specifically, he used data from four different sources:

* [BechdelTest.com](http://bechdeltest.com/) for how films scored on the Bechdel test
* [The-Numbers.com](https://www.the-numbers.com/) for the budget and revenue data
* [The Bureau of Labor Statistics](https://www.bls.gov/) for historical inflation data
* [The Open Movie Database (OMBD)](https://www.omdbapi.com/) for rating data

The following code is copied almost verbatim from Brian Keegan.

Two pedagogical notes about this code:

1. I tend to write one-off functions that encapsulate most of the processing, then call that function and save the results into a variable. This keeps the main namespace relatively clean.

2. I have found that it is a good idea to read the data in one cell, then manipulate copies of it in subsequent cells. This allows for quicker iteration of code, since you don't have to re-read the data if you screw up. This also prevents the `a value is trying to be set on a copy of a slice from a DataFrame` error. What I tend to do is write the code outside a function and test it as I go, then wrap it in a function when I get it working.

In [None]:
def read_data():

    def read_revenues():
        return pd.read_csv('data/revenue.csv', encoding='utf8', index_col=0)

    def read_inflations():
        inflation_df = pd.read_csv('data/cpi.csv', index_col='Year')
        inflation_df = inflation_df['Annual']
        return dict(inflation_df.loc[2014] / inflation_df)

    def read_bechdel():
        bechdel_df = pd.read_json('data/bechdel.json')
        bechdel_df['imdbid'] = bechdel_df['imdbid'].dropna().apply(int)
        bechdel_df = bechdel_df.set_index('imdbid')
        bechdel_df.dropna(subset=['title'], inplace=True)
        bechdel_df = bechdel_df[['rating', 'title']]
        return bechdel_df

    def read_imdb():

        def runtime_to_minutes(runtime):
            if 'h' in runtime:
                hours, minutes = runtime.split('h')
                return str(int(hours) * 60 + int(minutes))
            else:
                return runtime

        # Read the data in
        imdb_df = pd.read_json('data/imdb_data.json')

        # Drop non-movies
        imdb_df = imdb_df[imdb_df['Type'] == 'movie']

        # Drop movies with unknown release dates
        imdb_df = imdb_df[imdb_df['Released'] != 'N/A']

        # Convert to datetime objects
        imdb_df['Released'] = pd.to_datetime(imdb_df['Released'], format="%d %b %Y")

        # Drop errant identifying characters in the ID field
        imdb_df['imdbID'] = imdb_df['imdbID'].str.slice(start=2)

        # Remove the " min" at the end of Runtime entries so we can convert to ints
        imdb_df['Runtime'] = imdb_df['Runtime'].str.slice(stop=-4).replace('', np.nan)

        # Convert errant runtimes to minutes
        imdb_df['Runtime'] = imdb_df['Runtime'].astype(str).apply(runtime_to_minutes)

        # Blank out non-MPAA or minor ratings (NC-17, X)
        imdb_df['Rated'] = imdb_df['Rated'].replace(
            to_replace=[
                'N/A',
                'Not Rated',
                'Approved',
                'Unrated',
                'TV-PG',
                'TV-G',
                'TV-14',
                'TV-MA',
                'NC-17',
                'X',
            ],
            value=np.nan,
        )

        # Convert Release datetime into new columns for year, month, and week
        imdb_df['Year'] = imdb_df['Released'].apply(lambda date: date.year)
        imdb_df['Month'] = imdb_df['Released'].apply(lambda date: date.month)
        imdb_df['Week'] = imdb_df['Released'].apply(lambda date: date.week)

        # Convert the series to float
        imdb_df['Runtime'] = imdb_df['Runtime'].apply(float)

        # Convert the imdbVotes strings into float
        imdb_df['imdbVotes'] = imdb_df['imdbVotes'].replace('N/A', np.nan).dropna().apply(
            lambda s: float(s.replace(',', ''))
        )

        # Take the Metascore formatted as string containing "N/A", convert to float
        # Also divide by 10 to make effect sizes more comparable
        imdb_df['Metascore'] = imdb_df['Metascore'].dropna().replace('N/A', np.nan).dropna().apply(float) / 10

        # Take the imdbRating formatted as string containing "N/A", convert to float
        imdb_df['imdbRating'] = imdb_df['imdbRating'].dropna().replace('N/A', np.nan).dropna().apply(float)

        # Create a dummy variable for English language
        imdb_df['English'] = (imdb_df['Language'] == u'English').astype(int)
        imdb_df['USA'] = (imdb_df['Country'] == u'USA').astype(int)

        # Convert imdb_ID to int, set it as the index
        imdb_df['imdbID'] = imdb_df['imdbID'].dropna().apply(int)
        imdb_df = imdb_df.set_index('imdbID')

        return imdb_df

    def get_combined_data():
        revenue_df = read_revenues()
        inflation_dict = read_inflations()
        bechdel_df = read_bechdel()
        imdb_df = read_imdb()
        df = imdb_df.join(bechdel_df, how='inner').reset_index()
        df = pd.merge(df, revenue_df, left_on=['Title', 'Year'], right_on=['Movie', 'Year'])
        df['Year'] = df['Released_x'].apply(lambda date: date.year)
        df['Adj_Revenue'] = df.apply(lambda row: row['Revenue'] * inflation_dict[row['Year']], axis=1)
        df['Adj_Budget'] = df.apply(lambda row: row['Budget'] * inflation_dict[row['Year']], axis=1)
        return df

    return get_combined_data()


raw_df = read_data()
raw_df.tail(2).T

A commonly-overlooked part of creating data visualizations is *understanding the data*. There are at least two things worth noting here:

First, since I am using Brian's original 2014 data, all dollar amounts will be in 2014 dollars.

Second, and more importantly, it's worth understanding what Brian called the "Bechdel score" or the "Bechdel dimension". This is the `rating` column (row in the above sample) is the rating from bechdeltest.com. As Brian quoted from their API: 

> The actual score. Number from 0 to 3.
> * 0.0 means no two women,
> * 1.0 means no talking, 
> * 2.0 means talking about a man, 
> * 3.0 means it passes the test.

Note that *the data is **ordinal** and not numeric*. The only semantic information available is that a film with no women is "worse" than a film with no women talking, which is in turn worse than a film with women talking about a man. The mapping to 0, 1, 2, and 3 is *arbitrary* - we could just as well map it to -7, 0.5, 0.6, 1000; all that matters is the numbers are increasing. As [Wikipedia page on ordinal data](https://en.wikipedia.org/wiki/Ordinal_data#General) says, "the use of means and standard deviations ... [is] not appropriate".

This leads directly to Brian's first data plot...

## Plot 1: Average Bechdel Score over Time

Brian has a plot that compares the data he collected with that of FiveThirtyEight's. Although we can now get the original data from FiveThirtyEight's repository, we will ignore that plot for the sake of sticking with the information Brian had at the time of his writing.

The next plot that Brian presents is the average Bechdel score over time:

![Original Average Bechdel Test over Time Plot](original-plots/1a.png)

Remember that ordinal data stuff from the last section, and how we shouldn't use means and standard deviations? *This plot violates that rule.* As we will see, the average Bechdel score (and so the plot itself) is almost meaningless, and its appearance can be changed drastically with a very different mapping of Bechdel categories to numbers.

But first, let's recreate this plot in Bokeh.

### Recreating Plot 1

The following function takes the raw DataFrame from before and extracts only the year and rating information:

In [None]:
def prepare_plot_1_data():
    
    # create a DataFrame that will contiain the data to be plotted
    plot_df = raw_df.copy()

    # select only the columns we will need
    plot_df = plot_df[['Year', 'rating']]

    # find the mean rating for each year
    plot_df = plot_df.groupby(by='Year').mean()

    # reset the index so Bokeh can understand the data
    plot_df = plot_df.reset_index()
    
    return plot_df

prepare_plot_1_data().head()

To plot the data, we will use the [Bokeh](https://bokeh.pydata.org/) library. Although there are a lot of plotting libraries to choose from, including [Matplotlib](https://matplotlib.org/) (which is the library that pandas uses for their [plotting functions](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)), [Seaborn](https://seaborn.pydata.org/), [Plotly](https://plot.ly/), and more recently [ggplot](http://ggplot.yhathq.com/) and [Altair](https://altair-viz.github.io/), I personally like Bokeh's approach of layering glyphs on top of each other, which makes it easy to add additional visual components. 

Creating a Bokeh plot involves three steps (after wrangling the data into shape):

1. Creating the Figure object. This is simply a call to the [`figure()`](https://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.figure) factory function. Some plot-level attributes such as size, title, axes, etc. can be set at this point.

2. Creating glyphs that represent the data. This involves calling at least one of the many glyph methods [in the Figure class](https://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure). Each glyph takes different arguments; a [`circle()` glyph](https://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure.circle), for example, takes `x` and `y`, while a [`vbar()` glyph](https://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure.vbar) (a vertical rectangle) takes `x`, `width`, `top`, and `bottom`. Central to this API is that each glyph takes a `source` keyword argument, which tells the glyph where to get the data from. Most of the time, if you are using using Bokeh with pandas, this source will be a [ColumnDataSource](https://bokeh.pydata.org/en/latest/docs/reference/models/sources.html#bokeh.models.sources.ColumnDataSource) created from a pandas DataFrame.

3. Showing the figure. In a Jupyter notebook, this means two things. First, to tell Bokeh that you are working in a Jupyter notebook, the [`output_notebook()`](https://bokeh.pydata.org/en/latest/docs/reference/io.html#bokeh.io.output_notebook) function must be called. I tend to do this at the start right after I import the necessary libraries. Once we have that, we call the [`show()`](https://bokeh.pydata.org/en/latest/docs/reference/io.html#bokeh.io.show) function for each figure we want to display.

We follow these three steps in the following function:

In [None]:
def show_plot_1():
    
    # get the data for the plot
    plot_df = prepare_plot_1_data()
    
    # convert it to a ColumnDataSource
    data_source = ColumnDataSource(plot_df)
    
    # create the figure (step 1)
    fig = figure()
    
    # create the glyph, in this case, a line (step 2)
    fig.line(
        # use the 'Year' column as the x value
        x='Year',
        # use the 'rating' column as the y value
        y='rating',
        # use the converted ColumnDataSource as the source of columns
        source=data_source,
    )
    
    # show the plot (step 3)
    show(fig)

show_plot_1()

One nice thing about Bokeh is that the plots are semi-interactive by default - you can drag the plot to pan around, use the tools on the right to zoom in, or reset the plot to the original view.

We used the Bokeh defaults for this first attempt, but we can style it a bit more. In the code below, we will change the plot size, set the domain and range, and add axis labels.

In [None]:
def show_plot_1():
    
    # get the data for the plot
    plot_df = prepare_plot_1_data()
    
    # convert it to a ColumnDataSource
    data_source = ColumnDataSource(plot_df)
    
    # create the figure (step 1)
    fig = figure(
        # set the size of the plot
        width=600, height=400,
        # set the domain and range
        x_range=[1910, 2020],
        y_range=[0, 3],
        # set axis labels
        x_axis_label='Year',
        y_axis_label='Avg. Bechdel Test',
    )
    
    # create the glyph, in this case, a line (step 2)
    fig.line(
        # use the 'Year' column as the x value
        x='Year',
        # use the 'rating' column as the y value
        y='rating',
        # use the converted ColumnDataSource as the source of columns
        source=data_source,
    )
    
    # show the plot (step 3)
    show(fig)

show_plot_1()

Looking back at the original plot, we can see that we've recreated most of the visual elements:

![Original Average Bechdel Test over Time Plot](original-plots/1a.png)

The one element we *didn't* get is the line width - if you look at the original closely, you will see that the lines gets thicker towards the right. Brian never explains what this means in his text, but if you look at his code, you will see this in cell 37:

    lines = LineCollection(lines, linewidths=dict(num_movies).values())

It seems like the line width represents how many movies were averaged over. Unfortunately, Bokeh does not offer an easy way to varying the line width. What we have to do instead is draw each line separately by looping over the rows:

In [None]:
def plot_1():
    
    # prepare the data
    plot_df = raw_df.copy()
    plot_df = plot_df[['Year', 'rating']]
    # calculate the mean and number of ratings separately then combine them
    plot_df = pd.concat(
        [
            plot_df.groupby(by='Year').mean()['rating'],
            plot_df.groupby(by='Year').count()['rating'],
        ],
        axis=1,
    )
    plot_df = plot_df.reset_index()
    # rename the columns appropriately
    plot_df.columns = ['Year', 'rating', 'count']
    
    fig = figure(
        width=600, height=400,
        x_range=[1910, 2020],
        y_range=[0, 3],
        x_axis_label='Year',
        y_axis_label='Avg. Bechdel Test',
    )
    
    # convert the DataFrame into a list of rows
    # take each pair of rows and plot them as a line
    rows_list = list(plot_df.itertuples())
    for start, end in zip(rows_list[:-1], rows_list[1:]):
        fig.line(
            x=[start[1], end[1]],
            y=[start[2], end[2]],
            line_width=start[3],
        )
        
        
    show(fig)

plot_1()

This is almost there, except that the line on the right side are too thick; however, merely dividing the width by a number makes the other lines too thin. After some experimentation, it turns out using the log of the count gives the appropriate appearance:

In [None]:
def plot_1():
    
    # prepare the data
    plot_df = raw_df.copy()
    plot_df = plot_df[['Year', 'rating']]
    # calculate the mean and number of ratings separately then combine them
    plot_df = pd.concat(
        [
            plot_df.groupby(by='Year').mean()['rating'],
            plot_df.groupby(by='Year').count()['rating'],
        ],
        axis=1,
    )
    plot_df = plot_df.reset_index()
    # rename the columns appropriately
    plot_df.columns = ['Year', 'rating', 'count']
    
    fig = figure(
        width=600, height=400,
        x_range=[1910, 2020],
        y_range=[0, 3],
        x_axis_label='Year',
        y_axis_label='Avg. Bechdel Test',
    )
    
    rows_list = list(plot_df.itertuples())
    for start, end in zip(rows_list[:-1], rows_list[1:]):
        fig.line(
            x=[start[1], end[1]],
            y=[start[2], end[2]],
            line_width=math.log(start[3]),
        )
        
        
    show(fig)

plot_1()

And there we have it: the original Bechdel test over time plot recreated in Bokeh.

### Manipulating Plot 1: Or, The Treachery of Ordinal Data 

<table>
    <tr>
        <td>
            ![Original Average Bechdel Test over Time Plot](original-plots/1b.png)
        </td>
        <td>
            ![Original Average Bechdel Test over Time Plot](original-plots/1c.png)
        </td>
    </tr>
</table>

## Plot 2: Bechdel Score Distribution over Time

![](original-plots/2.png)

## Plot 3: Film Budget by Bechdel Score

![](original-plots/3.png)

## Plot 4: Other Metrics by Bechdel Score

<table>
    <tr>
        <td>
            ![Original Average Bechdel Test over Time Plot](original-plots/4a.png)
        </td>
        <td>
            ![Original Average Bechdel Test over Time Plot](original-plots/4b.png)
        </td>
    </tr>
</table>

<table>
    <tr>
        <td>
            ![Original Average Bechdel Test over Time Plot](original-plots/4c.png)
        </td>
        <td>
            ![Original Average Bechdel Test over Time Plot](original-plots/4d.png)
        </td>
    </tr>
</table>