# Finding stories in data using Python and Jupyter notebooks
[Journocoders London](https://www.meetup.com/Journocoders/), April 13, 2017  

David Blood/[@davidcblood](https://twitter.com/davidcblood)/[first] dot [last] at ft.com

## Introduction
The [Jupyter](http://jupyter.org/) notebook provides an intuitive, flexible and shareable way to work with code. It's a potentially invaluable tool for journalists who need to analyse data quickly and reproducibly, particularly as part of a graphics-oriented workflow.

This aim of this tutorial is to help you become familiar with the notebook and its role in a [Python](https://www.python.org/) data analysis toolkit. We'll start with a demographic dataset and explore and analyse it visually in the notebook to see what it can tell us about people who voted ‘leave’ in the UK's EU referendum. To finish, we'll output a production-quality graphic using [Bokeh](http://bokeh.pydata.org/en/latest/).

You'll need access to an empty Python 3 Jupyter notebook, ideally running on your local machine, although a cloud-based Jupyter environment is fine too.

You're ready to start the tutorial when you're looking at this screen:

![Empty Python 3 notebook](https://github.com/davidbjourno/finding-stories-in-data/raw/master/images/fsid-empty-notebook.png)

## 1. Bring your data into the notebook

In Python-world, people often use the [pandas](http://pandas.pydata.org/) module for working with data. You don't have to—there are other modules that do similar things—but it's the most well-known and comprehensive (probably).

Let's [import](https://docs.python.org/3/reference/import.html) pandas into our project and assign it to the variable `pd`, because that's easier to type than `pandas`. While we're at it, let's import all the other modules we'll need for this tutorial and also let [Matplotlib](http://matplotlib.org/) know that we want it to plot charts here in the notebook rather than in a separate window. Enter the following code into the first cell in your notebook and hit shift-return to run the code block:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import bokeh.plotting as bokeh
from bokeh.io import output_notebook

%matplotlib inline

If you encounter any error messages at this point, it's most likely because you don't have one or more of these modules installed on your system. Running
```
pip3 install pandas matplotlib numpy seaborn bokeh
```
from the command line should take care of that. If not, holler and I'll try to help you.

As well as running your code, hitting shift-return in your first cell should have automatically created an empty cell below it. In that cell, we're going to use the `read_csv` method provided by pandas to, um, read our CSV.

When pandas reads data from a CSV file, it automagically puts it into something called a _dataframe_. It's not important at this point to understand what a dataframe is or how it differs from other [Python data structures](https://docs.python.org/3/tutorial/datastructures.html?highlight=dictionary#data-structures). All you need to know for now is that it's an object containing structured data.

We'll also assign our new dataframe to another variable—`df`—so we can do things with it down the line.

We do all of this in one go, like so (remember to hit shift-return):

In [None]:
df = pd.read_csv('data/leave-demographics.csv') # Passing in the path to our local CSV file

# If you're using an online notebook, pass in a URL instead of a local path:
# df = pd.read_csv('https://raw.githubusercontent.com/davidbjourno/finding-stories-in-data/master/data/leave-demographics.csv')

See how easy that was? Now let's check that `df` is in fact a dataframe. Using the `.head(n=[number])` method on any dataframe will return the first `[number]` rows in that dataframe. Let's take a look at the first ten:

In [None]:
df.head(n=10)

Looks good!

By now, you may have noticed that some of the row headers in this CSV aren't particularly descriptive (`var1`, `var2` etc.). This is the game: over the course of this tutorial, you're going to identify the variables that correlate most strongly with the percentage of ‘leave’ votes (the `leave` column), i.e. which factors were the most predictive of people voting ‘leave’. At the end of the meetup, before we all go down the pub, you can tell me which variables you think were the most predictive and I'll tell you what each of them are 😁

## 2. Explore the data

The main advantage of the workflow we're using here is that it enables us to inspect a dataset visually, which can often be the quickest way to identify patterns or trends in data. A common first step in this process is to use scatter plots to visualise the relationship between two variables, if any. So let's use Matplotlib to create a first, super basic [scatter plot](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter):

In [None]:
# Configure Matplotlib's pyplot method to plot at a size of 8x8 inches at a
# resolution of 72 dots per inch
plt.figure(figsize=(8, 8), dpi=72) 

# Plot the data as a scatter plot
g = plt.scatter(
    df['var1'], # The values we want to plot along the x axis
    df['leave'], # The values we want to plot along the y axis
    50, # The size…
    '#0571b0', # …colour…
    alpha=0.5 # …and opacity we want the data point markers to be
)

Yikes, not much of a relationship there. Let's try a different variable:

In [None]:
plt.figure(figsize=(8, 8), dpi=72)

g = plt.scatter(
    df['var2'], # Plotting var2_pc along the x axis this time
    df['leave'],
    50,
    '#0571b0',
    alpha=0.5
)

Hmm, the distribution looks better—there's a stronger (negative) correlation here—but it's still a little unclear what we're looking at. Let's add some context.

We know from our provisional data-munging (that we didn't do) that many of the boroughs of London were among the strongest ‘remain’ areas in the country. We can add an additional field called `is_london` to our dataframe and set the values of that field to either `True` or `False` depending on whether the row's `region_name` field is equal to `'London'`:

In [None]:
df['is_london'] = np.where(df['region_name'] == 'London', True, False)

# Print all the rows in the dataframe in which the is_london field is equal to
# True
df[df['is_london'] == True]

Those names should look familiar. That's numpy's [`.where`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html) method coming in handy there to help us generate a new column of data based on the values of another column—in this case, `region_name`.

At this point, we're going to abandon Matplotlib like merciless narcissists and turn our attention to the younger, hotter [Seaborn](https://seaborn.pydata.org/). Though it sounds like one of the factions from _Game of Thrones_, it's actually another plotting module that includes some handy analytical shortcuts and statistical methods. One of those analytical shortcuts is the [`FacetGrid`](http://seaborn.pydata.org/generated/seaborn.FacetGrid.html).

If you've ever used [OpenRefine](http://openrefine.org/), you're probably familiar with the concept of faceting. I'll fumblingly describe it here as a method whereby data is apportioned into distinct matrices according to the values of a single field. You get the idea. Right now, we're going to facet on the `is_london` column so that we can distinguish the London boroughs from the rest of the UK:

In [None]:
# Set the chart background colour (completely unnecessary, I just don't like the
# default)
sns.set_style('darkgrid', { 'axes.facecolor': '#efefef' })

# Tell Seaborn that what we want from it is a FacetGrid, and assign this to the
# variable ‘fg’
fg = sns.FacetGrid(
    df, # Use our dataframe as the input data
    hue='is_london', # Highlight the data points for which is_london == True
    palette=['#0571b0', '#ca0020'], # Define a tasteful blue/red colour combo
    size=7 # Make the plots size 7, whatever that means
)

# Tell Seaborn that what we want to do with our FacetGrid (fg) is visualise it
# as a scatter plot
fg.map(
    plt.scatter,
    'var2', # Values to plot along the x axis
    'leave', # Values to plot along the y axis
    alpha=0.5 # Data point marker opacity
)

Now we're cooking with gas! We can see a slight negative correlation in the distribution of the data points _and_ we can see how London compares to all the other regions of the country. Whatever `var2` is, we now know that the London boroughs generally have higher values in this field than most of the rest of the UK.

So what's to stop you faceting on `is_london` but with a different variable plotted along the x axis? The answer is: nothing! Try doing that exact thing right now:

In [None]:
# Plot the chart above with a different variable along the x axis.

What's more, faceting isn't limited to just highlighting specific data points. We can also pass `FacetGrid` a `col` (column) argument with the name of a column that we'd like to use to further segment our data. So let's create another `True`/`False` ([Boolean](https://en.wikipedia.org/wiki/Boolean_data_type)) column to flag the areas with the largest populations—the ones with electorates of 100,000 people or more—and plot a new facet grid:

In [None]:
df['is_largest'] = np.where(df['electorate'] >= 100000, True, False)

g = sns.FacetGrid(
    df,
    hue='is_london', 
    col='is_largest',
    palette=['#0571b0', '#ca0020'],
    size=7
)

g.map(
    plt.scatter,
    'var2',
    'leave',
    alpha=0.62
)

Now we're able to make the following statements based solely on a visual inspection of this facet grid:

* Most of the less populous areas (electorate < 100,000) voted ‘leave’
* Most of the less populous areas had `var2` values lower than 35. Only two – both London boroughs – had values higher than 35
* There is a stronger correlation between the strength of the ‘leave’ vote and the value of `var2` among the more populous areas

So you see how faceting can come in handy when you come to a dataset cold and need to start to understand it quickly.

As yet, we still don't have much of a story, just a few observations—not exactly Pulitzer material. The next and most important step is to narrow down which of the variables in the dataset had the strongest influence on the likelihood of a ‘leave’ vote. The good news is that we don't have to repeat the facet grid steps above for every variable, because Seaborn provides another useful analytical shortcut called a [`PairGrid`](https://seaborn.pydata.org/generated/seaborn.PairGrid.html#seaborn.PairGrid).

## 3. Optimise for efficiency

Apparently there's an equivalent to the pair grid in R called a correlogram or something (I wouldn't know). But the pair grid is super sweet because it allows us to check for correlations among a number of variables at once. By passing the `PairGrid` function an array of specific variables in our dataset, we can plot each of those variables against every other variable in one amazing ultra-grid:

In [None]:
# Just adding the first four variables, plus leave, to start with—you'll see why
grid_columns = [
    'var1',
    'var2',
    'var3',
    'var4',
    'leave'
]

g = sns.PairGrid(
    df[grid_columns],
    palette='#0571b0'
)

g.map_offdiag(plt.scatter);

Using the pair grid, you should be able to identify which of the variables in the data set correlate most strongly with percentage of leave votes and whether the correlations are positive or negative.

## 3. Make a viz and get it out of the notebook

In [None]:
output_notebook()

In [None]:
# Instantiate our plot
p = bokeh.figure(
    plot_width=600,
    plot_height=422,
    background_fill_color='#d3d3d3',
    title='Leave demographics'
)

# Add a circle renderer to the plot
p.circle(
    df['var5'], # Values to plot along the x axis
    df['leave'], # Values to plot along the y axis
    # Size the markers according to the size of the electorate (scaled down)
    size=df['electorate'] / 20000,
    color='#ca0020',
    line_width=1,
    line_color='#ca0020',
    alpha=0.5
)

# Configure the plot's x axis
p.xaxis.axis_label = 'var5'
p.xgrid.grid_line_color = None

# Configure the plot's y axis
p.yaxis.axis_label = 'Percentage voting leave'
p.ygrid.grid_line_color = '#999999'
p.ygrid.grid_line_alpha = 1
p.ygrid.grid_line_dash = [6, 4]

# Show the plot
bokeh.show(p)

Now that's starting to look like something we could publish. Refer to the [Bokeh docs](http://bokeh.pydata.org/en/latest/docs/user_guide.html) for more customisation options, and when you're happy with how your plot looks, click the ‘Save’ button on the toolbar at the right of the plot to export it as a PNG image. If you want, you can even paste it into the [hackpad](https://journocoders.hackpad.com/Journocoders-April-2017-BLRJmLthZLk) with your name—coolest-looking one wins a drink!

## 4. Baller-level challenges
* Using pandas, identify the top ten ‘leave’-voting areas in the country, both by vote percentage and per capita of the electorate
* Recreate your Bokeh visualisation in D3 (in the notebook). [Hint](https://github.com/PyGoogle/PyD3)