# Intro to Plotting

### Sneak peak:

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10
sns.set(style='ticks', context='talk')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
df = pd.read_csv('data/beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = [c for c in df.columns if c[0:6] == 'review']
df.head()

In [None]:
fig, ax = plt.subplots(figsize=(5, 10))
sns.countplot(hue='kind', y='stars', data=(df[review_cols]
                                           .stack()
                                           .rename_axis(['record', 'kind'])
                                           .rename('stars')
                                           .reset_index()),
              ax=ax, order=np.arange(0, 5.5, .5))
sns.despine()

## Matplotlib

- Tons of features
- "Low-level" library

Check out [the tutorials](http://matplotlib.org/users/beginner.html)

In [None]:
from IPython import display
display.HTML('<iframe src="http://matplotlib.org/users/beginner.html" height=500 width=1024>')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

A single series is interpreted as y values, so x is just the index...

For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot. 

To work on plots in more detail, it's useful to store the "axis" object

Lots of `keyword` properties...

#### Overlaying plots

#### Multiple plots

#### Types of axes

The best way to learn is [the gallery](http://matplotlib.org/gallery.html)

In [None]:
display.HTML('<iframe src="http://matplotlib.org/gallery.html" height=500 width=1024>')

### A handful of examples

Scatter plots and "bubble charts"

In [None]:
n = 20
x = np.random.normal(size=n)
y = np.random.normal(size=n)
c = np.random.uniform(size=n)
s = np.random.randint(100, size=n)

#### Bar charts

In [None]:
people = ['Annie', 'Brian', 'Chelsea', 'Derek', 'Elise']
performance = 3 + 10 * np.random.rand(len(people))
error = np.random.rand(len(people))

# Plotting with Pandas

matplotlib is a relatively *low-level* plotting package, relative to others. It makes very few assumptions about what constitutes good layout (by design), but has a lot of flexiblility to allow the user to completely customize the look of the output.

On the other hand, Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.

In [None]:
normals = pd.Series(np.random.normal(size=10))

Similarly, for a DataFrame:

In [None]:
variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                          'gamma': np.random.gamma(1, size=100), 
                          'poisson': np.random.poisson(size=100)})

All Pandas plotting commands return `matplotlib` `axis` objects:

As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for `plot`:

Or, we could use a secondary y-axis:

(Note that ["friends don't let friends use two y-axes"](https://kieranhealy.org/blog/archives/2016/01/16/two-y-axes/), but we're just showing some examples here...)

If we would like a little more control, we can use matplotlib's `subplots` function directly, and manually assign plots to its axes:

### Bar plots

Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the `plot` method with a `kind='bar'` argument.

For this series of examples, let's load up the Titanic dataset:

In [None]:
titanic = pd.read_excel("data/titanic.xls", "titanic")
titanic.head()

Or if we wanted to see survival _rate_ instead:

## Histograms

Frequently it is useful to look at the *distribution* of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.

For instance, fare distributions aboard the titanic:

### Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.

### Scatter plots

In [None]:
df.head()

In [None]:
jittered_df = df[review_cols] + (np.random.rand(*df[review_cols].shape) - 0.5)
jittered_df.head()

### Lots more info on Pandas plotting in [the docs](http://pandas.pydata.org/pandas-docs/stable/visualization.html)

## [Seaborn](http://seaborn.pydata.org/)

High-level interface for `matplotlib`

Seaborn also returns `matplotlib` `axis` objects...

## [ggplot](http://ggplot.yhathq.com/)

Another high-level `matplotlib` library, but this time mimicking `R`'s `ggplot`

In [None]:
from ggplot import *
ggplot(diamonds, aes(x='carat', y='price', color='cut')) +\
    geom_point() +\
    scale_color_brewer(type='diverging', palette=4) +\
    xlab("Carats") + ylab("Price") + ggtitle("Diamonds")

### [Bokeh](http://bokeh.pydata.org/)

In [2]:
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.palettes import brewer
output_notebook()

N = 20
categories = ['y' + str(x) for x in range(10)]
data = {}
data['x'] = np.arange(N)
for cat in categories:
    data[cat] = np.random.randint(10, 100, size=N)

df = pd.DataFrame(data)
df = df.set_index(['x'])

def stacked(df, categories):
    areas = dict()
    last = np.zeros(len(df[categories[0]]))
    for cat in categories:
        next = last + df[cat]
        areas[cat] = np.hstack((last[::-1], next))
        last = next
    return areas

areas = stacked(df, categories)

colors = brewer["Spectral"][len(areas)]

x2 = np.hstack((data['x'][::-1], data['x']))

p = figure(x_range=(0, 19), y_range=(0, 800))
p.grid.minor_grid_line_color = '#eeeeee'

p.patches([x2] * len(areas), [areas[cat] for cat in categories],
          color=colors, alpha=0.8, line_color=None)

show(p, notebook_handle=True)
push_notebook()

## So many plotting libraries!

In [None]:
display.HTML('<iframe src="https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/" width=1024 height=500>')

## Exercise 6 - "Choose your own adventure" workshop

1. Grab the data of your choice
    - Can't think of anything? [GHDx](http://ghdx.healthdata.org/)
2. Load it into a Pandas `DataFrame`
3. Compute some summary statistics
4. Create some cool plots

## References

Slide materials inspired by and adapted from [Chris Fonnesbeck](https://github.com/fonnesbeck/statistical-analysis-python-tutorial) and [Tom Augspurger](https://github.com/TomAugspurger/pydata-chi-h2t)