In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import simplejson as json

# get GOOG data
with open('small_data/goog.json') as raw_f:
    raw_data = raw_f.read()
    json_data = json.loads(raw_data)

goog = pd.DataFrame(json_data['data'], columns=json_data['column_names'])
goog['Day'] = goog.index.values
goog.set_index(pd.DatetimeIndex(goog['Date']), inplace=True)

# Layout and Design

<!-- requirement: small_data/fha_by_tract.csv -->
<!-- requirement: small_data/goog.json -->
<!-- requirement: small_data/cal_house.json.gz -->
<!-- requirement: images/sample-charts/pie-vs-bar.png -->
<!-- requirement: images/sample-charts/pie-vs-line.jpg -->
<!-- requirement: images/sample-charts/minard_pie.png -->
<!-- requirement: images/sample-charts/sparklines.png -->

The design of a graphic can be a major source of distractors or obscure important data relationships. Let's examine same case studies and apply good design principles to see how they might be improved.

## Design Elements & Principles

By considering the design elements of a figure, we can easily identify possible transformations of a graph that may make it more legible.

- Axes: scale, limits, units
- Marks: symbols for data representation
- Labels: titles, axis labels, tick marks, grids, callouts, data labels, legend
- Layout: alignment and positioning of axes, marks, and labels

Many of these design elements are synonymous with elements of the [Grammar of Graphics which formed the basis of ggplot](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) and has influenced [Plotly](https://en.wikipedia.org/wiki/Plotly), [Bokeh](https://bokeh.pydata.org/en/latest/), and [Vega](http://vega.github.io/) among other plotting libraries.

Some additional useful principles identified by [Edward Tufte](https://en.wikipedia.org/wiki/Edward_Tufte) that extend beyond these design elements are the "data-ink ratio" and "data density."

## Examples (mostly bad, sometimes good)


It is often easy to learn to examine poor examples, since bad choices are often glaringly obvious. With that in mind, examine some visualizations from http://viz.wtf/.  Pay particular attention to the design elements of the figure and the visual cues it activates.

Here are a few samples:
1. http://viz.wtf/image/182223114674

1. http://viz.wtf/image/161589799318

1. http://viz.wtf/post/159271251475/pick-up-sticks-and-tiddlywinks

1. http://viz.wtf/post/162284459410/circleception

1. https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png

## Axes (use them!)


Axes form the coordinate system for our graph, meaning they are the basis for locating and comparing data points. With rare exceptions, all the data on the graph should share a common set of axes. We also want to choose limits and a scale for axes that will clearly display details in the data that we want to highlight. Generally we want to choose axes that make the data look neither sparse nor compressed.

### Avoid Pie Charts


Pie charts essentially have no axes, making it extremely difficult to compare the data elements/proportions (technically they use a polar coordinate system, but polar coordinates are a poor choice for linear/aperiodic data).  A bar chart is almost always the right answer when representing proportions.

<img src="images/sample-charts/pie-vs-bar.png" alt="Pie versus Bar graph" style="width:80%;">

Furthermore, since they have no axes, we have no ability to transform an axis to a different variable/unit or scale. For example, pie charts are a poor choice for showing trends in time.
<img src="images/sample-charts/pie-vs-line.jpg" alt="Pie versus Line Graph" style="width:80%;">

The one place where pie charts excel is that they scale to small sizes very well. When showing one to three proportions in a small space, a pie chart may be the only good option.
<img src="images/sample-charts/minard_pie.png" alt="Pie Chart small multiple" style="width:80%;">

### Choosing the right axes

Choosing suitable axes is often a matter of informed experimentation and knowledge of your intended audience. For example, highly skewed data may appear sparse on a linear axis but is more evenly distributed on a log axis. 

In [None]:
names = ["State_Code", "County_Code", "Census_Tract_Number", "NUM_ALL",
        "NUM_FHA", "PCT_NUM_FHA", "AMT_ALL", "AMT_FHA", "PCT_AMT_FHA"]
# Loading a CSV file, without a header (so we have to provide field names)
df = pd.read_csv('small_data/fha_by_tract.csv', names=names)

plt.figure(figsize=[6, 5])
plt.subplot(311)
df['AMT_ALL'].hist(bins=50)
plt.xticks([0, 4*10**5, 8*10**5, 12*10**5, 16*10**5])
plt.subplot(312)
df['AMT_ALL'].hist(bins=1000)
plt.xlim([0, 10**4])
plt.subplot(313)
df['AMT_ALL'].apply(np.log10).hist(bins=50)

plt.tight_layout()

Nonetheless, this may be a poor choice of axis if your audience has a difficulty time interpreting logarithmic scales or is unfamiliar with how relationships between variables may appear on a linear-log or log-log scale.

In [None]:
x = np.linspace(0, 3, 100)

plt.figure(figsize=[8, 3])
for subplot in range(3):
    plt.subplot(131 + subplot)
    plt.plot(x, x, '.', label='linear')
    plt.plot(x, x**2, '.', label='power')
    plt.plot(x, np.exp(x), '.', label='exponential')
    if subplot > 0:
        plt.yscale('log')
    if subplot > 1:
        plt.xscale('log')
    plt.legend()

plt.tight_layout()

## Choosing the right mark

Most (2D) plots will use one of a few marks: bars, scattered points/symbols, lines, or areas. In most cases it should be fairly obvious what to use, but sometimes we can leverage the implications of a mark to imply relationships that might otherwise be difficult to represent.

Lines imply continuity through time or space. They are useful for representing trends over time or distributions over a variable. 

In [None]:
plt.figure(figsize=[6,2])
sns.distplot(df['AMT_ALL'].apply(np.log))

In [None]:
import gzip
import json
from scipy.stats import linregress

cali_data = json.load(gzip.open('small_data/cal_house.json.gz'))
X = pd.DataFrame(cali_data['data'], columns=cali_data['feature_names'])
y = cali_data['target']

slope, intercept, _, _, _ = linregress(X['MedInc'], y)

plt.figure(figsize=[6, 2])
plt.plot(X['MedInc'], y, 'b.', alpha=.2)
plt.plot(X['MedInc'].sort_values(), slope * X['MedInc'].sort_values() + intercept, 'k--')
plt.ylim([0, 5.1])
plt.xlim([0, 10])

This implied continuity can be leveraged even if the continuous variable is not in the axes of the plot.

<img src="images/sample-charts/driving.jpg" alt="Gas mileage graph" style="width:80%;">

## Data-Ink Ratio

When creating labels (including legends) and organizing the layout of the graphic, we should aim to keep the fluff to a minimum and focus attention on the data. One useful way to think about this focus is the "data-ink ratio."

<img src="http://www.randalolson.com/wp-content/uploads/data-ink.gif" alt="data-ink" class="aligncenter size-full wp-image-4883">

Throughout this notebook we've been using Seaborn's visual defaults, which include a pale blue background with white gridlines. This is a nice way to provide gridlines without being distracting, but we should still practice questioning the visual defaults of our plotting software and looking for opportunities to minimize ink spent on graphical elements other than data.

A corollary of maintaining high data-ink ratio is that graphics should have high "data density." One way to improve data density is to simplify and shrink visual elements. Since often we are looking for gross trends or patterns in the data, we may be able to sacrifice a great amount of detail without sacrificing much meaning, freeing up visual space for more data. We'll see this applied in the layout of [small multiples](#Small-Multiples).

Finally, don't obscure data with legends whenever possible. Place legends outside of the axes if necessary.

In [None]:
x = np.linspace(0, 3, 100)

plt.figure(figsize=[8, 4])
plt.subplot(121)
plt.plot(x, x, '.', label='linear')
plt.plot(x, x**2, '.', label='power')
plt.plot(x, np.exp(x), '.', label='exponential')
plt.legend(loc=3)
plt.title('Horrible legend placement')

ax = plt.subplot(122)
plt.plot(x, x, '.', label='linear')
plt.plot(x, x**2, '.', label='power')
plt.plot(x, np.exp(x), '.', label='exponential')
# Shrink current axis's height by 10% on the bottom
box = ax.get_position()
ax.set_position([box.x0, box.y0 + box.height * 0.1,
                 box.width, box.height * 0.9])

# Put a legend below current axis
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1),
          fancybox=True, shadow=True, ncol=5)
plt.title('Legend outside axes')

plt.tight_layout()

## Dealing with multiple scales


Often you might want to display several different values on the same plot.  For example, here we wish to display both closing price and volume from a series of Google stock data.  We can leverage a combination of our design elements to create several solutions

First, we start by creating two separate graphs.  This helps us read the values on both series easily, but it's a bit difficult to compare the two series at a glance.

In [None]:
goog_recent = goog[-100:]

fig, axes = plt.subplots(2, 1)
goog_recent.plot(kind='scatter', x='Day', y='Close', ax=axes[0])
goog_recent.plot(kind='scatter', x='Day', y='Volume', ax=axes[1])

We could also combine them into a single graph, but if both series use the same kind of mark, this can also be confusing.

In [None]:
goog_recent = goog[-100:]

fig, ax1 = plt.subplots(figsize=[8, 4])

ln1 = ax1.plot(goog_recent['Day'], goog_recent['Close'], 'k', label='closing price')
ax1.set_ylabel('Closing price')
ax1.grid(False)

ax2 = ax1.twinx()
ln2 = ax2.plot(goog_recent['Day'], goog_recent['Volume'], 'r--', alpha=.3, label='volume')
ax2.set_ylabel('Volume')
ax2.grid(False)

lines = ln1 + ln2
labels = [line.get_label() for line in lines]
plt.legend(lines, labels, loc=3)

In [None]:
goog_recent = goog[-100:]

fig, ax1 = plt.subplots(figsize=[8, 4])

ln1 = ax1.plot(goog_recent['Day'], goog_recent['Close'], 'k', label='closing price')
ax1.set_ylabel('Closing price')
ax1.grid(False)

ax2 = ax1.twinx()
ln2 = ax2.bar(goog_recent['Day'], goog_recent['Volume'], color='r', alpha=.3, label='volume')
ax2.set_ylabel('Volume')
ax2.grid(False)

lines = ln1 + [ln2]
labels = [line.get_label() for line in lines]
plt.legend(lines, labels, loc=3)

Size is the next-best cue after position, so lets try that.  It's easier to see everything at a glance, but the varying sizes are a bit distracting and make it harder to judge the position (both in time and price). Furthermore, it is difficult to represent the scale for volume.

In [None]:
goog_recent.plot(kind='scatter', x='Day', y='Close', s=goog_recent['Volume']/50000)

We can use intensity or color.  These plots is easier to read, although it's hard to see small variations in the volume, and we place a heavy emphasis on high volume days (which is maybe what we want to do!).

In [None]:
fig, axes = plt.subplots(2, 1, figsize=[8, 4])

goog_recent.plot(kind='scatter', x='Day', y='Close', c=goog_recent['Volume'], 
                 cmap=plt.cm.binary, ax=axes[0]);
goog_recent.plot(kind='scatter', x='Day', y='Close', c=goog_recent['Volume'], 
                 cmap=plt.cm.viridis, ax=axes[1]);

But do we actually need to see small variations in the volume?  The above plot seems to show that volume was larger than usual in that recent growth.  Instead of coloring by the volume directly, we'll color based on whether the volume is greater or less than average.

In [None]:
large_volume = goog_recent['Volume'] > goog_recent['Volume'].mean()
plt.plot(goog_recent['Day'][large_volume], goog_recent['Close'][large_volume],
         '.', label='Volume > Average', ms=10, mfc=sns.color_palette()[4])
plt.plot(goog_recent['Day'][~large_volume], goog_recent['Close'][~large_volume],
         '.', label='Volume < Average', ms=10, mfc=sns.color_palette()[0])
plt.legend()
plt.xlabel('Day')
plt.ylabel('Close');

Sure enough, it was.  Sometimes it can be better for your story to reduce the amount of information on a plot.


## Small Multiples

Sometimes we want to view many data series at once. While these may clutter a single graph or may cover very different ranges on the axes, as long as the scale and units of the axes remain the same, it may be useful to create many small figures to be able to detect gross trends among many series. This is the basis for Tufte's ["sparklines"](https://en.wikipedia.org/wiki/Sparkline) that are commonly used for represent stock price data.

![sparklines](images/sample-charts/sparklines.png)

This concept of small multiples can even be leverage within the same graph to create new types of plots. For example, a [violin plot](https://seaborn.pydata.org/generated/seaborn.violinplot.html) is essentially applying the principle of small multiples to histograms.

In [None]:
tips = sns.load_dataset("tips")

plt.figure(figsize=[8, 6])
plt.subplot(211)
for day in tips['day'].unique():
    sns.distplot(tips[tips['day'] == day]['tip'])
    
plt.subplot(212)
sns.violinplot(x="day", y="total_bill", data=tips)

plt.tight_layout()

*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*