# A Very, Very Brief Introduction to Data Visualization (Part II)

This document, together with [part1](part1.ipynb) constitute a short two-part course covering the basics of data visualization.

# About Me
<br>
<div style="font-size: larger;">
Chandrasekhar (Sekhar) Ramakrishnan<br>
<a href="https://twitter.com/ciyer">@ciyer</a><br>
<br>
<a href="https://datascience.ch">Swiss Data Science Center</a> and freelance data scientist; teach data viz at <a href="https://propulsion.academy">Propulsion Academy</a>

<a href="https://illposed.com"><img alt="illposed logo" src="images/illposed-logo.svg" width="300px"/></a>
</div>

# Part 1: Thinking about visualizations

See [part1](part1.ipynb)

In the first part, we developed a framework for thinking about visualizations and looked at some ideas and tools that can help us build on a solid foundation.

# Part 2: Visualizations for reasoning about data

* The importance of **context**
* Show multiple variables using **layering**
* Use **highlighting** to draw attention to particular items
* **Faceting** lets you show more data

Part 2 continues on this foundation to look at how to make visualizations for reasoning about data.

# The Importance of **Context**

## Edward Tufte *Envisioning Information* (1990)

At the heart of quantitative reasoning is a single question:

**Compared to what?**

## Enabling Comparison

How can we support making comparisons?

<div style="font-size: larger; margin: 20px;">
    To make comparisons possible, give <b>context</b> by <b>layering information</b> and increasing <b>data density</b>. Highlight to <b>draw attention</b> to important features.
</div>

Quantitative reasoning is about *comparison*, and to make comparisons, you need to see *data in context*. This is done through layering of information, and increasing the density of data. There is some tension between density and clarity, though. As we have more data, it is easier to lose the forest for the trees. Highlighting is a way to present high density data and still give signposts to guide the viewer to relevant information. We will look at how to use color, transparency, and text to achieve this.

# **Providing Context** using color and text

## **Color** to layer information

![FRED](https://fred.stlouisfed.org/graph/fredgraph.png?g=EaVv) https://fred.stlouisfed.org/series/PCE#0

Plots of economic data from FRED always shade in periods of recession since many economic variables behave differently when economy is contracting. This is a plot of *personal consumption expenditures*, and an economy in recession is one obvious reason why people may be spending less (in aggregate).

On the face of it, this looks like a plot of two variables, YoY change in PCE (y-axis) and time (x-axis). But there is in fact another variable of data necessary to make this chart —  YoY change in (nominal) GDP — we just do not see it in full fidelity, and it is not necessary. We only care if this variable is below zero or not, in which case, we mark the region in gray (a recession that has finished) or yellow (a recession that is ongoing).

## **Text** to layer information

![Three-point Scatter](https://fivethirtyeight.com/wp-content/uploads/2015/12/morris-stephcurry-1.png?w=600)
http://fivethirtyeight.com/features/stephen-curry-is-the-revolution/

Using text is another way of layering information. This plot is taken from an article about basketball and illustrates how the teams in the semifinals of the 2014 – 2015 season playoffs all took advantage of the 3-point shot. This is a plot of three variables: two quantitative ones and a nominal variable – team name, but the nominal variable is not shown for all the teams, just a selection of them necessary for contextualizing the information.

## **Clarity in density** with color and text

![Curry Scatter](https://fivethirtyeight.com/wp-content/uploads/2015/12/morris-stephcurry-21.png?w=600)
http://fivethirtyeight.com/features/stephen-curry-is-the-revolution/

Here is another plot from the same article about a basketball player named Stephen Curry. To explain Curry’s skill, this plot shows how he compares to 1. the league in general 2. the best players of the league, and it does this by 1. achieving high data density 2. layering multiple variables 3. highlighting and drawing your attention to the relevant data.
To achieve the high density, transparency is used to also many dots to overlap, but still be distinct. Text labels and color are used to highlight a small number of players to examine in detail.

## **Model** to aid interpretation

![Messi Scatter](https://fivethirtyeight.com/wp-content/uploads/2014/06/morris-feature-messi-2.png?w=600) https://fivethirtyeight.com/features/lionel-messi-is-impossible/

The data that is layered over a basic plot can come from the same data source, a secondary data source, or be synthetic — computed using the data. 

Here is an example of layering used to show data with a model of it. In this case, we see a regression of shooting efficiency vs. shooting volume of a large number of football players, making it possible to compare two extraordinary players: Messi and Ronaldo. 

# **Providing Context** Data Density

Viewing as much data as possible is key to proper interpretation and robust understanding. Cherry-picking a subset of the data is a recipe for misleading (intentionally or unintentionally).

## **Phillips Curve** (after Tufte)

Following an example presented by Tufte in *The Visual Display of Quantitative Information* and revisited on his [website](https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=00041w), consider the Phillips Curve, a concept from economics which states that there is an inverse relationship between the unemployment rate and the inflation rate: as the unemployment rate declines, the inflation rate should increase.

If we look at data from 1961 to 1969, this seems plausible.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import introviz

introviz.set_style()

In [None]:
phillips_df = introviz.phillips.read_data("data/phillips-ue-cpi.csv")
fig, ax = plt.subplots(figsize=(8, 8))
introviz.phillips.xy_plot(ax, phillips_df.loc[(slice(None), slice("1961", "1969")), :], "USA")
ax.set_xlabel("Annual unemployment rate (%)")
ax.set_ylabel("Inflation rate (annual % change in CPI)")
ax.xaxis.set_label_coords(0.3, -0.05)
introviz.phillips.cite_source(ax, "OECD (data.oecd.org)")
fig.tight_layout()

As a consumer of visualizations, you should be skeptical when you see small amounts of data displayed: it is easy to cherry-pick data to support an argument, hiding data that does not. As a creator of visualizations, you should aim to show as much data as possible. Not only is it honest, it is also more convincing. If the conclusion holds when looking at much data, it is more robust.

If we expand our view to 60 years of data covering the period 1960 to 2020, we see a more complex picture emerge revealing periods in the US high inflation and high unemployment (1970s/early 1980s) and periods of decreasing unemployment and low inflation (2014-2019).

In [None]:
phillips_df = introviz.phillips.read_data("data/phillips-ue-cpi.csv")
fig, ax = plt.subplots(figsize=(8, 8))
introviz.phillips.xy_plot(ax, phillips_df, "USA")
ax.set_xlabel("Annual unemployment rate (%)")
ax.set_ylabel("Inflation rate (annual % change in CPI)")
ax.xaxis.set_label_coords(0.3, -0.05)
introviz.phillips.cite_source(ax, "OECD (data.oecd.org)")
fig.tight_layout()

## **Phillips Curve** multiple countries

William Phillips identified the Phillips curve while studying unemployment and inflation in the United Kingdom. Maybe the US and UK behave differently in this regard. How can we look at a larger array of countries?

One way would be to overlay multiple plots in one figure. As you can see, that does not work very well when data density becomes high. It is very difficult to distinguish the two series from one another.

In [None]:
palette = sns.color_palette()
fig, ax = plt.subplots(figsize=(8, 8))
introviz.phillips.xy_plot(ax, phillips_df, "USA")
introviz.phillips.xy_plot(ax, phillips_df, "GBR", s_color=palette[2], r_color=palette[3])
ax.set_xlabel("Annual unemployment rate (%)")
ax.set_ylabel("Inflation rate (annual % change in CPI)")
ax.legend()
ax.xaxis.set_label_coords(0.3, -0.05)
introviz.phillips.cite_source(ax, "OECD (data.oecd.org)")
fig.tight_layout()

# **Faceting**

## **Edward Tufte** *Envisioning Information* (1990)

<div style="font-size: larger; margin: 20px;">
    At the heart of quantitative reasoning is a single question:
    <p>
    <b>Compared to what?</b>
    </p>
    <p>Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.</p>
</div>

Tufte (Envisioning Information, p. 67)

To compare the relationship between unemployment and inflation in the USA and UK, we can do a small-multiples plot. Each plot is a complex scatterplot, showing the movement of the series through time, layering a model and orienting text, yet they remain readable (and could be improved with a little intervention in Illustrator).

Notice how the x and y-axes cover the same range. This is important to make the plots comparable and should only be violated with good reason.

In [None]:
palette = sns.color_palette()
fig, axs = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(16, 8))
introviz.phillips.xy_plot(axs[0], phillips_df, "USA")
introviz.phillips.xy_plot(axs[1], phillips_df, "GBR", text_offset=(0, -0.5))
axs[0].set_xlabel("Annual unemployment rate (%)")
axs[0].set_ylabel("Inflation rate (annual % change in CPI)")
axs[0].xaxis.set_label_coords(0.3, -0.05)
axs[0].set_title("USA (1960 - 2020)")
axs[1].set_title("GBR (1984 - 2020)")
introviz.phillips.cite_source(ax, "OECD (data.oecd.org)")
fig.tight_layout()

## **Phillips Curve** nine countries

The faceted plot approach can be extended to a large number multiples as well. Here we see the Phillips Curve plotted for nine countries, giving a richer and more nuanced view of the relationship between unemployment and inflation. The relationship broadly holds for two or three countries (Sweden, Japan, and, if we want to be generous, Italy), but is not a good description of what we see in the other six. And in the UK, the slope is opposite to what is expected.

In [None]:
tdf = phillips_df.loc[(["CAN", "USA", "JPN", "GBR", "FRA", "DEU", "NLD", "SWE", "ITA"]), :]
ue_cpi_r2_df = introviz.phillips.r2_df(tdf).sort_values("R2", ascending=False)
g = sns.FacetGrid(tdf.reset_index(), col="LOCATION", col_wrap=3, col_order=ue_cpi_r2_df['LOCATION'], height=4, aspect=1.2)
g.map_dataframe(introviz.phillips.facet_xy_plot, "UE", "c_cpi")
g.set_xlabels("Unemployment rate")
g.set_ylabels("Inflation rate")

for l in ue_cpi_r2_df['LOCATION'].values:
    ax = g.axes_dict[l]
    ax.set_title(introviz.phillips.facet_xy_plot_label(tdf, ue_cpi_r2_df, l))

# **Case Study** Mortality in France 2000 — 2021

[Baptiste Coulmont](https://coulmont.com), a sociologist in Paris, produced some stunning visualizations of all-cause mortality in France over the period of 2000 — 2021. The first one appeared on his [blog](coulmont.com/blog/2020/04/24/2020-une-mortalite-specifique/), and he posts updated versions on his [twitter feed](https://twitter.com/coulmont/status/1377966826517372928).
As a case study for applying these contextualization techniques, we will use the same data and explore the design space.

# References

<div style="display: flex; flex-direction: row;  justify-content: space-around">

<div>

<h2>Edward Tufte</h2>
<ul>
<li><a href="https://www.amazon.com/Visual-Display-Quantitative-Information/dp/0961392142/">Visual Display of Quantitative Information</a></li>
<li><a href="https://www.amazon.com/Envisioning-Information-Edward-R-Tufte/dp/0961392118/">Envisioning Information</a></li>
<li><a href="https://www.amazon.com/Visual-Explanations-Quantities-Evidence-Narrative/dp/0961392126/">Visual Explanations</a></li>
</ul>

</div>

<div>

<h2>Online</h2>
<ul>
<li><a href="https://magrawala.github.io/cs448b-fa17/">Maneesh Agrawala’s Visualization Course</a></li>
<li><a href="https://courses.cs.washington.edu/courses/cse442/17au/">Jeffrey Heer’s Visualization Course</a></li>
<li><a href="https://www.tableau.com/sites/default/files/media/designing-great-visualizations.pdf">Jock Mackinlay’s Designing Great Visualizations</a></li>
</ul>

</div>
</div>



<!-- * Müller-Brockmann -->