(communicate-plots)=
# Graphics for Communication

## Introduction

In this chapter, you'll learn about using visualisation to communicate.

In {ref}`exploratory-data-analysis`, you learned how to use plots as tools for *exploration*.
When you make exploratory plots, you know—even before looking—which variables the plot will display.
You made each plot for a purpose, quickly looked at it, and then moved on to the next plot.
In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.

Now that you understand your data, you need to *communicate* your understanding to others.
Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you'll learn some of the tools that **lets-plot** provides to do make charts tell a story.

### Prerequisities

As ever, there are a plethora of options (and packages) for data visualisation using code. We're focusing on the declarative, "grammar of graphics" approach using **lets-plot** here, but advanced users looking for more complex graphics might wish to use an imperative library such as the excellent **matplotlib**. You should have both **lets-plot** and **pandas** installed. Once you have them installed, import them like so:

In [None]:
# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt

# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

In [None]:
from lets_plot import *
from lets_plot.mapping import as_discrete
import pandas as pd
import numpy as np

LetsPlot.setup_html()

## Labels, titles, and other contextual information

The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. Let's look at an example using the MPG (miles per gallon) data, which covers the fuel economy for 38 popular models of cars from 1999 to 2008.

In [None]:
# load the data
mpg = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv", index_col=0
)

We want to show fuel efficiency on the highway changes with engine displacement, in litres. The most basic chart we can do with these variables is:

In [None]:
(ggplot(mpg, aes(x="displ", y="hwy")) + geom_point())

Now we're going to add lots of extra useful information that will make the chart better. The purpose of a plot title is to summarize the main finding.
Avoid titles that just describe what the plot is, e.g., "A scatterplot of engine displacement vs. fuel economy".

We're going to:

- add a title that summarises the main finding you'd like the viewer to take away (as opposed to one just describing the obvious!)
- add a subtitle that provides more info on the y-axis, and make the x-label more understandable
- remove the y-axis label that is at an awkward viewing angle
- add a caption with the source of the data

Putting this all in, we get:

In [None]:
(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(color="class"))
    + geom_smooth(se=False, method="loess", size=1)
    + labs(
        title="Fuel efficiency generally decreases with engine size",
        subtitle="Highway fuel efficiency (miles per gallon)",
        caption="Source: fueleconomy.gov",
        y="",
        x="Engine displacement (litres)",
    )
)

This is much clearer. It's easier to read, we know where the data come from, and we can see *why* we're being shown it too.

But maybe we want a different message? You can flex depending on your needs, and some people prefer to have a rotated y-axis so that the subtitle can provide even more context:

In [None]:
(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(color="class"))
    + geom_smooth(se=False, method="loess", size=1)
    + labs(
        x="Engine displacement (L)",
        y="Highway fuel economy (mpg)",
        color="Car type",
        title="Fuel efficiency generally decreases with engine size",
        subtitle="Two seaters (sports cars) are an exception because of their light weight",
        caption="Source: fueleconomy.gov",
    )
)

### Exercises

1.  Create one plot on the fuel economy data with customized `title`, `subtitle`, `caption`, `x`, `y`, and `color` labels.

2.  Recreate the following plot using the fuel economy data.
    Note that both the colors and shapes of points vary by type of drive train.

In [None]:
(
    ggplot(mpg, aes(x="cty", y="hwy", color="drv", shape="drv"))
    + geom_point()
    + labs(
        x="City MPG",
        y="Highway MPG",
        shape="Type of\ndrive train",
        color="Type of\ndrive train",
    )
)

3.  Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand.

## Annotations

In addition to labelling major components of your plot, it's often useful to label individual observations or groups of observations.
The first tool you have at your disposal is `geom_text()`.
`geom_text()` is similar to `geom_point()`, but it has an additional aesthetic: `label`.
This makes it possible to add textual labels to your plots.

There are two possible sources of labels: ones that are part of the data, which we'll add with `geom_text`; and ones that we add directly and manually as annotations using `geom_label`.

In the first case, you might have a dataframe that contains labels.
In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called `label_info`. In creating it, we pick out the mean values of "hwy" by "drv" as the points to label—but we could do any aggregation we feel would work well on the chart.

In [None]:
mapping = {
    "4": "4-wheel drive",
    "f": "front-wheel drive",
    "r": "rear-wheel drive",
}
label_info = (
    mpg.groupby("drv")
    .agg({"hwy": "mean", "displ": "mean"})
    .reset_index()
    .assign(drive_type=lambda x: x["drv"].map(mapping))
    .round(2)
)
label_info

Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the fontface and size arguments we can customize the look of the text labels. They’re larger than the rest of the text on the plot and bolded. (`theme(legend.position = "none")` turns all the legends off — we’ll talk about it more shortly.)

In [None]:
(
    ggplot(mpg, aes(x="displ", y="hwy", color="drv"))
    + geom_point(alpha=0.5)
    + geom_smooth(se=False, method="loess")
    + geom_text(
        aes(x="displ", y="hwy", label="drive_type"),
        data=label_info,
        fontface="bold",
        size=8,
        hjust="left",
        vjust="bottom",
    )
    + theme(legend_position="none")
)

Note the use of `hjust` (horizontal justification) and `vjust` (vertical justification) to control the alignment of the label.


The second of the two methods we're looking at is `geom_label`. This has two modes: in the first, it works like `geom_text` but with a box around the text, like so:

In [None]:
potential_outliers = mpg.query("hwy > 40 | (hwy > 20 & displ > 5)")
(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(color="black")
    + geom_smooth(se=False, method="loess", color="black")
    + geom_point(
        data=potential_outliers,
        color="red",
    )
    + geom_label(
        aes(label="model"),
        data=potential_outliers,
        color="red",
        position=position_jitter(),
        fontface="bold",
        size=5,
        hjust="left",
        vjust="bottom",
    )
    + theme(legend_position="none")
)

The second method is generally useful for adding either a single or several annotations to a plot, like so:

In [None]:
import textwrap

# wrap the text so it is over multiple lines:
trend_text = textwrap.fill("Larger engine sizes tend to have lower fuel economy.", 30)
trend_text

In [None]:
(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point()
    + geom_label(x=3.5, y=38, label=trend_text, hjust="left", color="red")
    + geom_segment(x=2, y=40, xend=5, yend=25, arrow=arrow(type="closed"), color="red")
)

Annotation is a powerful tool for communicating main takeaways and interesting features of your visualisations. The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!

Remember, in addition to `geom_text()` and `geom_label()`, you have many other geoms in **lets-plot** available to help annotate your plot.
A couple ideas:

-   Use `geom_hline()` and `geom_vline()` to add reference lines.
    We often make them thick (`size = 2`) and grey (`color = gray`), and draw them underneath the primary data layer.
    That makes them easy to see, without drawing attention away from the data.

-   Use `geom_rect()` to draw a rectangle around points of interest.
    The boundaries of the rectangle are defined by aesthetics `xmin`, `xmax`, `ymin`, `ymax`.

-   You already saw the use of `geom_segment()` with the `arrow` argument to draw attention to a point with an arrow.
    Use aesthetics `x` and `y` to define the starting location, and `xend` and `yend` to define the end location.


### Exercises

1.  Use `geom_text()` with infinite positions to place text at the four corners of the plot.

2.  Use `geom_label()` to add a point geom in the middle of your last plot without having to create a dataframe
    Customise the shape, size, or color of the point.

3.  How do labels with `geom_text()` interact with faceting?
    How can you add a label to a single facet?
    How can you put a different label in each facet?
    (Hint: Think about the dataset that is being passed to `geom_text()`.)

4.  What arguments to `geom_label()` control the appearance of the background box?

5.  What are the four arguments to `arrow()`?
    How do they work?
    Create a series of plots that demonstrate the most important options.


## Scales

Another you can make your plot better for communication is to adjust the scales.
Scales control how the aesthetic mappings manifest visually.

### Default scales

Normally, **lets-plot** automatically adds scales for you and you don't need to worry about them. For example, when you type:

```python
(
    ggplot(mpg, aes(x="displ", y="hwy")) +
    geom_point(aes(color="class"))
)
```

**lets-plot** is automatically doing this behind the scenes:

```python
(
    ggplot(mpg, aes(x="displ", y="hwy")) +
    geom_point(aes(color="class")) +
    scale_x_continous() +
    scale_y_continuous() +
    scale_color_discrete()
)
```

Note the naming scheme for scales: `scale_` followed by the name of the aesthetic, then `_`, then the name of the scale.
The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date.
`scale_x_continuous()` puts the numeric values from `displ` on a continuous number line on the x-axis, `scale_color_discrete()` chooses colors for each of the `class` of car, etc.
There are lots of non-default scales which you'll learn about below.

The default scales have been carefully chosen to do a good job for a wide range of inputs.
Nevertheless, you might want to override the defaults for two reasons:

-   You might want to tweak some of the parameters of the default scale.
    This allows you to do things like change the breaks on the axes, or the key labels on the legend.

-   You might want to replace the scale altogether, and use a completely different algorithm.
    Often you can do better than the default because you know more about the data.


### Axis ticks and legend keys

Collectively axes and legends get the somewhat confusing name **guides** in **lets-plot**. Axes are used for x and y aesthetics; legends are used for everything else.

There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: `breaks` and `labels`.
Breaks controls the position of the ticks, or the values associated with the keys. If you like, the breaks *are* the ticks.
Labels controls the text label associated with each tick/key. We might more accurately call these *tick labels*.
The most common use of `breaks` is to override the default choice:


In [None]:
(
    ggplot(mpg, aes(x="displ", y="hwy", color="drv"))
    + geom_point()
    + scale_y_continuous(breaks=np.arange(15, 40, step=5))
)

You can use `labels` in the same way (ie pass in an array or list of strings the same length as `breaks`). To remove them altogether, you would have to use a theme, though, a topic we'll return to later.
You can also use `breaks` and `labels` to control the appearance of legends.
For discrete scales for categorical variables, `labels` can be a named list of the existing levels names and the desired labels for them.


In [None]:
(
    ggplot(mpg, aes(x="displ", y="hwy", color="drv"))
    + geom_point()
    + scale_color_discrete(labels=["4-wheel", "front", "rear"])
)

To change the formatting of the tick labels, use the `format=` keyword argument. This is useful to render currencies, percentages, and so on—though it's often easier for the reader to just see this symbol once in the axis label.

In the example below, we read in the `diamonds` dataset and then format it with a command `format="$.2s"`; let's break this down:

- the dollar sign says put a dollar sign in front of every number
- the .2 says use two significant digits
- the s says, use the Système International (SI)

There are a wealth of alternative options for formatting—it's best to use the [helpful page on formatting](https://lets-plot.org/pages/formats.html) in the documentation of **lets-plot** to find out more.

In [None]:
diamonds = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv"
)
diamonds["cut"] = diamonds["cut"].astype(
    pd.CategoricalDtype(
        categories=["Fair", "Good", "Very Good", "Premium", "Ideal"], ordered=True
    )
)
diamonds["color"] = diamonds["color"].astype(
    pd.CategoricalDtype(categories=["D", "E", "F", "G", "H", "I", "J"], ordered=True)
)

In [None]:
(
    ggplot(diamonds, aes(x="cut", y="price"))
    + geom_boxplot()
    + coord_flip()
    + scale_y_continuous(format="$.2s", breaks=np.arange(0, 19000, step=6000))
)

To double check this works, let's use the terminal. We'll try the command `ls`, which lists everything in directory, and `grep *.svg` to pull out any files that end in `.svg` from what is returned by `ls`. These are strung together as commands by a `|`. (Note that the leading exclamation mark below just tells the software that builds this book to use the terminal.)

In [None]:
!ls | grep *.svg

In [None]:
# remove-cell
import os

os.remove("output_chart.svg")