# ggplot2 - Scales And Coordinates

## Scale

The first thing we’re going to look at is scaling our data.  Here’s a plot showing the relationship between GDP per capita and mean life expectancy:

In [None]:
if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}
 
if(!require(gapminder)) {
    install.packages("gapminder", repos = "http://cran.us.r-project.org")
    library(gapminder)
}

In [None]:
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE)

One problem with this visual is that we are making our users think a little too much.  People have trouble thinking in logarithmic terms.  If I tell you that the base-2 log of a value is 8.29, you probably won’t know that the value is 3983.83 without busting out a calculator.  But that’s what I’m making people do with this chart.  So let’s fix that with a scale.

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10()

By adding one line of code, we changed the scale on the X axis from continuous to logarithmic in base 10.  That gives us numbers on the X axis that we can immediately understand:  1e4, or \$10,000.  But, uh, maybe I want to see \$10,000 instead of 1e+04?  Fortunately, there is a `label` parameter on the scale that lets us set a label.  The [scales](https://cran.r-project.org/web/packages/scales/scales.pdf) package in R (another part of the tidyverse) gives us a set of pre-packaged labels, including USD and other currency formats.  This is what the call looks like:

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_log10(label = scales::dollar)

And now we have a version where my users don’t have to think hard about what those values mean.

## Going Deeper With Scale
To understand the ggplot scale better, let’s take a look at [what functions are available to us](http://ggplot2.tidyverse.org/reference/#section-scales).

The quick summary is that there are two parts of most scale functions.  The first part describes **what** we want to scale, and the second part describes **how** we want to scale it.

First, the whats:

1. alpha — Using alpha transparency levels to differentiate categories
2. color — Using a color scale as a way to differentiate categories
3. fill — Using a color fill as a way of describing a variable
4. linetype — Using the line type (e.g., solid line, dotted line, dashed line) to differentiate categories
5. shape — Using a shape (e.g., circle, triangle, square) to differentiate categories
6. size — Using the size of a shape to differentiate categories
7. x — Change the scale of the X axis
8. y — Change the scale of the Y axis

Next, the hows, which I’ll break up into two categories.  The first category is the “differentiation” hows, which handle alpha, color, fill, linetype, shape, and size:

1. Continuous
2. Discrete
3. Brewer
4. Distiller
5. Gradient / Gradient2 / Gradientn
6. Grey
7. Hue
8. Identity
9. Manual

And here are the X-Y hows:

1. Continuous
2. Discrete
3. Log10
4. Reverse
5. Sqrt
6. Date / Time / Datetime

There are a few scale functions which don’t fit this pattern (scale_radius) and a couple which have “default” how values (scale_alpha, scale_size, scale_shape).  Also, not all whats intersect with hows:  for example, there is no scale_shape_continuous or scale_size_hue because those combinations don’t make sense.

Now let’s dig into these a bit more and see what we can find.

## X and Y Scales Galore
We’ve already seen scale_x_log10(), which converts the X axis to a base-10 logarithmic scale.  It turns out that this is just [a transformation of scale_continuous()](http://ggplot2.tidyverse.org/reference/scale_continuous.html).  So we can re-write it as:

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(trans = "log10", label = scales::dollar)

There are approximately 15 transformations and you can build your own if you’d like.  For the most part, however, you’re probably going to use the base scale or one of the most common transformations, which have their own functions:  scale_x_log10, scale_x_sqrt, and scale_x_reverse.

We can also handle dates and times in ggplot2, but don't cover it directly in this talk.  You can see it at the [original blog post](https://36chambers.wordpress.com/2018/02/01/ggplot-basics-scales-and-coordinates/).

### Brewing Colors
Another area of intrigue is coloration.  ggplot2 will give you some colors by default, but you may not want to use them.  You can specify your own colors if you’d like, or you can ask [Color Brewer](http://colorbrewer2.org/) for help.  For example, suppose I want to segment by gapminder data by continent, displaying each continent as a different color.  I can use the scale_color_brewer function to generate an appropriate set of colors for me, and it adds just one more line of code.

This function has two important parameters:  the type of data and the palette you wish to use.  By default, ggplot2 assumes that you’re sending sequential data.  You can also tell it that you are graphing divergent data (commonly seen in two-party electoral maps as the percentage margin of victory for each candidate) or that you have qualitative data (typically unordered categorical data).  In this case, I’m going to show all three even though the data doesn’t really fit two of them.

First up, here’s a sequential palette of various greens, starting from the lightest green and going darker based on continent name.

In [None]:
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(mapping = aes(color = continent)) +
    scale_color_brewer(type = "seq", palette = "Greens") +
    geom_smooth(method = "lm", se = FALSE)

Next up, here’s what it looks like if you use a divergent color scheme, where names closer to A have shades of orange and names closer to Z have shades of purple.

In [None]:
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(mapping = aes(color = continent)) +
    scale_color_brewer(type = "div", palette = "PuOr") +
    geom_smooth(method = "lm", se = FALSE)

Finally, we have a qualitative color scheme, which actually matches our data.  The five continents aren’t really continuous, so we’d want five different and unique colors to show our results.  Note that I’ve re-introduced alpha values here because these are solid colors and I want to be able to see some amount of interplay:

In [None]:
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
    geom_point(alpha = 0.5, mapping = aes(color = continent)) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    geom_smooth(method = "lm", se = FALSE)

Note that all of this above uses the `scale_color_brewer` function because we’re colorizing points.  If you want to colorize a bar graph or some other 2D structure, you’ll want to use `scale_fill_brewer` to colorize the filled-in portion and `scale_color_brewer` if for some reason you’d like the outline to be a different color.

For example, here is a bar chart of life expectancy by continent in 1952.  I’m setting the color to continent and have set the overall fill to white so you can see the coloration.

In [None]:
lifeExp_by_continent_1952 <- gapminder %>%
  filter(year == 1952) %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp)) %>%
  select(continent, avg_lifeExp)
 
ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
  geom_col(mapping = aes(color = continent), fill = "White") +
  scale_color_brewer(type = "seq", palette = "Greens")

That doesn’t look like what we intended.  This is more like it:

In [None]:
ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) +
  geom_col(mapping = aes(fill = reorder(continent, desc(avg_lifeExp)))) +
  scale_fill_brewer(type = "seq", palette = "Greens", direction = -1)

Note that I re-used the continent order to define colors in terms of mean life expectancy rather than alphabetically.  I also set the direction parameter on `scale_fill_brewer` to -1, which means to reverse colors.  By default, color brewed results go from light to dark, but here I want them to go from dark to light, so I reversed the direction.

### Shapes And Sizes
You can plot data according to shape and size as well.  As far as shape goes, there are only a few options.  By default, we can have our scatter plot points show continents as different shapes using the following code:

In [None]:
ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(shape = continent)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

We can also choose whether to use solid or hollow shapes with the solid flag on scale_shape:

In [None]:
ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(shape = continent)) +
  scale_shape(solid = FALSE) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

You can also set the size of a point using the `size` attribute, and can use `scale_size` to control the size.  Here’s an example where we increase in size based on continent:

In [None]:
ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5, mapping = aes(size = continent)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10(label = scales::dollar)

Shapes tend to show up in black-and-white graphs of categorical data—like our continents—and sizes tend to show up with continuous variables.  In fact, when you try to run the code above, you get a warning:  “Using size for a discrete variable is not advised.”  It’s a bad practice, but it is something that you can do if you really want to.

## Coordinates
The other thing I want to cover today is coordinate systems.  [The ggplot2 documentation](http://ggplot2.tidyverse.org/reference/#section-coordinate-systems) shows seven coordinate functions.  There are good reasons to use each, but I’m only going to demonstrate one.  By default, we use the Cartesian coordinate system and ggplot2 sets the viewing space.  This viewing space covers the fullness of your data set and generally is reasonable, though you can change the viewing area using the xlim and ylim parameters.

The special coordinate system I want to point out is `coord_flip`, which flips the X and Y axes.  This allows us, for example, to turn a column chart into a bar chart.  Taking our life expectancy by continent, data I can create a bar chart whereas before, we’ve been looking at column charts.  I can use `coord_flip` to switch the x and y axes:

In [None]:
ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, avg_lifeExp), y = avg_lifeExp)) +
  geom_col() +
  coord_flip()

Now we have a bar chart.  With `coord_flip()`, we can easily create bar charts like above or Cleveland dot plots, like below.

In [None]:
ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, avg_lifeExp), y = avg_lifeExp)) +
  geom_point(size = 4) +
  coord_flip()

## Conclusion
In this notebook, we looked at some of the more common scale and coordinate functions in ggplot2.  There are quite a few that I did not cover, but I think this gives us a pretty fair idea of what we can create from this library.   In the next notebook, we will look at labels and annotations.