# STA 141A Fundamentals of Statistical Data Science

### Lecture 8, 26/10/23, Plotting with `ggplot2` 2/2

### Announcements

- 

### Today's topics

- Facets
- Coordinate systems
- Annotations
- Saving
- Maps


### Facets

We have learned to plot several geometrical objects on the same panel by using several `geom_`-functions within the same `ggplot` function. Another way, particularly useful for categorical variables that do noe have many levels, is to split your plot into subplots, *facets*, that each display one subset of the data.
`facet_grid` forms a matrix of panels defined by row and column faceting variables. 
It is most useful when you have two discrete variables, and all combinations of the variables exist in the data. If you have only one variable with many levels, you can use `facet_wrap`.

In [None]:
require(ggplot2)
library(repr); options(repr.plot.width = 12, repr.plot.height = 12) # only necessary for jupyter, not R Studio

In [None]:
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
    geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))

In [None]:
options(repr.plot.width = 16) # only necessary for jupyter, not R Studio
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
    geom_smooth(mapping = aes(x = displ, y = hwy, color = drv)) + 
    facet_wrap(~ drv)

`facet_grid` creates a matrix of panels, some of which possibly empty.

In [None]:
require(dplyr)
mpg %>% head()

In [None]:
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(drv ~ class)

In [None]:
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(. ~ class)

In [None]:
 ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~ class, nr = 1)

`facet_wrap` too can display all possible non-empty combinations of `drv` and `class`: 

In [None]:
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(drv ~ class, nr = 3)
options(repr.plot.width = 6) # only necessary for jupyter, not R Studio

### Coordinate systems 

We have already seen how to limit the panels width with `xlim` and `ylim` and relative length of the axes using `coord_`.

In [None]:
g <- ggplot(data = mpg, aes(x = displ, y = hwy)) +
    geom_point()
g

In [None]:
g + coord_fixed() # the units on both axes are of same length / not useful here

In [None]:
g + coord_flip() # flip x and y axes

In [None]:
g + xlim(0, 10) + ylim(10, 50)

If we reduce the plot so that only a subset of observations is shown, we have to be more careful. 

In [None]:
g + xlim(3, 6) + ylim(10, 40) # throws warnings

In [None]:
g + coord_cartesian(x = c(3,6), y = c(10, 40)) # does not throw warnings

In [None]:
g + scale_x_continuous(limits = c(3,6)) + 
    scale_y_continuous(limits = c(10, 40)) # does throw warnings

### Annotations

Of course, the plot can be adequately annotated in a variety of ways. 

In [None]:
options(repr.plot.width = 16) # only necessary for jupyter, not R Studio
g <- ggplot(mpg, aes(displ, hwy)) +
    geom_point(aes(color = class), size = 4, alpha = 0.5) +
    geom_smooth(se = FALSE, method = 'lm', size = 2) +
    labs(title = "Fuel efficiency generally decreases with engine size",
         subtitle = "Two seaters (sports cars) are an exception because of their light weight",
         x = "Engine displacement (in litres)",
         y = "Highway miles per gallon",
         caption = "Data from fueleconomy.gov",
         color = "Car type")
g

In [None]:
library("dplyr")
best_in_class <- mpg %>%
    group_by(class) %>%
    filter(row_number(desc(hwy)) == 1)
best_in_class

In [None]:
g + geom_text(aes(label = model), data = best_in_class)

In [None]:
library("ggrepel") # install.packages("ggrepel")

In [None]:
g + ggrepel::geom_label_repel(aes(label = model), data = best_in_class) + 
geom_point(data = best_in_class, size = 5, color = 'red')

The `theme` function governs the background. It should be changed from the default grey background if the analysis is to be printed. 

In [None]:
g + theme_minimal()

In [None]:
g + theme_bw()

In [None]:
g + theme_void()

In [None]:
g + theme_classic()

There are multiple ways to manipulate the legend. 

In [None]:
g + theme(legend.position = "bottom")

### Saving 

Plots can be exportet to <kbd>pdf</kbd> or <kbd>png</kbd> via `ggsave`. 

In [None]:
ggsave("../source/09-scatterplot.pdf", width = 6, height = 6, scale = 1)

In [None]:
ggsave("../source/09-scatterplot.png", width = 6, height = 6, scale = 1)

### Maps

We will now see how to create and color maps to display spatial data. First, load the `raster`-package.

In [None]:
#install.packages("raster") # takes some time
library("raster")
library("dplyr")

In [None]:
USA = getData("GADM", country = "USA", level = 2)

`USA` is a `S4` object, we can access its information with `@`: 

In [None]:
str(USA@data)

In [None]:
USA@data

The polygons used to draw the boundaries of the US and its states can be accessed using `@polygons`:

In [None]:
str(USA@polygons)

In order to plot the map, we first transform the object to a <kbd>data.frame</kbd> using `fortify` in order to use `geom_polygon`.

In [None]:
californiaCountiesID = USA@data$NAME_1 == "California"

In [None]:
californiaCountiesID

In [None]:
californiaCountiesID = as.numeric(rownames(USA@data)[californiaCountiesID])

In [None]:
californiaCountiesID

In [None]:
pop <- read.table("../data/population.txt", sep = ";", header = T)
pop$Population <- log(pop$Population)
pop$id <- as.character(pop$id)

In [None]:
California <- fortify(USA) %>% 
  filter(id %in% californiaCountiesID) %>% 
  left_join(pop)

In [None]:
require(viridis)
g <- ggplot(California, aes(long, lat, group = group, fill = Population)) +
    geom_polygon() + scale_fill_viridis() + 
    theme_void() + coord_equal() + theme(legend.position = "none") + 
    labs(title = "Population in California",
         subtitle = "log-population per County",
         caption = "Data from gadm.org and california-demographics.com")
g

### Exercise 

On the plot of California, add the names of the three counties with largest and lowest population.