# ggplots2

The `ggplot2` package (created by Hadley Wickham) offers a powerful graphics language for creating elegant and complex plots. 

`ggplot2` allows to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner. Grouping can be represented by color, symbol, size, and transparency. Moreover, the creation of trellis plots (i.e., conditioning) is relatively simple. 

To install a suite of usefull packages including `ggplot2` type in R: install.packages("tidyverse")

Other sources:
+ https://ggplot2.tidyverse.org
+ https://github.com/tidyverse/ggplot2/wiki

## Why ggplot2?

<b>Advantages of ggplot2:</b>
+ consistent underlying Grammar of Graphics (http://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448)
+ plot specification at a high level of abstraction
+ very flexible
+ theme system for polishing plot appearance
+ mature and complete graphics system
+ many users, active mailing list

<b>Compared to base graphics, ggplot2:</b>
+ is more verbose for simple / canned graphics
+ is less verbose for complex / custom graphics
+ does not have methods (data should always be in a data.frame)
+ uses a different system for adding plot elements

## What Is The Grammar Of Graphics?

The basic idea is to independently specify plot building blocks and combine them to create just about any kind of graphical display you want. 

Building blocks of a graph include:
+ data
+ aesthetic mapping
+ geometric object
+ statistical transformations
+ scales
+ coordinate system
+ position adjustments
+ faceting

In [None]:
# install.packages("tidyverse")
# install.packages("readr")
library(tidyverse)
library(readr)

## `qplot()`

The `qplot()` function (qplot stands for *quick plot*) can be used to create the most common graph types. While it does not expose 
ggplot's full power, it can create a very wide range of useful plots. The format is:

`qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)`

<table style="width:80%">
    <tr>
        <th style="text-align: left; width:30%">Parameter</th><th style="text-align: left; width:70%">Description</th>
    </tr>
    <tr>
        <td style="text-align: left">alpha</td><td style="text-align: left">Alpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity).</td>
    </tr>
    <tr>
        <td style="text-align: left">color, shape, size, fill</td><td style="text-align: left">Associates the levels of variable with symbol color, shape, or size. For line plots, color associates levels of a variable with line color. For density and box plots, fill associates fill colors with a variable. Legends are drawn automatically.</td>
    </tr>
    <tr>
        <td style="text-align: left">data</td><td style="text-align: left">Specifies a data frame.</td>
    </tr>
    <tr>
        <td style="text-align: left">facets</td><td style="text-align: left">Creates a trellis graph by specifying conditioning variables. Its value is expressed as rowvar ~ colvar. To create trellis graphs based on a single conditioning variable, use rowvar~. or .~colvar)</td>
    </tr>
    <tr>
        <td style="text-align: left">geom</td><td style="text-align: left">Specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".</td>
    </tr>
    <tr>
        <td style="text-align: left">main, sub</td><td style="text-align: left">Character vectors specifying the title and subtitle.</td>
    </tr>
    <tr>
        <td style="text-align: left">method, formula</td><td style="text-align: left">If geom="smooth", a loess fit line and confidence limits are added by default. When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. Methods include "lm" for regression, "gam" for generalized additive models, and "rlm" for robust regression. The formula parameter gives the form of the fit.<br />For example, to add simple linear regression lines, you'd specify geom="smooth", method="lm", formula=y~x. Changing the formula to y~poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables.<br />For method="gam", be sure to load the mgcv package. For method="rml", load the MASS package.</td>
    </tr>
    <tr>
        <td style="text-align: left">x, y</td><td style="text-align: left">Specifies the variables placed on the horizontal and vertical axis. For univariate plots (for example, histograms), omit y.</td>
    </tr>
    <tr>
        <td style="text-align: left">xlab, ylab</td><td style="text-align: left">Character vectors specifying horizontal and vertical axis labels.</td>
    </tr>
    <tr>
        <td style="text-align: left">xlim, ylim</td><td style="text-align: left">Two-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively.</td>
    </tr>
</table>

In [None]:
data(mtcars)
print(head(mtcars))

In [None]:
# Create factors with value labels.
mtcars$gear <- factor(mtcars$gear, levels = c(3,4,5), labels = c("3-gears","4-gears","5-gears")) 
mtcars$am <- factor(mtcars$am, levels = c(0,1), labels = c("Automatic","Manual")) 
mtcars$cyl <- factor(mtcars$cyl, levels = c(4,6,8), labels = c("4cyl","6cyl","8cyl")) 

In [None]:
# Kernel density plots for mpg grouped by number of gears (indicated by color).
qplot(mpg, 
      data = mtcars, 
      geom = "density",
      fill = gear, 
      alpha = .5,
      main = "Distribution of Gas Milage", 
      xlab = "Miles Per Gallon", 
      ylab = "Density")

In [None]:
# Scatterplot of mpg vs. hp with smoothed line.
# The option smooth is used to add a smoothed line with its standard error.
qplot(x = hp, 
      y = mpg, 
      data = mtcars, 
      geom = c("point", "smooth"),
      xlab = "Horsepower",
      ylab = "Miles Per Gallon")

In [None]:
# Scatterplot of mpg vs. hp for each combination of cylinders.
# The argument color is used to tell R that we want to color the points by groups.
qplot(x = mpg, 
      y = hp, 
      data = mtcars, 
      color = factor(cyl),
      geom = c("point", "line"),
      xlab = "Horsepower",
      ylab = "Miles Per Gallon")

In [None]:
# Scatterplot of mpg vs. hp for each combination of gears and cylinders
# in each facet, transmittion type is represented by shape and color.
qplot(x = hp,
      y = mpg, 
      data = mtcars, 
      shape = am, 
      color = am, 
      facets = gear~cyl, 
      size = I(3),
      xlab = "Horsepower", 
      ylab="Miles per Gallon")

In [None]:
# Boxplots of mpg by number of gears (observations (points) are overlayed and jittered).
qplot(x = gear, 
      y = mpg, 
      data = mtcars, 
      geom = c("boxplot", "jitter"),
      fill = gear, 
      main = "Mileage by Gear Number",
      xlab = "Gear Number", 
      ylab = "Miles per Gallon")

## `aes()`

Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms. Aesthetic mappings can be set in ggplot2() and in individual layers.

`aes()` is a quoting function. This means that its inputs are quoted to be evaluated in the context of the data. This makes it easy to work with variables from the data frame because you can name those directly.

Examples include:
+ position (i.e., on the x and y axes)
+ color (“outside” color)
+ fill (“inside” color)
+ shape (of points)
+ linetype
+ size

## Geometic Objects (`geom`)

Geometric objects are the actual marks we put on a plot. Examples include:
+ points (`geom_point`, for scatter plots, dot plots, etc)
+ lines (`geom_line`, for time series, trend lines, etc)
+ boxplot (`geom_boxplot`, for boxplots)

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the `+` operator

Each type of geom accepts only a subset of all aesthetics–refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the `aes()` function.

In [None]:
# You can get a list of available geometric objects using the code below:
help.search("geom_", package = "ggplot2")

In [None]:
# Let’s look at housing prices.
housing <- read_csv("data/landdata-states.csv")
head(housing[1:5])

In [None]:
# Base graphics histogram example.
hist(housing$Home.Value, breaks = 40)

In [None]:
# ggplot2 histogram example.
library(ggplot2)
ggplot(housing, 
       aes(x = Home.Value)) +
geom_histogram(bins = 40)

In [None]:
# Base graphics colored scatter plot example.
plot(Home.Value ~ Date,
     col = factor(State),
     data = filter(housing, State %in% c("MA", "TX")))
legend("topleft",
       legend = c("MA", "TX"),
       col = c("black", "red"),
       pch = 1)

In [None]:
# ggplot2 colored scatter plot example.
ggplot(filter(housing,
              State %in% c("MA", "TX")),
       aes(x = Date,
           y = Home.Value,
           color = State)) +
geom_point()

In [None]:
# Points (Scatterplot) with log transformation.
ggplot(hp2001Q1,
       aes(y = Structure.Cost, 
           x = log(Land.Value))) +
geom_point()

## Prediction Line

A plot constructed with `ggplot` can have more than one geom. In that case the mappings established in the `ggplot()` call are plot defaults that can be added to or overridden.

In [None]:
# Our plot could improve if we add a regression line.
head(hp2001Q1)
hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))

p <- ggplot(hp2001Q1,
            aes(x = log(Land.Value),
                y = Structure.Cost))

p + 
    geom_point(aes(color = Home.Value)) + 
    geom_line(aes(y = pred.SC))

In [None]:
# Not all geometric objects are simple shapes–the smooth geom includes a line and a ribbon.
p +
    geom_point(aes(color = Home.Value)) +
    geom_smooth()

In [None]:
# Other aesthetics are mapped in the same way as x and y in the previous example.
p +
    geom_point(aes(color = Home.Value, shape = region))

In [None]:
p +
    geom_point(aes(color = Home.Value, shape = region)) + 
    geom_smooth(method = "lm",
                formula = y ~ x + log(x), 
                se = TRUE,
                color = "red")

In [None]:
# The scale_shape_identity scale can be used to pass through any legal shape value (its mapping is the 
# identity function, and thus it does not change anything).

# Let's take a look at all 25 symbols.
df <- data.frame(x = 1:5 , y = 1:25, z = 1:25)
ggplot(df,
       aes(x = x, 
           y = y)) +
geom_point(aes(shape = z), 
           size = 4) + 
scale_shape_identity()

## Scales: Controlling Aesthetic Mapping

Aesthetic mapping `aes()` only says that a variable should be mapped to an aesthetic. It doesn't say *how* that should happen. For example, when mapping a variable to *shape* with `aes(shape = x)` you don't say *what* shapes should be used. Similarly, `aes(color = z)` doesn't say *what* colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding *scale*. In `ggplot2` scales include

-   position
-   color and fill
-   size
-   shape
-   line type

Scales are modified with a series of functions using a `scale_<aesthetic>_<type>` naming scheme. Try typing `scale_<tab>` to see a list of scale modification functions.

## Common Scale Arguments

The following arguments are common to most scales in ggplot2:

+ name: the first argument gives the axis or legend title
+ limits: the minimum and maximum of the scale
+ breaks: the points along the scale where labels should appear
+ labels: the labels that appear at each break

Specific scale functions may have additional arguments; for example, the `scale_color_continuous` function has arguments `low` and `high` for setting the colors at the low and high end of the scale.

In [None]:
p <- ggplot(housing,
            aes(x = State,
                y = Home.Price.Index)) + 
geom_point(aes(color = Date),
                       alpha = 0.5,
                       size = 1.5,
                       # width: degree of jitter in x direction. Defaults to 40% of the resolution of the data.
                       # height: degree of jitter in y direction. Defaults to 40% of the resolution of the data
                       position = position_jitter(width = 0.25, height = 0)) +
scale_x_discrete(name = "State Abbreviation") + 
scale_color_continuous(name = "Date",
                       breaks = c(1976, 1994, 2013),
                       labels = c("'76", "'94", "'13")) +
theme(legend.position = "top",
      axis.text = element_text(size = 6))
p

## Faceting

+ Faceting is `ggplot2` parlance for **small multiples**
+ The idea is to create separate graphs for subsets of data
+ `ggplot2` offers two functions for creating small multiples:
    + `facet_wrap()`: define subsets as the levels of a single grouping variable
    + `facet_grid()`: define subsets as the crossing of two grouping variables
+ Facilitates comparison among plots, not just of geoms within a plot

In [None]:
## What is the trend in housing prices in each state?
p <- ggplot(housing, 
            aes(x = Date, 
                y = Home.Value)) + 
geom_line(aes(color = State))
p

In [None]:
p <- p + geom_line() +
facet_wrap(~State, ncol = 10)
p

## Themes

The `ggplot2` theme system handles non-data plot elements such as:
+ Axis labels
+ Plot background
+ Facet label backround
+ Legend appearance

Built-in themes include:
+ `theme_gray()` (default)
+ `theme_bw()`
+ `theme_classc()`

In [None]:
p + theme_linedraw()

In [None]:
p + theme_light()

In [None]:
p + theme_minimal() +
theme(text = element_text(color = "turquoise"))