# STA 141A Fundamentals of Statistical Data Science

### Lecture 7, 10/24/23, Plotting with `ggplot2` 1/2

### Announcements

- 

### Last week's topics

- Base R plotting
- Midterm


### Today's topics


- Final Project
- Structure of `ggplot`
- `geom_`s
- Global and local mappings
- Coloring the graphic

### Final Project

You should work alone or with up to two partners. The purpose of the project is to perform a statistical analysis on real data. 
This includes:

* Acquire a data set (e.g., from [Kaggle](https://www.kaggle.com/datasets) or [FiveThirtyEight](https://github.com/fivethirtyeight/data))
* Posing questions and developing hypotheses
* Processing the data 
* Exploring and visualizing the data
* Performing a statistical analysis (modelling, clustering)
* Presenting your findings through writing

#### Getting Started

To narrow down your project to just one topic, think
about:

*   What questions does your topic address or what problems does your topic
    solve? Why and to whom are these meaningful?
*   What questions can be ansered with your data and methods? 
*   Are there credible, **public** datasets available to explore the topic?
*   Is a 6-week project long enough to explore the topic reasonably well?

As inspiration and an example of what can be done with public datasets, see [I
Quant NY][NY]. 

[NY]: http://iquantny.tumblr.com/post/144197004989/the-nypd-was-systematically-ticketing-legally

#### Proposal

Your group should submit a 1-2 page project proposal by
__November 4th 11:59pm__. Your proposal should address:

*   What's the topic of your project? What question(s) will you attempt to
    answer or what problems will you attempt to solve? Why and to whom are
    these meaningful?

*   What data source(s) will your team use? Briefly describe each data source
    and explain how you think you will use it. Provide a link for each data
    source. This is a check to make sure that there is actually data available
    for your topic. If you ultimately decide not to use some of the data
    sources, or find additional data sources later, that's okay.

*   What makes your project challenging? Consider that you will have about 6
    weeks to work on the project. 

The proposal is your best opportunity to get feedback on your project. Make
sure it's clear and addresses the questions above. You can also use the
proposal to tell us about any other comments or concerns you have about your
project topic. You do not need to present any data analyses in the proposal.

The proposal will be graded satisfactory/unsatisfactory. This will not count for your final grade. 
Your priority should be working on the project itself; don't spend more than a few hours working on
the proposal. 
The proposal is to be submitted to Canvas. 

#### Grading criteria

The final report is due on __December 15__. The report should be 8-10 pages
including writing and visualizations, but excluding code. 

We will score your report according to:

* Reporting (30%): Are there clear research questions that you asked, and did you
    address these in an orderly fashion? Did you make well justified
    conclusions? Is your project sensible and easy to read?

* Data Processing (10%):  How much work was necessary to get your data and bring it in a format that is useful for further analysis. 

* Vizualisation and Methodology (40%): 
    Do your visualizations follow best practices? Do they support the hypothesis? 
    Is your methodology appropriate? 
    Does this give insight to your project? Are the methods tailored to your
    specific topic and data (not generic or off-the-shelf)?

* Code (10%): Is your code well-organized and easy to read? Is your code
    reproducible? Is your code documented? Is your code reasonably efficient?
    Did you use appropriate data structures and algorithms?

Grading scales:

Grade            | Points
------------     | -------
Good             | 10
Satisfactory     | 8
Poor             | 6
Partial Work     | 4
No Work          | 0

### Structure of `ggplot` function

`ggplot2` is a package that allows for more flexibility in plotting compared to what is provided with base R. Furthermore, the package harmonizes with the `tidyverse` packages.  

In [None]:
# install.packages("ggplot2")
library("ggplot2")

The function `ggplot` creates a coordinate system that you can add layers to. The first argument is the dataset to use in the plot, it is of <kbd>data.frame</kbd> or <kbd>tibble</kbd> format. Consequently, 
`ggplot(data = mpg)` creates an empty graph. Here, `mpg` from the `ggplot2` package is simiilar to `mtcars`. 

In [None]:
str(mpg)

In [None]:
table(mpg$drv)

In [None]:
library(repr); options(repr.plot.width = 12, repr.plot.height = 12) # only necessary for jupyter, not R Studio

In [None]:
ggplot(data = mpg)

We complete the graphic by adding one or more layers to `ggplot`: 
`geom_`-type functions add a geometrical object to the plot (e.g. `geom_point` creates a scatterplot).

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    geom_point()

Each geom function in `ggplot2` takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. In the case above the mapping was provided within the `geom_point` function as `aes(...)`. 

### `geom_`s

Above, the `geom_point` function takes two argument, namely x- and y-value. Other geometrical object require more or less arguments. 

In [None]:
ggplot(data = mpg, mapping = aes(displ))+
  geom_histogram()

In [None]:
ggplot(data = mpg, mapping = aes(displ))+
  geom_histogram(binwidth = 0.5)

Note that the histogram is not scaled to a density, but displaying frequencies. We can change this by `y = ..density..`. Furthermore, the `geom_histogram` argument takes more arguments than `mapping`: 

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = ..density..)) +
  geom_histogram(bins =15, fill = 2) # plot density 

Another one-dimensional mapping argument is taken by `geom_bar`: 

In [None]:
ggplot(data = mpg, mapping = aes(drv)) +
  geom_bar(fill = 3)

One graphic can display multiple geometrical objects: `geom_smooth` adds a regression line of the fit of a linear model regressing `x` onto `y`. The `se` argument adds confidence bands. 

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) 

### Global and local mappings


All `geom_`-objects require a mapping argument. If its missing, the command will print an error. 

In [None]:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

In [None]:
ggplot(data = mpg) + 
  geom_boxplot()

Both `ggplot` and `geom_`-functions can take a mapping argument returned by `aes`. Up until now, we have provided the mapping within `ggplot`. All `geom_`-functions inherit the mapping argument provided in `ggplot`. 

Alternatively, the argument can provided at `geom_`-level. 

In [None]:
ggplot(mpg) + 
  geom_boxplot(aes(x = class, y = hwy))

The global mapping can be overwritten at local `geom_`-level.

In [None]:
ggplot(data = mpg, mapping = aes(x = displ)) + 
  geom_boxplot(mapping = aes(x = class, y = hwy))

Multiple mappings are important if different geometrical objects are to be plotted. 

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(aes(color = drv)) +
  geom_smooth(method = 'lm', se = FALSE)

Here, the `geom_point` function accesses both the globally provided `x` and `y` argument as well as the locally provided additional `color` aesthetic argument. The `geom_smooth` function only accesses `x` and `y`. If it were to be provided with `color` as well, a different object were to be plotted. 

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(method = 'lm', se = F)

### Coloring the graphic

As we have seen before, `color` is both an optional aesthetic argument and `geom_`-argument. Here, the latter overwrites the former. 

In [None]:
ggplot(data = mpg, aes(x = displ, y = hwy, color = drv), color = 2) +
  geom_point() + 
  geom_smooth(method = 'lm', se = F)

The difference in providing `color` within `aes` is that the coloring is depending on the provided value. If the color does not convey information about a variable, but only changes the appearance of the plot, it should be added outside the `aes`. 

In [None]:
library(dplyr)
mpg %>% 
    select(displ, hwy, drv) %>%
    head()

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() 

In this case, `drv` contains additional information about each object. `drv` is (internally transformed to) a factor variable with three levels. Consequently, three colors are automatically assigned to the plot. These colors are evenly spaced around a HSL (hue, saturation, lightness) color circle.

In [None]:
printColor <- function(n) {
  df <- tibble::tibble(
    x = 1:n, 
    y = 1, 
    col = factor(x))
  require(ggplot2)
  ggplot(df) + 
    theme_void() + 
    theme(legend.position = "none") + 
    geom_point(aes(x, y, color = col), 
               size = 20, 
               shape = "square")
}

In [None]:
printColor(100) # set values up to, e.g., 200

Although the evently spaced colors around the HSL color circle is sensible, the resulting colors are not color friendly. Custom palettes can be created via HEX color code. 

In [None]:
myPalette <- c("#E69F00", "#56B4E9", "#009E73")
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
    scale_color_manual(values = myPalette) + 
    geom_point() 

If, instead of `color`, the `fill` argument of `aes` is used, the manual color command is `scale_fill_manual`. 

Just as if we set `n = 200` above in `printColor`, if the underlying argument passed to `aes` is not discrete, a continious color scheme is not appropriate. 

In [None]:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = hwy)) +
  geom_point() 

There are a variety of good continuous color schemes. We'll use `viridis` in connection with `geom_hex`. 

In [None]:
library("hexbin")
library("viridis")

Consider the synthetic data frame generated in the last lecture. 

In [None]:
df <- data.frame(matrix(rnorm(10000, 0, 1), nc = 2) %*% matrix(c(1,0.5, 0.5, 1), 2, 2))
ggplot(df) +
    coord_fixed() + scale_fill_viridis() + 
    geom_hex(mapping = aes(x = X1, y = X2)) 

Here, the `fill` aesthetic is automatically supplied by the internally computed `count` variable. Proper two-dimensional density plot can be created using `stat_density_2d`. 

In [None]:
ggplot(faithful, aes(eruptions, waiting)) + 
  scale_fill_viridis() + 
  theme(legend.position = "none") + 
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) 

The `..density..` argument has been seen before. Here, it fills the plot by relative frequency. The use of a `stat_` function instead of `geom_` is because we do not simply plot a geometrical object, but compute a statistical transformation (here: density estimation) first. 

### Exercises

Consider the following data set where you want to create two boxplots given variable `x`, and additionally split by `z`. You thus expect two boxplots for A (one corresponding to b=1 and one to b=2) and two boxplots for B (one for b=1 and one for b=2).
Why this code is only generating two boxplots?

In [None]:
set.seed(2022)
df = data.frame(x = sample(c("A","B"), 40, replace = T),
                y = sample(1:2, 40, replace = T),
                z = rnorm(80))

ggplot(data = df, aes(x = x, y = z, fill = x)) +
    geom_boxplot()