In this notebook, we review material from previous lectures:

* [ggplot](#ggplot)
    * aesthetics
    * geometries
    * faceting
    * statistical transformations
    * position adjustments
    * coordinate transformations
* [dplyr verbs](#dplyr-verbs)
    * filter
    * arrange
    * select
    * rename
    * mutate
    * transmute
    * group_by
    * summarize
* [pipes](#pipes)
* [EDA](#EDA)
    * visualizing distributions
    * typical values
    * unusual values
    * missing values
    * covariation

# ggplot

In [None]:
#options(repr.plot.width=6, repr.plot.height=4)
library(tidyverse)
install.packages("nycflights13")

What's wrong with this code to produce a scatter plot of `hwy` vs `displ` with the `color` aesthetic mapped to `drv`?

In [None]:
tryCatch({
  ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy), color = drv) # attempt to create a scatter plot
},
error = function(err) {
 print(err)
})


What does `se = FALSE` do in the code below?

In [None]:
ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy), se = FALSE)

Write the command to produce the following plot.

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot1.png)

In [None]:
ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
    geom_point() +
    geom_smooth()

Write the command to produce the following plot.

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot2.png)

In [None]:
ggplot(data=mpg, mapping=aes(x=displ,y=hwy,color=drv)) +
    geom_point() +
    geom_smooth()

Match the geometries below with their statistical transformations.

| Geometry       | Stat     |
|----------------|----------|
| geom_point     | bin      |
| geom_histogram | count    |
| geom_bar       | identity |

Answer:

| Geometry       | Stat     |
|----------------|----------|
| geom_point     | identity |
| geom_histogram | identity |
| geom_bar       | count    |

Write the command to produce the following plot.

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot3.png)

In [None]:
ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class,fill=drv))

Write the command to produce the following plot.

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot4.png)

In [None]:
ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class,fill=drv)) +
    coord_flip()

# dplyr verbs

What's wrong in the code fragments below?

In [None]:
tryCatch({
  filter(mpg, drv == f) # find vehicles with front wheel drive
},
error = function(err) {
 print(err)
})

In [None]:
# f should be in quotes since drv is of chr type
filter(mpg, drv == "f")

In [None]:
tryCatch({
  filter(mpg, 'drv' != '4') # find vehicles that do not have 4 wheel drives
},
error = function(err) {
 print(err)
})

In [None]:
# variable name shouldn't be in quotes
filter(mpg, drv != '4')

In [None]:
tryCatch({
  filter(mpg, manufacturer == toyota & class == suv) # find all suvs made by toyota
},
error = function(err) {
 print(err)
})


In [None]:
# toyota and suv should be in quotes
filter(mpg, manufacturer == "toyota" & class == "suv")

In [None]:
tryCatch({
  n(filter(mpg, cyl == 4)) # find the number of vehicle with 4 cylinders
},
error = function(err) {
 print(err)
})

In [None]:
# nrow() computes no. of rows in a tibble (and ncol computes no. of columns), n() can only be used inside of other funcs like summarize()
nrow(filter(mpg, cyl == 4))

# pipes

What's wrong in the code fragments below?

In [None]:
tryCatch({
  select(mpg, hwy) %>% # show only the highway mileage of suvs sorted in descending order of the mileage
    filter(class == 'suv') %>%
    arrange(hwy)
},
error = function(err) {
 print(err)
})

In [None]:
filter(mpg, class == 'suv') %>%
    select(hwy) %>%
    arrange(hwy)

In [None]:
# depth variable in diamonds is supposed to be the ratio (as a percentage) between z and mean of x,y
# add a new column new_depth where we compute it ourselves
# assign it to a variable called new_diamonds
mutate(diamonds, new_depth <- 100*2*z/(x+y)) %>%
    new_diamonds

In [None]:
new_diamonds <- mutate(diamonds, new_depth = 100*2*z/(x+y))

In [None]:
# check if depth and new_depth are close to each other within machine precision
filter(new_diamonds, depth == new_depth)

In [None]:
filter(new_diamonds, near(depth, new_depth))

In [None]:
# Note that distance is in miles and air_time is in minutes
#
# add a speed variable in m.p.h. obtained by dividing distance and air_time, then
# select only speed and distance, then
# plot a scatter of speed (y axis) vs distance (x axis)
library(nycflights13)
mutate(flights, speed = 60*distance/air_time) %>%
    select(flights, distance, air_time) %>%
    ggplot(mapping = aes(x = distance, y = air_time)) +
        geom_point()

In [None]:
mutate(flights, speed = 60*distance/air_time) %>%
    select(speed, distance) %>%
    ggplot(mapping = aes(x = distance, y = speed)) +
        geom_point()

In [None]:
# show a bar chart of average highway mileage of vehicle produced by each manufacturer
# manufacturer names are long so make sure to flip the coordinate axes in the bar chart
mpg %>%
    group(manufacturer) %>%
    summarize(average_hwy = mean(hwy)) %>%
    ggplot() %>%
        geom_bar(mapping = aes(x = manufacturer, y = hwy)) +
        coord_flip()

In [None]:
mpg %>%
    group_by(manufacturer) %>%
    summarize(average_hwy = mean(hwy)) %>%
    ggplot() +
        geom_bar(mapping = aes(x = manufacturer, y = average_hwy), stat = "identity") +
        coord_flip()

# EDA

Write the command to produce the following plot.

Note that the `speed` (in m.p.h.) variable has been computed using `distance` (in miles) and `air_time` (in minutes) variables. `binwidth` was 10 m.p.h.

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot5.png)

In [None]:
mutate(flights, speed = 60*distance/air_time) %>%
    ggplot() +
        geom_histogram(aes(x=speed), binwidth=10)

Write the command to produce the following plot using the `mpg` data set.

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot6.png)

In [None]:
ggplot(mpg) +
    geom_bar(aes(x=manufacturer,fill=class)) +
    coord_flip()

Write the command to produce the following plot using the `mpg` data set.

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot7.png)

In [None]:
ggplot(mpg) +
    geom_boxplot(aes(x=manufacturer,y=hwy)) +
    coord_flip()

Write the command to produce the following plot.

Note that the `speed` (in m.p.h.) variable has been computed using `distance` (in miles) and `air_time` (in minutes) 

![plot](http://dept.stat.lsa.umich.edu/~tewaria/teaching/STATS306-Fall2017/Rplot8.png)

In [None]:
mutate(flights, speed = 60*distance/air_time) %>%
    select(speed, distance) %>%
    ggplot(mapping = aes(x = distance, y = speed)) +
        geom_hex()