<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/003_lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2: ggplot and dplyr

### January 18th, 2022

# 1. Logistics

- Please join Slack and the Lab 3 lab channel in particular if you haven't already
- This week, I will be moving my Tuesday office hour to Wednesday 4:30-6pm
- Starting next week, I will split my OH into 3 chunks to increase coverage
- The purpose of labs...
- We are probably meeting in-person next week

In [None]:
library(tidyverse) # automatically imports ggplot and dplyr

# 2. Back to ggplot

Often, we're only interested in a subset of data points. Can help with plotting or running expensive analysis.

## 2.1 Pipe Syntax (Functional Programming)

In [None]:
# randomly sampling a subset from dataset
# setting a random seed ensures replicability
set.seed(108)
# this is "piping" notation 
dm <- diamonds %>% sample_n(1000)
names(dm)
summary(dm)
dim(dm)

In [None]:
# optional, look at the documentation
# ?sample_n
# standard syntax
dm <- sample_n(diamonds, 1000)
# functional programming syntax (w/ pipe)
dm <- diamonds %>% sample_n(1000)
# typically, we will use pipes to apply a function to a tibble
# does this work?
dm %>% summary()

## 2.2 Review of ggplot

Every ggplot2 plot has three key components:

- data,

- A set of aesthetic mappings between variables in the data and visual properties, and

- At least one layer which describes how to render each observation. Layers are usually created with a geom function.

In [None]:
p1 = ggplot(dm) +
    geom_point(aes(x, price, color = cut)) + 
    facet_wrap(vars(clarity), ncol=4) + 
    theme_bw() # optional: add a theme layer
print(p1)

# some available themes: theme_bw, theme_classic, theme_void...
# feel free to explore!

## this code does the same thing!
# p1 = ggplot(dm, aes(x, price, color = cut)) +
#     geom_point() +
#     facet_wrap(vars(clarity), ncol=4) 
# print(p1)

## 2.3 Layering Geometric Objects
Suppose we are interested in identifying trends in our data. We can plot a smooth line of best fit as follows:

In [None]:
p2 = ggplot(dm) +
    geom_point(aes(x, price)) + # we can specify the aes. mapping
    geom_smooth(aes(x, price))  # for EACH geom layer
print(p2)

### Exercise 1
In the above fit, use locally weighted scatterplot smoother instead general additive model (loess).

How do you go about checking the documentation?

In [None]:
# your code here
p3 = ggplot(dm)

### Exercise 2.1 
Same as before, but try fitting a linear best fit.

In [None]:
# your code here
p4 = ggplot(dm)

### Exercise 2.2

Fit a linear line but with both variables log-scaled.

(You do not need to do anything if your previous code is correct)

In [None]:
p4 + scale_x_continuous(trans='log10') + 
     scale_y_continuous(trans='log10')

### Why would log-log plot make sense?

A log-log plot describes the relationship $Price = c \cdot Carat^k$ for some constant $c, k > 0$.

* Going from 0.1 carats to 0.2 carats may not be worth much, but increasing from 1.9 carats to 2.0 carats would result in a significant increase in price.
* Let's look at the following relationship
$$ Price = c \cdot Carat^k \Rightarrow \log Price = \log c + k \log x$$
* You can see that once we transform both variables with log, we have a linear relationship between $\log Price$ and $\log Carat$.

### Exercise 2.3
Make separate linear lines for each category of the clarity variable. *Hint: give each line a different color based on clarity.*

In [None]:
# optional: think about how we could disable the confidence interval
# our code here
p4 = ggplot(dm)

### Exercise 3
Can we rewrite the above to reduce the code duplication above? (Both geom objects have the same aesthetic mapping...)

In [None]:
# your code here
p5 = ggplot(dm)
# optional: how do we hide the legend?
# two options: 
# 1. "show.legend" argument in a geom_layer
# 2. add a theme layer: "theme(legend.position = "none")""

### Self-Study
What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?


In [None]:
# Boxplot

# Histogram

# Area Chart

# 3. Customizing ggplot

## 3.1 Statistical Transformations
Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot.

You can learn which stat a geom uses by inspecting the default value for the 'stat' argument. For example, `?geom_bar` shows that the default value for stat is 'count', which means that `geom_bar()` uses `stat_count()`.

`stat_count()` is documented on the same page as geom_bar(), and if you scroll down you can find a section called "computed variables." That describes how it computes two new variables: count and prop.


In [None]:
?geom_bar

In [None]:
popn <- tribble(
~city, ~population,
"Istanbul", 15029231,
"Moscow", 12615279,
"Saint Petersburg", 9126366,
"Berlin", 5383890,
"Madrid", 3748148
)

![We want a plot that looks like this.](https://raw.githubusercontent.com/enesdilber/stats306_labs/master/lab2/graph5.png)

### Bar Plots in ggplot

How do we reproduce the above plot?

In [None]:
# this doesn't look right
ggplot(data = popn, aes(city)) + 
  geom_bar()

In [None]:
# this will raise an error
ggplot(data = popn, aes(city, population)) + 
  geom_bar()

In [None]:
# we can override the default stat
# and supply a variable mapped to the y-axis!
ggplot(data = popn, aes(city, population)) + 
  geom_bar(stat="identity")

### Exercise 4
Use `geom_col` to reproduce a bar plot of cities by population. (No need to override the default `stat`). Remember to supply a title!

In [None]:
# your code here
ggplot(data = popn, aes(city, population)) + 
    geom_col() + ggtitle("Most Populated Cities in Europe")

### Proportion Bar Plot

In [None]:
ggplot(data = dm) + 
    geom_bar(mapping = aes(x=cut, y=..prop.., group=1))

### Summary Plots

Other times, you want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary() which summarises the y values for each unique x value, to draw attention to the summary that you're computing:

In [None]:
ggplot(data = dm, aes(cut, depth)) + 
    stat_summary(fun = mean, fun.min = min, fun.max = max, size=0.3)

  # try playing around with different options for "fun"

### Self-Study
1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
2. What does geom_col() do? How is it different to geom_bar()?
3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
4. What variables does stat_smooth() compute? What parameters control its behaviour?
5. In our proportion bar chart, we need to set group = 1. Why?

## 3.2 Position adjustments

### geom_bar

In [None]:
# Difference between color and fill
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

Remember that we can "color by" a different variable - in this case, clarity. By default, it stacks the bars for each clarity level. This is done using the positional adjustment specified by the position argument of geom_bar. If you don't want a stacked bar chart, you can use one of three other options: "identity", "dodge", or "fill".

In [None]:
ggplot(data = dm, aes(x = cut)) + 
  geom_bar(aes(fill = clarity))

In [None]:
# position "identity"
# overlaps bars
# not particularly useful, imo
ggplot(data = dm) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")

In [None]:
# position "fill"
# each stacked bar is same height
# useful for comparing proportions
ggplot(data = dm) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

### Exercise 5
Implement a bar plot with `position = 'dodge'`.

This places overlapping objects directly beside one another, which makes it easier to compare individual values.

![Caption for the picture.](https://raw.githubusercontent.com/enesdilber/stats306_labs/master/lab2/graph10.png)

In [None]:
# your code here

### Jitter
A positional adjustment that is very useful for scatterplots with overlaps is the 'jitter' argument. *However, you have to be careful not to interpret the random positioning of points on the x-axis as meaningful.*

In [None]:
ggplot(data = dm) + 
  geom_point(mapping = aes(x = cut, y = price))

ggplot(data = dm) + 
  geom_point(mapping = aes(x = cut, y = price), position = "jitter")

### Self Study
1. What parameters to geom_jitter() control the amount of jittering?
2. Compare and contrast geom_jitter() with geom_count().
3. What's the default position adjustment from geom_boxplot()? Create a visualization of the mpg dataset and demostrate it. 
Make sure you go through coordinate systems.



# 4. dplyr for Data Manipulation

In [None]:
dim(dm)
head(dm)

Simply run 'dm' after declaring the dm variable above. can you guess what 'dbl', 'ord', and 'int' are?

Notice how the levels below follow an order. Indeed, we expect Fair < Good < Very Good < Premium < Ideal

In [None]:
print(levels(dm$cut))
print(levels(dm$color))
print(levels(dm$clarity))

In [None]:
sizes = c("M", "S", "S", "M", "XL", "XXL", "XL", "S", "M", "L")
levels(as.factor(sizes)) # factors are just categorical variables in R

In [None]:
sizes = ordered(sizes, levels = c("S", "M", "L", "XL", "XXL"))
levels(sizes)

There are five fundamental functions in dplyr: `filter`, `arrange`, `select`, `mutate` and `summarise`. All of them have the following properties:
1. The first argument is a data frame.
2. The subsequent arguments indicate what to do with the data frame, using the variable names (without quotes).
3. The result is a new data frame.

## 4.1 Filter
Used if you want to view or store a new dataset containing a subset of the full dataset.

In [None]:
filter(dm, cut == 'Fair', color == 'J')

Usually you want to store the newly subsetted data in memory. 

In [None]:
worst_diamonds = filter(dm, cut == 'Fair', color == 'J')

Make sure to use '==' instead of '='. The former is to test equality while the latter is for assignments. 

###  Exercise 6

Practice filtering on multiple conditions.

In [None]:
# filter for rows that have color J or a fair cut
a = filter(dm) 

# # filter for rows that have color J and a fair cut
b = filter(dm)

# filter for rows that don't have a fair cut
c = filter(dm)

# filter for rows that have either color J or a fair cut (and not both!)
d = filter(dm)  # look for XOR gate

# filter for rows that have a carat less than 1.0
e = filter(dm)

In R, if you want to find if a variable's value is missing, use the is.na() function. In particular, do not check for equality with NA:

In [None]:
x = 4
x == NA
is.na(x)

Similarly, never put an equality condition with NA in your dplyr filter() statements

In [None]:
# create a dataframe
df = tibble(x = c(1, NA, 3))
print(df)

In [None]:
filter(df, x>1)

In [None]:
filter(df, is.na(x) | x > 1)

### Self Study
1. Write code using filter that will allow you to output diamonds with colors D or E and cuts Good or Very Good
2. Write code using filter that wil allow you to output diamonds with even-numbered prices

## 4.2 Arrange
Useful for ordering rows!

In [None]:
arrange(dm, clarity, color)[1:20,] 
# this sorts first by clarity, and then by color

Missing values are always sorted at the end:

In [None]:
df = tibble(x = c(5, NA, 2))
arrange(df, x)

In [None]:
arrange(df, desc(x))

### Exercise 7
Use arrange to sort the dm dataset by describing order of the product of the x, y, and z variables. Output the first 20 rows of the new dataset.

Next week, we'll look at `select`, `mutate`, and `summarise`!