<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/stats306_lab6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6: Pivoting for Longer and Wider Tables
## March 8th, 2022

Welcome back!


In [None]:
library(tidyverse)
install.packages("dslabs")
library(dslabs)

# 1. Preliminaries

In [None]:
# create a tibble from a dataframe
irisTibble <- as_tibble(iris)

In [None]:
# initialize tibble from vectors

tib1 <- tibble(
  rank = 1:6,
  names = c("mercury", "venus", "terra", "mars", "jupiter", "saturn"),
  inhabited = c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)
)
tib1

In [None]:
# tibble from data entry
tib2 <- tribble(
  ~index, ~language, ~phrase,
  1, "swedish", "hej, världen!",
  2, "czech", "ahoj, světe!",
  3, "irish", "dia duit, domhan!",
  4, "portuguese", "olá, mundo!" 
)
tib2

In [None]:
# non-syntactic names
annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)
annoying

In [None]:
annoying[[2]]; annoying %>% select(`2`)

## Tidy Data

"Tidy datasets are all alike, but every messy dataset is messy in its own way." (Hadley Wickham)

Rules of tidy data:

1. Each variable must have its own column
2. Each observation must have its own row
3. Each value must have its own cell

Real world data is usually messy. People who enter/curate data often don't have the data analyst in mind.

Pivoting helps with the following problems:
- One variable spread across multiple columns
- One observation scattered across multiple rows

# 2. Longer Tables

`pivot_longer` takes 4 main arguments:
* `data`: data frame to pivot
* `cols`: columns to lengthen (ignores the others)
* `names_to`: name of the new variable to which old column names (`cols`) get moved
* `values_to`: name of the variable to which pivoted column values get moved

In [None]:
# example: column names here are really values of a variable, "year"
grades_wide = tribble(
  ~name, ~Sex, ~`2015`, ~`2016`, ~`2017`,
     'Wu',  'M', 83,      89,      93,
  'Alice',  'F', 92,      90,      93,
 'Jordan',   NA, 80,      87,      99,
 'Gilberto','M', 67,      90,      92)
grades_wide

In [None]:
grades_long = grades_wide %>% 
  pivot_longer(
    -c(name, Sex),
    names_to = "year",
    values_to = "grades")
grades_long

# 3. Wider Tables

`pivot_wider` is the opposite of `pivot_longer` and takes three main parameters:

* `data`: data frame to pivot
* `names_from`: column from which new column names are extracted 
* `values_from`: column from which new values are extracted

In [None]:
# example: the "type" column contains multiple variable names,
long_countries <- tribble(
  ~country,      ~year, ~type,           ~count,
  #---------------------------------------------
  "Afghanistan",  1999, "cases",            745,
  "Afghanistan",  1999, "population",  19987071,
  "Afghanistan",  2000, "cases",           2666,
  "Afghanistan",  2000, "population",  20595360,
  "Brazil",       1999, "cases",          37737,
  "Brazil",       1999, "population", 172006362
)

wide_countries <- long_countries %>% pivot_wider(
  names_from = type,
  values_from = count)
wide_countries

In [None]:
# note: pivot longer and pivot wider aren't *exactly* symmetrical

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")

# 4. Exercises on MLB dataset

In [None]:
mlb = read_csv('https://raw.githubusercontent.com/enesdilber/stats306_labs/master/lab5/mlb.csv')
mlb %>% head

#### Exercise 1:
 Calculate the `Home Run to Fly Ball rate (HR/FB)` in total for each team and year. That is $HR\_FB = \frac{\sum HR_i}{\sum FB_i}$. Make sure you have the division at the final dataset. So  you'll have `division`, `team`, `year` and `HR_FB`.

 #### Exercise 2: 
 Convert this to a wide dataset, so your variables should be `division`, `team`, and `2015-2018`, where values are the `HR/FB` rate. Note that, again, you should ensure that `division` is still in the dataset.

 #### Exercise 3:
 Create a variable called `increased`, which checks if the `HR/FB` rate was higher in 2018 than it was in 2015 for that team.

#### Exercise 4:
Calculate the correlation between each year with the following year. That is $\rho_{2015, 2016}, \rho_{2016, 2017}, \rho_{2017, 2018}$.

(Hint: use the `cor` function along with `summarise`.)

#### Exercise 5:
Turn `df_wide` back into a "long" dataset called `df_long`.

#### Exercise 6:
Using df_long, create a faceted line plot of `HR/FB` rate on `year`. Color it by `team`, facet it by `division`, and choose the linetype according to the `increased` variable.