# 6. Merging and Reshaping Data

In this chapter, we continue with some of the ways to manipulate data using the tidyverse packages. In particular, we will look at reshaping and merging data frames in order to get the data in the format we want. When reshaping data, we can convert our data between wide form (more columns, fewer rows) and long form (fewer columns, more rows). For example, we can use these data pivots to put our data into a tidy form. When merging data, we are combining information from multiple data frames into a single data frame. The key idea when merging data is to think about what the common information is between the data frames and which values we want to keep. 

For this chapter, we will use three data sets. The first data set is `covidcases`, which contains the daily case and death counts by county in the United States for 2020, the second data set is `mobility`, which contains daily mobility estimates by state in 2020, and lastly we have `lockdowndates`, which contains the start and end dates for statewide stay at home orders. Take a look at the first few rows of each data frame below and read the documentation for the column descriptions.

In [80]:
library(tidyverse)
library(RforHDSdata)
library(lubridate)
data(covidcases)
data(lockdowndates)
data(mobility)

In [81]:
covidcases <- read.csv("data/covidcases.csv")
head(covidcases)

Unnamed: 0_level_0,date,state,county,daily_deaths,daily_cases
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<int>
1,2020-03-01,California,Marin,0,1
2,2020-03-01,California,Orange,0,1
3,2020-03-01,Florida,Manatee,0,1
4,2020-03-01,California,Napa,0,1
5,2020-03-01,Washington,Spokane,0,4
6,2020-03-01,California,San Francisco,0,3


In [82]:
head(mobility)

state,date,samples,m50,m50_index
<chr>,<chr>,<int>,<dbl>,<dbl>
Alabama,2020-03-01,267652,10.87194,76.92647
Alabama,2020-03-02,287264,14.34513,98.57353
Alabama,2020-03-03,292018,14.2446,98.25
Alabama,2020-03-04,298704,13.08301,89.69118
Alabama,2020-03-05,288218,14.81503,102.38235
Alabama,2020-03-06,282982,17.94399,126.22059


In [83]:
head(lockdowndates)

State,Lockdown_Start,Lockdown_End
<chr>,<chr>,<chr>
Alabama,2020-04-04,2020-04-30
Alaska,2020-03-28,2020-04-24
Arizona,2020-03-31,2020-05-15
Arkansas,,
California,2020-03-19,2020-08-28
Colorado,2020-03-26,2020-04-26


For each of these data frames, there is a date column. Write now the class of these columns is a character value. We can use the `as.Date()` function to tell R to treat these as dates. We need to specify the date format as an argument to this function. In our format `%Y-%M-%D` the `%Y` stands for the full four-digit year, `%M` is a two-digit month (e.g. January is coded "01" vs "1"), and `%D` stands for the two-digit day (e.g. the third day is coded "03" vs "3").

In [84]:
mobility$date <- as.Date(mobility$date, formula="%Y-%M-%D")
covidcases$date <- as.Date(covidcases$date, formula="%Y-%M-%D")
lockdowndates$Lockdown_Start <- as.Date(lockdowndates$Lockdown_Start, formula="%Y-%M-%D")
lockdowndates$Lockdown_End <- as.Date(lockdowndates$Lockdown_End, formula="%Y-%M-%D")

By coding these columns as dates, we can access information such as the day, month, year, or week. These functions are all available in base R. The `lubridate` package, which is not used in this book, has more capabilities for working with date times. 

In [85]:
month(mobility$date[1])
week(mobility$date[1])

## Reshaping Data

The mobility and covid case data are in tidy form - each observation corresponds to a single row and every column is a single variable. We might consider the lockdown dates to not be in the same form. Another way to represent this data would be to have each observation be the start or end of a stay at home order. 

To reshape our data, we use the `pivot_longer()` function to change the data from what is called **wide form** to **long form**. This kind of pivot involves taking a subset of columns that we will *gather* into a single column while increasing the number of rows in the data set. Before pivoting, we have to think about which columns we are transforming. In our case, we want to take the lockdown start and end columns and create two columns: one column will be whether this is the start or end of a lockdown and the other will be the date. These are called the key and value columns, respectively. The key column will get its values from the names of the columns we are transforming (or the keys) whereas the value column will get its values from the entries in those columns (or the values). 

TODO: image

The `pivot_longer()` function takes in a data table, the columns `cols` that we are pivoting to longer form, the column name `names_to` that will store the data from the previous column names, and the column name `values_to` for the column that will store the information from the columns gathered. In our case, the first column we will name `Lockdown_Event` since it will contain whether each date is the start or end of a lockdown and the second column we will name `Date`. Take a look at the result below.

In [86]:
lockdown_long <- lockdowndates %>%
  pivot_longer(cols=c("Lockdown_Start", "Lockdown_End"), names_to="Lockdown_Event", values_to="Date") %>%
  mutate(Date = as.Date(Date, formula ="%Y-%M-%D"), 
         Lockdown_Event = ifelse(Lockdown_Event=="Lockdown_Start", "Start", "End")) %>%
  na.omit()
head(lockdown_long)

State,Lockdown_Event,Date
<chr>,<chr>,<date>
Alabama,Start,2020-04-04
Alabama,End,2020-04-30
Alaska,Start,2020-03-28
Alaska,End,2020-04-24
Arizona,Start,2020-03-31
Arizona,End,2020-05-15


We can also transform our data in the opposite direction. The function `pivot_wider()` converts data in long form to wide form. This function again takes in the the data frame but now we specify the arguments `names_from` and `values_from`. The former is the column to get the new column names from and the latter is where the row values will be taken from. To pivot our lockdown data back to wider form, we specify that `names_from` is the lockdown event and `values_from` is the date itself. Now we are back to the same form as before!

In [87]:
lockdown_wide <- pivot_wider(lockdown_long, names_from=Lockdown_Event, values_from=Date)
head(lockdown_wide)

State,Start,End
<chr>,<date>,<date>
Alabama,2020-04-04,2020-04-30
Alaska,2020-03-28,2020-04-24
Arizona,2020-03-31,2020-05-15
California,2020-03-19,2020-08-28
Colorado,2020-03-26,2020-04-26
Connecticut,2020-03-23,2020-05-20


Let's show another example. Suppose that I wanted to create a data frame where the columns corresponded to the number of cases for each state in New England and the rows corresponded to the months. First, I need to filter my data to New England and then summarize my data to find the number of cases per month. I use the `month()` function to be able to group by month and state. Additionally, you can see that I add an `ungroup()` at the end since the summarized output will still be grouped by state (as shown in the warning message). 

In [88]:
ne_cases <- covidcases %>% 
  filter(state %in% c("Maine", "Vermont", "New Hampshire", "Connecticut", "Rhode Island",
                                  "Massachusetts")) %>%
  mutate(month = month(date)) %>%
  group_by(state, month) %>%
  summarize(total_cases = sum(daily_cases)) %>%
  ungroup()
head(ne_cases)

[1m[22m`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.


state,month,total_cases
<chr>,<dbl>,<int>
Connecticut,3,3051
Connecticut,4,24187
Connecticut,5,14729
Connecticut,6,4343
Connecticut,7,3257
Connecticut,8,3202


Now, I need to focus on converting this data to wide format. I want a column for each state. This tells me that my `names_from` argument will be `state`. Next, I want each row to have the case values for each state. This tells me that my `values_from` argument will be total_cases. The format of this data may not be tidy but it allows me to quickly compare cases across states.

In [89]:
pivot_wider(ne_cases, names_from=state, values_from=total_cases)

month,Connecticut,Maine,Massachusetts,New Hampshire,Rhode Island,Vermont
<dbl>,<int>,<int>,<int>,<int>,<int>,<int>
3,3051,297,6285,367,416,282
4,24187,797,55505,1776,7118,581
5,14729,1229,34878,2502,5537,113
6,4343,928,11926,1136,2158,226
7,3257,659,8711,801,2015,205
8,3202,616,9642,691,2565,209


## Merging Data with Joins

Above, we saw how to manipulate our current data into a new format. Now, we will see how we can combine our multiple data sources. Merging two data frames is called joining and the function we will use depends on how we want to match between the data frames. The image below shows an overview of the different joins and the video talks through each join type.

**Types of Joins**:
* `left_join(table1,table2,by)`: Joins each row of table1 with all matches in table2 
*  `right_join(table1,table2,by)`: Joins each row of table2 with all matches in table1 (the opposite of a left join) 
* `inner_join(table1,table2,by)`: Looks for all matches between rows in table1 and table2. Rows that do not find a match are dropped.  
* `full_join(table1,table2,by)`: Keeps all rows from both tables and joins those that match. Rows that do not find a match will have NA values filled in.   
* `semi_join(table1,table2,by)`: Keeps all rows in table1 that have a match in table2 but does not join to any information from table2.  
* `anti_join(table1,table2,by)`: Keeps all rows in table1 that *do not* have a match in table 2 but does not join to any information from table2. The opposite of a semi join.  

TODO: image and video

We will first demonstate a left join using the `left_join()` function. This function takes in two data tables (table1 and table 2) and the columns to match rows by. In a left join, for every row of table1, we look for all matching rows in table2 and add any columns not used to do the matching. Thus, every row in table1 corresponds to at least one entry in the resulting table but possibly more if there are multiple matches. We will use a left join to add the lockdown information to our case data. In this case, the first table will be `covidcases` and we will match by `state`. Since the state column is slightly different between the two data frames we specify that `state` is equivalent to `State` in the `by` argument.

In [90]:
covidcases_full <- left_join(covidcases, lockdowndates, by=c("state"="State"))
head(covidcases_full)

Unnamed: 0_level_0,date,state,county,daily_deaths,daily_cases,Lockdown_Start,Lockdown_End
Unnamed: 0_level_1,<date>,<chr>,<chr>,<int>,<int>,<date>,<date>
1,2020-03-01,California,Marin,0,1,2020-03-19,2020-08-28
2,2020-03-01,California,Orange,0,1,2020-03-19,2020-08-28
3,2020-03-01,Florida,Manatee,0,1,2020-04-02,2020-05-04
4,2020-03-01,California,Napa,0,1,2020-03-19,2020-08-28
5,2020-03-01,Washington,Spokane,0,4,2020-03-24,2020-05-04
6,2020-03-01,California,San Francisco,0,3,2020-03-19,2020-08-28


These two new columns will allow us to determine whether a given day was during a lockdown. We use the `between` function to create a new column `lockdown` before dropping the two date columns. We can check that this column worked as expected by choosing a single county to look at. 

In [91]:
covidcases_full <- covidcases_full %>%
  mutate(lockdown = between(date, Lockdown_Start, Lockdown_End)) %>%
  select(-c(Lockdown_Start, Lockdown_End)) 
covidcases_full %>%
  filter(state == "Alabama", county == "Jefferson", date <= as.Date("2020-05-10"))

date,state,county,daily_deaths,daily_cases,lockdown
<date>,<chr>,<chr>,<int>,<int>,<lgl>
2020-03-13,Alabama,Jefferson,0,2,False
2020-03-14,Alabama,Jefferson,0,4,False
2020-03-15,Alabama,Jefferson,0,7,False
2020-03-16,Alabama,Jefferson,0,4,False
2020-03-17,Alabama,Jefferson,0,4,False
2020-03-18,Alabama,Jefferson,0,4,False
2020-03-19,Alabama,Jefferson,0,9,False
2020-03-20,Alabama,Jefferson,0,16,False
2020-03-21,Alabama,Jefferson,0,11,False
2020-03-22,Alabama,Jefferson,0,10,False


We now want to add in the mobility data. In the last case, we wanted to keep any observation in `covidcases` regardless if it was in the `lockdowndates` data frame. Therefore, we used a left join. In this case, we will only want to keep observations that have mobility date for that state on each date. This indicates that we want to use an *inner join*. The function `inner_join()` takes in two data tables (table1 and table2) and the columns to match rows by. The function only keeps rows in table1 that match to a row in table2. Again, those columns in table2 not used to match with table1 are added to the resulting outcome. In this case, we match by state and date.

In [92]:
covidcases_full <- inner_join(covidcases_full, mobility, by = c("state", "date")) %>%
  select(-c(samples, m50_index))
head(covidcases_full)

Unnamed: 0_level_0,date,state,county,daily_deaths,daily_cases,lockdown,m50
Unnamed: 0_level_1,<date>,<chr>,<chr>,<int>,<int>,<lgl>,<dbl>
1,2020-03-01,California,Marin,0,1,False,9.7228
2,2020-03-01,California,Orange,0,1,False,9.7228
3,2020-03-01,Florida,Manatee,0,1,False,7.386687
4,2020-03-01,California,Napa,0,1,False,9.7228
5,2020-03-01,Washington,Spokane,0,4,False,2.5035
6,2020-03-01,California,San Francisco,0,3,False,9.7228


## Recap Video

TODO: might skip since we have a video up above

## Exercises

1. First, create a data frame `sub_cases` where the columns corresponded to the number of cases for CA, Michigan, Connecticut, RI, Ohio, NY, and MA states and the rows corresponded to the months. Then, manipulate the mobility data to calculate total number of `m50` for each month and merge it with the `sub_cases` using `right_join()` function.

In [93]:
sub_cases <- covidcases %>% 
  filter(state %in% c("California", "Michigan", "Connecticut", "Rhode Island",
                      "Ohio", "New York", "Massachusetts")) %>%
  mutate(month = month(date)) %>%
  group_by(state, month) %>%
  summarize(total_cases = sum(daily_cases)) %>%
  ungroup()
head(sub_cases)

[1m[22m`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.


state,month,total_cases
<chr>,<dbl>,<int>
California,3,8583
California,4,41887
California,5,62644
California,6,119039
California,7,270120
California,8,210270


In [101]:
mob_reshape <- mobility %>% 
                 group_by(state) %>%
                 mutate(month = month(date), total_m50 = sum(m50)) %>%
                 slice(1) %>%
                 filter(state %in% c("California", "Michigan", "Connecticut", "Rhode Island",
                      "Ohio", "New York", "Massachusetts"))

mob_reshape

state,date,samples,m50,m50_index,month,total_m50
<chr>,<date>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
California,2020-03-01,902406,9.7228,312.14545,3,556.0362
Connecticut,2020-03-01,97328,5.006111,59.88889,3,697.8056
Massachusetts,2020-03-01,169972,3.455786,55.92857,3,463.2189
Michigan,2020-03-01,329371,5.409658,71.65789,3,1025.6544
New York,2020-03-01,518922,3.435403,52.14516,3,610.4611
Ohio,2020-03-01,422079,6.32217,62.14773,3,1244.3168
Rhode Island,2020-03-01,28916,4.795833,73.83333,3,579.077


In [95]:
covid_mob_sub <- right_join(sub_cases, mob_reshape, by = c("month", "state"))
head(covid_mob_sub)

state,month,total_cases,date,samples,m50,m50_index,total_m50
<chr>,<dbl>,<int>,<date>,<int>,<dbl>,<dbl>,<dbl>
California,3,8583,2020-03-01,902406,9.7228,312.14545,556.0362
Connecticut,3,3051,2020-03-01,97328,5.006111,59.88889,697.8056
Massachusetts,3,6285,2020-03-01,169972,3.455786,55.92857,463.2189
Michigan,3,7536,2020-03-01,329371,5.409658,71.65789,1025.6544
New York,3,76211,2020-03-01,518922,3.435403,52.14516,610.4611
Ohio,3,2199,2020-03-01,422079,6.32217,62.14773,1244.3168


2. Convert the `sub_cases` and `mob_reshape` data frames to wide format and create columns of total cases for each state and total `m50` for those selected states in the `mob_reshape` data frames. Using the `left_join()` function to merge them together. 

In [96]:
pivot_wider(sub_cases, names_from=state, values_from=total_cases)

month,California,Connecticut,Massachusetts,Michigan,New York,Ohio,Rhode Island
<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
3,8583,3051,6285,7536,76211,2199,416
4,41887,24187,55505,33724,233485,15828,7118
5,62644,14729,34878,15776,65879,17486,5537
6,119039,4343,11926,13650,22567,16276,2158
7,270120,3257,8711,20036,21581,39370,2015
8,210270,3202,9642,21999,19757,31998,2565


In [98]:
mob_reshape %>%
  group_by(month) %>%
  summarize(tot = sum(total_m50))
pivot_wider(mob_reshape, names_from=state, values_from=total_m50)

month,tot
<dbl>,<dbl>
3,5176.57


date,samples,m50,m50_index,month,California,Connecticut,Massachusetts,Michigan,New York,Ohio,Rhode Island
<date>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2020-03-01,902406,9.7228,312.14545,3,556.0362,,,,,,
2020-03-01,97328,5.006111,59.88889,3,,697.8056,,,,,
2020-03-01,169972,3.455786,55.92857,3,,,463.2189,,,,
2020-03-01,329371,5.409658,71.65789,3,,,,1025.654,,,
2020-03-01,518922,3.435403,52.14516,3,,,,,610.4611,,
2020-03-01,422079,6.32217,62.14773,3,,,,,,1244.317,
2020-03-01,28916,4.795833,73.83333,3,,,,,,,579.077
