# Tidyverse and Data Reshaping

The [tidyverse](https://www.tidyverse.org/) is a collection of code libraries in R where the libraries share an underlying design philosophy, grammar, and data structures. Conducting an analysis within the 'tidyverse' of code libraries can make the analysis process seamless and smooth, as long as you understand a bit of the underlying philosophy that makes it all work together.  

<img src='hex.jpg' height = 35%, width=35%>

One analogy for the shared philosophy underlying all the code libraries is like how many countries in the world have their own currency. In the US, for example, as long as you have dollars, you can go from store to store purchasing all kinds of things without any problem. But if you take a trip to Europe and try to use dollars, you won't get very far.  

In the same way, as long as your data is represented in a particular way, you'll cruise through different functions in the tidyverse with no problem. But if your data is in the wrong format, all kinds of problems will come your way.  

We'll first go over some of the basic language of the tidyverse and introduce some of the basic functions. Then
we'll talk about making sure your data is in the right shape so that going between all these functions is seamless.

## Tidyverse Functions

The first step to using functions in the tidyverse is to import the tidyverse: `library(tidyverse)`. Tidyverse is kind of like a meta package---it's a package that contains many packages of code. So, by importing tidyverse, we automatically import libraries like `ggplot2`.

Within the tidyverse there are many different functions that are useful for all different stages of an analysis. Below are some of the most common ones (note: these functions all happen to come from the `dplyr` package):  

   * `filter()`: Used to subset a data frame, retaining all rows that satisfy your conditions.  
   * `select()`: Select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name.
   * `mutate()`: Adds new variables and preserves existing ones  
   * `group_by()`: Takes an existing data.frame and converts it into a grouped data.frame where operations are performed "by group".  
   * `summarize()`: Creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

### Read in data

In [38]:
d <- read.csv('days_sick_long.csv')
head(d)
summary(d)

Unnamed: 0_level_0,patient,condition,visit_number,num_days_sick
Unnamed: 0_level_1,<int>,<fct>,<fct>,<int>
1,1,treatment,Visit 1,65
2,2,treatment,Visit 1,0
3,3,treatment,Visit 1,34
4,4,treatment,Visit 1,33
5,5,treatment,Visit 1,10
6,6,treatment,Visit 1,39


    patient           condition     visit_number  num_days_sick   
 Min.   :   1.0   control  :3000   Visit 1:2000   Min.   :  0.00  
 1st Qu.: 250.8   treatment:3000   Visit 2:2000   1st Qu.:  0.00  
 Median : 500.5                    Visit 3:2000   Median : 23.00  
 Mean   : 500.5                                   Mean   : 29.53  
 3rd Qu.: 750.2                                   3rd Qu.: 49.00  
 Max.   :1000.0                                   Max.   :140.00  

### Use tidyverse functions

**Filter**  

Let's say I want to filter this dataset to keep only those patients who were in the control condition. By now you know how to do it without tidyverse functions (also called the 'base R' way), but let's try doing it with the `filter()` function.

In [46]:
## base r way
new_d = d[d$condition=='control',]
head(new_d)

Unnamed: 0_level_0,patient,condition,visit_number,num_days_sick
Unnamed: 0_level_1,<int>,<fct>,<fct>,<int>
1001,1,control,Visit 1,14
1002,2,control,Visit 1,47
1003,3,control,Visit 1,0
1004,4,control,Visit 1,67
1005,5,control,Visit 1,53
1006,6,control,Visit 1,63


In [41]:
## tidyverse way
new_d = filter(d, condition == 'control')
head(new_d)

Unnamed: 0_level_0,patient,condition,visit_number,num_days_sick
Unnamed: 0_level_1,<int>,<fct>,<fct>,<int>
1,1,control,Visit 1,14
2,2,control,Visit 1,47
3,3,control,Visit 1,0
4,4,control,Visit 1,67
5,5,control,Visit 1,53
6,6,control,Visit 1,63


One advantage to using the tidyverse functions is that it makes your code much more *readable*. Quickly looking at the two code blocks above, the word `filter` makes it immediately obvious what's going on in that operation. The same is not true for `d$condition=='control'`. This difference in clarity becomes more extreme as the operations get more complex.

**Select** 

The `select` function is very handing for keeping or dropping columns. Let's say I only want to have a dataset with columns `patient` and `condition`:

In [42]:
new_d = select(d, patient, condition)
head(new_d)

Unnamed: 0_level_0,patient,condition
Unnamed: 0_level_1,<int>,<fct>
1,1,treatment
2,2,treatment
3,3,treatment
4,4,treatment
5,5,treatment
6,6,treatment


Or maybe I want to keep the first three columns. The `select` function supports handy selecting syntax, like:

In [43]:
new_d = select(d, patient:visit_number)
head(new_d)

Unnamed: 0_level_0,patient,condition,visit_number
Unnamed: 0_level_1,<int>,<fct>,<fct>
1,1,treatment,Visit 1
2,2,treatment,Visit 1
3,3,treatment,Visit 1
4,4,treatment,Visit 1
5,5,treatment,Visit 1
6,6,treatment,Visit 1


It's also easy to drop columns that we don't need. Let's say I want to drop the `condition` variable using the `-` operator:

In [44]:
new_d = select(d, -condition)
head(new_d)

Unnamed: 0_level_0,patient,visit_number,num_days_sick
Unnamed: 0_level_1,<int>,<fct>,<int>
1,1,Visit 1,65
2,2,Visit 1,0
3,3,Visit 1,34
4,4,Visit 1,33
5,5,Visit 1,10
6,6,Visit 1,39


**Mutate and ifelse**  
Let's look at creating new columns and add in some conditionality. First, the `mutate()` function is an easy way to create a new column. Let's say I want to add a new column to the data that takes `num_days_sick` and adds on five days for each visit for each patient:

In [45]:
new_d = mutate(d, num_days_sick_five = num_days_sick + 5)
head(new_d)

Unnamed: 0_level_0,patient,condition,visit_number,num_days_sick,num_days_sick_five
Unnamed: 0_level_1,<int>,<fct>,<fct>,<int>,<dbl>
1,1,treatment,Visit 1,65,70
2,2,treatment,Visit 1,0,5
3,3,treatment,Visit 1,34,39
4,4,treatment,Visit 1,33,38
5,5,treatment,Visit 1,10,15
6,6,treatment,Visit 1,39,44


I specify the name I want to call the new column, then equals then the operation. Note I don't need to use quotes for any of that.  

Let's look at a more practical example. Let's say I want to create a new variable that assigns the value 'high' to a patient at a given visit if that patient's number of days sick is greater than the average of all number of days sick in the dataset, and assign to it a value of 'low' otherwise.  

For this we'll use the `ifelse` function. The first argument to `ifelse` is a logical expression, the next argument is the value to return if the first argument is true, the third argument is what gets returned if the first argument is false. `ifelse` is powerful because it can be *vectorized*, which means it can be applied to an entire logical vector and return a vector of values equal in length to the logical vector. Let's look at a small example:

In [48]:
v = c(5,2,6,3)
ifelse(v > 3, 'yes', 'no')

This is a powerful tool for creating new columns where the values of the new column depend on some values in another column:

In [47]:
avg_days_sick = mean(d$num_days_sick)
new_d = mutate(d, sick_group = ifelse(num_days_sick > avg_days_sick, 'high', 'low'))
head(new_d)

Unnamed: 0_level_0,patient,condition,visit_number,num_days_sick,sick_group
Unnamed: 0_level_1,<int>,<fct>,<fct>,<int>,<chr>
1,1,treatment,Visit 1,65,high
2,2,treatment,Visit 1,0,low
3,3,treatment,Visit 1,34,high
4,4,treatment,Visit 1,33,high
5,5,treatment,Visit 1,10,low
6,6,treatment,Visit 1,39,high


**Operation piping, group_by, and summarize**  

Before we get into the `group_by` and `summarize` functions, I want to introduce a powerful aspect of tidyverse syntax, and that's operation piping. Often in an analysis you'll want to apply a combination of filtering, selecting, making new variables etc. The piping syntax allows you to go from one to the next seamlessly, and it leaves your code quite readable and organized.

As an example, let's say I want to only keep rows where the number of days sick is less than $50$ and then only keep columns `patient` and `num_days_sick`. To do this, I'll need to use the `filter` and `select` functions, and I can chain them together like this:

In [51]:
d %>% ## take the whole dataset d and give it as an input to filter
filter(num_days_sick < 50) %>% ## filter the dataset and give the filtered dataset as an input to select
select(patient, num_days_sick) %>% ## select only two columns, and give this condensed dataset to the head function
head() ## take the ouput of all the above lines and run it through the head function

Unnamed: 0_level_0,patient,num_days_sick
Unnamed: 0_level_1,<int>,<int>
1,2,0
2,3,34
3,4,33
4,5,10
5,6,39
6,7,0


This code is read from top to bottom, where the pipe character (`%>%`) takes the output from one line and feeds it as the input to the next line. If I want to save the output of all these computations, I define it on the first line:

In [52]:
new_d = d %>% ## 'new_d = ' here saves the output of everything that happens below to a variable called new_d
filter(num_days_sick < 50) %>% 
select(patient, num_days_sick) %>% 
head() 

new_d

Unnamed: 0_level_0,patient,num_days_sick
Unnamed: 0_level_1,<int>,<int>
1,2,0
2,3,34
3,4,33
4,5,10
5,6,39
6,7,0


Operation piping combined with the `group_by` and `summarize` functions give us a powerful and intuitive way to complete summarizations across groups, like we learned a few weeks ago. Let's say I want to get the average number of days sick across condition:

In [54]:
d %>% 
group_by(condition) %>% 
summarize(avg_days_sick = mean(num_days_sick))

condition,avg_days_sick
<fct>,<dbl>
control,46.81033
treatment,12.241


More grouping variables can easily be added to the `group_by` function:

In [55]:
d %>% 
group_by(condition, visit_number) %>% 
summarize(avg_days_sick = mean(num_days_sick))

`summarise()` has grouped output by 'condition'. You can override using the `.groups` argument.


condition,visit_number,avg_days_sick
<fct>,<fct>,<dbl>
control,Visit 1,28.542
control,Visit 2,45.204
control,Visit 3,66.685
treatment,Visit 1,11.147
treatment,Visit 2,13.01
treatment,Visit 3,12.566


Feels like this is so long and I haven't even started reshaping yet. Maybe I'll cruise through all the above content and light speed and spend most of the time on reshaping. Cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf