# Data Wrangling and Tidyverse  

In this notebook you'll learn principles behind data wrangling and management, including tidying and transforming data to answer questions you might want to ask. 

## Some useful notes

With Jupyter Notebook you can get a nice popup of function definitions just like you can in RStudio. Simply navigate to a cell or start a new one, and enter in ?function like you would normally. A popup will appear.

You should see an Insert dropdown menu and Run button at the top which lets you add cells as well as run code or render Markdown in the cells, but these are very useful keyboard shortcuts for the same functions: 

- Shift+Enter: Run code or render Markdown in the current cell you're on
- Esc+a: Add a cell above
- Esc+b: Add a cell below
- Esc+dd: Delete a cell

## Package prerequisites 

Packages that are required in this workshop are tidyverse (which includes the packages ggplot2, dplyr, purrr, and others), gridExtra, which helps with arranging plots next to each other, ggrepel, which helps with plotting labels, and maps, which is for map data. 

In [None]:
library(tidyverse)
library(gridExtra)
library(ggrepel)
library(maps)
library(pillar)

# Data Frames

- A data frame is another way to organize a collection of rows and columns.
- It is a collection of lists organized into columns.
- It is similar to a matrix, except data frames allow different data types in different columns.
- We can use the `data.frame()` function to create a data frame from vectors using the following format:

```
dataframe <- data.frame(column_1, column_2, column_3)
```

In [None]:
example_df <- data.frame(
    c('a','b','c'), 
    c(1, 3, 5), 
    c(TRUE, TRUE, FALSE))

print(example_df)

Use `names()` or `colnames()` to name columns,  `rownames()` to name rows, or `dimnames()` to assign both column and row names to the data frame:

In [None]:
colnames(example_df) <- c('letters', 'numbers', 'boolean')
rownames(example_df) <- c('first', 'second', '')
print(example_df)

In [None]:
names(example_df) <- c('_letters_', '_numbers_', '_boolean_')
print(example_df)

In [None]:
dimnames(example_df) <- list(c('__first', '__second', '__third'), c('__letters', '__numbers', '__boolean'))
print(example_df)

We can use the `attributes()` and `str()` functions to get some information about our data frame:

In [None]:
attributes(example_df)

In [None]:
str(example_df)

Data frames can be classified into two broad categories: wide format and long format. All data frames shown so far have been presented in wide format. A wide format data frame has each row describe a sample and each column describe a feature. Here is a short example of a data frame in wide format, tabulating counts for three genes in three patients:

In [None]:
wide_df <- data.frame(c("A", "B", "C"), c(1, 1, 2), c(5, 6, 7), c(0, 1, 0))
colnames(wide_df) <- c("id", "gene.1", "gene.2", "gene.3")
wide_df

Long format stacks features on top of one another; each row is the combination of a sample and a feature.  One column exists to denote the feature in question, and another column exists to denote that feature' value:

In [None]:
long_df <- data.frame(c("A", "A", "A", "B", "B", "B", "C", "C", "C"), c("gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3"), c(1, 5, 0, 1, 6, 1, 2, 7, 0))
colnames(long_df) <- c("id", "gene", "count")
long_df

These formats both contain the exact same data but represent it in different ways. Various functions exist to convert between wide and long format and we'll get into this a bit more shortly when discussing the `tidyr` package.  

# Adding columns to a data frame

Let's make a new example dataframe to work with:

In [None]:
patients_1 <- data.frame(
    c('Boo','Rex','Chuckles'), 
    c(1, 3, 5), 
    c('dog', 'dog', 'dog'))
print(patients_1)

Use `names()` or `colnames()` to name columns,  `rownames()` to name rows, or `dimnames()` to assign both column and row names to the data frame.
Here we will use `names()` to name the columns:

In [None]:
names(patients_1) <- c('name', 'number_of_visits', 'type')
print(patients_1)

We can use the column names to extract a single column using the notation `dataframe$column`, e.g.:

In [None]:
print(patients_1$name)

The `cbind()` function can be used to add more columns to a dataframe:

In [None]:
column_4 <- c(4, 2, 6)
patients_1 <- cbind(patients_1, column_4)
print(patients_1)

We can also rename individual columns of the dataframe using index notation, lets rename the 4th column we just added:

In [None]:
colnames(patients_1)[4] <- 'age_in_years'
print(patients_1)

We can also use the `dataframe$column` notation to add a new column and name it at the same time:

In [None]:
patients_1$weight_in_pounds <- c(35, 75, 15)
print(patients_1)

Let's use `str()` and `attributes()` functions to look at the structure and attributes of this data frame:

In [None]:
str(patients_1)

In [None]:
attributes(patients_1)

# Data frame merging
- Data is often spread across more than one file, reading each file into R will result in more than one data frame. 
- If the data frames have some common identifying column, we can use that common ID to combine the data frames. 

For example:

In [None]:
print(patients_1)

Let's make another data frame:

In [None]:
patients_2 <- data.frame(
    c('Fluffy', 'Smokey', 'Kitty'), 
    c(1, 1, 2), 
    c('cat', 'dog', 'cat'),
    c(1, 3, 5))
colnames(patients_2) <- c('name', 'number_of_visits', 'type', 'age_in_years')
print(patients_2)

We can use the `merge()` function to combine them:

In [None]:
patients_df <- merge(patients_1, patients_2, all = TRUE)
print(patients_df)

- Using `all = TRUE` will fill in blank values if needed (for example, the weight of any of the animals in `patients_2`).
- Using the `all.x = TRUE` argument will return all values in the `patients_1` dataframe, as well as any entries with the same ID column(s) from `patients_2`.

In [None]:
patients_df <- merge(patients_1, patients_2, all.x = TRUE)
print(patients_df)

- Using the `all.y = TRUE` argument will return all values in the `patients_2` dataframe, as well as any entries with the same ID column(s) from `patients_1`.

In [None]:
patients_df <- merge(patients_1, patients_2, all.y = TRUE)
print(patients_df)

You can also specify which columns to join on:

In [None]:
patients_df <- merge(patients_1, patients_2, by = c('name', 'type', 'number_of_visits', 'age_in_years'), all = TRUE)
print(patients_df)

# Tidying Data

Most datasets are data frames made up of rows and columns. However, talking about data frames just in terms of what rows and columns it has is not enough.

 * **Variable:** quantity, quality, property that can be measured.
 * **Value:** State of variable when measured.
 * **Observation:** Set of measurements made under similar conditions
 * **Tabular data:** Set of values, each associated with a variable and an observation.

Tidy data:
 * Each variable is its own column
 * Each observation is its own row
 * Each value is in a single cell
 
Benefits:
 * Easy to manipulate
 * Easy to model
 * Easy to visualize
 * Has a specific and consistent structure
 * Stucture makes it easy to tidy other data
 
Cons:
 * Data frame is not as easy to look at

Consider the following tables:

In [None]:
table1 <- data.frame(makemodel=c("audi a4","audi a4","chevrolet corvette","chevrolet corvette","honda civic","honda civic"),
                    year=rep(c(1999,2008),3),
                    cty=c(18,21,15,15,24,25),
                    hwy=c(29,30,23,25,32,36))
table1

This is tidy data, because each column is a variable, each observation is a row, and each value is in a single cell

Next we will look at some non-tidy data and operations from the **tidyr** package (part of **tidyverse**) to make the data tidy. Note that many of you might be more used to using operations from **reshape2**, like melting and casting. It's a very useful package with more functionality including aggregating data, but syntax with **tidyr** commands is simpler and more intuitive for the purposes of tidying data.

## pivot_wider

We can use the `pivot_wider` function to get our data in wide format (less rows, more columns). It accepts the following arguments:
`id_cols` - a set of columns that uniquely identified each observation, defaults to all values in your data except what you specify in `names_from` and `values_from`. `names_from` and `values_from` describe which columns will be used to name the output column (`names_from`) and which column will be used to populate the call values (`values_from`). To see the rest of the optional arguments, use `?pivot_wider`.

In [None]:
?tidyr::pivot_wider

In [None]:
table2 <- tidyr::pivot_wider(table1,
            names_from = year,
            values_from = c(cty, hwy))
table2

## pivot_longer

`pivot_longer` is the inverse transformation of `pivot_wider`. It'll make your data longer (more rows, less columns). It accepts the following arguments:`cols` are the columns to pivot into longer format, `names_to` a string specifying the name of the column to create from the data stored in the column names of the input data, `values_to` a string specifying the name of the column to create from the data stored in the cell values.

In [None]:
tidyr::pivot_longer

In [None]:
table3 <- tidyr::pivot_longer(table2,
                       cols = !makemodel,
                       names_to = 'mpg_year',
                      values_to = 'value')
table3


## Separating

`table3` has a `mpg_year` column that actually contains two variables, which we can separate into two columns.

Parameters:
 * table and column/variable that needs to be separated.
 * `into`: columns to split into
 * `sep`: separator value. Can be regexp or positions to split at. If not provided then splits at non-alphanumeric characters.We can split the new `mpg_year` column we just made using `separate`.

In [None]:
?tidyr::separate

In [None]:
table4 <- tidyr::separate(table3, mpg_year, into = c("mpg", "year"), sep="_")
table4

`pivot_wider` to get things tidy again

In [None]:
table5 <- tidyr::pivot_wider(table4,
            names_from = mpg,
            values_from = value)
table5

## Uniting

Let's say we want to unite `cty` and `hwy` to be one column. We can do this using the `unite` function.

`Unite` will accept the following arguments: `col` the name of the new column, `sep` as the separator between values, and the columns to unite are indicated using `:` or listing them out as below.


In [None]:
?tidyr::unite

In [None]:
table6 <- tidyr::unite(table5, col = cty_hwy, cty, hwy, sep='_')
table6

## Piping

dplyr from tidyverse contains the 'pipe' (%>%) which allows you to combine multiple operations, directly taking output from a funtion as input to the next. Can save time and memory as well as make code easier to read. Can think of it this way: x %>% f(y) becomes f(x,y), and x %>% f(y) %>% g(z) becomes g(f(x,y),z), etc.

In [None]:
table_1999 <- tidyr::unite(table5, col = cty_hwy, cty, hwy, sep='_') %>% filter(year == 1999)
table_1999

table_2008 <- tidyr::unite(table5, col = cty_hwy, cty, hwy, sep='_') %>% filter(year == 2008)
table_2008

We can merge tables using join function -- https://dplyr.tidyverse.org/reference/mutate-joins.html). 

    inner_join(): includes all rows in x and y.

    left_join(): includes all rows in x.

    right_join(): includes all rows in y.

    full_join(): includes all rows in x or y.

This will combine dataframes based on what column you specify in the `by` argument.

In [None]:
dplyr::left_join(table_1999, table_2008, by = 'makemodel')

**join() vs merge()**

1. join() is faster than merge(), particularly if data  is large.

2. join() preserves the original order of rows, merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.

In [None]:
?dplyr::join

In [None]:
?merge

## Not all data should be tidy

Matrices, phylogenetic trees (although `ggtree` and `treeio` have tidy representations that help with annotating trees), etc.

# Transforming (Tidy) Data

Now we know how to get tidy data. At this point we can already start visualizing our data. However in many cases we will need to further transform our data to narrow down variables and observations we are really interested in or to create new variables that are functions of our existing variables and data. This is known as **transforming** data.

 * `filter()` to pick observations (rows) by their values
 * `arrange()` to reorder rows, default is by ascending value
 * `select()` to pick variables (columns) by their names
 * `mutate()` to create new variables with functions of existing variables
 * `summarise()` to collapes many values down to a single summary
 * `group_by()` to set up functions to operate on groups rather than the whole data set
 * `%>%` propagates the output from a function as input to another. eg: x %>% f(y) becomes f(x,y), and x %>% f(y) %>% g(z) becomes g(f(x,y),z).
 
All functions have similar structure:
 1. First argument is data frame
 2. Next arguments describe what to do with data frame using variable names
 3. Result is new data frame
 
We will be working with the **mpg** data frame for the rest of workshop which comes with the **tidyverse** library.

In [None]:
data(mpg)
head(mpg)

## `filter()` rows/observations

As name suggests filters out rows. First argument is name of data frame, next arguments are expressions that filter the data frame.

In [None]:
?dplyr::filter

In [None]:
# filter out 2 seater cars
table(mpg$class)
no_2seaters <- dplyr::filter(mpg, class != "2seater")
head(no_2seaters)
table(no_2seaters$class)

In [None]:
# filter out audis, chevys, and hondas
mpg %>% dplyr::filter(!manufacturer %in% c("audi","chevrolet","honda")) %>% head

## `arrange()` rows/observations

Changes order of rows. First argument is name of data frame, next arguments are column names (or more complicated expressions) to order by. Default column ordering is by ascending order, can use `desc()` to do descending order. Missing values get sorted at the end regardless of what column ordering is chosen.

In [None]:
?dplyr::arrange

In [None]:
# arrange/reorder mpg by class
dplyr::arrange(mpg, class) %>% head

In [None]:
# arrange/reorder data frame with 2seaters filtered out by class
# 2seaters does not appear which is as it should be
## `arrange()` rows/observations

dplyr::arrange(no_2seaters, class) %>% head

What kinds of cars have the best highway and city gas mileage?

In [None]:
# arrange mpg so that first hwy mileage is by descending order, then cty mileage is by descending order
dplyr::arrange(mpg, desc(hwy), desc(cty)) %>% head

Example of missing data getting placed at bottom.

In [None]:
df <- data.frame(x=c(5,2,NA,6))
df

In [None]:
# arrange df by ascending order, NA will be at bottom
dplyr::arrange(df, x)

In [None]:
# arrange df by descending order, NA will be at bottom
dplyr::arrange(df, desc(x))

In [None]:
# rest of the values are unsorted because they are all T for !is.na(x)
dplyr::arrange(df,!is.na(x))

In [None]:
# can arrange by x again to get ascending order
dplyr::arrange(df,!is.na(x),desc(x))

## `select()` columns/variables

Selects columns, which can be useful when you have hundreds or thousands of variables in order to narrow down to what variables you're actually interested in. First argument is name of data frame, subsequent arguments are columns to select. Can use `a:b` to select all columns between `a` and `b`, or use `-a` to select all columns *except* a.

In [None]:
?dplyr::select

In [None]:
# select manufacturer, model, year, cty, hwy
dplyr::select(mpg, manufacturer, model, year, cty, hwy) %>% head

In [None]:
# select all columns model thru hwy
dplyr::select(mpg, model:hwy) %>% head
head(mpg)

In [None]:
# select all columns except cyl thru drv and class
dplyr::select(mpg, -(cyl:drv), -class) %>% head

## `mutate()` to add new variables or `transmute()` to keep only new variables

Adds new columns that are functions of existing columns. First argument is name of data frame, next arguments are of the form `new_column_name = f(existing columns)`.

In [None]:
?dplyr::mutate

In [None]:
# add a new column that takes average mileage between city and highway
dplyr::mutate(mpg, avg_mileage = (cty+hwy)/2) %>% head

In [None]:
# keep only average mileage between city and highway
dplyr::transmute(mpg,cty,avg_mileage=(cty+hwy)/2) %>% head

## `summarise()` and `group_by()` for grouped summaries

`summarise()` collapses a data frame into a single row, and `group_by()` changes analysis from entire data frame into individual groups.

In [None]:
?dplyr::summarise

In [None]:
?dplyr::group_by

In [None]:
# get average mileage grouped by engine cylinder
m <- dplyr::mutate(mpg, avg_mileage=(cty+hwy)/2)
# behavior is actually different in R/RStudio compared to notebooks
m %>% dplyr::group_by(cyl) %>%
    dplyr::summarise(avg=mean(avg_mileage)) %>%
    head

**Note:** If you look at the output of `group_by` in R/RStudio you will actually be able to see what your groupings are as well as how many of them you have. For example if we did `group_by(mpg, cyl)` the output would include `cyl [4]` which shows that our grouping is by `cyl` and there are 4 groups. Jupyter notebook doesn't display this for reasons having to do with [how data frames are outputted](https://github.com/IRkernel/repr/issues/113). Some other differences exist between how certain objects from **tidyverse** are displayed as well.

In [None]:
dplyr::group_by(m, drv) %>%
    dplyr::summarise(avg=mean(avg_mileage))

In [None]:
# df after group_by would show that we have 9 groups
drv_cyl <- dplyr::group_by(m, drv, cyl) %>%
    dplyr::summarise(avg=mean(avg_mileage)) %>%
    dplyr::arrange(desc(avg))
drv_cyl

Can also run `ungroup` to ungroup your observations.

In [None]:
drv_cyl %>% dplyr::summarise(max=max(avg))

In [None]:
ungroup(drv_cyl) %>% dplyr::summarise(max=max(avg))