# Data Wrangling and Tidyverse  

In this notebook you'll learn principles behind data wrangling and management, including tidying and transforming data to answer questions you might want to ask. 

## Some useful notes

With Jupyter Notebook you can get a nice popup of function definitions just like you can in RStudio. Simply navigate to a cell or start a new one, and enter in ?function like you would normally. A popup will appear.

You should see an Insert dropdown menu and Run button at the top which lets you add cells as well as run code or render Markdown in the cells, but these are very useful keyboard shortcuts for the same functions: 

- Shift+Enter: Run code or render Markdown in the current cell you're on
- Esc+a: Add a cell above
- Esc+b: Add a cell below
- Esc+dd: Delete a cell

## Package prerequisites 

Packages that are required in this workshop are tidyverse (which includes the packages ggplot2, dplyr, purrr, and others), gridExtra, which helps with arranging plots next to each other, ggrepel, which helps with plotting labels, and maps, which is for map data. 

In [1]:
library(tidyverse)
library(gridExtra)
library(ggrepel)
library(maps)
library(pillar)

“running command 'timedatectl' had status 1”
── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflic

# Data Frames

- A data frame is another way to organize a collection of rows and columns.
- It is a collection of lists organized into columns.
- It is similar to a matrix, except data frames allow different data types in different columns.
- We can use the `data.frame()` function to create a data frame from vectors using the following format:

```
dataframe <- data.frame(column_1, column_2, column_3)
```

In [2]:
example_df <- data.frame(
    c('a','b','c'), 
    c(1, 3, 5), 
    c(TRUE, TRUE, FALSE))

print(example_df)

  c..a....b....c.. c.1..3..5. c.TRUE..TRUE..FALSE.
1                a          1                 TRUE
2                b          3                 TRUE
3                c          5                FALSE


Use `names()` or `colnames()` to name columns,  `rownames()` to name rows, or `dimnames()` to assign both column and row names to the data frame:

In [3]:
colnames(example_df) <- c('letters', 'numbers', 'boolean')
rownames(example_df) <- c('first', 'second', '')
print(example_df)

       letters numbers boolean
first        a       1    TRUE
second       b       3    TRUE
             c       5   FALSE


In [4]:
names(example_df) <- c('_letters_', '_numbers_', '_boolean_')
print(example_df)

       _letters_ _numbers_ _boolean_
first          a         1      TRUE
second         b         3      TRUE
               c         5     FALSE


In [5]:
dimnames(example_df) <- list(c('__first', '__second', '__third'), c('__letters', '__numbers', '__boolean'))
print(example_df)

         __letters __numbers __boolean
__first          a         1      TRUE
__second         b         3      TRUE
__third          c         5     FALSE


We can use the `attributes()` and `str()` functions to get some information about our data frame:

In [6]:
attributes(example_df)

In [7]:
str(example_df)

'data.frame':	3 obs. of  3 variables:
 $ __letters: chr  "a" "b" "c"
 $ __numbers: num  1 3 5
 $ __boolean: logi  TRUE TRUE FALSE


Data frames can be classified into two broad categories: wide format and long format. All data frames shown so far have been presented in wide format. A wide format data frame has each row describe a sample and each column describe a feature. Here is a short example of a data frame in wide format, tabulating counts for three genes in three patients:

In [8]:
wide_df <- data.frame(c("A", "B", "C"), c(1, 1, 2), c(5, 6, 7), c(0, 1, 0))
colnames(wide_df) <- c("id", "gene.1", "gene.2", "gene.3")
wide_df

id,gene.1,gene.2,gene.3
<chr>,<dbl>,<dbl>,<dbl>
A,1,5,0
B,1,6,1
C,2,7,0


Long format stacks features on top of one another; each row is the combination of a sample and a feature.  One column exists to denote the feature in question, and another column exists to denote that feature' value:

In [9]:
long_df <- data.frame(c("A", "A", "A", "B", "B", "B", "C", "C", "C"), c("gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3"), c(1, 5, 0, 1, 6, 1, 2, 7, 0))
colnames(long_df) <- c("id", "gene", "count")
long_df

id,gene,count
<chr>,<chr>,<dbl>
A,gene.1,1
A,gene.2,5
A,gene.3,0
B,gene.1,1
B,gene.2,6
B,gene.3,1
C,gene.1,2
C,gene.2,7
C,gene.3,0


These formats both contain the exact same data but represent it in different ways. Various functions exist to convert between wide and long format and we'll get into this a bit more shortly when discussing the `tidyr` package.  

# Adding columns to a data frame

Let's make a new example dataframe to work with:

In [10]:
patients_1 <- data.frame(
    c('Boo','Rex','Chuckles'), 
    c(1, 3, 5), 
    c('dog', 'dog', 'dog'))
print(patients_1)

  c..Boo....Rex....Chuckles.. c.1..3..5. c..dog....dog....dog..
1                         Boo          1                    dog
2                         Rex          3                    dog
3                    Chuckles          5                    dog


Use `names()` or `colnames()` to name columns,  `rownames()` to name rows, or `dimnames()` to assign both column and row names to the data frame.
Here we will use `names()` to name the columns:

In [11]:
names(patients_1) <- c('name', 'number_of_visits', 'type')
print(patients_1)

      name number_of_visits type
1      Boo                1  dog
2      Rex                3  dog
3 Chuckles                5  dog


We can use the column names to extract a single column using the notation `dataframe$column`, e.g.:

In [12]:
print(patients_1$name)

[1] "Boo"      "Rex"      "Chuckles"


The `cbind()` function can be used to add more columns to a dataframe:

In [13]:
column_4 <- c(4, 2, 6)
patients_1 <- cbind(patients_1, column_4)
print(patients_1)

      name number_of_visits type column_4
1      Boo                1  dog        4
2      Rex                3  dog        2
3 Chuckles                5  dog        6


We can also rename individual columns of the dataframe using index notation, lets rename the 4th column we just added:

In [14]:
colnames(patients_1)[4] <- 'age_in_years'
print(patients_1)

      name number_of_visits type age_in_years
1      Boo                1  dog            4
2      Rex                3  dog            2
3 Chuckles                5  dog            6


We can also use the `dataframe$column` notation to add a new column and name it at the same time:

In [15]:
patients_1$weight_in_pounds <- c(35, 75, 15)
print(patients_1)

      name number_of_visits type age_in_years weight_in_pounds
1      Boo                1  dog            4               35
2      Rex                3  dog            2               75
3 Chuckles                5  dog            6               15


Let's use `str()` and `attributes()` functions to look at the structure and attributes of this data frame:

In [16]:
str(patients_1)

'data.frame':	3 obs. of  5 variables:
 $ name            : chr  "Boo" "Rex" "Chuckles"
 $ number_of_visits: num  1 3 5
 $ type            : chr  "dog" "dog" "dog"
 $ age_in_years    : num  4 2 6
 $ weight_in_pounds: num  35 75 15


In [17]:
attributes(patients_1)

# Data frame merging
- Data is often spread across more than one file, reading each file into R will result in more than one data frame. 
- If the data frames have some common identifying column, we can use that common ID to combine the data frames. 

For example:

In [18]:
print(patients_1)

      name number_of_visits type age_in_years weight_in_pounds
1      Boo                1  dog            4               35
2      Rex                3  dog            2               75
3 Chuckles                5  dog            6               15


Let's make another data frame:

In [19]:
patients_2 <- data.frame(
    c('Fluffy', 'Smokey', 'Kitty'), 
    c(1, 1, 2), 
    c('cat', 'dog', 'cat'),
    c(1, 3, 5))
colnames(patients_2) <- c('name', 'number_of_visits', 'type', 'age_in_years')
print(patients_2)

    name number_of_visits type age_in_years
1 Fluffy                1  cat            1
2 Smokey                1  dog            3
3  Kitty                2  cat            5


We can use the `merge()` function to combine them:

In [20]:
patients_df <- merge(patients_1, patients_2, all = TRUE)
print(patients_df)

      name number_of_visits type age_in_years weight_in_pounds
1      Boo                1  dog            4               35
2 Chuckles                5  dog            6               15
3   Fluffy                1  cat            1               NA
4    Kitty                2  cat            5               NA
5      Rex                3  dog            2               75
6   Smokey                1  dog            3               NA


- Using `all = TRUE` will fill in blank values if needed (for example, the weight of any of the animals in `patients_2`).
- Using the `all.x = TRUE` argument will return all values in the `patients_1` dataframe, as well as any entries with the same ID column(s) from `patients_2`.

In [21]:
patients_df <- merge(patients_1, patients_2, all.x = TRUE)
print(patients_df)

      name number_of_visits type age_in_years weight_in_pounds
1      Boo                1  dog            4               35
2 Chuckles                5  dog            6               15
3      Rex                3  dog            2               75


- Using the `all.y = TRUE` argument will return all values in the `patients_2` dataframe, as well as any entries with the same ID column(s) from `patients_1`.

In [22]:
patients_df <- merge(patients_1, patients_2, all.y = TRUE)
print(patients_df)

    name number_of_visits type age_in_years weight_in_pounds
1 Fluffy                1  cat            1               NA
2  Kitty                2  cat            5               NA
3 Smokey                1  dog            3               NA


You can also specify which columns to join on:

In [23]:
patients_df <- merge(patients_1, patients_2, by = c('name', 'type', 'number_of_visits', 'age_in_years'), all = TRUE)
print(patients_df)

      name type number_of_visits age_in_years weight_in_pounds
1      Boo  dog                1            4               35
2 Chuckles  dog                5            6               15
3   Fluffy  cat                1            1               NA
4    Kitty  cat                2            5               NA
5      Rex  dog                3            2               75
6   Smokey  dog                1            3               NA


# Tidying Data

Most datasets are data frames made up of rows and columns. However, talking about data frames just in terms of what rows and columns it has is not enough.

 * **Variable:** quantity, quality, property that can be measured.
 * **Value:** State of variable when measured.
 * **Observation:** Set of measurements made under similar conditions
 * **Tabular data:** Set of values, each associated with a variable and an observation.

Tidy data:
 * Each variable is its own column
 * Each observation is its own row
 * Each value is in a single cell
 
Benefits:
 * Easy to manipulate
 * Easy to model
 * Easy to visualize
 * Has a specific and consistent structure
 * Stucture makes it easy to tidy other data
 
Cons:
 * Data frame is not as easy to look at

Consider the following tables:

In [24]:
table1 <- data.frame(makemodel=c("audi a4","audi a4","chevrolet corvette","chevrolet corvette","honda civic","honda civic"),
                    year=rep(c(1999,2008),3),
                    cty=c(18,21,15,15,24,25),
                    hwy=c(29,30,23,25,32,36))
table1

makemodel,year,cty,hwy
<chr>,<dbl>,<dbl>,<dbl>
audi a4,1999,18,29
audi a4,2008,21,30
chevrolet corvette,1999,15,23
chevrolet corvette,2008,15,25
honda civic,1999,24,32
honda civic,2008,25,36


This is tidy data, because each column is a variable, each observation is a row, and each value is in a single cell

Next we will look at some non-tidy data and operations from the **tidyr** package (part of **tidyverse**) to make the data tidy. Note that many of you might be more used to using operations from **reshape2**, like melting and casting. It's a very useful package with more functionality including aggregating data, but syntax with **tidyr** commands is simpler and more intuitive for the purposes of tidying data.

## pivot_wider

We can use the `pivot_wider` function to get our data in wide format (less rows, more columns). It accepts the following arguments:
`id_cols` - a set of columns that uniquely identified each observation, defaults to all values in your data except what you specify in `names_from` and `values_from`. `names_from` and `values_from` describe which columns will be used to name the output column (`names_from`) and which column will be used to populate the call values (`values_from`). To see the rest of the optional arguments, use `?pivot_wider`.

In [65]:
?tidyr::pivot_wider

In [25]:
table2 <- tidyr::pivot_wider(table1,
            names_from = year,
            values_from = c(cty, hwy))
table2

makemodel,cty_1999,cty_2008,hwy_1999,hwy_2008
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
audi a4,18,21,29,30
chevrolet corvette,15,15,23,25
honda civic,24,25,32,36


## pivot_longer

`pivot_longer` is the inverse transformation of `pivot_wider`. It'll make your data longer (more rows, less columns). It accepts the following arguments:`cols` are the columns to pivot into longer format, `names_to` a string specifying the name of the column to create from the data stored in the column names of the input data, `values_to` a string specifying the name of the column to create from the data stored in the cell values.

In [None]:
tidyr::pivot_longer

In [26]:
table3 <- tidyr::pivot_longer(table2,
                       cols = !makemodel,
                       names_to = 'mpg_year',
                      values_to = 'value')
table3


makemodel,mpg_year,value
<chr>,<chr>,<dbl>
audi a4,cty_1999,18
audi a4,cty_2008,21
audi a4,hwy_1999,29
audi a4,hwy_2008,30
chevrolet corvette,cty_1999,15
chevrolet corvette,cty_2008,15
chevrolet corvette,hwy_1999,23
chevrolet corvette,hwy_2008,25
honda civic,cty_1999,24
honda civic,cty_2008,25


## Separating

`table3` has a `mpg_year` column that actually contains two variables, which we can separate into two columns.

Parameters:
 * table and column/variable that needs to be separated.
 * `into`: columns to split into
 * `sep`: separator value. Can be regexp or positions to split at. If not provided then splits at non-alphanumeric characters.We can split the new `mpg_year` column we just made using `separate`.

In [67]:
?tidyr::separate

In [27]:
table4 <- tidyr::separate(table3, mpg_year, into = c("mpg", "year"), sep="_")
table4

makemodel,mpg,year,value
<chr>,<chr>,<chr>,<dbl>
audi a4,cty,1999,18
audi a4,cty,2008,21
audi a4,hwy,1999,29
audi a4,hwy,2008,30
chevrolet corvette,cty,1999,15
chevrolet corvette,cty,2008,15
chevrolet corvette,hwy,1999,23
chevrolet corvette,hwy,2008,25
honda civic,cty,1999,24
honda civic,cty,2008,25


`pivot_wider` to get things tidy again

In [28]:
table5 <- tidyr::pivot_wider(table4,
            names_from = mpg,
            values_from = value)
table5

makemodel,year,cty,hwy
<chr>,<chr>,<dbl>,<dbl>
audi a4,1999,18,29
audi a4,2008,21,30
chevrolet corvette,1999,15,23
chevrolet corvette,2008,15,25
honda civic,1999,24,32
honda civic,2008,25,36


## Uniting

Let's say we want to unite `cty` and `hwy` to be one column. We can do this using the `unite` function.

`Unite` will accept the following arguments: `col` the name of the new column, `sep` as the separator between values, and the columns to unite are indicated using `:` or listing them out as below.


In [69]:
?tidyr::unite

In [29]:
table6 <- tidyr::unite(table5, col = cty_hwy, cty, hwy, sep='_')
table6

makemodel,year,cty_hwy
<chr>,<chr>,<chr>
audi a4,1999,18_29
audi a4,2008,21_30
chevrolet corvette,1999,15_23
chevrolet corvette,2008,15_25
honda civic,1999,24_32
honda civic,2008,25_36


## Piping

dplyr from tidyverse contains the 'pipe' (%>%) which allows you to combine multiple operations, directly taking output from a funtion as input to the next. Can save time and memory as well as make code easier to read. Can think of it this way: x %>% f(y) becomes f(x,y), and x %>% f(y) %>% g(z) becomes g(f(x,y),z), etc.

In [30]:
table_1999 <- tidyr::unite(table5, col = cty_hwy, cty, hwy, sep='_') %>% filter(year == 1999)
table_1999

table_2008 <- tidyr::unite(table5, col = cty_hwy, cty, hwy, sep='_') %>% filter(year == 2008)
table_2008

makemodel,year,cty_hwy
<chr>,<chr>,<chr>
audi a4,1999,18_29
chevrolet corvette,1999,15_23
honda civic,1999,24_32


makemodel,year,cty_hwy
<chr>,<chr>,<chr>
audi a4,2008,21_30
chevrolet corvette,2008,15_25
honda civic,2008,25_36


We can merge tables using join function -- https://dplyr.tidyverse.org/reference/mutate-joins.html). 

    inner_join(): includes all rows in x and y.

    left_join(): includes all rows in x.

    right_join(): includes all rows in y.

    full_join(): includes all rows in x or y.

This will combine dataframes based on what column you specify in the `by` argument.

In [31]:
dplyr::left_join(table_1999, table_2008, by = 'makemodel')

makemodel,year.x,cty_hwy.x,year.y,cty_hwy.y
<chr>,<chr>,<chr>,<chr>,<chr>
audi a4,1999,18_29,2008,21_30
chevrolet corvette,1999,15_23,2008,15_25
honda civic,1999,24_32,2008,25_36


**join() vs merge()**

1. join() is faster than merge(), particularly if data  is large.

2. join() preserves the original order of rows, merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.

In [72]:
?dplyr::join

In [71]:
?merge

## Not all data should be tidy

Matrices, phylogenetic trees (although `ggtree` and `treeio` have tidy representations that help with annotating trees), etc.

# Transforming (Tidy) Data

Now we know how to get tidy data. At this point we can already start visualizing our data. However in many cases we will need to further transform our data to narrow down variables and observations we are really interested in or to create new variables that are functions of our existing variables and data. This is known as **transforming** data.

 * `filter()` to pick observations (rows) by their values
 * `arrange()` to reorder rows, default is by ascending value
 * `select()` to pick variables (columns) by their names
 * `mutate()` to create new variables with functions of existing variables
 * `summarise()` to collapes many values down to a single summary
 * `group_by()` to set up functions to operate on groups rather than the whole data set
 * `%>%` propagates the output from a function as input to another. eg: x %>% f(y) becomes f(x,y), and x %>% f(y) %>% g(z) becomes g(f(x,y),z).
 
All functions have similar structure:
 1. First argument is data frame
 2. Next arguments describe what to do with data frame using variable names
 3. Result is new data frame
 
We will be working with the **mpg** data frame for the rest of workshop which comes with the **tidyverse** library.

In [32]:
data(mpg)
head(mpg)

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact


## `filter()` rows/observations

As name suggests filters out rows. First argument is name of data frame, next arguments are expressions that filter the data frame.

In [58]:
?dplyr::filter

In [33]:
# filter out 2 seater cars
table(mpg$class)
no_2seaters <- dplyr::filter(mpg, class != "2seater")
head(no_2seaters)
table(no_2seaters$class)


   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact



   compact    midsize    minivan     pickup subcompact        suv 
        47         41         11         33         35         62 

In [34]:
# filter out audis, chevys, and hondas
mpg %>% dplyr::filter(!manufacturer %in% c("audi","chevrolet","honda")) %>% head

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
dodge,caravan 2wd,2.4,1999,4,auto(l3),f,18,24,r,minivan
dodge,caravan 2wd,3.0,1999,6,auto(l4),f,17,24,r,minivan
dodge,caravan 2wd,3.3,1999,6,auto(l4),f,16,22,r,minivan
dodge,caravan 2wd,3.3,1999,6,auto(l4),f,16,22,r,minivan
dodge,caravan 2wd,3.3,2008,6,auto(l4),f,17,24,r,minivan
dodge,caravan 2wd,3.3,2008,6,auto(l4),f,17,24,r,minivan


## `arrange()` rows/observations

Changes order of rows. First argument is name of data frame, next arguments are column names (or more complicated expressions) to order by. Default column ordering is by ascending order, can use `desc()` to do descending order. Missing values get sorted at the end regardless of what column ordering is chosen.

In [73]:
?dplyr::arrange

In [35]:
# arrange/reorder mpg by class
dplyr::arrange(mpg, class) %>% head

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
chevrolet,corvette,5.7,1999,8,manual(m6),r,16,26,p,2seater
chevrolet,corvette,5.7,1999,8,auto(l4),r,15,23,p,2seater
chevrolet,corvette,6.2,2008,8,manual(m6),r,16,26,p,2seater
chevrolet,corvette,6.2,2008,8,auto(s6),r,15,25,p,2seater
chevrolet,corvette,7.0,2008,8,manual(m6),r,15,24,p,2seater
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact


In [36]:
# arrange/reorder data frame with 2seaters filtered out by class
# 2seaters does not appear which is as it should be
## `arrange()` rows/observations

dplyr::arrange(no_2seaters, class) %>% head

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact


What kinds of cars have the best highway and city gas mileage?

In [37]:
# arrange mpg so that first hwy mileage is by descending order, then cty mileage is by descending order
dplyr::arrange(mpg, desc(hwy), desc(cty)) %>% head

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
volkswagen,new beetle,1.9,1999,4,manual(m5),f,35,44,d,subcompact
volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact
volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact
toyota,corolla,1.8,2008,4,manual(m5),f,28,37,r,compact
honda,civic,1.8,2008,4,auto(l5),f,25,36,r,subcompact
honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact


Example of missing data getting placed at bottom.

In [38]:
df <- data.frame(x=c(5,2,NA,6))
df

x
<dbl>
5.0
2.0
""
6.0


In [39]:
# arrange df by ascending order, NA will be at bottom
dplyr::arrange(df, x)

x
<dbl>
2.0
5.0
6.0
""


In [40]:
# arrange df by descending order, NA will be at bottom
dplyr::arrange(df, desc(x))

x
<dbl>
6.0
5.0
2.0
""


In [41]:
# rest of the values are unsorted because they are all T for !is.na(x)
dplyr::arrange(df,!is.na(x))

x
<dbl>
""
5.0
2.0
6.0


In [42]:
# can arrange by x again to get ascending order
dplyr::arrange(df,!is.na(x),desc(x))

x
<dbl>
""
6.0
5.0
2.0


## `select()` columns/variables

Selects columns, which can be useful when you have hundreds or thousands of variables in order to narrow down to what variables you're actually interested in. First argument is name of data frame, subsequent arguments are columns to select. Can use `a:b` to select all columns between `a` and `b`, or use `-a` to select all columns *except* a.

In [None]:
?dplyr::select

In [43]:
# select manufacturer, model, year, cty, hwy
dplyr::select(mpg, manufacturer, model, year, cty, hwy) %>% head

manufacturer,model,year,cty,hwy
<chr>,<chr>,<int>,<int>,<int>
audi,a4,1999,18,29
audi,a4,1999,21,29
audi,a4,2008,20,31
audi,a4,2008,21,30
audi,a4,1999,16,26
audi,a4,1999,18,26


In [44]:
# select all columns model thru hwy
dplyr::select(mpg, model:hwy) %>% head
head(mpg)

model,displ,year,cyl,trans,drv,cty,hwy
<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>
a4,1.8,1999,4,auto(l5),f,18,29
a4,1.8,1999,4,manual(m5),f,21,29
a4,2.0,2008,4,manual(m6),f,20,31
a4,2.0,2008,4,auto(av),f,21,30
a4,2.8,1999,6,auto(l5),f,16,26
a4,2.8,1999,6,manual(m5),f,18,26


manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact


In [45]:
# select all columns except cyl thru drv and class
dplyr::select(mpg, -(cyl:drv), -class) %>% head

manufacturer,model,displ,year,cty,hwy,fl
<chr>,<chr>,<dbl>,<int>,<int>,<int>,<chr>
audi,a4,1.8,1999,18,29,p
audi,a4,1.8,1999,21,29,p
audi,a4,2.0,2008,20,31,p
audi,a4,2.0,2008,21,30,p
audi,a4,2.8,1999,16,26,p
audi,a4,2.8,1999,18,26,p


## `mutate()` to add new variables or `transmute()` to keep only new variables

Adds new columns that are functions of existing columns. First argument is name of data frame, next arguments are of the form `new_column_name = f(existing columns)`.

In [74]:
?dplyr::mutate

In [46]:
# add a new column that takes average mileage between city and highway
dplyr::mutate(mpg, avg_mileage = (cty+hwy)/2) %>% head

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,avg_mileage
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<dbl>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,25.0
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,25.5
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,25.5
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,21.0
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact,22.0


In [47]:
# keep only average mileage between city and highway
dplyr::transmute(mpg,cty,avg_mileage=(cty+hwy)/2) %>% head

cty,avg_mileage
<int>,<dbl>
18,23.5
21,25.0
20,25.5
21,25.5
16,21.0
18,22.0


## `summarise()` and `group_by()` for grouped summaries

`summarise()` collapses a data frame into a single row, and `group_by()` changes analysis from entire data frame into individual groups.

In [76]:
?dplyr::summarise

In [75]:
?dplyr::group_by

In [48]:
# get average mileage grouped by engine cylinder
m <- dplyr::mutate(mpg, avg_mileage=(cty+hwy)/2)
# behavior is actually different in R/RStudio compared to notebooks
m %>% dplyr::group_by(cyl) %>%
    dplyr::summarise(avg=mean(avg_mileage)) %>%
    head

cyl,avg
<int>,<dbl>
4,24.90741
5,24.625
6,19.51899
8,15.1


**Note:** If you look at the output of `group_by` in R/RStudio you will actually be able to see what your groupings are as well as how many of them you have. For example if we did `group_by(mpg, cyl)` the output would include `cyl [4]` which shows that our grouping is by `cyl` and there are 4 groups. Jupyter notebook doesn't display this for reasons having to do with [how data frames are outputted](https://github.com/IRkernel/repr/issues/113). Some other differences exist between how certain objects from **tidyverse** are displayed as well.

In [49]:
dplyr::group_by(m, drv) %>%
    dplyr::summarise(avg=mean(avg_mileage))

drv,avg
<chr>,<dbl>
4,16.75243
f,24.06604
r,17.54


In [50]:
# df after group_by would show that we have 9 groups
drv_cyl <- dplyr::group_by(m, drv, cyl) %>%
    dplyr::summarise(avg=mean(avg_mileage)) %>%
    dplyr::arrange(desc(avg))
drv_cyl

[1m[22m`summarise()` has grouped output by 'drv'. You can override using the `.groups` argument.


drv,cyl,avg
<chr>,<int>,<dbl>
f,4,26.26724
f,5,24.625
4,4,21.47826
r,6,21.25
f,6,21.12791
f,8,20.5
4,6,17.14062
r,8,16.83333
4,8,14.22917


Can also run `ungroup` to ungroup your observations.

In [51]:
drv_cyl %>% dplyr::summarise(max=max(avg))

drv,max
<chr>,<dbl>
4,21.47826
f,26.26724
r,21.25


In [52]:
ungroup(drv_cyl) %>% dplyr::summarise(max=max(avg))

max
<dbl>
26.26724
