## `dplyr` and `magrittr`

**Author: Ahmed Hasan**

Made for UofT Coders - to be delivered 19/07/2017

<hr>

Using the R packages `dplyr` and pipes a la bash via `magrittr`, we have a very powerful toolset for data wrangling and manipulation.

`dplyr` is built around 5 verbs. These verbs make up the majority of the data manipulation you tend to do. You might need to:

- `select` certain columns of data.
- `filter` your data to select specific rows.
- `arrange` the rows of your data into an order.
- `mutate` your data frame to contain new columns/variables.
- `summarise` chunks of your data in some way.

Lesson adapted from [Stat545](http://stat545.com/block009_dplyr-intro.html) and Hadley Wickham's own [`dplyr` workshop](http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/).

In [None]:
# install.packages('dplyr')
# install.packages('magrittr')

In [None]:
library(dplyr, warn.conflicts = FALSE)

In [None]:
str(iris)

In [None]:
head(iris)

<center><h2> `iris` is a _tidy_ dataset </h2></center>

<center><h2> `dplyr` is optimized for work with tidy data </h2></center>

<center><h1> `filter` - subsetting rows </h1></center>

In [None]:
filter(iris, Sepal.Length >= 6.5)

In [None]:
filter(iris, Sepal.Length >= 6.5, Species == 'virginica')

In [None]:
filter(iris, Sepal.Length < 5.0, Species %in% c('versicolor', 'setosa'))

<center><h1> Before we go any further - meet the pipe </h1></center>
<br>
<center>
```
f(x) == x %>% f()

f(x, y) == x %>% f(y)

f(x, y) == y %>% f(x, .)

```
</center>    

In [None]:
library(magrittr)

In [None]:
head(iris)

In [None]:
iris %>% head()

In [None]:
head(iris, 3)

In [None]:
iris %>% head(3)

In [None]:
3 %>% head(iris, .)

<center><h1> `select` - subsetting columns </h1></center>

In [None]:
select(iris, Sepal.Length)

In [None]:
select(iris, Sepal.Length, Species) %>% head()

In [None]:
select(iris, contains('Sepal'), Species) %>% head()

In [None]:
select(iris, starts_with('Sepal'), ends_with('Length')) %>% head()

In [None]:
# renaming with select

select(iris, SL = Sepal.Length, SW = Sepal.Width) %>% head()

Combining `select` and `filter` using the pipe:

In [None]:
class(select(iris, Sepal.Length)) # is a data frame

In [None]:
select(iris, Sepal.Length) %>%
filter(Sepal.Length > 7.0) # notice lack of first argument here

In [None]:
# could be restructured as -
iris %>%
    select(Sepal.Length) %>%
    filter(Sepal.Length > 7.0)

# almost reads like a recipe!

In [None]:
iris %>%
    select(Sepal.Length) %>%
    filter(Sepal.Length > 7.0) %>%
    arrange(Sepal.Length)

In [None]:
iris %>%
    select(Sepal.Length) %>%
    filter(Sepal.Length > 7.0) %>%
    arrange(desc(Sepal.Length))

In [None]:
head(iris)

In [None]:
# arranging by x *then* y

iris %>%
    select(Petal.Length:Species) %>%
    arrange(Petal.Length, Petal.Width)

<center><h1> `mutate` - create new columns </h1></center>

In [None]:
iris %>%
    mutate(Sepal.Area = Sepal.Width * Sepal.Length,
          Petal.Area = Petal.Width * Petal.Length) %>%
    head()

In [None]:
iris %>%
    filter(Species == 'setosa') %>%
    mutate(Length.Diff = Sepal.Length - Petal.Length) %>%
    select(Length.Diff) %>%
    head()

<center><h1> `group_by` and `mutate` - create new variables in a grouped fashion</h1></center>

In [None]:
iris %>%
    group_by(Species) %>%
    mutate(Sepal.Length.Deviation = mean(Sepal.Length) - Sepal.Length) %>%
    head()

In [None]:
iris %>%
    mutate(Sepal.Area = Sepal.Width * Sepal.Length,
           Petal.Area = Petal.Width * Petal.Length) %>%
    group_by(Species) %>%
    mutate(Grouped.Area.Ratio = Sepal.Area / mean(Sepal.Area)) %>%
    head()

<center><h1> `summarise` - summarise your data</h1></center>

<center><h1> `group_by` and `summarise` - summarise your data in a grouped fashion</h1></center>

In [None]:
iris %>%
    group_by(Species) %>%
    summarise(n = n())

In [None]:
iris %>%
    count(Species)

In [None]:
iris %>%
    group_by(Species) %>%
    head()

In [None]:
iris %>%
    group_by(Species) %>%
    summarise(Mean.Sepal.Length = mean(Sepal.Length),
             Mean.Sepal.Width = mean(Sepal.Width))

In [None]:
# group_by adds grouping info 'under the hood'

str(iris)

In [None]:
iris %>%
    group_by(Species) %>%
    str()

In [None]:
# this can be removed if necessary
iris %>%
    group_by(Species) %>%
    ungroup() %>%
    str()

<center><h1> other `magrittr` tricks</h1></center>

In [None]:
# tee operator
# passes LHS to RHS, but prints LHS 

library(ggplot2)

iris.summary <-
    iris %>%
    group_by(Species) %>%
    summarise(Mean.Sepal.Length = mean(Sepal.Length),
             Mean.Sepal.Width = mean(Sepal.Width)) %T>% # tee prints LHS
    plot()

In [None]:
# notice how the plot didn't get saved to our df

iris.summary

In [None]:
# other trick - compound assignment

sepals.only <- select(iris, contains("Sepal"), Species)
head(sepals.only)

In [None]:
sepals.only <-
    sepals.only %>%
    mutate(Sepal.Area = Sepal.Length * Sepal.Width)

head(sepals.only)

In [None]:
# or we could use the compound assignment operator

sepals.only <- select(iris, contains("Sepal"), Species)

sepals.only %<>%
    mutate(Sepal.Area = Sepal.Length * Sepal.Width)

head(sepals.only)