-
Notifications
You must be signed in to change notification settings - Fork 1.9k
/
dplyr_tutorial.Rmd
257 lines (174 loc) · 8.57 KB
/
dplyr_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
layout: page
title: dplyr tutorial
---
```{r options, echo=FALSE}
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
```
## What is dplyr?
dplyr is a powerful R-package to transform and summarize tabular data with rows and columns. For another explanation of dplyr see the dplyr package vignette: [Introduction to dplyr](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html)
## Why Is It Useful?
The package contains a set of functions (or "verbs") that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.
In addition, dplyr contains a useful function to perform another common task which is the "split-apply-combine" concept. We will discuss that in a little bit.
## How Does It Compare To Using Base Functions R?
If you are familiar with R, you are probably familiar with base R functions such as split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate(). Compared to base functions in R, the functions in dplyr are easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames, instead of just vectors.
## How Do I Get dplyr?
To install dplyr:
```{r, eval=FALSE}
install.packages("dplyr")
```
To load dplyr:
```{r, message=FALSE}
library(dplyr)
```
# Data: Mammals Sleep
The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on github. This data set contains 83 rows and 11 variables.
Download the msleep data set in CSV format from [here](https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv), and then load into R:
```{r}
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
msleep <- read.csv(url)
head(msleep)
```
The columns (in order) correspond to the following:
column name | Description
--- | ---
name | common name
genus | taxonomic rank
vore | carnivore, omnivore or herbivore?
order | taxonomic rank
conservation | the conservation status of the mammal
sleep\_total | total amount of sleep, in hours
sleep\_rem | rem sleep, in hours
sleep\_cycle | length of sleep cycle, in hours
awake | amount of time spent awake, in hours
brainwt | brain weight in kilograms
bodywt | body weight in kilograms
# Important dplyr Verbs To Remember
dplyr verbs | Description
--- | ---
`select()` | select columns
`filter()` | filter rows
`arrange()` | re-order or arrange rows
`mutate()` | create new columns
`summarise()` | summarise values
`group_by()` | allows for group operations in the "split-apply-combine" concept
# dplyr Verbs In Action
The two most basic functions are `select()` and `filter()`, which selects columns and filters rows respectively.
## Selecting Columns Using `select()`
Select a set of columns: the name and the sleep\_total columns.
```{r}
sleepData <- select(msleep, name, sleep_total)
head(sleepData)
```
To select all the columns *except* a specific column, use the "-" (subtraction) operator (also known as negative indexing):
```{r}
head(select(msleep, -name))
```
To select a range of columns by name, use the ":" (colon) operator:
```{r}
head(select(msleep, name:order))
```
To select all columns that start with the character string "sl", use the function `starts_with()`:
```{r}
head(select(msleep, starts_with("sl")))
```
Some additional options to select columns based on a specific criteria include:
1. `ends_with()` = Select columns that end with a character string
2. `contains()` = Select columns that contain a character string
3. `matches()` = Select columns that match a regular expression
4. `one_of()` = Select column names that are from a group of names
## Selecting Rows Using `filter()`
Filter the rows for mammals that sleep a total of more than 16 hours.
```{r}
filter(msleep, sleep_total >= 16)
```
Filter the rows for mammals that sleep a total of more than 16 hours *and* have a body weight of greater than 1 kilogram.
```{r}
filter(msleep, sleep_total >= 16, bodywt >= 1)
```
Filter the rows for mammals in the Perissodactyla and Primates taxonomic order
```{r}
filter(msleep, order %in% c("Perissodactyla", "Primates"))
```
You can use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.
# Pipe Operator: %>%
Before we go any further, let's introduce the pipe operator: %>%. dplyr imports this operator from another package (magrittr).This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right.
Here's an example you have seen:
```{r}
head(select(msleep, name, sleep_total))
```
Now in this case, we will pipe the msleep data frame to the function that will select two columns (name and sleep\_total) and then pipe the new data frame to the function `head()`, which will return the head of the new data frame.
```{r}
msleep %>%
select(name, sleep_total) %>%
head
```
You will soon see how useful the pipe operator is when we start to combine many functions.
# Back To dplyr Verbs In Action
Now that you know about the pipe operator (%>%), we will use it throughout the rest of this tutorial.
## Arrange Or Re-order Rows Using `arrange()`
To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:
```{r}
msleep %>% arrange(order) %>% head
```
Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by sleep\_total. Finally, show the head of the final data frame:
```{r}
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, sleep_total) %>%
head
```
Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:
```{r}
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, sleep_total) %>%
filter(sleep_total >= 16)
```
Something slightly more complicated: same as above, except arrange the rows in the sleep\_total column in a descending order. For this, use the function `desc()`
```{r}
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, desc(sleep_total)) %>%
filter(sleep_total >= 16)
```
## Create New Columns Using `mutate()`
The `mutate()` function will add new columns to the data frame. Create a new column called rem_proportion, which is the ratio of rem sleep to total amount of sleep.
```{r}
msleep %>%
mutate(rem_proportion = sleep_rem / sleep_total) %>%
head
```
You can many new columns using mutate (separated by commas). Here we add a second column called bodywt_grams which is the bodywt column in grams.
```{r}
msleep %>%
mutate(rem_proportion = sleep_rem / sleep_total,
bodywt_grams = bodywt * 1000) %>%
head
```
## Create summaries of the data frame using `summarise()`
The `summarise()` function will create summary statistics for a given column in the data frame such as finding the mean. For example, to compute the average number of hours of sleep, apply the `mean()` function to the column sleep\_total and call the summary value avg\_sleep.
```{r}
msleep %>%
summarise(avg_sleep = mean(sleep_total))
```
There are many other summary statistics you could consider such `sd()`, `min()`, `max()`, `median()`, `sum()`, `n()` (returns the length of vector), `first()` (returns first value in vector), `last()` (returns last value in vector) and `n_distinct()` (number of distinct values in vector).
```{r}
msleep %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
```
## Group operations using `group_by()`
The `group_by()` verb is an important function in dplyr. As we mentioned before it's related to concept of "split-apply-combine". We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output.
Let's do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.
```{r}
msleep %>%
group_by(order) %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
```