Skip to content

Latest commit

 

History

History
1151 lines (963 loc) · 41.4 KB

hands_on_with_dplyr.md

File metadata and controls

1151 lines (963 loc) · 41.4 KB

Hands-on with dplyr

Dmitry Grapov

Introduction

This is meant to be an introduction to dplyr which covers dplyr basics, gets into a little bit of programming with dplyr and ends with brief mention of some gotchas and a benchmarking comparison to base for the split-apply strategy. You might also find Rstudio's Data Wrangling Cheat Sheet featuring dplyr useful (this is also where I borrowed some of the images used in this tutorial).

The dplyr package from Hadley Wickham is plain awesome. It features consistent and succinct syntax, is computationally fast and getting better with every release. The dplyr package has replaced many common more verbose R idioms which I had to previously rely upon for most common data analysis tasks.

For example, many data analysis tasks involve the procedure of splitting the data set based on a grouping variable and then applying a function to each of the groups (split-apply). Lets say I want to calculate the median values for a few parameters for cars with different numbers of cylinders using the mtcars data set.

Set up the data for the example.

#some data prep
data(mtcars)
data<-mtcars
data$cyl<-factor(data$cyl)

Split-lapply-apply in base:

#select some variable of interest
vars<-c("mpg","wt","qsec")
tmp.data<-data[,colnames(data)%in%vars]

#split the data on the number of cylinders
big.l<-split(tmp.data,data$cyl)

#apply some function of interest to all columns
results<-lapply(big.l, function(x) apply(x,2,median))

#bind results and add splitting info
data.frame(cyl=names(results),do.call("rbind",results))
##   cyl  mpg    wt   qsec
## 4   4 26.0 2.200 18.900
## 6   6 19.7 3.215 18.300
## 8   8 15.2 3.755 17.175

Now the same process using dplyr:

suppressPackageStartupMessages(library(dplyr))

#variables of interest
vars<-c("mpg","wt","qsec")

data %>% group_by(cyl) %>% select(one_of(vars)) %>% summarise_each(funs(median(.)))
## Source: local data frame [3 x 4]
## 
##   cyl  mpg    wt   qsec
## 1   4 26.0 2.200 18.900
## 2   6 19.7 3.215 18.300
## 3   8 15.2 3.755 17.175

Switching from base to dplyr for data manipulation feels a little like this:

base

dplyr

Each of the individual dplyr verbs are discussed in more detail below, but the use of %>% or the pipe operator is worth mentioning now. The %>% operator is imported from magrittr and for the purpose of this tutorial we can simply think of it as then. From the cheatsheet referenced above:


Overview of common dplyr functions and verbs

I highly recommend that you take a look at the dplyr vignetts for more detailed description of all of this packages capabilities. One immediate addition in dplyr you might notice is tbl_df which is a local data frame and mostly behaves like the classical data.frame but is more convenient for working with large data.

tbl_df(mtcars)
## Source: local data frame [32 x 11]
## 
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## ..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...
# control the number of rows
print(tbl_df(mtcars),n=5)
## Source: local data frame [32 x 11]
## 
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1  21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3  22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4  21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5  18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## ..  ... ...  ... ...  ...   ...   ... .. ..  ...  ...

You can can make sure all columns are output to the screen using options(dplyr.width = Inf). glimpse is another useful function which is an analogue of str but tries to show you more of the data.

str(mtcars)
## 'data.frame':	32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
glimpse(mtcars)
## Observations: 32
## Variables:
## $ mpg  (dbl) 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  (dbl) 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp (dbl) 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   (dbl) 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat (dbl) 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   (dbl) 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec (dbl) 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   (dbl) 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   (dbl) 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear (dbl) 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb (dbl) 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

For the purpose of this tutorial we will be mostly working with data.frames, however it should be noted that dplyr syntax abstracts away the need to specify the kind of object being manipulated and most everything we will cover can also be applied to interact with a variety of database objects.

The most common dplyr functions also referred to as verbs are as follows (see more in the introduction vignette):

  • filter() and slice()
  • arrange()
  • select() and rename()
  • mutate() and transmute()
  • summarise()
  • group_by()

The following commands will be demonstrated using the hflights data set.

suppressPackageStartupMessages(library(hflights))
## Warning: package 'hflights' was built under R version 3.1.3
(flights <- tbl_df(hflights))
## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Which contains 227496 records for 21 variables for flights departing Houston airport for 2011.

Filter

Use filter to keep or select rows matching some criteria or condition(s).

base

flights[flights$Month == 1 & flights$DayofMonth == 1, ]
## Source: local data frame [552 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          1         6     728     840            AA       460
## 3  2011     1          1         6    1631    1736            AA      1121
## 4  2011     1          1         6    1756    2112            AA      1294
## 5  2011     1          1         6    1012    1347            AA      1700
## 6  2011     1          1         6    1211    1325            AA      1820
## 7  2011     1          1         6     557     906            AA      1994
## 8  2011     1          1         6    1824    2106            AS       731
## 9  2011     1          1         6     654    1124            B6       620
## 10 2011     1          1         6    1639    2110            B6       622
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
#could have also used subset
subset(flights, Month == 1 & DayofMonth == 1)
## Source: local data frame [552 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          1         6     728     840            AA       460
## 3  2011     1          1         6    1631    1736            AA      1121
## 4  2011     1          1         6    1756    2112            AA      1294
## 5  2011     1          1         6    1012    1347            AA      1700
## 6  2011     1          1         6    1211    1325            AA      1820
## 7  2011     1          1         6     557     906            AA      1994
## 8  2011     1          1         6    1824    2106            AS       731
## 9  2011     1          1         6     654    1124            B6       620
## 10 2011     1          1         6    1639    2110            B6       622
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
#be wary of using subset programmatically: http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset

dplyr

#comma is the same as an ampersand (&)
filter(flights, Month == 1, DayofMonth == 1)
## Source: local data frame [552 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          1         6     728     840            AA       460
## 3  2011     1          1         6    1631    1736            AA      1121
## 4  2011     1          1         6    1756    2112            AA      1294
## 5  2011     1          1         6    1012    1347            AA      1700
## 6  2011     1          1         6    1211    1325            AA      1820
## 7  2011     1          1         6     557     906            AA      1994
## 8  2011     1          1         6    1824    2106            AS       731
## 9  2011     1          1         6     654    1124            B6       620
## 10 2011     1          1         6    1639    2110            B6       622
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
# use pipe(|) for or
filter(flights, Month == 1 | DayofMonth == 1)
## Source: local data frame [25,769 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

We can also include any of the following operators in filter.

Slice is a variant of filter used to extract rows based on position.

base

flights[1:10,]
## Source: local data frame [10 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

dplyr

slice(flights, 1:10)
## Source: local data frame [10 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Arrange

Order data based on specified columns.

base

flights[order(flights$Month),]
## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

dplyr

arrange(flights,Month)
## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
#decreasing order
arrange(flights,desc(Month))
## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011    12         15         4    2113    2217            AA       426
## 2  2011    12         16         5    2004    2128            AA       426
## 3  2011    12         18         7    2007    2113            AA       426
## 4  2011    12         19         1    2108    2223            AA       426
## 5  2011    12         20         2    2008    2107            AA       426
## 6  2011    12         21         3    2025    2124            AA       426
## 7  2011    12         22         4    2021    2118            AA       426
## 8  2011    12         23         5    2015    2118            AA       426
## 9  2011    12         26         1    2013    2118            AA       426
## 10 2011    12         27         2    2007    2123            AA       426
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
#break ties using more columns
arrange(flights,desc(Month),DayOfWeek)
## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011    12         19         1    2108    2223            AA       426
## 2  2011    12         26         1    2013    2118            AA       426
## 3  2011    12          5         1     558     926            AA       466
## 4  2011    12         12         1     609     921            AA       466
## 5  2011    12         19         1     603     913            AA       466
## 6  2011    12         26         1     558     912            AA       466
## 7  2011    12          5         1    1206    1311            AA       865
## 8  2011    12         12         1    1339    1436            AA       865
## 9  2011    12         19         1    1203    1314            AA       865
## 10 2011    12         26         1    1200    1318            AA       865
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Select

Select columns from the data.

base

flights[,colnames(flights)%in%c("Month","DayOfWeek")]
## Source: local data frame [227,496 x 2]
## 
##    Month DayOfWeek
## 1      1         6
## 2      1         7
## 3      1         1
## 4      1         2
## 5      1         3
## 6      1         4
## 7      1         5
## 8      1         6
## 9      1         7
## 10     1         1
## ..   ...       ...

dplyr

select(flights,Month,DayOfWeek)
## Source: local data frame [227,496 x 2]
## 
##    Month DayOfWeek
## 1      1         6
## 2      1         7
## 3      1         1
## 4      1         2
## 5      1         3
## 6      1         4
## 7      1         5
## 8      1         6
## 9      1         7
## 10     1         1
## ..   ...       ...
#select using a dynamic variable
variables<-c("Month","DayOfWeek")
select(flights,one_of(variables))
## Source: local data frame [227,496 x 2]
## 
##    Month DayOfWeek
## 1      1         6
## 2      1         7
## 3      1         1
## 4      1         2
## 5      1         3
## 6      1         4
## 7      1         5
## 8      1         6
## 9      1         7
## 10     1         1
## ..   ...       ...
#remove variables
select(flights,one_of(variables),-Month)
## Source: local data frame [227,496 x 1]
## 
##    DayOfWeek
## 1          6
## 2          7
## 3          1
## 4          2
## 5          3
## 6          4
## 7          5
## 8          6
## 9          7
## 10         1
## ..       ...

Select also provides many regular expression wrappers.

Use rename to change column names.

rename(flights,diverted=Diverted)
## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), diverted (int)

Mutate and Transmute

Use theses verbs to create a new column variable which in the case of mutate will be added to the or created as a stand-alone variables transmute.

Lets calculate the wait time based on the difference between ArrTime and DepTime.

base

#transmute like
head(wait<-flights$ArrTime - flights$DepTime)
## [1] 100 100 150 110 102 144
#mutate like
head(flights2<-cbind(flights,wait))
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted wait
## 1      7      13         0                         0  100
## 2      6       9         0                         0  100
## 3      5      17         0                         0  150
## 4      9      22         0                         0  110
## 5      9       9         0                         0  102
## 6      6      13         0                         0  144

dplyr

#stand alone
transmute(flights,diff = ArrTime - DepTime)
## Source: local data frame [227,496 x 1]
## 
##    diff
## 1   100
## 2   100
## 3   150
## 4   110
## 5   102
## 6   144
## 7   150
## 8    99
## 9   111
## 10  110
## ..  ...
#added to data
mutate(flights,diff = ArrTime - DepTime)
## Source: local data frame [227,496 x 22]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int), diff (int)

Many dplyr functions will let you use newly create variables in the same function which is creating the variable in the first place.

transmute(flights,diff = ArrTime - DepTime, ratio = ArrTime/diff, ratio2 = diff/ratio)
## Source: local data frame [227,496 x 3]
## 
##    diff    ratio    ratio2
## 1   100 15.00000  6.666667
## 2   100 15.01000  6.662225
## 3   150 10.01333 14.980027
## 4   110 13.75455  7.997356
## 5   102 14.77451  6.903782
## 6   144 10.43750 13.796407
## 7   150 10.06000 14.910537
## 8    99 14.68687  6.740715
## 9   111 14.00000  7.928571
## 10  110 14.11818  7.791372
## ..  ...      ...       ...

The mutate_each function can be used to apply a function to every column in the dataframe. Lets bin each column into quartiles using the ntile function.

glimpse(mutate_each(flights,funs(ntile(.,n=4))))
## Observations: 227496
## Variables:
## $ Year              (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Month             (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        (int) 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,...
## $ DayOfWeek         (int) 3, 4, 1, 1, 2, 2, 3, 3, 4, 1, 1, 2, 2, 3, 3,...
## $ DepTime           (int) 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2,...
## $ ArrTime           (int) 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ UniqueCarrier     (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ FlightNum         (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ TailNum           (int) 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ ActualElapsedTime (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ AirTime           (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ArrDelay          (int) 1, 1, 1, 3, 2, 2, 2, 1, 4, 4, 4, 3, 1, 2, 1,...
## $ DepDelay          (int) 2, 3, 1, 3, 3, 2, 2, 1, 4, 4, 4, 4, 2, 1, 2,...
## $ Origin            (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Dest              (int) 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Distance          (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ TaxiIn            (int) 3, 3, 2, 4, 4, 3, 4, 3, 4, 3, 4, 1, 3, 2, 3,...
## $ TaxiOut           (int) 2, 1, 3, 4, 1, 2, 3, 2, 4, 4, 4, 2, 2, 3, 1,...
## $ Cancelled         (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CancellationCode  (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Diverted          (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

Here is a list of some other dplyr convenience functions.


Summarise

Carry out a function on the data frame returning a single value.

Lets calculate the range for wait times and use this as an opportunity to involve %>%.

transmute(flights,diff = ArrTime - DepTime) %>% summarise(.,min=min(diff,na.rm=TRUE),max=max(diff,na.rm=TRUE))
## Source: local data frame [1 x 2]
## 
##     min  max
## 1 -2322 1350

Either we just discovered time travel or we should have been referencing the day and time in our calculation of the difference between arrival and departure. Lets count how many calculations we may have screwed up.

(bad<-transmute(flights,diff = ArrTime - DepTime) %>% filter(diff<0) %>% count(.))
## Source: local data frame [1 x 1]
## 
##      n
## 1 2718

An error rate of 1.2 %may or may not be acceptable.

We can use summarise_each to apply a function to every column in the data set. Lets calculate the median and median absolute deviation for every numeric variable.

vars<-colnames(flights)[sapply(flights,is.numeric)]
flights %>% select(one_of(vars)) %>% summarise_each(.,funs(median=median(.,na.rm=TRUE),mad=mad(.,na.rm=TRUE)))
## Source: local data frame [1 x 32]
## 
##   Year_median Month_median DayofMonth_median DayOfWeek_median
## 1        2011            7                16                4
## Variables not shown: DepTime_median (int), ArrTime_median (dbl),
##   FlightNum_median (dbl), ActualElapsedTime_median (dbl), AirTime_median
##   (dbl), ArrDelay_median (dbl), DepDelay_median (int), Distance_median
##   (dbl), TaxiIn_median (dbl), TaxiOut_median (int), Cancelled_median
##   (dbl), Diverted_median (dbl), Year_mad (dbl), Month_mad (dbl),
##   DayofMonth_mad (dbl), DayOfWeek_mad (dbl), DepTime_mad (dbl),
##   ArrTime_mad (dbl), FlightNum_mad (dbl), ActualElapsedTime_mad (dbl),
##   AirTime_mad (dbl), ArrDelay_mad (dbl), DepDelay_mad (dbl), Distance_mad
##   (dbl), TaxiIn_mad (dbl), TaxiOut_mad (dbl), Cancelled_mad (dbl),
##   Diverted_mad (dbl)

Here are some additional functions which can be used with summarise.


Group_by

Break the data sets into groups of rows.

group_by adds the final piece of the puzzle we need to execute the split-apply strategy to our hearts content. This function becomes very powerful when combined with the previously discussed dplyr verbs. For example lets calculate which day of the week has the most cancellations.

flights %>% group_by(DayOfWeek) %>% 
  select(Cancelled)  %>% summarise_each(funs(canceled=sum(.,na.rm=TRUE),
                                             total=n(),
                                             percent_cancelled=round(canceled/total*100,1)))
## Source: local data frame [7 x 4]
## 
##   DayOfWeek canceled total percent_cancelled
## 1         1      344 34360               1.0
## 2         2      369 31649               1.2
## 3         3      396 31926               1.2
## 4         4      616 34902               1.8
## 5         5      663 34972               1.9
## 6         6      272 27629               1.0
## 7         7      313 32058               1.0

We can also use group_by to generate groups using more than one variable. For example lets calculate the median AirTime times by Distance and TailNum.

(slowest<-flights %>% group_by(Distance,TailNum) %>% 
  select(AirTime)  %>% summarise_each(funs(mean(.,na.rm=TRUE))))
## Source: local data frame [44,698 x 3]
## Groups: Distance
## 
##    Distance TailNum  AirTime
## 1        79  N14940 30.00000
## 2        79  N14943 29.00000
## 3        79  N17928 22.00000
## 4       127              NaN
## 5       127  N11106 29.00000
## 6       127  N11107 24.00000
## 7       127  N11155 26.00000
## 8       127  N11187 25.00000
## 9       127  N11189 26.00000
## 10      127  N11535 27.33333
## ..      ...     ...      ...

Lets identify the 3 slowest and fastest planes based on AirTime for some arbitrary Distance.

#choose arbitrary distance
tmp<-slowest %>% na.omit(.) %>% 
  filter(.,Distance==781) %>%
  arrange(AirTime)

#not clear why, but the results can't be bound directly
tmp %>% head(.,3)
## Source: local data frame [3 x 3]
## Groups: Distance
## 
##   Distance TailNum AirTime
## 1      781  N275WN      85
## 2      781  N453WN      87
## 3      781  N725SW      88
tmp %>% tail(.,3) %>% arrange(desc(.))
## Source: local data frame [3 x 3]
## Groups: Distance
## 
##   Distance TailNum AirTime
## 1      781  N474WN     155
## 2      781  N16646     127
## 3      781  N14653     139

A common data analysis task might be to carry out some group-wise normalization or adjustments of the data. For example we may want to calculate the day of the week with the slowest flights, but also adjust for differences between individual planes. To do this we will start by calculating the average speed for each plane.

#calculate speed
flights<-flights %>% mutate(hrs=AirTime/60, speed=Distance/hrs)

(averages<-flights %>%
  group_by(TailNum) %>% 
  select(.,speed) %>%
  summarise_each(funs(mean(.,na.rm=TRUE))) %>% 
  rename(.,mean_speed=speed))
## Source: local data frame [3,320 x 2]
## 
##    TailNum mean_speed
## 1                 NaN
## 2   N0EGMQ   462.9232
## 3   N10156   441.9362
## 4   N10575   413.5374
## 5   N11106   436.8909
## 6   N11107   437.8388
## 7   N11109   434.9298
## 8   N11113   438.9109
## 9   N11119   441.4825
## 10  N11121   436.0297
## ..     ...        ...

Next lets express the overall speed for each plane as ratio to the mean plane speed. To do this we will use one of the powerful join capabilities in dplyr.

We will join with the original data set based on TailNum and calculate the plane-adjusted measure of speed.

right_join(flights,averages,by="TailNum") %>%
  mutate(norm_speed = speed / mean_speed) %>% 
  group_by(DayOfWeek) %>% 
  select(contains("speed")) %>%
  summarise_each(funs(mean(.,na.rm=TRUE)))
## Source: local data frame [7 x 4]
## 
##   DayOfWeek    speed mean_speed norm_speed
## 1         1 419.2984   420.5830  0.9965149
## 2         2 418.2220   420.4509  0.9942130
## 3         3 419.2926   420.4236  0.9968347
## 4         4 420.3597   420.3783  0.9995784
## 5         5 419.9916   420.4688  0.9985305
## 6         6 427.3059   422.7704  1.0107539
## 7         7 423.7673   421.4198  1.0053550

So if things worked out like we expected it looks like Saturday flights are fastest and Tuesday the slowest.

Using dplyr programmatically

Most of the examples up to this point featured using dplyr in interactive mode. However there are variants of nearly every verb which are best suited for use inside other functions. To see what theses are take a look at verb_ versions of each function (e.g. summarise_).

Benchmarking split-apply in base and dplyr

Finally I will wrap with a relatively non-sophisticated benchmarking head-to-head comparison of base and dplyr speed for the almighty split-apply strategy.

Lets set up the data.

rows<-10000
cols<-100
groups<-100
samples<-rows/groups
tmp.data<-data.frame(matrix(rnorm(rows),rows,cols))
tmp.data$group<-rep(1:groups,each=samples)

This data set has 10^{4} rows, 100 columns and 100 groups with 100 samples each.

base

ptm <- Sys.time()
#split the data on the number of cylinders
big.l<-split(tmp.data,tmp.data$group)

#apply some function of interest to all columns
results<-lapply(big.l, function(x) apply(x,2,median))

#bind results and add splitting info
results<-data.frame(group=names(results),do.call("rbind",results))

#elapsed time
(bd<-Sys.time()-ptm )
## Time difference of 0.5802019 secs

dplyr

ptm <- Sys.time()

results<-tmp.data %>% group_by(group) %>% summarise_each(funs(median(.)))

#elapsed time
(ad<-Sys.time()-ptm )
## Time difference of 0.3492 secs

Wow I just saved 0.2310019 seconds of my life!

Create a benchmark visualization comparing base to dplyr for differing number of groups, rows and columns. Uncomment the code in the appendix below and modify as needed to re-run the benchmark.

load(file="benchmark results")

#create a plot
library(reshape2)
library(ggplot2)
tmp.data<-melt(res,id.vars=c("rows","columns","groups","samples")) %>% mutate(seconds=value*60)

ggplot(tmp.data, aes(y=seconds,x=groups,group=variable,color=variable)) + geom_line() +geom_point()+ facet_grid(rows ~ columns) +scale_y_log10()

The plot above shows the calculation time for 10 replications in seconds (y-axis) for calculating the median of varying number of groups (x-axis), rows (y-facet) and columns (x-facet).


Gotchas

  1. In dplyr rownames are a second class citizen and are not stored.
  2. Even though nearly every dplyr tutorial features pipes (%>%) it may be easier to learn both dplyr and %>% separately.
  3. Most dplyr functions only work on objects coercible to ~ data.frames. Lots of my debugging sessions start with trying to understand the data structure of objects I am passing to dplyr.

Appendix

# 
# #set up functions to time
# base_fun<-function(data){
#       #split the data on the number of cylinders
#     big.l<-split(data,data$group)
#     
#     #apply some function of interest to all columns
#     results<-lapply(big.l, function(x) apply(x,2,median))
#     
#     #bind results and add splitting info
#     data.frame(group=names(results),do.call("rbind",results))
# }
# 
# dplyr_fun<-function(data){
#   
#   data %>% group_by(group) %>% summarise_each(funs(median(.)))
# }
# 
# #benchmark function
# benchmark_fun<-function(rows,cols,groups){
#   
#   #set up data
#   samples<-floor(rows/groups)
#   tmp.data<-data.frame(matrix(rnorm(rows),rows,cols))
#   tmp.data$group<-rep(1:groups,length.out=rows)
#   
#   #base
#   base.time<-system.time(replicate(10,base_fun(tmp.data)))
#   
#   #dplyr
#   dplyr.time<-system.time(replicate(10,dplyr_fun(tmp.data)))
#   
#   data.frame(rows=rows,columns=cols,groups=groups,samples=samples,base=signif(base.time["elapsed"]/10,3),dplyr=signif(dplyr.time["elapsed"]/10,3))
#   
# }
# 
# #run benchmarks
# len<-5
# groups<-seq(5, 100,length.out=len) %>% signif(.,0)
# rows<-seq(100, 10000,length.out=len) %>% signif(.,0)
# cols<-seq(10, 100,length.out=len) %>% signif(.,0)
# 
# #benchmarks
# results<-list()
# counter<-1
# for(i in 1:length(groups)){ 
#   .group<-groups[i]
#   for(j in 1:length(rows)){
#     .row<-rows[j]
#     for(k in 1:length(cols)){
#       .col<-cols[k]
#        results[[counter]]<- benchmark_fun(.row,.col,.group)
#       counter<-counter+1
#     }
#   }  
# } 
# 
# 
# res<-do.call("rbind",results)
# save(res,file="benchmark results")
# #create a plot
# library(reshape2)
# library(ggplot2)
# tmp.data<-melt(res,id.vars=c("rows","columns","groups","samples")) %>% mutate(seconds=value*60)
# 
# ggplot(tmp.data, aes(y=seconds,x=groups,group=variable,color=variable)) + geom_line() +geom_point()+ facet_grid(columns ~ rows) +scale_y_log10()