## Intro to `dplyr()`

This lesson will cover some basic functions that can be used to manipulate data in R.
Again, we will be using the gapminder data set, which includes country information on GDP, population, etc.

This material is based on a Software Carpentry lesson, available on their [website](http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/index.html).

There are five main functions we'll be talking about today, each allowing us to manipulate data frames. These five functions are:

* `select()`  --  Choose columns (variables or attributes) from our data frame
* `filter()`  --  Choose rows (samples or observations) from our data frame
* `mutate()`  --  Create new columns, based on existing ones
* `group_by()`  --  Group rows based on a particular column/value within that column 
* `summarize()`  --  Perform some function on the grouped data


If you haven't already, make sure you have `dplyr()` and `gapminder()` installed and loaded with the following commands:


In [16]:
# Download the packages
#install.packages(c("dplyr", "gapminder"))

# Load the packages for use
library(dplyr)
library(gapminder)

Let's take a quick look at our data frame to remind ourselves of its structure. We do this using the `head()` command, which will display the first 10 rows (given by `n = 10`) of our data frame. 

In [17]:
head(gapminder, n = 10)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134
Afghanistan,Asia,1982,39.854,12881816,978.0114
Afghanistan,Asia,1987,40.822,13867957,852.3959
Afghanistan,Asia,1992,41.674,16317921,649.3414
Afghanistan,Asia,1997,41.763,22227415,635.3414


### Choose Columns: select()

The first function we'll be using is `select()`. This function let's us pick columns from our data frame, based on name (e.g. year) or by index (e.g. 3). 

![](https://swcarpentry.github.io/r-novice-gapminder/fig/13-dplyr-fig1.png)

Let's try using `select()` to pick out a few columns: "country", "year", "lifeExp", and "pop". We'll be assigning these columns to a new data frame, `gapminder_select`. Then we'll use `head()` to see if it worked.

In [18]:
# select() code here:
gapminder_select <- select(gapminder, country, year, lifeExp, pop)

# Check the data frame:
head(gapminder_select, n = 10)

country,year,lifeExp,pop
Afghanistan,1952,28.801,8425333
Afghanistan,1957,30.332,9240934
Afghanistan,1962,31.997,10267083
Afghanistan,1967,34.02,11537966
Afghanistan,1972,36.088,13079460
Afghanistan,1977,38.438,14880372
Afghanistan,1982,39.854,12881816
Afghanistan,1987,40.822,13867957
Afghanistan,1992,41.674,16317921
Afghanistan,1997,41.763,22227415


As you can see, our new data frame contains only a subset of the columns from the original data frame, based on the names we provided in the `select()` command. 

***

Here we'll also introduce another great feature of `dplyr()`: the pipe (  **%>%** ). This symbol sends or pipes an object (e.g. a data frame like gapminder) INTO a function (e.g. `select()`). 
So, the above `select()` command can be rewritten as follows (NOTE: the "." is a placeholder, which represents the object being piped). Again, we can check our result using `head()`.

In [19]:
# select() using pipe syntax
gapminder_pipe <- gapminder %>% select(., country, year, lifeExp, pop)

head(gapminder_pipe, n = 10)

country,year,lifeExp,pop
Afghanistan,1952,28.801,8425333
Afghanistan,1957,30.332,9240934
Afghanistan,1962,31.997,10267083
Afghanistan,1967,34.02,11537966
Afghanistan,1972,36.088,13079460
Afghanistan,1977,38.438,14880372
Afghanistan,1982,39.854,12881816
Afghanistan,1987,40.822,13867957
Afghanistan,1992,41.674,16317921
Afghanistan,1997,41.763,22227415


We can actually simplify the above command further - dplyr's functions such as `select()` are smart enough that you don't actually need to include the "." placeholder, as shown below.

In [20]:
# select() using pipe syntax w/out a placeholder
gapminder_pipe2 <- gapminder %>% select(country, year, lifeExp, pop)

head(gapminder_pipe2, n = 10)

country,year,lifeExp,pop
Afghanistan,1952,28.801,8425333
Afghanistan,1957,30.332,9240934
Afghanistan,1962,31.997,10267083
Afghanistan,1967,34.02,11537966
Afghanistan,1972,36.088,13079460
Afghanistan,1977,38.438,14880372
Afghanistan,1982,39.854,12881816
Afghanistan,1987,40.822,13867957
Afghanistan,1992,41.674,16317921
Afghanistan,1997,41.763,22227415


#### Challenge 1
Using the `select()` command and pipe (` %>% `) notation, pick the following columns from the `gapminder` data frame, assign them to a new variable (we'll use **x**), and display the results using `head(x, n = 10)`. Columns to choose are:

* continent
* GDP per capita
* life expectancy
* year

In [21]:
# Answer here:
# x <- select()

### Choose Rows: filter()

So we've covered selecting columns, but what about rows? This is where `filter()` comes in. This function allows us to choose rows from our data frame using some logical criteria. An example is filtering for rows in which the country is Canada. This can also be applied to numerical values, such as the year being equal to 1967, or life expectancy greater than 30. 

NOTE: In R, equality (e.g. country is Canada, year is 1967) is done using a double equals sign (`==`).

![](https://jules32.github.io/2016-07-12-Oxford/dplyr_tidyr/img/rstudio-cheatsheet-filter.png)

Let's go through a couple examples. 

In [22]:
# Filter rows where country is Canada
gapminder_canada <- gapminder %>% filter(country == "Canada")

head(gapminder_canada, n = 10)

country,continent,year,lifeExp,pop,gdpPercap
Canada,Americas,1952,68.75,14785584,11367.16
Canada,Americas,1957,69.96,17010154,12489.95
Canada,Americas,1962,71.3,18985849,13462.49
Canada,Americas,1967,72.13,20819767,16076.59
Canada,Americas,1972,72.88,22284500,18970.57
Canada,Americas,1977,74.21,23796400,22090.88
Canada,Americas,1982,75.76,25201900,22898.79
Canada,Americas,1987,76.86,26549700,26626.52
Canada,Americas,1992,77.95,28523502,26342.88
Canada,Americas,1997,78.61,30305843,28954.93


Let's try another one, this time filtering on life expectancy above a certain threshold:

In [23]:
# Filter for rows where life expectancy is greater than 50
gapminder_LE <- gapminder %>% filter(lifeExp > 50)

head(gapminder_LE, n = 10)

country,continent,year,lifeExp,pop,gdpPercap
Albania,Europe,1952,55.23,1282697,1601.056
Albania,Europe,1957,59.28,1476505,1942.284
Albania,Europe,1962,64.82,1728137,2312.889
Albania,Europe,1967,66.22,1984060,2760.197
Albania,Europe,1972,67.69,2263554,3313.422
Albania,Europe,1977,68.93,2509048,3533.004
Albania,Europe,1982,70.42,2780097,3630.881
Albania,Europe,1987,72.0,3075321,3738.933
Albania,Europe,1992,71.581,3326498,2497.438
Albania,Europe,1997,72.95,3428038,3193.055


***

We can also filter with multiple arguments, each separated by a comma:

In [24]:
# filter() for Canada and life expectancy greater than 80
gapminder_C_LE <- gapminder %>% filter(country == "Canada", lifeExp > 80)

head(gapminder_C_LE, n = 10)

country,continent,year,lifeExp,pop,gdpPercap
Canada,Americas,2007,80.653,33390141,36319.24


*** 

#### Challenge 2
Use `filter()` to choose data for African countries, from the year 1980 and onwards. 

In [25]:
# Challenge 2 code here:
# x <- filter()

### Create New Columns: mutate()

Let's say we now want to calculate the GDP in billions, which is done by mutiplying the GDP per capita by the population, then dividing by 1 billion (1 * 10^9). `mutate()` will perform this calculation on each row in the data frame, one row at a time (i.e. row-wise). The code below will calculate the GDP in billions:

* `gdpPercap * pop / 10^9`


In [26]:
# Use mutate() to calculate GDP in billions
gapminder_gdpBil <- gapminder %>% mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_gdpBil, n = 10)

country,continent,year,lifeExp,pop,gdpPercap,gdp_billion
Afghanistan,Asia,1952,28.801,8425333,779.4453,6.567086
Afghanistan,Asia,1957,30.332,9240934,820.853,7.585449
Afghanistan,Asia,1962,31.997,10267083,853.1007,8.758856
Afghanistan,Asia,1967,34.02,11537966,836.1971,9.648014
Afghanistan,Asia,1972,36.088,13079460,739.9811,9.678553
Afghanistan,Asia,1977,38.438,14880372,786.1134,11.697659
Afghanistan,Asia,1982,39.854,12881816,978.0114,12.598563
Afghanistan,Asia,1987,40.822,13867957,852.3959,11.82099
Afghanistan,Asia,1992,41.674,16317921,649.3414,10.595902
Afghanistan,Asia,1997,41.763,22227415,635.3414,14.121996


### Combine Functions with Pipes
We've seen that pipes ( **%>%** ) can be used to send an object such as a data frame into a function, such as `select()`, or `filter()`. But they can also be used to send the output of one function into another function. This allows us to chain together multiple commmands, without the need for intermediate variables.

Let's take a look at this in an example. 

In [27]:
# select() the five columns, and filter() for Canada
gapminder_multi <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada")

head(gapminder_multi, n = 10)

country,year,lifeExp,pop,gdpPercap
Canada,1952,68.75,14785584,11367.16
Canada,1957,69.96,17010154,12489.95
Canada,1962,71.3,18985849,13462.49
Canada,1967,72.13,20819767,16076.59
Canada,1972,72.88,22284500,18970.57
Canada,1977,74.21,23796400,22090.88
Canada,1982,75.76,25201900,22898.79
Canada,1987,76.86,26549700,26626.52
Canada,1992,77.95,28523502,26342.88
Canada,1997,78.61,30305843,28954.93


We can further expand on this by incorporating our `mutate()` command from earlier, linking multiple functions into a single command. Be sure to indent (`TAB` key) when moving to a new line after a pipe. 

In [28]:
# select() the four columns, filter() for Canada, and calculate GDP in billions
gapminder_multi_2 <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada") %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_multi_2, n = 10)

country,year,lifeExp,pop,gdpPercap,gdp_billion
Canada,1952,68.75,14785584,11367.16,168.0701
Canada,1957,69.96,17010154,12489.95,212.456
Canada,1962,71.3,18985849,13462.49,255.5967
Canada,1967,72.13,20819767,16076.59,334.7108
Canada,1972,72.88,22284500,18970.57,422.7497
Canada,1977,74.21,23796400,22090.88,525.6835
Canada,1982,75.76,25201900,22898.79,577.0931
Canada,1987,76.86,26549700,26626.52,706.926
Canada,1992,77.95,28523502,26342.88,751.3913
Canada,1997,78.61,30305843,28954.93,877.5034


### group_by() and summarise()

These functions allow us to work on our data in specific groups. For example, we can use `group_by()` to group observations by country, then calculate the average life expectancy for each country. 

![](https://swcarpentry.github.io/r-novice-gapminder/fig/13-dplyr-fig3.png)


In [29]:
# group_by() country, calculate average life expectancy
gapminder_grp <- gapminder %>% 
group_by(country) %>% 
summarise(mean(lifeExp))

head(gapminder_grp, n = 10)

country,mean(lifeExp)
Afghanistan,37.47883
Albania,68.43292
Algeria,59.03017
Angola,37.8835
Argentina,69.06042
Australia,74.66292
Austria,73.10325
Bahrain,65.60567
Bangladesh,49.83408
Belgium,73.64175


Let's do another example, again grouping by country. This time, we'll calculate the mean and standard deviation of the GDP per capita. We'll also specify the column names inside of the `summarise()` command.

In [30]:
gapminder_mean_sd <- gapminder %>% 
    group_by(country) %>% 
    summarise(mean_gdp = mean(gdpPercap), sd_gdp = sd(gdpPercap))

head(gapminder_mean_sd, n = 10)

country,mean_gdp,sd_gdp
Afghanistan,802.6746,108.2029
Albania,3255.3666,1192.3515
Algeria,4426.026,1310.3377
Angola,3607.1005,1165.9003
Argentina,8955.5538,1862.5832
Australia,19980.5956,7815.4052
Austria,20411.9163,9655.2815
Bahrain,18077.6639,5415.4134
Bangladesh,817.5588,235.0796
Belgium,19900.7581,8391.1863


### Tying it all Together

Now let's use all the commands we've covered and combine them with pipes into a single statement. 

Let's say we want calculate the mean and SD of the GDP (in billions) for each country, but only considering data from 1980 and onwards. We can accomplish this all in one step as follows. 

In [31]:
# select() columns, filter() by year, calculate GDP in billions, mean() and sd() of GDP in billions
gapminder_final <- gapminder %>% 
    select(country, year, pop, gdpPercap) %>% 
    filter(year >= 1980) %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9) %>% 
    group_by(country) %>% 
    summarise(mean_gdpBillion = mean(gdp_billion), sd_gdpBillion = sd(gdp_billion))

head(gapminder_final, n = 10)

country,mean_gdpBillion,sd_gdpBillion
Afghanistan,16.43003,7.663241
Albania,13.06277,4.837832
Algeria,148.61314,33.154735
Angola,28.94037,15.531401
Argentina,353.0717,91.45521
Australia,477.63932,154.360112
Austria,225.38878,50.320096
Bahrain,12.39705,5.145348
Bangladesh,119.9549,54.34125
Belgium,271.94451,54.647537
