## Intro to `dplyr()`

This lesson will cover some basic functions that can be used to manipulate data in R.
Again, we will be using the gapminder data set, which includes country information on GDP, population, etc.

This material is based on a Software Carpentry lesson, available on their [website](http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/index.html).

There are five main functions we'll be talking about today, each allowing us to manipulate data frames. These five functions are:

* `select()`  --  Choose columns (variables or attributes) from our data frame
* `filter()`  --  Choose rows (samples or observations) from our data frame
* `mutate()`  --  Create new columns, based on existing ones
* `group_by()`  --  Group rows based on a particular column/value within that column 
* `summarize()`  --  Perform some function on the grouped data


If you haven't already, make sure you have `dplyr()` and `gapminder()` installed and loaded with the following commands:


In [1]:
# Download the packages
#install.packages(c("dplyr", "gapminder"))

# Load the packages for use
library(dplyr)
library(gapminder)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Let's take a quick look at our data frame to remind ourselves of its structure. We do this using the `head()` command, which will display the first 10 rows (given by `n = 10`) of our data frame. 

In [2]:
head(gapminder, n = 10)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134
Afghanistan,Asia,1982,39.854,12881816,978.0114
Afghanistan,Asia,1987,40.822,13867957,852.3959
Afghanistan,Asia,1992,41.674,16317921,649.3414
Afghanistan,Asia,1997,41.763,22227415,635.3414


### Choose Columns: select()

The first function we'll be using is `select()`. This function let's us pick columns from our data frame, based on name (e.g. year) or by index (e.g. 3). 

![](https://swcarpentry.github.io/r-novice-gapminder/fig/13-dplyr-fig1.png)

Let's try using `select()` to pick out a few columns: "country", "year", "lifeExp", and "pop". We'll be assigning these columns to a new data frame, `gapminder_select`. Then we'll use `head()` to see if it worked.

In [3]:
# select() code here:


# Check the data frame:


As you can see, our new data frame contains only a subset of the columns from the original data frame, based on the names we provided in the `select()` command. 

***

Here we'll also introduce another great feature of `dplyr()`: the pipe (  **%>%** ). This symbol sends or pipes an object (e.g. a data frame like gapminder) INTO a function (e.g. `select()`). 
So, the above `select()` command can be rewritten as follows (NOTE: the "." is a placeholder, which represents the object being piped). Again, we can check our result using `head()`.

In [4]:
# select() using pipe syntax:


# Check the result with head():


We can actually simplify the above command further - dplyr's functions such as `select()` are smart enough that you don't actually need to include the "." placeholder, as shown below.

In [5]:
# select() using pipe syntax w/out a placeholder:


# Check the results:


#### Challenge 1
Using the `select()` command and pipe (` %>% `) notation, pick the following columns from the `gapminder` data frame, assign them to a new variable (we'll use **x**), and display the results using `head(x, n = 10)`. Columns to choose are:

* continent
* gdpPercap
* lifeExp
* year

In [6]:
# Answer here:


### Choose Rows: filter()

So we've covered selecting columns, but what about rows? This is where `filter()` comes in. This function allows us to choose rows from our data frame using some logical criteria. An example is filtering for rows in which the country is Canada. This can also be applied to numerical values, such as the year being equal to 1967, or life expectancy greater than 30. 

NOTE: In R, equality (e.g. country is Canada, year is 1967) is done using a double equals sign (`==`).

![](https://jules32.github.io/2016-07-12-Oxford/dplyr_tidyr/img/rstudio-cheatsheet-filter.png)

Let's go through a couple examples. 

In [7]:
# Filter rows where country is Canada:

# Check the result:


Let's try another one, this time filtering on life expectancy above a certain threshold:

In [8]:
# Filter for rows where life expectancy is greater than 50:


# Check the result:


***

We can also filter with multiple arguments, each separated by a comma:

In [9]:
# filter() for Canada and life expectancy greater than 80:


# Check the result:


*** 

#### Challenge 2
Use `filter()` to choose data for African countries, from the year 1980 and onwards. 

In [10]:
# Challenge 2 code here:
# x <- filter()

### Create New Columns: mutate()

Let's say we now want to calculate the GDP in billions, which is done by mutiplying the GDP per capita by the population, then dividing by 1 billion (1 * 10^9). `mutate()` will perform this calculation on each row in the data frame, one row at a time (i.e. row-wise). The code below will calculate the GDP in billions:

* `gdpPercap * pop / 10^9`


In [11]:
# Use mutate() to calculate GDP in billions, using formula/code above:


# Check the result:


### Combine Functions with Pipes
We've seen that pipes ( **%>%** ) can be used to send an object such as a data frame into a function, such as `select()`, or `filter()`. But they can also be used to send the output of one function into another function. This allows us to chain together multiple commmands, without the need for intermediate variables.

Let's take a look at this in an example, selecting country, year, lifeExp pop and gdpPercap columns, and filtering for Canadian entries. 

In [1]:
# select() the five columns, and filter() for Canada


# Check the result


We can further expand on this by incorporating our `mutate()` command from earlier, linking multiple functions into a single command. Be sure to indent (`TAB` key) when moving to a new line after a pipe. 

In [None]:
# select() the four columns, filter() for Canada, and calculate GDP in billions:


# Check the result:


### group_by() and summarise()

These functions allow us to work on our data in specific groups. For example, we can use `group_by()` to group observations by country, then calculate the average life expectancy for each country. 

![](https://swcarpentry.github.io/r-novice-gapminder/fig/13-dplyr-fig3.png)


In [2]:
# group_by() country, calculate average life expectancy


# Check the results:


Let's do another example, again grouping by country. This time, we'll calculate the mean and standard deviation of the GDP per capita. We'll also specify the column names inside of the `summarise()` command.

In [None]:
# Use group_by() and summarise to calculate mean and SD of gdpPercap


# Check the result:


### Tying it all Together

Now let's use all the commands we've covered and combine them with pipes into a single statement. 

Let's say we want calculate the mean and SD of the GDP (in billions) for each country, but only considering data from 1980 and onwards. We can accomplish this all in one step as follows. 

In [None]:
# select() columns, filter() by year, calculate GDP in billions, mean() and sd() of GDP in billions
