## Introduction: why dplyr?

There are a _lot_ of amazing packages in the [Tidyverse](https://www.tidyverse.org/packages/), but `dplyr` is hands-down my absolute favorite package. I use `dplyr` when I'm cleaning and exploring my dataset, and what I particularly love is that after I get a good handle on my dataset with `dplyr`, I can feed the various manipulations I've creating into the `ggplot2` package for visualization.  

This tutorial is for anyone interested in learning the basics of the `dplyr` package. We'll be focusing on data exploration and manipulation, building off of the examples in the `dplyr` package documentation using the [Palmer Penguins](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) dataset.   

**By the end of this notebook, you'll be able to:**  
* Demonstrate what each of the main five `dplyr` functions does
* Use the pipe operator `%>%` to chain together multiple `dplyr` functions

### My analytical workflow

We won't be covering _all_ of the steps in my workflow in this tutorial, but in general I follow these steps: 

1. Set up the programming environment by loading packages  
2. Import my data  
3. Check out my data 
4. Explore my data 
5. Model my data
6. Communicate what I've learned

## Set up our environment

In [None]:
# we have a couple of options here - we can load the entire tidyverse or we can just load the 
# tidyverse packages that we're interested in using. I'm going to load the tidyverse, but alternatively you
# could run the following instead:

#library(readr)
#library(dplyr)

# load the tidyverse
library(tidyverse)

### A quick note on Conflicts

After running `library(tidyverse)` you might have noticed that the print out told us which packages were attached successfully (all of them, as evidenced by the green check marks), and where we have conflicts (the red x's).  

Conflicts aren't necessarily a bad thing! Because R is an open source language and anyone can create a package, it's common for different packages to use the same name for similar functions. In our conflicts we see that the `filter()` function from the `dplyr` package masks the `filter()` function from the `stats` package. We know this because the package name comes before the double colon and the function name comes after, like this:  

> `package::function()`

What if we want to use the `filter()` function from the stats package? All is not lost! What we can do in our code is use the full `package::function()` syntax and R will know to use the `filter()` function from the `stats` package instead of the `dplyr` package.

## Import our data

In [None]:
penguins <- read_csv('../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv')

### Parsing the parsing statement

One thing that took me awhile to get used to was that just because the text is in BRIGHT RED doesn't mean that something bad has happened, or that I've made a mistake. And that's the same as what we're seeing here!    

What the parsing statement does is tell us how R formatted each of the columns in our dataframe. The `read_csv()` function looks at the first thousand rows of a dataset and makes an educated guess as to what the remaining rows are. We can override this if we need to, either by telling R to use more rows to guess using the `guess_max` argument, or by explicitly telling R what type of data is in each column.

## Check out our data

Here's where I like to get a handle on what I'm working with. I'll use various functions to make sure my data imported correctly, and start to get an understanding of the data structure and data types. The functions I commonly use to accomplish this are:  

* `glimpse()`
* `head()` and `tail()`
* `summary()`

### glimpse() is grrrreat!

`glimpse()` gives you just about everything you could want, all wrapped up in a single function. We get our dataframe structure with the printout to rows and columns, telling us that in our `penguins` dataset we have 344 rows (or observations) and 7 columns (or variables).  

![](http://)We also see each of the variables listed out by name, followed by the data type `<datatype>`, and then a look at the first few rows of each variable.

In [None]:
glimpse(penguins)

### The heck is a culmen?

> I didn't know either, but [Allison Horst](https://twitter.com/allison_horst) has an amazing illustration explaining it!

![](https://pbs.twimg.com/media/EaAXQn8U4AAoKUj.jpg:small)

### head()

Before reading further (or running the code) take a second to think about what the `head()` function might return.  

If you guessed the "head" of the dataframe, or the first few rows, you'd be correct! I use `head()` to check a couple of things. First, I want to see if my data imported correctly. It's not uncommon to have the first few rows of a `.csv` file be blank, or contain information that I don't want in my final dataset. Second, `head()` prints out a nicely-formatted table that lets me take a quick look and see if the data is formatted consistently.  

Using `head()` and seeing that your data is formatted consistently isn't a guarantee that you won't run into problems later, but it's a great first check.  

In [None]:
head(penguins)

### summary()

`summary()` might be one of the first functions I remember using and going "ooooh, this is pretty cool!" Like with the `head()` function, the name tells you what it does - any data that we pass to `summary()` will return a set of summary statistics appropriate for that datatype.  

We can send individual variables to `summary()`, or an entire dataframe, and get a quick idea of our data types, the spread of our data, and an idea of how much missing missing data we'll be dealing with.

In [None]:
summary(penguins)

### a note on names()

I have a really hard time remembering what the names of my variables are, and because R is case-sensitive, how the names are formatted. We could fix this by converting all of our variable names to the same case, but for now just know that if you ever need a refresher on the names of the variables in your dataset (and how they're formatted!) you can run `names()`, like this:

In [None]:
names(penguins)

## Exploring our data with dplyr

**Main functions we'll use**
* `arrange()`
* `filter()`
* `select()`
* `mutate()`
* `summarise()` (you can also use `summarize()`)

**Reading and writing R code**  
One thing that I really enjoy about working in R is that I can write out what I want to do in a sentence, and then translate that into code. For example, if I say:

> Take the penguins dataset and then filter for all penguins that live on Torgersen island

* _Take the penguins dataset_ **translates to** `penguins`
* _and then_ **translates to** `%>%`
* _filter for all penguins that live on Torgersen island_ **translates to** `filter(island == "Torgersen")`

We can then take these three lines and put them together to get the following:

In [None]:
penguins %>%
  filter(island == "Torgersen")

### Wait what the heck is %>%?

`%>%` is the pipe operator, and it allows us to push our data through sequential functions in R. Much like we use the words "and then" to describe instructions or steps on how to do something, `%>%` acts like an "and then" statement between functions.  

We can take the code we wrote above _and then_ add a function we've already used, `head()` to print out a much shorter table, like this:

In [None]:
penguins %>%
  filter(island == "Torgersen") %>%
  head()

So let's get to it! In this section we'll go through a couple of examples with each of the individual `dplyr` functions, and then start combining them to do some powerful data manipulations!

## Applying arrange()

`arrange()` "arranges," or organizes, our data in _ascending_ order, starting from the lowest value and running to the highest (or in the case of character data, in alphabetical order).  

We can provide a single argument to the `arrange()` function, such as `culmen_length_mm` (double) or `species` (character).

In [None]:
# numeric data 
# I've added the head() function to the end of the function chain to reduce the length of the table that's printed out
# you can remove it in your version!

penguins %>%
  arrange(culmen_length_mm) %>%
  head()

In [None]:
# character data

penguins %>%
  arrange(species)

### Creating a subset

It's a little hard to see what's going on in the above table, so I'm going to create a smaller subset of the `penguins` dataset so that we can see what's going on a bit more clearly. You can run the code on the subset of the data, or replace `penguins_subset` with `penguins` to see what happens on the full dataset!

In [None]:
# creating a random subset of the penguins dataset
set.seed(406)

penguins_subset <- penguins %>%
  sample_n(12)  # another dplyr function!

penguins_subset

In [None]:
# let's re-run the arrange() function on character data in the penguins_subset data

penguins_subset %>%
  arrange(species)

### Nesting desc() inside arrange()

What if we don't want our data in ascending order? Then we can nest the `desc()` function, which stands for _descending_, within the `arrange()` function. This will then order our numeric data from highest to lowest, and our character data in reverse alphabetical order.

In [None]:
# numeric data arranged in descending order

penguins_subset %>%
  arrange(desc(culmen_length_mm))

In [None]:
# character data arranged in descending - reverse alphabetical - order

penguins_subset %>%
  arrange(desc(species))

## Fun with filter()

`filter()` is probably one of my most used functions, because it allows me to look at subsets quickly and easily. What's nice about `filter()` is its flexibility - we can use it on a single condition or multiple conditions.

In [None]:
# filter with a single numeric condition

penguins_subset %>%
  filter(culmen_depth_mm > 16.2)

In [None]:
# filter with a single character condition

penguins_subset %>%
  filter(island == "Dream")

In [None]:
# filter with a single numeric condition between two values

penguins_subset %>%
  filter(between(culmen_depth_mm, 16.2, 18.1 ))

## Starting with select()

`select()` allows us to pick which columns (variables) we want to look at, and we can use it to pull a subset of variables, or even rearrange the order of our variables within a dataframe.

In [None]:
# selecting species, flipper_length_mm, and sex columns

penguins_subset %>%
  select(species, flipper_length_mm, sex)

In [None]:
# selecting all character data

penguins_subset %>%
  select(where(is.character))

In [None]:
# selecting all numeric data

penguins_subset %>%
  select(where(is.numeric))

In [None]:
# selecting all character data by using "where not numeric" data

penguins_subset %>%
  select(!where(is.numeric))

## Math with mutate()

What's not to love about a function that let's us create new columns (variables)?! For these examples we'll work strictly with `mutate()`, but when you work on extending this notebook, try using `group_by()` and _then_ `mutate()`! (We'll talk about `group_by()` in the next section.)

In [None]:
# converting grams to pounds
# notice how the order of our columns stays the same, and the new column, body_weight_pounds, gets placed at the 
# far right of the dataframe. what function could we use to change this order?

penguins_subset %>%
  mutate(body_weight_pounds = body_mass_g / 453.59237)

In [None]:
# OK I wanted to show you how to combine select and mutate
# what do you think the everything() function might do? confirm your guess by looking at the documentation (linked at 
# the end of the notebook).

penguins_subset %>%
  mutate(body_weight_pounds = body_mass_g / 453.59237) %>%
  select(species, body_mass_g, body_weight_pounds, everything())

## Summaries with summarise(), with help from group_by()

You can use either `summarise()` or `summarize()` to get a summary, or overview, of your data. What's more, once we introduce `group_by()` you can get summary data for _subsets_ of your data.

In [None]:
# summarising the average body mass of penguins, in grams

penguins_subset %>%
  summarise(avg_body_mass = mean(body_mass_g))

In [None]:
# since we're now summarising our data we can go ahead and use the full dataframe, since the printout will be reasonably-sized

penguins %>%
  summarise(avg_body_mass = mean(body_mass_g))

### The NAs!

If we don't handle our `NA` values we're going to be in for a bad time. There are multiple ways of dealing with `NA` values in R - for now we're going to use `na.rm = TRUE`, but you could use `filter()` from the `dplyr` package or `drop_na()` from the `tidyr` package as well!  

`na.rm` is like asking the question, "Should we remove `NA`s from our code?" where `na` stands for `NA` values, and `rm` stands for remove. So when we set `na.rm = TRUE` we're saying "Yes, please remove `NA` values from my calculations." Likewise if we use `na.rm = FALSE` we're telling R that we want to include `NA` values in our calculations.  

And if you're not sure, `NA` stands for "Not Available," meaning data that is missing.

In [None]:
# summarising body mass on the entire penguins dataset while removing NA values from the calculation

penguins %>%
  summarise(avg_body_mass = mean(body_mass_g, na.rm = TRUE))

In [None]:
# now let's use the grouping function, group_by(), to look at the average body mass of penguins, in grams,
# by species

penguins %>%
  group_by(species) %>%
  summarise(avg_species_body_mass = mean(body_mass_g, na.rm = TRUE)) 

In [None]:
# now let's calculate the average body mass by species AND island

penguins %>%
  group_by(species, island) %>%
  summarise(avg_species_body_mass = mean(body_mass_g, na.rm = TRUE)) 

## Where to next?

What we've done here only scratches the surface of what you can accomplish with `dplyr`. `dplyr` is a powerful package in its own right, but even more so once you dive into column-wise operations, like `across()`, as well as combine it with other packages in the Tidyverse, such as `purrr` and `ggplot2`.  

What I recommend is making a copy of this notebook and running the cells to ensure you understand what's happening with each function, and then build out additional chains of `dplyr` functions to see what you can discover and learn! Play around and don't worry about making mistakes - it's all part of learning!   

These are some helpful resources to get you started:  

* [`dplyr` documentation - so many functions!](https://dplyr.tidyverse.org/reference/index.html)
* [R for Data Science text](https://r4ds.had.co.nz/transform.html)
* [STAT545](https://stat545.com/)
* [More on column-wise operations](https://dplyr.tidyverse.org/articles/colwise.html)