<a href="https://colab.research.google.com/github/bitanb1999/RBasics/blob/main/02_datamanipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome

![](https://socialdatascience.network/courses/poster/r.png)

**Welcome to the Intro to R Programming Workshop!**

This is Part II. For a link to Part I click [here](https://colab.research.google.com/drive/1dLsdGbkvgn1JbWgsy9Z-pFmPd_2MG4Xu?usp=sharing).

Link to Slides: https://favstats.github.io/ds3_r_intro/





# R Packages

Packages are at the heart of R: 

* R packages are basically a collection of functions that you load into your working environment.

* They contain code that other R users have prepared for the community.

* It's good to know your packages, they can really make your life easier.

* I suggest keeping track of package developments either on Twitter via #rstats

* Or [postsyoumighthavemissed.com](https://postsyoumighthavemissed.com/posts/)




You can install packages in R like this using the `install.packages` function:

In [None]:
install.packages("janitor")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘snakecase’




However, installing is not enough. You also need to load the package via `library`.

In [None]:
library(janitor)

“running command 'timedatectl' had status 1”

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test




Think of `install.packages` as buying a set of tools (for free!) and `library` as pulling out the tools each time you want to work with them.

# The Tidyverse



![](https://predictivehacks.com/wp-content/uploads/2020/11/tidyverse-default.png)


## What is the `tidyverse`?

The tidyverse describes itself:

> The tidyverse is an opinionated **collection of R packages** designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

<center>
<img src="https://rstudio-education.github.io/tidyverse-cookbook/images/data-science-workflow.png" style="width: 60%" />
</center>


## Core principle: tidy data

* Every column is a variable.
* Every row is an observation.
* Every cell is a single value.

We have already seen tidy data:

| Animal | Maximum Lifespan | Animal/Human Years Ratio  |
|  --- | ---  |  --- |  
| Domestic dog | 24.0 | 5.10 |
| Domestic cat | 30.0 | 4.08 |
| American alligator | 77.0 | 1.59 | 
| Golden hamster | 3.9 | 31.41 |
| King penguin | 26.0 |  4.71 |


### Untidy data I


| Animal | Type | Value  |
|  --- | ---  |  --- |  
| Domestic dog | lifespan | 24.0 |
| Domestic dog | ratio | 5.10 |
| Domestic cat | lifespan | 30.0 |
| Domestic cat | ratio | 4.08 |
| American alligator | lifespan | 77.0 | 
| American alligator | ratio | 1.59 |
| Golden hamster | lifespan | 3.9 |
| Golden hamster | ratio | 31.41 |
| King penguin | lifespan |  26.0 |
| King penguin | ratio |  4.71 |


The data above has multiple rows with the same observation (animal).

= not tidy



### Untidy data II

| Animal | Lifespan/Ratio  |
|  --- | ---  | 
| Domestic dog | 24.0 / 5.10 |
| Domestic cat | 30.0 / 4.08 |
| American alligator | 77.0 / 1.59 | 
| Golden hamster | 3.9 / 31.41 |
| King penguin | 26.0 /  4.71 |

The data above has multiple variables per column.

= not tidy


### Core principle: tidy data

<center>
<img src="https://www.openscapes.org/img/blog/tidydata/tidydata_2.jpg" style="width: 80%" />
</center>

Artist: [Allison Horst](https://github.com/allisonhorst)


Tidy data has two decisive advantages:

* Consistently prepared data is easier to read, process, load and save.

* Many procedures (or the associated functions) in R require this type of data.

<center>
<img src="https://www.openscapes.org/img/blog/tidydata/tidydata_4.jpg" style="width: 40%" />
</center>

Artist: [Allison Horst](https://github.com/allisonhorst)


## Installing and loading the tidyverse

First we install the packages of the tidyverse like this:

In [None]:
 install.packages("tidyverse")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Then we load them:

In [None]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## A new dataset appears..

We are going to work with a new data from here on out.

No worries, we will stay within the animal kingdom but we need a dataset that is a little more complex than what we have seen already.

**Meet the Palmer Station penguins!**

Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/).




<center>
<img src="https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png" style="width: 80%" />
</center>

<center>
<img src="https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/culmen_depth.png" style="width: 80%" />
</center>

Artist: [Allison Horst](https://github.com/allisonhorst)





### Palmer Penguins

We could install the R package `palmerpenguins` and then access the data. 

However, we are going to use a different method: directly load a .csv file (comma-separated values) into R from the internet.

We can use the `readr` package which provides many convenient functions to load data into R. Here we need `read_csv`:

In [None]:
penguins_raw <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins_raw.csv")



[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  studyName = [31mcol_character()[39m,
  `Sample Number` = [32mcol_double()[39m,
  Species = [31mcol_character()[39m,
  Region = [31mcol_character()[39m,
  Island = [31mcol_character()[39m,
  Stage = [31mcol_character()[39m,
  `Individual ID` = [31mcol_character()[39m,
  `Clutch Completion` = [31mcol_character()[39m,
  `Date Egg` = [34mcol_date(format = "")[39m,
  `Culmen Length (mm)` = [32mcol_double()[39m,
  `Culmen Depth (mm)` = [32mcol_double()[39m,
  `Flipper Length (mm)` = [32mcol_double()[39m,
  `Body Mass (g)` = [32mcol_double()[39m,
  Sex = [31mcol_character()[39m,
  `Delta 15 N (o/oo)` = [32mcol_double()[39m,
  `Delta 13 C (o/oo)` = [32mcol_double()[39m,
  Comments = [31mcol_character()[39m
)




In [None]:
penguins_raw

studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,,,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195,3250,FEMALE,8.36821,-25.33302,
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,
PAL0708,7,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.46060,-24.89958,Nest never observed with full clutch.
PAL0708,9,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A1,Yes,2007-11-09,34.1,18.1,193,3475,,,,No blood sample obtained.
PAL0708,10,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A2,Yes,2007-11-09,42.0,20.2,190,4250,,9.13362,-25.09368,No blood sample obtained for sexing.


### take a `glimpse`

We can also take a look at data set using the `glimpse` function from `dplyr`.

In [None]:
glimpse(penguins_raw)

Rows: 344
Columns: 17
$ studyName             [3m[90m<chr>[39m[23m "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number`       [3m[90m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species               [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region                [3m[90m<chr>[39m[23m "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                [3m[90m<chr>[39m[23m "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage                 [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID`       [3m[90m<chr>[39m[23m "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion`   [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg`            [3m[90m<date>[39m[23m 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
$ `Culmen Length (mm)`  [3m[90m<dbl>[39m[23m 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9



## initial data cleaning using `janitor`

<center>
<img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" style="width: 20%" />
</center>





`janitor` is not offically part of the tidyverse package compilation but in my view it is incredibly important to know.

Provides some convenient functions for basic cleaning of the data.

Just like any tidverse-style package it fullfills the following criteria for its functions:

> The data is always the first argument.

This helps us to match by position.


In [None]:
install.packages("janitor")

library(janitor)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



### `clean_names()`

One annoyance with the `penguins_raw` data is that it has spaces in the variable names. Urgh! 

R has to put quotes around the variable names that have spaces:

In [None]:
penguins_raw$`Delta 15 N (o/oo)`
penguins_raw$`Flipper Length (mm)`


`janitor` can help with that: 

using a function called `clean_names()`




`clean_names()` just magically turns all our messy column names into readable lower-case snake case:

In [None]:

penguins_clean <- clean_names(penguins_raw) 


That is how the variables look like now:

In [None]:
glimpse(penguins_clean)


Rows: 344
Columns: 17
$ study_name        [3m[90m<chr>[39m[23m "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     [3m[90m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           [3m[90m<chr>[39m[23m "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            [3m[90m<chr>[39m[23m "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            [3m[90m<chr>[39m[23m "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             [3m[90m<chr>[39m[23m "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     [3m[90m<chr>[39m[23m "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          [3m[90m<date>[39m[23m 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  [3m[90m<dbl>[39m[23m 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39

### `remove_constant()`

Now we have another problem. Not all variables in the `penguins_clean` data set are that useful. 

Some of them are the same across all observations. We don't need those variables, like `region`.

In [None]:
table(penguins_clean$region)


Anvers 
   344 

We can use the base R function `table` to quickly get some tabulations of our variable.




Here to help get rid of these *constant* columns is the function `remove_constant()`.

In [None]:
penguins_clean <- remove_constant(penguins_clean, quiet = F)

Removing 2 constant columns of 17 columns total (Removed: region, stage).



When we set `quiet = F` we even get some input as to what exactly was removed. Neat!

Another useful function in `janitor` is `remove_empty()` which removes all rows or columns that just consist of missing values (i.e. `NA`)




## Data cleaning using `tidyr`

![](https://tidyr.tidyverse.org/logo.png)

Now we are already fairly advanced in our tidying.

But our dataset is still not entirely tidy yet.

Consider the `species` variable:

In [None]:
table(penguins_clean$species)


      Adelie Penguin (Pygoscelis adeliae) 
                                      152 
Chinstrap penguin (Pygoscelis antarctica) 
                                       68 
        Gentoo penguin (Pygoscelis papua) 
                                      124 

This variable violates the tidy rule that each cell should include a single value. 

Species hold both the *common name* and the *latin name* of the penguin.

### `separate()`

We can use a `tidyr` function called `separate()` to turn this into two variables.

Two arguments are important for that:

+ `sep`: specifies by which character the value should be split
+ `into`: a vector which specifies the resulting new variable names

In our case we want to split by an empty space and opening bracket ` \\(` and will name our variables `species` and `latin_name`:

In [None]:
penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name"))

In [None]:
penguins_clean

study_name,sample_number,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,,,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,
PAL0708,3,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A1,Yes,2007-11-16,40.3,18.0,195,3250,FEMALE,8.36821,-25.33302,
PAL0708,4,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,5,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,
PAL0708,6,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,
PAL0708,7,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.46060,-24.89958,Nest never observed with full clutch.
PAL0708,9,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A1,Yes,2007-11-09,34.1,18.1,193,3475,,,,No blood sample obtained.
PAL0708,10,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A2,Yes,2007-11-09,42.0,20.2,190,4250,,9.13362,-25.09368,No blood sample obtained for sexing.


There is a also a function called `unite()` which works in the opposite direction.

Now our data is in tidy format!

We were in luck because the data pretty much already came in a format that was: 1 observation per row.

But what if that is not the case?



### `pivot_wider()` and `pivot_longer()`

`tidyr` also comes equipped to deal with data that has more that one observation per row. 

The function to use here is called `pivot_wider`.

Now our `penguin_clean` data is already tidy.

But we can just read in a dataset that isn't:


In [None]:
untidy_animals <- read_csv("https://github.com/favstats/ds3_r_intro/blob/main/data/untidy_animals.csv?raw=true")
untidy_animals


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  Animal = [31mcol_character()[39m,
  Type = [31mcol_character()[39m,
  Value = [32mcol_double()[39m
)




Animal,Type,Value
<chr>,<chr>,<dbl>
Domestic dog,lifespan,24.0
Domestic dog,ratio,5.1
Domestic cat,lifespan,30.0
Domestic cat,ratio,4.08
American alligator,lifespan,77.0
American alligator,ratio,1.59
Golden hamster,lifespan,3.9
Golden hamster,ratio,31.41
King penguin,lifespan,26.0
King penguin,ratio,4.71


You may recognize this data from the subsection *Untidy data I*

Now let's use `pivot_wider` to make every row an observation.

We need two main arguments for that:

1. `names_from`: tells the function where the new column names come from
2. `values_from`: tells the function where the values should come from



In [None]:
tidy_animals <- pivot_wider(untidy_animals,  names_from = Type, values_from = Value)
tidy_animals

Animal,lifespan,ratio
<chr>,<dbl>,<dbl>
Domestic dog,24.0,5.1
Domestic cat,30.0,4.08
American alligator,77.0,1.59
Golden hamster,3.9,31.41
King penguin,26.0,4.71


`pivot_longer` can untidy our data again

The argument `cols = ` tells the function which variables to turn into long format:

In [None]:
pivot_longer(tidy_animals,  cols = c(lifespan, ratio))

Animal,name,value
<chr>,<chr>,<dbl>
Domestic dog,lifespan,24.0
Domestic dog,ratio,5.1
Domestic cat,lifespan,30.0
Domestic cat,ratio,4.08
American alligator,lifespan,77.0
American alligator,ratio,1.59
Golden hamster,lifespan,3.9
Golden hamster,ratio,31.41
King penguin,lifespan,26.0
King penguin,ratio,4.71




## Data manipulation using `dplyr`

<center>
<img src="https://github.com/allisonhorst/stats-illustrations/blob/master/rstats-artwork/dplyr_wrangling.png?raw=true" style="width: 62%" />
</center>

Artist: [Allison Horst](https://github.com/allisonhorst)




### `select()`

helps you select variables







![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/select.png)

`select()` is part of the dplyr package and helps you select variables

Remember: with tidyverse-style functions, **data is always the first argument**.


#### Select variables

Here we only keep `individual_id`, `sex` and `species`.

In [None]:
select(penguins_clean, individual_id, sex, species)

individual_id,sex,species
<chr>,<chr>,<chr>
N1A1,MALE,Adelie Penguin
N1A2,FEMALE,Adelie Penguin
N2A1,FEMALE,Adelie Penguin
N2A2,,Adelie Penguin
N3A1,FEMALE,Adelie Penguin
N3A2,MALE,Adelie Penguin
N4A1,FEMALE,Adelie Penguin
N4A2,MALE,Adelie Penguin
N5A1,,Adelie Penguin
N5A2,,Adelie Penguin


But `select()` is more powerful than that.

#### Remove variables

We can also **remove** variables with a **`-`** (minus).

Here we remove `individual_id`, `sex` and `species`.

In [None]:
select(penguins_clean, -individual_id, -sex, -species)

study_name,sample_number,latin_name,island,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
PAL0708,1,Pygoscelis adeliae),Torgersen,Yes,2007-11-11,39.1,18.7,181,3750,,,Not enough blood for isotopes.
PAL0708,2,Pygoscelis adeliae),Torgersen,Yes,2007-11-11,39.5,17.4,186,3800,8.94956,-24.69454,
PAL0708,3,Pygoscelis adeliae),Torgersen,Yes,2007-11-16,40.3,18.0,195,3250,8.36821,-25.33302,
PAL0708,4,Pygoscelis adeliae),Torgersen,Yes,2007-11-16,,,,,,,Adult not sampled.
PAL0708,5,Pygoscelis adeliae),Torgersen,Yes,2007-11-16,36.7,19.3,193,3450,8.76651,-25.32426,
PAL0708,6,Pygoscelis adeliae),Torgersen,Yes,2007-11-16,39.3,20.6,190,3650,8.66496,-25.29805,
PAL0708,7,Pygoscelis adeliae),Torgersen,No,2007-11-15,38.9,17.8,181,3625,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Pygoscelis adeliae),Torgersen,No,2007-11-15,39.2,19.6,195,4675,9.46060,-24.89958,Nest never observed with full clutch.
PAL0708,9,Pygoscelis adeliae),Torgersen,Yes,2007-11-09,34.1,18.1,193,3475,,,No blood sample obtained.
PAL0708,10,Pygoscelis adeliae),Torgersen,Yes,2007-11-09,42.0,20.2,190,4250,9.13362,-25.09368,No blood sample obtained for sexing.


#### Selection helpers

These *selection helpers* match variables according to a given pattern.

`starts_with()`: Starts with a prefix.

`ends_with()`: Ends with a suffix.

`contains()`: Contains a literal string.

`matches()`: Matches a regular expression.


For example: let's keep all variables that start with `s`:


In [None]:
select(penguins_clean, starts_with("s"))

study_name,sample_number,species,sex
<chr>,<dbl>,<chr>,<chr>
PAL0708,1,Adelie Penguin,MALE
PAL0708,2,Adelie Penguin,FEMALE
PAL0708,3,Adelie Penguin,FEMALE
PAL0708,4,Adelie Penguin,
PAL0708,5,Adelie Penguin,FEMALE
PAL0708,6,Adelie Penguin,MALE
PAL0708,7,Adelie Penguin,FEMALE
PAL0708,8,Adelie Penguin,MALE
PAL0708,9,Adelie Penguin,
PAL0708,10,Adelie Penguin,


#### Even more ways to select

Select the first 5 variables:

In [None]:
select(penguins_clean, 1:5)

study_name,sample_number,species,latin_name,island
<chr>,<dbl>,<chr>,<chr>,<chr>
PAL0708,1,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,2,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,3,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,4,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,5,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,6,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,7,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,8,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,9,Adelie Penguin,Pygoscelis adeliae),Torgersen
PAL0708,10,Adelie Penguin,Pygoscelis adeliae),Torgersen


Select everything from `individual_id` to `flipper_length_mm`.




In [None]:
select(penguins_clean, individual_id:flipper_length_mm)

individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm
<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>
N1A1,Yes,2007-11-11,39.1,18.7,181
N1A2,Yes,2007-11-11,39.5,17.4,186
N2A1,Yes,2007-11-16,40.3,18.0,195
N2A2,Yes,2007-11-16,,,
N3A1,Yes,2007-11-16,36.7,19.3,193
N3A2,Yes,2007-11-16,39.3,20.6,190
N4A1,No,2007-11-15,38.9,17.8,181
N4A2,No,2007-11-15,39.2,19.6,195
N5A1,Yes,2007-11-09,34.1,18.1,193
N5A2,Yes,2007-11-09,42.0,20.2,190




### `filter()`

helps you filter rows





![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/filter.png)

Here we only keep penguins from the Island `Dream`.

In [None]:
filter(penguins_clean, island == "Dream")

study_name,sample_number,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,31,Adelie Penguin,Pygoscelis adeliae),Dream,N21A1,Yes,2007-11-09,39.5,16.7,178,3250,FEMALE,9.69756,-25.11223,
PAL0708,32,Adelie Penguin,Pygoscelis adeliae),Dream,N21A2,Yes,2007-11-09,37.2,18.1,178,3900,MALE,9.72764,-25.01020,
PAL0708,33,Adelie Penguin,Pygoscelis adeliae),Dream,N22A1,Yes,2007-11-09,39.5,17.8,188,3300,FEMALE,9.66523,-25.06020,
PAL0708,34,Adelie Penguin,Pygoscelis adeliae),Dream,N22A2,Yes,2007-11-09,40.9,18.9,184,3900,MALE,8.79665,-25.14591,
PAL0708,35,Adelie Penguin,Pygoscelis adeliae),Dream,N23A1,Yes,2007-11-16,36.4,17.0,195,3325,FEMALE,9.17847,-25.23061,
PAL0708,36,Adelie Penguin,Pygoscelis adeliae),Dream,N23A2,Yes,2007-11-16,39.2,21.1,196,4150,MALE,9.15308,-25.03469,
PAL0708,37,Adelie Penguin,Pygoscelis adeliae),Dream,N24A1,Yes,2007-11-16,38.8,20.0,190,3950,MALE,9.18985,-25.12255,
PAL0708,38,Adelie Penguin,Pygoscelis adeliae),Dream,N24A2,Yes,2007-11-16,42.2,18.5,180,3550,FEMALE,8.04787,-25.49523,
PAL0708,39,Adelie Penguin,Pygoscelis adeliae),Dream,N25A1,No,2007-11-13,37.6,19.3,181,3300,FEMALE,9.41131,-25.04169,Nest never observed with full clutch.
PAL0708,40,Adelie Penguin,Pygoscelis adeliae),Dream,N25A2,No,2007-11-13,39.8,19.1,184,4650,MALE,,,Nest never observed with full clutch. Not enough blood for isotopes.


#### `%in%`


Here the **`%in%`** operator can come in handy again if we want to filter more than one island:

In [None]:
islands_to_keep <- c("Dream", "Biscoe")

filter(penguins_clean, island %in% islands_to_keep)

study_name,sample_number,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,21,Adelie Penguin,Pygoscelis adeliae),Biscoe,N11A1,Yes,2007-11-12,37.8,18.3,174,3400,FEMALE,8.73762,-25.09383,
PAL0708,22,Adelie Penguin,Pygoscelis adeliae),Biscoe,N11A2,Yes,2007-11-12,37.7,18.7,180,3600,MALE,8.66271,-25.06390,
PAL0708,23,Adelie Penguin,Pygoscelis adeliae),Biscoe,N12A1,Yes,2007-11-12,35.9,19.2,189,3800,FEMALE,9.22286,-25.03474,
PAL0708,24,Adelie Penguin,Pygoscelis adeliae),Biscoe,N12A2,Yes,2007-11-12,38.2,18.1,185,3950,MALE,8.43423,-25.22664,
PAL0708,25,Adelie Penguin,Pygoscelis adeliae),Biscoe,N13A1,Yes,2007-11-10,38.8,17.2,180,3800,MALE,9.63954,-25.29856,
PAL0708,26,Adelie Penguin,Pygoscelis adeliae),Biscoe,N13A2,Yes,2007-11-10,35.3,18.9,187,3800,FEMALE,9.21292,-24.36130,
PAL0708,27,Adelie Penguin,Pygoscelis adeliae),Biscoe,N17A1,Yes,2007-11-12,40.6,18.6,183,3550,MALE,8.93997,-25.36288,
PAL0708,28,Adelie Penguin,Pygoscelis adeliae),Biscoe,N17A2,Yes,2007-11-12,40.5,17.9,187,3200,FEMALE,8.08138,-25.49448,
PAL0708,29,Adelie Penguin,Pygoscelis adeliae),Biscoe,N18A1,No,2007-11-10,37.9,18.6,172,3150,FEMALE,8.38404,-25.19837,Nest never observed with full clutch.
PAL0708,30,Adelie Penguin,Pygoscelis adeliae),Biscoe,N18A2,No,2007-11-10,40.5,18.9,180,3950,MALE,8.90027,-25.11609,Nest never observed with full clutch.




### `mutate()`

helps you create variables





![](https://favstats.shinyapps.io/r_intro/_w_dfe6b732/images/mutate.png)

`mutate` will take a statement like this:

`variable_name = some_calculation`

and attach `variable_name` at the *end of the dataset*.



Let's say we want to calculate penguin bodymass in kg rather than gram.

We tale the variable `body_mass_g` and divided by `1000`.

In [None]:
pg_new <- mutate(penguins_clean, bodymass_kg = body_mass_g/1000)

In [None]:
select(pg_new, bodymass_kg, body_mass_g)

bodymass_kg,body_mass_g
<dbl>,<dbl>
3.750,3750
3.800,3800
3.250,3250
,
3.450,3450
3.650,3650
3.625,3625
4.675,4675
3.475,3475
4.250,4250


#### Recoding with `ifelse`

`ifelse()` is a very useful function that allows to easily recode variables based on logical tests.

It's basic functionality looks like this:

$$\color{red}{\text{ifelse}}(\color{orange}{\text{logical test}},\color{blue}{\text{what should happen if TRUE}}, \color{green}{\text{what should happen if FALSE}})$$

Here is a very basic example:


In [None]:
ifelse(1 == 1, "Pick me if test is TRUE", "Pick me if test is FALSE")

In [None]:
ifelse(1 != 1, "Pick me if test is TRUE", "Pick me if test is FALSE")

Let's use `ifelse` in combination with `mutate`.

Let's create the variable `sex_short` which has a shorter label for sex:

In [None]:
mutate(penguins_clean, sex_short = ifelse(sex == "MALE", "m", "f"))

study_name,sample_number,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments,sex_short
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<chr>
PAL0708,1,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,,,Not enough blood for isotopes.,m
PAL0708,2,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,,f
PAL0708,3,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A1,Yes,2007-11-16,40.3,18.0,195,3250,FEMALE,8.36821,-25.33302,,f
PAL0708,4,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.,
PAL0708,5,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,,f
PAL0708,6,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,,m
PAL0708,7,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.,f
PAL0708,8,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.46060,-24.89958,Nest never observed with full clutch.,m
PAL0708,9,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A1,Yes,2007-11-09,34.1,18.1,193,3475,,,,No blood sample obtained.,
PAL0708,10,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A2,Yes,2007-11-09,42.0,20.2,190,4250,,9.13362,-25.09368,No blood sample obtained for sexing.,


#### Recoding with `case_when`

`case_when` (from the `dplyr` package) is like `ifelse` but allows for much more complex combinations.

The basic setup for a `case_when` call looks like this:


case_when(


> $\color{orange}{\text{logical test}}$ ~ $\color{blue}{\text{what should happen if TRUE}}$,

> $\color{orange}{\text{logical test}}$ ~ $\color{blue}{\text{what should happen if TRUE}}$,


> $\color{orange}{\text{logical test}}$ ~ $\color{blue}{\text{what should happen if TRUE}}$,

> $TRUE$ ~ $\color{green}{\text{what should happen with everything else}}$,

)

The following code recodes a numeric vector (1 through 50) into three categorical ones:


In [None]:
x <- 1:50

case_when(
  x %in% 1:10 ~ "1 through 10",
  x %in% 11:30 ~ "11 through 30",
  TRUE ~ "above 30"
)

Let's use `case_when` in combination with `mutate`.

Creating the variable `short_island` which has a shorter label for `island`:




In [None]:
mutate(penguins_clean, 
        island_short = case_when(
          island == "Torgersen" ~ "T",
          island == "Biscoe" ~ "B",
          island == "Dream" ~ "D"
        ))

study_name,sample_number,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments,island_short
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<chr>
PAL0708,1,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,,,Not enough blood for isotopes.,T
PAL0708,2,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,,T
PAL0708,3,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A1,Yes,2007-11-16,40.3,18.0,195,3250,FEMALE,8.36821,-25.33302,,T
PAL0708,4,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.,T
PAL0708,5,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,,T
PAL0708,6,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,,T
PAL0708,7,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.,T
PAL0708,8,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.46060,-24.89958,Nest never observed with full clutch.,T
PAL0708,9,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A1,Yes,2007-11-09,34.1,18.1,193,3475,,,,No blood sample obtained.,T
PAL0708,10,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A2,Yes,2007-11-09,42.0,20.2,190,4250,,9.13362,-25.09368,No blood sample obtained for sexing.,T


With `case_when` you can also mix different variables making this a very powerful tool!



### `rename()`



Just changes the variable name but leaves all else intact:


In [None]:
rename(penguins_clean, sample = sample_number)

study_name,sample,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,1,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,,,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin,Pygoscelis adeliae),Torgersen,N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,
PAL0708,3,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A1,Yes,2007-11-16,40.3,18.0,195,3250,FEMALE,8.36821,-25.33302,
PAL0708,4,Adelie Penguin,Pygoscelis adeliae),Torgersen,N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
PAL0708,5,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,
PAL0708,6,Adelie Penguin,Pygoscelis adeliae),Torgersen,N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,
PAL0708,7,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Adelie Penguin,Pygoscelis adeliae),Torgersen,N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.46060,-24.89958,Nest never observed with full clutch.
PAL0708,9,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A1,Yes,2007-11-09,34.1,18.1,193,3475,,,,No blood sample obtained.
PAL0708,10,Adelie Penguin,Pygoscelis adeliae),Torgersen,N5A2,Yes,2007-11-09,42.0,20.2,190,4250,,9.13362,-25.09368,No blood sample obtained for sexing.



### `arrange()`

You can order your data to show the highest or lowest value first.

Let's order by `flipper_length_mm`.

Lowest first:

In [None]:
arrange(penguins_clean, flipper_length_mm)


study_name,sample_number,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0708,29,Adelie Penguin,Pygoscelis adeliae),Biscoe,N18A1,No,2007-11-10,37.9,18.6,172,3150,FEMALE,8.38404,-25.19837,Nest never observed with full clutch.
PAL0708,21,Adelie Penguin,Pygoscelis adeliae),Biscoe,N11A1,Yes,2007-11-12,37.8,18.3,174,3400,FEMALE,8.73762,-25.09383,
PAL0910,123,Adelie Penguin,Pygoscelis adeliae),Torgersen,N67A1,Yes,2009-11-16,40.2,17.0,176,3450,FEMALE,9.30722,-25.61039,
PAL0708,31,Adelie Penguin,Pygoscelis adeliae),Dream,N21A1,Yes,2007-11-09,39.5,16.7,178,3250,FEMALE,9.69756,-25.11223,
PAL0708,32,Adelie Penguin,Pygoscelis adeliae),Dream,N21A2,Yes,2007-11-09,37.2,18.1,178,3900,MALE,9.72764,-25.01020,
PAL0809,99,Adelie Penguin,Pygoscelis adeliae),Dream,N50A1,Yes,2008-11-10,33.1,16.1,178,2900,FEMALE,9.04218,-26.15775,
PAL0708,7,Chinstrap penguin,Pygoscelis antarctica),Dream,N66A1,Yes,2007-11-28,46.1,18.2,178,3250,FEMALE,8.85664,-24.55644,
PAL0708,48,Adelie Penguin,Pygoscelis adeliae),Dream,N29A2,Yes,2007-11-13,37.5,18.9,179,2975,,,,Sexing primers did not amplify. Not enough blood for isotopes.
PAL0708,12,Adelie Penguin,Pygoscelis adeliae),Torgersen,N6A2,Yes,2007-11-09,37.8,17.3,180,3700,,,,No blood sample obtained.
PAL0708,22,Adelie Penguin,Pygoscelis adeliae),Biscoe,N11A2,Yes,2007-11-12,37.7,18.7,180,3600,MALE,8.66271,-25.06390,



Highest first using `desc()` (for descendant):


In [None]:
arrange(penguins_clean, desc(flipper_length_mm))

study_name,sample_number,species,latin_name,island,individual_id,clutch_completion,date_egg,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,delta_15_n_o_oo,delta_13_c_o_oo,comments
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
PAL0809,64,Gentoo penguin,Pygoscelis papua),Biscoe,N19A2,Yes,2008-11-13,54.3,15.7,231,5650,MALE,8.49662,-26.84166,
PAL0708,2,Gentoo penguin,Pygoscelis papua),Biscoe,N31A2,Yes,2007-11-27,50.0,16.3,230,5700,MALE,8.14756,-25.39369,
PAL0708,34,Gentoo penguin,Pygoscelis papua),Biscoe,N56A2,Yes,2007-12-03,59.6,17.0,230,6050,MALE,7.76843,-25.68210,
PAL0809,66,Gentoo penguin,Pygoscelis papua),Biscoe,N20A2,Yes,2008-11-04,49.8,16.8,230,5700,MALE,8.47067,-26.69166,
PAL0809,76,Gentoo penguin,Pygoscelis papua),Biscoe,N56A2,Yes,2008-11-06,48.6,16.0,230,5800,MALE,8.59640,-26.71199,
PAL0910,90,Gentoo penguin,Pygoscelis papua),Biscoe,N14A2,Yes,2009-11-25,52.1,17.0,230,5550,MALE,8.27595,-26.11657,
PAL0910,114,Gentoo penguin,Pygoscelis papua),Biscoe,N34A2,Yes,2009-11-27,51.5,16.3,230,5500,MALE,8.78557,-25.76147,
PAL0910,116,Gentoo penguin,Pygoscelis papua),Biscoe,N35A2,Yes,2009-11-25,55.1,16.0,230,5850,MALE,8.08354,-26.18161,
PAL0809,68,Gentoo penguin,Pygoscelis papua),Biscoe,N51A2,Yes,2008-11-09,49.5,16.2,229,5800,MALE,8.49854,-26.74809,
PAL0910,112,Gentoo penguin,Pygoscelis papua),Biscoe,N32A2,Yes,2009-11-20,49.8,15.9,229,5950,MALE,8.29226,-26.21019,




### `group_by()` and `summarize()`

When you want to aggregate your data (by groups)

Sometimes we want to calculate **group statistics**.

In other languages this is often a pain. 

With `dplyr` this is fairly easy **and** readable.

<img src="https://learn.r-journalism.com/wrangling/dplyr/images/groupby.png" style="width: 80%" />





Let's calculate the average `culmen_length_mm` for each sex.

*First* we group `penguins_clean` by `sex`.

In [None]:
grouped_by_sex <- group_by(penguins_clean, sex)

`summarize` works in a similar way to `mutate`:

`variable_name = some_calculation`

In [None]:
summarise(grouped_by_sex, avg_culmen_length_mm = mean(culmen_length_mm, na.rm = T))

sex,avg_culmen_length_mm
<chr>,<dbl>
FEMALE,42.09697
MALE,45.85476
,41.3


### `count()`

Now this is a function that I use all the time.

This function helps you count how often a certain value occur(s) within variables(s).

Simply specify which variable you want to count. 

Let's count how often the species occur.


In [None]:
count(penguins_clean, species, sort = T)

species,n
<chr>,<int>
Adelie Penguin,152
Gentoo penguin,124
Chinstrap penguin,68


The `sort = T` tells the function to sort by the highest occuring frequency.



### The `%>%` operator

<center>
<img src="https://rpodcast.github.io/officer-advrmarkdown/img/magrittr.png" style="width: 62%" />
</center>





The point of the pipe is to help you write code in a way that is easier to read and understand. 

Let's consider an example with some data manipulation we have done so far:

In [None]:
## first I select variables
pg <- select(penguins_clean, individual_id, island, body_mass_g)

## then I filter to only Dream island
pg <- filter(pg, island == "Dream")

## then I convert body_mass_g to kg
pg <- mutate(pg, bodymass_kg = body_mass_g/1000)

## rename individual id to simply id
pg <- rename(pg, id = individual_id)






Now this works but the problem is: we have to write a lot of code that repeats itself!

In [None]:
pg

id,island,body_mass_g,bodymass_kg
<chr>,<chr>,<dbl>,<dbl>
N21A1,Dream,3250,3.250
N21A2,Dream,3900,3.900
N22A1,Dream,3300,3.300
N22A2,Dream,3900,3.900
N23A1,Dream,3325,3.325
N23A2,Dream,4150,4.150
N24A1,Dream,3950,3.950
N24A2,Dream,3550,3.550
N25A1,Dream,3300,3.300
N25A2,Dream,4650,4.650




Another alternative is to *nest all the functions*:

In [None]:
rename(mutate(filter(select(penguins_clean, individual_id, island, body_mass_g), island == "Dream"), bodymass_kg = body_mass_g/1000), id = individual_id)


id,island,body_mass_g,bodymass_kg
<chr>,<chr>,<dbl>,<dbl>
N21A1,Dream,3250,3.250
N21A2,Dream,3900,3.900
N22A1,Dream,3300,3.300
N22A2,Dream,3900,3.900
N23A1,Dream,3325,3.325
N23A2,Dream,4150,4.150
N24A1,Dream,3950,3.950
N24A2,Dream,3550,3.550
N25A1,Dream,3300,3.300
N25A2,Dream,4650,4.650



*The piping style*: 

Read from top to bottom and from left to right and the `%>%` as "and then".

> Data first, data once

In [None]:
penguins_clean %>% 
  select(individual_id, island, body_mass_g) %>% 
  filter(island == "Dream") %>% 
  mutate(bodymass_kg = body_mass_g/1000) %>% 
  rename(id = individual_id)


id,island,body_mass_g,bodymass_kg
<chr>,<chr>,<dbl>,<dbl>
N21A1,Dream,3250,3.250
N21A2,Dream,3900,3.900
N22A1,Dream,3300,3.300
N22A2,Dream,3900,3.900
N23A1,Dream,3325,3.325
N23A2,Dream,4150,4.150
N24A1,Dream,3950,3.950
N24A2,Dream,3550,3.550
N25A1,Dream,3300,3.300
N25A2,Dream,4650,4.650


#### `group_by()` again

Grouping also become easier using pipes.

Let's try again to calculate the average `culmen_length_mm` for each sex but this time with pipes.


In [None]:
penguins_clean %>% 
  group_by(sex) %>% 
  summarise(avg_culmen_length = mean(culmen_length_mm , na.rm = T))

sex,avg_culmen_length
<chr>,<dbl>
FEMALE,42.09697
MALE,45.85476
,41.3


#### Small Note on the Pipe

Since R Version 4.1.0 Base R also provides a pipe.

It looks like this:

$$|>$$

While it shares many similarities with the `%>%` there are also some differences.

It's beyond the scope of this workshop to go over it here but for the sake of simplicity we will stick with the `magrittr` pipe.

# Exercises II

The following includes a list of exercises that you can complete on your own. 

We are going to use the `palmerpenguins` dataset for the tasks ahead!



![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png)

![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/culmen_depth.png)

## Functions reference list

For reference, here is a list of some useful  functions.

If you have trouble with any of these functions, try reading the documentation with `?function_name`

Remember: all these functions take the **data** first.

* `filter()`

  * Subset rows using column values

* `mutate()`

  * Create and modify delete columns

* `rename()`

  * Rename columns

* `select()`

  * Subset columns using their names and types

* `summarise()`; `summarize()`

  * Summarise each group to fewer rows

* `group_by()`; `ungroup()`

  * Group by one or more variables

* `arrange()`

  * Arrange rows by column values

* `count()`; `tally()` 

  * Count observations by group

* `distinct()`

  * Subset distinct/unique rows

* `pull()`

  * Extract a single column

* `ifelse()`

  * useful for coding of binary variables

* `case_when()`

  * useful for recoding (when `ifelse` is not enough)

* `separate()`

  * separate two variables by some separator

* `pivot_wider()`

  * turn data into wide format

* `pivot_longer()`

  * turn data into long format


## Task 1

Load the `tidyverse` and `janitor` packages.

If `janitor` is not installed yet (it will say `janitor` not found) install it.

## Task 2

Read in the already cleaned `palmerpenguins` dataset using 

* `read_csv`
* the following url: https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv

Assign the resulting data to `penguins`.

Then take a look a look at it using `glimpse`. 

What kind of variables can you recognize?

## Task 3

Only keep the variables: `species`, `island` and `sex`.

Only keep variables 2 to 4.

Remove the column year.

Only include columns that contain "mm" in the variable name.

## Task 4

Rename `island` to `location`.

## Task 5

Filter the data so that `species` only includes `Chinstrap`.

Filter the data so that `species` only includes `Chinstrap` or `Gentoo`.




Filter the data so it includes only penguins that are `male` *and* of the species `Adelie`.

## Task 6

Create three new variables that calculates `bill_length_mm` and `bill_depth_mm` and `flipper_length_mm` from milimeter to centimeter.
  
Tip: divide the length value by 10.


Create a new variable called `bill_depth_cat` which has two values:

* Everything above a bill depth of 18mm and 18mm itself is "high"
* Everything below a bill depth of 18mm is "low"

Create a new variable called `species_short`.

* `Adelie` should become `A`
* `Chinstrap` should become `C`
* `Gentoo` should become `G`

## Task 7

Calculate the average `body_mass_g` per `island`.



If you haven't done so already, try using the `%>%` operator to do this.

## Task 8

Use the pipe operator (`%>%`) to do all the operations below.



1. Filter the `penguins` data so that it only includes `Chinstrap` or `Adelie`.
2. Rename `sex` to `observed_sex`
3. Only keep the variables `species`, `observed_sex`, `bill_length_mm` and `bill_depth_mm`
4. Calculate the ratio between `bill_length_mm` and `bill_depth_mm`
5. Sort the data by the highest ratio

Try to create the pipe step by step and execute code as you go to see if it works.

Once you are done, assign the data to `new_penguins`.


Calculate the average ratio by `species` and `sex`, again using pipes.

## Task 9

Count the number of penguins by island and species.

## Task 10

Below is a dataset that needs some cleaning.

Use the skills that you have learned so far to turn the data into a tidy dataset.

In [None]:
my_animals <- table(
  Names = c("Francis", "Catniss", "Theodor", "Eugenia"),
  TheAnimals = c("Dog", "Cat", "Hamster", "Rabbit"),
  Sex = c("m", "f", "m", "f"),
  oWnerr = c("me", "me", "me", "me"),
  `Age/Adopted/Condition` = c("8/2020/Very Good", "13/2019/Wild", "1/2021/Fair", "2/2020/Good")    
) 

Start here:

If you are done, turn the final data into long format.