
## R Packages

Packages are at the heart of R: 

* R packages are basically a collection of functions that you load into your working environment.

* They contain code that other R users have prepared for the community.


* It's good to know your packages, they can really make your life easier.

* I suggest keeping track of package developments either on Twitter via #rstats

* Or [postsyoumighthavemissed.com](https://postsyoumighthavemissed.com/posts/)


## R Packages

You can install packages in R like this using the `install.packages` function:

In [None]:
install.packages("janitor")

However, installing is not enough. You also need to load the package via `library`.

In [None]:
library(janitor)

Think of `install.packages` as buying a set of tools (for free!) and `library` as pulling out the tools each time you want to work with them.

class: center, middle, inverse

## Break?


class: center, middle, inverse

![](https://predictivehacks.com/wp-content/uploads/2020/11/tidyverse-default.png)


## What is the `tidyverse`?

The tidyverse describes itself:

> The tidyverse is an opinionated **collection of R packages** designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

<center>
<img src="https://rstudio-education.github.io/tidyverse-cookbook/images/data-science-workflow.png" style="width: 60%" />
</center>


## Core principle: tidy data

* Each variable is in its own column
* Each observation / case is on its own row
* Each value is in a cell

We have already seen tidy data:

| Animal | Maximum Lifespan | Animal/Human Years Ratio  |
|  |  |  | 
| Domestic dog | 24.0 | 5.10 |
| Domestic cat | 30.0 | 4.08 |
| American alligator | 77.0 | 1.59 | 
| Golden hamster | 3.9 | 31.41 |
| King penguin | 26.0 |  4.71 |


## Untidy data I

.leftcol[
| Animal | Type | Value  |
|  |  |  | 
| Domestic dog | lifespan | 24.0 |
| Domestic dog | ratio | 5.10 |
| Domestic cat | lifespan | 30.0 |
| Domestic cat | ratio | 4.08 |
| American alligator | lifespan | 77.0 | 
| American alligator | ratio | 1.59 |
| Golden hamster | lifespan | 3.9 |
| Golden hamster | ratio | 31.41 |
| King penguin | lifespan |  26.0 |
| King penguin | ratio |  4.71 |
]

.rightcol[

<br>

<br>

The data on the right has multiple rows with the same observation (animal).

= not tidy

]


## Untidy data II

| Animal | Lifespan/Ratio  |
|  |  | 
| Domestic dog | 24.0 / 5.10 |
| Domestic cat | 30.0 / 4.08 |
| American alligator | 77.0 / 1.59 | 
| Golden hamster | 3.9 / 31.41 |
| King penguin | 26.0 /  4.71 |

The data above has multiple variables per column.

= not tidy


## Core principle: tidy data

<center>
<img src="https://www.openscapes.org/img/blog/tidydata/tidydata_2.jpg" style="width: 80%" />
</center>

.fifty[Artist: [Allison Horst](https://github.com/allisonhorst)]

## Core principle: tidy data

Tidy data has two decisive advantages:

* Consistently prepared data is easier to read, process, load and save.

* Many procedures (or the associated functions) in R require this type of data.

<center>
<img src="https://www.openscapes.org/img/blog/tidydata/tidydata_4.jpg" style="width: 40%" />
</center>

.fifty[Artist: [Allison Horst](https://github.com/allisonhorst)]


## Installing and loading the tidyverse

First we install the packages of the tidyverse like this:

In [None]:
install.packages("tidyverse")

Then we load them:

In [None]:
library(tidyverse)

## A new dataset appears..

Before we dive into some functions of the tidyverse, we are going to work with a new data set and clean it a bit.

No worries, we will stay within the animal kingdom but we need a dataset that is a little more complex than what we have seen already.



## A new dataset appears..

Before we dive into some functions of the tidyverse, we are going to work with a new data set and clean it a bit.

Meet the Palmer penguins! Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/).


.leftcol[
<center>
<img src="https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png" style="width: 80%" />
</center>
]

.rightcol[
<center>
<img src="https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/culmen_depth.png" style="width: 80%" />
</center>
.right[
.fifty[Artist: [Allison Horst](https://github.com/allisonhorst)]]
]




## Palmer Penguins

We could install the R package `palmerpenguins` and then access the data. 

However, we are going to use a different method: directly load a .csv file (comma-separated values) into R from the internet.

We can use the `readr` package which provides many convenient functions to load data into R. Here we need `read_csv`:

In [None]:
penguins_raw <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins_raw.csv")


## Palmer Penguins

In [None]:
penguins_raw

## Palmer Penguins

We can also take a look at data set using the `glimpse` function from `dplyr`.

In [None]:
glimpse(penguins_raw)

class: center, middle, inverse

## initial data cleaning

### using `janitor`

<center>
<img src="https://github.com/sfirke/janitor/raw/main/man/figures/logo_small.png" style="width: 20%" />
</center>



## cleaning with `janitor`

`janitor` is not offically part of the tidyverse package compilation but in my view it is incredibly important to know.

Provides some convenient functions for basic cleaning of the data.

Just like any tidverse-style package it fullfills the following criteria for its functions:

> The data is always the first argument.

This helps us to match by position.


## cleaning with `janitor`

One annoyance with the `penguins_raw` data is that it has spaces in the variable names. Urgh! 

R has to put quotes around the variable names that have spaces:

In [None]:
penguins_raw$`Delta 15 N (o/oo)`
penguins_raw$`Flipper Length (mm)`


`janitor` can help with that: 

using a function called `clean_names()`



## cleaning with `janitor`

`clean_names()` just magically turns all our messy column names into readable lower-case snake case:

In [None]:
library(janitor)

penguins_clean <- clean_names(penguins_raw) 


That is how the variables look like now:

In [None]:
penguins_clean$delta_15_n_o_oo
penguins_clean$flipper_length_mm

## cleaning with `janitor`


In [None]:
glimpse(penguins_clean)


## cleaning with `janitor`

Now we have another problem. Not all variables in the `penguins_clean` data set are that useful. 

Some of them are the same across all observations. We don't need those variables, like `region`.

In [None]:
table(penguins_clean$region)

We can use the base R function `table` to quickly get some tabulations of our variable.


## cleaning with `janitor`

Here to help get rid of these *constant* columns is the function `remove_constant()`.

In [None]:
penguins_clean <- remove_constant(penguins_clean, quiet = F)

In [None]:
message("Removing 2 constant columns of 17 columns total (Removed: region, stage).")

When we set `quiet = F` we even get some input as to what exactly was removed. Neat!

Another useful function in `janitor` is `remove_empty()` which removes all rows or columns that just consist of missing values (i.e. `NA`)




class: center, middle, inverse

## Data manipulation using `dplyr`

<center>
<img src="https://github.com/allisonhorst/stats-illustrations/blob/master/rstats-artwork/dplyr_wrangling.png?raw=true" style="width: 62%" />
</center>

.fifty[Artist: [Allison Horst](https://github.com/allisonhorst)]


class: center, middle, inverse

## `select()`

helps you select variables





## `select()`

![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/select.png)

`select()` is part of the dplyr package and helps you select variables

Remember: with tidyverse-style functions, **data is always the first argument**.


## `select()`

![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/select.png)

Here we only keep `individual_id`, `sex` and `species`.

In [None]:
select(penguins_clean, individual_id, sex, species)

class: center, middle, inverse

## `filter()`

helps you filter rows



## `filter()`

helps you filter rows

![](https://favstats.shinyapps.io/r_intro/_w_bfa1a45e/images/filter.png)

Here we only keep penguins from the Island `Dream`.

In [None]:
filter(penguins_clean, island == "Dream")

## `filter()`

Here the **`%in%`** operator can come in handy again if we want to filter more than one island:

In [None]:
islands_to_keep <- c("Dream", "Biscoe")

filter(penguins_clean, island %in% islands_to_keep)

class: center, middle, inverse

## `mutate()`

helps you create variables



## `mutate()`

![](https://favstats.shinyapps.io/r_intro/_w_dfe6b732/images/mutate.png)

`mutate` will take a statement like this:

`variable_name = some_calculation`

and attach `variable_name` at the *end of the dataset*.


## `mutate()`

![](https://favstats.shinyapps.io/r_intro/_w_dfe6b732/images/mutate.png)

Let's say we want to calculate penguin bodymass in kg rather than gram.

In [None]:
pg_new <- mutate(penguins_clean, bodymass_kg = body_mass_g/1000)

In [None]:
select(pg_new, bodymass_kg, body_mass_g)

class: center, middle, inverse

## `rename()`

helps you rename variables



## `rename()`

Just changes the variable name but leaves all else intact:

In [None]:
rename(penguins_clean, sample = sample_number)

class: center, middle, inverse

## `group_by()` and `summarize()`

when you want to aggregate your data (by groups)




## `group_by()` and `summarize()`

Sometimes we want to calculate group statistics.

In other languages this is often a pain. With `dplyr` this is fairly easy **and** readable.

<img src="https://learn.r-journalism.com/wrangling/dplyr/images/groupby.png" style="width: 80%" />




## `group_by()` and `summarize()`

First we group `penguins_clean` by `sex`.

In [None]:
grouped_by_sex <- group_by(penguins_clean, sex)

`summarize` works in a similar way to `mutate`:

`variable_name = some_calculation`

In [None]:
summarise(grouped_by_sex, avg_culmen_length_mm = mean(culmen_length_mm, na.rm = T))

class: center, middle, inverse

# **`%>%`**

## The pipe operator

<center>
<img src="https://rpodcast.github.io/officer-advrmarkdown/img/magrittr.png" style="width: 62%" />
</center>



## The `%>%` operator

The point of the pipe is to help you write code in a way that is easier to read and understand. 

Let's consider an example with the data manipulation we have done so far:

In [None]:
## first I select variables
pg <- select(penguins_clean, individual_id, island, body_mass_g)

## then I filter to only Dream island
pg <- filter(pg, island == "Dream")

## then I convert body_mass_g to kg
pg <- mutate(pg, bodymass_kg = body_mass_g/1000)

## rename individual id to simply id
pg <- rename(pg, id = individual_id)


## The `%>%` operator

Now this works but the problem is: we have to write a lot of code that repeats itself!

In [None]:
pg

## The `%>%` operator

Another alternative is to *nest all the functions*:

In [None]:
rename(mutate(filter(select(penguins_clean, individual_id, island, body_mass_g), island == "Dream"), bodymass_kg = body_mass_g/1000), id = individual_id)


## The `%>%` operator

*The piping style*: 

Read from top to bottom and from left to right and the `%>%` as "and then".

In [None]:
penguins_clean %>% 
  select(individual_id, island, body_mass_g) %>% 
  filter(island == "Dream") %>% 
  mutate(bodymass_kg = body_mass_g/1000) %>% 
  rename(id = individual_id)


class: center, middle, inverse

# Exercises

### It's time to type some R code



<center>
<img src="https://media1.tenor.com/images/72bf7922ac0b07b2f7f8f630e4ae01d2/tenor.gif?itemid=11364811" style="width: 50%" />
</center>

Google Colab: [tinyurl.com/hackr4321](https://tinyurl.com/hackr4321)

