# Data Manipulation with Tidyverse, Part I

In this lecture and the next, we're going to go through a lot of different concepts. Remember, you can use the Complete Notebook as a reference fo all that we will cover in this lecture. 

The goal of this lecture and the next is:

1) Learn about packages, the tidyverse
2) Learn how to manipulate and clean data with tidyverse

# Why is data manipulation important 

Presumably, we are all planning to do Environmental Data Science

- We will have a hypothesis of how the world works
- We will want to construct a model that approximates that 
- We will need data from real world to build that model that approximates the world
- The data we have may not be set up to be plugged into the model we'd like to run
- However, it could be *manipulated* so that we can use the data we have with the model we want

My personal example: American Time Use Survey and Travel Cost Model, Cell Photo Data and green space

# What are packages and how can I get them?

**What are they:**

- A package contains a bunch of pre-built functions 
- Anyone can load and use them
- Saves you a ton of time because someone already figured out how to do it

**Tidyverse**

- Collection of R packages 
- All are meant for data science
- Have shared syntax
- Makes it easier to import, tidy, transform, visualize, and model data in R
- Shout out to Hadley Wickham and co 

**Installing a package vs. loading a pacakge** 

- You only need to install a package on your local computer once (your lab has all the packages pre-installed)
- You then "load" that package in a script when you want to use it using the `library(0)` command

For reference, the code to install a package on you local computer is below. We do not need to run this code on each of our local machines. 

```r
# installing packages that are a part of the tidyverse using r code
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")
```

You can load packages after you've installed them with the library function

In [1]:
# load dplyr, the package we will focus on today


# Manipulating and cleaning with dplyr 

- dplyr is my most used package for data cleaning and manipulation 
- Going to introduce the primary functions in the dplyr package
    - `mutate()`
    - `if_else()`
    - `filter()`
    - `select()`
    - `group_by()`
    - `summarise()`
- There are a million ways to implement these functions 
- Important for your solution strategy to know that these are the functions you can build around 
- I would say 90% of my data cleaning is different combinations of these functions
- Why use dplyr instead of base R?
    - It's faster and more memory efficient (good for large data sets)
    - It's easier to read
- Let's go through some examples of what we did in the Base R Lectures , but redo with dplyr code

# `mutate()`
Last lecture, everyone moved 1 mile futher from the park (bummer). We used a loop, like this

In [2]:
# Build our trusty dataset, but let's call this one myBase


# create a second dataset that is identical to the first

# check to make sure they're the same


In [3]:
# Base R



Let's perform the same command with the `mutate()` function from the `dplyr` package.

In [4]:
# add mile to park dist using dplyr


# use a boolean operator to show the two dfs are the same 


**New things introduced** 

- Pipes %>%: pipes take input (our dataframe) and passes it onto the next function. You can chain pipes together.
- `mutate()` mutate is a function from the dplyr package 
- A data frame is piped to the function `mutate()` and then the function executes some small operation you gave it (park_dist +1) and returns the new value

In [5]:
# create a new column with mutate (rather than just changing original)


**New things introduced** 

- `mutate()` can also construct a new a new column

# `if_else()`

Last lecture, we said that women and non-binary people had their distances from parks recorded wrong. Non-male people were a quarter mile closer to the park than orginally recorded.

In [6]:
# goes through each row and changes distance if someone is not male


`dplyr` has an `if_else()` function 

In [7]:
# correct distance using if_else



**New things introduced** 

- `if_else()` combined with `mutate()`
    - the first part is what you're evaluating to check if it's true (`male == FALSE`)
    - Next entry is what to return if true (`park_dist - 0.25`)
    - Final part is what to return if false (`park_dist`)

# `filter()`

We haven't seen something like a filter command yet. The filter command is used when you have a data set, and want subset to only some observations conditioned on something. 

## Example One: Filter for a characteristicc
Let's say we have observations of different ecosystems

In [8]:
# ecosystem dataset
df_env_data <- data.frame(
    ecosystem = c("Forest", "Desert", "Wetland", "Grassland", "Urban"),
    species_richness = c(120, 45, 80, 60, 30),
    pollution_level = c("Low", "High", "Medium", "Low", "High")
)

# display 


# filter to include only locations with low pollution levels


# display the filtered dataset


## Example two: drop NAs
You will frequently have datasets that have missing data (*i.e.,* the cell has a `NA` value). Many functions and models won't work with NAs, so they have to be cleaned out of the data set.

Let's imagine we're missing a location for one of our ecosystem observations

In [9]:
# ecosystem dataset but with NAs
df_env_data_na <- data.frame(
    ecosystem = c("Forest", "Desert", "Wetland", NA, "Urban"),
    species_richness = c(120, 45, 80, 60, 30),
    pollution_level = c("Low", "High", "Medium", "Low", "High")
)

# display the dataset with NAs


# drop rows with NAs in the 'location' column using filter


# display the cleaned dataset


# `select()`

You have a data set with more columns than you need. Sometimes you want to only work with some of the variables, and it is space efficient to get rid of the rest. 

For example, let's say you're only interested in the polution at diffent locations, not the species richness. 

In [10]:
# select only the ecosystem and pollution_level columns


# display the selected columns


# `group_by()`

Group by is useful for when you want to aggregate information up to a higher level. So, if you need a varaibel that is the average species richness by ecosystem rather than the species richness at individual sites.

Let's work with a longer version of our ecosystem dataset  

In [11]:
# create an environmental dataset 
df_env_long <- data.frame(
    ecosystem = c("Forest", "Desert", "Wetland", "Grassland", "Urban", "Forest", "Desert", "Wetland", "Grassland", "Urban"),
    species_richness = c(120, 45, 80, 60, 30, 110, 50, 85, 65, 35),
    pollution_level = c("Low", "High", "Medium", "Low", "High", "Low", "High", "Medium", "Low", "High")
)

# display the dataset


# group by ecosystem and calculate the mean species richness


# display the updated dataset with mean species richness


# `summarise()` 
Finally, the `summarise()` function can help you create summary tables. These are useful for getting a quick snapshot of data. Using them is particularly important when you have a large enough datset that you're unable to look at the dataset and glean insight (which, for me, happens if the dataset is longer than 5 observations).

In [12]:
# group by ecosystem and calculate the mean and total species richness


# display the summary table
