# MATH 3345 Supplemental Notes

## Basic Data Transformation

These notes supplement Chapter 3 of _**R for Data Science (2nd Ed.)**_

### Set Up 


In [None]:
#Remove the comment symbol on the line below, run the line ONE time, then replace the comment symbol
#install.packages("dplyr")

In [None]:
#Remove the comment symbol on the line below, run the line ONE time, then replace the comment symbol
#install.packages("ggplot2")

In [None]:
library(dplyr)
library(ggplot2)

In [None]:
#Run the line below only ONE time (remove/replace comment symbol)
#install.packages("nycflights13")

In [None]:
library(nycflights13)

In [None]:
head(flights,3)

## The 'PIPE' - What It Is & How It Works

**NOTE:** The textbook shows the symbols **```|>```** for the pipe operator. Older versions of _R_ use the symbols **```%>%```** instead.

Below we examine the first example that uses the pipe to see what is really happening.

### Example in Section 3.1.3

Here is the example given in Section 3.1.3 using the pipe operator (shown here as ```%>%```).

```
flights %>%
  filter(dest == "IAH") %>% 
  group_by(year, month, day) %>% 
  summarize(
    arr_delay = mean(arr_delay, na.rm = TRUE)
  )
```
The pipe operator is used to combine several things:

* The **flights** dataframe 
* The **filter** function, which selects only certain rows in a given dataframe (a subset)
* The **group_by** function, which combines multiple rows of a dataframe into related groups
* The **summarize** function, which takes grouped data and computes a given result (in this case, the mean of a given column)

We will break the above command down for more insight into what is happening.

Notice that we are starting with the entire **flights** dataframe; below we look again at its structure and first few rows.

In [None]:
glimpse(flights)

In [None]:
head(flights)

### Step 1: Applying ```filter```

The ```filter``` function (in the **dplyr** library) creates a _subset_ of a given dataframe that matches a given condition. This step is shown below in the usual format of the function (without using the pipe).

In [None]:
step_1 <- filter(flights,dest == "IAH")

#### Result of the Filtered Dataframe

The result of this instruction was stored in the **step_1** variable so we can have a better look.

In [None]:
glimpse(step_1)
head(step_1)

##### Observations

Notice that the result is another dataframe, but it is much smaller than the original ```flights``` dataframe: This one only has  7198 rows, instead of the original 336,776.  As the output above shows, it now only has the rows where the **dest** variable is 'IAH'.

#### Same Function, Different Format

The same ```filter``` function can be invoked in a different way using the pipe operator, as shown below.

In [None]:
step_one <- flights %>% filter(dest == "IAH")

##### View Results

Compare the results of the instruction we just ran with the results from the previous code.

In [None]:
glimpse(step_one)
head(step_one)

##### What Changed?

The results of this example are identical to the previous results. But the code is different. What's going on?

The first _'argument'_ in the ```filter``` function is the dataframe we want to filter; the second argment specifies _how_ that dataframe should be filtered.  In our original example, we used:
```
filter(flights,dest == "IAH")
```

When we used the pipe operator (```%>%```), we supplied the first argument (the name of the dataframe) and used the pipe to 'feed' that information to the ```filter``` instruction:
```
flights %>% filter(dest == "IAH")
```

Then we only needed to provide the second argument in the parentheses.

#### Why Use The Pipe?

The benefit of using the pipe is not evident if you only need to make one function call. But as we continue _**nesting**_ function calls, the advantage of the pipe will become more obvious.

### Step 2: Applying ```group_by```

The ```group_by``` function (in the **dplyr** library) creates a grouping of records in a given dataframe based on specified grouping criteria. This step is shown below in the usual format of the function (without using the pipe).

In [None]:
step_2 <- group_by(step_1,year,month,day)

#### Result: A Grouped Dataframe

Here we started with the filtered dataframe (in **step_1**) and transformed it into a _**grouped**_ dataframe. Look at the results (in **step_2**) below. The number of rows has not changed; what evidence do you see that the result is a grouped dataframe?

In [None]:
glimpse(step_2)
head(step_2)

#### Nesting Functions

Recall that **step_1** was already a result from a previous function call using ```filter```. We stored those results so we could use them to create the grouped dataframe in **step_2**. Below we show how we could have done this in one step, without using the **step_1** variable.

In [None]:
step_2 <- group_by(filter(flights,dest=="IAH"),year,month,day)
glimpse(step_2)

#### Same Process, Different Format

The first argument in the ```group_by``` function was the result created using the ```filter``` function. This is known as _nesting_ functions.  It can be confusing to read. Below we show how we could accomplish the same thing using pipes.

In [None]:
step_two <- flights %>% filter(dest=="IAH") %>% group_by(year,month,day)
glimpse(step_two)

#### How Pipes Help

The pipes help us to see what order the processes are happening, so the code is easier to decipher than the original nested version.

* We start with the ```flights``` dataframe
* We filter it to use only rows the "IAH" destination
* We group the filtered result by year, month, and day

The result of each step is 'piped' to the next function to carry out the next step.

### Step 3: Report Group Values Using ```summarize```

Once records are grouped with the ```group_by``` function, we can use the ```summarize``` function to calculate values for each group. Summary values could include minimum, maximum, mean, median, standard deviation, and so on.

In this example, we take the grouped results from Step 2 and compute the mean arrival delay for _each group_. Since the dataframe is grouped by day, month, and year, there will be one mean for each different date.

In [None]:
summary(flights$year)

In [None]:
step_3 <- summarize(step_2, arr_delay = mean(arr_delay, na.rm = TRUE))
glimpse(step_3)
head(step_3)

#### The Same Task Using Nested Functions 

The process above again relies on intermediate dataframes **step_1** and **step_2**. Below we look at how the same task is performed relying solely on nested functions.

In [None]:
step_3 <- summarize(group_by(filter(flights,dest=="IAH"),year,month,day), arr_delay = mean(arr_delay, na.rm = TRUE))
glimpse(step_3)
head(step_3)

#### Same Process with Pipes Instead of Nesting

To make sense of the nesting in the above example, you need to navigate a lot of parentheses and work your way from the 'inside' to the 'outside' of the function calls. As you may have noticed, this gets increasingly confusing with every level of nesting we introduce.  

The same process is shown below with pipes. Notice that: 
- Reading the code in the order it is written gives us an accurate picture of the order in which things are happening.
- To make the steps more readable, each time there is a result to be 'piped' to the next step, we end one step with a pipe and go to the next line to begin the next step. **_R_** does not require this, but it benefits _us_ when reading the code.

In [None]:
final_result <- flights %>%
                filter(dest == "IAH") %>% 
                group_by(year, month, day) %>% 
                summarize(
                    arr_delay = mean(arr_delay, na.rm = TRUE)
                )

glimpse(final_result)
head(final_result)

### Other Notes

We will use the 'intermediate' dataframes that we created to revisit some other details of what is happening.  Recall:

- **step_1** is filtered so that it only contains the flights arriving at IAH
- **step_2** is the **step_1** dataframe with grouping information added (year, month, day)
- **step_3** is the summarized dataframe showing the mean delay of flights for each group in the **step_2** dataframe

It will be helpful to keep this in mind as we review these additional notes.

#### Note 1.
To get the summary data by date, we had to first group the records by date.  If we had skipped this step, the entire data set would be treated as one big group, and only one mean would be reported, as shown below.

In [None]:
skipped_step <- summarize(step_1, arr_delay = mean(arr_delay, na.rm = TRUE))
glimpse(skipped_step)
head(skipped_step)

#### Note 2. 
The warning message is telling us that the grouping of the _**result**_ only contains year and month. This does not mean that all 3 groups (year, month, day) were not used to _produce_ the result. 

The best way to see what's going on is to run ```summarize``` again on the **step_3** data frame, as shown below. (Note that first we look at the structure of **step_3** and verify that it only has **year** and **month** as groups.)

In [None]:
glimpse(step_3)
head(step_3)

In [None]:
one_more_step <- summarize(step_3, arr_delay=mean(arr_delay))
glimpse(one_more_step)
one_more_step

##### How ```summarize``` Behaves with Groups

- The ```summarize``` function creates a new dataframe with only the grouped variables and the summarized variables from the dataframe originally given. The **step_2** dataframe had 3 grouping variables, plus several other attributes, including the **arr_delay** variable. The grouping variables and the **arr_delay** (summarized as a mean) are the only variables we see in **step_3**. 


- By default, the _new_ dataframe created by the ```summarize``` function has _**one less**_ grouping variable than its 'parent' dataframe; the last one gets dropped. The **step_2** dataframe was grouped by **year**, **month**, and **day**. Then after ```summarize``` was applied, the result in **step_3** was only grouped by **year** and **month**. Also notice that when we summarized **step_3**, another grouping variable (**month**) was dropped, leaving only **year**.

    (_Think about this and see if you can determine why this is a reasonable default behavior._)

     
- As the warning suggests, you can change this default behavior using the ```.groups``` argument when invoking the ```summarize``` function. Below, we bypass the warning message by adding the ```.groups``` argument and explicitly indicating that we want the default behavior. (Recall that you can learn more about all the available options by bringing up the help/documentation; just type ```?summarize```.)

In [None]:
step_3 <- summarize(step_2, arr_delay = mean(arr_delay, na.rm = TRUE), .groups = "drop_last")
glimpse(step_3)
head(step_3)

#### Note 3.

In the above examples, we summarized a variable called **arr_delay** (arrival delay) and used a variable of the same name to hold the summarized result. This is not required, and there may be some benefit to _re-naming_ the summarized variable to indicate that the value shown is a summary value. This is demonstrated below.  The summarized column has been named **mean_arr_delay** instead of the original **arr_delay**.

In [None]:
final_result <- flights %>%
                filter(dest == "IAH") %>% 
                group_by(year, month, day) %>% 
                summarize(
                    mean_arr_delay = mean(arr_delay, na.rm = TRUE),
                    .groups = "drop"
                )

glimpse(final_result)
head(final_result)

#### Note 4.

The ```summarize``` function allows you to compute more than one summarized value, as shown in the following examples. The first example computes the mean departure and arrival delays; the second example computes several statistics for the arrival delay.

In [None]:
final_result <- flights %>%
                filter(dest == "IAH") %>% 
                group_by(year, month, day) %>%
                summarize(
                    flights = n(),
                    mean_dep_delay = mean(dep_delay, na.rm = TRUE),
                    mean_arr_delay = mean(arr_delay, na.rm = TRUE),
                    .groups = "drop"
                )

glimpse(final_result)
head(final_result)

In [None]:
final_result <- flights %>%
                filter(dest == "IAH") %>% 
                group_by(year, month, day) %>%
                summarize(
                    mean_arr_delay = mean(arr_delay, na.rm = TRUE),
                    std_arr_delay = sd(arr_delay, na.rm = TRUE),
                    min_arr_delay = min(arr_delay, na.rm = TRUE),
                    median_arr_delay = median(arr_delay, na.rm = TRUE),
                    max_arr_delay = max(arr_delay, na.rm = TRUE),
                    .groups = "drop"
                )

glimpse(final_result)
head(final_result)