# Summarise and the pipe operator

## Section 7 - Last but not least: summarise

### summarise

* summarise(tble, new column name = expression) # summarize() also works
* Syntax example
```
summarise(tbl, sum = sum(A),
               avg = mean(B),
               var = var(B))
```

* The functions used in summarise should take a vector as an input, and a single number as an output.
    - base:: min, mean, sum, var, sd, length, max, median, IQR
    - dplyr:: first, last, nth, n, n_distinct
    - etc.

### The syntax of summarise

In [2]:
library(dplyr)
library(hflights)

# hflights and dplyr are loaded in the workspace

# Print out a summary with variables min_dist and max_dist
summarise(hflights, min_dist = min(Distance), max_dist = max(Distance))

# Print out a summary with variable max_div
summarise(filter(hflights, Diverted == 1), max_div = max(Distance))

"package 'dplyr' was built under R version 3.4.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



min_dist,max_dist
79,3904


max_div
3904


### Aggregate functions

* min(x) - minimum value of vector x.
* max(x) - maximum value of vector x.
* mean(x) - mean value of vector x.
* median(x) - median value of vector x.
* quantile(x, p) - pth quantile of vector x.
* sd(x) - standard deviation of vector x.
* var(x) - variance of vector x.
* IQR(x) - Inter Quartile Range (IQR) of vector x.
* diff(range(x)) - total range of vector x.

In [3]:
# hflights is available

# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))

# Generate summary about ArrDelay column of temp1
summarise(temp1, earliest = min(ArrDelay),
                 average = mean(ArrDelay),
                 latest = max(ArrDelay),
                 sd = sd(ArrDelay))

# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn) & !is.na(TaxiOut))

# Print the maximum taxiing difference of temp2 with summarise()
summarise(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))

earliest,average,latest,sd
-70,7.094334,978,30.70852


max_taxi_diff
160


### dplyr aggregate functions

* first(x) - The first element of vector x.
* last(x) - The last element of vector x.
* nth(x, n) - The nth element of vector x.
* n() - The number of rows in the data.frame or group of observations that * summarise() describes.
* n_distinct(x) - The number of unique values in vector x.

In [4]:
# hflights is available with full names for the carriers

# Generate summarizing statistics for hflights
summarise(hflights,
          n_obs = n(),
          n_carrier = n_distinct(UniqueCarrier),
          n_dest = n_distinct(Dest))

# All American Airline flights
aa <- filter(hflights, UniqueCarrier == "American")

# Generate summarizing statistics for aa 
summarise(aa, n_flights = n(),
              n_canc = sum(Cancelled == 1),
              avg_delay = mean(ArrDelay, na.rm = TRUE))

n_obs,n_carrier,n_dest
227496,15,116


n_flights,n_canc,avg_delay
0,0,


## Section 8 -Chaining your functions: the pipe operator

### pipes

* Without pipes

```
# Save to other objects after applying one function at a time
# This may slowdown the process
a1 <- select(a, X, Y, Z)
a2 <- filter(a1, X > Y)
a3 <- mutate(a2, Q = X + Y + Z)
a4 <- summarise(a3, all = sum(Q))

# Chain all functions together
# This code is difficult to read - Dagwood sandwich problem
summarise(
    mutate(
        filter(
            select(a, X, Y, Z)
        X > Y),
    Q = X + Y + Z),
all = sum(Q)
)
```

* With pipe operator: %>% comes from magrittr package from Stefan Bache

    - object %>% function(____, arg2, arg3, ...)
        - The pipe takes the object on its left and passes it to the function on its right as the first argument of the function

```
a %>%
    select(X, Y, Z) %>%
    filter(X > Y) %>%
    mutate(Q = X + Y + Z) %>%
    summarise(all = sum(Q))

# It makes it much clearer if you translate %>% as 'then'
```

### Overview of syntax

Use dplyr functions and the pipe operator to transform the following English sentences into R code:

* Take the hflights data set and then ...
* Add a variable named diff that is the result of subtracting TaxiIn from TaxiOut, and then ...
* Pick all of the rows whose diff value does not equal NA, and then ...
* Summarise the data set with a value named avg that is the mean diff value.

In [5]:
library(dplyr)
library(hflights)

# Write the 'piped' version of the English sentences.
hflights %>%
    mutate(diff = TaxiOut - TaxiIn) %>%
    filter(!is.na(diff)) %>%
    summarise(avg = mean(diff))

avg
8.992064


### Drive or fly? Part 1 of 2

* mutate() the hflights dataset and add two variables:
    - RealTime: the actual elapsed time plus 100 minutes (for the overhead that flying involves) and
    - mph: calculated as Distance / RealTime * 60, then
* filter() to keep observations that have an mph that is not NA and that is below 70, finally
* summarise() the result by creating four summary variables:
    - n_less, the number of observations,
    - n_dest, the number of destinations,
    - min_dist, the minimum distance and
    - max_dist, the maximum distance.

In [6]:
hflights %>%
    mutate(RealTime = ActualElapsedTime + 100,
          mph = Distance / RealTime * 60) %>%
    filter(!is.na(mph) & mph < 70) %>%
    summarise(n_less = n(),
             n_dest = n_distinct(Dest),
             min_dist = min(Distance),
             max_dist = max(Distance))
    

n_less,n_dest,min_dist,max_dist
6726,13,79,305


### Drive or fly? Part 2 of 2

* filter() the result of mutate to:
    - keep observations that have an mph under 105 or for which Cancelled equals 1 or for which Diverted equals 1.
* summarise() the result by creating four summary variables:
    - n_non, the number of observations,
    - n_dest, the number of destinations,
    - min_dist, the minimum distance and
    - max_dist, the maximum distance.

In [7]:
# Finish the command with a filter() and summarise() call
hflights %>%
    mutate(RealTime = ActualElapsedTime + 100,
         mph = Distance / RealTime * 60) %>%
    filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
    summarise(n_non = n(),
             n_dest = n_distinct(Dest),
             min_dist = min(Distance),
             max_dist = max(Distance))

n_non,n_dest,min_dist,max_dist
42400,113,79,3904


### Advanced piping exercise

* filter() the hflights tbl to keep only observations whose DepTime is not NA, whose ArrTime is not NA and for which DepTime exceeds ArrTime.
* Pipe the result into a summarise() call to create a single summary variable:
    - num, that simply counts the number of observations.

In [8]:
# hflights and dplyr are loaded

# Count the number of overnight flights
hflights %>%
    filter(!is.na(DepTime) & !is.na(ArrTime) & DepTime > ArrTime) %>%
    summarise(num = n())

num
2718
