# Data Transformation with dplyr in R
By: Fauzi Bajuri for DataScience SG Youth Wing

## Why use dplyr?

- Intuitive to write and easy to read, especially when using "chaining" syntax

### Useful shortcuts and tips for using Jupyter Notebook
https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed

### Resources:

- R for Data Science http://r4ds.had.co.nz/transform.html
- Hands-on video tutorial https://www.youtube.com/watch?v=jWjqLW-u3hc and http://rpubs.com/justmarkham/dplyr-tutorial

## About dplyr

- Five basic verbs : filter, select, arrange, mutate, summarise (plus 'group_by')
- Joins (not convered)
- dplyr will mask a  few base functions
- previous package is plyr
- dplyr approach is simpler to write and read

- **Command structure (for all dplyr verbs)**:
    - first argument is a data frame
    - return value is a data frame
    - nothing is modified in place

## Comparison with Microsoft Excel
- This training will make comparisons with functions & features in Microsoft Excel to help partcipants understand how the verbs work!
- Particpants recommended to open file.csv in Microsoft Excel during the session!

In [None]:
#install & load packages
#install.packages("dplyr")
library(dplyr)

In [None]:
flights <- read.csv("file.csv")
flights <- tbl_df(flights)

In [None]:
str(flights)

## About Dataset: New York City Flights 13

This data contains information on all arriving and departing flights from NYC in 2013. The variables in this dataset are:

- **year, month, day** - Date of departure
- **dep_time,arr_time** - Actual departure and arrival times.
- **sched_dep_time, sched_arr_time** - Scheduled departure and arrival times.
- **dep_delay, arr_delay** - delays in minutes
- **hour, minute** - Time of scheduled departure
- **carrier** - carrier abbreviation (See: https://www.census.gov/foreign-trade/reference/codes/aircarrier/acname.txt)
- **tailnum** - Tail number of plane.
- **flight** - flight number.
- **origin, dest** - Origin and Destination
- **air_time** - Time spent in air.
- **distance** - Distance flown.
- **time_hour** - scheduled date and hour of flight.

Source: http://statseducation.com/Introduction-to-R/modules/graphics/ggplot2/

## Verb 1 - filter() using Relational Operators (>, >=, <, <=, !=, ==, &, |)
Filter similar to Microsoft Excel

In [None]:
#R base example

#head(flights[flights$month == 11 | flights$month== 12,]) # not modified in place

##dplyr method easier

#OR (|) Operator

#filter to view data in Nov and Dec

filter(flights, month == 11 | month == 12) #not modified in place

#head(filter(flights, month == 11 | 12)) #wrong!

In [None]:
#filter for string/non-numeric data type
#filter for both months nov and dec for carriers AA and UA
filter(flights, month == 11 | month == 12, carrier == "AA" | carrier == "UA") %>% summarize(n = n())

#use nrow() or str() to check no. of observations/rows after filter is done
#can use piping with summarize to compute no. of rows( %>% summarize(n = n()))


In [None]:
#AND conditions (&)
#filter to view data on 1st of November

filter(flights, month == 11 & day == 1) %>% summarize(n = n())



## Exercise 1 - Filter

In [1]:
#1) Filter dataframe for flights on 1st March which departed earlier than scheduled from John F. Kennedy International Airport (176)

filter(flights, month == 3, day == 1, dep_delay < 0, origin == "JFK") %>% summarize(no._obs= n())

#2) Filter dataframe for flights on September by both United Airlines and American Airlines which was scheduled to arrive at Los Angeles International Airport between  12 noon and 6PM(288)

filter(flights, month == 9, carrier == "UA" | carrier == "AA", dest == "LAX", sched_arr_time >= 1200, sched_arr_time <= 1800) %>% summarize(no._obs= n())

ERROR: Error in filter(flights, month == 3, day == 1, dep_delay < 0, origin == : could not find function "%>%"


## Verb 2 - arrange()
Arrange is similar to sorting a table in Microsoft Excel

In [None]:
head(arrange(flights, dep_delay)) #by default, sort by assending order (smallest to largest value)

In [None]:
head(arrange(flights, desc(dep_delay))) #set desc as nested functiont to sort by descending order (largest to smallest value)

## Exercise 2 - Arrange

In [None]:
#Arrange by arr time and then month (latest months first)

head(arrange(flights, arr_time, desc(month)))

In [None]:
#sort by descending order for month and day, and ascending order for dep_delay)

head(arrange(flights, desc(month), desc(day), dep_delay))

## Verb 3 - select()
Select is similar to SELECT in SQL and deleting columns in Microsoft Excel. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

In [None]:
select(flights, year, month, day) %>% head(.,2)

In [None]:
select(flights, year:day) %>% head(.,2) # use : to select columns from:to

In [None]:
select(flights, -(year:day)) %>% head(.,2) #use - to select all except the column names provided in argument

#### Good to know!

There are a number of helper functions you can use within select():

- **starts_with("abc")**: matches names that begin with “abc”.

- **ends_with("xyz")**: matches names that end with “xyz”.

- **contains("ijk")**: matches names that contain “ijk”.

- **matches("(.)\\1")**: selects variables that match a regular expression. This one matches any variables that contain repeated characters.

- **num_range("x", 1:3)**: matches x1, x2 and x3.

In [None]:
select(flights, starts_with("dep")) %>% head(.,2)

In [None]:
select(flights, ends_with("time")) %>% head(.,2)

## Verb 4 - mutate()
Add new columns that are functions of existing columns. Functions include +, -, *, /, ^, %/% (integer division) and %% (remainder)

In [None]:
mutate(flights,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60) #%>% head(.,2)
#distance  = speed * time!

In [None]:
#If you only want to keep the new variables, use transmute():

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
) %>% head(.,2)

## Verb 5 -  summarise() with group_by
The last key verb is summarise(). It collapses a data frame to a single row

In [None]:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) #computes mean for dep_delay for entire dataframe (remove NA)

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe (%>%).

A good way to pronounce %>% when reading code is “then”.

In [None]:
flights %>%
    group_by(carrier) %>% # set how you want to group the dataframe by
    summarize (count = n())#n() is used to count no. of rows/observations

How to read the above:

"Transform the *flights* dataframe by (1) grouping them by carrier and *then* summarizing by count (no. of carriers)" 

**Note how this is similar to =COUNTIF in Microsoft Excel and also using Pivot Table**

We can add on more transformations by using %>% 

In [None]:
flights %>%
    group_by(carrier) %>% # set how you want to group the dataframe by
    summarize (count = n(), #n() is used to count no. of rows/observations 
               mean_dist = mean(distance, na.rm = TRUE), #compute mean distance per carrier
               median_arr_delay = median(arr_delay, na.rm = TRUE)
              ) %>%
    filter(count >= 1000) %>% #filter to show only carriers with flights >= 1000 
    arrange(median_arr_delay) #arrange/sort in ascending order based on median_arr_delay

How to read the above:

"Transform the flights dataframe by...
- **(1)** **grouping them** by carrier and **then**
- **(2)** **summarizing** by count (no. of carriers), mean distance, median arrival delay and **then**
- **(3)** **filter** only for carriers with >= 1000 flights and **then** 
- **(4)** **arrange** the dataset in ascending order based on median arrival delay time

## Exercise 3 - Wrap up dplyr

Create a dataframe to show the (1) mean distance in km (1 mile = 1.60934 km) and (2) no. of flights for each pair of Origin & Destination (OD) (3) except for HNL (i.e. I do not want HNL data as either origin or destination in my dataframe).

Following which, I want to view them to be sorted in (4) alphabetical order based on origin first then by no. of flights (in descending order).



In [None]:
#1 mile = 1.60934 km

flights %>%
    group_by(origin, dest) %>%
    summarize(count = n(),
              mean_dist_km = round(mean(distance, na.rm = TRUE)*1.60934)
             ) %>%
    arrange(origin, desc(count)) %>%
    filter(dest != "HNL", origin !="HNL")