In this notebook, we will cover:

* [Adding New Variables](#Adding-New-Variables)

Let us load up the `tidyverse` and `nycflights13` packages.

In [2]:
install.packages("nycflights13")
library(tidyverse)
library(nycflights13)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Adding New Variables

Let us zoom in on a few variables of interest.

In [27]:
my_flights <- select(flights, year:day, dep_time, arr_time, air_time, origin, dest)
head(my_flights)

year,month,day,dep_time,arr_time,air_time,origin,dest
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>
2013,1,1,517,830,227,EWR,IAH
2013,1,1,533,850,227,LGA,IAH
2013,1,1,542,923,160,JFK,MIA
2013,1,1,544,1004,183,JFK,BQN
2013,1,1,554,812,116,LGA,ATL
2013,1,1,554,740,150,EWR,ORD


Additional variable can be added using the `mutate()` function. We already have an `air_time` variable. Let us compute the total time for the flight by subtracting the time of departure `dep_time` from time of arrival `arr_time`.

We notice something odd though. When we subtract 5h 17m from 8h 30m we should get 3h 13m, i.e. 193 minutes. But instead we get 313 minutes below.

In [28]:
mutate(my_flights, total_time = arr_time - dep_time) %>%
    head()

year,month,day,dep_time,arr_time,air_time,origin,dest,total_time
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<int>
2013,1,1,517,830,227,EWR,IAH,313
2013,1,1,533,850,227,LGA,IAH,317
2013,1,1,542,923,160,JFK,MIA,381
2013,1,1,544,1004,183,JFK,BQN,460
2013,1,1,554,812,116,LGA,ATL,258
2013,1,1,554,740,150,EWR,ORD,186


The issue is that `dep_time` and `arr_time` are in the hour-minute notation which is great to look at but not very useful for computations. We should first convert these times into the number of minutes elapsed since midnight.

We want add to new variables `new_dep` and `new_arr` but we need to write a function first that can do the conversion.

In [5]:
hourmin2min <- function(hourmin) {
    min <- hourmin %% 100 # remainder after division by 100
    hour <- (hourmin - min) %/% 100 # quotient after division by 100
    return(60*hour + min)
} 

Let us test the function on 530. That's 5h 30min, i.e., 330 minutes since midnight.

In [6]:
hourmin2min(530)

The `hourmin2min` function is **vectorized**: given a vector, it outputs a vector.

In [7]:
hourmin2min(c(430,530,630,730))

R provides you with several in-built vectorized functions that can be used to create more complicated function. These include:

* **Arithmetic operators** `+, -, *, /, ^`
* **Modular arithmetic operators** `%/%` and `%%` 
* **Logarithms** `log()`, `log10()`, `log2()`
* **Offsets** `lag()` and `lead()`

In [8]:
5 / 3   # regular division
5 %/% 3 # integer division

In [9]:
1:20 %% 5  # shorter argument 5 is extended to match length of longer argument

In [10]:
near(1:10, exp(log(1:10))) # log to base e

In [11]:
near(1:10, 10^log10(1:10)) # log to base 10

In [12]:
log2(2^(1:10))  # log to base 2

In [13]:
(x <- 1:10)
lag(x)
lead(x)

We also have:

* **Logical comparisons** `==, !=, <, <=, >, >=`
* **Cumulative aggregates** `cumsum(), cumprod(), cummin(), cummax()` (`dplyr` also provides `cummean()`)

In [14]:
1:10 < 11:20
1:10 < 5
21 < 11:20

In [15]:
(factorials <- cumprod(1:10))

Finally, we can use these **ranking** functions:

* `min_rank()`
* `row_number()`
* `dense_rank()`
* `percent_rank()`
* `cume_dist()`
* `ntile()`

In [16]:
(x <- sample(c(11, 12, 12, 14, 14, 14, 17, 21, 26, NA))) # returns a random permutation of the input
min_rank(x) # ranks with smallest value as rank 1
min_rank(desc(x)) # ranks with largest value as rank 1

In [17]:
dense_rank(x) # don't create gaps in ranks

In [18]:
row_number(x) # just return the position number in sorted order (ties get different ranks here)

In [19]:
percent_rank(x) # min_rank values are scaled to [0,1]

In [20]:
cume_dist(x) # fraction of entries less than or equal to a given number

In [21]:
ntile(x, 4) # rough ranks based on using just 4 buckets

Let us now create two new variables obtained from `arr_time` and `dep_time` by converting them into minutes since midnight.

In [22]:
my_flights_new <- mutate(my_flights, new_arr = hourmin2min(arr_time), new_dep = hourmin2min(dep_time))
head(my_flights_new)

year,month,day,dep_time,arr_time,air_time,origin,dest,new_arr,new_dep
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
2013,1,1,517,830,227,EWR,IAH,510,317
2013,1,1,533,850,227,LGA,IAH,530,333
2013,1,1,542,923,160,JFK,MIA,563,342
2013,1,1,544,1004,183,JFK,BQN,604,344
2013,1,1,554,812,116,LGA,ATL,492,354
2013,1,1,554,740,150,EWR,ORD,460,354


Now we can subtract the departure time `new_dep` from the arrival time `new_arr` to get a new variable `total_time`.

In [23]:
my_flights_total <- mutate(my_flights_new, total_time = new_arr - new_dep)
head(my_flights_total)

year,month,day,dep_time,arr_time,air_time,origin,dest,new_arr,new_dep,total_time
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
2013,1,1,517,830,227,EWR,IAH,510,317,193
2013,1,1,533,850,227,LGA,IAH,530,333,197
2013,1,1,542,923,160,JFK,MIA,563,342,221
2013,1,1,544,1004,183,JFK,BQN,604,344,260
2013,1,1,554,812,116,LGA,ATL,492,354,138
2013,1,1,554,740,150,EWR,ORD,460,354,106


How is it that the total time is less that the time in air for some flights? We are faced with time zone issues.

In [24]:
filter(my_flights_total, total_time < air_time) %>%
    head()

year,month,day,dep_time,arr_time,air_time,origin,dest,new_arr,new_dep,total_time
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
2013,1,1,517,830,227,EWR,IAH,510,317,193
2013,1,1,533,850,227,LGA,IAH,530,333,197
2013,1,1,554,740,150,EWR,ORD,460,354,106
2013,1,1,558,753,138,LGA,ORD,473,358,115
2013,1,1,558,924,345,JFK,LAX,564,358,206
2013,1,1,558,923,361,EWR,SFO,563,358,205


We also have some negative values for total time for flight that departed late in the day and arrived early morning next day.

In [25]:
filter(my_flights_total, total_time < 0) %>%
    head()

year,month,day,dep_time,arr_time,air_time,origin,dest,new_arr,new_dep,total_time
<int>,<int>,<int>,<int>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
2013,1,1,1929,3,192.0,EWR,BQN,3,1169,-1166
2013,1,1,1939,29,,JFK,DFW,29,1179,-1150
2013,1,1,2058,8,159.0,EWR,TPA,8,1258,-1250
2013,1,1,2102,146,199.0,EWR,SJU,106,1262,-1156
2013,1,1,2108,25,354.0,EWR,SFO,25,1268,-1243
2013,1,1,2120,16,160.0,LGA,FLL,16,1280,-1264


We can fix the negative values by adding 24\*60 to them (we keep the positive values as is).

Note that `transmute()` will only keep the new variables.

In [26]:
no_negatives <- transmute(my_flights_total, arr_time, dep_time,
          new_total_time = (total_time < 0)*(total_time + 24*60) + (total_time >= 0)*total_time)
filter(no_negatives, new_total_time < 0)

arr_time,dep_time,new_total_time
<int>,<int>,<dbl>


So that took care of the negative total time issue.