In this notebook, we will cover:

* [Filter Rows](#Filter-Rows)
* [Arrange Rows](#Arrange-Rows)
* [Select Columns](#Select-Columns)
* [Slice Rows](#Slice-Rows)
* [Operations with Missing Values](#Operations-with-Missing-Values)

# Filter Rows

We will be using the `dplyr` package that is part of `tidyverse`. Let us also load a data set about flights departing from the New York City area in 2013.

In [1]:
# might have to install nycflights13
# install.packages('nycflights13')
library(tidyverse)
library(nycflights13)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
print(flights)

[38;5;246m# A tibble: 336,776 x 19[39m
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m  [4m2[24m013     1     1      517            515         2      830            819
[38;5;250m 2[39m  [4m2[24m013     1     1      533            529         4      850            830
[38;5;250m 3[39m  [4m2[24m013     1     1      542            540         2      923            850
[38;5;250m 4[39m  [4m2[24m013     1     1      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022
[38;5;250m 5[39m  [4m2[24m013     1     1      554            600        -[31m6[39m      812            837
[38;5;250m 6[39m  [4m2[24m013     1    

Notice the types of the variables above. They include:

* **int** integers
* **dbl** double precision floating point numbers
* **chr** character vectors, or strings
* **dttm** date-time (a date along with a time)

Other types available in R but not represented above include:

* **lgl** logical (either `TRUE` or `FALSE`)
* **fctr** factor (categorical variable with a fixed number of possible values)
* **date** date

In [3]:
filter(flights, month == 12 & day == 31) # all flights that departed on December 31

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,31,13,2359,14,439,437,2,B6,839,N566JB,JFK,BQN,189,1576,23,59,2013-12-31 23:00:00
2013,12,31,18,2359,19,449,444,5,DL,412,N713TW,JFK,SJU,192,1598,23,59,2013-12-31 23:00:00
2013,12,31,26,2245,101,129,2353,96,B6,108,N374JB,JFK,PWM,50,273,22,45,2013-12-31 22:00:00
2013,12,31,459,500,-1,655,651,4,US,1895,N557UW,EWR,CLT,95,529,5,0,2013-12-31 05:00:00
2013,12,31,514,515,-1,814,812,2,UA,700,N470UA,EWR,IAH,223,1400,5,15,2013-12-31 05:00:00
2013,12,31,549,551,-2,925,900,25,UA,274,N577UA,EWR,LAX,346,2454,5,51,2013-12-31 05:00:00
2013,12,31,550,600,-10,725,745,-20,AA,301,N3CXAA,LGA,ORD,127,733,6,0,2013-12-31 06:00:00
2013,12,31,552,600,-8,811,826,-15,EV,3825,N14916,EWR,IND,118,645,6,0,2013-12-31 06:00:00
2013,12,31,553,600,-7,741,754,-13,DL,731,N333NB,LGA,DTW,86,502,6,0,2013-12-31 06:00:00
2013,12,31,554,550,4,1024,1027,-3,B6,939,N552JB,JFK,BQN,195,1576,5,50,2013-12-31 05:00:00


In [4]:
filter(flights, month == 12, day == 31) # multiple arguments are equivalent to `&`

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,31,13,2359,14,439,437,2,B6,839,N566JB,JFK,BQN,189,1576,23,59,2013-12-31 23:00:00
2013,12,31,18,2359,19,449,444,5,DL,412,N713TW,JFK,SJU,192,1598,23,59,2013-12-31 23:00:00
2013,12,31,26,2245,101,129,2353,96,B6,108,N374JB,JFK,PWM,50,273,22,45,2013-12-31 22:00:00
2013,12,31,459,500,-1,655,651,4,US,1895,N557UW,EWR,CLT,95,529,5,0,2013-12-31 05:00:00
2013,12,31,514,515,-1,814,812,2,UA,700,N470UA,EWR,IAH,223,1400,5,15,2013-12-31 05:00:00
2013,12,31,549,551,-2,925,900,25,UA,274,N577UA,EWR,LAX,346,2454,5,51,2013-12-31 05:00:00
2013,12,31,550,600,-10,725,745,-20,AA,301,N3CXAA,LGA,ORD,127,733,6,0,2013-12-31 06:00:00
2013,12,31,552,600,-8,811,826,-15,EV,3825,N14916,EWR,IND,118,645,6,0,2013-12-31 06:00:00
2013,12,31,553,600,-7,741,754,-13,DL,731,N333NB,LGA,DTW,86,502,6,0,2013-12-31 06:00:00
2013,12,31,554,550,4,1024,1027,-3,B6,939,N552JB,JFK,BQN,195,1576,5,50,2013-12-31 05:00:00


The above code just displayed the filtered rows. What if we want to store the results for later use?

In [5]:
dec31 <- filter(flights, month == 12 & day == 31)

If you want to assign as well as print, enclose the command in parentheses.

In [6]:
(dec31 <- filter(flights, month == 12 & day == 31))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,31,13,2359,14,439,437,2,B6,839,N566JB,JFK,BQN,189,1576,23,59,2013-12-31 23:00:00
2013,12,31,18,2359,19,449,444,5,DL,412,N713TW,JFK,SJU,192,1598,23,59,2013-12-31 23:00:00
2013,12,31,26,2245,101,129,2353,96,B6,108,N374JB,JFK,PWM,50,273,22,45,2013-12-31 22:00:00
2013,12,31,459,500,-1,655,651,4,US,1895,N557UW,EWR,CLT,95,529,5,0,2013-12-31 05:00:00
2013,12,31,514,515,-1,814,812,2,UA,700,N470UA,EWR,IAH,223,1400,5,15,2013-12-31 05:00:00
2013,12,31,549,551,-2,925,900,25,UA,274,N577UA,EWR,LAX,346,2454,5,51,2013-12-31 05:00:00
2013,12,31,550,600,-10,725,745,-20,AA,301,N3CXAA,LGA,ORD,127,733,6,0,2013-12-31 06:00:00
2013,12,31,552,600,-8,811,826,-15,EV,3825,N14916,EWR,IND,118,645,6,0,2013-12-31 06:00:00
2013,12,31,553,600,-7,741,754,-13,DL,731,N333NB,LGA,DTW,86,502,6,0,2013-12-31 06:00:00
2013,12,31,554,550,4,1024,1027,-3,B6,939,N552JB,JFK,BQN,195,1576,5,50,2013-12-31 05:00:00


Note the following **comparison operators** in R:

* `==` equal to (warning: do not use `=` for testing equality)
* `!=` not equal to
* `<`, `<=` less than, less than or equal to
* `>`, `>=` greater than, greater than or equal to

Finally, remember that flaoting point calculations are performed with limited precision and mathematically equal quantities many not equal according to `==`.

In [7]:
1/98 * 98 == 1

In [8]:
near(1/98 * 98, 1)

Conditions can be combined using **logical operators**. These are:

* `!x` `TRUE` iff x is `FALSE`
* `x & y` `TRUE` iff both x, y are `TRUE`
* `x | y` `TRUE` iff either x or y is `TRUE`
* `xor(x, y)` `TRUE` iff exactly one of x and y is `TRUE`

Additionally, there is a short-hand for testing whether `x` is one of the values in `y`:

* `x %in% y`

In [9]:
filter(flights, month %in% 10:12) # Flight from October through December

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,10,1,447,500,-13,614,648,-34,US,1877,N538UW,EWR,CLT,69,529,5,0,2013-10-01 05:00:00
2013,10,1,522,517,5,735,757,-22,UA,252,N556UA,EWR,IAH,174,1400,5,17,2013-10-01 05:00:00
2013,10,1,536,545,-9,809,855,-46,AA,2243,N630AA,JFK,MIA,132,1089,5,45,2013-10-01 05:00:00
2013,10,1,539,545,-6,801,827,-26,UA,1714,N37252,LGA,IAH,172,1416,5,45,2013-10-01 05:00:00
2013,10,1,539,545,-6,917,933,-16,B6,1403,N789JB,JFK,SJU,186,1598,5,45,2013-10-01 05:00:00
2013,10,1,544,550,-6,912,932,-20,B6,939,N593JB,JFK,BQN,191,1576,5,50,2013-10-01 05:00:00
2013,10,1,549,600,-11,653,716,-23,EV,5716,N830AS,JFK,IAD,46,228,6,0,2013-10-01 06:00:00
2013,10,1,550,600,-10,648,700,-12,US,1909,N949UW,LGA,PHL,38,96,6,0,2013-10-01 06:00:00
2013,10,1,550,600,-10,649,659,-10,US,2167,N749US,LGA,DCA,39,214,6,0,2013-10-01 06:00:00
2013,10,1,551,600,-9,727,730,-3,UA,279,N415UA,EWR,ORD,117,719,6,0,2013-10-01 06:00:00


In [10]:
nrow(filter(flights, is.na(dep_time))) # flights with missing departure time. note: nrow() gives the number of rows

In [11]:
nrow(filter(flights, between(month, 1, 3))) # no. of flights departing between Jan and Mar

# Arrange Rows

`arrange` can order rows of a data frame using a variable name (or a more complicated expression). If you provide multiple expressions to order by, it uses the second one to break ties in the first one, third one to break ties in the second one, and so on.

In [12]:
arrange(flights, month, day)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00
2013,1,1,555,600,-5,913,854,19,B6,507,N516JB,EWR,FLL,158,1065,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,709,723,-14,EV,5708,N829AS,LGA,IAD,53,229,6,0,2013-01-01 06:00:00
2013,1,1,557,600,-3,838,846,-8,B6,79,N593JB,JFK,MCO,140,944,6,0,2013-01-01 06:00:00
2013,1,1,558,600,-2,753,745,8,AA,301,N3ALAA,LGA,ORD,138,733,6,0,2013-01-01 06:00:00


`desc()` will order in descending order.

In [13]:
arrange(flights, desc(month))

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,12,1,13,2359,14,446,445,1,B6,745,N715JB,JFK,PSE,195,1617,23,59,2013-12-01 23:00:00
2013,12,1,17,2359,18,443,437,6,B6,839,N593JB,JFK,BQN,186,1576,23,59,2013-12-01 23:00:00
2013,12,1,453,500,-7,636,651,-15,US,1895,N197UW,EWR,CLT,86,529,5,0,2013-12-01 05:00:00
2013,12,1,520,515,5,749,808,-19,UA,1487,N69804,EWR,IAH,193,1400,5,15,2013-12-01 05:00:00
2013,12,1,536,540,-4,845,850,-5,AA,2243,N634AA,JFK,MIA,144,1089,5,40,2013-12-01 05:00:00
2013,12,1,540,550,-10,1005,1027,-22,B6,939,N821JB,JFK,BQN,189,1576,5,50,2013-12-01 05:00:00
2013,12,1,541,545,-4,734,755,-21,EV,3819,N13968,EWR,CVG,95,569,5,45,2013-12-01 05:00:00
2013,12,1,546,545,1,826,835,-9,UA,1441,N23708,LGA,IAH,204,1416,5,45,2013-12-01 05:00:00
2013,12,1,549,600,-11,648,659,-11,US,2167,N945UW,LGA,DCA,42,214,6,0,2013-12-01 06:00:00
2013,12,1,550,600,-10,825,854,-29,B6,605,N706JB,EWR,FLL,140,1065,6,0,2013-12-01 06:00:00


Missing values are always left at the end by `arrange`. In contrast, `filter` will ignore missing values unless you explicitly ask for them using `is.na()`.

In [14]:
arrange(flights, desc(is.na(dep_delay)), dep_delay) # put all NA values first

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,,1630,,,1815,,EV,4308,N18120,EWR,RDU,,416,16,30,2013-01-01 16:00:00
2013,1,1,,1935,,,2240,,AA,791,N3EHAA,LGA,DFW,,1389,19,35,2013-01-01 19:00:00
2013,1,1,,1500,,,1825,,AA,1925,N3EVAA,LGA,MIA,,1096,15,0,2013-01-01 15:00:00
2013,1,1,,600,,,901,,B6,125,N618JB,JFK,FLL,,1069,6,0,2013-01-01 06:00:00
2013,1,2,,1540,,,1747,,EV,4352,N10575,EWR,CVG,,569,15,40,2013-01-02 15:00:00
2013,1,2,,1620,,,1746,,EV,4406,N13949,EWR,PIT,,319,16,20,2013-01-02 16:00:00
2013,1,2,,1355,,,1459,,EV,4434,N10575,EWR,MHT,,209,13,55,2013-01-02 13:00:00
2013,1,2,,1420,,,1644,,EV,4935,N759EV,EWR,ATL,,746,14,20,2013-01-02 14:00:00
2013,1,2,,1321,,,1536,,EV,3849,N13550,EWR,IND,,645,13,21,2013-01-02 13:00:00
2013,1,2,,1545,,,1910,,AA,133,,JFK,LAX,,2475,15,45,2013-01-02 15:00:00


# Select Columns

`select` is used to keep only a few variables of interest to the current analysis. It is most useful when working with data frames involving a large number of variables.

In [15]:
select(flights, year, month, day, dep_time, arr_time) %>%
    head()

year,month,day,dep_time,arr_time
<int>,<int>,<int>,<int>,<int>
2013,1,1,517,830
2013,1,1,533,850
2013,1,1,542,923
2013,1,1,544,1004
2013,1,1,554,812
2013,1,1,554,740


You can change the name of the variables when selecting them.

In [16]:
select(flights, year, month, day, departure_time = dep_time, arrival_time = arr_time) %>%
    head()

year,month,day,departure_time,arrival_time
<int>,<int>,<int>,<int>,<int>
2013,1,1,517,830
2013,1,1,533,850
2013,1,1,542,923
2013,1,1,544,1004
2013,1,1,554,812
2013,1,1,554,740


Note that `select` drops any variables not explicitly mentioned. To just rename some variables while keeping all others, use `rename`.

In [17]:
rename(flights, departure_time = dep_time, arrival_time = arr_time) %>%
    head()

year,month,day,departure_time,sched_dep_time,dep_delay,arrival_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
2013,1,1,554,558,-4,740,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


In [18]:
print(flights)

[38;5;246m# A tibble: 336,776 x 19[39m
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m  [4m2[24m013     1     1      517            515         2      830            819
[38;5;250m 2[39m  [4m2[24m013     1     1      533            529         4      850            830
[38;5;250m 3[39m  [4m2[24m013     1     1      542            540         2      923            850
[38;5;250m 4[39m  [4m2[24m013     1     1      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022
[38;5;250m 5[39m  [4m2[24m013     1     1      554            600        -[31m6[39m      812            837
[38;5;250m 6[39m  [4m2[24m013     1    

You can select ranges using `:` and exclude variables using `-`.

In [19]:
select(flights, year:day) %>%
    head()

year,month,day
<int>,<int>,<int>
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1


In [20]:
select(flights, -(dep_time:time_hour)) %>%
    head()

year,month,day
<int>,<int>,<int>
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1
2013,1,1


If you want to bring a few variables at the beginning, you can use `everything()` to refer to the remaining variables.

In [21]:
select(flights, dep_time, arr_time, day, month, year, everything()) %>%
    head()

dep_time,arr_time,day,month,year,sched_dep_time,dep_delay,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
517,830,1,1,2013,515,2,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
533,850,1,1,2013,529,4,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
542,923,1,1,2013,540,2,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
544,1004,1,1,2013,545,-1,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
554,812,1,1,2013,600,-6,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00
554,740,1,1,2013,558,-4,728,12,UA,1696,N39463,EWR,ORD,150,719,5,58,2013-01-01 05:00:00


In addition, there are some helper functions that only work inside `select()`.

* `starts_with()`, `ends_with()`, `contains()`
* `matches()`
* `num_range()`

You can consult the documentation or type `?select` at the prompt to learn more about these. Here's just one example of their use.

In [22]:
select(flights, contains("time")) %>%
    head()

dep_time,sched_dep_time,arr_time,sched_arr_time,air_time,time_hour
<int>,<int>,<int>,<int>,<dbl>,<dttm>
517,515,830,819,227,2013-01-01 05:00:00
533,529,850,830,227,2013-01-01 05:00:00
542,540,923,850,160,2013-01-01 05:00:00
544,545,1004,1022,183,2013-01-01 05:00:00
554,600,812,837,116,2013-01-01 06:00:00
554,558,740,728,150,2013-01-01 05:00:00


# Slice Rows

In [23]:
slice(flights, 1:5)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00
2013,1,1,544,545,-1,1004,1022,-18,B6,725,N804JB,JFK,BQN,183,1576,5,45,2013-01-01 05:00:00
2013,1,1,554,600,-6,812,837,-25,DL,461,N668DN,LGA,ATL,116,762,6,0,2013-01-01 06:00:00


In [24]:
slice(flights, n()) # last row

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,9,30,,840,,,1020,,MQ,3531,N839MQ,LGA,RDU,,431,8,40,2013-09-30 08:00:00


In [25]:
slice(flights, (n()-4):n()) # last 5 rows

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213,14,55,2013-09-30 14:00:00
2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198,22,0,2013-09-30 22:00:00
2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764,12,10,2013-09-30 12:00:00
2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419,11,59,2013-09-30 11:00:00
2013,9,30,,840,,,1020,,MQ,3531,N839MQ,LGA,RDU,,431,8,40,2013-09-30 08:00:00


# Operations with Missing Values

In [26]:
NA > 0

In [27]:
NA == 1

In [28]:
2 <= NA

In [29]:
NA == NA # this might be a bit confusing but it actually makes sense

In [30]:
slice(flights, (n()-4):n()) %>%
    filter(dep_time == arr_time)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>


In [31]:
is.na(NA)