Skip to content
An R package providing access to medium airline flight delay data
R
Branch: master
Clone or download
Pull request Compare This branch is 82 commits ahead, 68 commits behind hadley:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data-raw removed etl_cleanup method, implemented smart_download Jun 9, 2016
data
inst/sql added new partitions Mar 14, 2018
man fix for new URL; close #59 Mar 12, 2019
tests
vignettes
.Rbuildignore updated .Rbuildignore May 10, 2016
.gitignore
.travis.yml updated Travis Jul 3, 2016
DESCRIPTION fix for new URL; close #59 Mar 12, 2019
NAMESPACE switched to downloader::download Jul 22, 2016
NEWS.md Ungroup weather data. Include all 3 stations. Apr 16, 2015
README.Rmd updated README Aug 24, 2016
README.md updated README Aug 24, 2016
airlines.Rproj

README.md

airlines

Travis-CI Build Status

The airlines package provides a user-friendly interface to create and maintain an SQL database of flight information from the U.S. Bureau of Transportation Statistics Airline On-Time Performance data. The user of the airlines package only needs a valid place to store the data -- no sophisticated SQL administration skills are necessary.

Several existing R package could be considered subsets of these data:

  1. nycflights13: all outgoing flights from the three New York City airports (LGA, JFK, and EWR) during 2013
  2. hflights: all outgoing flights from the three New York City airports (IAH and HOU) during 2011

This airlines package will allow you to download data for over 165 million flights from 1987 to present, from all domestic airports.

Install

The etl package (on CRAN) provides the generic framework for the airlines package. Since the airlines package currently lives on GitHub and not on CRAN, you have to install it using devtools:

install.packages("devtools")
devtools::install_github("beanumber/airlines")

To begin, load the airlines package. Note that this loads etl, which in turn loads dplyr.

library(airlines)

Populate

Any etl-derived package can make use of the SQL backends supported by dplyr. Here, we illustrate how to set up a local MySQL database to store the flight data. This approach uses a MySQL options file located at ~/.my.cnf.

system("mysql -e 'CREATE DATABASE IF NOT EXISTS airlines;'")
db <- src_mysql_cnf(dbname = "airlines")

Once we have a database connection, we create an etl object, initialize the database, and then populate it with data. Please note that to update the database with all 30 years worth of flights may take a few hours.

ontime <- etl("airlines", db = db, dir = "~/dumps/airlines")
ontime %>%
  etl_init() %>%
  etl_update(years = 1987:2016)

Verify

There are over 300 months worth of files to download, and they will occupy more than 21 GB in their zipped and unzipped states.

summary(ontime)
## files:
##     n      size                              path
## 1 349  6.504 GB  /home/bbaumer/dumps/airlines/raw
## 2 345 18.725 GB /home/bbaumer/dumps/airlines/load

##       Length Class           Mode       
## con   1      MySQLConnection S4         
## info  8      -none-          list       
## disco 3      -none-          environment

The full flights table should contain about 169 million flights from October 1987 to June 2016.

ontime %>%
  tbl(from = "flights") %>%
  summarise(numFlights = n())
## Source:   query [?? x 1]
## Database: mysql 5.7.13-0ubuntu0.16.04.2 [bbaumer@localhost:/airlines]
## 
##   numFlights
##        <dbl>
## 1  169405490

Analyze

The number of flights per year seems to have peaked in 2007.

ontime %>%
  tbl(from = "flights") %>%
  group_by(year) %>%
  summarise(numMonths = n_distinct(month), numFlights = n()) %>%
  print(n = 40)
## Source:   query [?? x 3]
## Database: mysql 5.7.13-0ubuntu0.16.04.2 [bbaumer@localhost:/airlines]
## 
##     year numMonths numFlights
##    <int>     <dbl>      <dbl>
## 1   1987         3    1311826
## 2   1988        12    5202096
## 3   1989        12    5041200
## 4   1990        12    5270893
## 5   1991        12    5076925
## 6   1992        12    5092157
## 7   1993        12    5070501
## 8   1994        12    5180048
## 9   1995        12    5327435
## 10  1996        12    5351983
## 11  1997        12    5411843
## 12  1998        12    5384721
## 13  1999        12    5527884
## 14  2000        12    5683047
## 15  2001        12    5967780
## 16  2002        12    5271359
## 17  2003        12    6488540
## 18  2004        12    7129270
## 19  2005        12    7140596
## 20  2006        12    7141922
## 21  2007        12    7455458
## 22  2008        12    7009726
## 23  2009        12    6450285
## 24  2010        12    6450117
## 25  2011        12    6085281
## 26  2012        12    6096762
## 27  2013        12    6369482
## 28  2014        12    5819811
## 29  2015        12    5819079
## 30  2016         6    2777463

Please see the vignette for more detail about how to use this package.

You can’t perform that action at this time.