# Collecting domestic flight data 2018-2020

**Allison Merritt, last updated Nov 2020**

If we want to know whether flights that leave earlier in the day are less likely to be delayed -- or any other flight-related information -- the first thing we need to do is grab some data!

For this project, we'll use publicly available data from the Bureau of Transportation Statistics, which stores descriptions of flights in the US. The data can be downloaded by hand from [the BTS website](https://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time); however, it's more fun to use this as an excuse to practice web scraping.

[Selenium](https://www.selenium.dev/) lets us scrape data from the web via a WebDriver (ie, we can automate browsers to carry out tasks including navigating to web pages, checking boxes, clicking download buttons, .. etc). It also plays nicely with Javascript, which is what the BTS website uses.

In the cell below:
* First, the `initiate_driver()` function will set up the ChromeDriver, navigate to the BTS website, and proceed to the downloads page. Then it will check each of the boxes specified by the `check_boxes` list using their `XPATH` expressions in order to get the fields we need for each flight.

* The `grab_file()` function will then use the driver to select a time window (month, year) and download a `.zip` file containing the data.

* Finally, we need to locate the recently downloaded file (`find_recent_flight_file()`), unzip it, rename it to something more reasonable, and store it in our preferred directory (`move_flight_file()`).

In [1]:
from selenium import webdriver

import zipfile
import shutil
import glob
import time
import os

data_folder = '/save/data/here'
downloads = '/Downloads/are/here'

def initiate_driver():
    """
    Initiates ChromeDriver, navigates to the downloads page from the BTS website,
    and checks boxes for fields to download.
    
    Returns
    -------
    The ChromeDriver
    """
    print('Initiating driver ...')
    driver = webdriver.Chrome()

    # starting URL
    bts_base = 'https://www.transtats.bts.gov/Tables.asp?'
    bts_url = bts_base + 'DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time'
    driver.get(bts_url)

    # from inspection, this is the element corresponding to the button needed to take you
    #   to transtats.bts.gove/DL_SelectFields.asp, which will break if you try to access directly.
    selector_copy = '#form1 > table:nth-child(6) > tbody > tr:nth-child(7) > td.dataTDRight > a:nth-child(3)'

    print('Navigating to the data download page ...')
    driver.find_element_by_css_selector(selector_copy).click()

    check_boxes = ["FlightDate", "Reporting_Airline", "CRSDepTime", "DepTime", "DepDelay",
               "CRSArrTime","ArrTime", "ArrDelay", "Cancelled", "CancellationCode", 
               "CRSElapsedTime", "ActualElapsedTime", "AirTime", "Flights", "Distance",
               "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay"]

    print('Checking relevant boxes ... ')
    for checkbox in check_boxes:
        driver.find_element_by_xpath(f'//input[@title="{checkbox}"]').click()
        
    return driver

def grab_file(driver, month, year):
    """
    Downloads files from the BTS website
    
    Parameters
    ----------
    driver : object
        The Selenium ChromeDriver returned by initiate_driver()
        
    month : str
        The name of a month to download
        
    year : int
        The year to download

    """
    print(f'Downloading data from {month} {year} ... ')

    driver.find_element_by_xpath('//*[@id="XYEAR"]').send_keys(str(year))
    driver.find_element_by_xpath('//*[@id="FREQUENCY"]').send_keys(month)

    # download
    dwnld_xpath = '//*[@id="content"]/table[1]/tbody/tr/td[2]/table[3]/tbody/tr/td[2]/button[1]'
    driver.find_element_by_xpath(dwnld_xpath).click()
    
def find_recent_flight_file():
    """
    Finds the most recently downloaded flights file
    
    Returns
    -------
    The name of the zipfile
    """
    flist = glob.glob(os.path.join(downloads, '*_ONTIME_REPORTING.zip'))
    if len(flist) == 0:
        return None
    return flist[0]

def move_flight_file(month,year):
    """
    Checks for the existence of a flights file (if the file is still downloading, wait 30 seconds and check again..),
    then unzip it and rename the file to specify the year and month. Move the file from Downloads to wherever you
    want to store the data.
    
    Parameters
    ----------
    month : str
        The name of a month to download
        
    year : int
        The year to download    

    """
    print('Checking for file...')
    found_file = False
    while found_file is False:
        zfn = find_recent_flight_file()
        if zfn is not None:
            found_file = True
        else:
            print('  [File not found yet - waiting another 30 seconds ...] ')
            time.sleep(30)
        
    print('  File found! Unzipping and moving to data directory.')
    with zipfile.ZipFile(zfn) as zip_ref:
        ufn = zip_ref.namelist()[0]
        zip_ref.extractall(downloads)
        
    outname_short = f'{year}_{month}_'+'_'.join(os.path.basename(zfn).split('_')[-2:])
    shutil.move(os.path.join(downloads, ufn), os.path.join(data_folder,outname_short.replace('.zip','.csv')))
    os.remove(zfn)

Now that that's all set up, we just need to choose which years and months we want to download (we are limited to one month at a time, so we'll need to do a loop):

In [2]:
import calendar 

years = [2018,2019,2020]
months = [calendar.month_name[m] for m in range(1,13)]

Now we run everything:

In [3]:
driver = initiate_driver()

for year in years:
    for month in months:
        grab_file(driver, month, year)
        move_flight_file(month,year)
        
driver.quit()

Initiating driver ...
Navigating to the data download page ...
Checking relevant boxes ... 
Downloading data from January 2018 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from February 2018 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from March 2018 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from April 2018 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from May 2018 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  [File not found yet - waiting another 30 seconds .

Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from July 2020 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from August 2020 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from September 2020 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading data from October 2020 ... 
Checking for file...
  [File not found yet - waiting another 30 seconds ...] 
  File found! Unzipping and moving to data directory.
Downloading

That's it! The data are now stored and ready to use :)

For an exploratory data analysis, check out [this notebook](https://nbviewer.jupyter.org/github/atmerritt/delay-expectations/blob/main/flight_delays_data_explore.ipynb?flush_cache=true)