# Flight Delay Web Scraping
## Goal
This notebook pulls data from the US government Bureau of Transportation Statistics (BTS) website https://www.transtats.bts.gov/ontime/departures.aspx to catalogue flight delays from the 33 busiest airports in the US. 

## Notes About the Data
As a US agency, BTS (unsurprisingly, not the K-pop band) only catalogues delays at US airports for US-based airlines. This means that major international carriers like Lufthansa, KLM, Air Canada, etc are not included in this database. I have also made the decision to only pull data from the 33 busiest airports rather than all of the airports, largely due to time constraints. Even these 33 airports took several hours to run, nevermind the full list. (As a side note, I chose 33 airports because the 33rd busiest airport is PDX and I live in Portland.) I do not expect major impacts to my analysis as a result fo these caveats although it could be helpful to have flight delay data from all airports as delays in a flight leaving one airport could certainly correlate with delays in that same flight arriving and departing the next airport. 

## Brief Overview
Here, web scraping is done in two main parts:
1. The full list of airports and airlines is gathered from the dropdowns.
2. We cycle through the busiest airports and check all airlines for each of those airports.

## Detailed Overview
The selenium package is the main mover and shaker in this notebook. I use it to visit the BTS website and examine the dropdowns for the airports and airlines to see what data is availabile. Note that the website shows every airline for every airport even if an airline does not service a particular airport. This only becomes clear when you pull the data and there is no flight delay information for said combination of airline and aiport.

Once I identified the airports that had data, I consolidated the 33 busiest ones (per Wikipedia https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States) into a list, created combinations of each of these with the available airlines, and then cycled through each of the combinations. For each combination, I selected the checkboxes to provide all the data from January 2021 through September 2025 and changed the dropdowns on the website to the correct airline and airport and requested data for delays. After waiting for the table to load, I download the csv and wait up to 300 seconds for the csv to be fully present on my local machine. This extended time is necessary due to internet slowdowns as well as the huge files for certain combinations (e.g. Delta at ATL is over 100 MB). I then save the delay information with the airline and airport names before moving on to the next combination.

## Future Improvements
This web scraping tool already handles most events like the file like not being fully downloaded and page loading delays, but more time could be put into resolving the full range of exceptions in a  more automated way. It would also be nice if there was a way to speed up the waiting time when it isn't necessary. For instance, a lot of time is spent waiting on combinations of airlines and airports that simply do not exist. Creating a list of all existing combinations of airlines and airports could save much of this waiting. 

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
import time
import os
from selenium.webdriver.chrome.options import Options

# Setup
options = Options()
options.add_experimental_option("detach", True)

chrome_options = webdriver.ChromeOptions()
# Disable "Save As" dialog and set default download directory
download_dir = 'C:/Users/dloso/Downloads'
prefs = {
    "download.default_directory": download_dir,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "safebrowsing.enabled": True
}
chrome_options.add_experimental_option("prefs", prefs)

driver = webdriver.Chrome(options=options)
url = 'https://www.transtats.bts.gov/ontime/departures.aspx' # Replace with the actual page URL
driver.get(url)

# Get list of all airports and airlines
try:
    dropdown_element = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.NAME, "cboAirport")))
    select_airport = Select(dropdown_element)
    dropdown_element2 = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.NAME, "cboAirline")))
    select_airline = Select(dropdown_element2)
    airport_options = [option.get_attribute('value') for option in select_airport.options]
    airline_options = [option.get_attribute('value') for option in select_airline.options]
except Exception as e:
    print(f'An error has occurred when getting the airport/airline lists: {e}')

finally:
    driver.quit()

In [2]:
import itertools
import numpy as np
import glob
from pathlib import Path
import json
import tempfile
import random

# Pre-create combos with airlines so that you can easily pick up from where you left off
busiest_airports = ['ATL', 'DFW', 'DEN', 'ORD', 'LAX', 'JFK', 'CLT', 'LAS', 'MCO',
                    'MIA', 'PHX', 'SEA', 'SFO', 'EWR', 'IAH', 'BOS', 'MSP', 'FLL',
                    'LGA', 'DTW', 'PHL', 'SLC', 'BWI', 'IAD', 'SAN', 'DCA', 'TPA',
                    'BNA', 'AUS', 'HNL', 'MDW', 'DAL', 'PDX']
airport_airline_combos = list(itertools.product(busiest_airports, airline_options))
num_combos = len(busiest_airports) * len(airline_options)

# See if the program has crashed before and start from the crashed index. Otherwise start at 0
PROGRESS_FILE = 'checkpoint.json'
if os.path.exists(PROGRESS_FILE):
    with open(PROGRESS_FILE, 'r') as f:
        combo_index = json.load(f).get('combo_index', 0)
else:
    combo_index = 0

while combo_index < num_combos:
    # Setup
    options = Options()
    options.add_experimental_option("detach", True)
    
    chrome_options = webdriver.ChromeOptions()
    # Disable "Save As" dialog and set default download directory
    prefs = {
        "download.default_directory": download_dir,
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    }
    chrome_options.add_experimental_option("prefs", prefs)
    
    driver = webdriver.Chrome(options=options)
    url = 'https://www.transtats.bts.gov/ontime/departures.aspx' # Replace with the actual page URL
    driver.get(url)

    try:
        # Select all statistics, days, months, and years 2021-2025
        checkboxStats = driver.find_element(By.NAME, "chkAllStatistics")
        driver.execute_script("arguments[0].click();", checkboxStats)
        checkboxMonths = driver.find_element(By.NAME, "chkAllMonths")
        driver.execute_script("arguments[0].click();", checkboxMonths)
        chkAllDays = driver.find_element(By.NAME, "chkAllDays")
        driver.execute_script("arguments[0].click();", chkAllDays)
        chk2021 = driver.find_element(By.ID, "chkYears_34")
        driver.execute_script("arguments[0].click();", chk2021)
        chk2022 = driver.find_element(By.ID, "chkYears_35")
        driver.execute_script("arguments[0].click();", chk2022)
        chk2023 = driver.find_element(By.ID, "chkYears_36")
        driver.execute_script("arguments[0].click();", chk2023)
        chk2024 = driver.find_element(By.ID, "chkYears_37")
        driver.execute_script("arguments[0].click();", chk2024)
        chk2025 = driver.find_element(By.ID, "chkYears_38")
        driver.execute_script("arguments[0].click();", chk2025)

        # for each airport
        for i in range(combo_index, num_combos):
            # select the current airport
            cur_airport = airport_airline_combos[i][0]
            cur_airline = airport_airline_combos[i][1]
            dropdown_element = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.NAME, "cboAirport")))
            select_airport = Select(dropdown_element)
            select_airport.select_by_value(cur_airport)
            dropdown_element2 = WebDriverWait(driver, 2).until(EC.element_to_be_clickable((By.NAME, "cboAirline")))
            select_airline = Select(dropdown_element2)
            select_airline.select_by_value(cur_airline)

            print(f"Airport: {cur_airport}\tAirline: {cur_airline}")

            # Click the button to generate the csv
            submit_button = driver.find_element(By.NAME, "btnSubmit") # Replace with actual locator
            submit_button.click()

            # wait for the page to reload
            time.sleep(random.uniform(3, 7))

            try:
                # Check if there is data for this combination of airport and airline
                # data_present = driver.find_elements_by_xpath("//*[contains(text(), 'No data found for the above selection')]")
                
                # if data_present:
                if 'No data found for the above selection' not in driver.page_source:
                    # click the button to download the csv
                    download_button = driver.find_element(By.ID, "DL_CSV")
                    driver.execute_script("arguments[0].click();", download_button)
                
                    # Wait for the download to complete
                    current_file_name = "C:/Users/dloso/Downloads/Detailed_Statistics_Departures.csv"
                    # new_file_name = 'C:/Users/dloso/Documents/Data Science/flight-delay-forecasting/data/delays/' \
                    #     + cur_airport + '_' + cur_airline + '_Delays.csv'
                    new_file_name = 'C:/Users/dloso/Downloads/test/' \
                        + cur_airport + '_' + cur_airline + '_Delays.csv'
                    
                    download_path = 'C:/Users/dloso/Downloads/Detailed_Statistics_Departures'
                    start_time = time.time()
                    timeout = 300
                    poll_interval = 2
                    print(f"Waiting for download in: {download_path}")
                    
                    while time.time() - start_time < timeout:
                        # Check for any temporary/partial file extensions
                        # Common ones: .crdownload (Chrome), .part (Firefox/others), .tmp
                        temp_files = glob.glob(download_path + '*.crdownload') + \
                                     glob.glob(download_path + '*.part') + \
                                     glob.glob(download_path + '*.tmp')
                        
                        # Check for the expected final file(s) using the pattern
                        completed_files = glob.glob('C:/Users/dloso/Downloads/Detailed_Statistics_Departures.csv')
                
                        # If there are completed files and no temporary files, the download is likely done
                        if completed_files and not temp_files:
                            print('Download complete.')
                            # Return the path(s) of the successfully downloaded file(s)
                            break
                        
                        print(".", end="", flush=True) # Print a dot to show activity
                        time.sleep(poll_interval)
                    
                    try:
                        # Rename the file
                        os.rename(current_file_name, new_file_name)
                        while Path(current_file_name).exists() and not Path(new_file_name).exists():
                            time.sleep(1)
                        
                        print(f"File '{current_file_name}' renamed to '{new_file_name}' successfully.")

                        # Update the json file with the next index using a temporary file to mitigate corruption in the event of a crash
                        with tempfile.NamedTemporaryFile('w', delete=False) as tf:
                            json.dump({'combo_index': i + 1}, tf)
                            temp_name = tf.name
                        os.replace(temp_name, PROGRESS_FILE)
                        
                    except FileNotFoundError:
                        print(f"Error: The file '{current_file_name}' was not found.")
                    except FileExistsError:
                        print(f"Error: The destination file '{new_file_name}' already exists.")
                        if cur_airline == 'MQ':
                            print(f'MQ airline--appending a 2 to the file.')
                            # new_file_name = 'C:/Users/dloso/Documents/Data Science/flight-delay-forecasting/data/delays/' \
                                # + cur_airport + '_' + cur_airline + '_2_Delays.csv'
                            new_file_name = 'C:/Users/dloso/Downloads/test/' \
                                + cur_airport + '_' + cur_airline + '_2_Delays.csv'
                            os.rename(current_file_name, new_file_name)
                    except Exception as e:
                        print(f"An unexpected error occurred during renaming: {e}")

                else:
                    raise ValueError('')
            except Exception as e:
                print(f'No data.')

            combo_index += 1
    except Exception as e:
        print(f'Program errored out--will restart at current airport-airline combo {combo_index}. {e}')
        # Define the pattern for files to delete (e.g., all files ending with '.txt')
        file_pattern = '*.crdownload' 
        # Define the directory path ('.' refers to the current directory)
        directory_path = 'C:/Users/dloso/Downloads/' 
        
        # Construct the full pattern to search for
        search_pattern = os.path.join(directory_path, file_pattern)
        files_to_remove = glob.glob(search_pattern)
        
        print(f"Files found to remove: {files_to_remove}")
        
        for file_path in files_to_remove:
            try:
                os.remove(file_path)
                print(f"Removed: {file_path}")
            except OSError as e:
                # Handle potential errors like permission issues, or if it's a directory
                print(f"Error removing {file_path}: {e.strerror}")
                
        # Define the pattern for files to delete (e.g., all files ending with '.txt')
        file_pattern = 'Detailed_Statistics_Departures*' 
        # Define the directory path ('.' refers to the current directory)
        directory_path = 'C:/Users/dloso/Downloads/' 
        
        # Construct the full pattern to search for
        search_pattern = os.path.join(directory_path, file_pattern)
        files_to_remove = glob.glob(search_pattern)
        
        print(f"Files found to remove: {files_to_remove}")
        
        for file_path in files_to_remove:
            try:
                os.remove(file_path)
                print(f"Removed: {file_path}")
            except OSError as e:
                # Handle potential errors like permission issues, or if it's a directory
                print(f"Error removing {file_path}: {e.strerror}")
    finally:
        driver.quit()

in while loop
Airport: ATL	Airline: TZ
No data.
Airport: ATL	Airline: FL
No data.
Airport: ATL	Airline: AS
Waiting for download in: C:/Users/dloso/Downloads/Detailed_Statistics_Departures
Download complete.
File 'C:/Users/dloso/Downloads/Detailed_Statistics_Departures.csv' renamed to 'C:/Users/dloso/Downloads/test/ATL_AS_Delays.csv' successfully.
Airport: ATL	Airline: G4


KeyboardInterrupt: 