## Data Acquisition and Cleaning

This project relys on two different sources of data: United States' Bureau of Transport Statistics, and National Climatic Data Center. The fisrt source can only be accessed via its website, while the second provides an API with an extensive guide for utilising it.

Bureau of Transport Statistics provides extensive statistics on various forms of transportation, including airline performance data which consists of various figures related to departure and arrival delays. Data related to any year back to 1987 can be accessed in monthly chunks. However, given that no API is provided, the relevant data should be downloaded via a menu system by selecting all the required features.

National Climatic Data Center offers historical weather data for a host of weather stations inside and outside the United States. The provided API works through a request link that can be constructed according to one's needs for any time period depending on availability. Offered features include hourly weather condition, liquid precipitation, air temperature and many more.

Both of these sources provide the requested data in csv files.

In [1]:
#impoting the necessary python libraries
import pandas as pd
import numpy as np
import glob
import requests as req
import io
from datetime import datetime
from meteostat import Hourly
from sklearn.impute import KNNImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings('ignore')

#### Flight Data
As mentioned above, the Bureau of Transport Statistics provides a menu system for data selection and download, which was saved to disk for the period starting 1st January 2017 and ending 31st December 2019.

In [2]:
#creating a list containing all the csv files and sorting them alphabetically
csv_list = glob.glob('../data-sources/capstone-project/flight-data/[0-9][0-9].csv')
csv_list.sort()

This data contains information about delays which are aggregated under National Aviation System delays, which contains delays related to less than severe weather events. The only information available about the weather portion of this aggregate value is a monthly percentage given on a national level.

In [3]:
#creating a dictionary containing the monthly weather share of delays labeled as National Aviation System delays
weather_share_nas = {'2017-01': 0.6617,
                     '2017-02': 0.6814,
                     '2017-03': 0.6475,
                     '2017-04': 0.5684,
                     '2017-05': 0.5320,
                     '2017-06': 0.6302,
                     '2017-07': 0.7192,
                     '2017-08': 0.7191,
                     '2017-09': 0.6112,
                     '2017-10': 0.5916,
                     '2017-11': 0.6567,
                     '2017-12': 0.5707,
                     '2018-01': 0.6056,
                     '2018-02': 0.6057,
                     '2018-03': 0.5891,
                     '2018-04': 0.6408,
                     '2018-05': 0.7559,
                     '2018-06': 0.6445,
                     '2018-07': 0.7273,
                     '2018-08': 0.7956,
                     '2018-09': 0.6346,
                     '2018-10': 0.6169,
                     '2018-11': 0.6599,
                     '2018-12': 0.6391,
                     '2019-01': 0.6986,
                     '2019-02': 0.6886,
                     '2019-03': 0.6188,
                     '2019-04': 0.7336,
                     '2019-05': 0.7774,
                     '2019-06': 0.7497,
                     '2019-07': 0.7903,
                     '2019-08': 0.7390,
                     '2019-09': 0.5391,
                     '2019-10': 0.5939,
                     '2019-11': 0.6203,
                     '2019-12': 0.6742}

In [4]:
#loading the first csv file, changing the column names to lower case, renaming the weather_delay to
#extreme_weather_delay for greater accuracy and extracting the weather share of NAS delays
flight_set = pd.read_csv(csv_list[0])
flight_set.columns = flight_set.columns.str.lower()
flight_set.rename(columns={'weather_delay': 'extreme_weather_delay'}, inplace=True)
flight_set['weather_delay'] = flight_set['nas_delay'] * weather_share_nas[list(weather_share_nas.keys())[0]]

In [5]:
#setting up a loop to load the remaining files
for i, file in enumerate(csv_list[1:]):
    #loading the remaining files and concatenating the contents one by one into a single dataframe
    next_flight_set = pd.read_csv(file)
    next_flight_set.columns = next_flight_set.columns.str.lower()
    next_flight_set.rename(columns={'weather_delay': 'extreme_weather_delay'}, inplace=True)
    next_flight_set['weather_delay'] = next_flight_set['nas_delay'] * \
                                       weather_share_nas[list(weather_share_nas.keys())[i]]
    flight_set = pd.concat([flight_set, next_flight_set])

In [6]:
#finding the top twenty airports of origin and forming a new dataframe containing the data related to
#these airports only
top_twenty_origins = list(flight_set['origin'].value_counts().sort_values(ascending=False).head(20).index)
top_twenty_airports = flight_set[flight_set['origin'].isin(top_twenty_origins)]

In [7]:
#converting the schedulled departure time into a string
top_twenty_airports['crs_dep_time'] = top_twenty_airports['crs_dep_time'].astype(str)

In [8]:
#defining a function to transform the schedulled departure time to a consistant 4-digit format
def correct_time(cell):
    if len(cell) == 3:
        return '0'+cell
    elif len(cell) == 2:
        return '00'+cell
    elif len(cell) == 1:
        return '000'+cell
    else:
        return cell

In [9]:
#using the above function to correct the formatting
top_twenty_airports['crs_dep_time'] = top_twenty_airports['crs_dep_time'].apply(correct_time)
#converting the schedulled departure time to hh:mm:ss format
top_twenty_airports['crs_dep_time'] = top_twenty_airports['crs_dep_time'].apply(lambda x: x[0: 2]+':'+x[2:]+':00')
#combining the schedulled departure time with the departure date in order to create a single timestamp
top_twenty_airports['dep_timestamp'] = pd.to_datetime(top_twenty_airports['fl_date'] \
                                                      +' '+top_twenty_airports['crs_dep_time'])

In [10]:
#droping the that are no longer required and sorting the dataframe by origin
top_twenty_airports.drop(['unnamed: 25', 'fl_date', 'crs_dep_time', 'dep_time'], axis=1, inplace=True)
top_twenty_airports.sort_values('origin', inplace=True)

In [11]:
#correcting the order inwhich the columns appear in the dataframe
top_twenty_airports = top_twenty_airports[['dep_timestamp', 'op_unique_carrier', 'tail_num', 'origin',
                                           'origin_city_name', 'dest', 'dest_city_name', 'dep_delay_new',
                                           'wheels_off', 'wheels_on', 'taxi_in', 'arr_delay_new', 'cancelled',
                                           'cancellation_code', 'diverted', 'air_time', 'distance', 'carrier_delay',
                                           'extreme_weather_delay', 'weather_delay', 'nas_delay', 'security_delay',
                                           'late_aircraft_delay']]

In [12]:
#replacing the missing values in the various delay columns to zero
top_twenty_airports['carrier_delay'].fillna(0, inplace=True)
top_twenty_airports['extreme_weather_delay'].fillna(0, inplace=True)
top_twenty_airports['weather_delay'].fillna(0, inplace=True)
top_twenty_airports['nas_delay'].fillna(0, inplace=True)
top_twenty_airports['security_delay'].fillna(0, inplace=True)
top_twenty_airports['late_aircraft_delay'].fillna(0, inplace=True)

#writing the dataframe into a csv file
top_twenty_airports.to_csv('../data-sources/capstone-project/flight-data/top-twenty-airports.csv', index=False)

#### Weather Data
Accessing the API provided by the National Climatic Data Center is possible through a request link which should contain the name of the database, target weather station, start and end time, required features and so on. This can be done by following the instructions available on the website. Although this data is extensive, there are some missing values, especially when it comes to information about weather condition. To fill this gap, I decided to use the Meteostat Python library as a secondary source of historic weather data.

In [13]:
#setting up the url template for NCDC's API
url_template = 'https://www.ncei.noaa.gov/access/services/data/v1?dataset=global-hourly&stations={}&includeStationName=1&includeStationLocation=1&startDate=2016-12-31&endDate=2020-01-01&dataTypes=TMP,VIS,WND,CIG,DEW,SLP,AA1&units=metric&format=csv'

In [14]:
#setting up the start and end parameters to be used by meteostat module
start = datetime(2016, 12, 31)
end = datetime(2020, 1, 1, 23, 59)

In [15]:
#creating a dictionary containing the station codes related to the top twenty airports
station_dict = {'DFW': '72259003927',
                'DCA': '72405013743',
                'ORD': '72530094846',
                'SFO': '72494023234',
                'PHX': '72278023183',
                'LAS': '72386023169',
                'JFK': '74486094789',
                'LGA': '72503014732',
                'MSP': '72658014922',
                'IAH': '72243012960',
                'SEA': '72793024233',
                'SLC': '72572024127',
                'LAX': '72295023174',
                'DEN': '72565003017',
                'ATL': '72219013874',
                'EWR': '72502014734',
                'MCO': '72205012815',
                'BOS': '72509014739',
                'DTW': '72537094847',
                'CLT': '72314013881'}

In [16]:
#defining a function to extract feature values from the raw data
def value_extractor(cell, position):
    try:
        return int(cell.split(',')[position])
    except:
        pass

Wind direction data is give in degrees. This is a circular scale in which smaller and larger values don't have the same meaning as commonly accepted. Therefore, I decided to encode the 360 degree scale into eight different categories for future use.

In [17]:
#definging a function to encode wind direction figures
def wind_encoder(cell):
    if cell >= 22.5 and cell < 67.5:
        return 'north-east'
    elif cell >= 67.5 and cell < 112.5:
        return 'east'
    elif cell >= 112.5 and cell < 157.5:
        return 'south-east'
    elif cell >= 157.5 and cell < 202.5:
        return 'south'
    elif cell >= 202.5 and cell < 247.5:
        return 'south-west'
    elif cell >= 247.5 and cell < 292.5:
        return 'west'
    elif cell >= 247.5 and cell < 337.5:
        return 'north-west'
    else:
        return 'north'

In [18]:
#setting up a loop to download and clean the data related to each station one by one
for key in list(station_dict.keys()):
    #sending the request via the API url, reading the downloaded csv file into a pandas dataframe and changing
    #column names to lower case
    weather_file = req.get(url_template.format(station_dict[key]))
    weather = pd.read_csv(io.StringIO(weather_file.content.decode('utf-8')))
    weather.columns = weather.columns.str.lower()

    #converting the time column to pandas datetime and setting that column as dataframe index
    weather['date'] = pd.to_datetime(weather['date'])
    weather.set_index('date', drop=True, inplace=True)

    #dropping the unnecessary columns
    weather.drop(['source', 'report_type', 'call_sign', 'quality_control'], axis=1, inplace=True)

    #replacing the code column with the current key and re-ordering the columns
    weather['code'] = key
    weather = weather[['station', 'name', 'code', 'latitude', 'longitude', 'elevation', 'aa1', 'cig', 'dew',
                                    'slp', 'tmp', 'vis', 'wnd']]

    #setting up the conditions for the correct discovery of the hourly observations which depends on the station
    if key != 'DTW':
        weather.drop(weather[weather.index.minute != weather.index.minute[3]].index, inplace=True)
    else:
        weather.drop(weather[weather.index.minute != weather.index.minute[4]].index, inplace=True)

    #resetting the index to a numeric value
    weather.reset_index(inplace=True)

    #dropping repeat observations from EWR station which has unique characteristics
    if key == 'EWR':
        weather.drop(weather[(weather['slp'].str.contains('99999')) &
                                        (weather['dew'] == '+0167,1')].index, inplace=True)

    #dropping repeat observations with a specific pattern
    weather.drop(weather[(weather['cig'].str.contains('99999')) & (weather['dew'].str.contains('9999')) &
                                        (weather['slp'].str.contains('99999')) &
                                        (weather['tmp'].str.contains('9999')) &
                                        (weather['vis'].str.contains('999999'))].index, inplace=True)

    #creating a list of missing timestamps
    missing_hours = list(set(pd.date_range(list(weather['date'])[0], list(weather['date'])[-1], 
                                                               periods=26328)) - set(weather['date']))

    #setting up a loop to add these missing onservations to the dataframe and setting their
    #features to known or missing values depending on the specific feature
    for hour in missing_hours:
        new_hour = {'date': hour,
                    'station': weather['station'][0],
                    'name': weather['name'][0],
                    'latitude': weather['latitude'][0],
                    'longitude': weather['longitude'][0],
                    'elevation': weather['elevation'][0],
                    'aa1': np.nan,
                    'cig': np.nan,
                    'dew': np.nan,
                    'slp': np.nan,
                    'tmp': np.nan,
                    'vis': np.nan,
                    'wnd': np.nan,
                    'wnd_dir': np.nan}

        weather = weather.append(new_hour, ignore_index=True)

    #sorting the dataframe according to time and resetting the index
    weather.sort_values('date', inplace=True)
    weather.set_index('date', drop=True, inplace=True)

    #using the meteostat module to download further data related to weather condition
    add_weather = Hourly(station_dict[key][:5], start, end)
    add_weather = add_weather.normalize()
    add_weather = add_weather.fetch()
    #converting observations with code zero to missing values and adding the to the data
    #downloaded through NCDC's API
    add_weather.loc[add_weather[add_weather['coco'] == 0.0].index, 'coco'] = np.nan
    add_weather.set_index(weather.index, drop=True, inplace=True)
    weather['au1'] = add_weather['coco']

    #extracting precipitation figures, converting to millimeters, and turning missing data into NaNs
    weather['aa1'] = weather['aa1'].apply(value_extractor, args=(1,))
    weather['aa1'] = weather['aa1'].apply(lambda x: x/10)
    weather.loc[weather[weather['aa1'] == 999.9].index, 'aa1'] = np.nan

    #extracting cloud alititude figures and converting missing data into NaNs
    weather['cig'] = weather['cig'].apply(value_extractor, args=(0,))
    weather.loc[weather[weather['cig'] == 99999].index, 'cig'] = np.nan

    #extracting dew point figures, converting to degrees celsius, and turning missing data into NaNs
    weather['dew'] = weather['dew'].apply(value_extractor, args=(0,))
    weather['dew'] = weather['dew'].apply(lambda x: x/10)
    weather.loc[weather[weather['dew'] == 999.9].index, 'dew'] = np.nan

    #extracting pressure figures and turning missing data into NaNs
    weather['slp'] = weather['slp'].apply(value_extractor, args=(0,))
    weather.loc[weather[weather['slp'] == 99999].index, 'slp'] = np.nan

    #extracting air temperature figures, converting to degrees celsius, and turning missing data into NaNs
    weather['tmp'] = weather['tmp'].apply(value_extractor, args=(0,))
    weather['tmp'] = weather['tmp'].apply(lambda x: x/10)
    weather.loc[weather[weather['tmp'] == 999.9].index, 'tmp'] = np.nan

    #extracting visibility figures and turning missing data into NaNs
    weather['vis'] = weather['vis'].apply(value_extractor, args=(0,))
    weather.loc[weather[weather['vis'] == 999999].index, 'vis'] = np.nan

    #extracting wind direction figures and replacing missing data with NaNs
    weather['wnd_dir'] = weather['wnd'].apply(value_extractor, args=(0,))
    weather.loc[weather[weather['wnd_dir'] == 999].index, 'wnd_dir'] = np.nan

    #extracting wind speed figures, converting into meters per second, and turning missing data into NaNs
    weather['wnd'] = weather['wnd'].apply(value_extractor, args=(3,))
    weather['wnd'] = weather['wnd'].apply(lambda x: x/10)
    weather.loc[weather[weather['wnd'] == 999.9].index, 'wnd'] = np.nan

    #resetting dataframe index to numerical values
    weather.reset_index(inplace=True)

    #creating a new dataframe containing only the continuous features with index equal to the primary dataframe
    weather_partial = weather[['aa1', 'cig', 'dew', 'slp', 'tmp', 'vis', 'wnd', 'wnd_dir']]
    weather_partial.set_index(weather.index, inplace=True)

    #instantiating a StandardScaler and standardising this partial data
    scaler = StandardScaler()
    weather_partial_std = pd.DataFrame(scaler.fit_transform(weather_partial), index=weather.index,
                                                                columns=weather_partial.columns)

    #instantiating a KNNImputer in order to impute the missing values in continuous features
    imputer = KNNImputer(n_neighbors=5)
    weather_partial_std = pd.DataFrame(imputer.fit_transform(weather_partial_std),
                                       columns=weather_partial_std.columns)

    #inverse tranforming the standardised data in order to go back to original units
    weather_partial = pd.DataFrame(scaler.inverse_transform(weather_partial_std), index=weather.index,
                                                                columns=weather_partial_std.columns)

    #replacing the continuous features in the primary dataframe with this newly imputed data
    weather[['aa1', 'cig', 'dew', 'slp', 'tmp', 'vis', 'wnd', 'wnd_dir']] = \
                                    np.round(weather_partial[['aa1', 'cig', 'dew', 'slp', 'tmp',
                                                                'vis', 'wnd', 'wnd_dir']], 1)

    #creating another subset containing the continuous features plus the categorical weather condition data
    features = weather[['aa1', 'cig', 'dew', 'slp', 'tmp', 'vis', 'wnd', 'wnd_dir', 'au1']]

    #dividing thid subset based on available and missing data in weather condition column
    X = features[features['au1'].notna()]
    X_predict = features[features['au1'].isnull()]

    #separating the predictor and target variables
    y = X.pop('au1')
    X_predict.pop('au1')

    #dividing the data with no missing weather condition figures into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

    #standardising the train, test and the predict sets
    X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    X_predict = pd.DataFrame(scaler.transform(X_predict), index=X_predict.index, columns=X_predict.columns)

    #instantiating a KNeighborsClassifier model in order to predict the missing weather condition data
    model_knn = KNeighborsClassifier(n_jobs=-2, n_neighbors=6, metric='manhattan', weights='distance')
    model_knn.fit(X_train, y_train)

    #inserting the predicted weather condition figures into the primary dataframe
    weather.loc[X_predict.index, 'au1'] = pd.Series(model_knn.predict(X_predict), index=X_predict.index)
    
    #applying the function to the wind direction figures
    weather['wnd_dir'] = weather['wnd_dir'].apply(wind_encoder)

    #turning the date column into the dataframe index
    weather.set_index('date', drop=True, inplace=True)

    #writing the weather data to csv files
    weather.to_csv(f'../data-sources/capstone-project/weather-data/{key}-weather.csv', index=True)

#### Joining Flight and Weather Data
Once the two sets of data is collated and cleaned, they have to be joined together in a single set. This can be done separately for each individual departure airport and then gathered together.

In [19]:
#loading the flight data into a dataframe
flights = pd.read_csv('../data-sources/capstone-project/flight-data/top-twenty-airports.csv')

In [20]:
#converting the dep_timestamp column into pandas datetime, renaming it and sorting
#the dataframe according to this column
flights['dep_timestamp'] = pd.to_datetime(flights['dep_timestamp'])
flights.rename(columns={'dep_timestamp': 'time'}, inplace=True)
flights.sort_values('time', inplace=True)

In [26]:
#creating a list of csv files containing weather data for all origin airports
weather_csv_list = glob.glob('../data-sources/capstone-project/weather-data/[A-Z][A-Z][A-Z]-weather.csv')
weather_csv_list.sort()

In [27]:
#setting up a loop to join the flight data and the weather data from different origins
for file in weather_csv_list:

    #loading the first weather csv, converting the date column into pandas datetime and renaming it
    weather = pd.read_csv(file)
    weather['date'] = pd.to_datetime(weather['date'])
    weather.rename(columns={'date': 'time'}, inplace=True)

    #subsetting the flight data using the origin airport from the weather data
    airport = flights[flights['origin'] == weather.loc[weather.index[0], 'code']]

    #joining the two dataframe on time according to the nearest values and resetting the index to time
    airport_weather = pd.merge_asof(airport, weather, on='time')
    airport_weather.set_index('time', drop=True, inplace=True)

    #writing the combined dataframe to a csv file
    airport_weather.to_csv(f"../data-sources/capstone-project/agg-data/{weather.loc[weather.index[0], 'code']}-airport.csv")

In [28]:
#creating a list of all the csv files containing the combined flight and eather data for each origin airport
airport_csv_list = glob.glob('../data-sources/capstone-project/agg-data/[A-Z][A-Z][A-Z]-airport.csv')
airport_csv_list.sort()

In [29]:
#lading the first file into a dataframe
airport_weather_agg = pd.read_csv(airport_csv_list[0], infer_datetime_format=True, index_col='time')

In [30]:
#looping over the remaining files, loading them into a dataframe and concatenating them
#with the dataframe loaded above
for file in airport_csv_list[1:]:
    next_airport_weather_agg = pd.read_csv(file, infer_datetime_format=True, index_col='time')
    airport_weather_agg = pd.concat([airport_weather_agg, next_airport_weather_agg])

In [31]:
#dropping all the unwanted features
airport_weather_agg.drop(['station', 'origin_city_name', 'dest_city_name', 'name', 'code', 'taxi_in',
                                    'wheels_on', 'wheels_off'], axis=1, inplace=True)

In [32]:
#renaming some of the columns to more usable names
airport_weather_agg.rename(columns={'op_unique_carrier': 'carrier',
                                    'dest': 'destination',
                                    'dep_delay_new': 'dep_delay',
                                    'arr_delay_new': 'arr_delay',
                                    'aa1': 'precipitation',
                                    'au1': 'condition',
                                    'cig': 'cloud_base',
                                    'dew': 'dew_temp',
                                    'slp': 'pressure',
                                    'tmp': 'air_temp',
                                    'vis': 'visibility',
                                    'wnd': 'wind_speed',
                                    'wnd_dir': 'wind_direction'}, inplace=True)

In [33]:
#reordering the columns
airport_weather_agg = airport_weather_agg[['origin', 'latitude', 'longitude', 'elevation', 'carrier', 'tail_num',
                                           'precipitation', 'condition', 'cloud_base', 'dew_temp', 'pressure',
                                           'air_temp', 'visibility', 'wind_speed', 'wind_direction', 'dep_delay',
                                           'carrier_delay', 'extreme_weather_delay', 'weather_delay', 'nas_delay',
                                           'security_delay', 'late_aircraft_delay', 'cancelled', 'cancellation_code',
                                           'diverted', 'destination', 'arr_delay', 'air_time', 'distance']]

In [34]:
#sorting the dataframe by its datetime index
airport_weather_agg.sort_index(inplace=True)

In [35]:
#creating different combinations of the data
geographical_features = airport_weather_agg[['origin', 'latitude', 'longitude', 'elevation', 'destination']]
carrier_features = airport_weather_agg[['carrier', 'tail_num', 'diverted', 'air_time', 'distance']]
core_features = airport_weather_agg[['origin', 'precipitation', 'condition', 'cloud_base', 'dew_temp', 'pressure',
                                     'air_temp', 'visibility', 'wind_speed', 'wind_direction', 'dep_delay',
                                     'carrier_delay', 'extreme_weather_delay', 'weather_delay', 'nas_delay',
                                     'security_delay', 'late_aircraft_delay', 'cancelled', 'cancellation_code',
                                     'arr_delay']]

In [36]:
#writing the different combinations to csv files
airport_weather_agg.to_csv('../data-sources/capstone-project/agg-data/airport-weather-agg.csv')
geographical_features.to_csv('../data-sources/capstone-project/agg-data/geographical_features.csv')
carrier_features.to_csv('../data-sources/capstone-project/agg-data/carrier_features.csv')
core_features.to_csv('../data-sources/capstone-project/agg-data/core_features.csv')