This notebook performs the last operations to put together all the data.

In [148]:
import pandas as pd
import numpy as np
import pickle

from sklearn import linear_model as lm
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing

import math
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier

plt.style.use('fivethirtyeight')

Get all the flight schedule/weather data, then the business passenger ratios for given origin-destination pairs (flight routes), then the hourly scheduled volume data for each airport for each year. 

In [149]:
with open('aotp_all_data.pkl', 'rb') as picks:
    aotp_nyc_data = pickle.load(picks)

In [150]:
aotp_nyc_data = aotp_nyc_data.reset_index()

In [151]:
with open('bus_pass_ratio.pkl', 'rb') as picks2:
    bus_pass_ratio = pickle.load(picks2)

In [152]:
bus_pass_ratio1516 = bus_pass_ratio[bus_pass_ratio['Year'] < 2017]
bus_pass_ratio17 = bus_pass_ratio[bus_pass_ratio['Year'] == 2017]

In [153]:
with open('volhr_15.pkl', 'rb') as picks3:
    volhr_15 = pickle.load(picks3)

with open('volhr_16.pkl', 'rb') as picks4:
    volhr_16 = pickle.load(picks4)
    
with open('volhr_17.pkl', 'rb') as picks5:
    volhr_17 = pickle.load(picks5)

Merge all the combined flight schedule/weather data with the business passenger ratios for a given year, quarter, O/D pair.

In [154]:
aotp_nyc_data = aotp_nyc_data.merge(bus_pass_ratio, on=['Year', 'Quarter', 'Origin', 'Dest'])

In [155]:
aotp_nyc_data['ArrDel15'] = aotp_nyc_data['ArrDel15'].astype('float')

Convert some of the weather data measurements. Original units given [here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt) in Section III.

In [156]:
aotp_nyc_data['ATemp'] = (9/5)*(aotp_nyc_data['ATemp'] / 10) + 32
aotp_nyc_data['ATempDest'] = (9/5)*(aotp_nyc_data['ATempDest'] / 10) + 32

In [157]:
aotp_nyc_data['Precip'] = (aotp_nyc_data['Precip'] / 10) / 25.4
aotp_nyc_data['PrecipDest'] = (aotp_nyc_data['PrecipDest'] / 10) / 25.4

In [158]:
aotp_nyc_data['AWind'] = (aotp_nyc_data['AWind'] / 10) * 2.23694
aotp_nyc_data['AWindDest'] = (aotp_nyc_data['AWindDest'] / 10) * 2.23694

Here we get, for each airport and each hour, the percentage at capacity for scheduled hourly volume. We create a column `AirportPerf` in the main dataframe for this metric.

In [159]:
volhr = pd.concat([volhr_15, volhr_16], axis=0)
volhr = pd.concat([volhr, volhr_17], axis=0)

In [160]:
volhr['AirportPerf'] = volhr['VolHr'] / volhr['MaxVolHr']

In [161]:
volhr = volhr[['FlightDate', 'TimeBlk', 'Airport', 'AirportPerf']]

For each hour and airport, put the percentage at capacity (for scheduled volume) `AirportPerf` into a dicitonary. Then use two specifically constructed functions (for the given data we are evaluating) to be applied on the main dataframe using the dictionary we've put together to get the `AirportPerf` value for the origin and destination for each flight, to be stored respectively as `OrigPerf` and `DestPerf` in the dataframe. 

In [162]:
volhrdict = volhr.set_index(['FlightDate', 'TimeBlk', 'Airport']).to_dict()

In [163]:
def get_orig_perf(row):
    tup = (row['FlightDate'], row['DepTimeBlk'], row['Origin'])
    return volhrdict['AirportPerf'][tup]    

In [164]:
aotp_nyc_data['OrigPerf'] = aotp_nyc_data.apply(get_orig_perf, axis=1)

In [165]:
def get_dest_perf(row):
    tup = (row['FlightDate'], row['ArrTimeBlk'], row['Dest'])
    return volhrdict['AirportPerf'][tup]

In [166]:
aotp_nyc_data['DestPerf'] = aotp_nyc_data.apply(get_dest_perf, axis=1)

In [167]:
aotp_nyc_data.head()

Unnamed: 0,level_0,index,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,UniqueCarrier,AirlineID,...,Unnamed: 109,ATemp,Precip,AWind,ATempDest,PrecipDest,AWindDest,BusPassRatio,OrigPerf,DestPerf
0,0,0,2015,1,1,1,4,2015-01-01,AA,19805,...,,32.54,0.0,17.2244,47.12,0,5.14496,0.038462,0.645833,0.757895
1,1,1,2015,1,1,2,5,2015-01-02,AA,19805,...,,39.38,0.0,13.6453,48.38,0,4.25019,0.038462,0.729167,0.789474
2,2,2,2015,1,1,3,6,2015-01-03,AA,19805,...,,36.5,0.940945,9.17145,50.18,0,5.14496,0.038462,0.645833,0.736842
3,3,3,2015,1,1,4,7,2015-01-04,AA,19805,...,,47.3,0.468504,10.0662,52.7,0,4.69757,0.038462,0.708333,0.757895
4,4,4,2015,1,1,5,1,2015-01-05,AA,19805,...,,42.98,0.0,19.6851,60.62,0,5.59235,0.038462,0.729167,0.810526


Now save the training and holdout dataframes for future use in hyperparameter tuning.

In [206]:
with open('aotp_nyc_data_all1516.pkl', 'wb') as pickle_main1:
    pickle.dump(aotp_nyc_data1516, pickle_main1)

In [207]:
with open('aotp_nyc_data_all17.pkl', 'wb') as pickle_main2:
    pickle.dump(aotp_nyc_data17, pickle_main2)