We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

# Machine Learning

In this file, instructions how to approach the challenge can be found.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score, GridSearchCV, RepeatedKFold
import pickle
import warnings
warnings.filterwarnings("ignore")

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [45]:
#loading data and selecting columns

message = pd.read_csv('./data/flights.csv')
message = message[(message['arr_delay'] > -120) & (message['arr_delay'] < 120)]
message.reset_index(inplace=True)
message = message[['arr_delay', 'origin_airport_id', 'dest_airport_id', 'origin', 'dest', 
                    'crs_dep_time', 'distance', 'crs_elapsed_time', 'fl_date', 
                    'mkt_carrier', 'tail_num', 'dep_delay', 'crs_arr_time']]

#message.to_csv('./data/model_features.csv',index=False)



Unnamed: 0,arr_delay,origin_airport_id,dest_airport_id,origin,dest,crs_dep_time,distance,crs_elapsed_time,fl_date,mkt_carrier,tail_num,dep_delay,crs_arr_time,total_flights_dest
0,16.0,11292,11057,DEN,CLT,1011,1337,195.0,2018-02-11,AA,N537UW,-4.0,1526,600
1,12.0,11298,12217,DFW,HSV,1322,603,107.0,2019-03-01,AA,N738SK,0.0,1509,30
2,1.0,14908,14107,SNA,PHX,810,338,70.0,2018-12-14,WN,N901WN,-3.0,1020,506
3,-15.0,13230,11433,MDT,DTW,1226,371,103.0,2019-12-27,DL,N8783E,-1.0,1409,480
4,-10.0,13061,12266,LRD,IAH,525,301,80.0,2019-01-22,UA,N11548,7.0,645,477


### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [86]:
#formating features
message = pd.read_csv('./data/model_features.csv')
message['month'] = pd.DatetimeIndex(message['fl_date']).month
message['month_day'] = pd.DatetimeIndex(message['fl_date']).day
message['week_day'] = pd.DatetimeIndex(message['fl_date']).weekday
message['year_day'] = pd.DatetimeIndex(message['fl_date']).dayofyear
message['dep_hour'] = pd.to_datetime(message['crs_dep_time'], format='%H%M', errors='coerce').dt.hour
message['arr_hour'] = pd.to_datetime(message['crs_arr_time'], format='%H%M', errors='coerce').dt.hour

message['arr_hour'] = message['arr_hour'].fillna(0)
message['dep_hour'] = message['dep_hour'].fillna(0)

message

Unnamed: 0,arr_delay,total_flights_origin,total_flights_dest,distance,taxi_in,taxi_out,mkt_carrier,arr_hour,dep_hour,month,week_day
0,27.0,318,31,461,4.0,19.0,DL,12.0,10.0,1,0
1,5.0,400,3,1179,6.0,17.0,B6,3.0,23.0,1,0
3,64.0,156,486,621,6.0,9.0,WN,14.0,11.0,1,0
4,4.0,235,23,401,3.0,20.0,UA,18.0,17.0,1,0
5,19.0,426,59,986,3.0,9.0,WN,23.0,19.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
1732378,4.0,380,75,1524,5.0,10.0,WN,23.0,18.0,12,1
1732379,13.0,106,240,337,6.0,8.0,WN,23.0,21.0,12,1
1732380,11.0,380,62,345,18.0,12.0,WN,13.0,11.0,12,1
1732381,28.0,55,891,760,5.0,20.0,UA,8.0,7.0,12,1


### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

In [24]:
message = pd.read_csv('./data/model_features.csv')
message = message[['arr_delay', 'origin', 'month', 'week_day']]
message = message.reset_index(drop=True)

message

Unnamed: 0,arr_delay,origin,month,week_day
0,16.0,DEN,2,6
1,12.0,DFW,3,4
2,1.0,SNA,12,4
3,-15.0,MDT,12,4
4,-10.0,LRD,1,1
...,...,...,...,...
4811395,-22.0,BDL,3,5
4811396,-17.0,SEA,8,3
4811397,-18.0,MDW,1,1
4811398,-24.0,SYR,3,0


In [25]:
message = message[['arr_delay', 'origin', 'month', 'week_day']]

message

Unnamed: 0,arr_delay,origin,month,week_day
0,16.0,DEN,2,6
1,12.0,DFW,3,4
2,1.0,SNA,12,4
3,-15.0,MDT,12,4
4,-10.0,LRD,1,1
...,...,...,...,...
4811395,-22.0,BDL,3,5
4811396,-17.0,SEA,8,3
4811397,-18.0,MDW,1,1
4811398,-24.0,SYR,3,0


In [26]:
dummies = ['origin', 'month', 'week_day']    

for i in dummies:
    message = pd.concat([message, pd.get_dummies(message[i], prefix=i)], axis=1)
    message = message.drop([i], axis=1)

message

Unnamed: 0,arr_delay,origin_ABE,origin_ABI,origin_ABQ,origin_ABR,origin_ABY,origin_ACK,origin_ACT,origin_ACV,origin_ACY,...,month_10,month_11,month_12,week_day_0,week_day_1,week_day_2,week_day_3,week_day_4,week_day_5,week_day_6
0,16.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,12.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1.0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,-15.0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
4,-10.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4811395,-22.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4811396,-17.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4811397,-18.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4811398,-24.0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [27]:
X_train, y_train = message.iloc[:,1:], message['arr_delay']


train_ratio = 0.7
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, shuffle=True, 
                                                    train_size=train_ratio,
                                                    random_state=42)

print(f'{len(X_train)} training samples and {len(X_test)} test samples')



3367980 training samples and 1443420 test samples


In [6]:
# linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


In [75]:
#ramdom forest model
model = RandomForestRegressor(n_estimators=30, max_depth=10, random_state=0)
model.fit(X_train, y_train)


In [28]:
#Decision Tree model
model = DecisionTreeRegressor(random_state=44)
model.fit(X_train, y_train)


In [33]:
#saving the model
filename = './model/DecisionTreeRegressor(month-origin-week_day).sav'
pickle.dump(model, open(filename, 'wb'))

In [68]:
# load the model from disk
filename = './model/DecisionTreeRegressor(month-origin-week_day).sav'
model = pickle.load(open(filename, 'rb'))


### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

In [29]:
#calculate predictions from model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)


In [30]:
#calculate R^2
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f'Train R^2:\t{r2_train}\nTest R^2:\t{r2_test}')

Train R^2:	0.03297124787823835
Test R^2:	0.014873631693913203


In [31]:
#calculate model score
model.score(X_test, y_test)

0.014873631693913203

In [34]:
#calculate mean squared error
MSE_train = mean_squared_error(y_train, y_train_pred)
MSE_test = mean_squared_error(y_test, y_test_pred)

print(f'Train MSE:\t{MSE_train}\nTest MSE:\t{MSE_test}')


Train MSE:	598.2479327858555
Test MSE:	610.4971829627124


In [32]:
#calculate mean absolute error
MAE_train = mean_absolute_error(y_train, y_train_pred)
MAE_test = mean_absolute_error(y_test, y_test_pred)

print(f'Train MAE:\t{MAE_train}\nTest MAE:\t{MAE_test}')


Train MAE:	16.89432765950239
Test MAE:	17.06523087118386


In [79]:
#getting training data
test_temp = pd.read_csv('./data/model_features.csv')
test_temp = test_temp[['fl_date', 'mkt_carrier', 'tail_num', 'origin', 'dest', 'month','week_day']]

test_temp

Unnamed: 0,fl_date,mkt_carrier,tail_num,origin,dest,month,week_day
0,2018-02-11,AA,N537UW,DEN,CLT,2,6
1,2019-03-01,AA,N738SK,DFW,HSV,3,4
2,2018-12-14,WN,N901WN,SNA,PHX,12,4
3,2019-12-27,DL,N8783E,MDT,DTW,12,4
4,2019-01-22,UA,N11548,LRD,IAH,1,1
...,...,...,...,...,...,...,...
4811395,2019-03-30,B6,N584JB,BDL,RSW,3,5
4811396,2019-08-08,AA,N922US,SEA,PHL,8,3
4811397,2018-01-30,WN,N406WN,MDW,PIT,1,1
4811398,2019-03-18,DL,N300PQ,SYR,DTW,3,0


In [80]:
#getting new data
test = pd.read_csv('./data/model_test.csv')
test = test[['fl_date', 'mkt_carrier', 'tail_num', 'origin', 'dest', 'month','week_day']]
test

Unnamed: 0,fl_date,mkt_carrier,tail_num,origin,dest,month,week_day
0,2020-01-01 00:00:00,WN,N951WN,ONT,SFO,1,2
1,2020-01-01 00:00:00,WN,N467WN,ONT,SFO,1,2
2,2020-01-01 00:00:00,WN,N7885A,ONT,SJC,1,2
3,2020-01-01 00:00:00,WN,N551WN,ONT,SJC,1,2
4,2020-01-01 00:00:00,WN,N968WN,ONT,SJC,1,2
...,...,...,...,...,...,...,...
660551,2020-01-31 00:00:00,DL,N926XJ,DCA,CVG,1,4
660552,2020-01-31 00:00:00,DL,N309PQ,DCA,CVG,1,4
660553,2020-01-31 00:00:00,DL,N324PQ,JFK,BTV,1,4
660554,2020-01-31 00:00:00,DL,N132EV,ORD,JFK,1,4


In [88]:
#concanating data
test = pd.concat([test_temp, test], axis=0)
test = test[(test.origin != 'RIW') & (test.origin != 'SHR')]
test = test.reset_index(drop=True)
test = test
test

Unnamed: 0,fl_date,mkt_carrier,tail_num,origin,dest,month,week_day,pred_delay
0,2018-02-11,AA,N537UW,DEN,CLT,2,6,
1,2019-03-01,AA,N738SK,DFW,HSV,3,4,
2,2018-12-14,WN,N901WN,SNA,PHX,12,4,
3,2019-12-27,DL,N8783E,MDT,DTW,12,4,
4,2019-01-22,UA,N11548,LRD,IAH,1,1,
...,...,...,...,...,...,...,...,...
5471863,2020-01-31 00:00:00,DL,N926XJ,DCA,CVG,1,4,0.889561
5471864,2020-01-31 00:00:00,DL,N309PQ,DCA,CVG,1,4,0.889561
5471865,2020-01-31 00:00:00,DL,N324PQ,JFK,BTV,1,4,-3.928709
5471866,2020-01-31 00:00:00,DL,N132EV,ORD,JFK,1,4,3.101110


In [82]:
#extracting features
test_use = test[['dest', 'origin', 'month', 'week_day']]

test_use

Unnamed: 0,dest,origin,month,week_day
0,CLT,DEN,2,6
1,HSV,DFW,3,4
2,PHX,SNA,12,4
3,DTW,MDT,12,4
4,IAH,LRD,1,1
...,...,...,...,...
5471863,CVG,DCA,1,4
5471864,CVG,DCA,1,4
5471865,BTV,JFK,1,4
5471866,JFK,ORD,1,4


In [83]:
#creating dummies
dummies = ['origin', 'month', 'week_day']    

for i in dummies:
    test_use = pd.concat([test_use, pd.get_dummies(test_use[i], prefix=i)], axis=1)
    test_use = test_use.drop([i], axis=1)
test_use = test_use.drop(['dest'], axis=1)

test_use

Unnamed: 0,origin_ABE,origin_ABI,origin_ABQ,origin_ABR,origin_ABY,origin_ACK,origin_ACT,origin_ACV,origin_ACY,origin_ADK,...,month_10,month_11,month_12,week_day_0,week_day_1,week_day_2,week_day_3,week_day_4,week_day_5,week_day_6
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5471863,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5471864,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5471865,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5471866,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [84]:
#collecting predictions
y_pred = model.predict(test_use)

In [89]:
#extracting new data predictions
test['pred_delay'] = y_pred
test = test[test['fl_date'] >= '2020-01-01']
test = test.reset_index(drop=True)
test

Unnamed: 0,fl_date,mkt_carrier,tail_num,origin,dest,month,week_day,pred_delay
0,2020-01-01 00:00:00,WN,N951WN,ONT,SFO,1,2,-3.303279
1,2020-01-01 00:00:00,WN,N467WN,ONT,SFO,1,2,-3.303279
2,2020-01-01 00:00:00,WN,N7885A,ONT,SJC,1,2,-3.303279
3,2020-01-01 00:00:00,WN,N551WN,ONT,SJC,1,2,-3.303279
4,2020-01-01 00:00:00,WN,N968WN,ONT,SJC,1,2,-3.303279
...,...,...,...,...,...,...,...,...
660463,2020-01-31 00:00:00,DL,N926XJ,DCA,CVG,1,4,0.889561
660464,2020-01-31 00:00:00,DL,N309PQ,DCA,CVG,1,4,0.889561
660465,2020-01-31 00:00:00,DL,N324PQ,JFK,BTV,1,4,-3.928709
660466,2020-01-31 00:00:00,DL,N132EV,ORD,JFK,1,4,3.101110


In [90]:
#saving results
test.to_csv('./data/delay_prediction.csv',index=False)

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.