In this notebook, we will try to build machine learning model that provides prediction for potential flight claims by delay/cancel.

In [18]:
# Import libraries
import pandas as pd, numpy as np, time
import lightgbm as lgb

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

In [2]:
def print_metrics(to_check_y, to_check_pred):
    # We check the score of set result in two ways
    # Confusion matrix
    cm = confusion_matrix(to_check_y, to_check_pred)
    # Accuracy
    accuracy=accuracy_score(to_check_pred,to_check_y)
    
    print(cm)
    print("Accuracy:", accuracy)
    print("Delay/Cancel case accuracy:", cm[1][1] / sum(cm[1]))

# Naive approach

For the most straightforward approach, only the per-record features are used for the prediction. In this case, it consists of a mixture of categorical and numerical values.

The prediction is supposed to be a regression problem, as to predict expected monetary claim of all flights. Previous study shows that there are only two is_claim values, thus the model training would be treated as classification problem first (making accuracy measurement easier).

In [33]:
# Load the dataset
data_df = pd.read_csv('../datasets/flight_delays_data.csv')

# Check data columns
data_df.columns

Index(['flight_id', 'flight_no', 'Week', 'Departure', 'Arrival', 'Airline',
       'std_hour', 'delay_time', 'flight_date', 'is_claim'],
      dtype='object')

We first modify the columns for model training:

In [34]:
# Transform categorical features to numerical labels such that the ML model can utilize those features properly
cat_feats = ['flight_no', 'Week', 'Departure','Arrival','Airline','std_hour']
for cat_col in cat_feats:
    data_df[cat_col] = data_df[cat_col].astype("category").cat.codes + 1

In [35]:
# For target is_claim, since we know there are only two values, the plan is to transform it into boolean variable
# It is easier for measuring accuracy in this way. In the actual model, we can ignore this part and do regression on is_claim amount instead.
def set_is_claim(val):
    if val == 0:
        return 0
    else:
        return 1
data_df['is_claim_bool'] = data_df["is_claim"].apply(set_is_claim)

In [36]:
# Check transformed result
data_df.sample(10)

Unnamed: 0,flight_id,flight_no,Week,Departure,Arrival,Airline,std_hour,delay_time,flight_date,is_claim,is_claim_bool
349730,1088539,990,51,1,44,34,1,0.2,2013-12-21,0,0
246503,766660,1569,48,1,76,54,12,0.0,2013-11-30,0,0
72857,225815,487,45,1,126,28,16,0.3,2015-11-11,0,0
369759,1150540,601,52,1,142,28,11,0.3,2015-12-24,0,0
378187,1176719,879,7,1,44,28,17,0.1,2014-02-18,0,0
713672,2218003,365,47,1,56,26,9,0.1,2014-11-25,0,0
406855,1266082,1486,27,1,41,54,13,0.1,2015-07-02,0,0
892688,2773301,695,39,1,118,28,9,1.5,2015-09-29,0,0
562224,1749551,206,42,1,93,16,21,-0.1,2013-10-21,0,0
97219,301570,1083,51,1,21,40,18,0.6,2014-12-18,0,0


In [37]:
# Get x/y columns
x_all = data_df[cat_feats]
y_all = data_df['is_claim_bool']

# Split training/testing set
train, test, y_train, y_test = train_test_split(x_all, y_all,
                                                random_state=10, test_size=0.3)

There are several machine learning models that can tackle this problem. It is decided to use ensemble method (i.e. GBDT in particular) as it is relatively easier to tune for (in contrast to something like NN models that requires tuning network structure), and the underlying decision tree makes it easier to interpret for feature importance.

In [38]:
# For gradient boosting, lightGBM is used
# The parameters are fixed in this stage
lg = lgb.LGBMClassifier(silent=False)
d_train = lgb.Dataset(train, label=y_train)
params = {"max_depth": 50, "learning_rate" : 0.1, "num_leaves": 900,  "n_estimators": 300}

# Train model with catgeorical features
model = lgb.train(params, d_train, categorical_feature = cat_feats)



In [40]:
# Training set metrics
to_check_train, to_check_y = train, y_train
to_check_pred = model.predict(to_check_train).round()
print_metrics(to_check_y, to_check_pred)

[[600414   1604]
 [ 20202   7159]]
Accuracy: 0.965353149691998
Delay/Cancel case accuracy: 0.2616497935016995


In [41]:
# Testing set metrics
to_check_train, to_check_y = test, y_test
to_check_pred = model.predict(to_check_train).round()
print_metrics(to_check_y, to_check_pred)

[[256214   1469]
 [  9879   2173]]
Accuracy: 0.9579290785400486
Delay/Cancel case accuracy: 0.18030202456023897


Although the model shows a 95% accuracy, the confusion matrix shows that there are 3274 cases of actual needed-to-be-claimed flights, but mis-classified as normal flights. The correct classified ratio of delay/cancel flights in this case would be 553 / (3274 + 553) ~= 14%.

In [19]:
# ['Week', 'Departure','Arrival','Airline','std_hour']
print("Per split importance: ", model.feature_importance(importance_type='split'))
print("Per gain importance: ", model.feature_importance(importance_type='gain'))

Per split importance:  [83796 92416     0 25838 11937 55713]
Per gain importance:  [27308.00026919 15130.38073316     0.          7905.98834031
  1551.28059162  8093.69478381]


# Feature engineering approach

In [28]:
# Load the dataset
data_df = pd.read_csv('../datasets/flight_delays_data_transformed.csv')

# Check data size
data_df.columns

Index(['Unnamed: 0', 'flight_id', 'flight_no', 'Week', 'Departure', 'Arrival',
       'Airline', 'std_hour', 'delay_time', 'flight_date', 'is_claim',
       'flight_date_dt', 'flight_dt', 'flight_year', 'flight_month',
       'flight_day', 'flight_2_hour_bin', 'flight_4_hour_bin', 'flight_ts',
       'flight_2_hour_ts', 'flight_4_hour_ts', 'flight_day_ts', 'flight_wk_ts',
       'dep_hr_delay_time', 'arr_hr_delay_time', 'air_hr_delay_time',
       'arr_air_hr_delay_time', 'dep_2_hr_delay_time', 'arr_2_hr_delay_time',
       'air_2_hr_delay_time', 'arr_air_2_hr_delay_time', 'dep_4_hr_delay_time',
       'arr_4_hr_delay_time', 'air_4_hr_delay_time', 'arr_air_4_hr_delay_time',
       'dep_day_delay_time', 'arr_day_delay_time', 'air_day_delay_time',
       'arr_air_day_delay_time', 'dep_wk_delay_time', 'arr_wk_delay_time',
       'air_wk_delay_time', 'arr_air_wk_delay_time', 'dep_hr_delay_count',
       'arr_hr_delay_count', 'air_hr_delay_count', 'arr_air_hr_delay_count',
       'dep_2_hr_

In [29]:
# Transform categorical features to numerical labels such that the ML model can utilize those features properly
all_feats = ['flight_no', 'Week', 'Departure','Arrival','Airline','std_hour', 
               'dep_hr_delay_time', 'arr_hr_delay_time', 'air_hr_delay_time',
               'arr_air_hr_delay_time', 'dep_2_hr_delay_time', 'arr_2_hr_delay_time',
               'air_2_hr_delay_time', 'arr_air_2_hr_delay_time', 'dep_4_hr_delay_time',
               'arr_4_hr_delay_time', 'air_4_hr_delay_time', 'arr_air_4_hr_delay_time',
               'dep_day_delay_time', 'arr_day_delay_time', 'air_day_delay_time',
               'arr_air_day_delay_time', 'dep_wk_delay_time', 'arr_wk_delay_time',
               'air_wk_delay_time', 'arr_air_wk_delay_time', 'dep_hr_delay_count',
               'arr_hr_delay_count', 'air_hr_delay_count', 'arr_air_hr_delay_count',
               'dep_2_hr_delay_count', 'arr_2_hr_delay_count', 'air_2_hr_delay_count',
               'arr_air_2_hr_delay_count', 'dep_4_hr_delay_count',
               'arr_4_hr_delay_count', 'air_4_hr_delay_count',
               'arr_air_4_hr_delay_count', 'dep_hr_cancel_count',
               'arr_hr_cancel_count', 'air_hr_cancel_count', 'arr_air_hr_cancel_count',
               'dep_2_hr_cancel_count', 'arr_2_hr_cancel_count',
               'air_2_hr_cancel_count', 'arr_air_2_hr_cancel_count',
               'dep_4_hr_cancel_count', 'arr_4_hr_cancel_count',
               'air_4_hr_cancel_count', 'arr_air_4_hr_cancel_count'
            ]
cat_feats = ['flight_no', 'Week', 'Departure','Arrival','Airline','std_hour']
for cat_col in cat_feats:
    data_df[cat_col] = data_df[cat_col].astype("category").cat.codes + 1

In [30]:
# For target is_claim, since we know there are only two values, the plan is to transform it into boolean variable
# It is easier for measuring accuracy in this way. In the actual model, we can ignore this part and do regression on is_claim amount instead.
def set_is_claim(val):
    if val == 0:
        return 0
    else:
        return 1
data_df['is_claim_bool'] = data_df["is_claim"].apply(set_is_claim)

In [31]:
# Check transformed result
data_df.sample(10)

Unnamed: 0.1,Unnamed: 0,flight_id,flight_no,Week,Departure,Arrival,Airline,std_hour,delay_time,flight_date,...,arr_air_hr_cancel_count,dep_2_hr_cancel_count,arr_2_hr_cancel_count,air_2_hr_cancel_count,arr_air_2_hr_cancel_count,dep_4_hr_cancel_count,arr_4_hr_cancel_count,air_4_hr_cancel_count,arr_air_4_hr_cancel_count,is_claim_bool
529413,529413,1647469,473,52,1,3,28,16,0.0,2013-12-31,...,0.0,0.0,0.0,0.0,0.0,8.0,0.0,1.0,0.0,0
290408,290408,903201,2011,52,1,126,94,17,-0.1,2013-12-27,...,0.0,3.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0
487915,487915,1517836,1576,18,1,118,54,11,-0.1,2014-05-06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
577321,577321,1796323,2097,21,1,129,104,21,0.4,2016-05-25,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
618782,618782,1924410,1411,42,1,143,54,13,1.1,2014-10-21,...,0.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0
828875,828875,2574873,1410,13,1,112,54,13,0.1,2015-03-31,...,0.0,5.0,0.0,1.0,0.0,7.0,0.0,2.0,0.0,0
617852,617852,1921634,1169,47,1,127,47,12,0.9,2015-11-22,...,0.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
745680,745680,2316908,426,12,1,142,27,14,0.0,2014-03-25,...,0.0,5.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0
877053,877053,2724776,1189,12,1,21,47,10,0.5,2016-03-20,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
287207,287207,893220,756,38,1,113,28,9,-0.1,2013-09-17,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [46]:
# Get x/y columns
x_all = data_df
y_all = data_df['is_claim_bool']

# Split training/testing set
train, test, y_train, y_test = train_test_split(x_all, y_all,
                                                random_state=10, test_size=0.1)
train = train[all_feats]
test_all = test.copy()
test = test[all_feats]

In [47]:
# For gradient boosting, lightGBM is used
# The parameters are fixed in this stage
lg = lgb.LGBMClassifier(silent=False)
d_train = lgb.Dataset(train, label=y_train)
params = {
    "max_depth": 50,
    "learning_rate" : 0.1, 
    #"num_leaves": 1200,
    "num_leaves": 900, 
    #"num_leaves": 300,
    #"num_leaves": 3000,
    "n_estimators": 300
}

# Train model with catgeorical features
model = lgb.train(params, d_train, categorical_feature = cat_feats)



In [48]:
# Training set metrics
to_check_train, to_check_y = train, y_train
to_check_pred = model.predict(to_check_train).round()
print_metrics(to_check_y, to_check_pred)

[[773756     23]
 [  4451  30972]]
Accuracy: 0.994471096215778
Delay/Cancel case accuracy: 0.8743471755638992


In [49]:
# Testing set metrics
to_check_train, to_check_y = test, y_test
to_check_pred = abs(model.predict(to_check_train).round())
print_metrics(to_check_y, to_check_pred)

[[85675   247]
 [ 2233  1757]]
Accuracy: 0.9724174748643117
Delay/Cancel case accuracy: 0.44035087719298244


In [50]:
pd.DataFrame({'feature': all_feats, 
              'Importance by split': model.feature_importance(importance_type='split'), 
              'Importance by gain': model.feature_importance(importance_type='gain')}).sort_values('Importance by gain', ascending=False)

Unnamed: 0,Importance by gain,Importance by split,feature
0,28032.594174,19632,flight_no
1,12009.404098,19589,Week
25,7333.989736,8641,arr_air_wk_delay_time
11,7175.389461,7956,arr_2_hr_delay_time
6,7048.610924,12056,dep_hr_delay_time
21,6770.806631,11742,arr_air_day_delay_time
3,5404.483432,4002,Arrival
7,5156.194396,7089,arr_hr_delay_time
19,5099.327376,13005,arr_day_delay_time
14,4334.059138,10700,dep_4_hr_delay_time


In [51]:
pd.DataFrame({'feature': all_feats, 
              'Importance by split': model.feature_importance(importance_type='split'), 
              'Importance by gain': model.feature_importance(importance_type='gain')}).sort_values('Importance by split', ascending=False)

Unnamed: 0,Importance by gain,Importance by split,feature
0,28032.594174,19632,flight_no
1,12009.404098,19589,Week
19,5099.327376,13005,arr_day_delay_time
6,7048.610924,12056,dep_hr_delay_time
21,6770.806631,11742,arr_air_day_delay_time
14,4334.059138,10700,dep_4_hr_delay_time
18,3624.293635,10623,dep_day_delay_time
10,4039.761297,10164,dep_2_hr_delay_time
20,2667.135057,9805,air_day_delay_time
15,4046.955368,9279,arr_4_hr_delay_time


In [13]:
# Try again with 10-fold
for i in range(10):
    train, test, y_train, y_test = train_test_split(x_all, y_all, test_size=0.3)
    
    lg = lgb.LGBMClassifier(silent=False)
    d_train = lgb.Dataset(train, label=y_train)
    params = {"max_depth": 50, "learning_rate" : 0.1, "num_leaves": 900,  "n_estimators": 300}
    model = lgb.train(params, d_train, categorical_feature = cat_feats)
    
    print("Round ", i, "---------------------")
    # Testing set metrics
    to_check_train, to_check_y = test, y_test
    to_check_pred = model.predict(to_check_train).round()
    print_metrics(to_check_y, to_check_pred)



Round  0 ---------------------
[[256985    863]
 [  6956   4931]]
Accuracy: 0.9710122898400282
Delay/Cancel case accuracy: 0.41482291579035924
Round  1 ---------------------
[[256918    869]
 [  6961   4987]]
Accuracy: 0.9709715090737205
Delay/Cancel case accuracy: 0.4173920321392702
Round  2 ---------------------
[[256894    883]
 [  6854   5104]]
Accuracy: 0.9713162919161399
Delay/Cancel case accuracy: 0.42682722863355077
Round  3 ---------------------
[[257021    879]
 [  6917   4918]]
Accuracy: 0.9710975587150351
Delay/Cancel case accuracy: 0.4155471060414026
Round  4 ---------------------
[[257009    898]
 [  6864   4964]]
Accuracy: 0.9712236083563498
Delay/Cancel case accuracy: 0.41968211024687185
Round  5 ---------------------
[[256916    909]
 [  6962   4948]]
Accuracy: 0.9708195080356646
Delay/Cancel case accuracy: 0.41544920235096555
Round  6 ---------------------
[[257206    932]
 [  6674   4923]]
Accuracy: 0.9718019537694403
Delay/Cancel case accuracy: 0.4245063378459947
Ro

In [47]:
test_all.columns

Index(['Unnamed: 0', 'flight_id', 'flight_no', 'Week', 'Departure', 'Arrival',
       'Airline', 'std_hour', 'delay_time', 'flight_date', 'is_claim',
       'flight_date_dt', 'flight_dt', 'flight_year', 'flight_month',
       'flight_day', 'flight_2_hour_bin', 'flight_4_hour_bin', 'flight_ts',
       'flight_2_hour_ts', 'flight_4_hour_ts', 'flight_day_ts', 'flight_wk_ts',
       'dep_hr_delay_time', 'arr_hr_delay_time', 'air_hr_delay_time',
       'dep_2_hr_delay_time', 'arr_2_hr_delay_time', 'air_2_hr_delay_time',
       'dep_4_hr_delay_time', 'arr_4_hr_delay_time', 'air_4_hr_delay_time',
       'dep_day_delay_time', 'arr_day_delay_time', 'air_day_delay_time',
       'dep_wk_delay_time', 'arr_wk_delay_time', 'air_wk_delay_time',
       'dep_hr_delay_count', 'arr_hr_delay_count', 'air_hr_delay_count',
       'dep_2_hr_delay_count', 'arr_2_hr_delay_count', 'air_2_hr_delay_count',
       'dep_4_hr_delay_count', 'arr_4_hr_delay_count', 'air_4_hr_delay_count',
       'dep_hr_cancel_count',

In [48]:
test_all['is_claim_actual'] = to_check_y
test_all['is_claim_pred'] = to_check_pred

In [49]:
test_all.to_csv('../ml_model_test_result.csv')

In [37]:
# Get whole set result
y_all_pred = model.predict(x_all).round()

In [38]:
# We check the score of testing set result in two ways
# Confusion matrix
cm = confusion_matrix(y_all, y_all_pred)

# Accuracy
accuracy=accuracy_score(y_all_pred,y_all)

print(cm)
print(accuracy)

"""
# With cancel count only:
[[859411    290]
 [  8524  30889]]
0.990197016174

# With cancel count + delay count:
[[859405    296]
 [  8621  30792]]
0.990082458954
"""

[[858571   1130]
 [  9317  30096]]
0.9883807837493355


'\n# With cancel count only:\n[[859411    290]\n [  8524  30889]]\n0.990197016174\n\n# With cancel count + delay count:\n[[859405    296]\n [  8621  30792]]\n0.990082458954\n'

In [39]:
precision = cm[0][0] / sum(cm[0])
recall = cm[0][0] / (cm[0][0] + cm[1][1])
2 * (precision * recall) / (precision + recall) # F1 score

0.9821399156241707

In [40]:
# With cancel count only: 0.7837261817166925
# With cancel count + delay count: 0.78126506482632629
cm[1][1] / sum(cm[1])

0.7636059168294725