In this notebook, we will try to build machine learning model that provides prediction for potential flight claims by delay/cancel.

In [1]:
# Import libraries
import pandas as pd, numpy as np, time
import lightgbm as lgb

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

# Naive approach

For the most straightforward approach, only the per-record features are used for the prediction. In this case, it consists of a mixture of categorical and numerical values.

The prediction is supposed to be a regression problem, as to predict expected monetary claim of all flights. Previous study shows that there are only two is_claim values, thus the model training would be treated as classification problem first (making accuracy measurement easier).

In [116]:
# Load the dataset
data_df = pd.read_csv('../datasets/flight_delays_data.csv')

# Check data columns
data_df.columns

Index(['flight_id', 'flight_no', 'Week', 'Departure', 'Arrival', 'Airline',
       'std_hour', 'delay_time', 'flight_date', 'is_claim'],
      dtype='object')

We first modify the columns for model training:

In [117]:
# Transform categorical features to numerical labels such that the ML model can utilize those features properly
cat_feats = ['flight_no', 'Week', 'Departure','Arrival','Airline','std_hour']
for cat_col in cat_feats:
    data_df[cat_col] = data_df[cat_col].astype("category").cat.codes + 1

In [118]:
# For target is_claim, since we know there are only two values, the plan is to transform it into boolean variable
# It is easier for measuring accuracy in this way. In the actual model, we can ignore this part and do regression on is_claim amount instead.
def set_is_claim(val):
    if val == 0:
        return 0
    else:
        return 1
data_df['is_claim_bool'] = data_df["is_claim"].apply(set_is_claim)

In [119]:
# Check transformed result
data_df.sample(10)

Unnamed: 0,flight_id,flight_no,Week,Departure,Arrival,Airline,std_hour,delay_time,flight_date,is_claim,is_claim_bool
55396,171544,CX418,23,1,62,28,15,0.1,2014-06-08,0,0
83262,257911,CX412,8,1,62,28,1,0.0,2015-02-23,0,0
658483,2046981,CA108,21,1,112,26,11,0.0,2014-05-25,0,0
600359,1867275,CX472,39,1,142,28,16,0.7,2015-09-29,0,0
719645,2236392,HX1892,41,1,142,47,10,0.1,2014-10-11,0,0
250180,777909,KA5494,23,1,142,54,11,0.3,2016-06-08,0,0
405789,1262636,B65301,16,1,129,18,9,0.7,2016-04-16,0,0
132868,412117,MH73,28,1,81,68,15,0.0,2015-07-12,0,0
428202,1331853,KA5564,17,1,142,54,14,0.4,2015-04-29,0,0
729515,2266900,CX532,37,1,103,28,17,0.5,2015-09-15,0,0


In [120]:
# Get x/y columns
x_all = data_df[cat_feats]
y_all = data_df['is_claim_bool']

# Split training/testing set
train, test, y_train, y_test = train_test_split(x_all, y_all,
                                                random_state=10, test_size=0.1)

There are several machine learning models that can tackle this problem. It is decided to use ensemble method (i.e. GBDT in particular) as it is relatively easier to tune for (in contrast to something like NN models that requires tuning network structure), and the underlying decision tree makes it easier to interpret for feature importance.

In [121]:
# For gradient boosting, lightGBM is used
# The parameters are fixed in this stage
lg = lgb.LGBMClassifier(silent=False)
d_train = lgb.Dataset(train, label=y_train)
params = {"max_depth": 50, "learning_rate" : 0.1, "num_leaves": 900,  "n_estimators": 300}

# Train model with catgeorical features
model = lgb.train(params, d_train, categorical_feature = cat_feats)



In [122]:
# Get testing set result
y_pred = model.predict(test).round()

In [123]:
# We check the score of testing set result in two ways
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Accuracy
accuracy=accuracy_score(y_pred,y_test)

In [124]:
cm

array([[85623,   299],
       [ 3437,   553]], dtype=int64)

In [125]:
accuracy

0.95844826052139875

This shows a classic pitfall for a prediction model. Although the model shows a 95% accuracy, the confusion matrix shows that there are 3274 cases of actual needed-to-be-claimed flights, but mis-classified as normal flights. The correct classified ratio of delay/cancel flights in this case would be 553 / (3274 + 553) ~= 14%.

In [126]:
# Get whole set result
y_all_pred = model.predict(x_all).round()

In [127]:
# We check the score of testing set result in two ways
# Confusion matrix
cm = confusion_matrix(y_all, y_all_pred)

# Accuracy
accuracy=accuracy_score(y_all_pred,y_all)

print(cm)
print(accuracy)

[[857805   1896]
 [ 32473   6940]]
0.961774591431


In [131]:
# ['Week', 'Departure','Arrival','Airline','std_hour']
print("Per split importance: ", model.feature_importance(importance_type='split'))
print("Per gain importance: ", model.feature_importance(importance_type='gain'))

Per split importance:  [95997     0 64434 47466 61803]
Per gain importance:  [ 15049.09899168      0.          15557.58651388   8390.33865381
  11378.6986177 ]


# Feature engineering approach

In [28]:
# Load the dataset
data_df = pd.read_csv('../datasets/flight_delays_data_transformed.csv')

# Check data size
data_df.columns

Index(['Unnamed: 0', 'flight_id', 'flight_no', 'Week', 'Departure', 'Arrival',
       'Airline', 'std_hour', 'delay_time', 'flight_date', 'is_claim',
       'flight_date_dt', 'flight_dt', 'flight_year', 'flight_month',
       'flight_day', 'flight_hour_bin', 'flight_ts', 'flight_day_ts',
       'flight_wk_ts', 'dep_hr_delay_time', 'arr_hr_delay_time',
       'air_hr_delay_time', 'dep_day_delay_time', 'arr_day_delay_time',
       'air_day_delay_time', 'dep_wk_delay_time', 'arr_wk_delay_time',
       'air_wk_delay_time', 'dep_hr_delay_count', 'arr_hr_delay_count',
       'air_hr_delay_count', 'dep_hr_cancel_count', 'arr_hr_cancel_count',
       'air_hr_cancel_count'],
      dtype='object')

In [74]:
# Transform categorical features to numerical labels such that the ML model can utilize those features properly
all_feats = ['flight_no', 'Week', 'Departure','Arrival','Airline','std_hour', 
             'dep_hr_delay_time', 'arr_hr_delay_time', 'air_hr_delay_time', 
             'dep_day_delay_time', 'arr_day_delay_time', 'air_day_delay_time', 
             'dep_wk_delay_time', 'arr_wk_delay_time', 'air_wk_delay_time', 
             #'dep_hr_delay_count', 'arr_hr_delay_count', 'air_hr_delay_count', 
             'dep_hr_cancel_count', 'arr_hr_cancel_count', 'air_hr_cancel_count'
            ]
cat_feats = ['flight_no', 'Week', 'Departure','Arrival','Airline','std_hour']
for cat_col in cat_feats:
    data_df[cat_col] = data_df[cat_col].astype("category").cat.codes + 1

In [75]:
# For target is_claim, since we know there are only two values, the plan is to transform it into boolean variable
# It is easier for measuring accuracy in this way. In the actual model, we can ignore this part and do regression on is_claim amount instead.
def set_is_claim(val):
    if val == 0:
        return 0
    else:
        return 1
data_df['is_claim_bool'] = data_df["is_claim"].apply(set_is_claim)

In [76]:
# Check transformed result
data_df.sample(10)

Unnamed: 0.1,Unnamed: 0,flight_id,flight_no,Week,Departure,Arrival,Airline,std_hour,delay_time,flight_date,...,dep_wk_delay_time,arr_wk_delay_time,air_wk_delay_time,dep_hr_delay_count,arr_hr_delay_count,air_hr_delay_count,dep_hr_cancel_count,arr_hr_cancel_count,air_hr_cancel_count,is_claim_bool
396787,396787,1234700,1947,23,1,95,87,14,0.6,2014-06-06,...,0.277003,0.337963,0.247619,51.0,,,2.0,,,0
898398,898398,2791285,1100,40,1,8,45,19,0.4,2015-10-01,...,0.669462,0.473469,0.810638,57.0,,,1.0,,,0
888118,888118,2759031,942,52,1,112,30,14,-0.1,2013-12-31,...,0.308407,0.334571,0.233333,44.0,6.0,1.0,,,,0
714589,714589,2220934,5,41,1,129,3,21,0.0,2013-10-12,...,0.230985,0.142657,0.22381,47.0,2.0,,1.0,,,0
575617,575617,1791131,288,10,1,16,22,15,0.0,2015-03-09,...,0.308513,-0.007143,-0.014286,43.0,,,,,,0
504099,504099,1568294,1744,6,1,10,69,17,0.6,2016-02-05,...,0.855728,0.588571,0.725926,66.0,,1.0,,,,0
591136,591136,1838690,1497,19,1,72,55,12,1.0,2016-05-10,...,0.567523,0.452459,0.569295,43.0,1.0,5.0,3.0,,,0
265388,265388,825385,1184,45,1,142,48,22,0.7,2014-11-09,...,0.18643,0.208031,0.271986,40.0,2.0,,,,,0
456041,456041,1418759,415,45,1,142,28,11,1.6,2014-11-07,...,0.18643,0.208031,0.64,37.0,7.0,,,,,0
866517,866517,2692353,1371,42,1,74,51,11,0.1,2014-10-15,...,0.179385,0.235338,0.092611,46.0,1.0,3.0,3.0,,,0


In [77]:
# Get x/y columns
x_all = data_df[all_feats]
y_all = data_df['is_claim_bool']

# Split training/testing set
train, test, y_train, y_test = train_test_split(x_all, y_all,
                                                random_state=10, test_size=0.1)

In [78]:
# For gradient boosting, lightGBM is used
# The parameters are fixed in this stage
lg = lgb.LGBMClassifier(silent=False)
d_train = lgb.Dataset(train, label=y_train)
params = {"max_depth": 50, "learning_rate" : 0.1, "num_leaves": 900,  "n_estimators": 300}

# Train model with catgeorical features
model = lgb.train(params, d_train, categorical_feature = cat_feats)



In [79]:
# Get testing set result
y_pred = model.predict(test).round()

In [80]:
# We check the score of testing set result in two ways
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Accuracy
accuracy=accuracy_score(y_pred,y_test)

In [81]:
cm

array([[85658,   264],
       [ 2346,  1644]], dtype=int64)

In [82]:
accuracy

0.97097161669187648

In [83]:
# Get whole set result
y_all_pred = model.predict(x_all).round()

In [84]:
# We check the score of testing set result in two ways
# Confusion matrix
cm = confusion_matrix(y_all, y_all_pred)

# Accuracy
accuracy=accuracy_score(y_all_pred,y_all)

print(cm)
print(accuracy)

"""
# With cancel count only:
[[859411    290]
 [  8524  30889]]
0.990197016174

# With cancel count + delay count:
[[859405    296]
 [  8621  30792]]
0.990082458954
"""

[[859405    296]
 [  8621  30792]]
0.990082458954


'\n# With cancel count only:\n[[859411    290]\n [  8524  30889]]\n0.990197016174\n'

In [85]:
precision = cm[0][0] / sum(cm[0])
recall = cm[0][0] / (cm[0][0] + cm[1][1])
2 * (precision * recall) / (precision + recall) # F1 score

0.98223439309034022

In [86]:
# With cancel count only: 0.7837261817166925
# With cancel count + delay count: 0.78126506482632629
cm[1][1] / sum(cm[1])

0.78126506482632629

In [73]:
pd.DataFrame({'feature': all_feats, 
              'Importance by split': model.feature_importance(importance_type='split'), 
              'Importance by gain': model.feature_importance(importance_type='gain')}).sort_values('Importance by split', ascending=False)

Unnamed: 0,Importance by gain,Importance by split,feature
10,12994.3509,32043,arr_day_delay_time
6,14163.500926,29855,dep_hr_delay_time
9,8572.25058,24219,dep_day_delay_time
11,7108.787043,23539,air_day_delay_time
8,7231.753137,22146,air_hr_delay_time
1,14309.766921,21764,Week
0,28470.661889,21183,flight_no
13,11007.559638,20655,arr_wk_delay_time
7,12311.636142,18592,arr_hr_delay_time
14,6256.941488,15592,air_wk_delay_time


In [159]:
# Try again with 10-fold
for i in range(10):
    train, test, y_train, y_test = train_test_split(x_all, y_all, test_size=0.1)
    
    lg = lgb.LGBMClassifier(silent=False)
    d_train = lgb.Dataset(train, label=y_train)
    params = {"max_depth": 50, "learning_rate" : 0.1, "num_leaves": 900,  "n_estimators": 300}
    model = lgb.train(params, d_train, categorical_feature = cat_feats)
    
    y_pred = model.predict(test).round()
    print("Round ", i, "---------------------")
    print("Test:")
    cm = confusion_matrix(y_test, y_pred)
    accuracy=accuracy_score(y_pred,y_test)
    print(cm)
    print(accuracy)
    
    y_all_pred = model.predict(x_all).round()
    print("Whole:")
    cm = confusion_matrix(y_all, y_all_pred)
    accuracy=accuracy_score(y_all_pred,y_all)
    print(cm)
    print(accuracy)



Round  0 ---------------------
Test:
[[85873   159]
 [  718  3162]]
0.990246018329
Whole:
[[859532    169]
 [  1956  37457]]
0.997636562216
Round  1 ---------------------
Test:
[[85882   144]
 [  742  3144]]
0.990145920456
Whole:
[[859549    152]
 [  2033  37380]]
0.997569829855
Round  2 ---------------------
Test:
[[85808   126]
 [  751  3227]]
0.990246018329
Whole:
[[859567    134]
 [  2031  37382]]
0.997592073975
Round  3 ---------------------
Test:
[[85909   156]
 [  708  3139]]
0.990390604146
Whole:
[[859537    164]
 [  1999  37414]]
0.997594298387
Round  4 ---------------------
Test:
[[85890   120]
 [  798  3104]]
0.989790016905
Whole:
[[859566    135]
 [  2065  37348]]
0.997553146764
Round  5 ---------------------
Test:
[[85842   134]
 [  791  3145]]
0.989712163004
Whole:
[[859557    144]
 [  2075  37338]]
0.99753201485
Round  6 ---------------------
Test:
[[85879   149]
 [  778  3106]]
0.989689919032
Whole:
[[859545    156]
 [  2043  37370]]
0.99755425897
Round  7 -------------

In [71]:
abnormal_df = data_df[data_df['is_claim_bool'] == 1]
abnormal_df.shape

(39413, 11)

In [58]:
x_all_abnormal = abnormal_df[cat_feats]
y_all_abnormal = abnormal_df['is_claim_bool']

In [59]:
y_pred_abnormal = model.predict(x_all_abnormal).round()

#Confusion matrix
from sklearn.metrics import confusion_matrix
abnormal_cm = confusion_matrix(y_all_abnormal, y_pred_abnormal)

#Accuracy
from sklearn.metrics import accuracy_score
abnormal_accuracy=accuracy_score(y_pred_abnormal,y_all_abnormal)

In [60]:
abnormal_cm

array([[    0,     0],
       [30178,  9235]], dtype=int64)

In [61]:
abnormal_accuracy

0.23431355136630047

In [68]:
abnormal_accuracy * len(y_pred_abnormal) * 800

7388000.0

In [66]:
y_pred_abnormal

array([ 0.,  0.,  0., ...,  1.,  0.,  0.])

In [32]:
import catboost as cb
from sklearn.model_selection import GridSearchCV

cat_features_index = [0,2,3,4,5]

"""
def auc(m, train, test): 
    return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
                            metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}
cb = cb.CatBoostClassifier()
cb_model = GridSearchCV(cb, params, scoring="accuracy", cv = 3)
cb_model.fit(train, y_train)
"""

def show_train_test_accuracy(m, train, test, y_train, y_test):
    

clf = cb.CatBoostClassifier(eval_metric="AUC",one_hot_max_size=31, \
                            depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train, cat_features=cat_features_index)
show_train_test_accuracy(clf, train, test)

"""
With Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC", depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train)
auc(clf, train, test)

With Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC",one_hot_max_size=31, \
                            depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train, cat_features=cat_features_index)
auc(clf, train, test)
"""

0:	total: 891ms	remaining: 7m 24s
1:	total: 1.76s	remaining: 7m 18s
2:	total: 2.62s	remaining: 7m 14s
3:	total: 3.44s	remaining: 7m 6s
4:	total: 4.22s	remaining: 6m 58s
5:	total: 4.91s	remaining: 6m 44s
6:	total: 5.68s	remaining: 6m 39s
7:	total: 6.41s	remaining: 6m 34s
8:	total: 7.04s	remaining: 6m 24s
9:	total: 7.64s	remaining: 6m 14s
10:	total: 8.25s	remaining: 6m 6s
11:	total: 8.9s	remaining: 6m 2s
12:	total: 9.49s	remaining: 5m 55s
13:	total: 10.2s	remaining: 5m 54s
14:	total: 10.9s	remaining: 5m 52s
15:	total: 11.5s	remaining: 5m 48s
16:	total: 12.3s	remaining: 5m 48s
17:	total: 13.1s	remaining: 5m 49s
18:	total: 13.8s	remaining: 5m 49s
19:	total: 14.5s	remaining: 5m 48s
20:	total: 15.2s	remaining: 5m 46s
21:	total: 15.9s	remaining: 5m 46s
22:	total: 16.7s	remaining: 5m 45s
23:	total: 17.4s	remaining: 5m 45s
24:	total: 18.2s	remaining: 5m 46s
25:	total: 19s	remaining: 5m 46s
26:	total: 19.8s	remaining: 5m 46s
27:	total: 20.5s	remaining: 5m 45s
28:	total: 21.1s	remaining: 5m 42s
2

229:	total: 3m 1s	remaining: 3m 32s
230:	total: 3m 2s	remaining: 3m 31s
231:	total: 3m 2s	remaining: 3m 31s
232:	total: 3m 3s	remaining: 3m 30s
233:	total: 3m 4s	remaining: 3m 29s
234:	total: 3m 5s	remaining: 3m 28s
235:	total: 3m 6s	remaining: 3m 28s
236:	total: 3m 6s	remaining: 3m 27s
237:	total: 3m 7s	remaining: 3m 26s
238:	total: 3m 8s	remaining: 3m 26s
239:	total: 3m 9s	remaining: 3m 25s
240:	total: 3m 10s	remaining: 3m 24s
241:	total: 3m 11s	remaining: 3m 23s
242:	total: 3m 12s	remaining: 3m 23s
243:	total: 3m 12s	remaining: 3m 22s
244:	total: 3m 13s	remaining: 3m 21s
245:	total: 3m 14s	remaining: 3m 20s
246:	total: 3m 15s	remaining: 3m 19s
247:	total: 3m 15s	remaining: 3m 19s
248:	total: 3m 16s	remaining: 3m 18s
249:	total: 3m 17s	remaining: 3m 17s
250:	total: 3m 18s	remaining: 3m 16s
251:	total: 3m 19s	remaining: 3m 15s
252:	total: 3m 19s	remaining: 3m 15s
253:	total: 3m 20s	remaining: 3m 14s
254:	total: 3m 21s	remaining: 3m 13s
255:	total: 3m 22s	remaining: 3m 12s
256:	total: 

455:	total: 6m 4s	remaining: 35.2s
456:	total: 6m 5s	remaining: 34.4s
457:	total: 6m 6s	remaining: 33.6s
458:	total: 6m 7s	remaining: 32.8s
459:	total: 6m 8s	remaining: 32s
460:	total: 6m 8s	remaining: 31.2s
461:	total: 6m 9s	remaining: 30.4s
462:	total: 6m 10s	remaining: 29.6s
463:	total: 6m 11s	remaining: 28.8s
464:	total: 6m 12s	remaining: 28s
465:	total: 6m 13s	remaining: 27.2s
466:	total: 6m 13s	remaining: 26.4s
467:	total: 6m 14s	remaining: 25.6s
468:	total: 6m 15s	remaining: 24.8s
469:	total: 6m 16s	remaining: 24s
470:	total: 6m 16s	remaining: 23.2s
471:	total: 6m 17s	remaining: 22.4s
472:	total: 6m 18s	remaining: 21.6s
473:	total: 6m 19s	remaining: 20.8s
474:	total: 6m 20s	remaining: 20s
475:	total: 6m 20s	remaining: 19.2s
476:	total: 6m 21s	remaining: 18.4s
477:	total: 6m 22s	remaining: 17.6s
478:	total: 6m 23s	remaining: 16.8s
479:	total: 6m 23s	remaining: 16s
480:	total: 6m 24s	remaining: 15.2s
481:	total: 6m 25s	remaining: 14.4s
482:	total: 6m 26s	remaining: 13.6s
483:	tota

NameError: name 'metrics' is not defined

In [33]:
import numpy as np

np_arr = np.array([[ 0.43348191, 0.56651809],
       [ 0.84401838, 0.15598162],
       [ 0.13147498, 0.86852502]])

In [35]:
np_arr[:,1]

array([ 0.56651809,  0.15598162,  0.86852502])

In [40]:
clf.predict_proba(test)

array([[ 0.99248242,  0.00751758],
       [ 0.98422051,  0.01577949],
       [ 0.98582767,  0.01417233],
       ..., 
       [ 0.98536377,  0.01463623],
       [ 0.99129739,  0.00870261],
       [ 0.97489411,  0.02510589]])

In [41]:
y_test

632830    0
249400    0
694264    0
717623    0
458571    0
287005    1
771526    0
807558    0
814023    1
56020     0
269614    0
500462    0
468659    0
94278     0
744185    0
289684    0
288029    0
889335    0
814548    0
191188    1
699195    0
23806     0
123362    0
163442    0
786351    0
142949    0
839632    0
642767    0
716690    0
430449    0
         ..
865945    0
683866    0
890432    0
672260    0
491212    0
185754    0
387622    0
760889    0
691997    0
286642    0
722248    0
268224    0
809047    0
763369    0
148719    0
348449    0
734266    1
504225    1
549270    0
780675    0
191978    0
728449    0
514447    0
637866    0
121222    0
413276    0
420683    0
560263    0
497622    0
870260    0
Name: is_claim_bool, Length: 89912, dtype: int64