Upon data study, this notebook will create new columns from flight delay claim dataset, that could be used as features in the prediction model.

In [2]:
# Used libraries
import pandas as pd
from datetime import datetime, timedelta

In [3]:
# Load the dataset
data_df = pd.read_csv('../datasets/flight_delays_data.csv')

# Check data size
data_df.shape

(899114, 10)

In [4]:
# Show some sample data
data_df.head()

Unnamed: 0,flight_id,flight_no,Week,Departure,Arrival,Airline,std_hour,delay_time,flight_date,is_claim
0,1582499,UO686,27,HKG,KIX,UO,10,0.4,2016-07-01,0
1,1582501,CI7868,17,HKG,TNN,CI,11,0.5,2015-04-23,0
2,1582504,PR301,14,HKG,MNL,PR,11,0.0,2014-04-08,0
3,1582508,LD327,37,HKG,SIN,LD,3,0.1,2013-09-15,0
4,1582509,KA5390,40,HKG,PEK,KA,9,0.5,2015-10-05,0


From the study, we are going to create different types of statistics that help with the prediction:

# Delay hours statistics

From the study, we are going to create statistics of delay hours from different perspectives:
- Departure
- Arrival
- Airline
- Departure + Arrival
- Departure + Airline (i.e. ~= Airline in this dataset)
- Arrival + Airline
- Departure + Arrival + Airline

For each perspective, the average delay_time of per (4 hours/1 day/1 week) is computed. After that, deviation of consecutive last average delay_time is computed.

In [6]:
# Get flight datetime-related columns
data_df['flight_date_dt'] = data_df['flight_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
data_df['flight_dt'] = data_df.apply(lambda x: x['flight_date_dt'] + timedelta(hours=x['std_hour']), axis=1)
data_df['flight_year'] = data_df['flight_dt'].apply(lambda x: x.year)
data_df['flight_month'] = data_df['flight_dt'].apply(lambda x: x.month)
data_df['flight_day'] = data_df['flight_dt'].apply(lambda x: x.day)
data_df['flight_hour_bin'] = data_df['std_hour'].apply(lambda x: x // 4)

In [7]:
data_df.sample(10)

Unnamed: 0,flight_id,flight_no,Week,Departure,Arrival,Airline,std_hour,delay_time,flight_date,is_claim,flight_date_dt,flight_dt,flight_year,flight_month,flight_day,flight_hour_bin
605220,1882382,CI602,34,HKG,TPE,CI,10,0.9,2015-08-23,0,2015-08-23,2015-08-23 10:00:00,2015,8,23,2
540457,1681990,MF8634,49,HKG,WUS,MF,20,Cancelled,2015-12-07,800,2015-12-07,2015-12-07 20:00:00,2015,12,7,5
628115,1953205,UA9726,37,HKG,NRT,UA,9,-0.1,2013-09-14,0,2013-09-14,2013-09-14 09:00:00,2013,9,14,2
731552,2273173,EY7121,20,HKG,CGK,EY,17,0.0,2014-05-15,0,2014-05-15,2014-05-15 17:00:00,2014,5,15,4
311995,971511,BA4551,39,HKG,AKL,BA,21,0.7,2015-09-28,0,2015-09-28,2015-09-28 21:00:00,2015,9,28,5
465558,1448326,AK238,38,HKG,BKI,AK,19,0.0,2014-09-18,0,2014-09-18,2014-09-18 19:00:00,2014,9,18,4
213478,662507,CX502,3,HKG,KIX,CX,16,0.1,2015-01-16,0,2015-01-16,2015-01-16 16:00:00,2015,1,16,4
670352,2084066,CX905,10,HKG,MNL,CX,22,0.2,2015-03-07,0,2015-03-07,2015-03-07 22:00:00,2015,3,7,5
89046,276203,CX530,27,HKG,TPE,CX,9,-0.1,2015-07-03,0,2015-07-03,2015-07-03 09:00:00,2015,7,3,2
337698,1050670,CX913,52,HKG,MNL,CX,20,0.0,2013-12-27,0,2013-12-27,2013-12-27 20:00:00,2013,12,27,5


In addition, for each flight record, the rolling average of last (4 hours/1 day/1 week) is computed.

# Delay status statistics

# Cancel status statistics