# US Flight Delay Analysis MVP
This notebook contains the analysis of the "On-Time : Reporting Carrier On-Time Performance" dataset, for January 2015.
In terms of technology, we will use pandas for the data manipulation and analysis.

Each phase of the process followed is indicated by cronological order.

## Imports:

In [1]:
import pandas as pd
import numpy as np
import os

# Cleaning the Dataset

## Data Loading

I use the os module, in order to get the directory of this notebook, and be totally sure of the data path to insert.

In [2]:
cwd = os.getcwd()
cwd

'C:\\Users\\GerardEspejo\\Desktop\\TFM\\TFM'

The dataset is loaded from a local folder "Data", which contains all the used datasets.

In [3]:
df_original = pd.read_csv('C:\\Users\\GerardEspejo\\Desktop\\TFM\\Data\\Jan_2015.csv')
df_original.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,...,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM,Unnamed: 109
0,2015,1,1,2,5,2015-01-02,NK,20416,NK,N521NK,...,,,,,,,,,,
1,2015,1,1,3,6,2015-01-03,NK,20416,NK,N512NK,...,,,,,,,,,,
2,2015,1,1,4,7,2015-01-04,NK,20416,NK,N528NK,...,,,,,,,,,,
3,2015,1,1,5,1,2015-01-05,NK,20416,NK,N523NK,...,,,,,,,,,,
4,2015,1,1,6,2,2015-01-06,NK,20416,NK,N534NK,...,,,,,,,,,,


## Initial Data Examination
In this section, I am trying to have quick glance of the orginal dataset. The examination is focused on the size and data types.

In [4]:
df_original.shape

(469968, 110)

In [5]:
df_original.dtypes

YEAR                       int64
QUARTER                    int64
MONTH                      int64
DAY_OF_MONTH               int64
DAY_OF_WEEK                int64
FL_DATE                   object
OP_UNIQUE_CARRIER         object
OP_CARRIER_AIRLINE_ID      int64
OP_CARRIER                object
TAIL_NUM                  object
OP_CARRIER_FL_NUM          int64
ORIGIN_AIRPORT_ID          int64
ORIGIN_AIRPORT_SEQ_ID      int64
ORIGIN_CITY_MARKET_ID      int64
ORIGIN                    object
ORIGIN_CITY_NAME          object
ORIGIN_STATE_ABR          object
ORIGIN_STATE_FIPS          int64
ORIGIN_STATE_NM           object
ORIGIN_WAC                 int64
DEST_AIRPORT_ID            int64
DEST_AIRPORT_SEQ_ID        int64
DEST_CITY_MARKET_ID        int64
DEST                      object
DEST_CITY_NAME            object
DEST_STATE_ABR            object
DEST_STATE_FIPS            int64
DEST_STATE_NM             object
DEST_WAC                   int64
CRS_DEP_TIME               int64
          

In [6]:
df_original.columns

Index(['YEAR', 'QUARTER', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE',
       'OP_UNIQUE_CARRIER', 'OP_CARRIER_AIRLINE_ID', 'OP_CARRIER', 'TAIL_NUM',
       ...
       'DIV4_TAIL_NUM', 'DIV5_AIRPORT', 'DIV5_AIRPORT_ID',
       'DIV5_AIRPORT_SEQ_ID', 'DIV5_WHEELS_ON', 'DIV5_TOTAL_GTIME',
       'DIV5_LONGEST_GTIME', 'DIV5_WHEELS_OFF', 'DIV5_TAIL_NUM',
       'Unnamed: 109'],
      dtype='object', length=110)

### Conclusions
It appears that with the full dataset, we have 110 columns, and some with mixed types. In order to have a more flexible working environment, and because I won't be using all the data available, I will do a pre-filter of the columns needed once generating the dataset. This will help me reduce time and efforts. Therefore, I am creating a more specific dataset to work with. 

The selection of "interesting" columns is done during the data downloading process, in the BTS website: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

The new dataset won't have information (columns) related to: Gate Return Information at Origin Airport and Diverted Airport Information. Because we are focusing this analysis only on Flight Delays, I won't be using flight divertions informations, for now.

## Loading the new dataset 

In [7]:
df = pd.read_csv('C:\\Users\\GerardEspejo\\Desktop\\TFM\\Data\\Jan_2015_v2.csv')
df.head()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,...,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 50
0,2015,1,1,22,4,2015-01-22,DL,DL,N969DL,1485,...,134.0,1.0,950.0,4,,,,,,
1,2015,1,1,22,4,2015-01-22,DL,DL,N912DL,1486,...,90.0,1.0,762.0,4,,,,,,
2,2015,1,1,22,4,2015-01-22,DL,DL,N359NW,1487,...,240.0,1.0,1956.0,8,,,,,,
3,2015,1,1,22,4,2015-01-22,DL,DL,N957AT,1488,...,29.0,1.0,143.0,1,,,,,,
4,2015,1,1,22,4,2015-01-22,DL,DL,N985DL,1489,...,123.0,1.0,689.0,3,,,,,,


## Data Examination

In [8]:
df.shape

(469968, 51)

In [9]:
df.dtypes

YEAR                       int64
QUARTER                    int64
MONTH                      int64
DAY_OF_MONTH               int64
DAY_OF_WEEK                int64
FL_DATE                   object
OP_UNIQUE_CARRIER         object
OP_CARRIER                object
TAIL_NUM                  object
OP_CARRIER_FL_NUM          int64
ORIGIN_AIRPORT_ID          int64
ORIGIN_CITY_MARKET_ID      int64
ORIGIN                    object
ORIGIN_CITY_NAME          object
ORIGIN_STATE_ABR          object
ORIGIN_STATE_NM           object
ORIGIN_WAC                 int64
DEST_AIRPORT_ID            int64
DEST_CITY_MARKET_ID        int64
DEST                      object
DEST_CITY_NAME            object
DEST_STATE_ABR            object
DEST_STATE_NM             object
DEST_WAC                   int64
CRS_DEP_TIME               int64
DEP_TIME                 float64
DEP_DELAY                float64
DEP_DELAY_NEW            float64
DEP_DEL15                float64
DEP_DELAY_GROUP          float64
CRS_ARR_TI

### Conclusions

It appears we have reduced the number of columns by 59, and we have solved the issue of having mixed types in various columns. In result, we now have a more "user-friendly" dataset.

## Trimming the Data

### Columns Rename
In order to maximize the user friendliness of the dataset, I choose to rename the columns to a more suitable name.

I will use the rename pandas fuction.

In [10]:
df.columns

Index(['YEAR', 'QUARTER', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE',
       'OP_UNIQUE_CARRIER', 'OP_CARRIER', 'TAIL_NUM', 'OP_CARRIER_FL_NUM',
       'ORIGIN_AIRPORT_ID', 'ORIGIN_CITY_MARKET_ID', 'ORIGIN',
       'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM', 'ORIGIN_WAC',
       'DEST_AIRPORT_ID', 'DEST_CITY_MARKET_ID', 'DEST', 'DEST_CITY_NAME',
       'DEST_STATE_ABR', 'DEST_STATE_NM', 'DEST_WAC', 'CRS_DEP_TIME',
       'DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'DEP_DEL15',
       'DEP_DELAY_GROUP', 'CRS_ARR_TIME', 'ARR_TIME', 'ARR_DELAY',
       'ARR_DELAY_NEW', 'ARR_DEL15', 'ARR_DELAY_GROUP', 'CANCELLED',
       'CANCELLATION_CODE', 'DIVERTED', 'CRS_ELAPSED_TIME',
       'ACTUAL_ELAPSED_TIME', 'AIR_TIME', 'FLIGHTS', 'DISTANCE',
       'DISTANCE_GROUP', 'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY',
       'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY', 'Unnamed: 50'],
      dtype='object')

In [11]:
df.rename(columns={'YEAR':'Year',
                  'QUARTER':'Quarter',
                  'MONTH':'Month',
                  'DAY_OF_MONTH':'DayOfMonth',
                  'DAY_OF_WEEK':'DayOfWeek',
                  'FL_DATE':'FlightDate',
                  'OP_UNIQUE_CARRIER':'UniqueCarrier',
                  'OP_CARRIER':'Carrier',
                  'TAIL_NUM':'RegistrationNum',
                  'OP_CARRIER_FL_NUM':'FlightNum'}
          , inplace=True)

I use the iloc indexer to create little datasets, with selected columns. In that way, I am investigating the content and parameters of specific columns, and I can provide and ideal name. The philosophy is to use a name that will provide minimum valuable information.

In [12]:
#Origin related columns
df_org = df.iloc[:,[11,12,13,14,15,16,17]]
df_org.head()

Unnamed: 0,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID
0,31703,LGA,"New York, NY",NY,New York,22,13204
1,30397,ATL,"Atlanta, GA",GA,Georgia,34,12953
2,33570,SAN,"San Diego, CA",CA,California,91,11433
3,30397,ATL,"Atlanta, GA",GA,Georgia,34,10208
4,30397,ATL,"Atlanta, GA",GA,Georgia,34,12266


In [13]:
df.rename(columns={'ORIGIN_AIRPORT_ID':'OriginAirport_IDNum', 
                  'ORIGIN_CITY_MARKET_ID':'OriginCityMarket_IDNum', 
                  'ORIGIN':'Origin_IATA',
                  'ORIGIN_CITY_NAME':'OriginCityName',
                  'ORIGIN_STATE_ABR':'OriginState_ID',
                  'ORIGIN_STATE_NM':'OriginStateName',
                  'ORIGIN_WAC':'OriginWAC'}
          , inplace=True)

In [14]:
#Destination related columns
df_dest = df.iloc[:,[17,18,19,20,21,22,23]]
df_dest.head()

Unnamed: 0,DEST_AIRPORT_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,DEST_WAC
0,13204,31454,MCO,"Orlando, FL",FL,Florida,33
1,12953,31703,LGA,"New York, NY",NY,New York,22
2,11433,31295,DTW,"Detroit, MI",MI,Michigan,43
3,10208,30208,AGS,"Augusta, GA",GA,Georgia,34
4,12266,31453,IAH,"Houston, TX",TX,Texas,74


In [15]:
df.rename(columns={'DEST_AIRPORT_ID':'DestAirport_IDNum',
                  'DEST_CITY_MARKET_ID':'DestCityMarket_IDNum',
                  'DEST':'Dest_IATA',
                  'DEST_CITY_NAME':'DestCityName',
                  'DEST_STATE_ABR':'DestState_ID',
                  'DEST_STATE_NM':'DestStateName',
                  'DEST_WAC':'DestWAC'}
          , inplace=True)

In [16]:
#Delay related columns
df_delay = df.iloc[:,[26,27,28,29,31,32,33,34]]
df_delay.head()

Unnamed: 0,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15
0,-5.0,0.0,0.0,-1.0,8.0,14.0,14.0,0.0
1,26.0,26.0,1.0,1.0,1456.0,-3.0,0.0,0.0
2,-6.0,0.0,0.0,-1.0,1907.0,-7.0,0.0,0.0
3,-1.0,0.0,0.0,-1.0,1954.0,-7.0,0.0,0.0
4,-8.0,0.0,0.0,-1.0,1454.0,-7.0,0.0,0.0


In [17]:
df.rename(columns={'CRS_DEP_TIME':'CRSDepTime',
                  'DEP_TIME':'DepTime',
                  'DEP_DELAY':'DepDelayMin',
                  'DEP_DELAY_NEW':'DepDelayMin0',
                  'DEP_DEL15':'DepDelay_Ind15',
                  'DEP_DELAY_GROUP':'DepDelayGroup_Int15',
                  'CRS_ARR_TIME':'CRSArrTime',
                  'ARR_TIME':'ArrTime',
                  'ARR_DELAY':'ArrDelayMin',
                  'ARR_DELAY_NEW':'DepDelayMin0',
                  'ARR_DEL15':'ArrDelay_Ind15',
                  'ARR_DELAY_GROUP':'ArrDelayGroup_Int15'}
          , inplace=True)

In [18]:
#Cancellations related columns
df_delay = df.iloc[:,[36,37,38]]
df_delay.head()

Unnamed: 0,CANCELLED,CANCELLATION_CODE,DIVERTED
0,0.0,,0.0
1,0.0,,0.0
2,0.0,,0.0
3,0.0,,0.0
4,0.0,,0.0


In [19]:
df.rename(columns={'CANCELLED':'Cancelled',
                  'CANCELLATION_CODE':'CancellationCode',
                  'DIVERTED':'Diverted',
                  'CRS_ELAPSED_TIME':'CRSElapsedTimeMin',
                  'ACTUAL_ELAPSED_TIME':'ActualElapsedTimeMin',
                  'AIR_TIME':'FlightTimeMin',
                  'FLIGHTS':'NumberOfFlights',
                  'DISTANCE':'DistanceMil',
                  'DISTANCE_GROUP':'Distance_Int250Mil',
                  'CARRIER_DELAY':'CarrierDelayMin',
                  'WEATHER_DELAY':'WeatherDelayMin',
                  'NAS_DELAY':'NASDelayMin',
                  'SECURITY_DELAY':'SecurityDelayMin',
                  'LATE_AIRCRAFT_DELAY':'LateAircraftDelay'}
          , inplace=True)

In [20]:
df.columns

Index(['Year', 'Quarter', 'Month', 'DayOfMonth', 'DayOfWeek', 'FlightDate',
       'UniqueCarrier', 'Carrier', 'RegistrationNum', 'FlightNum',
       'OriginAirport_IDNum', 'OriginCityMarket_IDNum', 'Origin_IATA',
       'OriginCityName', 'OriginState_ID', 'OriginStateName', 'OriginWAC',
       'DestAirport_IDNum', 'DestCityMarket_IDNum', 'Dest_IATA',
       'DestCityName', 'DestState_ID', 'DestStateName', 'DestWAC',
       'CRSDepTime', 'DepTime', 'DepDelayMin', 'DepDelayMin0',
       'DepDelay_Ind15', 'DepDelayGroup_Int15', 'CRSArrTime', 'ArrTime',
       'ArrDelayMin', 'DepDelayMin0', 'ArrDelay_Ind15', 'ArrDelayGroup_Int15',
       'Cancelled', 'CancellationCode', 'Diverted', 'CRSElapsedTimeMin',
       'ActualElapsedTimeMin', 'FlightTimeMin', 'NumberOfFlights',
       'DistanceMil', 'Distance_Int250Mil', 'CarrierDelayMin',
       'WeatherDelayMin', 'NASDelayMin', 'SecurityDelayMin',
       'LateAircraftDelay', 'Unnamed: 50'],
      dtype='object')

### Treating Null Values
First, let's take a quick preview of the data with sample of 15 rows and check the content of each column.


In [25]:
df.sample(15)

Unnamed: 0,Year,Quarter,Month,DayOfMonth,DayOfWeek,FlightDate,UniqueCarrier,Carrier,RegistrationNum,FlightNum,...,FlightTimeMin,NumberOfFlights,DistanceMil,Distance_Int250Mil,CarrierDelayMin,WeatherDelayMin,NASDelayMin,SecurityDelayMin,LateAircraftDelay,Unnamed: 50
382572,2015,1,1,6,2,2015-01-06,WN,WN,N954WN,4899,...,54.0,1.0,281.0,2,,,,,,
44866,2015,1,1,21,3,2015-01-21,EV,EV,N11194,4735,...,57.0,1.0,328.0,2,,,,,,
414927,2015,1,1,16,5,2015-01-16,WN,WN,N796SW,1837,...,115.0,1.0,711.0,3,,,,,,
118402,2015,1,1,6,2,2015-01-06,AA,AA,N785AA,33,...,355.0,1.0,2475.0,10,0.0,0.0,32.0,0.0,0.0,
370265,2015,1,1,2,5,2015-01-02,WN,WN,N906WN,1743,...,136.0,1.0,1067.0,5,39.0,0.0,0.0,0.0,2.0,
13928,2015,1,1,27,2,2015-01-27,DL,DL,N972DL,1062,...,118.0,1.0,859.0,4,,,,,,
274796,2015,1,1,13,2,2015-01-13,OO,OO,N702SK,5581,...,204.0,1.0,1535.0,7,,,,,,
150269,2015,1,1,10,6,2015-01-10,AA,AA,N3APAA,1637,...,217.0,1.0,1471.0,6,,,,,,
305293,2015,1,1,20,2,2015-01-20,UA,UA,N39450,1037,...,132.0,1.0,867.0,4,,,,,,
40915,2015,1,1,27,2,2015-01-27,EV,EV,N12126,4670,...,60.0,1.0,403.0,2,,,,,,


In [124]:
df.describe()

Unnamed: 0,Year,Quarter,Month,DayOfMonth,DayOfWeek,FlightNum,OriginAirport_IDNum,OriginCityMarket_IDNum,OriginWAC,DestAirport_IDNum,...,FlightTimeMin,NumberOfFlights,DistanceMil,Distance_Int250Mil,CarrierDelayMin,WeatherDelayMin,NASDelayMin,SecurityDelayMin,LateAircraftDelay,Unnamed: 50
count,469968.0,469968.0,469968.0,469968.0,469968.0,469968.0,469968.0,469968.0,469968.0,469968.0,...,457013.0,469968.0,469968.0,469968.0,95951.0,95951.0,95951.0,95951.0,95951.0,0.0
mean,2015.0,1.0,1.0,15.853001,4.025559,2266.351901,12669.40424,31714.181421,55.632303,12669.437149,...,112.544096,1.0,803.261279,3.684057,17.802368,2.741889,13.319872,0.069827,22.760211,
std,0.0,0.0,0.0,8.952803,1.933772,1804.269617,1516.976267,1284.310179,26.491024,1516.825219,...,71.682848,0.0,596.249383,2.345556,45.334536,18.510443,24.723111,2.068116,40.939751,
min,2015.0,1.0,1.0,1.0,1.0,1.0,10135.0,30070.0,1.0,10135.0,...,8.0,1.0,31.0,1.0,0.0,0.0,0.0,0.0,0.0,
25%,2015.0,1.0,1.0,8.0,2.0,760.0,11292.0,30627.0,34.0,11292.0,...,60.0,1.0,366.0,2.0,0.0,0.0,0.0,0.0,0.0,
50%,2015.0,1.0,1.0,16.0,4.0,1735.0,12889.0,31453.0,52.0,12889.0,...,94.0,1.0,641.0,3.0,2.0,0.0,4.0,0.0,4.0,
75%,2015.0,1.0,1.0,24.0,6.0,3488.0,13930.0,32467.0,82.0,13930.0,...,144.0,1.0,1046.0,5.0,18.0,0.0,18.0,0.0,29.0,
max,2015.0,1.0,1.0,31.0,7.0,9793.0,16218.0,35991.0,93.0,16218.0,...,676.0,1.0,4983.0,11.0,1971.0,938.0,830.0,241.0,948.0,


In [125]:
df.dtypes

Year                        int64
Quarter                     int64
Month                       int64
DayOfMonth                  int64
DayOfWeek                   int64
FlightDate                 object
UniqueCarrier              object
Carrier                    object
RegistrationNum            object
FlightNum                   int64
OriginAirport_IDNum         int64
OriginCityMarket_IDNum      int64
Origin_IATA                object
OriginCityName             object
OriginState_ID             object
OriginStateName            object
OriginWAC                   int64
DestAirport_IDNum           int64
DestCityMarket_IDNum        int64
Dest_IATA                  object
DestCityName               object
DestState_ID               object
DestStateName              object
DestWAC                     int64
CRSDepTime                  int64
DepTime                   float64
DepDelayMin               float64
DepDelayMin0              float64
DepDelay_Ind15            float64
DepDelayGroup_

In [126]:
df_flights_list = df[['YEAR', 'FL_DATE', 'ORIGIN_AIRPORT_ID', 'ORIGIN', 'DEST_AIRPORT_ID', 'DEST', 'DEP_DELAY', 'ARR_DELAY', 'CANCELLED']]
df_flights_list.head(10)

KeyError: "['YEAR' 'FL_DATE' 'ORIGIN_AIRPORT_ID' 'ORIGIN' 'DEST_AIRPORT_ID' 'DEST'\n 'DEP_DELAY' 'ARR_DELAY' 'CANCELLED'] not in index"

In [127]:
df_flights_list.corr

NameError: name 'df_flights_list' is not defined

In [128]:
max_delay_mins = df_flights_list[['DEP_DELAY']].max()
max_delay_hours = max_delay_mins/60
max_delay_hours

NameError: name 'df_flights_list' is not defined

In [129]:
df_flights_list.sort_values(by=['DEP_DELAY'])

NameError: name 'df_flights_list' is not defined

In [130]:
#Check Dataframe dimensions: number of rows
len(df)

469968

In [35]:
#NaN Treatment
df_clean = df.dropna()
df_clean.head()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,...,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 48


In [37]:
df_flights_list_clean = df_flights_list.dropna()
df_flights_list_clean.head()

Unnamed: 0,YEAR,FL_DATE,ORIGIN_AIRPORT_ID,ORIGIN,DEST_AIRPORT_ID,DEST,DEP_DELAY,ARR_DELAY,CANCELLED
0,2018,2018-01-27,11697,FLL,12266,IAH,-13.0,-12.0,0.0
1,2018,2018-01-27,14747,SEA,14771,SFO,-4.0,-18.0,0.0
2,2018,2018-01-27,11278,DCA,12266,IAH,-2.0,1.0,0.0
3,2018,2018-01-27,12892,LAX,13930,ORD,-9.0,-8.0,0.0
4,2018,2018-01-27,12451,JAX,11618,EWR,-14.0,-24.0,0.0
