# US Flight Delay Analysis - Winter Season (2018-2019)
Data: "On-Time : Reporting Carrier On-Time Performance" dataset, for: December 2018, January 2019 and February 2019.

This notebook contains the set up for the data base, prior to the analysis phase. Meaning, it includes: the data loading, the merging of the three different datasets, the data trimming and finally the treating of null values.

In terms of technology, we will use pandas for the data manipulation and analysis.

Each phase of the process followed is indicated by cronological order.

### Imports:

In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import os

# 01. Data Loading & Merging
We will load the three different datasets, containing the data of each month. Than, prior to the merging, we will do a quick examination of the data. Finally, we will proceed with the creation of the final datasets (merging the others), which we will be using for the analysis.

In [2]:
#I use the "os" module, in order to get the directory of this notebook, and be totally sure of the data path to insert.
cwd = os.getcwd()
cwd

'C:\\Users\\GerardEspejo\\Desktop\\TFM\\TFM'

### Dataset Loading and initial exploratory evaluation
The datasets are loaded from a local folder "Data", which contains all the datasets.

In [3]:
#December 2018
df_dec18 = pd.read_csv('C:\\Users\\GerardEspejo\\Desktop\\TFM\\Data\\Dec_2018.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
#Let's take a sample of 5 flights, as a quick preview. 
#Mainly I will check the dates of the flights (Column: 'FL_DATE'), and see if the month coincides with the dataset downloaded.
df_dec18.sample(5)

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,...,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 50
196581,2018,4,12,28,5,2018-12-28,OO,OO,N171SY,3437,...,260.0,1.0,1670.0,7,,,,,,
356648,2018,4,12,22,6,2018-12-22,NK,NK,N526NK,473,...,160.0,1.0,1167.0,5,,,,,,
577271,2018,4,12,1,6,2018-12-01,B6,B6,N183JB,918,...,32.0,1.0,187.0,1,,,,,,
239501,2018,4,12,26,3,2018-12-26,WN,WN,N8329B,5202,...,133.0,1.0,843.0,4,13.0,0.0,4.0,0.0,47.0,
313318,2018,4,12,10,1,2018-12-10,DL,DL,N387DA,1841,...,163.0,1.0,1175.0,5,,,,,,


In [5]:
#January 2019
df_jan19 = pd.read_csv('C:\\Users\\GerardEspejo\\Desktop\\TFM\\Data\\Jan_2019.csv')

In [6]:
df_jan19.sample(5)

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,...,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 50
160355,2019,1,1,21,1,2019-01-21,WN,WN,N253WN,1351,...,66.0,1.0,414.0,2,4.0,0.0,0.0,0.0,28.0,
43904,2019,1,1,27,7,2019-01-27,AA,AA,N355PU,2325,...,144.0,1.0,1205.0,5,,,,,,
49492,2019,1,1,14,1,2019-01-14,MQ,MQ,N843AE,3770,...,59.0,1.0,335.0,2,0.0,0.0,8.0,0.0,84.0,
166458,2019,1,1,29,2,2019-01-29,WN,WN,N265WN,1133,...,,1.0,405.0,2,,,,,,
456861,2019,1,1,24,4,2019-01-24,B6,B6,N309JB,1895,...,163.0,1.0,972.0,4,0.0,0.0,18.0,0.0,0.0,


In [7]:
#February 2019
df_feb19 = pd.read_csv('C:\\Users\\GerardEspejo\\Desktop\\TFM\\Data\\Feb_2019.csv')

In [8]:
df_feb19.sample(5)

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,...,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 50
423861,2019,1,2,17,7,2019-02-17,EV,EV,N15572,4250,...,56.0,1.0,427.0,2,0.0,0.0,0.0,0.0,47.0,
45895,2019,1,2,11,1,2019-02-11,WN,WN,N8605E,2244,...,160.0,1.0,1036.0,5,,,,,,
142818,2019,1,2,14,4,2019-02-14,WN,WN,N8670A,2173,...,213.0,1.0,1452.0,6,,,,,,
243227,2019,1,2,24,7,2019-02-24,DL,DL,N381DN,1974,...,59.0,1.0,352.0,2,,,,,,
336387,2019,1,2,4,1,2019-02-04,AA,AA,N537UW,428,...,245.0,1.0,1773.0,8,6.0,0.0,0.0,0.0,12.0,


In [9]:
#For the initial exploratory, we will check the shapes of the three datasets
Shapes = {'Dec 18': [df_dec18.shape], 'Jan 19': [df_jan19.shape], 'Feb 19': [df_feb19.shape]}
df_shapes = DataFrame (Shapes, columns = ['Dec 18','Jan 19', 'Feb 19'])
df_shapes

Unnamed: 0,Dec 18,Jan 19,Feb 19
0,"(593842, 51)","(583985, 51)","(533175, 51)"


### Merging the three datasets
As we have observed in the initial exploratory of the data, the three datasets have the same type of values. Moreover, their shapes are very similar, having the same number of columns. This means we can join them into a single dataset.

In [10]:
#We will use the concat() function, to merge the three datasets
#We will set the argument 'ignore_index' to True, which will automatically set the row labels according the join
data = [df_dec18, df_jan19, df_feb19]
df_winter = pd.concat(data, ignore_index=True)

In [11]:
#The number of rows of the new dataset should be equal to the sum of rows of all three datasets.
calculated_total_rows = 593842 + 583985 + 533175
calculated_total_rows

1711002

In [12]:
#Let's check the shape of the resultant dataset.
df_winter.shape

(1711002, 51)

We have a successfull merging!

# 02. Data Preparation
In the previous notebook ('01. First Approach') we have done an initial examination of the dataset, resulting in a pre-selection of the most "interesting" columns during the data downloading process, in the BTS website: https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

Having taken note of that, we have applied the same procedure prior to the Data Loading. Therefore, we will optimize our work.

For your information, the reduced data (columns) is related to: Gate Return Information at Origin Airport and Diverted Airport Information. Because we are focusing this analysis only on Flight Delays, we won't be using flight divertions informations.

### Initial Data Examination

In [13]:
df_winter.shape

(1711002, 51)

In [14]:
#Let's check the columns and the type of values
df_winter.dtypes

YEAR                       int64
QUARTER                    int64
MONTH                      int64
DAY_OF_MONTH               int64
DAY_OF_WEEK                int64
FL_DATE                   object
OP_UNIQUE_CARRIER         object
OP_CARRIER                object
TAIL_NUM                  object
OP_CARRIER_FL_NUM          int64
ORIGIN_AIRPORT_ID          int64
ORIGIN_CITY_MARKET_ID      int64
ORIGIN                    object
ORIGIN_CITY_NAME          object
ORIGIN_STATE_ABR          object
ORIGIN_STATE_NM           object
ORIGIN_WAC                 int64
DEST_AIRPORT_ID            int64
DEST_CITY_MARKET_ID        int64
DEST                      object
DEST_CITY_NAME            object
DEST_STATE_ABR            object
DEST_STATE_NM             object
DEST_WAC                   int64
CRS_DEP_TIME               int64
DEP_TIME                 float64
DEP_DELAY                float64
DEP_DELAY_NEW            float64
DEP_DEL15                float64
DEP_DELAY_GROUP          float64
CRS_ARR_TI

In [15]:
#Which columns have more null values
df_winter.isna().sum().sort_values(ascending=False)

Unnamed: 50              1711002
CANCELLATION_CODE        1672269
SECURITY_DELAY           1377621
NAS_DELAY                1377621
WEATHER_DELAY            1377621
CARRIER_DELAY            1377621
LATE_AIRCRAFT_DELAY      1377621
ACTUAL_ELAPSED_TIME        42988
ARR_DEL15                  42988
ARR_DELAY_GROUP            42988
AIR_TIME                   42988
ARR_DELAY                  42988
ARR_DELAY_NEW              42988
ARR_TIME                   39791
DEP_DEL15                  37708
DEP_DELAY_GROUP            37708
DEP_DELAY_NEW              37708
DEP_DELAY                  37708
DEP_TIME                   37701
TAIL_NUM                    5034
CRS_ELAPSED_TIME             134
CANCELLED                      0
ORIGIN_STATE_ABR               0
QUARTER                        0
MONTH                          0
DAY_OF_MONTH                   0
DAY_OF_WEEK                    0
FL_DATE                        0
OP_UNIQUE_CARRIER              0
OP_CARRIER                     0
OP_CARRIER

In [16]:
#How many airlines are we dealing with?
df_winter['OP_CARRIER'].nunique()

17

In [17]:
# ...and airports?
df_winter['ORIGIN'].nunique()

346

### Trimming the data

##### Columns Rename

It makes it really difficult to work with the default column names of the dataset. In order to ease up the process of calling specific columns and the understanding of the information contained in each column, I choose to rename the columns to a more suitable name. We will use the 'rename' pandas fuction.

In [18]:
#Current columns names
df_winter.columns

Index(['YEAR', 'QUARTER', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE',
       'OP_UNIQUE_CARRIER', 'OP_CARRIER', 'TAIL_NUM', 'OP_CARRIER_FL_NUM',
       'ORIGIN_AIRPORT_ID', 'ORIGIN_CITY_MARKET_ID', 'ORIGIN',
       'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM', 'ORIGIN_WAC',
       'DEST_AIRPORT_ID', 'DEST_CITY_MARKET_ID', 'DEST', 'DEST_CITY_NAME',
       'DEST_STATE_ABR', 'DEST_STATE_NM', 'DEST_WAC', 'CRS_DEP_TIME',
       'DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'DEP_DEL15',
       'DEP_DELAY_GROUP', 'CRS_ARR_TIME', 'ARR_TIME', 'ARR_DELAY',
       'ARR_DELAY_NEW', 'ARR_DEL15', 'ARR_DELAY_GROUP', 'CANCELLED',
       'CANCELLATION_CODE', 'DIVERTED', 'CRS_ELAPSED_TIME',
       'ACTUAL_ELAPSED_TIME', 'AIR_TIME', 'FLIGHTS', 'DISTANCE',
       'DISTANCE_GROUP', 'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY',
       'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY', 'Unnamed: 50'],
      dtype='object')

In [19]:
df_winter.rename(columns={'YEAR':'Year',
                  'QUARTER':'Quarter',
                  'MONTH':'Month',
                  'DAY_OF_MONTH':'DayOfMonth',
                  'DAY_OF_WEEK':'DayOfWeek',
                  'FL_DATE':'FlightDate',
                  'OP_UNIQUE_CARRIER':'UniqueCarrier',
                  'OP_CARRIER':'Carrier',
                  'TAIL_NUM':'RegistrationNum',
                  'OP_CARRIER_FL_NUM':'FlightNum'}
          , inplace=True)

We will use the 'iloc' indexer to create small datasets, with the selected columns. In that way, we will have a quick view of  the content and parameters of the specific columns, and we can provide and ideal name. The philosophy is to use a name that will provide minimum valuable information.

In [20]:
#Process: 1-Locate the columns to rename using 'iloc', and take a quick sample.
#Origin related columns
df_origin = df_winter.iloc[:,[11,12,13,14,15,16,17]]
df_origin.head(3)

Unnamed: 0,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID
0,32575,SNA,"Santa Ana, CA",CA,California,91,10397
1,30423,AUS,"Austin, TX",TX,Texas,74,10397
2,31703,JFK,"New York, NY",NY,New York,22,11697


In [21]:
#Process: 2-Rename the located columns using 'rename'
df_winter.rename(columns={'ORIGIN_AIRPORT_ID':'OriginAirport_IDNum', 
                  'ORIGIN_CITY_MARKET_ID':'OriginCityMarket_IDNum', 
                  'ORIGIN':'Origin_IATA',
                  'ORIGIN_CITY_NAME':'OriginCityName',
                  'ORIGIN_STATE_ABR':'OriginState_ID',
                  'ORIGIN_STATE_NM':'OriginStateName',
                  'ORIGIN_WAC':'OriginWAC'}
          , inplace=True)

In [22]:
#Destination related columns
df_destination = df_winter.iloc[:,[17,18,19,20,21,22,23]]
df_destination.head(3)

Unnamed: 0,DEST_AIRPORT_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_NM,DEST_WAC
0,10397,30397,ATL,"Atlanta, GA",GA,Georgia,34
1,10397,30397,ATL,"Atlanta, GA",GA,Georgia,34
2,11697,32467,FLL,"Fort Lauderdale, FL",FL,Florida,33


In [23]:
df_winter.rename(columns={'DEST_AIRPORT_ID':'DestAirport_IDNum',
                  'DEST_CITY_MARKET_ID':'DestCityMarket_IDNum',
                  'DEST':'Dest_IATA',
                  'DEST_CITY_NAME':'DestCityName',
                  'DEST_STATE_ABR':'DestState_ID',
                  'DEST_STATE_NM':'DestStateName',
                  'DEST_WAC':'DestWAC'}
          , inplace=True)

In [24]:
#Departure and arrival related columns
df_dep_arr = df_winter.iloc[:,[24,25,26,27,28,29,31,32,33,34,35]]
df_dep_arr.head(3)

Unnamed: 0,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP
0,645,655.0,10.0,10.0,0.0,0.0,1339.0,-17.0,0.0,0.0,-2.0
1,700,656.0,-4.0,0.0,0.0,-1.0,951.0,-17.0,0.0,0.0,-2.0
2,1133,1129.0,-4.0,0.0,0.0,-1.0,1429.0,-20.0,0.0,0.0,-2.0


In [25]:
df_winter.rename(columns={'CRS_DEP_TIME':'CRSDepTime',
                  'DEP_TIME':'DepTime',
                  'DEP_DELAY':'DepDelayMin',
                  'DEP_DELAY_NEW':'DepDelayMin0',
                  'DEP_DEL15':'DepDelay_Ind15',
                  'DEP_DELAY_GROUP':'DepDelayGroup_Int15',
                  'CRS_ARR_TIME':'CRSArrTime',
                  'ARR_TIME':'ArrTime',
                  'ARR_DELAY':'ArrDelayMin',
                  'ARR_DELAY_NEW':'ArrDelayMin0',
                  'ARR_DEL15':'ArrDelay_Ind15',
                  'ARR_DELAY_GROUP':'ArrDelayGroup_Int15'}
          , inplace=True)

In [26]:
#Cancellations related columns
df_cancel = df_winter.iloc[:,[36,37,38]]
df_cancel.head(3)

Unnamed: 0,CANCELLED,CANCELLATION_CODE,DIVERTED
0,0.0,,0.0
1,0.0,,0.0
2,0.0,,0.0


In [27]:
df_winter.rename(columns={'CANCELLED':'Cancelled',
                  'CANCELLATION_CODE':'CancellationCode',
                  'DIVERTED':'Diverted'}
          , inplace=True)

In [28]:
#Other Flight & Delay related columns
df_other = df_winter.iloc[:,[39,40,41,42,43,44,45,46,47,48,49,50]]
df_other.head(3)

Unnamed: 0,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 50
0,251.0,224.0,200.0,1.0,1919.0,8,,,,,,
1,128.0,115.0,97.0,1.0,813.0,4,,,,,,
2,196.0,180.0,152.0,1.0,1069.0,5,,,,,,


In [29]:
df_winter.rename(columns={'CRS_ELAPSED_TIME':'CRSElapsedTimeMin',
                  'ACTUAL_ELAPSED_TIME':'ActualElapsedTimeMin',
                  'AIR_TIME':'FlightTimeMin',
                  'FLIGHTS':'NumberOfFlights',
                  'DISTANCE':'DistanceMil',
                  'DISTANCE_GROUP':'Distance_Int250Mil',
                  'CARRIER_DELAY':'CarrierDelayMin',
                  'WEATHER_DELAY':'WeatherDelayMin',
                  'NAS_DELAY':'NASDelayMin',
                  'SECURITY_DELAY':'SecurityDelayMin',
                  'LATE_AIRCRAFT_DELAY':'LateAircraftDelay'}
          , inplace=True)

In [30]:
#New columns names
df_winter.columns

Index(['Year', 'Quarter', 'Month', 'DayOfMonth', 'DayOfWeek', 'FlightDate',
       'UniqueCarrier', 'Carrier', 'RegistrationNum', 'FlightNum',
       'OriginAirport_IDNum', 'OriginCityMarket_IDNum', 'Origin_IATA',
       'OriginCityName', 'OriginState_ID', 'OriginStateName', 'OriginWAC',
       'DestAirport_IDNum', 'DestCityMarket_IDNum', 'Dest_IATA',
       'DestCityName', 'DestState_ID', 'DestStateName', 'DestWAC',
       'CRSDepTime', 'DepTime', 'DepDelayMin', 'DepDelayMin0',
       'DepDelay_Ind15', 'DepDelayGroup_Int15', 'CRSArrTime', 'ArrTime',
       'ArrDelayMin', 'ArrDelayMin0', 'ArrDelay_Ind15', 'ArrDelayGroup_Int15',
       'Cancelled', 'CancellationCode', 'Diverted', 'CRSElapsedTimeMin',
       'ActualElapsedTimeMin', 'FlightTimeMin', 'NumberOfFlights',
       'DistanceMil', 'Distance_Int250Mil', 'CarrierDelayMin',
       'WeatherDelayMin', 'NASDelayMin', 'SecurityDelayMin',
       'LateAircraftDelay', 'Unnamed: 50'],
      dtype='object')

##### Changing Date Format: DepTime and ArrTime

In [31]:
df_winter[['DepTime', 'ArrTime', 'FlightDate']].dtypes

DepTime       float64
ArrTime       float64
FlightDate     object
dtype: object

In [32]:
df_time = df_winter[['DepTime', 'ArrTime', 'FlightDate']]
df_time.head(5)

Unnamed: 0,DepTime,ArrTime,FlightDate
0,655.0,1339.0,2018-12-06
1,656.0,951.0,2018-12-06
2,1129.0,1429.0,2018-12-06
3,724.0,944.0,2018-12-06
4,1034.0,1300.0,2018-12-06


As we can see, both DepTime and ArrTime are float types. Also, the FlightDate is a string. We need to combine them, and convert them into a datetime (TimeStamp) format.

In [33]:
#We need to create a function that will enable us to parse our int hours into a reasonable format ("HH:MM").  

def Deptime_to_String(deptime):
   
    #Using 'int' we only keep the integer value of the division, which represents the hour
    #Applying '%' Modulus, results (hours) equals to '24' are avoided, and returned as '00'
    dephour = int(deptime / 100) % 24
    depmin = int(deptime % 100)

    return '%02d:%02d' % (dephour, depmin)

def Arrtime_to_String(arrtime):
    
    #Using 'int' we only keep the integer value of the division, which represents the hour
    #Applying '%' Modulus, results (hours) equals to '24' are avoided, and returned as '00'
    arrhour = int(arrtime / 100) % 24 
    arrmin = int(arrtime % 100)

    return '%02d:%02d' % (arrhour, arrmin)

In [34]:
#We test the functions created with the first flight of the previous shown
Deptime_to_String(655.0), Arrtime_to_String(1339.0)

('06:55', '13:39')

In [35]:
#On previous tries I had this error - "ValueError: cannot convert float NaN to integer"
#Therefore, I chose to remove the missing values with dropna
deptime = df_winter['DepTime'].dropna().apply(Deptime_to_String)
arrtime = df_winter['ArrTime'].dropna().apply(Arrtime_to_String)

In [36]:
#Let's create 2 new columns with the time in the correct format
df_winter['DepTime2'] = deptime
df_winter['ArrTime2'] = arrtime

In [37]:
#Test
df_winter['DepTime2'].sample()

650869    18:09
Name: DepTime2, dtype: object

In [38]:
#Test
df_winter['ArrTime2'].sample()

1403985    08:22
Name: ArrTime2, dtype: object

Now, we have two new columns: DepTime2 and ArrTime2; containing departure and arrival times in the correct format. Next, we will generate a new column, for both times, into a TimeStamp format ('YYYY-MM-DD HH:MM:SS').

Applying 'to_datetime' we are able to combine two columns, for example: 'DepTime2' (Time) and 'FlightDate' (Date); and create a TimeStamp format with both of them, in a new column.

In [39]:
#Because to_datetime, the Date has to be positioned prior to the Time
dep_datetime = pd.to_datetime(df_winter['FlightDate']+' '+df_winter['DepTime2'])
df_winter['DepDateTime'] = dep_datetime
df_winter['DepDateTime'].sample(5) #Test

972382    2019-01-25 18:09:00
1449252   2019-02-26 15:45:00
10722     2018-12-30 18:54:00
798597    2019-01-15 14:33:00
1082227   2019-01-27 22:03:00
Name: DepDateTime, dtype: datetime64[ns]

In [40]:
arr_datetime = pd.to_datetime(df_winter['FlightDate']+' '+df_winter['ArrTime2'])
df_winter['ArrDateTime'] = arr_datetime
df_winter['ArrDateTime'].sample(5) #Test

593934    2019-01-16 06:59:00
331185    2018-12-21 12:53:00
326651    2018-12-26 08:37:00
1417803   2019-02-20 12:32:00
996251    2019-01-13 22:53:00
Name: ArrDateTime, dtype: datetime64[ns]

After having created both columns, let's drop the redundant columns: DepTime, DepTime2, ArrTime, ArrTime2 and FlightDate.

In [41]:
df = df_winter.drop(['DepTime', 'DepTime2', 'ArrTime', 'ArrTime2', 'FlightDate'], axis=1)

In [42]:
#Last check
df_time = df_winter[['DepDateTime', 'ArrDateTime']]
df_time.head()

Unnamed: 0,DepDateTime,ArrDateTime
0,2018-12-06 06:55:00,2018-12-06 13:39:00
1,2018-12-06 06:56:00,2018-12-06 09:51:00
2,2018-12-06 11:29:00,2018-12-06 14:29:00
3,2018-12-06 07:24:00,2018-12-06 09:44:00
4,2018-12-06 10:34:00,2018-12-06 13:00:00


In [43]:
df_winter[['DepDateTime', 'ArrDateTime']].dtypes

DepDateTime    datetime64[ns]
ArrDateTime    datetime64[ns]
dtype: object

Excellent! We now have both, departure and arrival, times in datetime format.

### Treating Null Values

In [44]:
#Which columns have more null values?
null_values = df_winter.isna().sum().sort_values(ascending=False)
null_values

Unnamed: 50               1711002
CancellationCode          1672269
CarrierDelayMin           1377621
LateAircraftDelay         1377621
SecurityDelayMin          1377621
NASDelayMin               1377621
WeatherDelayMin           1377621
FlightTimeMin               42988
ArrDelayMin0                42988
ArrDelayMin                 42988
ActualElapsedTimeMin        42988
ArrDelay_Ind15              42988
ArrDelayGroup_Int15         42988
ArrTime                     39791
ArrTime2                    39791
ArrDateTime                 39791
DepDelayGroup_Int15         37708
DepDelay_Ind15              37708
DepDelayMin                 37708
DepDelayMin0                37708
DepTime                     37701
DepDateTime                 37701
DepTime2                    37701
RegistrationNum              5034
CRSElapsedTimeMin             134
UniqueCarrier                   0
Origin_IATA                     0
OriginCityMarket_IDNum          0
OriginAirport_IDNum             0
FlightNum     

In [45]:
Cancellation_NullRate = (1672269/1711002)*100
DelayReason_NullRate = (1377621/1711002)*100
DepDelay_NullRate = (37708/1711002)*100
ArrDelay_NullRate = (42988/1711002)*100
Cancellation_NullRate, DelayReason_NullRate, DepDelay_NullRate, ArrDelay_NullRate

(97.73623876535503, 80.51545234897446, 2.2038548172357486, 2.512445923499797)

It appears the 'CancellationCode' column has the highest number of null values 1672269, representing the 97.74%. The reason is provably because when the flight was not cancelled it was: ahead of time, on time or delayed; leaving the cancellation code column null. 

Delay's causes: 'CarrierDelayMin', 'WeatherDelayMin, 'NASDelayMin', SecurityDelayMin', 'LateAircraftDelay'; have also a high number of null values with 1377621 each, representing 80.52%.

Let's keep in mind that the dataframe has 1711002 rows. Also, that the most interesting columns for us (the ones specically related to the delays in Departures and Arrivals); have between 37708 ('DepDelayMin') and 42988 ('ArrDelayMin'), representing an interval of [2.20%, 2.51%].

In [57]:
five_percent = (1711002*5)/100
Threshold = 1711002 - five_percent
five_percent, Threshold

(85550.1, 1625451.9)

Therefore, we will drop the columns with more than 5% of nulls values, equivalent to 85550.1 values. This threshold is selected in order to eliminate columns with more than 80% of null values, which won't contribute to the analysis.  
Threshold = 1711002 - 85550 = 1625451.9 = 1625452

In [58]:
#In dropna(), the threshold value indicates the minimum non-null values per column 
df = df_winter.dropna(axis=1, thresh=1625452)

In [59]:
df.shape

(1711002, 48)

In [49]:
df.isna().sum().sort_values(ascending=False)

FlightTimeMin             42988
ArrDelayMin               42988
ArrDelayMin0              42988
ArrDelay_Ind15            42988
ArrDelayGroup_Int15       42988
ActualElapsedTimeMin      42988
ArrDateTime               39791
ArrTime                   39791
ArrTime2                  39791
DepDelayMin0              37708
DepDelayGroup_Int15       37708
DepDelay_Ind15            37708
DepDelayMin               37708
DepDateTime               37701
DepTime2                  37701
DepTime                   37701
RegistrationNum            5034
CRSElapsedTimeMin           134
OriginCityName                0
Origin_IATA                   0
OriginCityMarket_IDNum        0
OriginAirport_IDNum           0
FlightNum                     0
DayOfMonth                    0
Carrier                       0
UniqueCarrier                 0
OriginStateName               0
Quarter                       0
Month                         0
FlightDate                    0
DayOfWeek                     0
OriginSt

Dropped columns: 'CancellationCode', 'CarrierDelayMin', 'LateAircraftDelay', 'SecurityDelayMin', 'NASDelayMin', 'WeatherDelayMin'

# 03. Exporting the created Dataset
We will export the created dataset to a CSV file, that we will store in the Data folder. The reason to do this is for future usage of the created dataset; in my case, I will be using it for visualitzations in Tableau.

In [51]:
#We will use the 'to_csv' function
df.to_csv(r'C:\\Users\\GerardEspejo\\Desktop\\TFM\\Data\\Winter_Season2.csv')