General Instructions

Please submit the description of the execution plan, the code, the output, and any comments.

You can use any tool/programming language/ library you wish. Please include any dependencies/ instructions required to recreate the output.
Data

The datasets that you will work with are located at
https://www.kaggle.com/usdot/flight-delays
Flights.csv contains flight data regarding 2015 US flights. Each row can be identified by (YEAR, MONTH, DAY, AIRLINE, FLIGHT_NUMBER, TAIL_NUMBER, SCHEDULED_DEPARTURE) 

For the purposes of this exercise, we will assume that all times are in the same time zone. Tasks

Task 1: We would like you to left join flights with airlines and airports using their respective IATA code. Please describe the resulting dataset ‘flights_extended’: Number of rows, null values if any. Also, please describe any cleaning processes you may find useful or necessary.

Task 2: We would like to perform an analysis in the top 10 airports in terms of departure delay. Please create a metric to rank each airport according to the average number of aircraft that departed from that airport having a DEPARTURE_DELAY > 15 mins. Please describe if such a metric would be efficient to compare airports and include any suggestion to improve such a comparison.

Task 3: We would like to find the association, if any, between these top 10 airports and the aircraft that had no previous arrival delay (ARRIVAL_DELAY < 15) on a given day but they had arrival delay > 15 mins as soon as they departed from these airports. Please create any metrics and plots and use any technique you deem necessary to indicate the potential existence of such a phenomenon.

In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import datetime

In [7]:
df=pd.read_csv('cleaned data.zip', compression='zip',low_memory=False)

Task 2: We would like to perform an analysis in the top 10 airports in terms of departure delay. 

Create a metric to rank each airport according to the average number of aircraft that departed from that airport having a DEPARTURE_DELAY > 15 mins. 

Please describe if such a metric would be efficient to compare airports and include any suggestion to improve such a comparison.

In [8]:
df['DELAY15']=0
df.loc[(df['DEPARTURE_DELAY']>15),'DELAY15']=1


aggregated=df[['ORIGIN_AIRPORT','DELAY15','ORIG_AIRPORT']].groupby(['ORIGIN_AIRPORT','ORIG_AIRPORT'])['DELAY15'].sum().reset_index(name='count_delayed').sort_values(['count_delayed'], ascending=False)
a=df[['ORIGIN_AIRPORT','DELAY15']].groupby(['ORIGIN_AIRPORT'])['ORIGIN_AIRPORT'].count().reset_index(name='total_flights')
b=df[['ORIGIN_AIRPORT','DELAY15']].groupby(['ORIGIN_AIRPORT'])['DELAY15'].mean().reset_index(name='delayed_flights_15m%')
aggregated=aggregated.merge(a,on='ORIGIN_AIRPORT')
aggregated=aggregated.merge(b,on='ORIGIN_AIRPORT')

#Ranking Airports in terms of absolute numbers of delayed flights, would be biased torwards the most busy airports

In [10]:
aggregated.head(10)

Unnamed: 0,ORIGIN_AIRPORT,ORIG_AIRPORT,count_delayed,total_flights,delayed_flights_15m%
0,ORD,Chicago O'Hare International Airport,65125,276554,0.235487
1,ATL,Hartsfield-Jackson Atlanta International Airport,58902,343506,0.171473
2,DFW,Dallas/Fort Worth International Airport,49384,232647,0.21227
3,DEN,Denver International Airport,42432,193402,0.219398
4,LAX,Los Angeles International Airport,39099,192003,0.203637
5,IAH,George Bush Intercontinental Airport,29472,144019,0.20464
6,SFO,San Francisco International Airport,29322,145491,0.201538
7,LAS,McCarran International Airport,27453,131937,0.208077
8,PHX,Phoenix Sky Harbor International Airport,26186,145552,0.179908
9,EWR,Newark Liberty International Airport,22605,98341,0.229863


#On the other hand ranking in terms of averaged delayed flights, is biased towards small airports, which do not have large volume and have probably a secondary role in the 

In [12]:
aggregated.sort_values(['delayed_flights_15m%'], ascending=False).head(10)

Unnamed: 0,ORIGIN_AIRPORT,ORIG_AIRPORT,count_delayed,total_flights,delayed_flights_15m%
303,ADK,Adak Airport,36,88,0.409091
310,GST,Gustavus Airport,29,76,0.381579
305,ILG,Wilmington Airport,35,95,0.368421
312,STC,St. Cloud Regional Airport,24,77,0.311688
285,MVY,Martha's Vineyard Airport,63,205,0.307317
300,UST,Northeast Florida Regional Airport (St. August...,44,144,0.305556
275,OTH,Southwest Oregon Regional Airport (North Bend ...,74,264,0.280303
293,CEC,Del Norte County Airport (Jack McNamara Field),48,173,0.277457
271,PBG,Plattsburgh International Airport,76,278,0.273381
103,ASE,Aspen-Pitkin County Airport,882,3263,0.270303


#We will use the median of 'departure delay' when there is an actual delay (since the variable also has negative values when airplane left earlier). In turn we will multiply with the average 

In [14]:
c=df[df['DEPARTURE_DELAY']>0].groupby(['ORIGIN_AIRPORT'])['DEPARTURE_DELAY'].median().reset_index(name='median_actual_delay')
aggregated=aggregated.merge(c,on='ORIGIN_AIRPORT')


In [15]:
aggregated.sort_values(['median_actual_delay','delayed_flights_15m%'], ascending=False).head(10)

Unnamed: 0,ORIGIN_AIRPORT,ORIG_AIRPORT,count_delayed,total_flights,delayed_flights_15m%,median_actual_delay
312,STC,St. Cloud Regional Airport,24,77,0.311688,52.0
265,PLN,Pellston Regional Airport of Emmet County,85,730,0.116438,48.0
293,CEC,Del Norte County Airport (Jack McNamara Field),48,173,0.277457,38.5
281,APN,Alpena County Regional Airport,69,546,0.126374,38.0
289,ESC,Delta County Airport,53,554,0.095668,37.0
301,MMH,Mammoth Yosemite Airport,37,140,0.264286,36.0
175,ACV,Arcata Airport,280,1268,0.22082,36.0
287,INL,Falls International Airport,53,556,0.095324,34.0
275,OTH,Southwest Oregon Regional Airport (North Bend ...,74,264,0.280303,31.0
302,PUB,Pueblo Memorial Airport,36,253,0.142292,31.0


In [16]:
df

Unnamed: 0.1,Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIRLINE,ORIG_AIRPORT,ORIG_CITY,ORIG_STATE,ORIG_COUNTRY,ORIG_LATITUDE,ORIG_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE,delay,Div/Canc,DATE,DELAY15
0,0,2015,1,1,4,98,N407AS,ANC,SEA,00:05:00,23:54:00,-11.0,21.0,00:15:00,205.0,194.0,169.0,1448,04:04:00,4.0,04:30:00,04:08:00,-22.0,0,0,0,Alaska Airlines Inc.,Ted Stevens Anchorage International Airport,Anchorage,AK,USA,61.17432,-149.99619,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,0,0,2015-01-01,0
1,1,2015,1,1,4,2336,N3KUAA,LAX,PBI,00:10:00,00:02:00,-8.0,12.0,00:14:00,280.0,279.0,263.0,2330,07:37:00,4.0,07:50:00,07:41:00,-9.0,0,0,0,American Airlines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Palm Beach International Airport,West Palm Beach,FL,USA,26.68316,-80.09559,0,0,2015-01-01,0
2,2,2015,1,1,4,840,N171US,SFO,CLT,00:20:00,00:18:00,-2.0,16.0,00:34:00,286.0,293.0,266.0,2296,08:00:00,11.0,08:06:00,08:11:00,5.0,0,0,0,US Airways Inc.,San Francisco International Airport,San Francisco,CA,USA,37.61900,-122.37484,Charlotte Douglas International Airport,Charlotte,NC,USA,35.21401,-80.94313,1,0,2015-01-01,0
3,3,2015,1,1,4,258,N3HYAA,LAX,MIA,00:20:00,00:15:00,-5.0,15.0,00:30:00,285.0,281.0,258.0,2342,07:48:00,8.0,08:05:00,07:56:00,-9.0,0,0,0,American Airlines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Miami International Airport,Miami,FL,USA,25.79325,-80.29056,0,0,2015-01-01,0
4,4,2015,1,1,4,135,N527AS,SEA,ANC,00:25:00,00:24:00,-1.0,11.0,00:35:00,235.0,215.0,199.0,1448,02:54:00,5.0,03:20:00,02:59:00,-21.0,0,0,0,Alaska Airlines Inc.,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,Ted Stevens Anchorage International Airport,Anchorage,AK,USA,61.17432,-149.99619,0,0,2015-01-01,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5714003,5819074,2015,12,31,4,688,N657JB,LAX,BOS,23:59:00,23:55:00,-4.0,22.0,00:17:00,320.0,298.0,272.0,2611,07:49:00,4.0,08:19:00,07:53:00,-26.0,0,0,0,JetBlue Airways,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Gen. Edward Lawrence Logan International Airport,Boston,MA,USA,42.36435,-71.00518,0,0,2015-12-31,0
5714004,5819075,2015,12,31,4,745,N828JB,JFK,PSE,23:59:00,23:55:00,-4.0,17.0,00:12:00,227.0,215.0,195.0,1617,04:27:00,3.0,04:46:00,04:30:00,-16.0,0,0,0,JetBlue Airways,John F. Kennedy International Airport (New Yor...,New York,NY,USA,40.63975,-73.77893,Mercedita Airport,Ponce,PR,USA,18.00830,-66.56301,0,0,2015-12-31,0
5714005,5819076,2015,12,31,4,1503,N913JB,JFK,SJU,23:59:00,23:50:00,-9.0,17.0,00:07:00,221.0,222.0,197.0,1598,04:24:00,8.0,04:40:00,04:32:00,-8.0,0,0,0,JetBlue Airways,John F. Kennedy International Airport (New Yor...,New York,NY,USA,40.63975,-73.77893,Luis Muñoz Marín International Airport,San Juan,PR,USA,18.43942,-66.00183,0,0,2015-12-31,0
5714006,5819077,2015,12,31,4,333,N527JB,MCO,SJU,23:59:00,23:53:00,-6.0,10.0,00:03:00,161.0,157.0,144.0,1189,03:27:00,3.0,03:40:00,03:30:00,-10.0,0,0,0,JetBlue Airways,Orlando International Airport,Orlando,FL,USA,28.42889,-81.31603,Luis Muñoz Marín International Airport,San Juan,PR,USA,18.43942,-66.00183,0,0,2015-12-31,0
