General Instructions

Please submit the description of the execution plan, the code, the output, and any comments.

You can use any tool/programming language/ library you wish. Please include any dependencies/ instructions required to recreate the output.
Data

The datasets that you will work with are located at
https://www.kaggle.com/usdot/flight-delays
Flights.csv contains flight data regarding 2015 US flights. Each row can be identified by (YEAR, MONTH, DAY, AIRLINE, FLIGHT_NUMBER, TAIL_NUMBER, SCHEDULED_DEPARTURE) 

For the purposes of this exercise, we will assume that all times are in the same time zone. Tasks

Task 1: We would like you to left join flights with airlines and airports using their respective IATA code. Please describe the resulting dataset ‘flights_extended’: Number of rows, null values if any. Also, please describe any cleaning processes you may find useful or necessary.

Task 2: We would like to perform an analysis in the top 10 airports in terms of departure delay. Please create a metric to rank each airport according to the average number of aircraft that departed from that airport having a DEPARTURE_DELAY > 15 mins. Please describe if such a metric would be efficient to compare airports and include any suggestion to improve such a comparison.

Task 3: We would like to find the association, if any, between these top 10 airports and the aircraft that had no previous arrival delay (ARRIVAL_DELAY < 15) on a given day but they had arrival delay > 15 mins as soon as they departed from these airports. Please create any metrics and plots and use any technique you deem necessary to indicate the potential existence of such a phenomenon.

pip install matplotlib-venn

import matplotlib.pyplot as plt
from matplotlib_venn import venn2

venn2(subsets = (10, 5, 2), set_labels = ('Group A', 'Group B'))
plt.show()

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import datetime

In [2]:
#Functions
def get_dataframe_info(df):
    """
    input
       DataFrame
    output
       DataFrame Info/description (sorted)
    Warning:Table must consist of at least 1 column with numeric values and dtype during import must not be set to unicode
    """

    df_types = pd.DataFrame(df.dtypes)
    df_nulls = df.count()
    df_transposed = pd.DataFrame(df.describe()).T
    df_overview = pd.concat([df_types, df_nulls,df_transposed ], axis=1).reset_index()
    
    # Reassign column names
    col_names = ['features', 'types', 'non_null_counts','count','mean','std','min','25%','50%','75%','max']
    df_overview .columns = col_names
    df_overview ['%nulls']=(df.shape[0]-df_overview ['non_null_counts'])*100/df.shape[0]

    
    # Add this to sort
    df_overview  = df_overview.sort_values(by=["non_null_counts"], ascending=False).drop(columns=['count','25%','50%','75%'])
    
    return df_overview 

In [3]:
airports=pd.read_csv('airports.csv')
flights=pd.read_csv('flights.csv')
airlines=pd.read_csv('airlines.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [4]:
#get_dataframe_info(airports) #2 airports na lat,long
#airlines.info() #no missing values
get_dataframe_info(flights) # missing values on multiple columns

Unnamed: 0,features,types,non_null_counts,mean,std,min,max,%nulls
0,YEAR,int64,5819079,2015.0,0.0,2015.0,2015.0,0.0
1,MONTH,int64,5819079,6.524085,3.405137,1.0,12.0,0.0
2,DAY,int64,5819079,15.704594,8.783425,1.0,31.0,0.0
3,DAY_OF_WEEK,int64,5819079,3.926941,1.988845,1.0,7.0,0.0
4,AIRLINE,object,5819079,,,,,0.0
5,FLIGHT_NUMBER,int64,5819079,2173.092742,1757.063999,1.0,9855.0,0.0
7,ORIGIN_AIRPORT,object,5819079,,,,,0.0
8,DESTINATION_AIRPORT,object,5819079,,,,,0.0
9,SCHEDULED_DEPARTURE,int64,5819079,1329.60247,483.751821,1.0,2359.0,0.0
20,SCHEDULED_ARRIVAL,int64,5819079,1493.808249,507.164696,1.0,2400.0,0.0


In [5]:
'''MERGING DATASETS'''
flights_airlines_temp=flights.merge(airlines,how='left',left_on='AIRLINE',right_on='IATA_CODE').drop(columns=['AIRLINE_x','IATA_CODE']).rename(columns={"AIRLINE_y": "AIRLINE"})

flights_extended_temp=flights_airlines_temp.merge(airports,how='left',left_on='ORIGIN_AIRPORT',right_on='IATA_CODE').drop(columns=['IATA_CODE']).rename(columns={"AIRPORT": "ORIG_AIRPORT",'CITY':'ORIG_CITY','STATE':'ORIG_STATE','COUNTRY':'ORIG_COUNTRY','LATITUDE':'ORIG_LATITUDE','LONGITUDE':'ORIG_LONGITUDE'})

flights_extended=flights_extended_temp.merge(airports,how='left',left_on='DESTINATION_AIRPORT',right_on='IATA_CODE').drop(columns=['IATA_CODE']).rename(columns={"AIRPORT": "DEST_AIRPORT",'CITY':'DEST_CITY','STATE':'DEST_STATE','COUNTRY':'DEST_COUNTRY','LATITUDE':'DEST_LATITUDE','LONGITUDE':'DEST_LONGITUDE'})

del flights_airlines_temp,flights_extended_temp

get_dataframe_info(flights_extended)


Unnamed: 0,features,types,non_null_counts,mean,std,min,max,%nulls
0,YEAR,int64,5819079,2015.0,0.0,2015.0,2015.0,0.0
8,SCHEDULED_DEPARTURE,int64,5819079,1329.60247,483.751821,1.0,2359.0,0.0
19,SCHEDULED_ARRIVAL,int64,5819079,1493.808249,507.164696,1.0,2400.0,0.0
16,DISTANCE,int64,5819079,822.356495,607.784287,21.0,4983.0,0.0
23,CANCELLED,int64,5819079,0.015446,0.12332,0.0,1.0,0.0
30,AIRLINE,object,5819079,,,,,0.0
22,DIVERTED,int64,5819079,0.00261,0.05102,0.0,1.0,0.0
1,MONTH,int64,5819079,6.524085,3.405137,1.0,12.0,0.0
7,DESTINATION_AIRPORT,object,5819079,,,,,0.0
6,ORIGIN_AIRPORT,object,5819079,,,,,0.0


----------

NULL:

Total rows=5819079

A)CANCELLATION_REASON has 98.45% nulls. It would make sence that when there is a cancellation we can expect null in either arrivals or departures. Similarly in diverted flights.

In [6]:
temp=flights_extended.copy()
a=get_dataframe_info(temp[temp['DEPARTURE_DELAY'].isnull()])
a.set_index('features', inplace=True)
a.loc[['CANCELLATION_REASON','DEPARTURE_DELAY','ARRIVAL_DELAY','DIVERTED']]

Unnamed: 0_level_0,types,non_null_counts,mean,std,min,max,%nulls
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CANCELLATION_REASON,object,86153,,,,,0.0
DEPARTURE_DELAY,float64,0,,,,,100.0
ARRIVAL_DELAY,float64,0,,,,,100.0
DIVERTED,int64,86153,0.0,0.0,0.0,0.0,0.0


In [7]:
print('For all the case when "departure delay" is null, cancellation has actual values:')
print(temp[temp['DEPARTURE_DELAY'].isnull()]['CANCELLATION_REASON'].unique())
print( 'There are '+str(temp['CANCELLATION_REASON'].count())+ ' non null values for cancelation.Thus we expect that the rest of actual values in cancellations correspont to nulls in "arrival delays"')

For all the case when "departure delay" is null, cancellation has actual values:
['A' 'B' 'C' 'D']
There are 89884 non null values for cancelation.Thus we expect that the rest of actual values in cancellations correspont to nulls in "arrival delays"


In other words, it is verified that all 'departure delay' nulls are explained by cancellations. We can thus impute 'departure delay' as cancelled/or 0 and control for those case via the relevant collumn. On the other hand there are still 3731 non null values for cancellation. We will check if there appears to be a relationship between 'arrival delay' and cancelation/diverted.

In [8]:
temp.loc[(temp['DEPARTURE_DELAY'].isnull()) , 
       'DEPARTURE_DELAY'] = 'cancelled' 
temp.loc[(temp['DEPARTURE_DELAY']== 'cancelled' ) , 
       'DEPARTURE_TIME'] = 'cancelled'

In [9]:
a=get_dataframe_info(temp[temp['ARRIVAL_DELAY'].isnull()])
a.set_index('features', inplace=True)
a.loc[['CANCELLATION_REASON','DEPARTURE_DELAY','ARRIVAL_DELAY','DIVERTED']]

Unnamed: 0_level_0,types,non_null_counts,mean,std,min,max,%nulls
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CANCELLATION_REASON,object,89884,,,,,14.454036
DEPARTURE_DELAY,object,105071,,,,,0.0
ARRIVAL_DELAY,float64,0,,,,,100.0
DIVERTED,int64,105071,0.14454,0.351638,0.0,1.0,0.0


In [10]:
print('The diverted can either be null or:'+str(temp[temp['ARRIVAL_DELAY'].isnull()]['DIVERTED'].unique()))
print('Thus we have to control for the cases where diverted =0 to examine if the reason behind our missing values in delays is explained by cancellations/diversions')

The diverted can either be null or:[0 1]
Thus we have to control for the cases where diverted =0 to examine if the reason behind our missing values in delays is explained by cancellations/diversions


In [11]:
a=get_dataframe_info(temp[(temp['ARRIVAL_DELAY'].isnull()) & (temp['DIVERTED']==0) & (temp['CANCELLATION_REASON'].notnull())])
a.set_index('features', inplace=True)
a.loc[['CANCELLATION_REASON','DEPARTURE_DELAY','ARRIVAL_DELAY','DIVERTED']]

Unnamed: 0_level_0,types,non_null_counts,mean,std,min,max,%nulls
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CANCELLATION_REASON,object,89884,,,,,0.0
DEPARTURE_DELAY,object,89884,,,,,0.0
ARRIVAL_DELAY,float64,0,,,,,100.0
DIVERTED,int64,89884,0.0,0.0,0.0,0.0,0.0


In [12]:
a=get_dataframe_info(temp[(temp['ARRIVAL_DELAY'].isnull()) & (temp['DIVERTED']==0) & (temp['CANCELLATION_REASON'].notnull())& (temp['DEPARTURE_DELAY']!='cancelled')])
a.set_index('features', inplace=True)
a.loc[['CANCELLATION_REASON','DEPARTURE_DELAY','ARRIVAL_DELAY','DIVERTED']]

Unnamed: 0_level_0,types,non_null_counts,mean,std,min,max,%nulls
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CANCELLATION_REASON,object,3731,,,,,0.0
DEPARTURE_DELAY,object,3731,,,,,0.0
ARRIVAL_DELAY,float64,0,,,,,100.0
DIVERTED,int64,3731,0.0,0.0,0.0,0.0,0.0


It is thus verified that the 3731 cancellation non null values that did not correspont to "departure delays",are related to arrival delays.

In [13]:
print('For 89884 flights where "arrival delay" is null, cause is cancellation. The cancellation reason can be identified as:'\
              +str(temp[(temp['ARRIVAL_DELAY'].isnull()) & (temp['DIVERTED']==0) & \
                        (temp['CANCELLATION_REASON'].notnull())]\
               ['CANCELLATION_REASON'].unique()))

For 89884 flights where "arrival delay" is null, cause is cancellation. The cancellation reason can be identified as:['A' 'B' 'C' 'D']


In [14]:
a=get_dataframe_info(temp[(temp['ARRIVAL_DELAY'].isnull()) & (temp['DIVERTED']==1) & (temp['CANCELLATION_REASON'].isnull())])
a.set_index('features', inplace=True)
a.loc[['CANCELLATION_REASON','DEPARTURE_DELAY','ARRIVAL_DELAY','DIVERTED']]

Unnamed: 0_level_0,types,non_null_counts,mean,std,min,max,%nulls
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CANCELLATION_REASON,object,0,,,,,100.0
DEPARTURE_DELAY,object,15187,,,,,0.0
ARRIVAL_DELAY,float64,0,,,,,100.0
DIVERTED,int64,15187,1.0,0.0,1.0,1.0,0.0


OVERAL: 105071 null in 'arrival delay' can be explained as 15187 diverted flights and 89884 cancelled flights.
We can impute missing values and control by the relevant column.

Should we wish to continue with the copy of our dataset:
temp.loc[(temp['DIVERTED']==1) , 
       'ARRIVAL_DELAY'] = 'diverted' 
temp.loc[(temp['DIVERTED']==1) , 
       'ARRIVAL_TIME'] = 'diverted'

temp.loc[(temp['ARRIVAL_DELAY'].isnull() ) , 
       'ARRIVAL_DELAY'] = 'cancelled'
temp.loc[(temp['ARRIVAL_DELAY']== 'cancelled' ) , 
       'ARRIVAL_TIME'] = 'cancelled' 
temp[['CANCELLATION_REASON', 'DIVERTED']] = temp[['CANCELLATION_REASON', 'DIVERTED']].fillna(value=0)

Having explained the afforementioned missing values in our temp dataframe we can then impute our original dataset thus:

-Fill na all departure/arrival delays and times

-Fill na diverted,cancellation reasons

-Create a column that displays whether the flight was diverted or cancelled

-Last create a column that summarize if there was any delay either arrival or departure

In [15]:
flights_extended['delay'] = 0
flights_extended.loc[(flights_extended['DEPARTURE_DELAY']>0) | (flights_extended['ARRIVAL_DELAY']>0), 
       'delay'] = 1 

flights_extended['Div/Canc'] = 0
flights_extended.loc[(flights_extended['CANCELLATION_REASON'].notnull()) |
                       (flights_extended['DIVERTED']==1) , 
       'Div/Canc'] = 1  


flights_extended[['CANCELLATION_REASON', 'DIVERTED','DEPARTURE_DELAY','DEPARTURE_TIME','ARRIVAL_DELAY','ARRIVAL_TIME']] = flights_extended[['CANCELLATION_REASON', 'DIVERTED','DEPARTURE_DELAY','DEPARTURE_TIME','ARRIVAL_DELAY','ARRIVAL_TIME']].fillna(value=0)


#B Delay reasoning nulls: 81.72%

In [16]:
temp=flights_extended.copy()

temp['reasons'] = np.nan
temp.loc[(temp['WEATHER_DELAY'] >0) |
         (temp['LATE_AIRCRAFT_DELAY'] >0) |
         (temp['AIRLINE_DELAY'] >0) |
         (temp['SECURITY_DELAY'] >0) |
         (temp['AIR_SYSTEM_DELAY'] >0), 
       'reasons'] = 1 
a=get_dataframe_info(temp[temp['delay']>0])
a.set_index('features', inplace=True)
a.loc[['WEATHER_DELAY', 'LATE_AIRCRAFT_DELAY','AIRLINE_DELAY','SECURITY_DELAY','AIR_SYSTEM_DELAY','DEPARTURE_DELAY','ARRIVAL_DELAY','reasons']]

Unnamed: 0_level_0,types,non_null_counts,mean,std,min,max,%nulls
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
WEATHER_DELAY,float64,1063439,2.91529,20.433336,0.0,1211.0,60.676972
LATE_AIRCRAFT_DELAY,float64,1063439,23.472838,43.197018,0.0,1331.0,60.676972
AIRLINE_DELAY,float64,1063439,18.969547,48.161642,0.0,1971.0,60.676972
SECURITY_DELAY,float64,1063439,0.076154,2.14346,0.0,573.0,60.676972
AIR_SYSTEM_DELAY,float64,1063439,13.480568,28.003679,0.0,1134.0,60.676972
DEPARTURE_DELAY,float64,2704367,24.927057,49.445211,-42.0,1988.0,0.0
ARRIVAL_DELAY,float64,2704367,23.694547,49.745699,-81.0,1971.0,0.0
reasons,float64,1063439,1.0,0.0,1.0,1.0,60.676972


In [17]:
del temp
print( 'For 60% of the flights where there was an actual delay, we are missing the reason of the delay.\
Should we wish to examine the reason of the dealy as part of our parameters we would have available only 40% of the initial dataset')

For 60% of the flights where there was an actual delay, we are missing the reason of the delay.Should we wish to examine the reason of the dealy as part of our parameters we would have available only 40% of the initial dataset


We will drop those column as it is part of our objective.

In [18]:
flights_extended.drop(['WEATHER_DELAY', 'LATE_AIRCRAFT_DELAY','AIRLINE_DELAY','SECURITY_DELAY','AIR_SYSTEM_DELAY']\
                      , axis=1,inplace=True)

C)ORIG/DEST_AIRPORT/ORIG/DEST_LATITUDE  have 8.522716% nulls

In [19]:
flights_extended[flights_extended['DEST_AIRPORT'].isnull()]['DESTINATION_AIRPORT'].unique()

array(['11298', '13487', '13303', '11057', '13930', '10693', '14747',
       '12266', '12478', '14057', '10397', '13198', '12173', '11618',
       '13204', '12402', '12758', '14771', '11292', '13830', '14107',
       '12982', '11697', '12519', '13577', '14869', '12339', '13232',
       '11259', '15016', '12892', '12889', '10140', '10721', '15304',
       '14679', '14100', '10732', '10821', '14831', '14843', '14570',
       '12191', '11433', '12264', '14893', '13796', '12953', '14122',
       '13495', '12451', '11042', '11278', '14908', '13342', '13891',
       '10299', '12523', '12954', '14520', '12129', '10713', '10431',
       '11066', '10994', '10423', '10792', '14683', '10800', '10551',
       '11193', '14307', '11540', '14635', '13851', '15376', '11884',
       '14492', '14027', '13970', '13184', '13296', '10599', '11986',
       '14698', '11423', '13244', '14574', '12992', '13241', '12217',
       '13871', '12197', '11003', '15919', '14524', '14689', '10529',
       '14576', '119

In [20]:
flights_extended[flights_extended['DEST_AIRPORT'].isnull()]['DESTINATION_AIRPORT'].nunique()
flights_extended[flights_extended['DEST_AIRPORT'].notnull()]['DESTINATION_AIRPORT'].nunique()

322

We identified the reason of the missing values:
The airoport coding is numerical. These do not correspond to our airport dataset (IATA codes are aplhanumerical).
We suspect that they might be ISO 3166-1 numeric, and refer to countries with no latin alphabet.
In case we require to create a map in our analysis these will be ignored, but do not otherwise affect significanlty our analysis: 
The nulls are 9% of our dataset and refer to 608 airports. 
The valid 91% of our dataset is compromised of 322 known airports.
We expect the important domestic airports will not be affected.

#D) 'TAXI_IN','WHEELS_ON','WHEELS_OFF','SCHEDULED_TIME','TAXI_OUT','TAIL_NUMBER','ELAPSED_TIME' nulls

In [21]:
a=get_dataframe_info(flights_extended[flights_extended['Div/Canc']!=1])
a.set_index('features', inplace=True)
a.loc[['TAXI_IN','WHEELS_ON','WHEELS_OFF','SCHEDULED_TIME','TAXI_OUT','TAIL_NUMBER','ELAPSED_TIME']]

Unnamed: 0_level_0,types,non_null_counts,mean,std,min,max,%nulls
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TAXI_IN,float64,5714008,7.429063,5.618951,1.0,248.0,0.0
WHEELS_ON,float64,5714008,1471.319332,521.86824,1.0,2400.0,0.0
WHEELS_OFF,float64,5714008,1357.099048,498.023745,1.0,2400.0,0.0
SCHEDULED_TIME,float64,5714008,141.893974,75.313998,18.0,718.0,0.0
TAXI_OUT,float64,5714008,16.065498,8.882449,1.0,225.0,0.0
TAIL_NUMBER,object,5714008,,,,,0.0
ELAPSED_TIME,float64,5714008,137.006189,74.211072,14.0,766.0,0.0


Missing values created in the aforementioned columns,are due to cancellations or diversions of flights. Since in our analysis we are required to examine how delays are affected and caused, but not cancellations or diversions, all these cases will be excluded.

In [22]:
flights_extended=flights_extended[flights_extended['Div/Canc']!=1]

----------------

DATA cleaning

Now that we have handled the missing values, we must correct the date,time information:

In [23]:
flights_extended['DATE'] = pd.to_datetime(flights_extended[['YEAR','MONTH', 'DAY']])
#flights_extended.loc[:, ['DATE','YEAR','MONTH', 'DAY']]

In [24]:
def format_time(myst):
    '''convert the 'HHMM' string to datetime.time'''
    if pd.isnull(myst):
        return np.nan
    else:
        if myst == 2400: myst = 0
        myst = "{0:04d}".format(int(myst))
        hour = datetime.time(int(myst[0:2]), int(myst[2:4]))
        return hour


In [25]:
flights_extended['WHEELS_OFF'] = flights_extended['WHEELS_OFF'].apply(format_time)
flights_extended['WHEELS_ON'] = flights_extended['WHEELS_ON'].apply(format_time)
flights_extended['SCHEDULED_ARRIVAL'] = flights_extended['SCHEDULED_ARRIVAL'].apply(format_time)
flights_extended['ARRIVAL_TIME'] = flights_extended['ARRIVAL_TIME'].apply(format_time)

In [26]:
flights_extended['DEPARTURE_TIME'] = flights_extended['DEPARTURE_TIME'].apply(format_time) 

In [27]:
flights_extended['SCHEDULED_DEPARTURE'] = flights_extended['SCHEDULED_DEPARTURE'].apply(format_time)

In [29]:
flights_extended.head(3)

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIRLINE,ORIG_AIRPORT,ORIG_CITY,ORIG_STATE,ORIG_COUNTRY,ORIG_LATITUDE,ORIG_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE,delay,Div/Canc,DATE
0,2015,1,1,4,98,N407AS,ANC,SEA,00:05:00,23:54:00,-11.0,21.0,00:15:00,205.0,194.0,169.0,1448,04:04:00,4.0,04:30:00,04:08:00,-22.0,0,0,0,Alaska Airlines Inc.,Ted Stevens Anchorage International Airport,Anchorage,AK,USA,61.17432,-149.99619,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,0,0,2015-01-01
1,2015,1,1,4,2336,N3KUAA,LAX,PBI,00:10:00,00:02:00,-8.0,12.0,00:14:00,280.0,279.0,263.0,2330,07:37:00,4.0,07:50:00,07:41:00,-9.0,0,0,0,American Airlines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Palm Beach International Airport,West Palm Beach,FL,USA,26.68316,-80.09559,0,0,2015-01-01
2,2015,1,1,4,840,N171US,SFO,CLT,00:20:00,00:18:00,-2.0,16.0,00:34:00,286.0,293.0,266.0,2296,08:00:00,11.0,08:06:00,08:11:00,5.0,0,0,0,US Airways Inc.,San Francisco International Airport,San Francisco,CA,USA,37.619,-122.37484,Charlotte Douglas International Airport,Charlotte,NC,USA,35.21401,-80.94313,1,0,2015-01-01


In [31]:
flights_extended.to_csv('cleaned data.zip')

-----------