# Airline Arrivals

Use [data](http://stat-computing.org/dataexpo/2009/the-data.html) given to predict how late flights will be. A flight only counts as late if it is more than 30 minutes late.

- Year	1987-2008
- Month	1-12
- DayofMonth	1-31
- DayOfWeek	1 (Monday) - 7 (Sunday)
- DepTime	actual departure time (local, hhmm)
- CRSDepTime	scheduled departure time (local, hhmm)
- ArrTime	actual arrival time (local, hhmm)
- CRSArrTime	scheduled arrival time (local, hhmm)
- UniqueCarrier	unique carrier code
- FlightNum	flight number
- TailNum	plane tail number
- ActualElapsedTime	in minutes
- CRSElapsedTime	in minutes
- AirTime	in minutes
- ArrDelay	arrival delay, in minutes
- DepDelay	departure delay, in minutes
- Origin	origin IATA airport code
- Dest	destination IATA airport code
- Distance	in miles
- TaxiIn	taxi in time, in minutes
- TaxiOut	taxi out time in minutes
- Cancelled	was the flight cancelled?
- CancellationCode	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
- Diverted	1 = yes, 0 = no
- CarrierDelay	in minutes
- WeatherDelay	in minutes
- NASDelay	in minutes
- SecurityDelay	in minutes
- LateAircraftDelay	in minutes

In [1]:
import pandas as pd
import numpy as np

raw_data = pd.read_csv('./data/flights_1989.csv')
raw_data.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,1989,1,23,1,1419.0,1230,1742.0,1552,UA,183,...,,,0,,0,,,,,
1,1989,1,24,2,1255.0,1230,1612.0,1552,UA,183,...,,,0,,0,,,,,
2,1989,1,25,3,1230.0,1230,1533.0,1552,UA,183,...,,,0,,0,,,,,
3,1989,1,26,4,1230.0,1230,1523.0,1552,UA,183,...,,,0,,0,,,,,
4,1989,1,27,5,1232.0,1230,1513.0,1552,UA,183,...,,,0,,0,,,,,


In [2]:
cols_with_one_val = []
cols_many_nans = []

def get_col_descriptions(df):
    for idx, col in enumerate(df.columns):
        num_uniq = len(df[col].unique())
        formatted_msg = '{}. {} – {} uniq vals'.format(idx + 1, col, num_uniq)
        
        if num_uniq == 1 and col not in cols_with_one_val:
            cols_with_one_val.append(col)
        
        if df[col].isnull().sum() > 0:
            num_nans = df[col].isnull().sum()
            percent_nans = round(num_nans / df.shape[0] * 100, 2)
            print(formatted_msg + '; {} NaNs ({}%)'.format(num_nans, percent_nans))
            if percent_nans > 50 and col not in cols_many_nans:
                cols_many_nans.append(col)
        else:
            print(formatted_msg)
    print('\n{} columns with 50+% NaNs: {}'.format(len(cols_many_nans), cols_many_nans))

get_col_descriptions(raw_data)

1. Year – 1 uniq vals
2. Month – 12 uniq vals
3. DayofMonth – 31 uniq vals
4. DayOfWeek – 7 uniq vals
5. DepTime – 1441 uniq vals; 74165 NaNs (1.47%)
6. CRSDepTime – 1199 uniq vals
7. ArrTime – 1441 uniq vals; 89004 NaNs (1.77%)
8. CRSArrTime – 1348 uniq vals
9. UniqueCarrier – 13 uniq vals
10. FlightNum – 2699 uniq vals
11. TailNum – 1 uniq vals; 5041200 NaNs (100.0%)
12. ActualElapsedTime – 629 uniq vals; 89004 NaNs (1.77%)
13. CRSElapsedTime – 488 uniq vals
14. AirTime – 1 uniq vals; 5041200 NaNs (100.0%)
15. ArrDelay – 721 uniq vals; 89004 NaNs (1.77%)
16. DepDelay – 789 uniq vals; 74165 NaNs (1.47%)
17. Origin – 237 uniq vals
18. Dest – 237 uniq vals
19. Distance – 1063 uniq vals; 26988 NaNs (0.54%)
20. TaxiIn – 1 uniq vals; 5041200 NaNs (100.0%)
21. TaxiOut – 1 uniq vals; 5041200 NaNs (100.0%)
22. Cancelled – 2 uniq vals
23. CancellationCode – 1 uniq vals; 5041200 NaNs (100.0%)
24. Diverted – 2 uniq vals
25. CarrierDelay – 1 uniq vals; 5041200 NaNs (100.0%)
26. WeatherDelay – 1 u

In [3]:
raw_data = raw_data.drop(cols_with_one_val, axis=1)
get_col_descriptions(raw_data)

1. Month – 12 uniq vals
2. DayofMonth – 31 uniq vals
3. DayOfWeek – 7 uniq vals
4. DepTime – 1441 uniq vals; 74165 NaNs (1.47%)
5. CRSDepTime – 1199 uniq vals
6. ArrTime – 1441 uniq vals; 89004 NaNs (1.77%)
7. CRSArrTime – 1348 uniq vals
8. UniqueCarrier – 13 uniq vals
9. FlightNum – 2699 uniq vals
10. ActualElapsedTime – 629 uniq vals; 89004 NaNs (1.77%)
11. CRSElapsedTime – 488 uniq vals
12. ArrDelay – 721 uniq vals; 89004 NaNs (1.77%)
13. DepDelay – 789 uniq vals; 74165 NaNs (1.47%)
14. Origin – 237 uniq vals
15. Dest – 237 uniq vals
16. Distance – 1063 uniq vals; 26988 NaNs (0.54%)
17. Cancelled – 2 uniq vals
18. Diverted – 2 uniq vals

10 columns with 50+% NaNs: ['TailNum', 'AirTime', 'TaxiIn', 'TaxiOut', 'CancellationCode', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']


In [4]:
# pd.get_dummies(raw_data, columns=['Month', 'DayOfWeek'], dummy_na=True)
raw_data.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,CRSElapsedTime,ArrDelay,DepDelay,Origin,Dest,Distance,Cancelled,Diverted
0,1,23,1,1419.0,1230,1742.0,1552,UA,183,323.0,322,110.0,109.0,SFO,HNL,2398.0,0,0
1,1,24,2,1255.0,1230,1612.0,1552,UA,183,317.0,322,20.0,25.0,SFO,HNL,2398.0,0,0
2,1,25,3,1230.0,1230,1533.0,1552,UA,183,303.0,322,-19.0,0.0,SFO,HNL,2398.0,0,0
3,1,26,4,1230.0,1230,1523.0,1552,UA,183,293.0,322,-29.0,0.0,SFO,HNL,2398.0,0,0
4,1,27,5,1232.0,1230,1513.0,1552,UA,183,281.0,322,-39.0,2.0,SFO,HNL,2398.0,0,0


In [8]:
df = raw_data.copy()
df['IsArrDelayed'] = (df['ArrDelay'] >= 30).apply(lambda x: 1 if x else 0)
df['IsDepDelayed'] = (df['DepDelay'] >= 30).apply(lambda x: 1 if x else 0)
df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,CRSElapsedTime,ArrDelay,DepDelay,Origin,Dest,Distance,Cancelled,Diverted,IsArrDelayed,IsDepDelayed
0,1,23,1,1419.0,1230,1742.0,1552,UA,183,323.0,322,110.0,109.0,SFO,HNL,2398.0,0,0,1,1
1,1,24,2,1255.0,1230,1612.0,1552,UA,183,317.0,322,20.0,25.0,SFO,HNL,2398.0,0,0,0,0
2,1,25,3,1230.0,1230,1533.0,1552,UA,183,303.0,322,-19.0,0.0,SFO,HNL,2398.0,0,0,0,0
3,1,26,4,1230.0,1230,1523.0,1552,UA,183,293.0,322,-29.0,0.0,SFO,HNL,2398.0,0,0,0,0
4,1,27,5,1232.0,1230,1513.0,1552,UA,183,281.0,322,-39.0,2.0,SFO,HNL,2398.0,0,0,0,0


In [None]:
# season
# weekend
# dep AM
# dep PM
# arr AM
# arr PM