## Statistics by day
---
Since our weather data is daily, it may not work to use it to predict delays of a specific flight, but it might be usable to predict average delays for an airport on a specific day. Since only the origin weather applies to all flights leaving an airport and and origin weather appears to have a stronger link to delays, I will focus on the origin airport.

### Load Data File

In [1]:
%matplotlib inline
import math
import matplotlib.pyplot as plt
from decimal import *
import numpy as np # linear algebra
import pandas as pd # read_csv and such
from io import StringIO # convert strings to buffers or something like that.
import os # for listing files in directory
import seaborn as sns
import scipy.stats as stats

pd.options.display.max_columns = 99
pd.options.display.max_rows = 99

In [2]:
flights_weather_path = '../data/flight_delays_2015/flights_weather.csv'
dtypes = {
    'ORIGIN_AIRPORT': 'str', 
    'DESTINATION_AIRPORT': 'str', 
    'IATA_CODE_x': 'str', 
    'origin_weather_station': 'str', 
    'IATA_CODE_y': 'str', 
    'destination_weather_station': 'str', 
    'OR_MAX': 'str', 
    'OR_MIN': 'str', 
    'OR_PRCP': 'str', 
    'DES_MAX': 'str', 
    'DES_MIN': 'str', 
    'DES_PRCP': 'str', 
    'OR_FRSHTT': 'str', 
    'DES_FRSHTT': 'str'
}
fw_df = pd.read_csv(flights_weather_path, dtype=dtypes, parse_dates=['DATE'])
fw_df = fw_df[['DATE', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'WEATHER_DELAY', 'OR_TEMP', 'OR_VISIB',
               'OR_WDSP', 'OR_MXSPD', 'OR_SNDP', 'OR_PRCP', 'OR_GUST', 'OR_MAX', 'OR_MIN', 'OR_FOG',
               'OR_RAIN_DRIZZLE', 'OR_SNOW_ICE_PELLETS', 'OR_HAIL', 'OR_THUNDER', 'OR_TORNADO_FUNNEL_CLOUD',
               'DES_TEMP', 'DES_VISIB', 'DES_WDSP', 'DES_MXSPD', 'DES_SNDP', 'DES_PRCP', 'DES_GUST', 'DES_MAX',
               'DES_MIN', 'DES_FOG', 'DES_RAIN_DRIZZLE', 'DES_SNOW_ICE_PELLETS', 'DES_HAIL', 'DES_THUNDER',
               'DES_TORNADO_FUNNEL_CLOUD']]
fw_df.head()

Unnamed: 0,DATE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,WEATHER_DELAY,OR_TEMP,OR_VISIB,OR_WDSP,OR_MXSPD,OR_SNDP,OR_PRCP,OR_GUST,OR_MAX,OR_MIN,OR_FOG,OR_RAIN_DRIZZLE,OR_SNOW_ICE_PELLETS,OR_HAIL,OR_THUNDER,OR_TORNADO_FUNNEL_CLOUD,DES_TEMP,DES_VISIB,DES_WDSP,DES_MXSPD,DES_SNDP,DES_PRCP,DES_GUST,DES_MAX,DES_MIN,DES_FOG,DES_RAIN_DRIZZLE,DES_SNOW_ICE_PELLETS,DES_HAIL,DES_THUNDER,DES_TORNADO_FUNNEL_CLOUD
0,2015-01-01,ANC,SEA,,35.1,7.5,3.6,6.0,999.9,0.09G,999.9,43.0,32.0,True,True,False,False,False,False,32.9,10.0,4.6,11.1,999.9,0.00G,999.9,42.1,26.1,False,False,False,False,False,False
1,2015-01-01,ANC,SEA,,35.1,7.5,3.6,6.0,999.9,0.09G,999.9,43.0,32.0,True,True,False,False,False,False,32.9,10.0,4.6,11.1,999.9,0.00G,999.9,42.1,26.1,False,False,False,False,False,False
2,2015-01-01,ANC,SEA,,35.1,7.5,3.6,6.0,999.9,0.09G,999.9,43.0,32.0,True,True,False,False,False,False,32.9,10.0,4.6,11.1,999.9,0.00G,999.9,42.1,26.1,False,False,False,False,False,False
3,2015-01-01,ANC,SEA,,35.1,7.5,3.6,6.0,999.9,0.09G,999.9,43.0,32.0,True,True,False,False,False,False,32.9,10.0,4.6,11.1,999.9,0.00G,999.9,42.1,26.1,False,False,False,False,False,False
4,2015-01-01,ANC,SEA,,35.1,7.5,3.6,6.0,999.9,0.09G,999.9,43.0,32.0,True,True,False,False,False,False,32.9,10.0,4.6,11.1,999.9,0.00G,999.9,42.1,26.1,False,False,False,False,False,False


In [3]:
fw_df['OR_PRCP'] = pd.to_numeric(fw_df['OR_PRCP'].str.replace('A|B|C|D|E|F|G|H|I', ''))
fw_df['OR_MAX'] = pd.to_numeric(fw_df['OR_MAX'].str.replace('*', ''))
fw_df['OR_MIN'] = pd.to_numeric(fw_df['OR_MIN'].str.replace('*', ''))
fw_df.loc[fw_df['OR_TEMP'] == 9999.9, 'OR_TEMP'] = fw_df.OR_TEMP.mean()
fw_df.loc[fw_df['OR_WDSP'] == 999.9, 'OR_WDSP'] = fw_df.OR_WDSP.mean()
fw_df.loc[fw_df['OR_PRCP'] == 99.99, 'OR_PRCP'] = 0
fw_df.loc[fw_df['OR_VISIB'] == 999.9, 'OR_VISIB'] = fw_df.OR_VISIB.mean()
fw_df.loc[fw_df['OR_GUST'] == 999.9, 'OR_GUST'] = fw_df.OR_GUST.mean()
fw_df.loc[fw_df['OR_MAX'] == 9999.9, 'OR_MAX'] = fw_df.OR_MAX.mean()
fw_df.loc[fw_df['OR_MIN'] == 9999.9, 'OR_MIN'] = fw_df.OR_MIN.mean()
fw_df.loc[fw_df['OR_WDSP'] == 999.9, 'OR_WDSP'] = fw_df.OR_WDSP.mean()
fw_df.loc[fw_df['OR_MXSPD'] == 999.9, 'OR_MXSPD'] = fw_df.OR_MXSPD.mean()
fw_df.loc[fw_df['OR_SNDP'] == 999.9, 'OR_SNDP'] = 0
fw_df.loc[fw_df['OR_FOG'] == True, 'OR_FOGV'] = 1
fw_df.loc[fw_df['OR_FOG'] == False, 'OR_FOGV'] = 0
fw_df.loc[fw_df['OR_RAIN_DRIZZLE'] == True, 'OR_RAIN_DRIZZLEV'] = 1
fw_df.loc[fw_df['OR_RAIN_DRIZZLE'] == False, 'OR_RAIN_DRIZZLEV'] = 0
fw_df.loc[fw_df['OR_SNOW_ICE_PELLETS'] == True, 'OR_SNOW_ICE_PELLETSV'] = 1
fw_df.loc[fw_df['OR_SNOW_ICE_PELLETS'] == False, 'OR_SNOW_ICE_PELLETSV'] = 0
fw_df.loc[fw_df['OR_HAIL'] == True, 'OR_HAILV'] = 1
fw_df.loc[fw_df['OR_HAIL'] == False, 'OR_HAILV'] = 0
fw_df.loc[fw_df['OR_THUNDER'] == True, 'OR_THUNDERV'] = 1
fw_df.loc[fw_df['OR_THUNDER'] == False, 'OR_THUNDERV'] = 0
fw_df.loc[fw_df['OR_TORNADO_FUNNEL_CLOUD'] == True, 'OR_TORNADO_FUNNEL_CLOUDV'] = 1
fw_df.loc[fw_df['OR_TORNADO_FUNNEL_CLOUD'] == False, 'OR_TORNADO_FUNNEL_CLOUDV'] = 0
fw_df['DES_PRCP'] = pd.to_numeric(fw_df['DES_PRCP'].str.replace('A|B|C|D|E|F|G|H|I', ''))
fw_df['DES_MAX'] = pd.to_numeric(fw_df['DES_MAX'].str.replace('*', ''))
fw_df['DES_MIN'] = pd.to_numeric(fw_df['DES_MIN'].str.replace('*', ''))
fw_df.loc[fw_df['DES_TEMP'] == 9999.9, 'DES_TEMP'] = fw_df.DES_TEMP.mean()
fw_df.loc[fw_df['DES_WDSP'] == 999.9, 'DES_WDSP'] = fw_df.DES_WDSP.mean()
fw_df.loc[fw_df['DES_PRCP'] == 99.99, 'DES_PRCP'] = 0
fw_df.loc[fw_df['DES_VISIB'] == 999.9, 'DES_VISIB'] = fw_df.DES_VISIB.mean()
fw_df.loc[fw_df['DES_GUST'] == 999.9, 'DES_GUST'] = fw_df.DES_GUST.mean()
fw_df.loc[fw_df['DES_MAX'] == 9999.9, 'DES_MAX'] = fw_df.DES_MAX.mean()
fw_df.loc[fw_df['DES_MIN'] == 9999.9, 'DES_MIN'] = fw_df.DES_MIN.mean()
fw_df.loc[fw_df['DES_WDSP'] == 999.9, 'DES_WDSP'] = fw_df.DES_WDSP.mean()
fw_df.loc[fw_df['DES_MXSPD'] == 999.9, 'DES_MXSPD'] = fw_df.DES_MXSPD.mean()
fw_df.loc[fw_df['DES_SNDP'] == 999.9, 'DES_SNDP'] = 0
fw_df.loc[fw_df['DES_FOG'] == True, 'DES_FOGV'] = 1
fw_df.loc[fw_df['DES_FOG'] == False, 'DES_FOGV'] = 0
fw_df.loc[fw_df['DES_RAIN_DRIZZLE'] == True, 'DES_RAIN_DRIZZLEV'] = 1
fw_df.loc[fw_df['DES_RAIN_DRIZZLE'] == False, 'DES_RAIN_DRIZZLEV'] = 0
fw_df.loc[fw_df['DES_SNOW_ICE_PELLETS'] == True, 'DES_SNOW_ICE_PELLETSV'] = 1
fw_df.loc[fw_df['DES_SNOW_ICE_PELLETS'] == False, 'DES_SNOW_ICE_PELLETSV'] = 0
fw_df.loc[fw_df['DES_HAIL'] == True, 'DES_HAILV'] = 1
fw_df.loc[fw_df['DES_HAIL'] == False, 'DES_HAILV'] = 0
fw_df.loc[fw_df['DES_THUNDER'] == True, 'DES_THUNDERV'] = 1
fw_df.loc[fw_df['DES_THUNDER'] == False, 'DES_THUNDERV'] = 0
fw_df.loc[fw_df['DES_TORNADO_FUNNEL_CLOUD'] == True, 'DES_TORNADO_FUNNEL_CLOUDV'] = 1
fw_df.loc[fw_df['DES_TORNADO_FUNNEL_CLOUD'] == False, 'DES_TORNADO_FUNNEL_CLOUDV'] = 0
fw_df.loc[fw_df['WEATHER_DELAY'].isnull(), 'WEATHER_DELAY'] = 0


In [4]:
fw_df = fw_df[['DATE', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'WEATHER_DELAY', 'OR_TEMP', 'OR_VISIB', 
               'OR_WDSP', 'OR_MXSPD', 'OR_SNDP', 'OR_PRCP', 'OR_MAX', 'OR_MIN', 'OR_FOGV',
               'OR_RAIN_DRIZZLEV', 'OR_SNOW_ICE_PELLETSV', 'OR_HAILV', 'OR_THUNDERV', 'OR_TORNADO_FUNNEL_CLOUDV',
               'DES_TEMP', 'DES_VISIB', 'DES_WDSP', 'DES_MXSPD', 'DES_SNDP', 'DES_PRCP', 'DES_MAX',
               'DES_MIN', 'DES_FOGV', 'DES_RAIN_DRIZZLEV', 'DES_SNOW_ICE_PELLETSV', 'DES_HAILV', 'DES_THUNDERV',
               'DES_TORNADO_FUNNEL_CLOUDV']]

In [5]:
fw_airports = fw_df.groupby(['DATE', 'ORIGIN_AIRPORT'], as_index=False).mean()

### Hypothesis Tests

Our null hypothesis would be that the mean weather conditions for flights with delays is the same as all flights. We can do this test for all of our weather variable, treating flights with a weather delay greater than 2 minutes as delayed:

#### Origin Precipitation

In [7]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_PRCP'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_PRCP'],
                 equal_var=False)

Ttest_indResult(statistic=15.99094642281525, pvalue=2.5416701109319971e-56)

#### Origin Temp

In [8]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_TEMP'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_TEMP'],
                 equal_var=False)

Ttest_indResult(statistic=-1.8035889272382468, pvalue=0.071347019670119569)

#### Origin Max Temp

In [9]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_MAX'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_MAX'],
                 equal_var=False)

Ttest_indResult(statistic=-3.6849325886630098, pvalue=0.00023082118171332882)

#### Origin Min Temp

In [10]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_MIN'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_MIN'],
                 equal_var=False)

Ttest_indResult(statistic=1.8727486869702372, pvalue=0.061152636194478062)

#### Origin Windspeed

In [11]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_WDSP'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_WDSP'],
                 equal_var=False)

Ttest_indResult(statistic=12.933958534665889, pvalue=9.3226561557838483e-38)

#### Origin Visibility

In [13]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_VISIB'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_VISIB'],
                 equal_var=False)

Ttest_indResult(statistic=-30.196614722237278, pvalue=6.0863055705511551e-186)

#### Origin Fog

In [14]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_FOGV'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_FOGV'],
                 equal_var=False)

Ttest_indResult(statistic=22.079318738348551, pvalue=1.0120639994741724e-103)

#### Origin Rain or Drizzle

In [15]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_RAIN_DRIZZLEV'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_RAIN_DRIZZLEV'],
                 equal_var=False)

Ttest_indResult(statistic=31.697929270366714, pvalue=7.0849939979477167e-204)

#### Origin Snow/Ice/Pellets

In [16]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_SNOW_ICE_PELLETSV'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_SNOW_ICE_PELLETSV'],
                 equal_var=False)

Ttest_indResult(statistic=22.880676565216234, pvalue=6.6138171767096105e-111)

#### Origin Hail

In [17]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_HAILV'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_HAILV'],
                 equal_var=False)

Ttest_indResult(statistic=6.1551675155749521, pvalue=8.035377833819259e-10)

#### Origin Thunder

In [18]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_THUNDERV'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_THUNDERV'],
                 equal_var=False)

Ttest_indResult(statistic=33.18813420581224, pvalue=9.8700875221811105e-221)

#### Origin Tornado/Funnel Cloud

In [19]:
stats.ttest_ind(a=fw_airports.loc[fw_airports['WEATHER_DELAY'] >= 1, 'OR_TORNADO_FUNNEL_CLOUDV'], 
                b=fw_airports.loc[fw_airports['WEATHER_DELAY'] < 1, 'OR_TORNADO_FUNNEL_CLOUDV'],
                 equal_var=False)

Ttest_indResult(statistic=2.0525178871003678, pvalue=0.040166572972179307)

Based on rejecting with a pvalue less than .05, we can reject the null hypothesis for all features other than average temperature and min temperature.

### Linear Regression

We can perform a linear regression to see if we can predict average delay at an airport based on weather.

In [20]:
from sklearn.cross_validation import train_test_split



In [22]:
X_train, X_test, y_train, y_test = train_test_split(fw_airports[['OR_TEMP', 'OR_VISIB', 
               'OR_WDSP', 'OR_MXSPD', 'OR_SNDP', 'OR_PRCP', 'OR_MAX', 'OR_MIN', 'OR_FOGV',
               'OR_RAIN_DRIZZLEV', 'OR_SNOW_ICE_PELLETSV', 'OR_HAILV', 'OR_THUNDERV', 'OR_TORNADO_FUNNEL_CLOUDV']], fw_airports[['WEATHER_DELAY']])


In [23]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)

# print intercept and coefficients
print lm.intercept_
print lm.coef_

[ 2.19620548]
[[  2.63121964e-03  -2.37702853e-01  -4.46934677e-03   3.02874326e-02
   -1.83841830e-02   1.24018674e-01  -5.68676963e-04  -3.52535595e-03
    6.91214366e-01  -1.19728416e-01   8.34651895e-01   2.65182671e+00
    7.93816354e-01  -1.08522756e-01]]


In [24]:
lm.score(X_test, y_test)

0.014309590387573401

This score is a little better than the per flight regression score but is still barely better than prediciting the mean for all inputs.