# Airline Arrivals

Use [data](http://stat-computing.org/dataexpo/2009/the-data.html) given to predict how late flights will be. A flight only counts as late if it is more than 30 minutes late.

- Year	1987-2008
- Month	1-12
- DayofMonth	1-31
- DayOfWeek	1 (Monday) - 7 (Sunday)
- DepTime	actual departure time (local, hhmm)
- CRSDepTime	scheduled departure time (local, hhmm)
- ArrTime	actual arrival time (local, hhmm)
- CRSArrTime	scheduled arrival time (local, hhmm)
- UniqueCarrier	unique carrier code
- FlightNum	flight number
- TailNum	plane tail number
- ActualElapsedTime	in minutes
- CRSElapsedTime	in minutes
- AirTime	in minutes
- ArrDelay	arrival delay, in minutes
- DepDelay	departure delay, in minutes
- Origin	origin IATA airport code
- Dest	destination IATA airport code
- Distance	in miles
- TaxiIn	taxi in time, in minutes
- TaxiOut	taxi out time in minutes
- Cancelled	was the flight cancelled?
- CancellationCode	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
- Diverted	1 = yes, 0 = no
- CarrierDelay	in minutes
- WeatherDelay	in minutes
- NASDelay	in minutes
- SecurityDelay	in minutes
- LateAircraftDelay	in minutes

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import ensemble
from sklearn.naive_bayes import BernoulliNB
from imblearn.over_sampling import SMOTE

raw_data = pd.read_csv('./data/flights_1989.csv')
raw_data.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,1989,1,23,1,1419.0,1230,1742.0,1552,UA,183,...,,,0,,0,,,,,
1,1989,1,24,2,1255.0,1230,1612.0,1552,UA,183,...,,,0,,0,,,,,
2,1989,1,25,3,1230.0,1230,1533.0,1552,UA,183,...,,,0,,0,,,,,
3,1989,1,26,4,1230.0,1230,1523.0,1552,UA,183,...,,,0,,0,,,,,
4,1989,1,27,5,1232.0,1230,1513.0,1552,UA,183,...,,,0,,0,,,,,


In [2]:
cols_with_one_val = []
cols_many_nans = []

def get_col_descriptions(df):
    for idx, col in enumerate(df.columns):
        num_uniq = len(df[col].unique())
        formatted_msg = '{}. {} – {} uniq vals'.format(idx + 1, col, num_uniq)
        
        if num_uniq == 1 and col not in cols_with_one_val:
            cols_with_one_val.append(col)
        
        if df[col].isnull().sum() > 0:
            num_nans = df[col].isnull().sum()
            percent_nans = round(num_nans / df.shape[0] * 100, 2)
            print(formatted_msg + '; {} NaNs ({}%)'.format(num_nans, percent_nans))
            if percent_nans > 50 and col not in cols_many_nans:
                cols_many_nans.append(col)
        else:
            print(formatted_msg)
    print('\n{} columns with 50+% NaNs: {}'.format(len(cols_many_nans), cols_many_nans))

get_col_descriptions(raw_data)

1. Year – 1 uniq vals
2. Month – 12 uniq vals
3. DayofMonth – 31 uniq vals
4. DayOfWeek – 7 uniq vals
5. DepTime – 1441 uniq vals; 74165 NaNs (1.47%)
6. CRSDepTime – 1199 uniq vals
7. ArrTime – 1441 uniq vals; 89004 NaNs (1.77%)
8. CRSArrTime – 1348 uniq vals
9. UniqueCarrier – 13 uniq vals
10. FlightNum – 2699 uniq vals
11. TailNum – 1 uniq vals; 5041200 NaNs (100.0%)
12. ActualElapsedTime – 629 uniq vals; 89004 NaNs (1.77%)
13. CRSElapsedTime – 488 uniq vals
14. AirTime – 1 uniq vals; 5041200 NaNs (100.0%)
15. ArrDelay – 721 uniq vals; 89004 NaNs (1.77%)
16. DepDelay – 789 uniq vals; 74165 NaNs (1.47%)
17. Origin – 237 uniq vals
18. Dest – 237 uniq vals
19. Distance – 1063 uniq vals; 26988 NaNs (0.54%)
20. TaxiIn – 1 uniq vals; 5041200 NaNs (100.0%)
21. TaxiOut – 1 uniq vals; 5041200 NaNs (100.0%)
22. Cancelled – 2 uniq vals
23. CancellationCode – 1 uniq vals; 5041200 NaNs (100.0%)
24. Diverted – 2 uniq vals
25. CarrierDelay – 1 uniq vals; 5041200 NaNs (100.0%)
26. WeatherDelay – 1 u

In [3]:
raw_data = raw_data.drop(cols_with_one_val, axis=1)
get_col_descriptions(raw_data)
print(raw_data.shape[0], 'rows')

1. Month – 12 uniq vals
2. DayofMonth – 31 uniq vals
3. DayOfWeek – 7 uniq vals
4. DepTime – 1441 uniq vals; 74165 NaNs (1.47%)
5. CRSDepTime – 1199 uniq vals
6. ArrTime – 1441 uniq vals; 89004 NaNs (1.77%)
7. CRSArrTime – 1348 uniq vals
8. UniqueCarrier – 13 uniq vals
9. FlightNum – 2699 uniq vals
10. ActualElapsedTime – 629 uniq vals; 89004 NaNs (1.77%)
11. CRSElapsedTime – 488 uniq vals
12. ArrDelay – 721 uniq vals; 89004 NaNs (1.77%)
13. DepDelay – 789 uniq vals; 74165 NaNs (1.47%)
14. Origin – 237 uniq vals
15. Dest – 237 uniq vals
16. Distance – 1063 uniq vals; 26988 NaNs (0.54%)
17. Cancelled – 2 uniq vals
18. Diverted – 2 uniq vals

10 columns with 50+% NaNs: ['TailNum', 'AirTime', 'TaxiIn', 'TaxiOut', 'CancellationCode', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
5041200 rows


In [4]:
raw_data.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,CRSElapsedTime,ArrDelay,DepDelay,Origin,Dest,Distance,Cancelled,Diverted
0,1,23,1,1419.0,1230,1742.0,1552,UA,183,323.0,322,110.0,109.0,SFO,HNL,2398.0,0,0
1,1,24,2,1255.0,1230,1612.0,1552,UA,183,317.0,322,20.0,25.0,SFO,HNL,2398.0,0,0
2,1,25,3,1230.0,1230,1533.0,1552,UA,183,303.0,322,-19.0,0.0,SFO,HNL,2398.0,0,0
3,1,26,4,1230.0,1230,1523.0,1552,UA,183,293.0,322,-29.0,0.0,SFO,HNL,2398.0,0,0
4,1,27,5,1232.0,1230,1513.0,1552,UA,183,281.0,322,-39.0,2.0,SFO,HNL,2398.0,0,0


In [5]:
df = raw_data.copy()
df['IsArrDelayed'] = (df['ArrDelay'] >= 30).apply(lambda x: 1 if x else 0)
df['IsDepDelayed'] = (df['DepDelay'] >= 30).apply(lambda x: 1 if x else 0)
df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,CRSElapsedTime,ArrDelay,DepDelay,Origin,Dest,Distance,Cancelled,Diverted,IsArrDelayed,IsDepDelayed
0,1,23,1,1419.0,1230,1742.0,1552,UA,183,323.0,322,110.0,109.0,SFO,HNL,2398.0,0,0,1,1
1,1,24,2,1255.0,1230,1612.0,1552,UA,183,317.0,322,20.0,25.0,SFO,HNL,2398.0,0,0,0,0
2,1,25,3,1230.0,1230,1533.0,1552,UA,183,303.0,322,-19.0,0.0,SFO,HNL,2398.0,0,0,0,0
3,1,26,4,1230.0,1230,1523.0,1552,UA,183,293.0,322,-29.0,0.0,SFO,HNL,2398.0,0,0,0,0
4,1,27,5,1232.0,1230,1513.0,1552,UA,183,281.0,322,-39.0,2.0,SFO,HNL,2398.0,0,0,0,0


In [6]:
def get_season(month_num):
    if month_num in [12, 1, 2]:
        return 'winter'
    elif month_num in [3, 4, 5]:
        return 'spring'
    elif month_num in [6, 7, 8]:
        return 'summer'
    else: 
        return 'fall'

top_50_origins = list(df['Origin'].value_counts().nlargest(50).index)
top_50_dests = list(df['Dest'].value_counts().nlargest(50).index)

def is_weekend(day_num):
    return 1 if day_num in [6, 7] else 0

def get_time_of_day(time):
    return 'AM' if time < 1200 else 'PM'

dur_avg = df['CRSElapsedTime'].mean()
def get_dur_vs_avg(duration):
    return 'Longer' if duration > dur_avg else 'Shorter'

dist_avg = df['Distance'].mean()
def get_dist_vs_avg(distance):
    return 'Farther' if distance > dist_avg else 'Closer'

df['Season'] = df['Month'].apply(get_season)
df['BegOfMonth'] = df['DayofMonth'] <= 15
df['IsInHolidaySeason'] = df['Month'].apply(lambda x: x in [11, 12, 1])
df['StartsInBusyAirport'] = df['Origin'].apply(lambda x: x in top_50_origins)
df['EndsInBusyAirport'] = df['Dest'].apply(lambda x: x in top_50_dests)
df['IsWeekend'] = df['DayOfWeek'].apply(is_weekend)
df['ScheduledDep'] = df['CRSDepTime'].apply(get_time_of_day)
df['ScheduledArr'] = df['CRSArrTime'].apply(get_time_of_day)
df['DurationVsAvg'] = df['CRSElapsedTime'].apply(get_dur_vs_avg)
df['DistanceVsAvg'] = df['Distance'].apply(get_dist_vs_avg)
df['IsDelayed'] = (df['IsArrDelayed'] + df['IsDepDelayed']) > 0
df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,...,BegOfMonth,IsInHolidaySeason,StartsInBusyAirport,EndsInBusyAirport,IsWeekend,ScheduledDep,ScheduledArr,DurationVsAvg,DistanceVsAvg,IsDelayed
0,1,23,1,1419.0,1230,1742.0,1552,UA,183,323.0,...,False,True,True,False,0,PM,PM,Longer,Farther,True
1,1,24,2,1255.0,1230,1612.0,1552,UA,183,317.0,...,False,True,True,False,0,PM,PM,Longer,Farther,False
2,1,25,3,1230.0,1230,1533.0,1552,UA,183,303.0,...,False,True,True,False,0,PM,PM,Longer,Farther,False
3,1,26,4,1230.0,1230,1523.0,1552,UA,183,293.0,...,False,True,True,False,0,PM,PM,Longer,Farther,False
4,1,27,5,1232.0,1230,1513.0,1552,UA,183,281.0,...,False,True,True,False,0,PM,PM,Longer,Farther,False


In [7]:
num_delays = df[df['IsDelayed'] == 1].shape[0]
num_okays = df[df['IsDelayed'] == 0].shape[0]
perc_delays = round(num_delays / (num_delays + num_okays) * 100, 2)

print(num_delays, 'Delays Out of', num_delays + num_okays, 'Flights')
print(perc_delays, '% of Flights Delayed')
print('Baseline:', round(100 - perc_delays, 2), '%')

572376 Delays Out of 5041200 Flights
11.35 % of Flights Delayed
Baseline: 88.65 %


In [8]:
print('Columns:', df.columns)
giveaway_cols = ['IsArrDelayed', 'IsDepDelayed', 'DepTime', 'ArrTime', 'ActualElapsedTime']
non_info_cols = ['Month', 'DayofMonth', 'DayOfWeek', 'FlightNum', 'Origin', 'Dest']
df = df.drop(giveaway_cols + non_info_cols, axis=1)
df.head()

Columns: Index(['Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime',
       'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'ActualElapsedTime',
       'CRSElapsedTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance',
       'Cancelled', 'Diverted', 'IsArrDelayed', 'IsDepDelayed', 'Season',
       'BegOfMonth', 'IsInHolidaySeason', 'StartsInBusyAirport',
       'EndsInBusyAirport', 'IsWeekend', 'ScheduledDep', 'ScheduledArr',
       'DurationVsAvg', 'DistanceVsAvg', 'IsDelayed'],
      dtype='object')


Unnamed: 0,CRSDepTime,CRSArrTime,UniqueCarrier,CRSElapsedTime,ArrDelay,DepDelay,Distance,Cancelled,Diverted,Season,BegOfMonth,IsInHolidaySeason,StartsInBusyAirport,EndsInBusyAirport,IsWeekend,ScheduledDep,ScheduledArr,DurationVsAvg,DistanceVsAvg,IsDelayed
0,1230,1552,UA,322,110.0,109.0,2398.0,0,0,winter,False,True,True,False,0,PM,PM,Longer,Farther,True
1,1230,1552,UA,322,20.0,25.0,2398.0,0,0,winter,False,True,True,False,0,PM,PM,Longer,Farther,False
2,1230,1552,UA,322,-19.0,0.0,2398.0,0,0,winter,False,True,True,False,0,PM,PM,Longer,Farther,False
3,1230,1552,UA,322,-29.0,0.0,2398.0,0,0,winter,False,True,True,False,0,PM,PM,Longer,Farther,False
4,1230,1552,UA,322,-39.0,2.0,2398.0,0,0,winter,False,True,True,False,0,PM,PM,Longer,Farther,False


In [9]:
df = pd.get_dummies(df)
print(list(df.columns))
df.head()

['CRSDepTime', 'CRSArrTime', 'CRSElapsedTime', 'ArrDelay', 'DepDelay', 'Distance', 'Cancelled', 'Diverted', 'BegOfMonth', 'IsInHolidaySeason', 'StartsInBusyAirport', 'EndsInBusyAirport', 'IsWeekend', 'IsDelayed', 'UniqueCarrier_AA', 'UniqueCarrier_AS', 'UniqueCarrier_CO', 'UniqueCarrier_DL', 'UniqueCarrier_EA', 'UniqueCarrier_HP', 'UniqueCarrier_NW', 'UniqueCarrier_PA (1)', 'UniqueCarrier_PI', 'UniqueCarrier_TW', 'UniqueCarrier_UA', 'UniqueCarrier_US', 'UniqueCarrier_WN', 'Season_fall', 'Season_spring', 'Season_summer', 'Season_winter', 'ScheduledDep_AM', 'ScheduledDep_PM', 'ScheduledArr_AM', 'ScheduledArr_PM', 'DurationVsAvg_Longer', 'DurationVsAvg_Shorter', 'DistanceVsAvg_Closer', 'DistanceVsAvg_Farther']


Unnamed: 0,CRSDepTime,CRSArrTime,CRSElapsedTime,ArrDelay,DepDelay,Distance,Cancelled,Diverted,BegOfMonth,IsInHolidaySeason,...,Season_summer,Season_winter,ScheduledDep_AM,ScheduledDep_PM,ScheduledArr_AM,ScheduledArr_PM,DurationVsAvg_Longer,DurationVsAvg_Shorter,DistanceVsAvg_Closer,DistanceVsAvg_Farther
0,1230,1552,322,110.0,109.0,2398.0,0,0,False,True,...,0,1,0,1,0,1,1,0,0,1
1,1230,1552,322,20.0,25.0,2398.0,0,0,False,True,...,0,1,0,1,0,1,1,0,0,1
2,1230,1552,322,-19.0,0.0,2398.0,0,0,False,True,...,0,1,0,1,0,1,1,0,0,1
3,1230,1552,322,-29.0,0.0,2398.0,0,0,False,True,...,0,1,0,1,0,1,1,0,0,1
4,1230,1552,322,-39.0,2.0,2398.0,0,0,False,True,...,0,1,0,1,0,1,1,0,0,1


In [10]:
print('Before dropped NaNs:', df.shape[0], 'rows')
df = df.dropna()
print('After dropped NaNs:', df.shape[0], 'rows')

Y = df['IsDelayed']
X = np.array(df.drop(labels=['IsDelayed'], axis=1))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

sm = SMOTE(random_state=12, ratio = 1.0)
X_train_res, Y_train_res = sm.fit_sample(X_train, Y_train)

Before dropped NaNs: 5041200 rows
After dropped NaNs: 4925482 rows




In [11]:
def fit_and_train(model, fit_X_train, fit_Y_train, X_train, Y_train):
    model_fit = model.fit(fit_X_train, fit_Y_train)
    model_score_train = model.score(X_train, Y_train)
    print('R² for train:', model_score_train)
    
    model_score_test = model.score(X_test, Y_test)
    print('\nR² for test:', model_score_test)
    
    if hasattr(model_fit, 'coef_'):
        print('\nCoefficients:', model_fit.coef_)
    
    if hasattr(model_fit, 'intercept_'):
        print('\nIntercept:', model_fit.intercept_)

lasso = linear_model.LogisticRegression(penalty='l1', C=100) 
fit_and_train(lasso, X_train_res, Y_train_res, X_train, Y_train)

R² for train: 0.9773612665292633

R² for test: 0.9773956155564455

Coefficients: [[-3.22287797e-05  4.26226773e-05  3.02839002e-02  3.82477327e-01
   1.69381452e-01 -2.72942843e-03  0.00000000e+00  0.00000000e+00
   1.77943659e-02 -5.88905649e-02 -6.65891924e-02  3.08052250e-01
   6.43544515e-02 -2.37384241e+00 -2.51891518e+00 -1.72613777e+00
  -2.88332818e+00 -1.92784362e+00 -3.15808600e+00 -2.17450688e+00
  -1.92446461e+00 -3.00089931e+00 -2.27972066e+00 -2.60565935e+00
  -2.85050347e+00 -2.93750511e+00 -2.42168499e+00 -2.44390918e+00
  -2.51812968e+00 -2.33021561e+00 -2.09047907e+00 -2.02926581e+00
  -1.76105251e+00 -2.04238139e+00 -1.73852428e+00 -1.88281458e+00
  -1.73403186e+00 -1.89538242e+00]]

Intercept: [-1.1094158]


In [12]:
def evaluate_model_printout(model):
    Y_train_vals = Y_train.values
    Y_test_vals = Y_test.values

    predict_train = model.predict_proba(X_train)
    predict_train = list(map(lambda x: 0 if x[0] > .998 else 1, predict_train))
    predict_train = np.fromiter(predict_train, dtype=np.int)

    predict_test = model.predict_proba(X_test)
    predict_test = list(map(lambda x: 0 if x[0] > .998 else 1, predict_test))
    predict_test = np.fromiter(predict_test, dtype=np.int)
    
    crosstab_labels = [0, 1, 'All']
    table_train = pd.crosstab(Y_train_vals, predict_train, rownames=['actual'], colnames=['predicted'], margins=True)
    table_train = table_train.reindex(index=crosstab_labels,columns=crosstab_labels, fill_value=0)

    print('TRAIN:')
    print(table_train, '\n')

    train_tI_errors = table_train.loc[0,1] / table_train.loc['All','All']
    train_tII_errors = table_train.loc[1,0] / table_train.loc['All','All']
    print(('Accuracy:\n% Type I errors: {}\n% Type II errors: {}\n').format(train_tI_errors, train_tII_errors))

    train_precision = table_train.loc[1,1] / table_train.loc['All', 1] # correctly predicted positives / all predicted positives
    train_recall = table_train.loc[1,1] / table_train.loc[1,'All'] # true positives / (true positives + false negatives)
    print('Precision:', train_precision)
    print('Recall:', train_recall, '\n\n----------\n')

    table_test = pd.crosstab(Y_test_vals, predict_test, rownames=['actual'], colnames=['predicted'], margins=True)
    table_test = table_test.reindex(index=crosstab_labels,columns=crosstab_labels, fill_value=0)
    
    print('TEST:')
    print(table_test, '\n')

    test_tI_errors = table_test.loc[0,1]/table_test.loc['All','All']
    test_tII_errors = table_test.loc[1,0]/table_test.loc['All','All']
    print(('Accuracy:\n% Type I errors: {}\n% Type II errors: {}\n').format(test_tI_errors, test_tII_errors))

    test_precision = table_test.loc[1,1] / table_test.loc['All', 1] # correctly predicted positives / all predicted positives
    test_recall = table_test.loc[1,1] / table_test.loc[1,'All'] # true positives / (true positives + false negatives)
    print('Precision:', test_precision)
    print('Recall:', test_recall)

In [13]:
evaluate_model_printout(lasso)

TRAIN:
predicted        0       1      All
actual                             
0          2179271  436516  2615787
1              211  339291   339502
All        2179482  775807  2955289 

Accuracy:
% Type I errors: 0.1477067048264992
% Type II errors: 7.139741663167291e-05

Precision: 0.43733944138168385
Recall: 0.9993785014521269 

----------

TEST:
predicted        0       1      All
actual                             
0          1454601  290538  1745139
1              134  224920   225054
All        1454735  515458  1970193 

Accuracy:
% Type I errors: 0.14746677102192526
% Type II errors: 6.801364130316167e-05

Precision: 0.43634980929581074
Recall: 0.999404587343482


In [14]:
ridge = linear_model.LogisticRegression(penalty='l2', C=100, fit_intercept=False)
fit_and_train(ridge, X_train_res, Y_train_res, X_train, Y_train)

R² for train: 0.9772079820281536

R² for test: 0.9772347176139596

Coefficients: [[-1.21201662e-04 -3.42407101e-05  2.87645373e-02  3.75239934e-01
   1.66551470e-01 -2.64948562e-03  0.00000000e+00  0.00000000e+00
   2.92183300e-03 -7.12302256e-02 -1.42038358e-01  2.08205529e-01
   3.99134240e-02 -3.17258694e-01 -4.60666978e-01  2.94995261e-01
  -8.43536847e-01  9.91214846e-02 -1.01108039e+00 -1.36793352e-01
   2.74080803e-01 -9.35426434e-01 -2.28969235e-01 -5.65387271e-01
  -8.02492802e-01 -8.91773753e-01 -1.36968628e+00 -1.40457199e+00
  -1.47677765e+00 -1.27415229e+00 -2.84704715e+00 -2.67814106e+00
  -2.64393010e+00 -2.88125811e+00 -2.63671434e+00 -2.88847387e+00
  -2.69793056e+00 -2.82725765e+00]]

Intercept: 0.0


In [15]:
evaluate_model_printout(ridge)

TRAIN:
predicted        0       1      All
actual                             
0          2165376  450411  2615787
1              201  339301   339502
All        2165577  789712  2955289 

Accuracy:
% Type I errors: 0.15240844465634326
% Type II errors: 6.80136528102666e-05

Precision: 0.4296515691796503
Recall: 0.9994079563596091 

----------

TEST:
predicted        0       1      All
actual                             
0          1445309  299830  1745139
1              127  224927   225054
All        1445436  524757  1970193 

Accuracy:
% Type I errors: 0.15218306023826092
% Type II errors: 6.446068989180247e-05

Precision: 0.4286307757685938
Recall: 0.9994356909897181


In [16]:
gbm = ensemble.GradientBoostingClassifier(n_estimators=500, max_depth=2, loss='deviance')
fit_and_train(gbm, X_train_res, Y_train_res, X_train, Y_train)

R² for train: 1.0

R² for test: 1.0


In [17]:
evaluate_model_printout(gbm)

TRAIN:
predicted        0       1      All
actual                             
0          2590833   24954  2615787
1                0  339502   339502
All        2590833  364456  2955289 

Accuracy:
% Type I errors: 0.008443844239937279
% Type II errors: 0.0

Precision: 0.9315308295102839
Recall: 1.0 

----------

TEST:
predicted        0       1      All
actual                             
0          1728596   16543  1745139
1                0  225054   225054
All        1728596  241597  1970193 

Accuracy:
% Type I errors: 0.008396639314016444
% Type II errors: 0.0

Precision: 0.9315264676299788
Recall: 1.0


In [18]:
bnb = BernoulliNB()
fit_and_train(bnb, X_train_res, Y_train_res, X_train, Y_train)

R² for train: 0.784408902141212

R² for test: 0.7845322767870965

Coefficients: [[-3.82293905e-07 -3.82293905e-07 -7.64587956e-07 -6.71533397e-04
  -3.74906184e-02 -3.82293905e-07 -1.47770763e+01 -1.47770763e+01
  -3.56831861e-01 -8.73901109e-01 -1.16698865e-01 -1.53009038e-01
  -1.00181293e+00 -1.79004024e+00 -4.12675866e+00 -2.27011422e+00
  -1.86839144e+00 -3.23030998e+00 -3.54326971e+00 -2.29739518e+00
  -3.63762324e+00 -2.17805660e+00 -2.59845197e+00 -1.59040758e+00
  -1.41644279e+00 -2.73827507e+00 -1.06530205e+00 -1.05536010e+00
  -8.64434525e-01 -8.34309528e-01 -1.26195356e+00 -3.32516144e-01
  -1.59047885e+00 -2.27701403e-01 -7.78326304e-01 -5.85443657e-01
  -5.70296957e-01 -8.27926886e-01]]

Intercept: [-0.69314718]


In [19]:
evaluate_model_printout(bnb)

TRAIN:
predicted        0        1      All
actual                              
0          1110473  1505314  2615787
1              257   339245   339502
All        1110730  1844559  2955289 

Accuracy:
% Type I errors: 0.5093627053056402
% Type II errors: 8.696273021014189e-05

Precision: 0.18391658927689492
Recall: 0.9992430088777091 

----------

TEST:
predicted       0        1      All
actual                             
0          741032  1004107  1745139
1             185   224869   225054
All        741217  1228976  1970193 

Accuracy:
% Type I errors: 0.5096490546865206
% Type II errors: 9.389943015735006e-05

Precision: 0.18297265365637735
Recall: 0.9991779750637625


## Conclusion

Gradient Boost Model was the most performant, with the highest R-squared values, as well as precision and recall.