### Let's establish a baseline accuracy to work with and work off of for feature engineering and determining if a given feature would help or hurt performance 

In [1]:
import pandas as pd 
import numpy as np 
import os 
os.chdir("../")

In [2]:
### Load in the data 
df = pd.read_csv("data/aggregated.csv")

In [3]:
df.head()

Unnamed: 0,MONTH,DAY_OF_WEEK,FL_DATE,UNIQUE_CARRIER,FL_NUM,ORIGIN,ORIGIN_CITY_NAME,DEST,DEST_CITY_NAME,CRS_DEP_TIME,ARR_DEL15,CRS_ELAPSED_TIME,DISTANCE,Unnamed: 13
0,2.0,6.0,2017-02-25,B6,28.0,MCO,"Orlando, FL",EWR,"Newark, NJ",1000.0,0.0,156.0,937.0,
1,2.0,7.0,2017-02-26,B6,28.0,MCO,"Orlando, FL",EWR,"Newark, NJ",739.0,0.0,153.0,937.0,
2,2.0,1.0,2017-02-27,B6,28.0,MCO,"Orlando, FL",EWR,"Newark, NJ",1028.0,0.0,158.0,937.0,
3,2.0,2.0,2017-02-28,B6,28.0,MCO,"Orlando, FL",EWR,"Newark, NJ",739.0,0.0,153.0,937.0,
4,2.0,3.0,2017-02-01,B6,33.0,BTV,"Burlington, VT",JFK,"New York, NY",1907.0,0.0,90.0,266.0,


### Right off the bat, Unnamed: 13 looks like it should be dropped immediately since it seems to take on many null values 
### ORIGIN_CITY_NAME is essentially the same thing as ORIGIN, except less detailed since it only describes the city name itself while a single city can have multiple airports and wouldn't too informative as a feature for determing if a flight in-fact gets delayed, thus it's dropped, same can be said for DEST_CITY_NAME as well  
### FL_NUMBER is also a value that should be dropped as it seems incredibly cryptic and vague 

In [4]:
df.isna().sum()
### Hmmm 5,129,354 NULL columns for Unnamed: 13, it should most certainly be dropped 

MONTH                     0
DAY_OF_WEEK               0
FL_DATE                   0
UNIQUE_CARRIER            0
FL_NUM                    0
ORIGIN                    0
ORIGIN_CITY_NAME          0
DEST                      0
DEST_CITY_NAME            0
CRS_DEP_TIME              0
ARR_DEL15             71020
CRS_ELAPSED_TIME         10
DISTANCE                  0
Unnamed: 13         5129354
dtype: int64

In [6]:
## Helper functions for dropping irrelevant columns and encoding categorical values 
## For function details, see `/scripts/preprocessing.py`
from scripts.preprocessing import drop_values
from scripts.preprocessing import encode_cats


In [7]:
cleaned_df = drop_values(df)
cleaned_df = encode_cats(cleaned_df)
cleaned_df = cleaned_df.drop(columns=['FL_DATE'])
cleaned_df = cleaned_df.dropna()

In [8]:
## Train, test, split, predict 
## Establishing baseline accuracy 
from sklearn.ensemble import RandomForestClassifier
from scripts.model_helpers import split_train_predict
pred, test_y = split_train_predict(cleaned_df, RandomForestClassifier())

In [9]:
from sklearn.metrics import accuracy_score
print("-"*6)
print("Baseline accuracy ---> ", accuracy_score(pred, test_y))
print("-"*6)

------
Baseline accuracy --->  0.7761746155257381
------


### Lovely, so 77.6% is now our baseline, let's perform some augmentations step-by-step and see if we can improve this baseline accuracy