# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

### Init Functions

In [5]:
import numpy as np
import pandas as pd
import re

In [17]:
def get_city_name(df):
    df['dest_city_name']=df.dest_city_name.str[:-4] # changed from flights.dest...
    df['origin_city_name']=df.origin_city_name.str[:-4] # changed from flights.origin...
    return df

In [7]:
def get_dep_arr_hr(df):
    df['dep_hr']=(df.crs_dep_time//100)*100
    df['arr_hr']=(df.crs_arr_time//100)*100
    return df

In [8]:
def time_machine(df):
    df['months']=pd.to_datetime(df.fl_date).dt.strftime("%b")
    df['weekday']=pd.to_datetime(df.fl_date).dt.strftime("%A")

In [9]:
def traffic(df):
    origin_traffic=df.groupby(['fl_date','origin']).sum()[['flights']].reset_index().rename({'flights': 'origin_traffic'}, axis='columns')
    dest_traffic=df.groupby(['fl_date','dest']).sum()[['flights']].reset_index().rename({'flights': 'dest_traffic'}, axis='columns')
    df = pd.merge(df, origin_traffic, on=['fl_date','origin'])
    df=pd.merge(df, dest_traffic, on=['fl_date','dest'])
    return df

In [10]:
def time_to_mins(df):
    df['crs_dep_time(mins)']=((df.crs_dep_time//100)*60)+df.crs_dep_time-(df.crs_dep_time//100)*100
    df['crs_arr_time(mins)']=((df.crs_arr_time//100)*60)+df.crs_arr_time-(df.crs_arr_time//100)*100

In [11]:
def taxi_Med(df):
    origin_taxi=flights.groupby('origin').taxi_out.median().to_frame().reset_index().rename({'taxi_out': 'origin_taxi'},axis='columns')
    dest_taxi=flights.groupby('dest').taxi_in.median().to_frame().reset_index().rename({'taxi_in': 'dest_taxi'},axis='columns')
    df=pd.merge(df,origin_taxi , on='origin')  
    df=pd.merge(df, dest_taxi, on='dest')
    return df

In [12]:
def one_hot1(df):
    df = df.join(pd.get_dummies(df.mkt_carrier))
    return df

In [13]:
def one_hot2(df):
    df = df.join(pd.get_dummies(df.weekday))
    return df

In [14]:
def one_hot3(df):
    df = df.join(pd.get_dummies(df.months))
    return df

### Transform Flights (training)

In [2]:
flights = pd.read_csv('data/flights.csv', low_memory=False)

In [4]:
flights=get_city_name(flights)

In [6]:
flights=get_dep_arr_hr(flights)

In [10]:
time_machine(flights)

In [12]:
# Checkpoint
flights.to_csv('data/flights_tm.csv', index=False)

In [1]:
#%reset -f
#import numpy as np
#import pandas as pd

In [25]:
flights = pd.read_csv('data/flights_tm.csv', low_memory=False)

In [26]:
flights=traffic(flights)

In [27]:
time_to_mins(flights)

In [8]:
flights=taxi_Med(flights)

In [9]:
# Checkpoint
flights.to_csv('data/flights_taxi.csv', index=False)

In [1]:
#%reset -f
#import numpy as np
#import pandas as pd

In [2]:
#flights = pd.read_csv('data/flights_taxi.csv', low_memory=False)

In [13]:
flights=one_hot1(flights)

In [14]:
flights=one_hot2(flights)

In [15]:
flights=one_hot3(flights)

In [16]:
# Checkpoint
flights.to_csv('data/flights_hot.csv', index=False)

In [None]:
#%reset -f
#import numpy as np
#import pandas as pd

In [None]:
#flights = pd.read_csv('data/flights_taxi.csv', low_memory=False)

### Transform Flights_Test
Kernel restarted to clear variables in RAM

In [15]:
flights_test = pd.read_csv('data/flights_test.csv', low_memory=False)

In [18]:
flights_test=get_city_name(flights_test)

In [19]:
flights_test=get_dep_arr_hr(flights_test)

In [20]:
time_machine(flights_test)

In [21]:
flights_test=traffic(flights_test)

In [22]:
time_to_mins(flights_test)

In [24]:
flights_test.columns

Index(['Unnamed: 0', 'fl_date', 'mkt_unique_carrier', 'branded_code_share',
       'mkt_carrier', 'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num',
       'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name',
       'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time',
       'crs_arr_time', 'dup', 'crs_elapsed_time', 'flights', 'distance',
       'dep_hr', 'arr_hr', 'months', 'weekday', 'origin_traffic',
       'dest_traffic', 'crs_dep_time(mins)', 'crs_arr_time(mins)'],
      dtype='object')

In [None]:
flights_test.to_csv('data/flights_pre-taxi.csv', index=False)

In [28]:
flights_test=taxi_Med(flights_test)

In [29]:
del flights

In [30]:
flights_test=one_hot1(flights_test)

In [31]:
flights_test=one_hot2(flights_test)

In [32]:
flights_test=one_hot3(flights_test)

## Feature Selection

In [18]:
pd.options.display.max_rows = None

In [22]:
flights.dtypes

Unnamed: 0              object
fl_date                 object
mkt_unique_carrier      object
branded_code_share      object
mkt_carrier             object
mkt_carrier_fl_num       int64
op_unique_carrier       object
tail_num                object
op_carrier_fl_num        int64
origin_airport_id        int64
origin                  object
origin_city_name        object
dest_airport_id          int64
dest                    object
dest_city_name          object
crs_dep_time             int64
dep_time               float64
dep_delay              float64
taxi_out               float64
wheels_off             float64
wheels_on              float64
taxi_in                float64
crs_arr_time             int64
arr_time               float64
arr_delay              float64
cancelled              float64
cancellation_code       object
diverted               float64
dup                     object
crs_elapsed_time       float64
actual_elapsed_time    float64
air_time               float64
flights 

In [45]:
features = ['fl_date','mkt_carrier','mkt_carrier_fl_num','origin','dest'] # _sub + predicted_delay

In [46]:
features += ['arr_delay'] # add y .iloc[:,:5]

In [47]:
features += ['crs_dep_time(mins)','crs_arr_time(mins)','crs_elapsed_time','distance','origin_traffic','dest_traffic','origin_taxi','dest_taxi']+list(flights_test.mkt_carrier.unique())+list(flights_test.weekday.unique())

In [48]:
features += ['Jan']

In [49]:
features += ['origin_city_name', 'dest_city_name', 'dep_hr', 'arr_hr'] # for appending weather data, !!!! drop after merge

In [50]:
[i for i in features]

['fl_date',
 'mkt_carrier',
 'mkt_carrier_fl_num',
 'origin',
 'dest',
 'arr_delay',
 'crs_dep_time(mins)',
 'crs_arr_time(mins)',
 'crs_elapsed_time',
 'distance',
 'origin_traffic',
 'dest_traffic',
 'origin_taxi',
 'dest_taxi',
 'WN',
 'UA',
 'AS',
 'AA',
 'DL',
 'B6',
 'F9',
 'HA',
 'NK',
 'G4',
 'Wednesday',
 'Thursday',
 'Friday',
 'Saturday',
 'Sunday',
 'Monday',
 'Tuesday',
 'Jan',
 'origin_city_name',
 'dest_city_name',
 'dep_hr',
 'arr_hr']

## EXPORT FEATURES FOR DATASETS

In [60]:
flights[features].to_csv('data/flights_merge.csv', index=False)

In [None]:
# for weather: ['fl_date', 'origin_city_name', 'dest_city_name', 'crs_dep_time', 'crs_arr_time']

In [51]:
del features[5] # del arr_delay variable for flights_test set

In [53]:
flights_test[features].to_csv('data/flights_test_merge.csv', index=False)