# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option("display.max_columns", None)

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [3]:
# import csv file with the proper dtypes
dtype_dict = {
 'branded_code_share': 'string',
 'mkt_carrier': 'string',
 'mkt_carrier_fl_num': 'string',
 'op_unique_carrier': 'string',
 'tail_num': 'string',
 'op_carrier_fl_num': 'string',
 'origin_airport_id': 'string',
 'origin': 'string',
 'origin_city_name': 'string',
 'dest_airport_id': 'string',
 'dest': 'string',
 'dest_city_name': 'string',
 'crs_dep_time': 'int64',
 'dep_time': 'int64',
 'dep_delay': 'int64',
 'taxi_out': 'int64',
 'wheels_off': 'int64',
 'wheels_on': 'int64',
 'taxi_in': 'int64',
 'crs_arr_time': 'int64',
 'arr_time': 'int64',
 'arr_delay': 'int64',
 'crs_elapsed_time': 'int64',
 'actual_elapsed_time': 'int64',
 'air_time': 'int64',
 'distance': 'int64'}
df = pd.read_csv("data/ransmpl_clean.csv", parse_dates=[0], dtype=dtype_dict)

In [4]:
df.head()

Unnamed: 0,fl_date,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,crs_elapsed_time,actual_elapsed_time,air_time,distance
0,2018-03-07,AA,AA,465,AA,N200UU,465,14107,PHX,"Phoenix, AZ",14679,SAN,"San Diego, CA",835,833,-2,13,846,838,2,851,840,-11,76,67,52,304
1,2018-03-07,AA,AA,591,AA,N833AW,591,11057,CLT,"Charlotte, NC",11278,DCA,"Washington, DC",1431,1537,66,16,1553,1648,3,1559,1651,52,88,74,55,331
2,2018-03-07,AA,AA,600,AA,N151UW,600,11697,FLL,"Fort Lauderdale, FL",11057,CLT,"Charlotte, NC",603,557,-6,18,615,746,19,809,805,-4,126,128,91,632
3,2018-03-07,AA,AA,1805,AA,N924US,1805,11057,CLT,"Charlotte, NC",10721,BOS,"Boston, MA",1135,1129,-6,11,1140,1312,12,1352,1324,-28,137,115,92,728
4,2018-03-07,AA,AA,2615,AA,N945NN,2615,11057,CLT,"Charlotte, NC",15370,TUL,"Tulsa, OK",1820,1812,-8,11,1823,1936,6,2002,1942,-20,162,150,133,842


### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

**Creating Feature: Avg Delay for carrier**

In [5]:
# calculate avg delay by carrier
avg_delay_df = df[["op_unique_carrier", "arr_delay"]].groupby("op_unique_carrier").mean().reset_index()
# round values and create a dictionary mapping each carrier to their avg delay
avg_delay_map = dict(avg_delay_df.values)
rounded_delay_vals = [round(val, 2) for val in avg_delay_map.values()]
avg_delay_map = dict(zip(avg_delay_map.keys(), rounded_delay_vals))
# creating new column based on the mapping
df["avg_delay_for_carrier"] = df["op_unique_carrier"].map(avg_delay_map)

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

In [6]:
df.head()

Unnamed: 0,fl_date,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,crs_elapsed_time,actual_elapsed_time,air_time,distance,avg_delay_for_carrier
0,2018-03-07,AA,AA,465,AA,N200UU,465,14107,PHX,"Phoenix, AZ",14679,SAN,"San Diego, CA",835,833,-2,13,846,838,2,851,840,-11,76,67,52,304,6.07
1,2018-03-07,AA,AA,591,AA,N833AW,591,11057,CLT,"Charlotte, NC",11278,DCA,"Washington, DC",1431,1537,66,16,1553,1648,3,1559,1651,52,88,74,55,331,6.07
2,2018-03-07,AA,AA,600,AA,N151UW,600,11697,FLL,"Fort Lauderdale, FL",11057,CLT,"Charlotte, NC",603,557,-6,18,615,746,19,809,805,-4,126,128,91,632,6.07
3,2018-03-07,AA,AA,1805,AA,N924US,1805,11057,CLT,"Charlotte, NC",10721,BOS,"Boston, MA",1135,1129,-6,11,1140,1312,12,1352,1324,-28,137,115,92,728,6.07
4,2018-03-07,AA,AA,2615,AA,N945NN,2615,11057,CLT,"Charlotte, NC",15370,TUL,"Tulsa, OK",1820,1812,-8,11,1823,1936,6,2002,1942,-20,162,150,133,842,6.07


**Features Used**

In [7]:
X = df[["avg_delay_for_carrier", "distance"]]
X2 = df[["distance", "air_time"]]

In [8]:
y = df["arr_delay"]

In [9]:
X.head()

Unnamed: 0,avg_delay_for_carrier,distance
0,6.07,304
1,6.07,331
2,6.07,632
3,6.07,728
4,6.07,842


In [10]:
y.head()

0   -11
1    52
2    -4
3   -28
4   -20
Name: arr_delay, dtype: int64

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

**Trying a simple linear regression**

In [13]:
linreg = LinearRegression(normalize=False)

In [14]:
linreg.fit(X_train, y_train)

LinearRegression()

In [15]:
delay_pred = linreg.predict(X_test)

In [16]:
results_df = pd.DataFrame([y_test.values, delay_pred])
results_df = results_df.T
results_df.columns = ["actual_delay", "predicted_delay"]
results_df["error"] = results_df["actual_delay"] - results_df["predicted_delay"]
results_df

Unnamed: 0,actual_delay,predicted_delay,error
0,-13.0,3.943127,-16.943127
1,10.0,3.796876,6.203124
2,1.0,7.886636,-6.886636
3,15.0,7.710259,7.289741
4,-2.0,6.243250,-8.243250
...,...,...,...
78007,-1.0,6.556450,-7.556450
78008,-13.0,0.756786,-13.756786
78009,4.0,7.016193,-3.016193
78010,-17.0,9.677741,-26.677741


In [17]:
# Calculating r2 manually

residuals = results_df["actual_delay"] - results_df["predicted_delay"]
residuals_sq = residuals ** 2
sum_of_residuals_sq = residuals_sq.sum()

var_mean = results_df["actual_delay"] - (results_df["actual_delay"].mean()) 
var_mean_sq = var_mean ** 2
sum_of_var_mean_sq = var_mean_sq.sum()

r2 = 1 - (sum_of_residuals_sq / sum_of_var_mean_sq)
r2

0.005338642610873134

In [18]:
# r2 train score
linreg.score(X_train, y_train)

0.0052377243517642835

In [19]:
# r2 test score
linreg.score(X_test, y_test)

0.005338642610873134

**Analysis**   
r2 score was awful.   
using flight distance and avg delay by carrier in a linear regression was not effective at all

**Making a linear regression function**

In [20]:
from sklearn.preprocessing import StandardScaler

In [21]:
def run_linear_regression(X, y):
    """
    Run a linear regression.
    
    Parameters
    ----------
    X : Pandas DataFrame or numpy array of feature variables
    
    y : Pandas Series or numpy array of target variable
    
    Returns
    -------
    R**2 score, numpy array of predictions for the target variable 
    """
    X_test, X_train, y_test, y_train = train_test_split(X, y, train_size=0.8)
    model = LinearRegression()
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = model.score(X_test, y_test)
    return r2, y_pred

In [22]:
r2, y_pred = run_linear_regression(X2, y)

In [23]:
r2

0.009609537085245812

**Making a Polynomial regression function**

In [24]:
from sklearn.preprocessing import PolynomialFeatures

In [25]:
def run_polynomial_regression(X, y, degree):
    poly = PolynomialFeatures(degree)
    X = poly.fit_transform(X)
    X_test, X_train, y_test, y_train = train_test_split(X, y, train_size=0.8)
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print("---Results---")
    print(f"degree = {degree}")
    print(f"Train score = {train_score}")
    print(f"Test score = {test_score}")
    return y_pred, train_score, test_score

In [26]:
def scale_data(X, y):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    y_scaled = scaler.fit_transform(y.values.reshape(-1,1))
    return X_scaled, y_scaled

In [27]:
X_scaled, y_scaled = scale_data(X2, y)

In [28]:
for i in range(2,15):
    y_pred, train_score, test_score = run_polynomial_regression(X_scaled, y_scaled, i)
    print()

---Results---
degree = 2
Train score = 0.013187670026152731
Test score = 0.012751652577797978

---Results---
degree = 3
Train score = 0.014516064217224178
Test score = 0.013837045685025307

---Results---
degree = 4
Train score = 0.014584996879753787
Test score = 0.015434600613914662

---Results---
degree = 5
Train score = 0.014064492790081484
Test score = -0.00021435644406553322

---Results---
degree = 6
Train score = 0.015268809261489658
Test score = -0.07007273430847949

---Results---
degree = 7
Train score = 0.01746468771594112
Test score = -4.121554516528459

---Results---
degree = 8
Train score = 0.01687281122581552
Test score = -23.30208413017531

---Results---
degree = 9
Train score = 0.016104961946556196
Test score = -313.04060823044995

---Results---
degree = 10
Train score = 0.014699152097938928
Test score = -1321.8661013625256

---Results---
degree = 11
Train score = 0.016359582199424683
Test score = -63045.89816250164

---Results---
degree = 12
Train score = 0.0163366489292

**Analysis**.  
Polynomial regression wasn't any better with the selected features

**Random Forest**

In [29]:
from sklearn.ensemble import RandomForestRegressor

In [30]:
def run_random_forest(X, y):
    model = RandomForestRegressor(n_estimators=50, max_depth=5)
    X_test, X_train, y_test, y_train = train_test_split(X, y, train_size=0.8)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print(f"train score = {train_score}")
    print(f"test score = {test_score}")
    return model

In [31]:
rf = run_random_forest(X2, y)

train score = 0.01781721019111826
test score = 0.006171617003033147


In [32]:
rf = RandomForestRegressor(n_estimators=100, max_depth=5)

In [33]:
X_test, X_train, y_test, y_train = train_test_split(X_scaled, y_scaled, train_size=0.8)

In [34]:
rf.fit(X_train, y_train.flatten())

RandomForestRegressor(max_depth=5)

In [35]:
y_pred = rf.predict(X_test)

In [36]:
rf.score(X_train, y_train)

0.022994024969808846

In [37]:
rf.score(X_test, y_test)

0.004973748351373541

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.