# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

## Main Task: Regression Problem

In [86]:
# Base Functions
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


dataset = pd.read_csv('../data/flights/flightFinalTrain.csv')
dataset = dataset.drop(['Unnamed: 0','tail_num'], axis=1)
data = dataset.sample(n=100000, axis=0, random_state=42)


In [87]:
data2 = data.copy()

In [88]:
data2.drop(['op_carrier_fl_num','origin_airport_id','dest_airport_id'],axis=1,inplace=True)

In [89]:
data2.drop(['mkt_carrier'],axis=1,inplace=True)

In [91]:
data3 = data[['month_delay','carrier_delay','carrier_perc_delay','distance','arr_delay']]

In [92]:
data3.head()

Unnamed: 0,month_delay,carrier_delay,carrier_perc_delay,distance,arr_delay
525813,3.454152,7.880984,21.18,666,-20.0
813587,3.10105,7.880984,21.18,327,-5.0
3160639,4.221989,7.982953,20.75,1437,2.0
1825114,7.783456,7.880984,21.18,223,4.0
4992657,3.10105,11.061624,27.15,641,17.0


In [62]:
data = pd.get_dummies(data,prefix=['mkt_carrier'], columns = ['mkt_carrier'], drop_first=True)

In [94]:
X = data3.drop(['arr_delay'], axis=1)
y = data3['arr_delay']

In [95]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [66]:
# For classification
#from sklearn.preprocessing import StandardScaler
#sc = StandardScaler()
#X_train = sc.fit_transform(X_train)
#X_test = sc.transform(X_test)


### Regression

In [101]:
# Naive Bayes

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


0.0094

In [102]:
from sklearn.metrics import mean_squared_error as mse

### Decision Tree

In [96]:
# Decision Tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as mse
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

In [97]:
mse(y_test,y_pred)

2410.498115520996

### Random Forest

In [98]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor.fit(X_train, y_train)

RandomForestRegressor(random_state=0)

In [99]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)

In [100]:
from sklearn.metrics import r2_score
mse(y_test, y_pred)

2163.382877827062

### Multiple Linear Regression

In [74]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [75]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)

In [76]:
from sklearn.metrics import r2_score
mse(y_test, y_pred)
r2_score(y_test, y_pred)

0.007515845984801484

### Ensemble Methods

In [54]:
# XGBoost
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))





KeyboardInterrupt: 

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.