# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def load_data_set(path= './data', file= 'flights.csv', test=False):
    '''
    Load Train or Test data add new features from database. 

    Parameters
    ----------
    Path : str, Location of source file. Ex. './data'.
    
    file : str, Name of file, include extension .
        Target vector relative to X.
    
    test: boolean, default=False
        True loads test. False loads train data.
           
    Returns
    -------
    X : pandas DataFrame
        DataFrame containing training or test data. 
    
    y : pandas Series (Only for training data)
        If test = False it returns a target variable in pandas Series.
    '''
    # Load Train or Test csv
    X = pd.read_csv(f'{path}/{file}')
    if test:
        return X
    else:
    # Take target variable out of flights data set
        y = X['arr_delay']
        X = X.drop('arr_delay', axis=1)
        return X, y

In [None]:
def load_agg_data(X, path= './data', test=False):
    '''
    Add aggregated variables as new features to Train or Test data set. 

    Parameters
    ----------
    X : pandas DataFrame
        Test or Train dataset.
    
    Path : str, Location of source file. Ex. './data'.
        Location of files to read and load.

    
    test: boolean, default=False
        True loads test. False loads train data.
           
    Returns
    -------
    X : pandas DataFrame
        Pandas DataFrame containing Train or Test data and additional features. 

    '''

    # Load flights aggregate data
    flight_delay_aggregate_mth = pd.read_csv('./data/flight_delay_aggregate_monthly.csv')
    flight_delay_aggregate_dow = pd.read_csv('./data/flight_delay_aggregate_day_of_week.csv')
    flight_delay_aggregate_arrive_hour= pd.read_csv('./data/flight_delay_aggregate_arrive_hour.csv')
    flight_airport_traffic = pd.read_csv('./data/flight_airport_traffic.csv')

    # Load passengers aggregate data
    passengers_flight_montly_aggregate = pd.read_csv('./data/passengers_flight_montly_aggregate.csv')
    passengers_carrier_monthly_aggregate = pd.read_csv('./data/passengers_carrier_monthly_aggregate.csv')
    passengers_airport_monthly_aggregate= pd.read_csv('./data/passengers_airport_monthly_aggregate.csv')
    # Load fuel comsumption data
    fuel_comsumption_monthyl_aggregate= pd.read_csv('./data/fuel_comsumption_monthyl_aggregate.csv')

    # join tables data from origin
    flights = pd.merge(X, flight_delay_aggregate_mth, how='left', on=['mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id',  'month'])
    flights = pd.merge(flights, flight_delay_aggregate_dow, how='left', on=['mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id',  'day_of_week'])
    flights = pd.merge(flights, flight_delay_aggregate_arrive_hour, how='left', on=['mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id', 'crs_arr_hour'])

    # Join Airport traffic
    orig = flight_airport_traffic[['airport_id','month', 'total_flights']]
    orig.columns = ['origin_airport_id','month','origin_total_flights']
    dest = flight_airport_traffic[['airport_id','month','total_flights']]
    dest.columns = ['dest_airport_id','month','dest_total_flights']
    flights = pd.merge(flights, orig, how='left', on=['origin_airport_id','month'])
    flights = pd.merge(flights, dest, how='left', on=['dest_airport_id','month'])

    
    # Join Passengers data
    flights = pd.merge(flights, passengers_flight_montly_aggregate, how='left', on=['mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id','month'])
    flights = pd.merge(flights, passengers_carrier_monthly_aggregate, how='left', on=['mkt_unique_carrier', 'month'])
    flights = pd.merge(flights, fuel_comsumption_monthyl_aggregate, how='left', on=['mkt_unique_carrier', 'month'])
    #Flights has origin and destination ariports. we add it from the table.
    orig_pass = passengers_airport_monthly_aggregate[['airport_id','month', 'airport_month_flight_seats', 'airport_month_passengers']]
    orig_pass.columns = ['origin_airport_id','month', 'orig_airport_month_flight_seats', 'orig_airport_month_passengers']
    dest_pass = passengers_airport_monthly_aggregate[['airport_id','month', 'airport_month_flight_seats', 'airport_month_passengers']]
    dest_pass.columns = ['dest_airport_id','month', 'dest_airport_month_flight_seats', 'dest_airport_month_passengers']
    flights = pd.merge(flights, orig_pass, how='left', on=['origin_airport_id','month'])
    flights = pd.merge(flights, dest_pass, how='left', on=['dest_airport_id','month'])
    
    return flights

In [None]:
# Load data
X, y = load_data_set()

In [None]:
print(X.shape)
print(y.shape)

In [None]:
X.head()

In [None]:
train = load_agg_data(X)

In [None]:
train.shape

In [None]:
# Save data in local disk
train.to_csv('./data/train.csv', index=False)
y.to_csv('./data/target.csv', index=False)

In [None]:
train.dtypes
	modified:   .ipynb_checkpoints/modeling_aa-checkpoint.ipynb
	modified:   model_evaluation.ipynb
	modified:   modeling_aa.ipynb

In [11]:
sorted(list(train.columns))

['arr_hour_avg_air_time',
 'arr_hour_avg_arr_delay',
 'arr_hour_avg_carrier_delay',
 'arr_hour_avg_late_aircraft_delay',
 'arr_hour_avg_nas_delay',
 'arr_hour_avg_security_delay',
 'arr_hour_avg_weather_delay',
 'branded_code_share',
 'carrier_month_avg_passengers',
 'carrier_month_avg_seats',
 'carrier_month_passengers',
 'carrier_month_seats',
 'crs_arr_hour',
 'crs_arr_time',
 'crs_dep_hour',
 'crs_dep_time',
 'crs_elapsed_time',
 'day',
 'day_of_week',
 'day_of_week_avg_air_time',
 'day_of_week_avg_arr_delay',
 'day_of_week_avg_carrier_delay',
 'day_of_week_avg_late_aircraft_delay',
 'day_of_week_avg_nas_delay',
 'day_of_week_avg_security_delay',
 'day_of_week_avg_weather_delay',
 'dest',
 'dest_airport_id',
 'dest_airport_month_flight_seats',
 'dest_airport_month_passengers',
 'dest_city_name',
 'dest_total_flights',
 'distance',
 'dup',
 'fl_date',
 'flights',
 'mkt_carrier',
 'mkt_carrier_fl_num',
 'mkt_unique_carrier',
 'month',
 'month_avg_air_time',
 'month_avg_arr_delay',
 '

In [12]:
# missing data
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(30)

Unnamed: 0,Total,Percent
month_flight_avg_passengers,1915,0.195249
month_flight_avg_seats,1915,0.195249
month_flight_passengers,1915,0.195249
month_flight_seats,1915,0.195249
tail_num,32,0.003263
arr_hour_avg_late_aircraft_delay,29,0.002957
arr_hour_avg_carrier_delay,29,0.002957
arr_hour_avg_weather_delay,29,0.002957
arr_hour_avg_nas_delay,29,0.002957
arr_hour_avg_security_delay,29,0.002957


In [13]:
train.dtypes[train.dtypes == 'object']

fl_date               object
mkt_unique_carrier    object
branded_code_share    object
mkt_carrier           object
op_unique_carrier     object
tail_num              object
origin                object
origin_city_name      object
dest                  object
dest_city_name        object
dup                   object
dtype: object

In [None]:
# fl_date numeric
#train['fl_date'] = train['fl_date'].replace('-', '', regex=True).astype(int)
# mkt_unique_carrier - hot-encode

In [27]:
def one_hot_encode(X):
    cat_feats = train.dtypes[X.dtypes == 'object'].index.tolist()
    df_dummy = pd.get_dummies(X[cat_feats])
    return df_dummy

def label_encode():
    pass

def date_numeric(s):
    s = s.replace('-', '', regex=True).astype(int)
    return s

def print_cat_describe(df):
    for col in train.dtypes[train.dtypes == 'object'].index:
        print("Variable: ", col)
        print(df[col].describe())
        print("Unique values: ", df[col].unique())
        print('')

In [22]:
train['fl_date'] = date_numeric(train['fl_date'])

TypeError: astype() got an unexpected keyword argument 'type'

In [23]:
train.dtypes[train.dtypes == 'object']

mkt_unique_carrier    object
branded_code_share    object
mkt_carrier           object
op_unique_carrier     object
tail_num              object
origin                object
origin_city_name      object
dest                  object
dest_city_name        object
dup                   object
dtype: object

In [None]:
#drop
# mkt_carrier
# tail_num - do we need it?
# origin
# origin_city_name
# dest
# dest_city_name
# dup - it contains only N - we do not need it
to_drop = ['mkt_carrier', 'tail_num', 'origin', 'origin_city_name', 'dest', 'dest_city_name', 'dup']
train = train.drop(to_drop, axis=1)

In [73]:
train.dtypes[train.dtypes == 'object']

mkt_unique_carrier    object
branded_code_share    object
op_unique_carrier     object
dtype: object

In [28]:
print_cat_describe(train)

Variable:  mkt_unique_carrier
count     9808
unique      11
top         AA
freq      2516
Name: mkt_unique_carrier, dtype: object
Unique values:  ['UA' 'F9' 'AA' 'AS' 'WN' 'DL' 'B6' 'NK' 'HA' 'G4' 'VX']

Variable:  branded_code_share
count     9808
unique      16
top         WN
freq      1655
Name: branded_code_share, dtype: object
Unique values:  ['UA_CODESHARE' 'F9' 'AA' 'UA' 'AS' 'WN' 'AA_CODESHARE' 'DL_CODESHARE'
 'DL' 'B6' 'AS_CODESHARE' 'NK' 'HA' 'G4' 'HA_CODESHARE' 'VX']

Variable:  mkt_carrier
count     9808
unique      11
top         AA
freq      2516
Name: mkt_carrier, dtype: object
Unique values:  ['UA' 'F9' 'AA' 'AS' 'WN' 'DL' 'B6' 'NK' 'HA' 'G4' 'VX']

Variable:  op_unique_carrier
count     9808
unique      27
top         WN
freq      1655
Name: op_unique_carrier, dtype: object
Unique values:  ['EV' 'F9' 'AA' 'UA' 'AS' 'WN' 'OH' '9E' 'DL' 'B6' 'OO' 'PT' 'G7' 'ZW'
 'MQ' 'YV' 'NK' 'QX' 'CP' 'YX' 'HA' 'C5' 'AX' 'G4' 'EM' 'VX' 'KS']

Variable:  tail_num
count       9776
unique

In [72]:
train.dtypes[train.dtypes == 'object']

mkt_unique_carrier    object
branded_code_share    object
op_unique_carrier     object
dtype: object

In [86]:
df_dummy = one_hot_encode(train)

In [87]:
df_dummy

Unnamed: 0,mkt_unique_carrier_AA,mkt_unique_carrier_AS,mkt_unique_carrier_B6,mkt_unique_carrier_DL,mkt_unique_carrier_F9,mkt_unique_carrier_G4,mkt_unique_carrier_HA,mkt_unique_carrier_NK,mkt_unique_carrier_UA,mkt_unique_carrier_VX,...,op_unique_carrier_OH,op_unique_carrier_OO,op_unique_carrier_PT,op_unique_carrier_QX,op_unique_carrier_UA,op_unique_carrier_VX,op_unique_carrier_WN,op_unique_carrier_YV,op_unique_carrier_YX,op_unique_carrier_ZW
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9803,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9804,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
9805,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9806,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.