# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [2]:
df_flights = pd.read_csv('data/df_sample.csv')

In [3]:
df_flights.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,1,296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,1,588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,1,234,1.0,0.0,13.0,0.0,52.0,0.0,0.0,0.0
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,1,284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,1,358,0.0,0.0,0.0,0.0,21.0,0.0,0.0,0.0


## Feature Engineering

### Adding taxi out mean timer per hour as feature

In [4]:

# Convert 'dep_time' to datetime format
df_flights['dep_time'] = pd.to_datetime(df_flights['dep_time'], format='%H%M', errors='coerce')

# Calculate mean taxi time per hour
df_flights['taxi_mean_time'] = df_flights.groupby(df_flights['dep_time'].dt.hour)['taxi_out'].transform('mean')

In [5]:
df_flights.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,taxi_mean_time
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.637988
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.351852
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,234,1.0,0.0,13.0,0.0,52.0,0.0,0.0,0.0,17.582934
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.582934
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,358,0.0,0.0,0.0,0.0,21.0,0.0,0.0,0.0,16.862433


### Explornig date time - Extract year , month , day of month, day of week

In [6]:
# Explorinig date time - Extract year , month , day of month, day of week

df_flights['fl_date'] = pd.to_datetime(df_flights['fl_date'], errors='coerce')
df_flights['year'] = df_flights['fl_date'].dt.year
df_flights['month'] = df_flights['fl_date'].dt.month
df_flights['day_of_month'] = df_flights['fl_date'].dt.day
df_flights['day_of_week'] = df_flights['fl_date'].dt.dayofweek
df_flights['dep_hour'] = df_flights['crs_dep_time'] // 100
df_flights.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,taxi_mean_time,year,month,day_of_month,day_of_week,dep_hour
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,0.0,0.0,0.0,0.0,16.637988,2019,1,8,1,13
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,0.0,0.0,0.0,0.0,19.351852,2018,5,4,4,8
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,52.0,0.0,0.0,0.0,17.582934,2019,2,11,0,15
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,0.0,0.0,0.0,0.0,17.582934,2018,1,16,1,15
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,21.0,0.0,0.0,0.0,16.862433,2019,6,27,3,11


In [7]:
# Create a new column for the time of day
time_bins = [0, 6, 12, 18, 24]
time_labels = ['night', 'morning', 'afternoon', 'evening']
df_flights['time_of_day'] = pd.cut(df_flights['dep_hour'], bins=time_bins, labels=time_labels)


In [8]:
df_flights.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,first_dep_time,total_add_gtime,longest_add_gtime,taxi_mean_time,year,month,day_of_month,day_of_week,dep_hour,time_of_day
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,0.0,0.0,0.0,16.637988,2019,1,8,1,13,afternoon
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,0.0,0.0,0.0,19.351852,2018,5,4,4,8,morning
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,0.0,0.0,0.0,17.582934,2019,2,11,0,15,afternoon
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,0.0,0.0,0.0,17.582934,2018,1,16,1,15,afternoon
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,0.0,0.0,0.0,16.862433,2019,6,27,3,11,morning


### Calculates the departure traffic for each origin airport per day 

In [9]:
df_flights['dep_traffic_per_day'] = df_flights.groupby(["origin", "fl_date"])["flights"].transform('sum')


In [10]:
df_flights.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,total_add_gtime,longest_add_gtime,taxi_mean_time,year,month,day_of_month,day_of_week,dep_hour,time_of_day,dep_traffic_per_day
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,0.0,0.0,16.637988,2019,1,8,1,13,afternoon,3
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,0.0,0.0,19.351852,2018,5,4,4,8,morning,47
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,0.0,0.0,17.582934,2019,2,11,0,15,afternoon,7
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,0.0,0.0,17.582934,2018,1,16,1,15,afternoon,1
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,0.0,0.0,16.862433,2019,6,27,3,11,morning,15


### Creating feature average delay per airline

In [11]:
# Calculate average departure delay per airline
average_delay_per_airline = df_flights.groupby('mkt_unique_carrier')['dep_delay'].mean()

# Map the average delay values to the corresponding airlines
df_flights['avg_delay_per_airline'] = df_flights['mkt_unique_carrier'].map(average_delay_per_airline)


In [12]:
df_flights.head(5)

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,longest_add_gtime,taxi_mean_time,year,month,day_of_month,day_of_week,dep_hour,time_of_day,dep_traffic_per_day,avg_delay_per_airline
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,0.0,16.637988,2019,1,8,1,13,afternoon,3,9.246834
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,0.0,19.351852,2018,5,4,4,8,morning,47,12.678995
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,0.0,17.582934,2019,2,11,0,15,afternoon,7,10.222446
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,0.0,17.582934,2018,1,16,1,15,afternoon,1,9.246834
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,0.0,16.862433,2019,6,27,3,11,morning,15,10.619946


### Creating feature average delay per route

In [13]:
df_flights['route'] = df_flights['origin'] + ' to ' + df_flights['dest']
route_avg_delay = df_flights.groupby('route')['dep_delay'].mean()
df_flights['avg_dep_delay_per_route'] = df_flights['route'].map(route_avg_delay)

### Creating feature - average monthly passengers

In [14]:
# Calling function from data_cleaning 

from data_cleaning import avg_passengers


In [15]:
# loading passengers csv into dataframe
passengers = pd.read_csv('data/passengers.csv')

In [16]:
# Calling function with flights and passengers dataframe

df_flights = avg_passengers(df_flights,passengers)

In [17]:
df_flights.head(5)

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,year,month,day_of_month,day_of_week,dep_hour,time_of_day,dep_traffic_per_day,avg_delay_per_airline,avg_dep_delay_per_route,monthly_avg_passengers
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,2019,1,8,1,13,afternoon,3,9.246834,17.236196,773.0
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,2018,5,4,4,8,morning,47,12.678995,13.836224,2469.0
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,2019,2,11,0,15,afternoon,7,10.222446,19.122066,1208.0
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,2018,1,16,1,15,afternoon,1,9.246834,18.308511,1088.0
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,2019,6,27,3,11,morning,15,10.619946,7.820859,10533.0


### Creating feature - average fuel consumption

In [18]:
# loading passengers csv into dataframe
fuel_df = pd.read_csv('data/fuel_comsumption.csv')

In [19]:
# Calling function from data_cleaning 

from data_cleaning import avg_fuel_use

In [20]:
# Calling function with flights and passengers dataframe

df_flights = avg_fuel_use(df_flights,fuel_df)

In [21]:
df_flights.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,...,day_of_month,day_of_week,dep_hour,time_of_day,dep_traffic_per_day,avg_delay_per_airline,avg_dep_delay_per_route,monthly_avg_passengers,avg_monthly_fuel_gallons,avg_monthly_fuel_cost
0,2019-01-08,DL,DL_CODESHARE,4179,OO,N477CA,4179,10408,ATW,"Appleton, WI",...,8,1,13,afternoon,3,9.246834,17.236196,773.0,262825545.0,507716787.0
1,2018-05-04,UA,UA,374,UA,N61882,374,12264,IAD,"Washington, DC",...,4,4,8,morning,47,12.678995,13.836224,2469.0,291973202.0,532297103.0
2,2019-02-11,AA,AA_CODESHARE,5254,OH,N207PS,5254,12197,HPN,"White Plains, NY",...,11,0,15,afternoon,7,10.222446,19.122066,1208.0,244408174.0,405192822.0
3,2018-01-16,DL,DL_CODESHARE,7409,OO,N675BR,7409,11013,CIU,"Sault Ste. Marie, MI",...,16,1,15,afternoon,1,9.246834,18.308511,1088.0,262825545.0,507716787.0
4,2019-06-27,WN,WN,2360,WN,N7812G,2360,10800,BUR,"Burbank, CA",...,27,3,11,morning,15,10.619946,7.820859,10533.0,178747802.0,342753536.0


In [22]:
df_flights.columns

Index(['fl_date', 'mkt_unique_carrier', 'branded_code_share',
       'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num',
       'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name',
       'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time', 'dep_time',
       'dep_delay', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in',
       'crs_arr_time', 'arr_time', 'arr_delay', 'cancelled',
       'crs_elapsed_time', 'actual_elapsed_time', 'air_time', 'flights',
       'distance', 'carrier_delay', 'weather_delay', 'nas_delay',
       'security_delay', 'late_aircraft_delay', 'first_dep_time',
       'total_add_gtime', 'longest_add_gtime', 'taxi_mean_time', 'year',
       'month', 'day_of_month', 'day_of_week', 'dep_hour', 'time_of_day',
       'dep_traffic_per_day', 'avg_delay_per_airline',
       'avg_dep_delay_per_route', 'monthly_avg_passengers',
       'avg_monthly_fuel_gallons', 'avg_monthly_fuel_cost'],
      dtype='object')

In [23]:
# Define the list of unnecessary features to drop
unnecessary_features = ['dep_delay','branded_code_share', 'tail_num','op_carrier_fl_num','cancelled','carrier_delay','weather_delay','nas_delay','security_delay','late_aircraft_delay','total_add_gtime', 'longest_add_gtime','origin_city_name',
                        'dest_city_name','origin_airport_id','dest_airport_id']


In [24]:
# Drop the unnecessary features from df_flights
df_flights.drop(columns=unnecessary_features, inplace=True)

In [25]:
# Rename the 'mkt_unique_carrier' column to 'mkt_carrier'
df_flights.rename(columns={'mkt_unique_carrier': 'mkt_carrier'}, inplace=True)


In [26]:
df_flights.columns

Index(['fl_date', 'mkt_carrier', 'mkt_carrier_fl_num', 'op_unique_carrier',
       'origin', 'dest', 'crs_dep_time', 'dep_time', 'taxi_out', 'wheels_off',
       'wheels_on', 'taxi_in', 'crs_arr_time', 'arr_time', 'arr_delay',
       'crs_elapsed_time', 'actual_elapsed_time', 'air_time', 'flights',
       'distance', 'first_dep_time', 'taxi_mean_time', 'year', 'month',
       'day_of_month', 'day_of_week', 'dep_hour', 'time_of_day',
       'dep_traffic_per_day', 'avg_delay_per_airline',
       'avg_dep_delay_per_route', 'monthly_avg_passengers',
       'avg_monthly_fuel_gallons', 'avg_monthly_fuel_cost'],
      dtype='object')

In [27]:
df_flights.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,taxi_out,wheels_off,...,day_of_month,day_of_week,dep_hour,time_of_day,dep_traffic_per_day,avg_delay_per_airline,avg_dep_delay_per_route,monthly_avg_passengers,avg_monthly_fuel_gallons,avg_monthly_fuel_cost
0,2019-01-08,DL,4179,OO,ATW,DTW,1334,1900-01-01 13:21:00,14.0,1335.0,...,8,1,13,afternoon,3,9.246834,17.236196,773.0,262825545.0,507716787.0
1,2018-05-04,UA,374,UA,IAD,ORD,806,1900-01-01 08:02:00,15.0,817.0,...,4,4,8,morning,47,12.678995,13.836224,2469.0,291973202.0,532297103.0
2,2019-02-11,AA,5254,OH,HPN,DCA,1500,1900-01-01 15:53:00,24.0,1617.0,...,11,0,15,afternoon,7,10.222446,19.122066,1208.0,244408174.0,405192822.0
3,2018-01-16,DL,7409,OO,CIU,DTW,1539,1900-01-01 15:37:00,18.0,1555.0,...,16,1,15,afternoon,1,9.246834,18.308511,1088.0,262825545.0,507716787.0
4,2019-06-27,WN,2360,WN,BUR,SMF,1125,1900-01-01 12:01:00,9.0,1210.0,...,27,3,11,morning,15,10.619946,7.820859,10533.0,178747802.0,342753536.0


### Separating the arr_delay which we want to predict

In [60]:
# Separate the predicted variable (arr_delay)
y = df_flights['arr_delay']

# Remove the predicted variable from the dataframe
df_flights = df_flights.drop('arr_delay', axis=1)

In [61]:
df_flights.dtypes

fl_date                     datetime64[ns]
mkt_carrier                         object
mkt_carrier_fl_num                   int64
op_unique_carrier                   object
origin                              object
dest                                object
crs_dep_time                         int64
dep_time                    datetime64[ns]
taxi_out                           float64
wheels_off                         float64
wheels_on                          float64
taxi_in                            float64
crs_arr_time                         int64
arr_time                           float64
crs_elapsed_time                   float64
actual_elapsed_time                float64
air_time                           float64
flights                              int64
distance                             int64
first_dep_time                     float64
taxi_mean_time                     float64
year                                 int64
month                                int64
day_of_mont

### Train test split

### Separating categorical and numrical variable

In [62]:
categorical_vars = df_flights.select_dtypes(include='object')
numeric_vars = df_flights.select_dtypes(include=['int64', 'float64'])

df_flights_cat = df_flights[categorical_vars.columns]
df_flights_num = df_flights[numeric_vars.columns]


In [63]:
df_flights_cat.head(2)

Unnamed: 0,mkt_carrier,op_unique_carrier,origin,dest
0,DL,OO,ATW,DTW
1,UA,UA,IAD,ORD


In [64]:
df_flights_num.head(2)

Unnamed: 0,mkt_carrier_fl_num,crs_dep_time,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,crs_elapsed_time,actual_elapsed_time,...,month,day_of_month,day_of_week,dep_hour,dep_traffic_per_day,avg_delay_per_airline,avg_dep_delay_per_route,monthly_avg_passengers,avg_monthly_fuel_gallons,avg_monthly_fuel_cost
0,4179,1334,14.0,1335.0,1522.0,9.0,1605,1531.0,91.0,70.0,...,1,8,1,13,3,9.246834,17.236196,773.0,262825545.0,507716787.0
1,374,806,15.0,817.0,849.0,12.0,915,901.0,129.0,119.0,...,5,4,4,8,47,12.678995,13.836224,2469.0,291973202.0,532297103.0


### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

### One Hot encoding for categorical variables

In [42]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# Categorical variable preprocessing
encoder = OneHotEncoder()
encoded_categorical = encoder.fit_transform(df_flights_cat)

In [47]:
df_flights_cat_encoded = pd.DataFrame(encoded_categorical.toarray())

In [50]:
df_flights_cat_encoded.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,781,782,783,784,785,786,787,788,789,790
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Scaling for numerical variable

In [52]:
scaler = StandardScaler()
df_flights_numeric_scaled = scaler.fit_transform(df_flights_num)

### Combining cat and num dataframes

In [53]:
df_scaled = pd.concat([df_flights_cat_encoded, pd.DataFrame(df_flights_numeric_scaled)], axis=1)


In [54]:
df_scaled.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.644981,-0.885323,-0.972119,-0.002215,-1.159671,-0.455075,1.18165,-1.090323,0.348969,0.509203
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,-0.468374,-1.341781,0.530681,-1.026423,-0.197864,0.834074,0.588894,-0.466187,0.654697,0.650325


### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.