# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import calendar


In [2]:
flights = pd.read_csv('flights.csv', low_memory=False)

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [3]:
flights = flights.drop(['diverted','cancelled'],axis=1)

In [4]:
#drop with to much missing data
flights = flights.drop(['carrier_delay','weather_delay','nas_delay','security_delay','late_aircraft_delay'],axis=1)

In [5]:

flights = flights.drop(['no_name','dup','cancellation_code',
                              'first_dep_time', 'total_add_gtime',
                              'longest_add_gtime','tail_num'],axis=1)



In [6]:
flights.mkt_carrier.unique()

array(['WN', 'UA', 'DL', 'AA', 'AS', 'NK', 'G4', 'HA', 'B6', 'F9', 'VX'],
      dtype=object)

In [7]:
#change carrier names 
flights['mkt_carrier'].replace({
    'UA':'United Airlines',
    'AS':'Alaska Airlines',
    'B6':'JetBlue Airways',
    'F9':'Frontier Airlines',
    'G4':'Allegiant Air',
    'HA':'Hawaiian Airlines',
    'NK':'Spirit Airlines',
    'VX':'Virgin America',
    'WN':'Southwest Airlines',
    'AA':'American Airlines',
    'DL':'Delta Airlines',
    
}, inplace=True)

In [9]:
flights.mkt_carrier.unique()

array(['Southwest Airlines', 'United Airlines', 'Delta Airlines',
       'American Airlines', 'Alaska Airlines', 'Spirit Airlines',
       'Allegiant Air', 'Hawaiian Airlines', 'JetBlue Airways',
       'Frontier Airlines', 'Virgin America'], dtype=object)

In [10]:
flights.isnull().sum()

fl_date                     0
mkt_unique_carrier          0
branded_code_share          0
mkt_carrier                 0
mkt_carrier_fl_num          0
op_unique_carrier           0
op_carrier_fl_num           0
origin_airport_id           0
origin                      0
origin_city_name            0
dest_airport_id             0
dest                        0
dest_city_name              0
crs_dep_time                0
dep_time               258814
dep_delay              263754
taxi_out               273274
wheels_off             273264
wheels_on              281162
taxi_in                281172
crs_arr_time                0
arr_time               275079
arr_delay              311744
crs_elapsed_time           20
actual_elapsed_time    309157
air_time               315221
flights                     0
distance                    0
dtype: int64

In [11]:
flights = flights.dropna()

In [12]:
flights.isnull().sum()

fl_date                0
mkt_unique_carrier     0
branded_code_share     0
mkt_carrier            0
mkt_carrier_fl_num     0
op_unique_carrier      0
op_carrier_fl_num      0
origin_airport_id      0
origin                 0
origin_city_name       0
dest_airport_id        0
dest                   0
dest_city_name         0
crs_dep_time           0
dep_time               0
dep_delay              0
taxi_out               0
wheels_off             0
wheels_on              0
taxi_in                0
crs_arr_time           0
arr_time               0
arr_delay              0
crs_elapsed_time       0
actual_elapsed_time    0
air_time               0
flights                0
distance               0
dtype: int64

In [13]:
##drop duplicates
# drop duplicated info columns 
flights = flights.drop(['mkt_unique_carrier','branded_code_share','op_unique_carrier','mkt_carrier_fl_num'],axis=1)
flights = flights.drop(['flights','origin_city_name','dest_city_name'],axis=1)

In [14]:
flights['day'] = pd.DatetimeIndex(flights['fl_date']).day
flights['month'] = pd.DatetimeIndex(flights['fl_date']).month
flights['fl_date'] = pd.to_datetime(flights['fl_date'])
flights['weekday'] = flights['fl_date'].dt.dayofweek


flights = flights.drop(['dest_airport_id','origin_airport_id'],axis=1)
flights = flights.drop(['taxi_out', 'wheels_off', 'wheels_on', 'taxi_in'],axis=1)

In [15]:
flights['dep_time'] = pd.to_datetime(flights['dep_time'])
flights['dep_time'] = flights['dep_time'].dt.hour

In [16]:
#function to change integer time to standard time in string
def time_row(row):
    row_int = int(row)
    row_str = str(row_int)
    len_row = len(row_str)
    if len_row == 1:
        minute = '00'
        hour = row_str
        row = hour + ':' + minute
    if len_row == 2:
        minute =  '0'+ row_str[1]
        hour = row_str[0]
        row = hour + ':' + minute
    if len_row == 3:
        minute = row_str[1]+row_str[2]
        hour = row_str[0]
        row = hour + ':' + minute
    elif len_row == 4:
        minute = row_str[2] + row_str[3]
        hour = row_str[0] + row_str[1]
        row = hour + ':' + minute
        if row == '24:00':
            row = '23:59'
    return str(row)




In [17]:
#applyting the function and changing to format to date and time
flights['dep_time'] = flights['dep_time'].apply(time_row)

flights['dep_time'] = pd.to_datetime(\
                                                       flights['dep_time'],\
                                                       format='%H:%M').dt.time

In [18]:
def hr_func(ts):
    return ts.hour

flights['dep_hour'] = flights['dep_time'].apply(hr_func)

In [19]:
flights["arr_delay"] = flights["arr_delay"].fillna(0)

In [25]:
#BINNING ??
#encoding function to numerical describe the duration of a flight based on three intervals (less than 3 hours, in between 3-6 hours and greater then 6 hours)


#def time_cat(flights, col):
    '''Determine if flight length is SHORT (0), MEDIUM(1) or LONG(2) based on expected elapsed flight time. '''
    length=[]
    for i in flights[col]: 
        if (i >=-85) and i <= -15: 
            length.append(0) # 0 = no delay
        elif (i >-15) and (i <= -0): 
            length.append(1) # 1 = no delay 
        elif (i > 0) and (i<= 8):
            length.append(2) # 2 = short delay 
        elif (i > 8) and (i<= 350):
            length.append(3) # 3 = medium to long delay 
        else:
            length.append(5) #nan 

   # flights['time_cat'] = length

In [26]:
flights.head()

Unnamed: 0,fl_date,mkt_carrier,op_carrier_fl_num,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_time,arr_delay,crs_elapsed_time,actual_elapsed_time,air_time,distance,day,month,weekday,dep_hour
0,2019-08-11,Southwest Airlines,2779,BUR,SJC,830,00:00:00,-1.0,935,922.0,-13.0,65.0,53.0,47.0,296,11,8,6,0
1,2019-08-11,Southwest Airlines,3413,BUR,SJC,2050,00:00:00,-1.0,2155,2153.0,-2.0,65.0,64.0,48.0,296,11,8,6,0
2,2019-08-11,Southwest Airlines,4131,BUR,SJC,1020,00:00:00,-1.0,1130,1121.0,-9.0,70.0,62.0,46.0,296,11,8,6,0
3,2019-08-11,Southwest Airlines,4159,BUR,SJC,1325,00:00:00,-1.0,1430,1431.0,1.0,65.0,67.0,45.0,296,11,8,6,0
4,2019-08-11,Southwest Airlines,4254,BUR,SJC,1650,00:00:00,0.0,1805,1751.0,-14.0,75.0,61.0,45.0,296,11,8,6,0


In [1]:
len(flights)

NameError: name 'flights' is not defined

In [None]:
len(flights_df)

In [22]:
import copy
flights_df = copy.deepcopy(flights)

In [23]:
#flights_df['arr_delay'] = pd.series

In [28]:
time_cat(flights,'arr_delay')

In [29]:
flights.loc[(flights['time_cat']== 0),'avg_delay'] = flights.loc[(flights['time_cat']== 0), 'arr_delay'].mean()
flights.loc[(flights['time_cat']== 1),'avg_delay'] = flights.loc[(flights['time_cat']== 1), 'arr_delay'].mean()
flights.loc[(flights['time_cat']== 2),'avg_delay'] = flights.loc[(flights['time_cat']== 2), 'arr_delay'].mean()
flights.loc[(flights['time_cat']== 3),'avg_delay'] = flights.loc[(flights['time_cat']== 3), 'arr_delay'].mean()
flights.loc[(flights['time_cat']== 4),'avg_delay'] = flights.loc[(flights['time_cat']== 4), 'arr_delay'].mean()

In [None]:
flights.describe()

In [30]:

flights = flights.drop(['op_carrier_fl_num'],axis=1)
flights = flights.drop(['fl_date'],axis=1)

In [31]:
for col in ['mkt_carrier','origin','dest','month','weekday']:
    flights[col]=flights[col].astype('category')

In [32]:
flights["mkt_carrier"] = flights["mkt_carrier"].cat.codes
flights["origin"] = flights["origin"].cat.codes
flights["dest"] = flights["dest"].cat.codes

In [33]:
flights = flights.drop(['crs_arr_time','dep_time'],axis=1)
flights = flights.drop(['crs_dep_time'],axis=1)
flights = flights.drop(['arr_time'],axis=1)

In [35]:
flights.head(10)

Unnamed: 0,mkt_carrier,origin,dest,dep_delay,arr_delay,crs_elapsed_time,actual_elapsed_time,air_time,distance,day,month,weekday,dep_hour,time_cat,avg_delay
0,7,58,329,-1.0,-13.0,65.0,53.0,47.0,296,11,8,6,0,1,-7.518593
1,7,58,329,-1.0,-2.0,65.0,64.0,48.0,296,11,8,6,0,1,-7.518593
2,7,58,329,-1.0,-9.0,70.0,62.0,46.0,296,11,8,6,0,1,-7.518593
3,7,58,329,-1.0,1.0,65.0,67.0,45.0,296,11,8,6,0,2,4.117733
4,7,58,329,0.0,-14.0,75.0,61.0,45.0,296,11,8,6,0,1,-7.518593
5,7,58,329,-6.0,-23.0,75.0,58.0,46.0,296,11,8,6,0,0,-21.850241
6,7,58,329,-3.0,-8.0,65.0,60.0,48.0,296,11,8,6,0,1,-7.518593
7,7,58,329,66.0,68.0,70.0,72.0,47.0,296,11,8,6,0,3,50.585052
8,7,58,329,-1.0,-8.0,70.0,63.0,46.0,296,11,8,6,0,1,-7.518593
9,7,58,329,-8.0,-12.0,70.0,66.0,46.0,296,11,8,6,0,1,-7.518593


In [39]:
codes, cats = pd.factorize(flights.mkt_carrier)

In [40]:
print(cats,codes, cats[codes], sep='\n\n')

Int64Index([7, 9, 3, 2, 0, 8, 1, 5, 6, 4, 10], dtype='int64')

[0 0 0 ... 0 0 0]

Int64Index([7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
            ...
            7, 7, 7, 7, 7, 7, 7, 7, 7, 7],
           dtype='int64', length=15605076)


In [36]:
flights.to_csv(r'flights_final.csv')

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.