# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly 


### *Failed attempts at importing custom modules*

In [73]:
import sys
sys.path.append("C:/Users/silvh/OneDrive/lighthouse/custom python")
import silvhua

In [72]:
import os     
os.environ["PATH"] += os.pathsep + 'C:/Users/silvh/OneDrive/lighthouse/custom python'
import silvhua

### *Custom functions*

In [52]:
# Create function to load CSV files.
def load_csv(filepath,filename,column1_as_index=False):
    """
    Load a csv file as a dataframe using specified file path copied from windows file explorer.
    Back slashes in file path will be converted to forward slashes.
    Arguments:
    - filepath (raw string): Use the format r'<path>'.
    - filename (string).
    - colum1_as_index (bool): If true, take the first column as the index. 
        Useful when importing CSV files from previously exported dataframes.

    Returns: dataframe object.

    Required import: pandas
    """
    filename = f'{filepath}/'.replace('\\','/')+filename
    df = pd.read_csv(filename)
    if column1_as_index==True:
        df.set_index(df.columns[0], inplace=True)
        df.index.name = None
    return df

In [76]:
def compare_id(df1, df1_column, df2, df2_column,print_common=False,print_difference=True):
    """
    Print the number of common values and unique values between two dataframe columns.
    
    """
    df1_values = df1[df1_column].unique()
    df2_values = df2[df2_column].unique()
    common_values = set(df1_values) & set(df2_values)
    if len(df1_values) > len(df2_values):
        different_values = set(df1_values) - set(df2_values)
        print(f'Proper subset = {set(df2_values) < set(df1_values)}')
    else:
        different_values = set(df2_values) - set(df1_values)
        print(f'Proper subset = {set(df1_values) < set(df2_values)}')
    print('Unique values in df1:',len(df1_values))
    print('Unique values in df2:',len(df2_values))
    print('Number of common values between df1 and df2:',len(common_values))
    print('Number of different values between df1 and df2:',len(different_values))
    if print_common == True:
        print('Values in common:',common_values)
    if print_difference == True:
        print('Different values:',different_values)

# function that prints null values
def explore(df,id=0,print_n_unique=True, printValues=False):
    """
    Explore dataframe data and print missing values.
    Parameters:
    - df: Dataframe.
    - id: Column number or name with the primary IDs. Default is zero.
    - print_n_unique (bool): If the number of unique values in the first column doesn't match 
        the number of rows in the df, print the number of unique values in each column to see if 
        there's another column that might serve as a unique id.
    """
    if (id==False) & (id !=0):
        pass
    elif isinstance(id,int):
    # if type(id)==int:
        print(f'Unique IDs: {len(set(df.iloc[:,0]))}. # of rows: {df.shape[0]}. Match: {len(set(df.iloc[:,0]))==df.shape[0]}')
    else:
        print(f'Unique IDs: {len(set(df[id]))}. # of rows: {df.shape[0]}. Match: {len(set(df[id]))==df.shape[0]}')
    
    # if the number of unique values in the first column doesn't match the number of rows in the df,
    # print the number of unique values in each column to see if there's another column that migh
    # serve as a unique id.
    if (print_n_unique==True):
        if len(set(df.iloc[:,0])) !=df.shape[0]: 
            for column in df.columns:
                print(len(df[column].value_counts()),'\t', column)
    
    # count amount of missing values in each column
    total = df.isnull().sum().sort_values(ascending=False) 
    # % of rows with missing data from each column
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False) 

    # create a table that lists total and % of missing values starting with the highest
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) 

    if (printValues == True):
        # extract the names of columns with missing values
        cols_with_missing = missing_data[missing_data.Percent > 0].index.tolist()
        print(df.dtypes[cols_with_missing])

    print(f'')
    return missing_data

# Function to plot multiple histograms using Plotly. Show different colours based on classification.
def plot_int_hist(df, columns=None, color=None):
    """
    Use Plotly to plot multiple histograms using the specified columns of a dataframe.
    Arguments:
    - df: Dataframe.
    - columns (optional): Columns of dataframe on which to create the histogram. If blank, all numeric data will be plotted.
    - color (optional): Provide name of colum containing binary classification values 0 and 1. 
        Data points classified as 1 will be in red.
    
    Make sure to do the following imports:
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots
    """
    if columns == None:
        columns = df.dtypes[df.dtypes != 'object'].index.tolist()
    fig = make_subplots(rows=round((len(columns)+.5)/2), cols=2,subplot_titles=columns)
    for i, feature in enumerate(columns):
        if color:
            bins = dict(
                start = min(df[feature]),
                end =  max(df[feature]),
                # size=
            )
            zero = df[df[color]==0]
            one = df[df[color] != 0]
            fig.add_trace(go.Histogram(x=zero[feature],
                marker_color='#330C73',
                opacity=0.5,
                xbins=bins), 
                row=i//2+1, col=i % 2 + 1
                )
            fig.add_trace(go.Histogram(x=one[feature],
                marker_color='red',
                opacity=0.5,
                xbins=bins),
                row=i//2+1, col=i % 2 + 1)
        else:
            fig.add_trace(go.Histogram(x=df[feature]), 
            row=i//2+1, col=i % 2 + 1)
    fig.update_layout(height=300*round((len(columns)+.5)/2), 
        showlegend=False,barmode='overlay')
    fig.show()


    
def correlation(df):
    """
    Plot the correlation matrix.
    Returns the dataframe with the correlation values.
    """

    # Create a mask to exclude the redundant cells that make up half of the graph.
    mask = np.triu(np.ones_like(df.corr(), dtype=bool))

    # Create the heatmap with the mask and with annotation
    sns.heatmap(data=df.corr(numeric_only=True),mask=mask,annot=True)
    return df.corr()

# Function to plot multiple histograms
def plot_hist(df, columns=None):
    """
    Plot multiple histograms using the specified columns of a dataframe.
    Arguments:
    df: Dataframe.
    columns (optional): Columns of dataframe on which to create the histogram. If blank, all numeric data will be plotted.
    
    Make sure to `import seaborn as sns`.
    """
    if columns == None:
        columns = df.dtypes[df.dtypes != 'object'].index.tolist()
    fig, ax = plt.subplots(nrows=round((len(columns)+.5)/2), ncols=2, figsize=(10,18))
    for i, feature in enumerate(columns):
        sns.histplot(data=df,x=feature,ax=ax[i//2, i % 2])
    plt.tight_layout()

### *Load files*

In [55]:

fuel = load_csv(r"C:\Users\silvh\OneDrive\lighthouse\projects\midterm_shared\data\experimentation/",'fuel_features_selected_10-23.csv',column1_as_index=True)
fuel.head()


Unnamed: 0,month,airline_id,unique_carrier,carrier_group_new,sdomt_gallons,sint_gallons,ts_gallons,tdomt_gallons,tint_gallons,total_gallons,year
0,1,,,1,0.0,0.0,0.0,3000.0,0.0,3000.0,2016
1,1,21352.0,0WQ,1,0.0,0.0,0.0,163052.0,47060.0,210112.0,2016
2,1,21645.0,23Q,1,0.0,0.0,0.0,0.0,0.0,0.0,2016
3,1,21652.0,27Q,1,0.0,0.0,0.0,0.0,0.0,0.0,2016
4,1,20408.0,5V,1,260848.0,0.0,260848.0,284362.0,0.0,284362.0,2016


In [59]:
passengers = load_csv(r'C:\Users\silvh\OneDrive\lighthouse\projects\midterm_shared\data\experimentation','passengers_features_selected_10-23.csv',column1_as_index=True)
passengers

Unnamed: 0,airline_id,unique_carrier,departures_performed,payload,seats,passengers,freight,mail,distance,air_time,...,dest_airport_id,dest_city_market_id,dest_city_name,dest_country_name,aircraft_group,aircraft_type,aircraft_config,year,month,class
0,21342,VB,34.0,1225224.0,6120.0,6069.0,0.0,0.0,1307.0,0.0,...,11874,31874,"Guadalajara, Mexico",Mexico,6,694,1,2019,6,F
1,21342,VB,5.0,180180.0,900.0,852.0,0.0,0.0,1341.0,0.0,...,11032,31032,"Cancun, Mexico",Mexico,6,694,1,2019,6,L
2,21342,VB,1.0,36036.0,180.0,48.0,0.0,0.0,1330.0,0.0,...,10397,30397,"Atlanta, GA",United States,6,694,1,2019,6,L
3,21342,VB,30.0,1081080.0,5400.0,4140.0,0.0,0.0,2090.0,0.0,...,12478,31703,"New York, NY",United States,6,694,1,2019,6,F
4,21342,VB,30.0,1081080.0,5400.0,4604.0,0.0,0.0,1507.0,0.0,...,12889,32211,"Las Vegas, NV",United States,6,694,1,2019,6,F
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,19977,UA,1.0,40142.0,179.0,167.0,3225.0,0.0,269.0,41.0,...,12896,32896,"Lubbock, TX",United States,6,888,1,2019,9,L
99996,19977,UA,1.0,45919.0,179.0,75.0,0.0,0.0,163.0,29.0,...,11775,31775,"Sioux Falls, SD",United States,6,888,1,2019,9,F
99997,19977,UA,1.0,40605.0,166.0,77.0,0.0,0.0,416.0,61.0,...,13930,30977,"Chicago, IL",United States,6,614,1,2019,9,F
99998,20336,H6,3.0,10200.0,27.0,22.0,35.0,0.0,126.0,158.0,...,14485,34485,"Red Dog, AK",United States,4,416,3,2019,9,L


### *How to merge data from passengers and fuel tables*

In [81]:
# See how passengers and fuel tables can be merged
compare_id(passengers, 'carrier_group_new',fuel,'carrier_group_new')
print()
compare_id(passengers, 'airline_id',fuel,'airline_id',print_difference=False)

Proper subset = True
Unique values in df1: 8
Unique values in df2: 3
Number of common values between df1 and df2: 3
Number of different values between df1 and df2: 5
Different values: {0, 4, 5, 6, 9}

Proper subset = False
Unique values in df1: 297
Unique values in df2: 63
Number of common values between df1 and df2: 44
Number of different values between df1 and df2: 253


In [79]:
fuel['carrier_group_new'].value_counts()

2    1052
3     999
1     984
Name: carrier_group_new, dtype: int64

In [92]:
fuel.groupby(['year','month']).agg('mean').filter(regex='gallons')

  fuel.groupby(['year','month']).agg('mean').filter(regex='gallons')


Unnamed: 0_level_0,Unnamed: 1_level_0,sdomt_gallons,sint_gallons,ts_gallons,tdomt_gallons,tint_gallons,total_gallons
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015,1,14763510.0,8466170.0,23229680.0,14998100.0,9026629.0,24024730.0
2015,2,13482880.0,7495644.0,20978520.0,13741120.0,8097401.0,21838520.0
2015,3,16179900.0,8803838.0,24983730.0,16520290.0,9473513.0,25993800.0
2015,4,16074120.0,8915791.0,24989910.0,16332030.0,9610475.0,25942500.0
2015,5,16541190.0,9504175.0,26045360.0,16807600.0,10163460.0,26971060.0
2015,6,17072400.0,9796820.0,26869220.0,17332800.0,10426130.0,27758930.0
2015,7,17939570.0,10335570.0,28275140.0,18172020.0,10926400.0,29098420.0
2015,8,17403700.0,10126920.0,27530620.0,17680570.0,10827380.0,28507950.0
2015,9,15583720.0,9133133.0,24716850.0,15907490.0,9982277.0,25889760.0
2015,10,16695480.0,9108164.0,25803640.0,16997100.0,9801550.0,26798650.0


## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [58]:
from datetime import datetime
# Convert date to cycling number
# current_date = "2022-01-01 00:00:00"
# cdate = datetime.strptime(current_date, '%Y-%m-%d %H:%M:%S')
# day_sin = np.sin(2 * np.pi * cdate.timetuple().tm_yday/365.0)
# day_cos = np.cos(2 * np.pi * cdate.timetuple().tm_yday/365.0)
# day_cos

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.