# Using Random Forest Regression and Functions

## About me

My name is Gabriel Scott, and I am a senior majoring in Mathematics. Though I mostly study pure mathematics, I am very interested in operations research and transportation research. I first learned of these fields through my work in STMATH 381, Discrete Mathematical Modeling. To further my understanding of coding in python and real world modeling, I chose to expand on what was learned in 381 with real world travel data modeling. 

### Background
Since part of my project is an analysis of how route predicts delay. It seems possible that a random forest regression could be constructed to predict Arrival Delay from the destination features. Destination is considered initially so the origin can be controlled. Rather than considering the whole network of routes, just routes departing a selected airport will be considered. This should provide less interaction and less noisy data for the Algorithm to Train and Test on.(Xu et al., 2008) The reach goal of this model is to be able to predict delay, but it would be useful to just know what features contribute to delay as well.

Unfortunately, I was unable to create a model that is better than the baseline generated from getting the mean delay of the sample. However, this project still generates a modular way to find the destinations that yield the most affect on delay. This could lead to a meta-analysis of the geographic location of the airports to see if there is delay associated with location. Even a temporal analysis could be considered by including time features, as it has been found that time based trends affect air travel.(Tu et al., 2008).

In this adaptation of the Random Forest exercise completed in class, the features and labels can be semi-modularly selected. Though the features and labels are hardcoded into some of the functions, it is very possible to change a couple lines to allow for more features or more labels given a certain originating airport. Additionally, the importance factors were ommitted as I could not reliably recreate this code without heavily copying the exercise project. In previous studies it was found that certain airports, due to climate or other factors, always have more delay,(Chatterji and Sridhar, 2005) and this could be validated if there were particular destinations that contribute more to arrival delay for various origins considered. I.e. We see a reporting destination occur high in the importance factors list for multiple origins.



In [3]:
import statsmodels.formula.api as smf
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
import pylab
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
import pydot


data_path = "airline_2m.csv"

#CSV Can be directly downloaded as a tar file here: 

#https://dax-cdn.cdn.appdomain.cloud/dax-airline/1.0.1/airline_2m.tar.gz?_ga=2.241493383.460169643.1645306071-17791737.1643504108

# Credit to IBM's Airline Dataset for the data and encoding code.


data = pd.read_csv(data_path, encoding = "ISO-8859-1",
                 dtype={'Div1Airport': str, 'Div1TailNum': str, 'Div2Airport': str, 'Div2TailNum': str})
#Code Provided By IBMs Airline Reporting Carrier on-time performance set. 
#This helps python better interpret the data frame because it is not encoded in UTF-8 
#An alternative to this could be to actually change the file to be encoded in UTF-8
#but this was difficult to do with how big the file was.

### originSampler
This function samples the data and automatically restricts the frame to the data for the flights regarding the airport we want. A sample is also generated in the cell below for ease of example.

In [4]:
sample = data.sample(10000, random_state = 420)

In [5]:
def originSampler(df, orig):
    """
    origin(Dataframe, Origin Airport IATA code)
    Changes the dataframe to only include flights
    from the origin code.
    Example:
    orginSampler(sample,'SEA') 
    sample = dataframe, orig = 'SEA'
    """
    originSubset = df[df['Origin'] == orig]
    return originSubset

### oneHot
This function one hot encodes the dataset and returns the labels and features. The arrival delay has NaN values for no delay, this is rectified by adding 0s in. Furthermore, it drops the rest of the rows that have remaining NaN values. Notice that the only features considered are destination and delay. Origin is already accounted for by originSampler, and more features can be considered by modifying the column names of the df call.

In [35]:
def oneHot(df):
    """
    This function takes in a dataframe and onehot
    encodes it with the label set to be the Arrival Delay
    and the Feature set to be Destination.
    It returns the labels and features
    
    """
    #Selecting only the wanted features from the sample
    df = df[['Dest','ArrDelay']]
    #Filling the NaN entries with 0, this is standard for the Dataset, as NaN represents
    #no delay for the sake of this particular dataset.
    df['ArrDelay'].fillna(0)
    sample_OH = df.dropna(axis = 0)
    #Print-check of the selected features.
    print(df.columns)
    #The actual hot-encoding step, adds the encoding for the
    #categorical data in the sample.
    print(df.head())
    
    features = pd.get_dummies(sample_OH)
    #Print-check for successful one-hot encoding.
    print(features.head())

    # The chosen label is ArrDelay. i.e. We are Predicting Arrival Delay
    labels = np.array(features['ArrDelay'])
    # Remove labels from the features
    features = features.drop('ArrDelay', axis = 1)
    # Convert to numpy array
    features = np.array(features)
    #return the labels and features
    return labels, features

In [33]:
#Example Call
labels, features = oneHot(originSampler(sample, 'SEA'))

Index(['Dest', 'ArrDelay'], dtype='object')
        Dest  ArrDelay
458182   ORD       6.0
1108738  SAN      -6.0
383793   SJC      -2.0
633893   LAX     -14.0
1422016  ORD      -9.0
         ArrDelay  Dest_ANC  Dest_ATL  Dest_AUS  Dest_BNA  Dest_BOI  Dest_BOS  \
458182        6.0         0         0         0         0         0         0   
1108738      -6.0         0         0         0         0         0         0   
383793       -2.0         0         0         0         0         0         0   
633893      -14.0         0         0         0         0         0         0   
1422016      -9.0         0         0         0         0         0         0   

         Dest_BUR  Dest_BWI  Dest_CLE  ...  Dest_RDU  Dest_SAN  Dest_SBA  \
458182          0         0         0  ...         0         0         0   
1108738         0         0         0  ...         0         1         0   
383793          0         0         0  ...         0         0         0   
633893          0         0

### Train and Split

This is the same code used in the exercise. I was going to change the nomenclature used, but it became difficult to keep track of these variable names and the dummy parameter names.

In [40]:
# Split the data into training and testing sets
# This was taken from the Random Forest Exercise
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 420)

In [41]:
print(train_features)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### baseLine 
This function takes in the split up training features and test features to create an array that is populated by the average delay that is the length of the test features.

In [50]:
def baseLine(trainFeatures, testFeatures):
    """
    This function takes in the trained features and tested features
    and makes a list of the mean of the trained features that
    is the length of the test features
    
    """
    #go down the train features row by row and pullout delay values 
    #This line is from an article I found on creating
    #Zero Algorithm for classification.
    #https://machinelearningmastery.com/implement-baseline-machine-learning-algorithms-scratch-python/
    values = [row[-1] for row in trainFeatures]
    #Calculates the mean delay value
    mean = sum(values) / float(len(values))
    #Populate a list called predicted that is the length of the test features
    predicted = [mean for i in range(len(testFeatures))]
    return predicted

In [51]:
#Generate the RF regressor
rf = RandomForestRegressor(n_estimators = 100, random_state = 420)
# Train the model on training data
rf.fit(train_features, train_labels);
# Use the forest's predict method on the test data

### baselineChecker
This function generates baseline_errors from baseLine and prints them out to be compared with predictCheck.

In [64]:
def baselineChecker(baseLine):
    # The baseline predictions were generated in baseLine()
    baseline_pred = np.array(baseLine)
    # Baseline errors, and display average baseline error
    baseline_errors = abs(baseline_pred - test_labels)
    print('Average baseline error: ', np.mean(baseline_errors))

In [65]:
#Example Call
baselineChecker(baseLine(train_features, test_features))

Average baseline error:  18.118815738215815


### predictCheck
This function generates the errors of the predictions that the model generated. Notice that the error is greater than the baseline error. This signifies that the model does a worse job than average, and this model needs to include more or be paramterized in some manner.

In [61]:
def predictCheck(testFeatures, testLabels):
    # Use the forest's predict method on the test data
    predictions = rf.predict(test_features)
    # Calculate the absolute errors
    errors = abs(predictions - test_labels)
    # Print out the mean absolute error (mae)
    print('Mean Absolute Error:', np.mean(errors))

In [62]:
#Example Call
predictCheck(test_features, test_labels)

Mean Absolute Error: 22.41106941529259


## Further Readings and Citations


- Zero Rule Algorithm Regression Reading:
 https://machinelearningmastery.com/implement-baseline-machine-learning-algorithms-scratch-python/
 
- Enhanced Random Forest Methods:












### Citations

Chatterji, G., Sridhar, B., 2005. National Airspace System Delay Estimation Using Weather 
Weighted Traffic Counts, in: AIAA Guidance, Navigation, and Control Conference and 
Exhibit. Presented at the AIAA Guidance, Navigation, and Control Conference and 
Exhibit, American Institute of Aeronautics and Astronautics, San Francisco, California. 
https://doi.org/10.2514/6.2005-6278

Tu, Y., Ball, M.O., Jank, W.S., 2008. Estimating Flight Departure Delay Distributions—A 
Statistical Approach With Long-Term Trend and Short-Term Pattern. J. Am. Stat. Assoc. 
103, 112–125. https://doi.org/10.1198/016214507000000257

Xu, N., Sherry, L., Laskey, K.B., 2008. Multifactor Model for Predicting Delays at U.S. 
Airports. Transp. Res. Rec. 2052, 62–71. https://doi.org/10.3141/2052-08
