In [23]:
import numpy as np
import pandas as pd
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder,  StandardScaler
from sklearn.metrics import brier_score_loss, log_loss
from transformers import DateTimeTransformer, AirportLatLongTransformer

# Predictive Analysis

This predictive analysis aims to answer questions like, "What is the probability that this flight will be on time?" or "What is the chance that this flight will experience a major delay?"

Various machine learning strategies will be used to identify patterns in basic flight schedule data to determine the probability of delay. This predictive model can then be used by airline schedule planners, travel agencies, and airline customers for planning trips.

## Preprocessing

To begin developing a predictive model, the data must first be imported. This data will be imported from the `2019_prepared.csv` file that was generated after running the scripts in the [Data Preparation]('./data_preparation.ipynb') notebook.

The data will then be split into training and test data sets to better assess the model's effectiveness after training. The numeric features will then be transformed using a StandardScaler, while the airline carrier (e.g. DL for Delta, UA for United) will be transformed as a categorical feature using a OneHotEncoder.

In [None]:

df = pd.read_csv('data/2019_prepared.csv')

y = df['DELAY_CATEGORY']
X = df.drop(columns = 'DELAY_CATEGORY')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numerical_features = ['FL_DAY', 'DEP_MINUTES', 'DAY_OF_WEEK', 'ORIGIN_LAT', 'ORIGIN_LON', 'DEST_LAT', 'DEST_LON']
categorical_features = ['OP_UNIQUE_CARRIER']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numeric_transformer, numerical_features),
    ]
)

## Basic Model Training

Three models have been chosen initially, based on their effectiveness in predicting probabilities of classifications. In this case, we want to predict the probability of no delay, minor delay, major delay, and severe delay, given basic flight information. 

Logistic Regression is a good model for identifying these probabilities accurately.

In [None]:
clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

clf.fit(X_train, y_train)

To gauge the effectiveness and performance of the results, two metrics are used - Brier Scoring and Log Loss. These metric calculations are encapsulated in a function for reuse later with other models.

In [19]:
def gauge_performance(clf, X_test, y_test):
    y_pred_proba = clf.predict_proba(X_test)

    classes = clf.named_steps['classifier'].classes_
    brier_scores = []

    for i, class_label in enumerate(classes):
        brier_score = brier_score_loss(y_test == class_label, y_pred_proba[:, i])
        brier_scores.append(brier_score)
        print(f'Brier score for class {class_label}: {brier_score}')

    average_brier_score = np.mean(brier_scores)
    print(f'Average Brier score: {average_brier_score}')    
    
    print("\nLog loss (smaller is better):")
    print(log_loss(y_test, y_pred_proba))

gauge_performance(clf, X_test, y_test)

Brier score for class MAJOR_DELAY: 0.05432723560244027
Brier score for class MINOR_DELAY: 0.08993720026190655
Brier score for class NO_DELAY: 0.137430744334431
Brier score for class SEVERE_DELAY: 0.024702888392350065
Average Brier score: 0.07659951714778197

Log loss (smaller is better):
0.6183724514261334


The results from Logistic Regression show a Brier score of 

The next model to assess is the RandomForestClassifier. 

In [20]:
clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

clf.fit(X_train, y_train)

gauge_performance(clf, X_test, y_test)

Brier score for class MAJOR_DELAY: 0.055227218874476555
Brier score for class MINOR_DELAY: 0.09278204984169135
Brier score for class NO_DELAY: 0.13849656827698908
Brier score for class SEVERE_DELAY: 0.02501327239301399
Average Brier score: 0.07787977734654275

Log loss (smaller is better):
0.7810242551774128


The results are comparable to LogisticRegression, with a slightly higher (but still good) log loss.

The third and final model to check is the DecisionTreeClassifier.

In [21]:
clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

clf.fit(X_train, y_train)

gauge_performance(clf, X_test, y_test)


Brier score for class MAJOR_DELAY: 0.11112245940149117
Brier score for class MINOR_DELAY: 0.1832805637830661
Brier score for class NO_DELAY: 0.2700439178837708
Brier score for class SEVERE_DELAY: 0.04682872025329384
Average Brier score: 0.15281891533040548

Log loss (smaller is better):
11.016304030939951


These results are much worse than either Logistic Regression or Random Forests. The Brier scores for each category are all much higher and the log loss is significantly higher.

## Calibrated Classifier

To further improve upon the results of the LogisticRegression and RandomForestClassifier models, a CalibratedClassifierCV will be included to better train for proper classification probabilities.

With this method, the RandomForestClassifier model shows the most improvement and best results.

In [25]:
calibrated_clf = CalibratedClassifierCV(RandomForestClassifier(), cv = 5, method = "isotonic")

clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', calibrated_clf)
])

clf.fit(X_train, y_train)

gauge_performance(clf, X_test, y_test)

Brier score for class MAJOR_DELAY: 0.05431621532454616
Brier score for class MINOR_DELAY: 0.0899500578447694
Brier score for class NO_DELAY: 0.13739717362360984
Brier score for class SEVERE_DELAY: 0.0247119492695515
Average Brier score: 0.07659384901561923

Log loss (smaller is better):
0.6183979922824974


To hopefully take this a step further, hyperparameter tuning using a GridSearchCV will be used to find the ideal parameters for the RandomForestClassifier that result in the best predictions.

In [28]:
param_grid = {
    'classifier__estimator__n_estimators': [100, 200, 300],
    'classifier__estimator__max_depth': [None, 10, 20, 30],
    'classifier__estimator__min_samples_split': [2, 5, 10],
    'classifier__estimator__min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(clf, param_grid, cv = 5, scoring = 'neg_log_loss', n_jobs = -1, verbose = 3)
grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)
print("Best log loss: ", -grid_search.best_score_)

best_rf_classifer = grid_search.best_estimator_



Best parameters found:  {'classifier__estimator__max_depth': 30, 'classifier__estimator__min_samples_leaf': 4, 'classifier__estimator__min_samples_split': 10, 'classifier__estimator__n_estimators': 300}
Best log loss:  0.6159168354320423


In [29]:
calibrated_clf = CalibratedClassifierCV(RandomForestClassifier(max_depth = 30, min_samples_leaf = 4, min_samples_split = 10, n_estimators = 300), cv = 5, method = "isotonic")

clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', calibrated_clf)
])

clf.fit(X_train, y_train)

gauge_performance(clf, X_test, y_test)


Brier score for class MAJOR_DELAY: 0.054011500165916035
Brier score for class MINOR_DELAY: 0.0893660609969095
Brier score for class NO_DELAY: 0.13592823302957732
Brier score for class SEVERE_DELAY: 0.02434474730637218
Average Brier score: 0.07591263537469377

Log loss (smaller is better):
0.6102507036542857
