In [1]:
import numpy as np
import pandas as pd
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder,  StandardScaler
from sklearn.metrics import brier_score_loss, log_loss
from transformers import DateTimeTransformer, AirportLatLongTransformer

# Predictive Analysis

This predictive analysis aims to answer questions like, "What is the probability that this flight will be on time?" or "What is the chance that this flight will experience a major delay?"

Various machine learning strategies will be used to identify patterns in basic flight schedule data to determine the probability of delay. This predictive model can then be used by airline schedule planners, travel agencies, and airline customers for planning trips.

## Preprocessing

To begin developing a predictive model, the data must first be imported. This data will be imported from the `2019_prepared.csv` file that was generated after running the scripts in the [Data Preparation]('./data_preparation.ipynb') notebook.

The data will then be split into training and test data sets to better assess the model's effectiveness after training. The numeric features will then be transformed using a StandardScaler, while the airline carrier (e.g. DL for Delta, UA for United) will be transformed as a categorical feature using a OneHotEncoder.

In [2]:

df = pd.read_csv('data/2019_prepared.csv')

y = df['DELAY_CATEGORY']
X = df.drop(columns = 'DELAY_CATEGORY')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numerical_features = ['FL_DAY', 'DEP_MINUTES', 'DAY_OF_WEEK', 'ORIGIN_LAT', 'ORIGIN_LON', 'DEST_LAT', 'DEST_LON']
categorical_features = ['OP_UNIQUE_CARRIER']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numeric_transformer, numerical_features),
    ]
)

## Basic Model Training

Three models have been chosen initially, based on their effectiveness in predicting probabilities of classifications. In this case, we want to predict the probability of no delay, minor delay, major delay, and severe delay, given basic flight information. 

Logistic Regression is a good model for identifying these probabilities accurately.

In [3]:
clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

clf.fit(X_train, y_train)

To gauge the effectiveness and performance of the results, two metrics are used - Brier Scoring and Log Loss. These metric calculations are encapsulated in a function for reuse later with other models.

In [4]:
def gauge_performance(clf, X_test, y_test):
    y_pred_proba = clf.predict_proba(X_test)

    classes = clf.named_steps['classifier'].classes_
    brier_scores = []

    for i, class_label in enumerate(classes):
        brier_score = brier_score_loss(y_test == class_label, y_pred_proba[:, i])
        brier_scores.append(brier_score)
        print(f'Brier score for class {class_label}: {brier_score}')

    average_brier_score = np.mean(brier_scores)
    print(f'Average Brier score: {average_brier_score}')    
    
    print("\nLog loss (smaller is better):")
    print(log_loss(y_test, y_pred_proba))

gauge_performance(clf, X_test, y_test)

Brier score for class MAJOR_DELAY: 0.055630738753677536
Brier score for class MINOR_DELAY: 0.09045398913166455
Brier score for class NO_DELAY: 0.14699749108416515
Brier score for class SEVERE_DELAY: 0.026120114474321773
Average Brier score: 0.07980058336095726

Log loss (smaller is better):
0.6465235547585348


The results from Logistic Regression show a Brier score of 0.079 on average, with a log loss of 0.65.

The next model to assess is the RandomForestClassifier. 

In [5]:
clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

clf.fit(X_train, y_train)

gauge_performance(clf, X_test, y_test)

Brier score for class MAJOR_DELAY: 0.05548908691655602
Brier score for class MINOR_DELAY: 0.09274071596364007
Brier score for class NO_DELAY: 0.1382112654478603
Brier score for class SEVERE_DELAY: 0.024979726279236034
Average Brier score: 0.07785519865182311

Log loss (smaller is better):
0.7938824585332832


The results are slightly worse than LogisticRegression, with a much higher (but still good) log loss value.

The third and final model to check is the DecisionTreeClassifier.

In [6]:
clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

clf.fit(X_train, y_train)

gauge_performance(clf, X_test, y_test)


Brier score for class MAJOR_DELAY: 0.11229700745582678
Brier score for class MINOR_DELAY: 0.1834337657031968
Brier score for class NO_DELAY: 0.27137166785823713
Brier score for class SEVERE_DELAY: 0.04611377795935043
Average Brier score: 0.1533040547441528

Log loss (smaller is better):
11.051276424688968


These results are much worse than either Logistic Regression or Random Forests. The Brier scores for each category are all much higher and the log loss is significantly higher.

## Calibrated Classifier

To further improve upon the results of the LogisticRegression and RandomForestClassifier models, a CalibratedClassifierCV will be included to better train for proper classification probabilities.

With this method, the RandomForestClassifier model shows the most improvement and best results.

In [7]:
calibrated_clf_lr = CalibratedClassifierCV(LogisticRegression(), cv = 5, method = "isotonic")
calibrated_clf_rf = CalibratedClassifierCV(RandomForestClassifier(), cv = 5, method = "isotonic")

clf_lr = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', calibrated_clf_lr)
])

clf_rf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', calibrated_clf_rf)
])

clf_lr.fit(X_train, y_train)
clf_rf.fit(X_train, y_train)

print("LogisticRegression:")
gauge_performance(clf_lr, X_test, y_test)

print("Random Forest:")
gauge_performance(clf_rf, X_test, y_test)

LogisticRegression:
Brier score for class MAJOR_DELAY: 0.05564248717636367
Brier score for class MINOR_DELAY: 0.09113594422672872
Brier score for class NO_DELAY: 0.14761786831008653
Brier score for class SEVERE_DELAY: 0.02608799818413125
Average Brier score: 0.08012107447432755

Log loss (smaller is better):
0.6504081489286346
Random Forest:
Brier score for class MAJOR_DELAY: 0.054360947050645106
Brier score for class MINOR_DELAY: 0.08992594401845104
Brier score for class NO_DELAY: 0.1375354094632746
Brier score for class SEVERE_DELAY: 0.024708584301051405
Average Brier score: 0.07663272120835554

Log loss (smaller is better):
0.6190415264368613


With a CalibratedClassifierCV, the LogisticRegression results actually got worse, while the RandomForestClassifier results greatly improved and exceed the best result from LogisticRegression so far with a log loss of 0.619. Therefore, moving forward, the RandomForestClassifier will be used.

## Hyperparameter Tuning

To hopefully take this result a step further, hyperparameter tuning using a GridSearchCV will be used to find the ideal parameters for the RandomForestClassifier that result in the best predictions. This code repeatedly trains and evaluates the model with different parameters and can take a very long time, and as such the code here has been disabled.

In [28]:
def rf_grid_search():
    param_grid = {
        'classifier__estimator__n_estimators': [100, 200, 300],
        'classifier__estimator__max_depth': [None, 10, 20, 30],
        'classifier__estimator__min_samples_split': [2, 5, 10],
        'classifier__estimator__min_samples_leaf': [1, 2, 4]
    }

    grid_search = GridSearchCV(clf, param_grid, cv = 5, scoring = 'neg_log_loss', n_jobs = 4, verbose = 3)
    grid_search.fit(X_train, y_train)

    print("Best parameters found: ", grid_search.best_params_)
    print("Best log loss: ", -grid_search.best_score_)

    best_rf_classifer = grid_search.best_estimator_
    
# Disabled due to extensive time requirement - can take up to an hour to run without a GPU
# rf_grid_search()



Best parameters found:  {'classifier__estimator__max_depth': 30, 'classifier__estimator__min_samples_leaf': 4, 'classifier__estimator__min_samples_split': 10, 'classifier__estimator__n_estimators': 300}
Best log loss:  0.6159168354320423


This results in a RandomForestClassifier with a `max_depth` of 30, `min_samples_leaf` of 4, `min_samples_split` of 10, and `n_estimators` of 300. This results in a very slight improvement over the default settings, with a log loss of 0.611 and an average brier score of 0.076.

In [10]:
calibrated_clf = CalibratedClassifierCV(RandomForestClassifier(max_depth = 30, min_samples_leaf = 4, min_samples_split = 10, n_estimators = 300), cv = 5, method = "isotonic")

clf = Pipeline([
    ('datetime_transformer', DateTimeTransformer()),
    ('airportlatlongtransformer', AirportLatLongTransformer()),
    ('preprocessor', preprocessor),
    ('classifier', calibrated_clf)
])

clf.fit(X_train, y_train)

gauge_performance(clf, X_test, y_test)

Brier score for class MAJOR_DELAY: 0.05398856660297161
Brier score for class MINOR_DELAY: 0.08940184812655995
Brier score for class NO_DELAY: 0.13611589977837152
Brier score for class SEVERE_DELAY: 0.02433004713386892
Average Brier score: 0.075959090410443

Log loss (smaller is better):
0.6110212131141943


This is the best performance found for predicting flight delay probabilities, and will therefore be used as the primary predictive model for the user-facing repl interface in `main.py`.