# DSC 630 Final: Aviation Delays
## Arbaz Khan

### 1: Data Prep

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.simplefilter(action='ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

# This datset includes data about all flights from past 10 years that 
# involved a delay of some sort. Includes data about total delay as well 
# as location and which factors influenced each delay
df = pd.read_csv("delay.csv", encoding_errors='ignore', low_memory=False)

df['carrier_name'].value_counts()

# Dropping the airline with <1000 flights and flights during height 
# of covid
df = df[df.carrier_name != "AirTran"]
df = df[df.year != "2022"]
df = df[df.year != "2023"]

# Drop redundant columns
df.drop(columns=['arr_flights', 'arr_cancelled', 'arr_diverted', 'arr_del15',
                      'carrier_ct', 'weather_ct', 'nas_ct', 'security_ct', 'late_aircraft_ct'], inplace=True)
        
df.head(5)

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2023,8,9E,Endeavor Air Inc.,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",1375.0,71.0,761.0,118.0,0.0,425.0
1,2023,8,9E,Endeavor Air Inc.,ABY,"Albany, GA: Southwest Georgia Regional",799.0,218.0,1.0,62.0,0.0,518.0
2,2023,8,9E,Endeavor Air Inc.,AEX,"Alexandria, LA: Alexandria International",766.0,56.0,188.0,78.0,0.0,444.0
3,2023,8,9E,Endeavor Air Inc.,AGS,"Augusta, GA: Augusta Regional at Bush Field",1397.0,471.0,320.0,388.0,0.0,218.0
4,2023,8,9E,Endeavor Air Inc.,ALB,"Albany, NY: Albany International",1530.0,628.0,0.0,134.0,0.0,768.0


In [3]:
# Here, we have to convert non-numeric values to numeric values. to_numeric will convert
# non-numeric values to NA, utilizing 'coerce' errors, and dropna() will remove these NA values
# altogether.
df["arr_delay"] = pd.to_numeric(df["arr_delay"], errors="coerce")
df['arr_delay'].dropna(inplace=True)

df['arr_delay']

arrDelay = df['arr_delay']

In addition to narrowing our target values, which is of the column 'arr_delay' (the total delay time overall for one row, which represents one flight), we must get dummy variables for the remaining columns that contain non-numerical values to get numerical values for the entirety of our data, so that we can use a random forest classifier. So, we use pd.get_dummies to get dummy variables for these categorical values.

In [4]:
col = df.columns[df.dtypes=='object']

# Use get_dummies to get numerical variables from categorical columns
df = pd.get_dummies(df, columns = col) 

### 2: Building and evaluating model

In [5]:
import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn import preprocessing
from sklearn import utils

y = df['arr_delay']
x = df.drop(['arr_delay'], axis=1)

# Must transform 'y' because values are 1-dimensional
lab_enc = preprocessing.LabelEncoder()
y = lab_enc.fit_transform(y)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.98)

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# Use a random forest tree classifier
rf = RandomForestClassifier(n_estimators = 20)

rf = rf.fit(x_train, y_train)

y_pred = rf.predict(x_test)

# Accuracy of our model
metrics.accuracy_score(y_test, y_pred)

0.04070545017921573

From our analysis of the model, we can see that we have achieved an accuracy of ~40%. For this model, the specifications of my computer did not allow me to increase the number of estimators past 5, which was a hurdle I did not anticipate when formulating my plan. As a result, this model has extremely low accuracy, and is not particularly reliable for drawing conclusions.

### Conclusion

Our random forest model has given us some level of insight into the factors that cause delays. From this model, we can see that highest level of arr_delay (the total delay of time for a single flight) is affected most notably by 

In [7]:
results = pd.DataFrame({'col_name': rf.feature_importances_}, index=x.columns).sort_values(by='col_name', ascending=False)
results.head(20)

Unnamed: 0,col_name
carrier_delay,0.083809
nas_delay,0.075658
late_aircraft_delay,0.072792
month,0.063806
year,0.061689
weather_delay,0.051206
security_delay,0.017367
carrier_name_SkyWest Airlines Inc.,0.009766
carrier_name_Delta Air Lines Inc.,0.009219
carrier_DL,0.008766


From our results, we can see the most important features of note were carrier delay, or delay related to carrier scheduling. The most important feature was arr_del15, which is flights that were delayed by over 15 minutes, which can be ignored in this context. The most important relevant features were in fact the carrier delay alongside delay due to the National Airspace System, which is a governmental system in the United States to govern airspace usage and timing. This means the majority of delays can be attributed to delays due to the carrier sticking to NAS limitations, which is likely due to schedule conflicts with other flights, weather, and much less commonly, issues of national emergency or security. As a result, it would improve this model to focus solely on resources such as weather, security, and maintenance specifically, which would not only improve accuracy but also improve our understanding of what causes delays.

In [22]:
from sklearn import metrics

y_test.mean()

2764.9041983439633

In [23]:
rmError = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
rmError

2311.8532888505424

In [28]:
from sklearn.metrics import classification_report
def limited_print(data, max_lines):
    for i, line in enumerate(str(data).splitlines()):
        if i < max_lines:
            print(line)
        else:
            break

limited_print(classification_report(y_test, y_pred), 10)


              precision    recall  f1-score   support

           0       0.61      1.00      0.76      6652
           1       0.04      0.03      0.04       283
           2       0.03      0.02      0.02       250
           3       0.02      0.03      0.02       258
           4       0.02      0.01      0.02       227
           5       0.05      0.04      0.05       197
           6       0.02      0.02      0.02       189
           7       0.02      0.01      0.01       157
