## Data Preprocessing

To identify which predictive model can provide the best fit for the data, several different models can be assessed with an 80%/20% split on the data for training and testing. Preprocessing will be applied to the data to ensure that non-numeric data is properly transformed into a format the model can understand using a `OneHotEncoder` from the `sklearn` library.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

df = pd.read_csv('data/sampled_data.csv')

y = df['DELAY_CATEGORY']
X = df.drop(['DELAY_CATEGORY'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numerical_features = ['FL_MONTH', 'DEP_HOUR']
categorical_features = ['ORIGIN', 'DEST', 'OP_UNIQUE_CARRIER']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    transformers = [
        ('num', numeric_transformer, numerical_features), 
        ('cat', categorical_transformer, categorical_features)
    ])