# **Flight Delay Prediction using Scikit-Learn Pipeline**

## **Overview**
This project demonstrates how to build a **machine learning pipeline** using scikit-learn to predict flight delays. The pipeline integrates data preprocessing with model training, ensuring efficient handling of both numerical and categorical data.

---

## **Objectives**
- Preprocess numerical and categorical data using `ColumnTransformer`.
- Automate the machine learning workflow using `Pipeline`.
- Train a **Random Forest Classifier** to predict flight delays.
- Optimize the model using **GridSearchCV** for hyperparameter tuning.

---

## **Data Overview**
- **Dataset**: Contains flight details such as:
  - **Year**, **Month**, **Day**
  - **Airline code**, **Origin airport code**, **Destination airport code**
  - **Departure delay** (target: delayed or not)

- **Target Variable**:  
  - `1` if the flight was delayed  
  - `0` if the flight was on time

---

## **Steps Involved**

### 1. **Data Loading and Exploration**
- Load the flight dataset and inspect its structure and missing values.

### 2. **Feature Engineering**
- **Numerical Features**:
  - `YEAR`, `MONTH`, `DAY`
- **Categorical Features**:
  - `AIRLINE__CODE`, `ORIGIN_AIRPORT_CODE`, `DESTINATION_AIRPORT_CODE`

### 3. **Preprocessing with `ColumnTransformer`**
- **Numerical Data**:
  - Impute missing values with the **mean**.
  - Standardize values using **`StandardScaler`**.
  
- **Categorical Data**:
  - Impute missing values with `'missing'`.
  - Encode using **`OneHotEncoder`**.

### 4. **Pipeline Setup**
- Use a **scikit-learn Pipeline** to link preprocessing and model training.
- Integrate a **Random Forest Classifier** within the pipeline.

### 5. **Model Training and Evaluation**
- Split the data into **train (70%)** and **test (30%)** sets.
- Evaluate the model using a **classification report** with metrics like:
  - **Precision**, **Recall**, **F1-score**

### 6. **Hyperparameter Tuning with GridSearchCV**
- Tune hyperparameters of the Random Forest model:
  - Number of estimators (`n_estimators`)
  - Maximum tree depth (`max_depth`)

### 7. **Model Persistence**
- Save the trained model using **`joblib`** for later use.

---

## **Technologies Used**
- **Python**: Programming language
- **Pandas**: Data manipulation and cleaning
- **Scikit-Learn**: Machine learning, preprocessing, and model evaluation
- **Joblib**: Model persistence
- **Jupyter Notebook**: Interactive development environment

---

## **Expected Output**
- A **trained Random Forest model** to predict flight delays.
- **Performance metrics** (accuracy, precision, recall) from the classification report.
- A **saved model** (`flight_delay_classifier.pkl`) for deployment.

---

## **Conclusion**
This project demonstrates how to create an automated **machine learning workflow** using scikit-learn’s `Pipeline` and `ColumnTransformer`. The streamlined preprocessing ensures consistency during both training and testing. With **hyperparameter tuning**, the model's performance is further optimized, making it reliable for real-world flight delay predictions.


# Upgrade pip and install all required packages

In [None]:
!pip install --upgrade pip

# Install Snowflake connectors, pandas integration, and essential libraries
!pip install "snowflake-connector-python[pandas]" \
             snowflake-snowpark-python==1.12.0,<2,>=1.11.1 \
             python-dateutil tqdm holidays faker
!pip install numpy pandas matplotlib scikit-learn xgboost seaborn \

# Ensure Snowpark Python is up-to-date
!pip install --upgrade -q snowflake-snowpark-python==1.12.0,<2,>=1.11.1


# Fix potential urllib3 version conflicts
!pip uninstall urllib3 -y
!pip install urllib3==1.26.15

# Additional installations for your project
!pip install fosforml==1.1.6
!pip install python-scipy
!pip install cloudpickle==2.2.1
!pip install basemap


# Importing necessary libraries and settings

In [1]:

# Standard libraries for date and warnings
import datetime
import warnings

# Scientific and Data Manipulation Libraries
import scipy
import pandas as pd
import numpy as np

# Data Visualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb

# Sklearn Modules for Data Preprocessing, Modeling, and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder  # Encoding categorical variables
from sklearn.preprocessing import StandardScaler  # Scaling numerical data
from sklearn.tree import DecisionTreeClassifier  # Decision Tree model
from sklearn.metrics import roc_auc_score, classification_report  # Evaluation metrics

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

import configparser
from dateutil.relativedelta import relativedelta
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline

# Configuring display options and warning filters
pd.options.display.max_columns = 50
warnings.filterwarnings("ignore")

# Custom FosforML package for Snowflake session and model registration
from fosforml.model_manager.snowflakesession import get_session
from fosforml import register_model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer


In [2]:
# Set Matplotlib's default font family to 'DeJavu Serif' to ensure a consistent font style across plots
plt.rcParams['font.family'] = 'DeJavu Serif'

# Establishing a Snowflake session


In [None]:
my_session = get_session()

# Defining the table name to fetch data from
# table_name = 'FLIGHTS'  # Initial option for table
table_name = 'FLIGHTS_FULL'  # Final table to be used

# Querying the data from the specified Snowflake table
sf_df = my_session.sql("SELECT * FROM {}".format(table_name))

# Converting the Snowflake DataFrame to a pandas DataFrame for local processing
df = sf_df.to_pandas()

df

# Filtering data for specific airlines

In [None]:
# Defining the list of airlines to include in the filtered DataFrame
#options = ['Southwest Airlines Co.', 'Delta Air Lines Inc.']
options = ['Southwest Airlines Co.']
#df.replace({'AIRLINE':'Southwest Airlines Co.'}, {'AIRLINE': 'Southwest airlines'}, regex=True)
#df.replace({'AIRLINE':'Delta Air Lines Inc.'}, {'AIRLINE': 'Delta airlines'}, regex=True)

# Selecting rows where the 'AIRLINE' column matches one of the specified airlines
flights = df.loc[df['AIRLINE'].isin(options)]
flights

In [None]:
flights = flights.dropna(subset = 'ARRIVAL_DELAY')

# Creating a copy of the filtered flights data

In [None]:
# This ensures that any modifications made to 'flights_needed_data' do not affect the original 'flights' DataFrame
flights_needed_data = flights.copy()

In [None]:
flights_needed_data.shape
#(2137736, 45)

In [None]:
flights_needed_data.info()

In [None]:
flights_needed_data.head()

# Function to categorize scheduled arrival times into time segments

In [None]:
def categorize_time(SCHEDULED_ARRIVAL):
    # Categorize based on scheduled arrival time in 24-hour format
    if 500 <= SCHEDULED_ARRIVAL < 800:
        return 'Early morning'
    elif 800 <= SCHEDULED_ARRIVAL < 1100:
        return 'Late morning'
    elif 1100 <= SCHEDULED_ARRIVAL < 1400:
        return 'Around noon'
    elif 1400 <= SCHEDULED_ARRIVAL < 1700:
        return 'Afternoon'
    elif 1700 <= SCHEDULED_ARRIVAL < 2000:
        return 'Evening'
    elif 2000 <= SCHEDULED_ARRIVAL < 2300:
        return 'Night'
    elif SCHEDULED_ARRIVAL >= 2300 or SCHEDULED_ARRIVAL < 200:
        return 'Late night'
    elif 200 <= SCHEDULED_ARRIVAL < 500:
        return 'Dawn'

# Apply categorize_time function to the 'SCHEDULED_ARRIVAL' column to create 'ARRIVAL_TIME_SEGMENT'
flights_needed_data['ARRIVAL_TIME_SEGMENT'] = flights_needed_data['SCHEDULED_ARRIVAL'].apply(categorize_time)


In [None]:
flights_needed_data

In [None]:
flights['AIRLINE__CODE'].unique()

In [None]:
flights_needed_data.value_counts('DIVERTED')

In [None]:
flights_needed_data['FLIGHT_NUMBER'] = flights_needed_data['FLIGHT_NUMBER'].astype(str)

In [None]:
flights_needed_data['MONTH'] = flights_needed_data['MONTH'].astype(str)
flights_needed_data['DAY'] = flights_needed_data['DAY'].astype(str)
flights_needed_data['DAY_OF_WEEK'] = flights_needed_data['DAY_OF_WEEK'].astype(str)
flights_needed_data['DIVERTED'] = flights_needed_data['DIVERTED'].astype(str)
flights_needed_data['CANCELLED'] = flights_needed_data['CANCELLED'].astype(str)
flights_needed_data['CANCELLED'] = flights_needed_data['CANCELLED'].astype(str)

In [None]:
flights_needed_data.info()

In [None]:
flights_needed_data.columns

# identifying the categorical and numerical variables

In [None]:
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()
print("Numerical columns:", numerical_cols)
print('\n\n')
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns:", categorical_cols)

# Quantifying missing values

In [None]:
flights_needed_data.isnull().mean().plot.bar(figsize=(12, 8))

# Highlighting the Cardinality

In [None]:
flights_needed_data[categorical_cols].nunique()

In [None]:
flights_needed_data[categorical_cols].nunique().plot.bar(figsize=(12, 8))
plt.xlabel("Categorical Variables")
plt.ylabel("Count of unique values")
plt.show()

# Correlations matrix

In [None]:
threshold = 0.05 * len(flights_needed_data)
for col in categorical_cols:
    counts = flights_needed_data[col].value_counts()
    rare_labels = counts[counts < threshold].index
    flights_needed_data[col] = flights_needed_data[col].replace(rare_labels, 'Other')


In [None]:
from scipy.stats import chi2_contingency

threshold = 0.05 * len(flights_needed_data)
for col in categorical_cols:
    counts = flights_needed_data[col].value_counts()
    rare_labels = counts[counts < threshold].index
    flights_needed_data[col] = flights_needed_data[col].replace(rare_labels, 'Other')

categorical_cols = [col for col in categorical_cols if flights_needed_data[col].nunique() < 50]

if 'FLY_DATE' in flights_needed_data.columns:
    flights_needed_data['FLY_MONTH'] = pd.to_datetime(flights_needed_data['FLY_DATE']).dt.month
    flights_needed_data['FLY_DAY_OF_WEEK'] = pd.to_datetime(flights_needed_data['FLY_DATE']).dt.dayofweek
    
    categorical_cols += ['FLY_MONTH', 'FLY_DAY_OF_WEEK']
    
    if 'FLY_DATE' in categorical_cols:
        categorical_cols.remove('FLY_DATE')

flights_needed_data = flights_needed_data.dropna(subset=categorical_cols + ['ARRIVAL_DELAY'])

for col in categorical_cols:
    contingency_table = pd.crosstab(flights_needed_data[col], flights_needed_data['ARRIVAL_DELAY'])
    
    if contingency_table.size > 0:
        chi2, p, _, _ = chi2_contingency(contingency_table)
        print(f"Chi-square test for {col}: p-value = {p}")
    else:
        print(f"Skipping {col}: Contingency table is empty or too sparse")


In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

for col in categorical_cols:
    flights_needed_data[col] = label_encoder.fit_transform(flights_needed_data[col].astype(str))

categorical_data = flights_needed_data[categorical_cols]
target = flights_needed_data['ARRIVAL_DELAY'].apply(lambda x: 1 if x > 0 else 0)  # Make it binary if needed

from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(categorical_data, target, discrete_features=True)

# Print mutual information scores
for col, score in zip(categorical_cols, mutual_info):
    print(f"Mutual Information for {col}: {score}")


In [None]:

corr_matrix = flights_needed_data.select_dtypes(include=['int', 'float']).corr()
arrival_delay_corr = corr_matrix['ARRIVAL_DELAY'].drop('ARRIVAL_DELAY').sort_values(ascending=False)
plt.figure(figsize=(10, 6))
arrival_delay_corr.plot(kind='bar', color='skyblue')
plt.title("Correlation with Target Variable 'ARRIVAL_DELAY'")
plt.xlabel("Features")
plt.ylabel("Correlation Coefficient")
plt.show()


In [None]:
arrival_delay_corr[arrival_delay_corr > 0.2].index.tolist()

In [None]:
##numerical_cols = ['MONTH', 'DAY', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY',
 #                 'DISTANCE', 'SCHEDULED_ARRIVAL', 'DIVERTED', 'CANCELLED', 'AIR_SYSTEM_DELAY',
 #                 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY']
##categorical_cols = ['AIRLINE', 'ARRIVAL_TIME_SEGMENT']

numerical_cols = ['DEPARTURE_DELAY',
 'LATE_AIRCRAFT_DELAY',
 'AIRLINE_DELAY',
 'DEPARTURE_TIME',
 'WHEELS_OFF',
 'WEATHER_DELAY']           
#categorical_cols = ['MONTH', 'DAY', 'DAY_OF_WEEK', 'FLIGHT_NUMBER','TAIL_NUMBER',
#                    'AIRLINE', 'ORIGIN_AIRPORT', 'ORIGIN_CITY', 'DEST_AIRPORT', 'DEST_CITY','ARRIVAL_TIME_SEGMENT']
categorical_cols = [    'ORIGIN_AIRPORT',   
    'ORIGIN_CITY',      
    'DEST_CITY',        
    'DEST_STATE',       
    'FLY_MONTH',        
    'FLY_DAY_OF_WEEK',  
    'AIRLINE',          
    'DEST_AIRPORT',  
    'DEST_COUNTRY',    
    'CANCELLATION_REASON' ]

# Define columns by data type

# Creating the target column

In [None]:
# result = []
# for row in flights_needed_data['ARRIVAL_DELAY']:
#   if row > 5:
#     result.append(1)
#   else:
#     result.append(0) 

# flights_needed_data['delay_flag'] = result
# flights_needed_data.value_counts('delay_flag')

In [None]:
flights_needed_data['FLY_DATE'] = pd.to_datetime(flights_needed_data['FLY_DATE'])

flights_needed_data['FLY_MONTH'] = flights_needed_data['FLY_DATE'].dt.month
flights_needed_data['FLY_DAY_OF_WEEK'] = flights_needed_data['FLY_DATE'].dt.dayofweek

In [None]:
flights_needed_data['MONTH'] = flights_needed_data['MONTH'].astype(int)

In [None]:
flights_needed_data = flights_needed_data.loc[:, ~flights_needed_data.columns.duplicated()]


In [None]:
test_data = flights_needed_data[flights_needed_data['MONTH'] > 6][numerical_cols+categorical_cols + ['ARRIVAL_DELAY']]
train_data = flights_needed_data[flights_needed_data['MONTH'] <= 6][numerical_cols+categorical_cols + ['ARRIVAL_DELAY']]


In [None]:
train_data = train_data.loc[:, ~train_data.columns.duplicated()]
test_data = test_data.loc[:, ~test_data.columns.duplicated()]

In [None]:
train_data.shape, test_data.shape

In [None]:
test_data.info()

In [None]:
# Replace 'target_column_name' with the actual name of your target column
X_train = train_data.drop(columns=['ARRIVAL_DELAY'])
y_train = train_data['ARRIVAL_DELAY']

# Replace 'target_column_name' with the actual name of your target column
X_test = test_data.drop(columns=['ARRIVAL_DELAY'])
y_test = test_data['ARRIVAL_DELAY']

In [None]:
X_train.columns

In [None]:
# numerical_cols = [ 'DISTANCE','AIR_TIME']                
# categorical_cols = ['MONTH', 'DAY', 'DAY_OF_WEEK', 'FLIGHT_NUMBER','TAIL_NUMBER',
#                     'AIRLINE', 'ORIGIN_AIRPORT', 'ORIGIN_CITY', 'DEST_AIRPORT', 'DEST_CITY','ARRIVAL_TIME_SEGMENT']
# ## should add flytime if rerunning again

# Define transformations for numerical columns: imputing and scaling

In [None]:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)) 
])

# Define transformations for categorical columns: imputing and one-hot encoding


In [None]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both transformations in a ColumnTransformer


In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor

In [None]:
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_jobs=-1))
])

model.fit(X_train, y_train)

In [None]:
X_train.columns

In [None]:
regressors = {
    'RandomForest': RandomForestRegressor(),
    'XGBoost': XGBRegressor(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet(),
    'LinearRegression': LinearRegression(),
    'DecisionTree': DecisionTreeRegressor(),
    'KNeighbors': KNeighborsRegressor(),
    'SVR': SVR(),
    'GradientBoosting': GradientBoostingRegressor()
}

param_grid = {
    'RandomForest': {
        'regressor__n_estimators': [100, 200],
        'regressor__max_depth': [5, 10]
    },
    'XGBoost': {
        'regressor__n_estimators': [100, 200],
        'regressor__learning_rate': [0.1, 0.3],
        'regressor__max_depth': [3, 10],
        'regressor__subsample': [0.5, 1.0]
    },
    'Ridge': {
        'regressor__alpha': [1.0, 10.0, 100.0]
    },
    'Lasso': {
        'regressor__alpha': [0.1, 1.0, 10.0]
    },
    'ElasticNet': {
        'regressor__alpha': [0.1, 1.0, 10.0],
        'regressor__l1_ratio': [0.2, 0.5, 0.8]
    },
    'LinearRegression': {},  # No hyperparameters to tune
    'DecisionTree': {
        'regressor__max_depth': [5, 10, 15]
    },
    'KNeighbors': {
        'regressor__n_neighbors': [3, 5, 7]
    },
    'SVR': {
        'regressor__C': [0.1, 1.0, 10.0],
        'regressor__kernel': ['linear', 'rbf']
    },
    'GradientBoosting': {
        'regressor__n_estimators': [100, 200],
        'regressor__learning_rate': [0.1, 0.3],
        'regressor__max_depth': [3, 5]
    }
}

# Initialize variables to store the best model and score
best_model = None
best_score = float('-inf')
best_params = None

In [None]:
for name, regressor in regressors.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', regressor)
    ])
    
    # Perform grid search
    grid_search = GridSearchCV(pipeline, param_grid[name], cv=5, scoring='r2', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    # Check if this model is the best so far
    if grid_search.best_score_ > best_score:
        best_score = grid_search.best_score_
        best_model = grid_search.best_estimator_
        best_params = grid_search.best_params_
    
    print(f"Model: {name}")
    print(f"Best R² score from cross-validation: {grid_search.best_score_}")
    print(f"Best parameters: {grid_search.best_params_}")
    print("")

# Output the best model and its parameters
print("Best model overall:")
print(best_model)
print(f"Best cross-validation R² score: {best_score}")
print(f"Best parameters: {best_params}")

# Evaluate on the test set
y_pred = best_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

print(f"Test set MSE: {test_mse}")
print(f"Test set R² score: {test_r2}")

In [None]:
# from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


# pipeline.fit(X_train, y_train)

# y_train_pred = pipeline.predict(X_train)

# train_mse = mean_squared_error(y_train, y_train_pred)
# train_mae = mean_absolute_error(y_train, y_train_pred)
# train_r2 = r2_score(y_train, y_train_pred)

# print(f"Training MSE: {train_mse}")
# print(f"Training MAE: {train_mae}")
# print(f"Training R²: {train_r2}")


# y_test_pred = pipeline.predict(X_test)

# test_mse = mean_squared_error(y_test, y_test_pred)
# test_mae = mean_absolute_error(y_test, y_test_pred)
# test_r2 = r2_score(y_test, y_test_pred)

# print(f"Test MSE: {test_mse}")
# print(f"Test MAE: {test_mae}")
# print(f"Test R²: {test_r2}")


# mse_tolerance = 0.2 

# if train_mse < test_mse * (1 - mse_tolerance) and train_r2 > test_r2:
#     print("The model is likely overfitting.")
# elif train_mse > test_mse * (1 + mse_tolerance):
#     print("The model is likely underfitting.")
# else:
#     print("The model is likely generalizing well.")