# **Flight Delay Prediction using Scikit-Learn Pipeline**

## **Overview**
This project demonstrates how to build a **machine learning pipeline** using scikit-learn to predict flight delays. The pipeline integrates data preprocessing with model training, ensuring efficient handling of both numerical and categorical data.

---

## **Objectives**
- Preprocess numerical and categorical data using `ColumnTransformer`.
- Automate the machine learning workflow using `Pipeline`.
- Train a **Random Forest Classifier** to predict flight delays.
- Optimize the model using **GridSearchCV** for hyperparameter tuning.

---

## **Data Overview**
- **Dataset**: Contains flight details such as:
  - **Year**, **Month**, **Day**
  - **Airline code**, **Origin airport code**, **Destination airport code**
  - **Departure delay** (target: delayed or not)

- **Target Variable**:  
  - `1` if the flight was delayed  
  - `0` if the flight was on time

---

## **Steps Involved**

### 1. **Data Loading and Exploration**
- Load the flight dataset and inspect its structure and missing values.

### 2. **Feature Engineering**
- **Numerical Features**:
  - `YEAR`, `MONTH`, `DAY`
- **Categorical Features**:
  - `AIRLINE__CODE`, `ORIGIN_AIRPORT_CODE`, `DESTINATION_AIRPORT_CODE`

### 3. **Preprocessing with `ColumnTransformer`**
- **Numerical Data**:
  - Impute missing values with the **mean**.
  - Standardize values using **`StandardScaler`**.
  
- **Categorical Data**:
  - Impute missing values with `'missing'`.
  - Encode using **`OneHotEncoder`**.

### 4. **Pipeline Setup**
- Use a **scikit-learn Pipeline** to link preprocessing and model training.
- Integrate a **Random Forest Classifier** within the pipeline.

### 5. **Model Training and Evaluation**
- Split the data into **train (70%)** and **test (30%)** sets.
- Evaluate the model using a **classification report** with metrics like:
  - **Precision**, **Recall**, **F1-score**

### 6. **Hyperparameter Tuning with GridSearchCV**
- Tune hyperparameters of the Random Forest model:
  - Number of estimators (`n_estimators`)
  - Maximum tree depth (`max_depth`)

### 7. **Model Persistence**
- Save the trained model using **`joblib`** for later use.

---

## **Technologies Used**
- **Python**: Programming language
- **Pandas**: Data manipulation and cleaning
- **Scikit-Learn**: Machine learning, preprocessing, and model evaluation
- **Joblib**: Model persistence
- **Jupyter Notebook**: Interactive development environment

---

## **Expected Output**
- A **trained Random Forest model** to predict flight delays.
- **Performance metrics** (accuracy, precision, recall) from the classification report.
- A **saved model** (`flight_delay_classifier.pkl`) for deployment.

---

## **Conclusion**
This project demonstrates how to create an automated **machine learning workflow** using scikit-learn’s `Pipeline` and `ColumnTransformer`. The streamlined preprocessing ensures consistency during both training and testing. With **hyperparameter tuning**, the model's performance is further optimized, making it reliable for real-world flight delay predictions.


# Upgrade pip and install all required packages

In [None]:
!pip install --upgrade pip

# Install Snowflake connectors, pandas integration, and essential libraries
!pip install "snowflake-connector-python[pandas]" \
             snowflake-snowpark-python==1.9.0 \
             numpy pandas matplotlib scikit-learn xgboost seaborn \
             python-dateutil tqdm holidays faker

# Ensure Snowpark Python is up-to-date
!pip install --upgrade -q snowflake-snowpark-python==1.9.0

# Fix potential urllib3 version conflicts
!pip uninstall urllib3 -y
!pip install urllib3==1.26.15

# Additional installations for your project
!pip install fosforml==1.1.6
!pip install python-scipy
!pip install basemap


# Importing necessary libraries and settings

In [1]:

# Standard libraries for date and warnings
import datetime
import warnings

# Scientific and Data Manipulation Libraries
import scipy
import pandas as pd
import numpy as np

# Data Visualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb

# Sklearn Modules for Data Preprocessing, Modeling, and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder  # Encoding categorical variables
from sklearn.preprocessing import StandardScaler  # Scaling numerical data
from sklearn.tree import DecisionTreeClassifier  # Decision Tree model
from sklearn.metrics import roc_auc_score, classification_report  # Evaluation metrics

# Configuring display options and warning filters
pd.options.display.max_columns = 50
warnings.filterwarnings("ignore")

# Custom FosforML package for Snowflake session and model registration
from fosforml.model_manager.snowflakesession import get_session
from fosforml import register_model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer


In [2]:
# Set Matplotlib's default font family to 'DeJavu Serif' to ensure a consistent font style across plots
plt.rcParams['font.family'] = 'DeJavu Serif'

# Establishing a Snowflake session


In [3]:
my_session = get_session()

# Defining the table name to fetch data from
# table_name = 'FLIGHTS'  # Initial option for table
table_name = 'FLIGHTS_FULL'  # Final table to be used

# Querying the data from the specified Snowflake table
sf_df = my_session.sql("SELECT * FROM {}".format(table_name))

# Converting the Snowflake DataFrame to a pandas DataFrame for local processing
df = sf_df.to_pandas()

df

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE__CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT_CODE,DESTINATION_AIRPORT_CODE,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,FLY_DATE,AIRLINE,ORIGIN_AIRPORT,ORIGIN_CITY,ORIGIN_STATE,ORIGIN_COUNTRY,ORIGIN_LATITUDE,ORIGIN_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE
0,2024,6,2,2,MQ,3288,N500MQ,ORD,DSM,1048,1048.0,0.0,10.0,1058.0,78.0,60.0,45.0,299,1143.0,5.0,1206,1148.0,-18.0,0,0,,,,,,,2024-06-02,American Eagle Airlines Inc.,Chicago O'Hare International Airport,Chicago,IL,USA,41.97960,-87.90446,Des Moines International Airport,Des Moines,IA,USA,41.53493,-93.66068
1,2024,6,2,2,MQ,3319,N902MQ,LFT,DFW,1048,1040.0,-8.0,8.0,1048.0,86.0,85.0,60.0,351,1148.0,17.0,1214,1205.0,-9.0,0,0,,,,,,,2024-06-02,American Eagle Airlines Inc.,Lafayette Regional Airport,Lafayette,LA,USA,30.20528,-91.98766,Dallas/Fort Worth International Airport,Dallas-Fort Worth,TX,USA,32.89595,-97.03720
2,2024,6,2,2,NK,762,N533NK,ATL,ORD,1048,1101.0,13.0,25.0,1126.0,120.0,132.0,81.0,606,1147.0,26.0,1148,1213.0,25.0,0,0,,12.0,0.0,13.0,0.0,0.0,2024-06-02,Spirit Air Lines,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Chicago O'Hare International Airport,Chicago,IL,USA,41.97960,-87.90446
3,2024,6,2,2,AA,2484,N3ENAA,DFW,IAH,1049,1051.0,2.0,20.0,1111.0,77.0,71.0,41.0,224,1152.0,10.0,1206,1202.0,-4.0,0,0,,,,,,,2024-06-02,American Airlines Inc.,Dallas/Fort Worth International Airport,Dallas-Fort Worth,TX,USA,32.89595,-97.03720,George Bush Intercontinental Airport,Houston,TX,USA,29.98047,-95.33972
4,2024,6,2,2,B6,842,N623JB,SAV,JFK,1049,1057.0,8.0,11.0,1108.0,131.0,107.0,92.0,718,1240.0,4.0,1300,1244.0,-16.0,0,0,,,,,,,2024-06-02,JetBlue Airways,Savannah/Hilton Head International Airport,Savannah,GA,USA,32.12758,-81.20214,John F. Kennedy International Airport (New Yor...,New York,NY,USA,40.63975,-73.77893
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819074,2024,4,23,4,UA,829,N417UA,EWR,DEN,1529,1528.0,-1.0,17.0,1545.0,273.0,258.0,234.0,1605,1739.0,7.0,1802,1746.0,-16.0,0,0,,,,,,,2024-04-23,United Air Lines Inc.,Newark Liberty International Airport,Newark,NJ,USA,40.69250,-74.16866,Denver International Airport,Denver,CO,USA,39.85841,-104.66700
5819075,2024,4,23,4,UA,550,N854UA,PDX,DEN,1529,1539.0,10.0,11.0,1550.0,153.0,139.0,123.0,991,1853.0,5.0,1902,1858.0,-4.0,0,0,,,,,,,2024-04-23,United Air Lines Inc.,Portland International Airport,Portland,OR,USA,45.58872,-122.59750,Denver International Airport,Denver,CO,USA,39.85841,-104.66700
5819076,2024,4,23,4,UA,1572,N73259,EWR,MIA,1529,1528.0,-1.0,16.0,1544.0,186.0,195.0,165.0,1085,1829.0,14.0,1835,1843.0,8.0,0,0,,,,,,,2024-04-23,United Air Lines Inc.,Newark Liberty International Airport,Newark,NJ,USA,40.69250,-74.16866,Miami International Airport,Miami,FL,USA,25.79325,-80.29056
5819077,2024,4,23,4,US,765,N762US,DTW,CLT,1529,1532.0,3.0,17.0,1549.0,110.0,107.0,82.0,500,1711.0,8.0,1719,1719.0,0.0,0,0,,,,,,,2024-04-23,US Airways Inc.,Detroit Metropolitan Airport,Detroit,MI,USA,42.21206,-83.34884,Charlotte Douglas International Airport,Charlotte,NC,USA,35.21401,-80.94313


# Filtering data for specific airlines

In [4]:
# Defining the list of airlines to include in the filtered DataFrame
options = ['Southwest Airlines Co.', 'Delta Air Lines Inc.']

# Selecting rows where the 'AIRLINE' column matches one of the specified airlines
flights = df.loc[df['AIRLINE'].isin(options)]
flights

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE__CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT_CODE,DESTINATION_AIRPORT_CODE,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,FLY_DATE,AIRLINE,ORIGIN_AIRPORT,ORIGIN_CITY,ORIGIN_STATE,ORIGIN_COUNTRY,ORIGIN_LATITUDE,ORIGIN_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE
5,2024,6,2,2,DL,1448,N895AT,ATL,DAL,1049,1046.0,-3.0,13.0,1059.0,137.0,119.0,101.0,721,1140.0,5.0,1206,1145.0,-21.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Dallas Love Field,Dallas,TX,USA,32.84711,-96.85177
18,2024,6,2,2,DL,1294,N942DN,RDU,ATL,1050,1045.0,-5.0,14.0,1059.0,81.0,74.0,54.0,356,1153.0,6.0,1211,1159.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Raleigh-Durham International Airport,Raleigh,NC,USA,35.87764,-78.78747,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694
19,2024,6,2,2,DL,653,N3740C,LAX,SEA,1050,1047.0,-3.0,17.0,1104.0,164.0,148.0,125.0,954,1309.0,6.0,1334,1315.0,-19.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931
20,2024,6,2,2,DL,748,N908DL,ATL,DTW,1050,1051.0,1.0,20.0,1111.0,122.0,105.0,78.0,594,1229.0,7.0,1252,1236.0,-16.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Detroit Metropolitan Airport,Detroit,MI,USA,42.21206,-83.34884
21,2024,6,2,2,DL,783,N334NW,ATL,MSP,1050,1048.0,-2.0,13.0,1101.0,156.0,146.0,128.0,907,1209.0,5.0,1226,1214.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819053,2024,4,23,4,DL,2232,N921DL,MSP,MKE,1525,1736.0,131.0,18.0,1754.0,62.0,67.0,45.0,297,1839.0,4.0,1627,1843.0,136.0,0,0,,5.0,0.0,131.0,0.0,0.0,2024-04-23,Delta Air Lines Inc.,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,General Mitchell International Airport,Milwaukee,WI,USA,42.94722,-87.89658
5819054,2024,4,23,4,DL,2382,N989AT,TLH,ATL,1525,1523.0,-2.0,13.0,1536.0,76.0,65.0,46.0,223,1622.0,6.0,1641,1628.0,-13.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Tallahassee International Airport,Tallahassee,FL,USA,30.39653,-84.35033,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694
5819055,2024,4,23,4,DL,2465,N950DL,BDL,ATL,1525,1528.0,3.0,12.0,1540.0,158.0,144.0,126.0,859,1746.0,6.0,1803,1752.0,-11.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Bradley International Airport,Windsor Locks,CT,USA,41.93887,-72.68323,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694
5819056,2024,4,23,4,DL,2658,N938DL,MIA,LGA,1525,1552.0,27.0,82.0,1714.0,184.0,272.0,159.0,1096,1953.0,31.0,1829,2024.0,115.0,0,0,,88.0,0.0,0.0,27.0,0.0,2024-04-23,Delta Air Lines Inc.,Miami International Airport,Miami,FL,USA,25.79325,-80.29056,LaGuardia Airport (Marine Air Terminal),New York,NY,USA,40.77724,-73.87261


# Creating a copy of the filtered flights data

In [5]:
# This ensures that any modifications made to 'flights_needed_data' do not affect the original 'flights' DataFrame
flights_needed_data = flights.copy()

In [6]:
flights_needed_data.shape

(2137736, 45)

In [7]:
flights_needed_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2137736 entries, 5 to 5819070
Data columns (total 45 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   YEAR                      int16  
 1   MONTH                     int8   
 2   DAY                       int8   
 3   DAY_OF_WEEK               int8   
 4   AIRLINE__CODE             object 
 5   FLIGHT_NUMBER             int16  
 6   TAIL_NUMBER               object 
 7   ORIGIN_AIRPORT_CODE       object 
 8   DESTINATION_AIRPORT_CODE  object 
 9   SCHEDULED_DEPARTURE       int16  
 10  DEPARTURE_TIME            float64
 11  DEPARTURE_DELAY           float64
 12  TAXI_OUT                  float64
 13  WHEELS_OFF                float64
 14  SCHEDULED_TIME            float64
 15  ELAPSED_TIME              float64
 16  AIR_TIME                  float64
 17  DISTANCE                  int16  
 18  WHEELS_ON                 float64
 19  TAXI_IN                   float64
 20  SCHEDULED_ARRIVAL         int

In [8]:
flights_needed_data.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE__CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT_CODE,DESTINATION_AIRPORT_CODE,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,FLY_DATE,AIRLINE,ORIGIN_AIRPORT,ORIGIN_CITY,ORIGIN_STATE,ORIGIN_COUNTRY,ORIGIN_LATITUDE,ORIGIN_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE
5,2024,6,2,2,DL,1448,N895AT,ATL,DAL,1049,1046.0,-3.0,13.0,1059.0,137.0,119.0,101.0,721,1140.0,5.0,1206,1145.0,-21.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Dallas Love Field,Dallas,TX,USA,32.84711,-96.85177
18,2024,6,2,2,DL,1294,N942DN,RDU,ATL,1050,1045.0,-5.0,14.0,1059.0,81.0,74.0,54.0,356,1153.0,6.0,1211,1159.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Raleigh-Durham International Airport,Raleigh,NC,USA,35.87764,-78.78747,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694
19,2024,6,2,2,DL,653,N3740C,LAX,SEA,1050,1047.0,-3.0,17.0,1104.0,164.0,148.0,125.0,954,1309.0,6.0,1334,1315.0,-19.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931
20,2024,6,2,2,DL,748,N908DL,ATL,DTW,1050,1051.0,1.0,20.0,1111.0,122.0,105.0,78.0,594,1229.0,7.0,1252,1236.0,-16.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Detroit Metropolitan Airport,Detroit,MI,USA,42.21206,-83.34884
21,2024,6,2,2,DL,783,N334NW,ATL,MSP,1050,1048.0,-2.0,13.0,1101.0,156.0,146.0,128.0,907,1209.0,5.0,1226,1214.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692


# Function to categorize scheduled arrival times into time segments

In [9]:
def categorize_time(SCHEDULED_ARRIVAL):
    # Categorize based on scheduled arrival time in 24-hour format
    if 500 <= SCHEDULED_ARRIVAL < 800:
        return 'Early morning'
    elif 800 <= SCHEDULED_ARRIVAL < 1100:
        return 'Late morning'
    elif 1100 <= SCHEDULED_ARRIVAL < 1400:
        return 'Around noon'
    elif 1400 <= SCHEDULED_ARRIVAL < 1700:
        return 'Afternoon'
    elif 1700 <= SCHEDULED_ARRIVAL < 2000:
        return 'Evening'
    elif 2000 <= SCHEDULED_ARRIVAL < 2300:
        return 'Night'
    elif SCHEDULED_ARRIVAL >= 2300 or SCHEDULED_ARRIVAL < 200:
        return 'Late night'
    elif 200 <= SCHEDULED_ARRIVAL < 500:
        return 'Dawn'

# Apply categorize_time function to the 'SCHEDULED_ARRIVAL' column to create 'ARRIVAL_TIME_SEGMENT'
flights_needed_data['ARRIVAL_TIME_SEGMENT'] = flights_needed_data['SCHEDULED_ARRIVAL'].apply(categorize_time)


In [10]:
flights_needed_data

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE__CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT_CODE,DESTINATION_AIRPORT_CODE,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,FLY_DATE,AIRLINE,ORIGIN_AIRPORT,ORIGIN_CITY,ORIGIN_STATE,ORIGIN_COUNTRY,ORIGIN_LATITUDE,ORIGIN_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE,ARRIVAL_TIME_SEGMENT
5,2024,6,2,2,DL,1448,N895AT,ATL,DAL,1049,1046.0,-3.0,13.0,1059.0,137.0,119.0,101.0,721,1140.0,5.0,1206,1145.0,-21.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Dallas Love Field,Dallas,TX,USA,32.84711,-96.85177,Around noon
18,2024,6,2,2,DL,1294,N942DN,RDU,ATL,1050,1045.0,-5.0,14.0,1059.0,81.0,74.0,54.0,356,1153.0,6.0,1211,1159.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Raleigh-Durham International Airport,Raleigh,NC,USA,35.87764,-78.78747,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Around noon
19,2024,6,2,2,DL,653,N3740C,LAX,SEA,1050,1047.0,-3.0,17.0,1104.0,164.0,148.0,125.0,954,1309.0,6.0,1334,1315.0,-19.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,Around noon
20,2024,6,2,2,DL,748,N908DL,ATL,DTW,1050,1051.0,1.0,20.0,1111.0,122.0,105.0,78.0,594,1229.0,7.0,1252,1236.0,-16.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Detroit Metropolitan Airport,Detroit,MI,USA,42.21206,-83.34884,Around noon
21,2024,6,2,2,DL,783,N334NW,ATL,MSP,1050,1048.0,-2.0,13.0,1101.0,156.0,146.0,128.0,907,1209.0,5.0,1226,1214.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,Around noon
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819053,2024,4,23,4,DL,2232,N921DL,MSP,MKE,1525,1736.0,131.0,18.0,1754.0,62.0,67.0,45.0,297,1839.0,4.0,1627,1843.0,136.0,0,0,,5.0,0.0,131.0,0.0,0.0,2024-04-23,Delta Air Lines Inc.,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,General Mitchell International Airport,Milwaukee,WI,USA,42.94722,-87.89658,Afternoon
5819054,2024,4,23,4,DL,2382,N989AT,TLH,ATL,1525,1523.0,-2.0,13.0,1536.0,76.0,65.0,46.0,223,1622.0,6.0,1641,1628.0,-13.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Tallahassee International Airport,Tallahassee,FL,USA,30.39653,-84.35033,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Afternoon
5819055,2024,4,23,4,DL,2465,N950DL,BDL,ATL,1525,1528.0,3.0,12.0,1540.0,158.0,144.0,126.0,859,1746.0,6.0,1803,1752.0,-11.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Bradley International Airport,Windsor Locks,CT,USA,41.93887,-72.68323,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Evening
5819056,2024,4,23,4,DL,2658,N938DL,MIA,LGA,1525,1552.0,27.0,82.0,1714.0,184.0,272.0,159.0,1096,1953.0,31.0,1829,2024.0,115.0,0,0,,88.0,0.0,0.0,27.0,0.0,2024-04-23,Delta Air Lines Inc.,Miami International Airport,Miami,FL,USA,25.79325,-80.29056,LaGuardia Airport (Marine Air Terminal),New York,NY,USA,40.77724,-73.87261,Evening


In [11]:
flights['AIRLINE__CODE'].unique()

array(['DL', 'WN'], dtype=object)

In [12]:
flights_needed_data.value_counts('DIVERTED')

DIVERTED
0    2132545
1       5191
Name: count, dtype: int64

In [13]:
flights_needed_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2137736 entries, 5 to 5819070
Data columns (total 46 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   YEAR                      int16  
 1   MONTH                     int8   
 2   DAY                       int8   
 3   DAY_OF_WEEK               int8   
 4   AIRLINE__CODE             object 
 5   FLIGHT_NUMBER             int16  
 6   TAIL_NUMBER               object 
 7   ORIGIN_AIRPORT_CODE       object 
 8   DESTINATION_AIRPORT_CODE  object 
 9   SCHEDULED_DEPARTURE       int16  
 10  DEPARTURE_TIME            float64
 11  DEPARTURE_DELAY           float64
 12  TAXI_OUT                  float64
 13  WHEELS_OFF                float64
 14  SCHEDULED_TIME            float64
 15  ELAPSED_TIME              float64
 16  AIR_TIME                  float64
 17  DISTANCE                  int16  
 18  WHEELS_ON                 float64
 19  TAXI_IN                   float64
 20  SCHEDULED_ARRIVAL         int

# Define columns by data type

In [14]:
numerical_cols = ['MONTH', 'DAY', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY',
                  'DISTANCE', 'SCHEDULED_ARRIVAL', 'DIVERTED', 'CANCELLED', 'AIR_SYSTEM_DELAY',
                  'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY']
categorical_cols = ['AIRLINE', 'ARRIVAL_TIME_SEGMENT']


# Define transformations for numerical columns: imputing and scaling

In [15]:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Define transformations for categorical columns: imputing and one-hot encoding


In [16]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine both transformations in a ColumnTransformer


In [17]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

# Create a full pipeline; add your model at the end (e.g., DecisionTreeClassifier)

In [18]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])
pipeline

# Creating the target column

In [19]:
result = []
for row in flights_needed_data['ARRIVAL_DELAY']:
  if row > 5:
    result.append(1)
  else:
    result.append(0) 

flights_needed_data['result'] = result
flights_needed_data.value_counts('result')

result
0    1595430
1     542306
Name: count, dtype: int64

In [21]:
test_data = flights_needed_data[flights_needed_data['MONTH'] >= 11] 
#test=test.drop(['FLY_DATE'], axis=1)
train_data = flights_needed_data[flights_needed_data['MONTH'] < 11]


In [22]:
# Replace 'target_column_name' with the actual name of your target column
X_train = train_data.drop(columns=['result'])
y_train = train_data['result']

In [23]:
# Replace 'target_column_name' with the actual name of your target column
X_test = test_data.drop(columns=['result'])
y_test = test_data['result']

In [24]:
pipeline.fit(X_train,y_train)

In [25]:
y_prob = pipeline.predict_proba(X_test)[:,1]

In [27]:
y_pred = pipeline.predict(X_test)

In [28]:
auc_score = roc_auc_score(y_test, y_pred)
auc_score

0.8408040265277601

In [29]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.91      0.92    263912
           1       0.75      0.77      0.76     90476

    accuracy                           0.88    354388
   macro avg       0.84      0.84      0.84    354388
weighted avg       0.88      0.88      0.88    354388



In [31]:
y_pred = pipeline.predict(flights_needed_data)

In [32]:
flights['ACTUAL_DELAY'] = flights_needed_data['result']
flights

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE__CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT_CODE,DESTINATION_AIRPORT_CODE,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,FLY_DATE,AIRLINE,ORIGIN_AIRPORT,ORIGIN_CITY,ORIGIN_STATE,ORIGIN_COUNTRY,ORIGIN_LATITUDE,ORIGIN_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE,ACTUAL_DELAY
5,2024,6,2,2,DL,1448,N895AT,ATL,DAL,1049,1046.0,-3.0,13.0,1059.0,137.0,119.0,101.0,721,1140.0,5.0,1206,1145.0,-21.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Dallas Love Field,Dallas,TX,USA,32.84711,-96.85177,0
18,2024,6,2,2,DL,1294,N942DN,RDU,ATL,1050,1045.0,-5.0,14.0,1059.0,81.0,74.0,54.0,356,1153.0,6.0,1211,1159.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Raleigh-Durham International Airport,Raleigh,NC,USA,35.87764,-78.78747,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,0
19,2024,6,2,2,DL,653,N3740C,LAX,SEA,1050,1047.0,-3.0,17.0,1104.0,164.0,148.0,125.0,954,1309.0,6.0,1334,1315.0,-19.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,0
20,2024,6,2,2,DL,748,N908DL,ATL,DTW,1050,1051.0,1.0,20.0,1111.0,122.0,105.0,78.0,594,1229.0,7.0,1252,1236.0,-16.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Detroit Metropolitan Airport,Detroit,MI,USA,42.21206,-83.34884,0
21,2024,6,2,2,DL,783,N334NW,ATL,MSP,1050,1048.0,-2.0,13.0,1101.0,156.0,146.0,128.0,907,1209.0,5.0,1226,1214.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819053,2024,4,23,4,DL,2232,N921DL,MSP,MKE,1525,1736.0,131.0,18.0,1754.0,62.0,67.0,45.0,297,1839.0,4.0,1627,1843.0,136.0,0,0,,5.0,0.0,131.0,0.0,0.0,2024-04-23,Delta Air Lines Inc.,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,General Mitchell International Airport,Milwaukee,WI,USA,42.94722,-87.89658,1
5819054,2024,4,23,4,DL,2382,N989AT,TLH,ATL,1525,1523.0,-2.0,13.0,1536.0,76.0,65.0,46.0,223,1622.0,6.0,1641,1628.0,-13.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Tallahassee International Airport,Tallahassee,FL,USA,30.39653,-84.35033,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,0
5819055,2024,4,23,4,DL,2465,N950DL,BDL,ATL,1525,1528.0,3.0,12.0,1540.0,158.0,144.0,126.0,859,1746.0,6.0,1803,1752.0,-11.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Bradley International Airport,Windsor Locks,CT,USA,41.93887,-72.68323,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,0
5819056,2024,4,23,4,DL,2658,N938DL,MIA,LGA,1525,1552.0,27.0,82.0,1714.0,184.0,272.0,159.0,1096,1953.0,31.0,1829,2024.0,115.0,0,0,,88.0,0.0,0.0,27.0,0.0,2024-04-23,Delta Air Lines Inc.,Miami International Airport,Miami,FL,USA,25.79325,-80.29056,LaGuardia Airport (Marine Air Terminal),New York,NY,USA,40.77724,-73.87261,1


In [33]:
flights['PREDICTED_DELAY'] = y_pred

In [34]:
flights

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE__CODE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT_CODE,DESTINATION_AIRPORT_CODE,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,FLY_DATE,AIRLINE,ORIGIN_AIRPORT,ORIGIN_CITY,ORIGIN_STATE,ORIGIN_COUNTRY,ORIGIN_LATITUDE,ORIGIN_LONGITUDE,DEST_AIRPORT,DEST_CITY,DEST_STATE,DEST_COUNTRY,DEST_LATITUDE,DEST_LONGITUDE,ACTUAL_DELAY,PREDICTED_DELAY
5,2024,6,2,2,DL,1448,N895AT,ATL,DAL,1049,1046.0,-3.0,13.0,1059.0,137.0,119.0,101.0,721,1140.0,5.0,1206,1145.0,-21.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Dallas Love Field,Dallas,TX,USA,32.84711,-96.85177,0,0
18,2024,6,2,2,DL,1294,N942DN,RDU,ATL,1050,1045.0,-5.0,14.0,1059.0,81.0,74.0,54.0,356,1153.0,6.0,1211,1159.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Raleigh-Durham International Airport,Raleigh,NC,USA,35.87764,-78.78747,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,0,0
19,2024,6,2,2,DL,653,N3740C,LAX,SEA,1050,1047.0,-3.0,17.0,1104.0,164.0,148.0,125.0,954,1309.0,6.0,1334,1315.0,-19.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Los Angeles International Airport,Los Angeles,CA,USA,33.94254,-118.40807,Seattle-Tacoma International Airport,Seattle,WA,USA,47.44898,-122.30931,0,0
20,2024,6,2,2,DL,748,N908DL,ATL,DTW,1050,1051.0,1.0,20.0,1111.0,122.0,105.0,78.0,594,1229.0,7.0,1252,1236.0,-16.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Detroit Metropolitan Airport,Detroit,MI,USA,42.21206,-83.34884,0,0
21,2024,6,2,2,DL,783,N334NW,ATL,MSP,1050,1048.0,-2.0,13.0,1101.0,156.0,146.0,128.0,907,1209.0,5.0,1226,1214.0,-12.0,0,0,,,,,,,2024-06-02,Delta Air Lines Inc.,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819053,2024,4,23,4,DL,2232,N921DL,MSP,MKE,1525,1736.0,131.0,18.0,1754.0,62.0,67.0,45.0,297,1839.0,4.0,1627,1843.0,136.0,0,0,,5.0,0.0,131.0,0.0,0.0,2024-04-23,Delta Air Lines Inc.,Minneapolis-Saint Paul International Airport,Minneapolis,MN,USA,44.88055,-93.21692,General Mitchell International Airport,Milwaukee,WI,USA,42.94722,-87.89658,1,1
5819054,2024,4,23,4,DL,2382,N989AT,TLH,ATL,1525,1523.0,-2.0,13.0,1536.0,76.0,65.0,46.0,223,1622.0,6.0,1641,1628.0,-13.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Tallahassee International Airport,Tallahassee,FL,USA,30.39653,-84.35033,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,0,0
5819055,2024,4,23,4,DL,2465,N950DL,BDL,ATL,1525,1528.0,3.0,12.0,1540.0,158.0,144.0,126.0,859,1746.0,6.0,1803,1752.0,-11.0,0,0,,,,,,,2024-04-23,Delta Air Lines Inc.,Bradley International Airport,Windsor Locks,CT,USA,41.93887,-72.68323,Hartsfield-Jackson Atlanta International Airport,Atlanta,GA,USA,33.64044,-84.42694,0,0
5819056,2024,4,23,4,DL,2658,N938DL,MIA,LGA,1525,1552.0,27.0,82.0,1714.0,184.0,272.0,159.0,1096,1953.0,31.0,1829,2024.0,115.0,0,0,,88.0,0.0,0.0,27.0,0.0,2024-04-23,Delta Air Lines Inc.,Miami International Airport,Miami,FL,USA,25.79325,-80.29056,LaGuardia Airport (Marine Air Terminal),New York,NY,USA,40.77724,-73.87261,1,1


In [36]:
flights = flights.where(pd.notnull(flights), None)

In [37]:
flights.isna().sum()

YEAR                              0
MONTH                             0
DAY                               0
DAY_OF_WEEK                       0
AIRLINE__CODE                     0
FLIGHT_NUMBER                     0
TAIL_NUMBER                    1431
ORIGIN_AIRPORT_CODE               0
DESTINATION_AIRPORT_CODE          0
SCHEDULED_DEPARTURE               0
DEPARTURE_TIME                19430
DEPARTURE_DELAY               19430
TAXI_OUT                      19744
WHEELS_OFF                    19744
SCHEDULED_TIME                    0
ELAPSED_TIME                  25058
AIR_TIME                      25058
DISTANCE                          0
WHEELS_ON                     20737
TAXI_IN                       20737
SCHEDULED_ARRIVAL                 0
ARRIVAL_TIME                  20737
ARRIVAL_DELAY                 25058
DIVERTED                          0
CANCELLED                         0
CANCELLATION_REASON         2117869
AIR_SYSTEM_DELAY            1783087
SECURITY_DELAY              

In [40]:
chunk_size = 100000
chunks = [flights[i:i + chunk_size] for i in range(0, len(flights), chunk_size)]

for chunk in chunks:
    # Ensure no NaN values remain by explicitly replacing NaN with None
    chunk = chunk.where(pd.notnull(chunk), None)
    
    # # Explicitly set the data type of each column to object
    # for col in chunk.columns:
    #     chunk[col] = chunk[col].astype(object)

    # Convert the DataFrame to a Snowflake-compatible DataFrame
    ins_train_sf = my_session.createDataFrame(
        chunk.values.tolist(),
        schema=chunk.columns.tolist()
    )
    
    # Write to Snowflake
    ins_train_sf.write.mode("append").save_as_table("TTH_DB.TTH_AIRLINE_SCHEMA.DELAY_CLASSIFIER_OUTPUT_2510")

SnowparkSQLException: (1304): 000904 (42000): SQL compilation error: error line 1 at position 971
invalid identifier 'NAN'