<a href="https://colab.research.google.com/github/florianaewing/CSB430SWIWinter2026/blob/main/OptimumDeparture_by_Time_and_LocationPredictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Florian Ewing, Margaret Keau, John Farnandez, Lauren Connely
# Professor Ix Procopios
# Software Design and Implementation
# 2.11.2026

#Optimum Departure Date by Time and Location Predictor

This project analyzes large-scale U.S. flight data from 2023 and 2024 to build a machine learning model that predicts whether a flight will depart on time or experience a significant delay.

The model uses a binary classification target:

## Class 0 → Flight departed early or within 15 minutes of
##scheduled departure

##Class 1 → Flight departed 15 minutes or more after scheduled departure

Using this prediction capability, the system evaluates airport and airline performance for a given date and recommends the optimal airport and airline combination with the lowest probability of delay.

# Data Loading and Preprocessing
The dataset contains multiple CSV files that provide complementary information about US civil flights. The main file, US_flights_2023.csv, records over 6.7 million flights with details on departure and arrival times, delays, aircraft information, and distances.

Cancelled_Diverted_2023.csv focuses on flight cancellations and diversions, with just over 100,000 entries.

maj us flight - january 2024.csv contains approximately half a million flights from January 2024, including aircraft manufacturer and age.

airports_geolocation.csv provides latitude and longitude for 364 airports, along with city and state information.

Finally, weather_meteo_by_airport.csv contains daily weather measurements for airports, including temperature, precipitation, wind, and pressure.

In [6]:
import os
import pandas as pd
import numpy as np
import kagglehub

# ==== Load dataset ====
dataset_path = kagglehub.dataset_download(
    "bordanova/2023-us-civil-flights-delay-meteo-and-aircraft"
)

flights_2023 = pd.read_csv(os.path.join(dataset_path, "US_flights_2023.csv"))
flights_jan2024 = pd.read_csv(
    os.path.join(dataset_path, "maj us flight - january 2024.csv")
)
cancelled_diverted = pd.read_csv(
    os.path.join(dataset_path, "Cancelled_Diverted_2023.csv")
)
airports = pd.read_csv(os.path.join(dataset_path, "airports_geolocation.csv"))
weather = pd.read_csv(os.path.join(dataset_path, "weather_meteo_by_airport.csv"))

# ==== Fix types and typos ====
for df in [flights_2023, flights_jan2024, cancelled_diverted, weather]:
    if "FlightDate" in df.columns:
        df["FlightDate"] = pd.to_datetime(df["FlightDate"])
    if "time" in df.columns:
        df["time"] = pd.to_datetime(df["time"])

flights_2023.rename(columns={"Aicraft_age": "Aircraft_age"}, inplace=True)
flights_jan2024.rename(columns={"Aicraft_age": "Aircraft_age"}, inplace=True)

# ==== Combine flights ====
flights = pd.concat([flights_2023, flights_jan2024], ignore_index=True)

# ==== Merge airport geolocation ====
airports_geo = airports[["IATA_CODE", "LATITUDE", "LONGITUDE"]]

# Departure
flights = flights.merge(airports_geo, left_on="Dep_Airport",
                        right_on="IATA_CODE", how="left")
flights.rename(columns={"LATITUDE": "Dep_LAT", "LONGITUDE": "Dep_LON"},
               inplace=True)
flights.drop(columns=["IATA_CODE"], inplace=True)

# Arrival
flights = flights.merge(airports_geo, left_on="Arr_Airport",
                        right_on="IATA_CODE", how="left")
flights.rename(columns={"LATITUDE": "Arr_LAT", "LONGITUDE": "Arr_LON"},
               inplace=True)
flights.drop(columns=["IATA_CODE"], inplace=True)

# ==== Merge weather ====
def merge_weather(df, w_df, prefix, airport_col):
    w = w_df.copy()
    w.rename(columns={"time": "FlightDate", "airport_id": airport_col},
             inplace=True)
    rename_cols = {c: f"{prefix}_{c}" for c in w.columns
                   if c not in ["FlightDate", airport_col]}
    w.rename(columns=rename_cols, inplace=True)
    return df.merge(w, on=["FlightDate", airport_col], how="left")

flights = merge_weather(flights, weather, "DepWeather", "Dep_Airport")
flights = merge_weather(flights, weather, "ArrWeather", "Arr_Airport")

# ==== Merge cancellations ====
cancelled_diverted["FlightNumber"] = cancelled_diverted["Tail_Number"]
cancelled_diverted["Cancelled_Class"] = cancelled_diverted["Cancelled"].fillna(0)

flights = flights.merge(
    cancelled_diverted[['FlightDate','Airline','FlightNumber',
                        'Dep_Airport','Cancelled_Class']],
    left_on=['FlightDate','Airline','Tail_Number','Dep_Airport'],
    right_on=['FlightDate','Airline','FlightNumber','Dep_Airport'],
    how='left'
)
flights['Cancelled_Class'] = flights['Cancelled_Class'].fillna(0)

# ==== Create delay class ====
flights = flights.dropna(subset=["Dep_Delay"])
flights["Delay_Class"] = np.where(flights["Dep_Delay"] >= 15, 1, 0)

# ==== Summary stats ====
print("Delay Class Counts:")
print(flights["Delay_Class"].value_counts())
print("\nDelay Class %:")
print((flights["Delay_Class"].value_counts(normalize=True)*100).round(2))

# Cancellation summary
print("\nCancellation Counts:")
print(flights["Cancelled_Class"].value_counts())
print("\nCancellation %:")
print((flights["Cancelled_Class"].value_counts(normalize=True)*100).round(2))


Using Colab cache for faster access to the '2023-us-civil-flights-delay-meteo-and-aircraft' dataset.
Delay Class Counts:
Delay_Class
0    5780639
1    1492770
Name: count, dtype: int64

Delay Class %:
Delay_Class
0    79.48
1    20.52
Name: proportion, dtype: float64

Cancellation Counts:
Cancelled_Class
0.0    7231140
1.0      42269
Name: count, dtype: int64

Cancellation %:
Cancelled_Class
0.0    99.42
1.0     0.58
Name: proportion, dtype: float64


# Key Metrics from each File

airports_geolocation.csv has 364 complete rows with location data and no duplicates.

Cancelled_Diverted_2023.csv has 104,488 rows, no missing values, 945 duplicates, mostly cancelled flights, skewed delay distributions, 15 airlines, and ~345 airports.

maj us flight - january 2024.csv contains 527,197 rows, no missing values, three duplicates, skewed delay metrics, and aircraft details (manufacturer, model, age ~14 years).

US_flights_2023.csv has 6,743,404 rows, no missing values, 31 duplicates, similar delay distributions, and consistent categorical data.

weather_meteo_by_airport.csv has 132,860 rows, no missing values, and complete weather metrics per airport

In the flight datasets (Cancelled_Diverted_2023.csv, maj us flight - january 2024.csv, US_flights_2023.csv), median departure and arrival delays are close to zero (or negative in January 2024), while maximum delays reach thousands of minutes (e.g., 3,024–4,413 minutes). This suggests that most flights have minimal delay, but a small number of flights experience extreme delays.

Delay components broken down by cause—
carrier,
weather,
NAS,
security,
last aircraft
—show similar patterns: mostly zero or very low values, with occasional extreme spikes.

# Data Processing and Visualization for Information Specific to Flight Delays

This code loads and merges U.S. civil flight, airport, and weather datasets from 2023 and January 2024, standardizes column names, and converts date fields to datetime objects. It adds latitude and longitude for departure and arrival airports and attaches corresponding weather data by date and airport. The code then calculates descriptive statistics and skewness for key delay metrics, computes median departure delays by airline and airport, and evaluates correlations between departure delays and weather variables. The result is a cleaned, merged dataset with initial analyses that summarize flight delays and their potential relationship to weather.

# US Flights 2023–Jan 2024: Delay Analysis

## Dataset
- 7,273,409 flights with 46 features including flight, weather, and
  airport data.

## Delay Statistics
- **Median departure delay:** ~0 minutes for most airlines;
  Southwest Airlines has 1 min.
- **Distribution:** Highly skewed with extreme delays (max ~4,400 min).
- **Airport hotspots:** PHF (10.5 min), SCK (9 min) have the highest
  median delays.

## Weather Impact
- Weak correlations with departure delays (highest: precipitation
  r≈0.064).
- Weather alone explains very little of the variation in delays.

## Conclusion
- Most flights depart on time or early.
- A few extreme delays drive high skewness in the data.
- Dataset is ready for feature selection and predictive modeling,
  including cancellation information.


In [8]:
import numpy as np

# ===== Define target variables =====

# Drop rows with missing departure delays
flights = flights.dropna(subset=["Dep_Delay"])

# Binary target for departure delay
# 0 -> On time / < 15 min delay
# 1 -> Delayed >= 15 min
flights["Delay_Class"] = np.where(
    flights["Dep_Delay"] >= 15, 1, 0
)

# Delay class distribution
delay_counts = flights["Delay_Class"].value_counts()
delay_percent = flights["Delay_Class"].value_counts(normalize=True) * 100

print("Delay Class Counts:")
print(delay_counts)
print("\nDelay Class Percentage:")
print(delay_percent.round(2))

# Compute class weights for imbalance
total = len(flights)
weight_0 = total / (2 * delay_counts[0])
weight_1 = total / (2 * delay_counts[1])

class_weights = {0: weight_0, 1: weight_1}
print("\nSuggested Delay Class Weights:")
print(class_weights)

# ===== Cancellation target summary =====
# Only if cancellations exist
if "Cancelled_Class" in flights.columns:
    cancel_counts = flights["Cancelled_Class"].value_counts()
    cancel_percent = flights["Cancelled_Class"].value_counts(
        normalize=True
    ) * 100

    print("\nCancellation Class Counts:")
    print(cancel_counts)
    print("\nCancellation Class Percentage:")
    print(cancel_percent.round(2))


Delay Class Counts:
Delay_Class
0    5780639
1    1492770
Name: count, dtype: int64

Delay Class Percentage:
Delay_Class
0    79.48
1    20.52
Name: proportion, dtype: float64

Suggested Delay Class Weights:
{0: np.float64(0.6291180784684877), 1: np.float64(2.4362122095165364)}

Cancellation Class Counts:
Cancelled_Class
0.0    7231140
1.0      42269
Name: count, dtype: int64

Cancellation Class Percentage:
Cancelled_Class
0.0    99.42
1.0     0.58
Name: proportion, dtype: float64


# Flight Delay and Cancellation Summary (2023–2024)

## Departure Delay

- **Class 0 (On time / < 15 min delay):** 5,780,639 flights (79.48%)
- **Class 1 (Delayed ≥ 15 min):** 1,492,770 flights (20.52%)

**Suggested class weights for modeling (to handle imbalance):**
- 0 → 0.63
- 1 → 2.44

> Most flights are on time, but a significant 1 in 5 flights is delayed.

## Cancellations

- **Class 0 (Not cancelled):** 7,231,140 flights (99.42%)
- **Class 1 (Cancelled):** 42,269 flights (0.58%)

> Cancellations are rare, making them highly imbalanced for modeling.

