# Flight Delays Prediction Project

## Introduction
This project aims to predict flight delays in minutes using flight data. The analysis leverages data sourced from Kaggle and applies machine learning models, specifically LightGBM and Random Forest, to make predictions.

## Data Source
The dataset used in this project is obtained from Kaggle:
[Flight Delays Dataset](https://www.kaggle.com/datasets/usdot/flight-delays)

## Notebooks
The project consists of two main Jupyter notebooks:

### 1. Data Wrangling and Exploratory Data Analysis (EDA)
This notebook covers the initial data wrangling and exploratory data analysis to understand the dataset and preprocess it for modeling.
- [Data Wrangling and EDA Notebook](https://github.com/azadehansari/Capstone3_Flight_Delays_Prediction/blob/main/notebooks/01.%20Data_Wrangling-EDA.ipynb)

### 2. Preprocessing and Modeling
This notebook includes the preprocessing steps required before modeling and the implementation of the LightGBM and Random Forest models.
- [Preprocessing and Modeling Notebook](https://github.com/azadehansari/Capstone3_Flight_Delays_Prediction/blob/main/notebooks/02.PreProcess%26Modeling.ipynb)

## Methodology
### Data Wrangling and EDA
- **Data Cleaning**: Handling missing values, outliers, and irrelevant features.
- **Exploratory Data Analysis**: Visualizing distributions, correlations, and trends in the data.

### Preprocessing
- **Feature Engineering**: Creating new features that can help improve model performance.
- **Data Splitting**: Dividing the dataset into training and testing sets.

### Modeling
- **LightGBM**: A gradient boosting framework that uses tree-based learning algorithms.
- **Random Forest**: An ensemble learning method that operates by constructing multiple decision trees.

## Results
The performance of the models is evaluated using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R²).

The RMSE of each model was calculated to determine how well they performed in predicting the flight delays (in minutes).

| Model                | RMSE          | Time (minutes)|
|----------------------|---------------|---------------|
| Light GBM            | 102.44        |    28         |
| Random Forest        | 104.59        |    263        |
| Neural Networks      | -             |   Did not run |

## Conclusion
This project demonstrates the process of predicting flight delays using machine learning techniques. By leveraging EDA and preprocessing, followed by the application of LightGBM and Random Forest models, we aim to achieve accurate predictions of flight delays. However, **LightGBM** seems to run faster and more accurate.

For more details, please refer to the linked notebooks above.