# 🧭 **Exploratory Data Analysis (EDA): Flights Dataset**

This notebook marks the **Exploratory Data Analysis (EDA)** phase within our **Travel MLOps Project**. Our focus is the **Flights dataset**, which we will dissect to uncover critical insights, identify data quality issues, and lay a robust groundwork for developing our **Flight Price Prediction Model**.

🎯 **Primary Goal of this EDA**:
To gather the necessary understanding and insights to construct a high-performing **Regression Model** capable of accurately predicting flight prices. This model will leverage key features including distance, time, flight class, agency, and travel route.

---

## 📦 **Dataset At a Glance: Flights**

The **Flights dataset** provides structured records of user flight bookings. It encompasses details on travel routes, flight characteristics, pricing (our target variable), and booking agencies.

| Column       | Description                                     |
|--------------|-------------------------------------------------|
| `travelCode` | Unique identifier for each travel itinerary     |
| `userCode`   | Identifier for the user (links to Users table)  |
| `from`       | Origin city/airport of the flight               |
| `to`         | Destination city/airport of the flight          |
| `flightType` | Service class of the flight (e.g., Economy, First Class) |
| `price`      | **Target Variable**: Cost of the flight (USD)   |
| `time`       | Flight duration (in hours)                      |
| `distance`   | Flight distance (e.g., in kilometers or miles)  |
| `agency`     | Airline or travel agency facilitating the booking|
| `date`       | Date of the flight departure                    |

---

## 🎯 **Key Objectives of this Analysis**

This EDA serves several critical functions in our model development lifecycle:

-   **Deep Dive into Data Structure:** We will thoroughly examine the dataset's schema, data types, and the distributions of categorical and numerical features.
-   **Pinpoint Data Quality Concerns:** Our investigation will focus on identifying and quantifying issues such as missing values, outliers, anomalies, and duplicate records.
-   **Evaluate Feature Relationships and Importance:** We aim to understand how each feature correlates with the target variable (`price`) and with other features, highlighting potential predictors.
-   **Inform Preprocessing and Feature Engineering:** The insights gained will directly guide the strategies for data cleaning, transformation, feature selection, and the creation of new, impactful features for the **Flight Price Prediction Model**.

---

## 🔍 **Analytical Approach**

Our exploration of the Flights dataset will involve a comprehensive analysis using descriptive statistics, various data visualizations (histograms, scatter plots, box plots), and correlation studies.

The findings from this process are crucial, as they will directly inform our **feature selection criteria, data transformation techniques, and the overall design of our model training pipeline**, all aimed at achieving accurate flight price predictions.

> Given the manageable number of features in this dataset, we will employ a thorough visualization strategy. This includes generating pair plots for numerical features, frequency plots for categorical features, and comparative plots (like box plots) across numerical and categorical feature combinations. This ensures a comprehensive understanding of inter-feature relationships and patterns.

Let's dive into the data!

### **Import Necessary Libraries and Packages**

In [1]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns
import os

from sklearn.model_selection import train_test_split


In [None]:
import sys

sys.path.append("../")
from utils.data_utils import check_duplicates

### **Load The Dataset**

In [2]:
flights_path = "../data/flights.csv"
# Load the dataset
flights = pd.read_csv(flights_path)

In [3]:
flights.head()

Unnamed: 0,travelCode,userCode,from,to,flightType,price,time,distance,agency,date
0,0,0,Recife (PE),Florianopolis (SC),firstClass,1434.38,1.76,676.53,FlyingDrops,09/26/2019
1,0,0,Florianopolis (SC),Recife (PE),firstClass,1292.29,1.76,676.53,FlyingDrops,09/30/2019
2,1,0,Brasilia (DF),Florianopolis (SC),firstClass,1487.52,1.66,637.56,CloudFy,10/03/2019
3,1,0,Florianopolis (SC),Brasilia (DF),firstClass,1127.36,1.66,637.56,CloudFy,10/04/2019
4,2,0,Aracaju (SE),Salvador (BH),firstClass,1684.05,2.16,830.86,CloudFy,10/10/2019


### **Checking Duplicates**

In [15]:
check_duplicates(flights)

Percentage of rows involved in duplication: 0.00%


### **Relevance of `travelCode` and `userCode` in EDA**

The `travelCode` and `userCode` columns primarily serve as identifiers.

*   **`travelCode`**: This links related flight segments, such as an outbound and its corresponding return trip. For example:

    | travelCode | userCode | from             | to                 | date       |
    |------------|----------|------------------|--------------------|------------|
    | 0          | 0        | Recife (PE)      | Florianopolis (SC) | 09/26/2019 |
    | 0          | 0        | Florianopolis (SC)| Recife (PE)        | 09/30/2019 |

*   **`userCode`**: This links flights to specific users, enabling connections to the `users.csv` data.

*   It's possible for different `travelCode` or `userCode` entries to correspond to flights with identical details (route, price, time, distance, agency). This can happen if multiple users book the same flight. Including these identifiers directly in analyses focused on flight attributes without aggregation could potentially cause **Bias** in a model by over-representing frequently booked flight configurations.

In [16]:
# Drop keys
flights.drop(columns=["travelCode", "userCode"], inplace=True)

In [17]:
flights.head()

Unnamed: 0,from,to,flightType,price,time,distance,agency,date
0,Recife (PE),Florianopolis (SC),firstClass,1434.38,1.76,676.53,FlyingDrops,09/26/2019
1,Florianopolis (SC),Recife (PE),firstClass,1292.29,1.76,676.53,FlyingDrops,09/30/2019
2,Brasilia (DF),Florianopolis (SC),firstClass,1487.52,1.66,637.56,CloudFy,10/03/2019
3,Florianopolis (SC),Brasilia (DF),firstClass,1127.36,1.66,637.56,CloudFy,10/04/2019
4,Aracaju (SE),Salvador (BH),firstClass,1684.05,2.16,830.86,CloudFy,10/10/2019


### **Checking And Handling duplicate values after removing unique identifiers(`travelCode` and `userCode`)**


In [19]:
check_duplicates(flights)


Percentage of rows involved in duplication: 45.02%


- This suggest many of the duplicates were hidded due to unique identifiers

In [20]:
# remove duplicates
flights.drop_duplicates(inplace=True)
check_duplicates(flights)


Percentage of rows involved in duplication: 0.00%


### ✈️ **Addressing Duplicate Flight Records**

Our analysis revealed a significant number of duplicate entries within the dataset:

-   **Substantial Duplicate Presence:**
    Approximately **45.02%** of the flight records were identified as duplicates. This high percentage is attributed to multiple unique `travelCode` and `userCode` entries corresponding to identical flight details (route, timing, price, etc.), likely representing different bookings for the same underlying flight service.

-   **Critical Preprocessing Step: Duplicate Removal:**
    🚮 It is crucial to **remove these duplicate records** before proceeding with further analysis or model training.
    -   **Impact on Analysis:** Retaining duplicates would skew descriptive statistics and visualizations, leading to inaccurate interpretations of flight patterns and pricing.
    -   **Impact on Modeling:** For machine learning, these duplicates can introduce significant bias, leading to an overestimation of model performance on seen data and poor generalization to new, unseen flight queries. Removing them ensures model integrity and more reliable predictions.

---