

Flight Delays and Cancellations – Exploratory Data Analysis

This notebook begins the analysis of flight data from 2024. We'll explore the structure of the dataset, clean missing or inconsistent values, and generate initial visualizations to understand flight patterns, delays, and cancellations.

### Goals:
- Load and inspect the dataset
- Identify missing values and data types
- Clean and preprocess the data
- Visualize basic distributions and patterns


### Loading the dataset
Imported the flight data from a CSV file and display the first few rows. This helps verify that the dataset loads correctly and gives us a quick look at its structure.


In [None]:

import pandas as pd

# Load the dataset
df = pd.read_csv("flight_data_2024.csv")  

# Show the first 5 rows
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Flight_Delays_and_Cancellations_analysis_3/data/flight_data_2024.csv'

### Dataset structure and data types
We use df.info() to examine the number of entries, column names, data types, and missing values. This helps us identify which columns may need cleaning or type conversion.


In [None]:
# Overview of columns, data types, and missing values
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 18 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   year                 1048575 non-null  int64  
 1   month                1048575 non-null  int64  
 2   day_of_month         1048575 non-null  int64  
 3   day_of_week          1048575 non-null  int64  
 4   fl_date              1048575 non-null  object 
 5   origin               1048575 non-null  object 
 6   origin_city_name     1048575 non-null  object 
 7   origin_state_nm      1048575 non-null  object 
 8   dep_time             1026022 non-null  float64
 9   taxi_out             1025450 non-null  float64
 10  wheels_off           1025450 non-null  float64
 11  wheels_on            1024898 non-null  float64
 12  taxi_in              1024898 non-null  float64
 13  cancelled            1048575 non-null  int64  
 14  air_time             1022824 non-null  float64
 15

### Summary statistics
df.describe() provides basic statistics for numerical columns, including mean, standard deviation, and percentiles. This helps us understand the distribution and scale of delay-related metrics.


In [None]:
# Summary statistics for numerical columns
df.describe()



Unnamed: 0,year,month,day_of_month,day_of_week,dep_time,taxi_out,wheels_off,wheels_on,taxi_in,cancelled,air_time,distance,weather_delay,late_aircraft_delay
count,1048575.0,1048575.0,1048575.0,1048575.0,1026022.0,1025450.0,1025450.0,1024898.0,1024898.0,1048575.0,1022824.0,1048575.0,1048575.0,1048575.0
mean,2024.0,1.478081,15.30512,3.893483,1325.074,18.25012,1349.996,1476.156,8.082517,0.02222635,116.227,834.5389,1.194321,5.32666
std,0.0,0.4995196,8.585503,2.010038,497.299,10.44025,498.0426,519.8682,6.512591,0.147419,70.91204,592.3104,20.05819,29.75676
min,2024.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,8.0,31.0,0.0,0.0
25%,2024.0,1.0,8.0,2.0,911.0,12.0,929.0,1058.0,4.0,0.0,64.0,402.0,0.0,0.0
50%,2024.0,1.0,15.0,4.0,1323.0,15.0,1337.0,1510.0,6.0,0.0,100.0,692.0,0.0,0.0
75%,2024.0,2.0,23.0,6.0,1736.0,21.0,1750.0,1914.0,9.0,0.0,147.0,1069.0,0.0,0.0
max,2024.0,2.0,31.0,7.0,2400.0,213.0,2400.0,2400.0,444.0,1.0,723.0,5095.0,1804.0,2100.0


### Missing values per column
We count the number of missing values in each column to identify which features need cleaning or imputation.


In [None]:
# Count missing values per column
df.isnull().sum()



year                       0
month                      0
day_of_month               0
day_of_week                0
fl_date                    0
origin                     0
origin_city_name           0
origin_state_nm            0
dep_time               22553
taxi_out               23125
wheels_off             23125
wheels_on              23677
taxi_in                23677
cancelled                  0
air_time               25751
distance                   0
weather_delay              0
late_aircraft_delay        0
dtype: int64