# Flight Delay Prediction: Data Generation, Wrangling, and EDA Workflow

This notebook provides a comprehensive workflow for the Flight Delay Prediction project, covering:
1. **Synthetic Data Generation**: Creating realistic flight and weather datasets.
2. **Data Wrangling and Cleaning**: Processing raw data, handling missing values, and feature engineering.
3. **Exploratory Data Analysis (EDA)**: Visualizing data to uncover patterns, trends, and insights related to flight delays.

## 1. Synthetic Data Generation

This section covers the generation of synthetic airline on-time performance data and weather data. Due to limitations in accessing real-time APIs for historical data in this environment, we use synthetic data that mimics the structure and characteristics of real-world datasets from sources like the Bureau of Transportation Statistics (BTS) and the National Oceanic and Atmospheric Administration (NOAA).

### 1.1 Setup and Imports for Data Generation

In [None]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time
import glob

# Ensure necessary directories exist
base_dir = '.'
os.makedirs(os.path.join(base_dir, 'data/raw/airline'), exist_ok=True)
os.makedirs(os.path.join(base_dir, 'data/raw/weather'), exist_ok=True)
os.makedirs(os.path.join(base_dir, 'data/processed'), exist_ok=True)
os.makedirs(os.path.join(base_dir, 'visualizations/general'), exist_ok=True)
os.makedirs(os.path.join(base_dir, 'visualizations/seasonal'), exist_ok=True)
os.makedirs(os.path.join(base_dir, 'visualizations/network'), exist_ok=True)
os.makedirs(os.path.join(base_dir, 'results'), exist_ok=True)

### 1.2 Helper Functions for Data Generation

In [None]:
"""
Data Acquisition Script for Flight Delay Analysis Project (Modified)
This script downloads airline on-time performance data from the Bureau of Transportation Statistics (BTS)
and weather data from the National Oceanic and Atmospheric Administration (NOAA).
Modified to use synthetic data due to API access limitations.
"""

import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time

# Create data directories if they don't exist
os.makedirs('data/raw/airline', exist_ok=True)
os.makedirs('data/raw/weather', exist_ok=True)
os.makedirs('data/processed', exist_ok=True)

def generate_synthetic_airline_data(year, month, airports_df, airlines_df):
    """
    Generate synthetic airline on-time performance data for a specific year and month.
    
    Parameters:
    -----------
    year : int
        Year of data to generate
    month : int
        Month of data to generate
    airports_df : pandas.DataFrame
        DataFrame containing airport information
    airlines_df : pandas.DataFrame
        DataFrame containing airline information
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing the synthetic airline data
    """
    print(f"Generating synthetic airline data for {year}-{month:02d}...")
    
    # Get airport codes
    airport_codes = airports_df['IATA'].tolist()
    
    # Get airline codes
    airline_codes = airlines_df['Code'].tolist()
    
    # Create date range for the month
    start_date = datetime(year, month, 1)
    if month == 12:
        end_date = datetime(year + 1, 1, 1) - timedelta(days=1)
    else:
        end_date = datetime(year, month + 1, 1) - timedelta(days=1)
    
    date_range = pd.date_range(start=start_date, end=end_date)
    
    # Generate synthetic flight data
    all_flights = []
    
    # Set seed for reproducibility
    np.random.seed(year * 100 + month)
    
    # Generate approximately 100 flights per day
    for date in date_range:
        for _ in range(100):
            # Randomly select origin and destination airports (ensure they're different)
            origin_idx = np.random.randint(0, len(airport_codes))
            dest_idx = (origin_idx + np.random.randint(1, len(airport_codes))) % len(airport_codes)
            
            origin = airport_codes[origin_idx]
            dest = airport_codes[dest_idx]
            
            # Randomly select airline
            airline = airline_codes[np.random.randint(0, len(airline_codes))]
            
            # Generate scheduled departure and arrival times
            dep_hour = np.random.randint(6, 22)  # Between 6 AM and 10 PM
            dep_minute = np.random.randint(0, 60)
            scheduled_dep_time = f"{dep_hour:02d}{dep_minute:02d}"
            
            flight_duration = np.random.normal(120, 30)  # Mean 2 hours, std 30 minutes
            flight_duration = max(30, flight_duration)  # Minimum 30 minutes
            
            arr_datetime = date + timedelta(hours=dep_hour, minutes=dep_minute) + timedelta(minutes=flight_duration)
            scheduled_arr_time = f"{arr_datetime.hour:02d}{arr_datetime.minute:02d}"
            
            # Generate delay information
            dep_delay = max(0, np.random.exponential(15) if np.random.random() < 0.3 else 0)
            arr_delay = dep_delay + np.random.normal(0, 10)
            
            # Determine if flight was cancelled or diverted
            cancelled = 1 if np.random.random() < 0.02 else 0  # 2% cancellation rate
            diverted = 1 if not cancelled and np.random.random() < 0.01 else 0  # 1% diversion rate
            
            # Determine delay causes (only if delayed)
            if dep_delay > 15:
                carrier_delay = np.random.exponential(20) if np.random.random() < 0.3 else 0
                weather_delay = np.random.exponential(30) if np.random.random() < 0.2 else 0
                nas_delay = np.random.exponential(15) if np.random.random() < 0.4 else 0
                security_delay = np.random.exponential(10) if np.random.random() < 0.05 else 0
                late_aircraft_delay = np.random.exponential(25) if np.random.random() < 0.3 else 0
            else:
                carrier_delay = weather_delay = nas_delay = security_delay = late_aircraft_delay = 0
            
            # Create flight record
            flight = {
                'FlightDate': date.strftime('%Y-%m-%d'),
                'Reporting_Airline': airline,
                'Origin': origin,
                'Dest': dest,
                'CRSDepTime': scheduled_dep_time,
                'CRSArrTime': scheduled_arr_time,
                'DepDelay': round(dep_delay, 1),
                'ArrDelay': round(arr_delay, 1),
                'Cancelled': cancelled,
                'Diverted': diverted,
                'CarrierDelay': round(carrier_delay) if not cancelled and not diverted else None,
                'WeatherDelay': round(weather_delay) if not cancelled and not diverted else None,
                'NASDelay': round(nas_delay) if not cancelled and not diverted else None,
                'SecurityDelay': round(security_delay) if not cancelled and not diverted else None,
                'LateAircraftDelay': round(late_aircraft_delay) if not cancelled and not diverted else None,
                'Distance': round(np.random.normal(800, 300))  # Mean 800 miles, std 300 miles
            }
            
            all_flights.append(flight)
    
    # Convert to DataFrame
    df = pd.DataFrame(all_flights)
    
    # Save raw data
    output_file = f"data/raw/airline/airline_data_{year}_{month:02d}.csv"
    df.to_csv(output_file, index=False)
    print(f"Saved synthetic airline data to {output_file}")
    
    return df

def generate_synthetic_weather_data(station_ids, start_date, end_date):
    """
    Generate synthetic weather data for specific weather stations and date range.
    
    Parameters:
    -----------
    station_ids : list
        List of NOAA weather station IDs (typically near major airports)
    start_date : str
        Start date in 'YYYY-MM-DD' format
    end_date : str
        End date in 'YYYY-MM-DD' format
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing the synthetic weather data
    """
    print(f"Generating synthetic weather data from {start_date} to {end_date}...")
    
    # Create a date range
    date_range = pd.date_range(start=start_date, end=end_date)
    
    # Create simulated weather data for each station
    all_weather_data = []
    
    for station_id in station_ids:
        # Use part of station ID as seed for reproducibility
        np.random.seed(int(station_id[-3:]))
        
        for date in date_range:
            # Determine season for more realistic weather patterns
            month = date.month
            if month in [12, 1, 2]:
                season = 'winter'
            elif month in [3, 4, 5]:
                season = 'spring'
            elif month in [6, 7, 8]:
                season = 'summer'
            else:
                season = 'fall'
            
            # Adjust temperature based on season
            if season == 'winter':
                temp_mean = 35
                temp_std = 15
            elif season == 'spring':
                temp_mean = 60
                temp_std = 12
            elif season == 'summer':
                temp_mean = 80
                temp_std = 10
            else:  # fall
                temp_mean = 55
                temp_std = 15
            
            # Simulate temperature, precipitation, wind speed, etc.
            data = {
                'STATION': station_id,
                'DATE': date.strftime('%Y-%m-%d'),
                'TEMP': round(np.random.normal(temp_mean, temp_std), 1),  # Temperature in F
                'PRCP': max(0, round(np.random.exponential(0.1), 2)),  # Precipitation in inches
                'WIND': round(np.random.gamma(2, 5), 1),  # Wind speed in mph
                'VISIBILITY': min(10, max(0, round(np.random.normal(8, 3), 1))),  # Visibility in miles
                'CEILING': round(np.random.normal(20000, 10000)),  # Ceiling height in feet
            }
            
            # Adjust precipitation and visibility based on season
            if season == 'winter':
                data['PRCP'] *= 1.5  # More precipitation in winter
                data['VISIBILITY'] *= 0.8  # Lower visibility in winter
            elif season == 'summer':
                data['PRCP'] *= 1.2  # More precipitation in summer (thunderstorms)
            
            all_weather_data.append(data)
    
    # Convert to DataFrame
    weather_df = pd.DataFrame(all_weather_data)
    
    # Save raw data
    output_file = f"data/raw/weather/weather_data_{start_date}_to_{end_date}.csv"
    weather_df.to_csv(output_file, index=False)
    print(f"Saved synthetic weather data to {output_file}")
    
    return weather_df

def get_major_airports():
    """
    Get a list of major US airports.
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing major airport information
    """
    # List of major US airports with their IATA codes, names, and nearby weather station IDs
    major_airports = [
        {'IATA': 'ATL', 'Name': 'Hartsfield-Jackson Atlanta International Airport', 'City': 'Atlanta', 'State': 'GA', 'Weather_Station': 'USW00013874'},
        {'IATA': 'LAX', 'Name': 'Los Angeles International Airport', 'City': 'Los Angeles', 'State': 'CA', 'Weather_Station': 'USW00023174'},
        {'IATA': 'ORD', 'Name': 'O\'Hare International Airport', 'City': 'Chicago', 'State': 'IL', 'Weather_Station': 'USW00094846'},
        {'IATA': 'DFW', 'Name': 'Dallas/Fort Worth International Airport', 'City': 'Dallas', 'State': 'TX', 'Weather_Station': 'USW00003927'},
        {'IATA': 'DEN', 'Name': 'Denver International Airport', 'City': 'Denver', 'State': 'CO', 'Weather_Station': 'USW00023062'},
        {'IATA': 'JFK', 'Name': 'John F. Kennedy International Airport', 'City': 'New York', 'State': 'NY', 'Weather_Station': 'USW00094789'},
        {'IATA': 'SFO', 'Name': 'San Francisco International Airport', 'City': 'San Francisco', 'State': 'CA', 'Weather_Station': 'USW00023234'},
        {'IATA': 'SEA', 'Name': 'Seattle-Tacoma International Airport', 'City': 'Seattle', 'State': 'WA', 'Weather_Station': 'USW00024233'},
        {'IATA': 'LAS', 'Name': 'Harry Reid International Airport', 'City': 'Las Vegas', 'State': 'NV', 'Weather_Station': 'USW00023169'},
        {'IATA': 'MCO', 'Name': 'Orlando International Airport', 'City': 'Orlando', 'State': 'FL', 'Weather_Station': 'USW00012815'},
        {'IATA': 'MIA', 'Name': 'Miami International Airport', 'City': 'Miami', 'State': 'FL', 'Weather_Station': 'USW00012839'},
        {'IATA': 'CLT', 'Name': 'Charlotte Douglas International Airport', 'City': 'Charlotte', 'State': 'NC', 'Weather_Station': 'USW00013881'},
        {'IATA': 'PHX', 'Name': 'Phoenix Sky Harbor International Airport', 'City': 'Phoenix', 'State': 'AZ', 'Weather_Station': 'USW00023183'},
        {'IATA': 'IAH', 'Name': 'George Bush Intercontinental Airport', 'City': 'Houston', 'State': 'TX', 'Weather_Station': 'USW00012960'},
        {'IATA': 'BOS', 'Name': 'Boston Logan International Airport', 'City': 'Boston', 'State': 'MA', 'Weather_Station': 'USW00014739'}
    ]
    
    # Convert to DataFrame
    airports_df = pd.DataFrame(major_airports)
    
    # Save to CSV
    airports_df.to_csv("data/raw/major_airports.csv", index=False)
    print(f"Saved major airports data to data/raw/major_airports.csv")
    
    return airports_df

def get_major_airlines():
    """
    Get a list of major US airlines.
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing major airline information
    """
    # List of major US airlines with their codes and names
    major_airlines = [
        {'Code': 'AA', 'Name': 'American Airlines'},
        {'Code': 'DL', 'Name': 'Delta Air Lines'},
        {'Code': 'UA', 'Name': 'United Airlines'},
        {'Code': 'WN', 'Name': 'Southwest Airlines'},
        {'Code': 'B6', 'Name': 'JetBlue Airways'},
        {'Code': 'AS', 'Name': 'Alaska Airlines'},
        {'Code': 'NK', 'Name': 'Spirit Airlines'},
        {'Code': 'F9', 'Name': 'Frontier Airlines'},
        {'Code': 'G4', 'Name': 'Allegiant Air'},
        {'Code': 'HA', 'Name': 'Hawaiian Airlines'}
    ]
    
    # Convert to DataFrame
    airlines_df = pd.DataFrame(major_airlines)
    
    # Save to CSV
    airlines_df.to_csv("data/raw/major_airlines.csv", index=False)
    print(f"Saved major airlines data to data/raw/major_airlines.csv")
    
    return airlines_df

def main():
    """Main function to orchestrate data acquisition."""
    print("Starting data acquisition process...")
    
    # Get major airports and airlines
    airports_df = get_major_airports()
    airlines_df = get_major_airlines()
    
    # Define seasons for recent years
    seasons = [
        {'name': 'Winter_2023', 'year': 2023, 'months': [1, 2, 12]},
        {'name': 'Spring_2023', 'year': 2023, 'months': [3, 4, 5]},
        {'name': 'Summer_2023', 'year': 2023, 'months': [6, 7, 8]},
        {'name': 'Fall_2023', 'year': 2023, 'months': [9, 10, 11]},
        {'name': 'Winter_2024', 'year': 2024, 'months': [1, 2]}
    ]
    
    # Generate synthetic airline data for each season
    for season in seasons:
        season_name = season['name']
        year = season['year']
        months = season['months']
        
        print(f"\nProcessing {season_name}...")
        
        # Create a directory for this season
        os.makedirs(f"data/raw/airline/{season_name}", exist_ok=True)
        
        # Generate data for each month in the season
        for month in months:
            # Adjust year for December in winter seasons
            actual_year = year - 1 if month == 12 and 'Winter' in season_name else year
            
            # Generate the data
            df = generate_synthetic_airline_data(actual_year, month, airports_df, airlines_df)
            
            # Add a small delay to simulate processing time
            time.sleep(0.5)
    
    # Get weather station IDs from airports DataFrame
    weather_stations = airports_df['Weather_Station'].tolist()
    
    # Generate synthetic weather data for each season
    for season in seasons:
        season_name = season['name']
        year = season['year']
        months = season['months']
        
        # Define date ranges for each season
        if 'Winter' in season_name:
            if 12 in months:  # Winter spans across years
                start_date = f"{year-1}-12-01"
                end_date = f"{year}-02-28"
            else:  # Just January and February
                start_date = f"{year}-01-01"
                end_date = f"{year}-02-28"
        elif 'Spring' in season_name:
            start_date = f"{year}-03-01"
            end_date = f"{year}-05-31"
        elif 'Summer' in season_name:
            start_date = f"{year}-06-01"
            end_date = f"{year}-08-31"
        elif 'Fall' in season_name:
            start_date = f"{year}-09-01"
            end_date = f"{year}-11-30"
        
        # Generate weather data
        weather_df = generate_synthetic_weather_data(weather_stations, start_date, end_date)
    
    print("\nData acquisition process completed.")



### 1.3 Execute Synthetic Data Generation

The following cell executes the main data generation process. It will create CSV files in the `data/raw/airline` and `data/raw/weather` directories.

In [None]:

    pass
main()


## 2. Data Wrangling and Cleaning

This section focuses on loading the raw synthetic data, cleaning it, handling missing values, and performing initial feature engineering to prepare the datasets for analysis.

### 2.1 Imports for Data Wrangling

In [None]:
# Imports are already covered in the first code cell, but good to reiterate for section clarity
import pandas as pd
import numpy as np
from datetime import datetime
import glob

### 2.2 Helper Functions for Data Wrangling

In [None]:
"""
Data Wrangling and Cleaning Script for Flight Delay Analysis Project
This script processes the synthetic airline and weather data, cleans it,
and prepares it for exploratory data analysis and network analysis.
"""

import os
import pandas as pd
import numpy as np
from datetime import datetime
import glob

def load_airline_data():
    """
    Load and combine all airline data files.
    
    Returns:
    --------
    pandas.DataFrame
        Combined DataFrame containing all airline data
    """
    print("Loading airline data...")
    
    # Get all airline data files
    airline_files = glob.glob('data/raw/airline/airline_data_*.csv')
    
    if not airline_files:
        raise FileNotFoundError("No airline data files found.")
    
    # Load and combine all files
    dfs = []
    for file in airline_files:
        df = pd.read_csv(file)
        dfs.append(df)
    
    # Combine all DataFrames
    combined_df = pd.concat(dfs, ignore_index=True)
    
    print(f"Loaded {len(combined_df)} flight records from {len(airline_files)} files.")
    
    return combined_df

def load_weather_data():
    """
    Load and combine all weather data files.
    
    Returns:
    --------
    pandas.DataFrame
        Combined DataFrame containing all weather data
    """
    print("Loading weather data...")
    
    # Get all weather data files
    weather_files = glob.glob('data/raw/weather/weather_data_*.csv')
    
    if not weather_files:
        raise FileNotFoundError("No weather data files found.")
    
    # Load and combine all files
    dfs = []
    for file in weather_files:
        df = pd.read_csv(file)
        dfs.append(df)
    
    # Combine all DataFrames
    combined_df = pd.concat(dfs, ignore_index=True)
    
    print(f"Loaded {len(combined_df)} weather records from {len(weather_files)} files.")
    
    return combined_df

def load_airports_data():
    """
    Load airport reference data.
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing airport information
    """
    print("Loading airport data...")
    
    # Load airport data
    airports_df = pd.read_csv('data/raw/major_airports.csv')
    
    print(f"Loaded information for {len(airports_df)} airports.")
    
    return airports_df

def load_airlines_data():
    """
    Load airline reference data.
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame containing airline information
    """
    print("Loading airline reference data...")
    
    # Load airline data
    airlines_df = pd.read_csv('data/raw/major_airlines.csv')
    
    print(f"Loaded information for {len(airlines_df)} airlines.")
    
    return airlines_df

def clean_airline_data(df):
    """
    Clean and preprocess airline data.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Raw airline data
    
    Returns:
    --------
    pandas.DataFrame
        Cleaned airline data
    """
    print("Cleaning airline data...")
    
    # Make a copy to avoid modifying the original
    df_clean = df.copy()
    
    # Convert date to datetime
    df_clean['FlightDate'] = pd.to_datetime(df_clean['FlightDate'])
    
    # Extract year, month, day, day of week
    df_clean['Year'] = df_clean['FlightDate'].dt.year
    df_clean['Month'] = df_clean['FlightDate'].dt.month
    df_clean['Day'] = df_clean['FlightDate'].dt.day
    df_clean['DayOfWeek'] = df_clean['FlightDate'].dt.dayofweek + 1  # 1=Monday, 7=Sunday
    
    # Determine season
    season_map = {
        1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring',
        6: 'Summer', 7: 'Summer', 8: 'Summer', 9: 'Fall', 10: 'Fall',
        11: 'Fall', 12: 'Winter'
    }
    df_clean['Season'] = df_clean['Month'].map(season_map)
    
    # Convert scheduled times to proper time format
    def format_time(time_str):
        if pd.isna(time_str):
            return np.nan
        time_str = str(int(time_str)).zfill(4)
        hour = int(time_str[:-2])
        minute = int(time_str[-2:])
        return f"{hour:02d}:{minute:02d}"
    
    df_clean['CRSDepTime_Formatted'] = df_clean['CRSDepTime'].apply(format_time)
    df_clean['CRSArrTime_Formatted'] = df_clean['CRSArrTime'].apply(format_time)
    
    # Create a binary delay indicator (1 if arrival delay > 15 minutes)
    df_clean['IsDelayed'] = (df_clean['ArrDelay'] > 15).astype(int)
    
    # Create delay categories
    def categorize_delay(minutes):
        if pd.isna(minutes) or minutes <= 15:
            return 'On Time'
        elif minutes <= 30:
            return 'Minor Delay'
        elif minutes <= 60:
            return 'Moderate Delay'
        else:
            return 'Severe Delay'
    
    df_clean['DelayCategory'] = df_clean['ArrDelay'].apply(categorize_delay)
    
    # Fill missing delay cause values with 0
    delay_causes = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
    df_clean[delay_causes] = df_clean[delay_causes].fillna(0)
    
    # Create a total delay minutes column
    df_clean['TotalDelayMinutes'] = df_clean[delay_causes].sum(axis=1)
    
    # Create a primary delay cause column
    def get_primary_delay_cause(row):
        causes = ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
        max_val = 0
        max_cause = 'None'
        
        for cause in causes:
            if row[cause] > max_val:
                max_val = row[cause]
                max_cause = cause.replace('Delay', '')
        
        return max_cause
    
    df_clean['PrimaryDelayCause'] = df_clean.apply(get_primary_delay_cause, axis=1)
    
    print(f"Airline data cleaned. Shape: {df_clean.shape}")
    
    return df_clean

def clean_weather_data(df):
    """
    Clean and preprocess weather data.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Raw weather data
    
    Returns:
    --------
    pandas.DataFrame
        Cleaned weather data
    """
    print("Cleaning weather data...")
    
    # Make a copy to avoid modifying the original
    df_clean = df.copy()
    
    # Convert date to datetime
    df_clean['DATE'] = pd.to_datetime(df_clean['DATE'])
    
    # Extract year, month, day
    df_clean['Year'] = df_clean['DATE'].dt.year
    df_clean['Month'] = df_clean['DATE'].dt.month
    df_clean['Day'] = df_clean['DATE'].dt.day
    
    # Determine season
    season_map = {
        1: 'Winter', 2: 'Winter', 3: 'Spring', 4: 'Spring', 5: 'Spring',
        6: 'Summer', 7: 'Summer', 8: 'Summer', 9: 'Fall', 10: 'Fall',
        11: 'Fall', 12: 'Winter'
    }
    df_clean['Season'] = df_clean['Month'].map(season_map)
    
    # Handle missing values
    df_clean['TEMP'] = df_clean['TEMP'].fillna(df_clean['TEMP'].mean())
    df_clean['PRCP'] = df_clean['PRCP'].fillna(0)  # Assume no precipitation if missing
    df_clean['WIND'] = df_clean['WIND'].fillna(df_clean['WIND'].mean())
    df_clean['VISIBILITY'] = df_clean['VISIBILITY'].fillna(df_clean['VISIBILITY'].mean())
    df_clean['CEILING'] = df_clean['CEILING'].fillna(df_clean['CEILING'].mean())
    
    # Create weather condition categories
    def categorize_weather(row):
        if row['PRCP'] > 0.5:  # Heavy precipitation
            return 'Severe'
        elif row['PRCP'] > 0.1 or row['VISIBILITY'] < 3 or row['WIND'] > 20:
            return 'Moderate'
        elif row['PRCP'] > 0 or row['VISIBILITY'] < 7 or row['WIND'] > 10:
            return 'Mild'
        else:
            return 'Clear'
    
    df_clean['WeatherCondition'] = df_clean.apply(categorize_weather, axis=1)
    
    print(f"Weather data cleaned. Shape: {df_clean.shape}")
    
    return df_clean

def merge_airport_weather_data(weather_df, airports_df):
    """
    Merge weather data with airport information.
    
    Parameters:
    -----------
    weather_df : pandas.DataFrame
        Cleaned weather data
    airports_df : pandas.DataFrame
        Airport reference data
    
    Returns:
    --------
    pandas.DataFrame
        Weather data with airport information
    """
    print("Merging weather data with airport information...")
    
    # Create a mapping from weather station to airport code
    station_to_airport = dict(zip(airports_df['Weather_Station'], airports_df['IATA']))
    
    # Add airport code to weather data
    weather_df['Airport'] = weather_df['STATION'].map(station_to_airport)
    
    # Merge with airport information
    weather_airport_df = weather_df.merge(
        airports_df[['IATA', 'Name', 'City', 'State']],
        left_on='Airport',
        right_on='IATA',
        how='left'
    )
    
    print(f"Weather data merged with airport information. Shape: {weather_airport_df.shape}")
    
    return weather_airport_df

def prepare_data_for_network_analysis(airline_df, airports_df):
    """
    Prepare data specifically for network analysis.
    
    Parameters:
    -----------
    airline_df : pandas.DataFrame
        Cleaned airline data
    airports_df : pandas.DataFrame
        Airport reference data
    
    Returns:
    --------
    tuple
        (airport_network_df, airline_network_df) - DataFrames prepared for network analysis
    """
    print("Preparing data for network analysis...")
    
    # Create airport-to-airport network data
    # Count flights between each airport pair
    airport_network = airline_df.groupby(['Origin', 'Dest']).size().reset_index(name='Flights')
    
    # Add average delay between airport pairs
    delay_by_route = airline_df.groupby(['Origin', 'Dest'])['ArrDelay'].mean().reset_index(name='AvgDelay')
    airport_network = airport_network.merge(delay_by_route, on=['Origin', 'Dest'])
    
    # Add distance
    distance_by_route = airline_df.groupby(['Origin', 'Dest'])['Distance'].mean().reset_index(name='Distance')
    airport_network = airport_network.merge(distance_by_route, on=['Origin', 'Dest'])
    
    # Create airline route network data
    # Count flights by airline and route
    airline_network = airline_df.groupby(['Reporting_Airline', 'Origin', 'Dest']).size().reset_index(name='Flights')
    
    # Add average delay by airline and route
    delay_by_airline_route = airline_df.groupby(['Reporting_Airline', 'Origin', 'Dest'])['ArrDelay'].mean().reset_index(name='AvgDelay')
    airline_network = airline_network.merge(delay_by_airline_route, on=['Reporting_Airline', 'Origin', 'Dest'])
    
    # Add distance
    airline_network = airline_network.merge(distance_by_route, on=['Origin', 'Dest'])
    
    print(f"Network analysis data prepared. Airport network shape: {airport_network.shape}, Airline network shape: {airline_network.shape}")
    
    return airport_network, airline_network

def main():
    """Main function to orchestrate data wrangling and cleaning."""
    print("Starting data wrangling and cleaning process...")
    
    # Create processed data directory if it doesn't exist
    os.makedirs('data/processed', exist_ok=True)
    
    # Load data
    airline_df = load_airline_data()
    weather_df = load_weather_data()
    airports_df = load_airports_data()
    airlines_df = load_airlines_data()
    
    # Clean data
    airline_df_clean = clean_airline_data(airline_df)
    weather_df_clean = clean_weather_data(weather_df)
    
    # Merge weather data with airport information
    weather_airport_df = merge_airport_weather_data(weather_df_clean, airports_df)
    
    # Prepare data for network analysis
    airport_network_df, airline_network_df = prepare_data_for_network_analysis(airline_df_clean, airports_df)
    
    # Save processed data
    airline_df_clean.to_csv('data/processed/airline_data_clean.csv', index=False)
    weather_airport_df.to_csv('data/processed/weather_data_clean.csv', index=False)
    airport_network_df.to_csv('data/processed/airport_network.csv', index=False)
    airline_network_df.to_csv('data/processed/airline_network.csv', index=False)
    
    print("Data wrangling and cleaning process completed.")
    print(f"Saved processed airline data to data/processed/airline_data_clean.csv")
    print(f"Saved processed weather data to data/processed/weather_data_clean.csv")
    print(f"Saved airport network data to data/processed/airport_network.csv")
    print(f"Saved airline network data to data/processed/airline_network.csv")



### 2.3 Execute Data Wrangling

This cell runs the data wrangling pipeline. It loads the raw data generated in the previous step, cleans it, and saves the processed files to the `data/processed` directory.

In [None]:

    pass
main()


## 3. Exploratory Data Analysis (EDA)

In this section, we perform EDA on the cleaned and processed data. This involves creating various visualizations to understand distributions, relationships, seasonal patterns, and network characteristics related to flight delays.

### 3.1 Imports for EDA

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap
import networkx as nx

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")

### 3.2 Helper Functions for EDA

In [None]:
"""
Exploratory Data Analysis (EDA) Script for Flight Delay Analysis Project
This script performs exploratory data analysis on the processed airline and weather data,
creating visualizations to understand patterns and relationships in the data.
"""

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap
import networkx as nx
from datetime import datetime

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")

# Create directories for visualizations
os.makedirs('visualizations/general', exist_ok=True)
os.makedirs('visualizations/seasonal', exist_ok=True)
os.makedirs('visualizations/network', exist_ok=True)

def load_processed_data():
    """
    Load all processed data files.
    
    Returns:
    --------
    tuple
        (airline_df, weather_df, airport_network_df, airline_network_df, airports_df, airlines_df)
    """
    print("Loading processed data...")
    
    # Load processed data
    airline_df = pd.read_csv('data/processed/airline_data_clean.csv')
    weather_df = pd.read_csv('data/processed/weather_data_clean.csv')
    airport_network_df = pd.read_csv('data/processed/airport_network.csv')
    airline_network_df = pd.read_csv('data/processed/airline_network.csv')
    
    # Load reference data
    airports_df = pd.read_csv('data/raw/major_airports.csv')
    airlines_df = pd.read_csv('data/raw/major_airlines.csv')
    
    # Convert dates to datetime
    airline_df['FlightDate'] = pd.to_datetime(airline_df['FlightDate'])
    weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])
    
    print("Data loaded successfully.")
    
    return airline_df, weather_df, airport_network_df, airline_network_df, airports_df, airlines_df

def general_statistics(airline_df, weather_df):
    """
    Generate general statistics and visualizations about the dataset.
    
    Parameters:
    -----------
    airline_df : pandas.DataFrame
        Processed airline data
    weather_df : pandas.DataFrame
        Processed weather data
    """
    print("Generating general statistics and visualizations...")
    
    # Summary statistics for airline data
    airline_stats = airline_df.describe(include='all')
    airline_stats.to_csv('results/airline_summary_statistics.csv')
    
    # Summary statistics for weather data
    weather_stats = weather_df.describe(include='all')
    weather_stats.to_csv('results/weather_summary_statistics.csv')
    
    # Flight counts by airline
    plt.figure(figsize=(12, 6))
    airline_counts = airline_df['Reporting_Airline'].value_counts()
    sns.barplot(x=airline_counts.index, y=airline_counts.values)
    plt.title('Number of Flights by Airline', fontsize=16)
    plt.xlabel('Airline', fontsize=14)
    plt.ylabel('Number of Flights', fontsize=14)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('visualizations/general/flights_by_airline.png', dpi=300)
    plt.close()
    
    # Flight counts by airport (origin)
    plt.figure(figsize=(12, 6))
    origin_counts = airline_df['Origin'].value_counts()
    sns.barplot(x=origin_counts.index, y=origin_counts.values)
    plt.title('Number of Departing Flights by Airport', fontsize=16)
    plt.xlabel('Airport', fontsize=14)
    plt.ylabel('Number of Flights', fontsize=14)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('visualizations/general/flights_by_origin.png', dpi=300)
    plt.close()
    
    # Delay distribution
    plt.figure(figsize=(12, 6))
    sns.histplot(airline_df['ArrDelay'].dropna(), bins=50, kde=True)
    plt.title('Distribution of Arrival Delays', fontsize=16)
    plt.xlabel('Arrival Delay (minutes)', fontsize=14)
    plt.ylabel('Frequency', fontsize=14)
    plt.axvline(x=15, color='red', linestyle='--', label='15-minute threshold')
    plt.legend()
    plt.tight_layout()
    plt.savefig('visualizations/general/delay_distribution.png', dpi=300)
    plt.close()
    
    # Delay categories
    plt.figure(figsize=(10, 6))
    delay_cats = airline_df['DelayCategory'].value_counts()
    sns.barplot(x=delay_cats.index, y=delay_cats.values)
    plt.title('Flight Delay Categories', fontsize=16)
    plt.xlabel('Delay Category', fontsize=14)
    plt.ylabel('Number of Flights', fontsize=14)
    plt.tight_layout()
    plt.savefig('visualizations/general/delay_categories.png', dpi=300)
    plt.close()
    
    # Primary delay causes
    plt.figure(figsize=(10, 6))
    delay_causes = airline_df['PrimaryDelayCause'].value_counts()
    sns.barplot(x=delay_causes.index, y=delay_causes.values)
    plt.title('Primary Causes of Delay', fontsize=16)
    plt.xlabel('Delay Cause', fontsize=14)
    plt.ylabel('Number of Flights', fontsize=14)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('visualizations/general/delay_causes.png', dpi=300)
    plt.close()
    
    # Weather condition distribution
    plt.figure(figsize=(10, 6))
    weather_conds = weather_df['WeatherCondition'].value_counts()
    sns.barplot(x=weather_conds.index, y=weather_conds.values)
    plt.title('Distribution of Weather Conditions', fontsize=16)
    plt.xlabel('Weather Condition', fontsize=14)
    plt.ylabel('Number of Days', fontsize=14)
    plt.tight_layout()
    plt.savefig('visualizations/general/weather_conditions.png', dpi=300)
    plt.close()
    
    print("General statistics and visualizations generated.")

def seasonal_analysis(airline_df, weather_df):
    """
    Perform seasonal analysis with visualizations.
    
    Parameters:
    -----------
    airline_df : pandas.DataFrame
        Processed airline data
    weather_df : pandas.DataFrame
        Processed weather data
    """
    print("Performing seasonal analysis...")
    
    # Average delay by season
    plt.figure(figsize=(10, 6))
    season_delay = airline_df.groupby('Season')['ArrDelay'].mean().reindex(['Winter', 'Spring', 'Summer', 'Fall'])
    sns.barplot(x=season_delay.index, y=season_delay.values)
    plt.title('Average Arrival Delay by Season', fontsize=16)
    plt.xlabel('Season', fontsize=14)
    plt.ylabel('Average Delay (minutes)', fontsize=14)
    plt.tight_layout()
    plt.savefig('visualizations/seasonal/avg_delay_by_season.png', dpi=300)
    plt.close()
    
    # Delay categories by season
    plt.figure(figsize=(12, 6))
    season_delay_cat = pd.crosstab(airline_df['Season'], airline_df['DelayCategory'])
    season_delay_cat = season_delay_cat.reindex(['Winter', 'Spring', 'Summer', 'Fall'])
    season_delay_cat.plot(kind='bar', stacked=True)
    plt.title('Delay Categories by Season', fontsize=16)
    plt.xlabel('Season', fontsize=14)
    plt.ylabel('Number of Flights', fontsize=14)
    plt.legend(title='Delay Category')
    plt.tight_layout()
    plt.savefig('visualizations/seasonal/delay_categories_by_season.png', dpi=300)
    plt.close()
    
    # Primary delay causes by season
    plt.figure(figsize=(14, 8))
    season_cause = pd.crosstab(airline_df['Season'], airline_df['PrimaryDelayCause'])
    season_cause = season_cause.reindex(['Winter', 'Spring', 'Summer', 'Fall'])
    season_cause.plot(kind='bar', stacked=True)
    plt.title('Primary Delay Causes by Season', fontsize=16)
    plt.xlabel('Season', fontsize=14)
    plt.ylabel('Number of Flights', fontsize=14)
    plt.legend(title='Delay Cause')
    plt.tight_layout()
    plt.savefig('visualizations/seasonal/delay_causes_by_season.png', dpi=300)
    plt.close()
    
    # Weather conditions by season
    plt.figure(figsize=(12, 6))
    weather_season = pd.crosstab(weather_df['Season'], weather_df['WeatherCondition'])
    weather_season = weather_season.reindex(['Winter', 'Spring', 'Summer', 'Fall'])
    weather_season.plot(kind='bar', stacked=True)
    plt.title('Weather Conditions by Season', fontsize=16)
    plt.xlabel('Season', fontsize=14)
    plt.ylabel('Number of Days', fontsize=14)
    plt.legend(title='Weather Condition')
    plt.tight_layout()
    plt.savefig('visualizations/seasonal/weather_by_season.png', dpi=300)
    plt.close()
    
    # Average temperature by season
    plt.figure(figsize=(10, 6))
    season_temp = weather_df.groupby('Season')['TEMP'].mean().reindex(['Winter', 'Spring', 'Summer', 'Fall'])
    sns.barplot(x=season_temp.index, y=season_temp.values)
    plt.title('Average Temperature by Season', fontsize=16)
    plt.xlabel('Season', fontsize=14)
    plt.ylabel('Average Temperature (°F)', fontsize=14)
    plt.tight_layout()
    plt.savefig('visualizations/seasonal/avg_temp_by_season.png', dpi=300)
    plt.close()
    
    # Heatmap of average delay by month and day of week
    monthly_dow_delay = airline_df.pivot_table(
        values='ArrDelay', 
        index='DayOfWeek', 
        columns='Month', 
        aggfunc='mean'
    )
    
    plt.figure(figsize=(12, 8))
    sns.heatmap(monthly_dow_delay, cmap='YlOrRd', annot=True, fmt='.1f')
    plt.title('Average Arrival Delay by Month and Day of Week', fontsize=16)
    plt.xlabel('Month', fontsize=14)
    plt.ylabel('Day of Week (1=Monday, 7=Sunday)', fontsize=14)
    plt.tight_layout()
    plt.savefig('visualizations/seasonal/delay_heatmap_month_dow.png', dpi=300)
    plt.close()
    
    print("Seasonal analysis completed.")

def network_analysis_visualization(airport_network_df, airline_network_df, airports_df, airlines_df):
    """
    Perform initial network analysis and visualizations.
    
    Parameters:
    -----------
    airport_network_df : pandas.DataFrame
        Airport network data
    airline_network_df : pandas.DataFrame
        Airline network data
    airports_df : pandas.DataFrame
        Airport reference data
    airlines_df : pandas.DataFrame
        Airline reference data
    """
    print("Performing network analysis and visualizations...")
    
    # Create a graph for airport network
    G_airport = nx.DiGraph()
    
    # Add nodes (airports)
    for _, airport in airports_df.iterrows():
        G_airport.add_node(airport['IATA'], name=airport['Name'], city=airport['City'], state=airport['State'])
    
    # Add edges (routes)
    for _, route in airport_network_df.iterrows():
        G_airport.add_edge(
            route['Origin'], 
            route['Dest'], 
            weight=route['Flights'],
            delay=route['AvgDelay'],
            distance=route['Distance']
        )
    
    # Calculate network metrics for airports
    airport_metrics = {
        'Airport': [],
        'Degree': [],
        'In_Degree': [],
        'Out_Degree': [],
        'Betweenness': [],
        'Eigenvector': []
    }
    
    betweenness = nx.betweenness_centrality(G_airport, weight='weight')
    eigenvector = nx.eigenvector_centrality(G_airport, weight='weight', max_iter=1000)
    
    for node in G_airport.nodes():
        airport_metrics['Airport'].append(node)
        airport_metrics['Degree'].append(G_airport.degree(node, weight='weight'))
        airport_metrics['In_Degree'].append(G_airport.in_degree(node, weight='weight'))
        airport_metrics['Out_Degree'].append(G_airport.out_degree(node, weight='weight'))
        airport_metrics['Betweenness'].append(betweenness[node])
        airport_metrics['Eigenvector'].append(eigenvector[node])
    
    airport_metrics_df = pd.DataFrame(airport_metrics)
    airport_metrics_df = airport_metrics_df.sort_values('Degree', ascending=False)
    airport_metrics_df.to_csv('results/airport_network_metrics.csv', index=False)
    
    # Visualize airport network
    plt.figure(figsize=(12, 10))
    pos = nx.spring_layout(G_airport, seed=42)
    
    # Draw nodes with size based on degree
    node_sizes = [G_airport.degree(node, weight='weight') * 10 for node in G_airport.nodes()]
    nx.draw_networkx_nodes(G_airport, pos, node_size=node_sizes, node_color='skyblue', alpha=0.8)
    
    # Draw edges with width based on number of flights
    edge_weights = [G_airport[u][v]['weight'] / 50 for u, v in G_airport.edges()]
    nx.draw_networkx_edges(G_airport, pos, width=edge_weights, alpha=0.5, edge_color='gray', arrows=True, arrowsize=10)
    
    # Draw labels
    nx.draw_networkx_labels(G_airport, pos, font_size=10, font_family='sans-serif')
    
    plt.title('Airport Network: Connections Between Major Airports', fontsize=16)
    plt.axis('off')
    plt.tight_layout()
    plt.savefig('visualizations/network/airport_network.png', dpi=300)
    plt.close()
    
    # Create a bar chart of airport centrality metrics
    plt.figure(figsize=(12, 6))
    sns.barplot(x='Airport', y='Degree', data=airport_metrics_df)
    plt.title('Airport Network: Total Degree Centrality', fontsize=16)
    plt.xlabel('Airport', fontsize=14)
    plt.ylabel('Degree (weighted by flights)', fontsize=14)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('visualizations/network/airport_degree_centrality.png', dpi=300)
    plt.close()
    
    # Create a bar chart of betweenness centrality
    plt.figure(figsize=(12, 6))
    betweenness_df = airport_metrics_df.sort_values('Betweenness', ascending=False)
    sns.barplot(x='Airport', y='Betweenness', data=betweenness_df)
    plt.title('Airport Network: Betweenness Centrality', fontsize=16)
    plt.xlabel('Airport', fontsize=14)
    plt.ylabel('Betweenness Centrality', fontsize=14)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('visualizations/network/airport_betweenness_centrality.png', dpi=300)
    plt.close()
    
    # Analyze airline networks
    # Get top 3 airlines by number of flights
    top_airlines = airline_network_df.groupby('Reporting_Airline')['Flights'].sum().nlargest(3).index.tolist()
    
    for airline_code in top_airlines:
        airline_name = airlines_df[airlines_df['Code'] == airline_code]['Name'].values[0]
        
        # Filter network data for this airline
        airline_routes = airline_network_df[airline_network_df['Reporting_Airline'] == airline_code]
        
        # Create a graph for this airline's network
        G_airline = nx.DiGraph()
        
        # Add nodes (airports)
        for _, airport in airports_df.iterrows():
            G_airline.add_node(airport['IATA'], name=airport['Name'], city=airport['City'], state=airport['State'])
        
        # Add edges (routes)
        for _, route in airline_routes.iterrows():
            G_airline.add_edge(
                route['Origin'], 
                route['Dest'], 
                weight=route['Flights'],
                delay=route['AvgDelay'],
                distance=route['Distance']
            )
        
        # Visualize airline network
        plt.figure(figsize=(12, 10))
        pos = nx.spring_layout(G_airline, seed=42)
        
        # Draw nodes with size based on degree
        node_sizes = [G_airline.degree(node, weight='weight') * 20 for node in G_airline.nodes()]
        nx.draw_networkx_nodes(G_airline, pos, node_size=node_sizes, node_color='lightgreen', alpha=0.8)
        
        # Draw edges with width based on number of flights
        edge_weights = [G_airline[u][v]['weight'] / 10 for u, v in G_airline.edges()]
        nx.draw_networkx_edges(G_airline, pos, width=edge_weights, alpha=0.5, edge_color='gray', arrows=True, arrowsize=10)
        
        # Draw labels
        nx.draw_networkx_labels(G_airline, pos, font_size=10, font_family='sans-serif')
        
        plt.title(f'{airline_name} ({airline_code}) Network', fontsize=16)
        plt.axis('off')
        plt.tight_layout()
        plt.savefig(f'visualizations/network/airline_network_{airline_code}.png', dpi=300)
        plt.close()
    
    print("Network analysis and visualizations completed.")

def main():
    """Main function to orchestrate exploratory data analysis."""
    print("Starting exploratory data analysis...")
    
    # Create results directory if it doesn't exist
    os.makedirs('results', exist_ok=True)
    
    # Load processed data
    airline_df, weather_df, airport_network_df, airline_network_df, airports_df, airlines_df = load_processed_data()
    
    # Generate general statistics and visualizations
    general_statistics(airline_df, weather_df)
    
    # Perform seasonal analysis
    seasonal_analysis(airline_df, weather_df)
    
    # Perform network analysis and visualizations
    network_analysis_visualization(airport_network_df, airline_network_df, airports_df, airlines_df)
    
    print("Exploratory data analysis completed.")



### 3.3 Execute Exploratory Data Analysis

The following cell runs the EDA pipeline. It loads the processed data and generates various plots and summary statistics, saving visualizations to the `visualizations` directory and results to the `results` directory.

In [None]:

    pass
main()


## End of Notebook

This notebook has demonstrated the end-to-end workflow for flight delay analysis, starting from synthetic data generation, proceeding through data wrangling and cleaning, and concluding with exploratory data analysis. The generated data, processed files, and visualizations are saved in their respective directories.