## Instructions 

| **Feature**           | **Type**           | **Description**                                         | **Notes / Use in Model**                     |
| --------------------- | ------------------ | ------------------------------------------------------- | -------------------------------------------- |
| `DEP_DEL15`           | Target (Binary)    | 1 if departure delay ≥15 minutes, else 0                | Main prediction target                       |
| `DEP_DELAY`           | Numeric            | Departure delay in minutes (negative = early departure) | Used for validation and correlation          |
| `CRS_DEP_HOUR`        | Numeric (0–23)     | Scheduled departure hour extracted from `CRS_DEP_TIME`  | Captures daily delay pattern                 |
| `DAY_OF_WEEK`         | Categorical (1–7)  | Day of the week (1 = Monday, 7 = Sunday)                | Reflects weekday vs. weekend trends          |
| `MONTH`               | Categorical (1–12) | Month of the year                                       | Captures seasonal variation                  |
| `OP_UNIQUE_CARRIER`   | Categorical        | Airline carrier code (e.g., AA, DL, UA)                 | Delay rates vary by airline                  |
| `ORIGIN`              | Categorical        | Origin airport IATA code                                | Captures airport-level delay patterns        |
| `DEST`                | Categorical        | Destination airport IATA code                           | May add contextual variation                 |
| `DISTANCE`            | Numeric            | Flight distance in miles                                | Longer flights less affected by short delays |
| `TAXI_OUT`            | Numeric            | Taxi-out time in minutes                                | Indicator of airport congestion              |
| `CRS_ELAPSED_TIME`    | Numeric            | Scheduled flight duration in minutes                    | Useful for normalization                     |
| `HourlyPrecipitation` | Numeric            | Precipitation at departure station (inches/hour)        | Proxy for adverse weather                    |
| `HourlyVisibility`    | Numeric            | Visibility at departure station (miles)                 | Lower values may increase delay risk         |
| `HourlyWindSpeed`     | Numeric            | Wind speed at departure station (mph)                   | Captures storm or runway condition impact    |
| `CANCELLED`           | Binary             | 1 if flight was canceled                                | May need to exclude or treat separately      |
| `DIVERTED`            | Binary             | 1 if flight diverted to another airport                 | Usually excluded for modeling                |

| Column                | Description                                                  | Source | Notes                                       |
| :-------------------- | :----------------------------------------------------------- | :----- | :------------------------------------------ |
| QUARTER               | Calendar quarter of the year (1–4)                           | Flight | some seasons might experience more delays than others: seasonality                      |
| MONTH                 | Month of flight date (1–12)                                  | Flight | Use for monthly trends                      |
| DAY_OF_MONTH          | Day of the month (1–31)                                      | Flight | Temporal feature                            |
| DAY_OF_WEEK           | Day of the week (1=Mon, 7=Sun)                               | Flight | Delays vary by weekday                      |
| FL_DATE               | Flight date                                                  | Flight | Combine with time fields for timestamp      |
| OP_UNIQUE_CARRIER     | Unique airline carrier code (e.g., AA, DL)                   | Flight | Key categorical variable                    |
| OP_CARRIER_AIRLINE_ID | Airline numeric ID from BTS                                  | Flight | Alternate carrier ID                        |
| OP_CARRIER            | Carrier abbreviation                                         | Flight | Duplicate of OP_UNIQUE_CARRIER              |
| TAIL_NUM              | Aircraft tail number                                         | Flight | Often missing or reused                     |
| OP_CARRIER_FL_NUM     | Flight number                                                | Flight | Combine with carrier for unique flight ID   |
| ORIGIN_AIRPORT_ID     | Unique numeric ID for origin airport                         | Flight | Key for joins                               |
| ORIGIN_AIRPORT_SEQ_ID | Unique ID per airport sequence                               | Flight | Not needed for modeling                     |
| ORIGIN_CITY_MARKET_ID | City market ID                                               | Flight | Identifies metro area                       |
| ORIGIN                | Origin airport code (IATA)                                   | Flight | Major key feature                           |
| ORIGIN_CITY_NAME      | Full city name of origin                                     | Flight | Redundant with ORIGIN                       |
| ORIGIN_STATE_ABR      | Origin state abbreviation                                    | Flight | Useful for mapping                          |
| ORIGIN_STATE_FIPS     | State FIPS code                                              | Flight | Redundant geographic ID                     |
| ORIGIN_STATE_NM       | Full state name                                              | Flight | Informational only                          |
| ORIGIN_WAC            | World Area Code for origin                                   | Flight | May be dropped                              |
| DEST_AIRPORT_ID       | Unique numeric ID for destination airport                    | Flight | Key for joins                               |
| DEST_AIRPORT_SEQ_ID   | Destination sequence ID                                      | Flight | Often redundant                             |
| DEST_CITY_MARKET_ID   | Destination city market ID                                   | Flight | Identifies metro area                       |
| DEST                  | Destination airport code (IATA)                              | Flight | Key feature                                 |
| DEST_CITY_NAME        | Full city name of destination                                | Flight | Informational                               |
| DEST_STATE_ABR        | Destination state abbreviation                               | Flight | Useful for mapping                          |
| DEST_STATE_FIPS       | Destination state FIPS code                                  | Flight | Redundant                                   |
| DEST_STATE_NM         | Full destination state name                                  | Flight | Informational only                          |
| DEST_WAC              | World Area Code for destination                              | Flight | May be dropped                              |
| CRS_DEP_TIME          | Scheduled departure time (HHMM local)                        | Flight | Convert to hour for modeling                |
| DEP_TIME              | Actual departure time (HHMM local)                           | Flight | Post-departure → leakage                    |
| DEP_DELAY             | Departure delay in minutes                                   | Flight | Leakage (after event)                       |
| DEP_DELAY_NEW         | Departure delay, no negatives                                | Flight | Leakage (after event)                       |
| DEP_DEL15             | 1 if departure delay ≥15 min                                 | Flight | Post-departure indicator                    |
| DEP_DELAY_GROUP       | Categorical group of departure delay                         | Flight | Leakage variable                            |
| DEP_TIME_BLK          | Scheduled departure block (time interval)                    | Flight | Keep for modeling                           |
| TAXI_OUT              | Taxi-out time in minutes                                     | Flight | Leakage; occurs after departure             |
| WHEELS_OFF            | Time wheels left ground (HHMM)                               | Flight | Leakage                                     |
| WHEELS_ON             | Time wheels touched down (HHMM)                              | Flight | Leakage                                     |
| TAXI_IN               | Taxi-in time (minutes)                                       | Flight | Leakage                                     |
| CRS_ARR_TIME          | Scheduled arrival time (HHMM)                                | Flight | Keep; pre-scheduled info                    |
| ARR_TIME              | Actual arrival time (HHMM)                                   | Flight | Leakage                                     |
| ARR_DELAY             | Arrival delay (minutes)                                      | Flight | Target-related; drop                        |
| ARR_DELAY_NEW         | Non-negative arrival delay                                   | Flight | Redundant                                   |
| ARR_DEL15             | (1 if arrival delay ≥15 min)             | Flight | Binary label                                |
| ARR_DELAY_GROUP       | Grouped arrival delay                                        | Flight | Redundant with ARR_DEL15                    |
| ARR_TIME_BLK          | Scheduled arrival block                                      | Flight | Pre-scheduled; usable                       |
| CANCELLED             | 1 if flight was cancelled                                    | Flight | Keep for classification                     |
| CANCELLATION_CODE     | Code for reason of cancellation (A=Carrier, B=Weather, etc.) | Flight | Important categorical for cancelled flights |
| DIVERTED              | 1 if flight diverted to another airport                      | Flight | Keep; rare event                            |
| CRS_ELAPSED_TIME      | Scheduled elapsed flight time (min)                          | Flight | Useful duration variable                    |
| ACTUAL_ELAPSED_TIME   | Actual total flight time (min)                               | Flight | Leakage                                     |
| AIR_TIME              | In-air flight time (min)                                     | Flight | Leakage                                     |
| FLIGHTS               | Number of flights (usually 1)                                | Flight | Constant; drop                              |
| DISTANCE              | Great circle distance (miles)                                | Flight | Key continuous variable                     |
| DISTANCE_GROUP        | Distance category (1=short haul, etc.)                       | Flight | Categorical; keep                           |
| CARRIER_DELAY         | Delay due to airline (min)                                   | Flight | Post-event; leakage                         |
| WEATHER_DELAY         | Delay due to weather (min)                                   | Flight | Leakage                                     |
| NAS_DELAY             | Delay due to air traffic control (min)                       | Flight | Leakage                                     |
| SECURITY_DELAY        | Delay due to security (min)                                  | Flight | Leakage                                     |
| LATE_AIRCRAFT_DELAY   | Delay due to late incoming aircraft (min)                    | Flight | Leakage                                     |
| FIRST_DEP_TIME        | First departure attempt (for multi-leg flights)              | Flight | Leakage                                     |
| TOTAL_ADD_GTIME       | Total gate time added                                        | Flight | Leakage                                     |
| LONGEST_ADD_GTIME     | Longest gate time added                                      | Flight | Leakage                                     |

| Category                           | Columns                                                                                                                                                                                               | Action                       | Notes                                                                             |
| :--------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------- | :-------------------------------------------------------------------------------- |
| **Target**                         | `ARR_DEL15`                                                                                                                                                                                           | Keep                         | Binary classification label (1 = delay ≥15 min)                                   |
| **Temporal Features**              | `MONTH`, `DAY_OF_WEEK`, `CRS_DEP_TIME`, `CRS_ARR_TIME`, `DEP_TIME_BLK`, `ARR_TIME_BLK`                                                                                                                | Keep (or engineer)           | Convert times to hour bins or features; captures time-of-day and seasonal effects |
| **Flight Ops / Routing**           | `OP_UNIQUE_CARRIER`, `ORIGIN`, `DEST`, `DISTANCE`, `DISTANCE_GROUP`, `CANCELLED`, `DIVERTED`                                                                                                          | Keep                         | Core predictive variables; encode categorical features                            |
| **Weather (Hourly)**               | `HourlyPrecipitation`, `HourlyVisibility`, `HourlyWindSpeed`                                                                                                                                          | Keep                         | Key real-time weather indicators                                                  |
| **Weather (Hourly – Derived)**     | `HourlyPrecipitation_D`, `HourlyVisibility_D`, `HourlyWindSpeed_D`                                                                                                                                    | Keep                         | Change/delta features showing recent shifts                                       |
| **Weather (Daily)**                | `DailyPrecipitation`, `DailyMaximumDryBulbTemperature`, `DailyMinimumDryBulbTemperature`, `DailyPeakWindSpeed`                                                                                        | Review                       | Optional aggregates for extended weather context                                  |
| **Geographic / Station Info**      | `LATITUDE`, `LONGITUDE`, `ELEVATION`, `origin_station_dis`                                                                                                                                            | Review                       | Keep if you plan spatial or distance-based analysis                               |
| **Carrier ID / Meta**              | `OP_CARRIER_AIRLINE_ID`, `OP_CARRIER`                                                                                                                                                                 | Drop                         | Redundant with `OP_UNIQUE_CARRIER`                                                |
| **Delay & Post-Arrival Variables** | `DEP_DELAY`, `ARR_DELAY`, `AIR_TIME`, `ACTUAL_ELAPSED_TIME`, `WHEELS_OFF`, `WHEELS_ON`, `TAXI_OUT`, `TAXI_IN`, `CARRIER_DELAY`, `WEATHER_DELAY`, `NAS_DELAY`, `SECURITY_DELAY`, `LATE_AIRCRAFT_DELAY` | Drop (leakage)               | All occur after or depend on delay outcome                                        |
| **Monthly / Climate Features**     | All `Monthly...` fields (`MonthlyAverageRH`, `MonthlyGreatestPrecip`, etc.)                                                                                                                           | Drop                         | More than 90% null and not relevant for single-flight prediction                  |
| **High-Null / Empty Monthly & Short-Duration** | `MonthlyAverageRH`, `MonthlyDaysWithGT001Precip`, `MonthlyDaysWithGT010Precip`, `MonthlyDaysWithGT32Temp`, `MonthlyDaysWithGT90Temp`, `MonthlyDaysWithLT0Temp`, `MonthlyDaysWithLT32Temp`, `MonthlyDepartureFromNormalAverageTemperature`, `MonthlyDepartureFromNormalCoolingDegreeDays`, `MonthlyDepartureFromNormalHeatingDegreeDays`, `MonthlyDepartureFromNormalMaximumTemperature`, `MonthlyDepartureFromNormalMinimumTemperature`, `MonthlyDepartureFromNormalPrecipitation`, `MonthlyDewpointTemperature`, `MonthlyGreatestPrecip`, `MonthlyGreatestPrecipDate`, `MonthlyGreatestSnowDepth`, `MonthlyGreatestSnowDepthDate`, `MonthlyGreatestSnowfall`, `MonthlyGreatestSnowfallDate`, `MonthlyMaxSeaLevelPressureValue`, `MonthlyMaxSeaLevelPressureValueDate`, `MonthlyMaxSeaLevelPressureValueTime`, `MonthlyMaximumTemperature`, `MonthlyMeanTemperature`, `MonthlyMinSeaLevelPressureValue`, `MonthlyMinSeaLevelPressureValueDate`, `MonthlyMinSeaLevelPressureValueTime`, `MonthlyMinimumTemperature`, `MonthlySeaLevelPressure`, `MonthlyStationPressure`, `MonthlyTotalLiquidPrecipitation`, `MonthlyTotalSnowfall`, `MonthlyWetBulb`, `AWND`, `CDSD`, `CLDD`, `DSNW`, `HDSD`, `HTDD`, `NormalsCoolingDegreeDay`, `NormalsHeatingDegreeDay`, `ShortDurationEndDate005`, `ShortDurationEndDate010`, `ShortDurationEndDate015`, `ShortDurationEndDate020`, `ShortDurationEndDate030`, `ShortDurationEndDate045`, `ShortDurationEndDate060`, `ShortDurationEndDate080`, `ShortDurationEndDate100`, `ShortDurationEndDate120`, `ShortDurationEndDate150`, `ShortDurationEndDate180`, `ShortDurationPrecipitationValue005`, `ShortDurationPrecipitationValue010`, `ShortDurationPrecipitationValue015`, `ShortDurationPrecipitationValue020`, `ShortDurationPrecipitationValue030`, `ShortDurationPrecipitationValue045`, `ShortDurationPrecipitationValue060`, `ShortDurationPrecipitationValue080`, `ShortDurationPrecipitationValue100`, `ShortDurationPrecipitationValue120`, `ShortDurationPrecipitationValue150`, `ShortDurationPrecipitationValue180` | Drop   | All null in OTPW;  safe to drop to simplify schema |
| **Backup / Metadata Fields**       | All `Backup...`, `ShortDuration...`, `_row_desc`, `REM`                                                                                                                                               | Drop                         | Join metadata and reference only                                                  |
| **Text Fields**                    | `NAME`, `REPORT_TYPE`, `DailyWeather`, `HourlySkyConditions`, `HourlyPresentWeatherType`                                                                                                              | Optional (drop for baseline) | Free text; may require NLP or feature extraction later                            |
| **Station Linking**                | `STATION`, `DATE`, `SOURCE`                                                                                                                                                                           | Keep for validation only     | Needed for join checks; not a model input                                         |

#### iv. Airport–Weather Integration Plan (Data Joins, Issues & Rationale)

Our exploratory review of the four source tables (flights, airport codes, weather, and station metadata) showed that the **biggest blockers are not in the flights themselves but in the lookup tables we need to join to**. The flights table already carries clean IATA airport codes (`ORIGIN`, `DEST`) and consistent date fields (`FL_DATE`, `YEAR`, `MONTH`, …), so it is a good factual backbone. However, our main airport codes file does **not** include time zones and sometimes only stores geolocation as a single `coordinates` string (`"lon, lat"`). At the same time, the external GitHub airport list **does** provide `timezone`, and often better lat/lon, but it doesn’t perfectly align 1:1 with our current codes file. This makes a direct “flights → airports → weather” join fragile, because we would be mixing two partially-overlapping airport catalogs. To fix this, we will first build **one master airport dimension** by joining the GitHub timezones and geolocation into the codes we already use in flights, and we will coalesce coordinates so that every IATA that appears as an origin/destination ends up with: **(a)** a timezone, **(b)** a lat/lon pair, and **(c)** a human-readable name.

The second major issue is at the **weather** side: NOAA data is hourly and station-based (`STATION`, `DATE`, 100+ weather features), not airport-based. That means there is no native key to connect “ATL, SFO, ORD…” directly to a weather row. Also, stations and airports don’t share the exact same IDs: stations use a different identifier (and sometimes we need to normalize with the `stations.csv` file). On top of that, **weather is in UTC** while our flights are in **local airport time**, so if we don’t add airport timezones first, we can’t reliably pick “the weather hour that corresponds to this departure.” To solve this, we will: (1) compute airport → nearest-(1..3)-station pairs using the unified lat/lon we just created, (2) store that in a small bridge table (`airport_weather_station`), and (3) when we enrich flights, we will convert flight times to UTC using the airport’s timezone and then pick the matching hourly weather from the correct station. This approach gives us a repeatable pattern we can scale from the 3-month sample up to 2015–2021.

**Key data problems we identified:**
- Our original airport codes file **lacks time zones**, so flight local times cannot be aligned to UTC weather.
- Airport geolocation is sometimes packed as a single text field (`coordinates`), so we must **parse and standardize lat/lon**.
- Weather rows are **station-based, not airport-based**, so we must create an extra **airport → station** bridge.
- Stations may come from **two slightly different sources** (`weather.csv` vs `stations.csv`), so we need **ID normalization**.
- Not all flights’ origin/dest codes are guaranteed to appear in the GitHub airport list, so we will **fallback to the original codes file** to avoid losing rows.

#### v. Entity–Relationship Blueprint for Flights ↔ Airports ↔ Weather

This diagram summarizes the core entities we will use to enrich flight records with meteorological data.


In [0]:
mermaid_diagram_joins = """
<div class="mermaid">
erDiagram
    %% LEGEND
    %% PK_...  = primary key (or business key)
    %% FK_...  = foreign key to another table
    %% extra_* = columns omitted for readability

    FLIGHTS {
        string PK_flight_row            
        date   FL_DATE
        string FK_origin_iata_code      
        string FK_dest_iata_code        
        string OP_UNIQUE_CARRIER
        int    OP_CARRIER_FL_NUM
        int    CRS_DEP_TIME
        int    CRS_ARR_TIME
        int    YEAR
        int    MONTH
        int    DAY_OF_MONTH
        string extra_flight_columns
    }

    MASTER_AIRPORTS {
        string PK_iata_code
        string ident
        string name
        string municipality
        string iso_country
        string iso_region
        string airport_timezone
        float  lat
        float  lon
        string extra_airport_columns
    }

    WEATHER {
        string PK_station_id            
        datetime PK_obs_datetime       
        float   LATITUDE
        float   LONGITUDE
        string  NAME
        string  extra_weather_columns
    }

    NOAA_STATIONS {
        string PK_station_id_norm
        float  lat
        float  lon
        string neighbor_id
        float  distance_to_neighbor
        string extra_station_columns
    }

    AIRPORT_WEATHER_STATION {
        string PK_iata_code            
        string PK_station_id            
        int    PK_rank                  
        float  dist_km
    }

    %% RELATIONSHIPS
    FLIGHTS }o--|| MASTER_AIRPORTS : "origin (FK_origin_iata_code)"
    FLIGHTS }o--|| MASTER_AIRPORTS : "destination (FK_dest_iata_code)"

    MASTER_AIRPORTS ||--o{ AIRPORT_WEATHER_STATION : "airport → nearest stations"
    WEATHER  ||--o{ AIRPORT_WEATHER_STATION : "station in bridge"

    WEATHER }o--|| NOAA_STATIONS : "normalize/enrich station"
</div>
<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
<script>mermaid.initialize({startOnLoad:true});</script>
"""
displayHTML(mermaid_diagram_joins)

#### vi. To-Do List for Airport–Weather Join Pipeline

**Phase 0 – Staging (raw → stg)**  
- [ ] Load **flights** (3m / 6m / 1y sample) into `stg_flights`  
- [ ] Load **airport codes** (current `codes.csv`) into `stg_airport_codes`  
- [ ] Load **GitHub airports with timezone** into `stg_airport_tz`  
- [ ] Load **weather** (NOAA hourly) into `stg_weather_hourly`  
- [ ] Load **station metadata** (`stations.csv`) into `stg_noaa_stations`

**Phase 1 – Unified airport dimension**  
- [ ] Uppercase and trim IATA codes in **both** airport sources  
- [ ] Parse `coordinates` from `stg_airport_codes` → (`codes_lat`, `codes_lon`)  
- [ ] Select from `stg_airport_tz` only the needed fields: (`iata_code`, `airport_timezone`, `gh_lat`, `gh_lon`)  
- [ ] Left-join timezone + better lat/lon into `stg_airport_codes` and **coalesce** → `dim_airport (master_airports)`  
- [ ] Compare `dim_airport` to distinct `ORIGIN` and `DEST` from flights to find **missing airports**

**Phase 2 – Weather station dimension**  
- [ ] Build `dim_weather_station` = distinct (`STATION`, `LATITUDE`, `LONGITUDE`, `NAME`) from `stg_weather_hourly`  
- [ ] Left-join to `stg_noaa_stations` to fill missing coordinates / IDs  
- [ ] Validate that all stations used in weather have lat/lon

**Phase 3 – Airport ↔ station bridge (nearest K)**  
- [ ] Broadcast `dim_airport` (only airports with lat/lon)  
- [ ] Cross-join with `dim_weather_station`  
- [ ] Compute **Haversine** distance → `dist_km`  
- [ ] Window/partition by airport and keep top **K=3** nearest stations  
- [ ] Save as `airport_weather_station (iata_code, STATION, dist_km, rank)`

**Phase 4 – Time alignment**  
- [ ] From flights, build `dep_ts_local` = `FL_DATE` + `CRS_DEP_TIME`  
- [ ] Convert `dep_ts_local` to UTC using `dim_airport.airport_timezone` (origin)  
- [ ] Repeat for arrival using destination airport  
- [ ] (Optional) Materialize `dim_date` / `dim_time` for reporting and easier joins

**Phase 5 – Final enrichment views + QA**  
- [ ] Create `v_flights_with_origin_weather`:
  - join flights → origin airport → bridge (rank=1) → weather on matching UTC hour  
- [ ] Create `v_flights_with_dest_weather`:
  - join flights → destination airport → bridge (rank=1) → weather on matching UTC hour  
- [ ] Coverage report:
  - % flights with origin weather  
  - % flights with destination weather  
  - airports with `dist_km > 300` (flag for manual review)

## Library imports

In [0]:
import pandas as pd
import urllib.request
import pyspark.sql.functions as sf
from pyspark.sql import Window as W

## Load databases

In [0]:
# Helper function to pretty print databases
def show_df(df, n=5):
    """Pretty print the first `n` rows of a Spark DataFrame using Databricks display."""
    display(df.limit(n))

# Helper function to display columns of a Spark DataFrame
def show_columns(df):
    """Display the column names, data types, and % of null values of a Spark DataFrame."""
    total_rows = df.count()
    null_counts = df.select([sf.count(sf.when(sf.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
    percent_null = {c: (null_counts[c] / total_rows * 100) if total_rows > 0 else None for c in df.columns}
    col_info = pd.DataFrame({
        "Column": df.columns,
        "Type": [t for _, t in df.dtypes],
        "% Null": [percent_null[c] for c in df.columns]
    })
    display(col_info)
    print(f"Total rows: {total_rows}")

# Helper function haversine calculation
def haversine_km_expr(lat1, lon1, lat2, lon2):
    """
    Great-circle distance on a sphere (WGS84 mean Earth radius).
    All arguments are Column[double] in radians.
    Returns Column[double] in kilometers.
    """
    dlat = (lat2 - lat1)
    dlon = (lon2 - lon1)
    a = sf.pow(sf.sin(dlat / 2), 2) + sf.cos(lat1) * sf.cos(lat2) * sf.pow(sf.sin(dlon / 2), 2)
    c = 2 * sf.atan2(sf.sqrt(a), sf.sqrt(1 - a))
    return sf.lit(6371.0088) * c  # mean Earth radius in km

In [0]:
# Flights data

"""
Flights data

     This is a subset of the passenger flight's on-time performance data taken from the TranStats data collection available from the U.S. Department of Transportation (DOT)

        https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ 

    Links to an external site. 

The flight dataset was downloaded from the US Department of Transportation
Links to an external site. and contains flight information from 2015 to 2021
(Note flight data for the period [2015-2019] has the following dimensionality  31,746,841 x 109)
A Data Dictionary for this dataset is located here:

    https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ 
"""

# Load parquet to dataframe
df_flights = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_airlines_data_3m/")
# df_flights = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_airlines_data_3m/")
# df_flights = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_airlines_data_3m/")

# Drop exact duplicates
n_raw = df_flights.count()
df_flights = df_flights.dropDuplicates()
n_distinct = df_flights.count()
print(f"[Flights] exact duplicates removed: {n_raw - n_distinct:,}")

# Display Results
# show_df(df_flights, 5)
# show_columns(df_flights)


[Flights] exact duplicates removed: 1,403,471


In [0]:
# Weather data

"""
Weather table

    As a frequent flyer, we know that flight departure (and arrival)  often get affected by weather conditions, so it makes sense to collect, and process weather data corresponding to the origin and destination airports at the time of departure and arrival, respectively, and build features based upon this data. 
    The weather dataset was downloaded from the National Oceanic and Atmospheric Administration repository 

Links to an external site. and contains weather information from 2015 to 2021

    The dimensionality of the weather data for the period [2015-2019] is 630,904,436 x 177

Data dictionary (subset): 

    Please refer to pages 8-12:  https://www.ncei.noaa.gov/data/global-hourly/doc/isd-format-document.pdf 

Links to an external site. 
A better version of the data dictionary can be read here: https://www.ncei.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf
Links to an external site.

    A superset of the features is described here:

        https://www.ncei.noaa.gov/data/global-hourly/doc/isd-format-document.pdf 

    Links to an external site.

A subset of the features is shared here:

    https://docs.google.com/spreadsheets/d/1v0P34NlQKrvXGCACKDeqgxpDwmj3HxaleUiTrY7VRn0/edit#gid=0 
"""

# Load parquet to dataframe
df_weather = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_weather_data_3m")
# df_weather = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_weather_data_3m")
# df_weather = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_weather_data_3m")


# Allowed hourly report types (no QCLCD daily/monthly summaries)
ALLOWED_RPT = ["FM-15", "FM-16", "FM-12"]  # METAR, SPECI, SYNOP

weather_hourly = df_weather.filter(sf.col("REPORT_TYPE").isin(ALLOWED_RPT)) \
                           .withColumn("obs_utc", sf.col("DATE").cast("timestamp"))

# Preference: METAR > SPECI > SYNOP, then latest timestamp
weather_ranked = weather_hourly.withColumn(
    "report_type_rank",
    sf.when(sf.col("REPORT_TYPE")=="FM-15", 1)
     .when(sf.col("REPORT_TYPE")=="FM-16", 2)
     .when(sf.col("REPORT_TYPE")=="FM-12", 3)
     .otherwise(99)
)

win_st_hr = W.partitionBy("STATION","obs_utc").orderBy(sf.col("report_type_rank").asc(), sf.col("DATE").desc())
weather_best = (weather_ranked
                .withColumn("rn", sf.row_number().over(win_st_hr))
                .filter(sf.col("rn")==1)
                .drop("rn","report_type_rank"))

# Display results
# show_df(df_weather, 5)
# show_columns(df_weather)


In [0]:
# Weather station data

"""
Airport dataset
    Overall the airport dataset provides some metadata about each airport.
    The airport dataset was downloaded from the US Department of Transportation and has the following dimensionality: 18,097 x 10.
    It is located here:
        dbfs:/mnt/mids-w261/datasets_final_project_2022/stations_data

"""
df_stations = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/stations_data/stations_with_neighbors.parquet")

# Display results
# show_df(df_stations, 5)
# show_columns(df_stations)

In [0]:
# Airport codes
"""
Airport codes Table

    Airport codes may refer to either:
         IATA airport code, a three-letter code that is used in passenger reservation, ticketing, and baggage-handling systems,
        or ICAO airport code which is a four-letter code used by ATC systems and for airports that do not have an IATA airport code (from Wikipedia).
    Here you will need to import an external airport code conversion set (source: https://datahub.io/core/airport-codes 

Links to an external site.) and join the airport codes to the airline's flights table on the IATA code (3-letter code used by passengers)
"""

# Download and load airport codes CSV to Spark DataFrame
url = "https://datahub.io/core/airport-codes/_r/-/data/airport-codes.csv"
local_path = "/tmp/airport-codes.csv"
urllib.request.urlretrieve(url, local_path)
df_codes = spark.read.format("csv").option("header", True).load(local_path)

# Display results
# show_df(df_codes, 5)
# show_columns(df_codes)

In [0]:
"""
Airport Timezone & Geolocation Table

    This external table augments airport identifiers with the fields we need
    for time alignment and spatial joins. Each record includes:
        • IATA (3-letter passenger code) and ICAO (4-letter ATC code)
        • Airport name and locality metadata
        • Latitude and longitude (decimal degrees)
        • Time zone in IANA format (e.g., "America/Los_Angeles")

    Source:
        https://raw.githubusercontent.com/lxndrblz/Airports/main/airports.csv
        (Project page: https://github.com/lxndrblz/Airports)

    Usage:
        Import this CSV and join to the flights table on IATA (3-letter code used
        in `ORIGIN` / `DEST`). Use the `timezone` column to convert local flight
        times (e.g., FL_DATE + CRS_DEP_TIME) to UTC before weather joins, and use
        `latitude`/`longitude` to compute nearest NOAA weather stations.

    Notes:
        • Normalize IATA to uppercase and drop duplicates before joining.
        • Prefer this table’s lat/lon and timezone; if an airport is missing,
          fall back to the original codes file.
        • The timezone field follows IANA; ensure your Spark build supports
          IANA names when calling `to_utc_timestamp`.
        • /tmp is ephemeral and may be cleared when the cluster shuts down; use DBFS for persistent storage.
"""

dbutils.fs.mkdirs("dbfs:/student-groups/Group_4_4")
local_path = "/dbfs/student-groups/Group_4_4/airport-zones.csv"
url = "https://raw.githubusercontent.com/lxndrblz/Airports/main/airports.csv"
urllib.request.urlretrieve(url, local_path)
df_zones = spark.read.format("csv").option("header", True).load("dbfs:/student-groups/Group_4_4/airport-zones.csv")

# Display results
# show_df(df_zones)
# show_columns(df_zones)

# Raw Dataframes Joins

## Airports

In [0]:
# Build MASTER_AIRPORTS with timezone +  latitude and longitude
codes = (
    df_codes
    .withColumn("iata_code", sf.upper("iata_code"))
    .withColumn("_coords", sf.split(sf.regexp_replace(sf.col("coordinates"), "\\s+", ""), ","))
    .withColumn("codes_lon", sf.col("_coords").getItem(0).cast("double"))
    .withColumn("codes_lat", sf.col("_coords").getItem(1).cast("double"))
    .drop("_coords")
)

# Display results
# show_df(codes, 5)



In [0]:
tz = (
    df_zones
    .withColumn("iata_code", sf.upper(sf.col("code")))
    .select(
        sf.col("iata_code"),
        sf.col("time_zone").alias("airport_timezone"),
        sf.col("latitude").cast("double").alias("gh_lat"),
        sf.col("longitude").cast("double").alias("gh_lon"),
        sf.col("name").alias("gh_name")
    )
    .dropna(subset=["iata_code"])
    .dropDuplicates(["iata_code"])
)

# Display results
# show_df(tz, 5)

In [0]:
df_airports = (
    codes.alias("c")
    .join(tz.alias("g"), on="iata_code", how="left")
    .withColumn("lat", sf.coalesce("g.gh_lat", "c.codes_lat"))
    .withColumn("lon", sf.coalesce("g.gh_lon", "c.codes_lon"))
    .withColumn("airport_timezone", sf.col("airport_timezone"))
    .select("iata_code", "ident", sf.col("c.type").alias("airport_type"), 
            "name", "municipality", "iso_country", "iso_region", "airport_timezone", "lat", "lon")
    .dropna(subset=["iata_code"])
    .dropDuplicates(["iata_code"])
)

# Display results
# show_df(df_airports, 5)

## Weather Stations

In [0]:
# Cleanup Weather stations
df_weather_station = (
    df_stations
    .select(
        sf.col("station_id").cast("string").alias("station_id"),
        sf.col("lat").cast("double").alias("lat"),
        sf.col("lon").cast("double").alias("lon")
    )
    .dropna(subset=["station_id","lat","lon"])
    .dropDuplicates(["station_id"])
)

# Display results
# show_df(df_weather_station, 5)
# show_columns(df_weather_station)


In [0]:
# Prepare airport coordinates in radians
airport_radians = (
    df_airports
    .dropna(subset=["lat", "lon"])
    .withColumn("lat_rad", sf.radians("lat"))
    .withColumn("lon_rad", sf.radians("lon"))
    .select("iata_code", "lat", "lon", "lat_rad", "lon_rad")  
)

# Display results
# show_df(airport_radians, 5)
# show_columns(airport_radians)

In [0]:
# Prepare stations coordinates in radians
stations_radians = (
    df_weather_station
    .dropna(subset=["lat", "lon"])
    .withColumn("st_lat_rad", sf.radians("lat"))
    .withColumn("st_lon_rad", sf.radians("lon"))
    .select(sf.col("station_id").alias("station_id"), "lat", "lon", "st_lat_rad", "st_lon_rad")
)
# Display results
# show_df(stations_radians, 5)
# show_columns(stations_radians)

In [0]:
# Cross join (broadcast the small side), compute distance, rank, keep top-3 per airport
airports_stations_cross = (
    sf.broadcast(airport_radians).crossJoin(stations_radians)
    .withColumn("dist_km", haversine_km_expr(sf.col("lat_rad"), sf.col("lon_rad"),
                                             sf.col("st_lat_rad"), sf.col("st_lon_rad")))
    .select(
        sf.col("iata_code"),
        sf.col("station_id").alias("STATION"),
        sf.col("dist_km")
    )
)
station_rank = W.partitionBy("iata_code").orderBy(sf.col("dist_km").asc())
airport_weather_station = (
    airports_stations_cross
    .withColumn("rank", sf.row_number().over(station_rank))
    .filter(sf.col("rank") <= 3)
    .select("iata_code", "STATION", "dist_km", "rank")
)

In [0]:
# Prediction timestamp (origin): build local time from schedule (no leakage), then T–2h, then local→UTC
fl = df_flights
crs = (
    fl
    .withColumn("CRS_DEP_TIME_str", sf.lpad(sf.col("CRS_DEP_TIME").cast("string"), 4, "0"))
    .withColumn("dep_hh", sf.col("CRS_DEP_TIME_str").substr(1, 2).cast("int"))
    .withColumn("dep_mm", sf.col("CRS_DEP_TIME_str").substr(3, 2).cast("int"))
    .withColumn("FL_DATE_str", sf.col("FL_DATE").cast("string"))
    .withColumn(
        "dep_local_ts",
        sf.to_timestamp(
            sf.concat_ws(" ", "FL_DATE_str", sf.format_string("%02d:%02d:00", sf.col("dep_hh"), sf.col("dep_mm"))),
            "yyyy-MM-dd HH:mm:ss"
        )
    )
)

# Display results
# show_df(crs, 5)
# show_columns(crs)


In [0]:
fl_origin = (
    crs.alias("f")
    .join(df_airports.alias("a"), sf.col("f.ORIGIN")==sf.col("a.iata_code"), "left")
    .withColumn("airport_timezone", sf.col("a.airport_timezone"))                     # expose TZ as a plain column
    .withColumn("prediction_local_ts", sf.expr("dep_local_ts - INTERVAL 2 HOURS"))    # (equiv to T–2h)
    .withColumn(
        "prediction_utc",
        sf.expr("to_utc_timestamp(prediction_local_ts, airport_timezone)")            # use expr() so TZ column works
    )
    .withColumn(
        "flight_id",
        sf.concat_ws("|",
            sf.col("FL_DATE").cast("string"),
            sf.col("OP_UNIQUE_CARRIER"),
            sf.col("OP_CARRIER_FL_NUM"),
            sf.col("ORIGIN"),
            sf.col("DEST")
        )
    )
    .select("flight_id","FL_DATE","ORIGIN","DEST","prediction_local_ts","prediction_utc")
)

# Display results
# show_df(fl_origin, 5)
# show_columns(fl_origin)

In [0]:
# As-of weather (origin): join candidate stations; filter obs_utc ≤ prediction_utc and within 6h; choose latest by rank/time
origin_candidates = (
    fl_origin.alias("f")
    .join(airport_weather_station.alias("b"), sf.col("f.ORIGIN") == sf.col("b.iata_code"), how="left")
    .select("f.*", sf.col("b.STATION").alias("cand_station"), sf.col("b.rank").alias("station_rank"))
)

# Display results
# show_df(origin_candidates)
# show_columns(origin_candidates)



In [0]:
# Restrict to weather rows 'as-of' prediction_utc (no future) and within a bounded lookback window 
weather_required = [
    "HourlyDryBulbTemperature","HourlyDewPointTemperature","HourlyWetBulbTemperature",
    "HourlyPrecipitation","HourlyWindSpeed","HourlyWindDirection","HourlyWindGustSpeed",
    "HourlyVisibility","HourlyRelativeHumidity","HourlyStationPressure","HourlySeaLevelPressure",
    "HourlyAltimeterSetting","HourlySkyConditions","HourlyPresentWeatherType"
]
wx_present = [c for c in weather_required if c in weather_best.columns]
weather = weather_best.select(
    "STATION",
    sf.col("obs_utc"),
    *[sf.col(c) for c in wx_present]
)

# weather_cols = ["STATION", sf.col("DATE").cast("timestamp").alias("obs_utc")] + [sf.col(c) for c in wx_present]
# weather = df_weather.select(*weather_cols)

# In origin_candidates, also carry the distance so we can publish origin_station_dis
origin_candidates = (
    fl_origin.alias("f")
    .join(airport_weather_station.alias("b"), sf.col("f.ORIGIN") == sf.col("b.iata_code"), how="left")
    .select("f.*",
            sf.col("b.STATION").alias("cand_station"),
            sf.col("b.rank").alias("station_rank"),
            sf.col("b.dist_km").alias("cand_station_dis_km"))
)

# 6-hour lookback 
weather_join = (
    origin_candidates.alias("x")
    .join(
        weather.alias("w"),
        on=[
            sf.col("w.STATION") == sf.col("x.cand_station"),
            sf.col("w.obs_utc") <= sf.col("x.prediction_utc"),
            sf.col("w.obs_utc") >= sf.expr("timestampadd(HOUR, -6, x.prediction_utc)")
        ],
        how="left"
    )
)

# Display results
# show_df(weather_join, 5)
# show_columns(weather_join)

In [0]:
# Station selection window, prefer lower station_rank (1, then 2, then 3), and the latest obs_utc within the window
window = W.partitionBy("flight_id").orderBy(sf.col("station_rank").asc(), sf.col("obs_utc").desc())
origin_asof = (
    weather_join
    .withColumn("rn", sf.row_number().over(window))
    .filter(sf.col("rn") == 1)
    .withColumn("asof_minutes", sf.floor((sf.unix_timestamp("prediction_utc") - sf.unix_timestamp("obs_utc"))/60.0))
    .select(
        "flight_id","ORIGIN","prediction_utc",
        sf.col("cand_station").alias("origin_station_id"),
        sf.col("cand_station_dis_km").alias("origin_station_dis"),
        sf.col("obs_utc").alias("origin_obs_utc"),
        "asof_minutes",
        *wx_present,
        "station_rank"
    )
)
# Display results
# show_df(origin_asof, 5)
# show_columns(origin_asof)

In [0]:
# Origin station lat/lon
origin_asof_enriched = (
    origin_asof.alias("o")
    .join(
        df_weather_station.select(
            sf.col("station_id").alias("STATION"),
            sf.col("lat").alias("origin_station_lat"),
            sf.col("lon").alias("origin_station_lon")
        ).alias("s"),
        sf.col("o.origin_station_id")==sf.col("s.STATION"),
        "left"
    )
)

need_from_w = ["flight_id","prediction_utc","origin_obs_utc","asof_minutes",
               "origin_station_id","origin_station_dis","origin_station_lat","origin_station_lon"] + wx_present
origin_asof_enriched = origin_asof_enriched.select(*[c for c in need_from_w if c in origin_asof_enriched.columns])


# Airport lat/lon (origin & dest)
air_min = df_airports.select(
    "iata_code",
    sf.col("lat").alias("airport_lat"),
    sf.col("lon").alias("airport_lon"),
    sf.col("airport_type")
)
origin_air_geo = air_min.select(
    sf.col("iata_code").alias("ORIGIN"),
    sf.col("airport_lat").alias("origin_airport_lat"),
    sf.col("airport_lon").alias("origin_airport_lon"),
    sf.col("airport_type").alias("origin_type")   
)
dest_air_geo = air_min.select(
    sf.col("iata_code").alias("DEST"),
    sf.col("airport_lat").alias("dest_airport_lat"),
    sf.col("airport_lon").alias("dest_airport_lon"),
    sf.col("airport_type").alias("dest_type")     # <-- new
)

# Dest station (rank-1) for location helpers (no dest weather to avoid leakage)
dest_rank1 = (
    airport_weather_station
    .filter(sf.col("rank")==1)
    .select(
        sf.col("iata_code").alias("DEST"),
        sf.col("STATION").alias("dest_station_id"),
        sf.col("dist_km").alias("dest_station_dis")
    )
)
dest_station_geo = (
    dest_rank1.alias("d")
    .join(
        df_weather_station.select(
            sf.col("station_id").alias("STATION"),
            sf.col("lat").alias("dest_station_lat"),
            sf.col("lon").alias("dest_station_lon")
        ).alias("s"),
        sf.col("d.dest_station_id")==sf.col("s.STATION"),
        "left"
    )
    .select("DEST","dest_station_id","dest_station_dis","dest_station_lat","dest_station_lon")
)

In [0]:
# Rebuild a stable flight_id on the raw flights for the final join
flights_keyed = (
    df_flights
    .withColumn(
        "flight_id",
        sf.concat_ws("|",
            sf.col("FL_DATE").cast("string"),
            sf.col("OP_UNIQUE_CARRIER"),
            sf.col("OP_CARRIER_FL_NUM"),
            sf.col("ORIGIN"),
            sf.col("DEST")
        )
    )
)

# Display results
# show_df(flights_keyed, 5)
# show_columns(flights_keyed)

In [0]:
# Assemble final dataset: one row per flight
final_joined = (
    flights_keyed.alias("f")
    .join(origin_asof_enriched.alias("w"), "flight_id", "left")
    .join(origin_air_geo, ["ORIGIN"], "left")
    .join(dest_air_geo,   ["DEST"],   "left")
    .join(dest_station_geo, ["DEST"], "left")
)

# Column groups
model_inputs = [
    # Flight & schedule
    "FL_DATE","YEAR","QUARTER","MONTH","DAY_OF_MONTH","DAY_OF_WEEK",
    "OP_UNIQUE_CARRIER","OP_CARRIER","OP_CARRIER_FL_NUM","TAIL_NUM",
    "CRS_DEP_TIME","CRS_ARR_TIME","CRS_ELAPSED_TIME",
    "ORIGIN","ORIGIN_AIRPORT_ID","ORIGIN_CITY_NAME","ORIGIN_STATE_ABR",
    "DEST","DEST_AIRPORT_ID","DEST_CITY_NAME","DEST_STATE_ABR",
    "DISTANCE","DISTANCE_GROUP",
    # Origin weather (as-of T–2h)
] + wx_present + [
    # Location helpers
    "origin_station_lat","origin_station_lon","origin_airport_lat","origin_airport_lon",
    "dest_station_lat","dest_station_lon","dest_airport_lat","dest_airport_lon",
    "origin_station_dis","dest_station_dis", "origin_type","dest_type"
]

labels_eval = ["DEP_DEL15","DEP_DELAY","ARR_DEL15","ARR_DELAY"]
post_flight = [
    "CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY",
    "DEP_TIME","ARR_TIME","TAXI_OUT","TAXI_IN","WHEELS_OFF","WHEELS_ON","ACTUAL_ELAPSED_TIME","AIR_TIME"
]
flags = ["CANCELLED","CANCELLATION_CODE","DIVERTED"]

provenance = ["flight_id","prediction_utc","origin_obs_utc","asof_minutes","origin_station_id","dest_station_id"]

# Keep only columns that exist (different weather slices may miss some)
def present(cols, df_cols): 
    s = set(df_cols); 
    return [c for c in cols if c in s]

keep = provenance \
     + present(model_inputs, final_joined.columns) \
     + present(labels_eval, final_joined.columns) \
     + present(post_flight, final_joined.columns) \
     + present(flags, final_joined.columns)

final_curated = final_joined.select(*keep)

# Persist for the team
out_path = "dbfs:/student-groups/Group_4_4/JOINED_3M_2015.parquet"
(final_curated
 .write
 .format("parquet")
 .mode("overwrite")
 .save(out_path))

In [0]:
df_check = spark.read.parquet(out_path)
display(df_check.limit(10))

flight_id,prediction_utc,origin_obs_utc,asof_minutes,origin_station_id,dest_station_id,FL_DATE,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,OP_CARRIER,OP_CARRIER_FL_NUM,TAIL_NUM,CRS_DEP_TIME,CRS_ARR_TIME,CRS_ELAPSED_TIME,ORIGIN,ORIGIN_AIRPORT_ID,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,DEST,DEST_AIRPORT_ID,DEST_CITY_NAME,DEST_STATE_ABR,DISTANCE,DISTANCE_GROUP,HourlyDryBulbTemperature,HourlyDewPointTemperature,HourlyWetBulbTemperature,HourlyPrecipitation,HourlyWindSpeed,HourlyWindDirection,HourlyWindGustSpeed,HourlyVisibility,HourlyRelativeHumidity,HourlyStationPressure,HourlySeaLevelPressure,HourlyAltimeterSetting,HourlySkyConditions,HourlyPresentWeatherType,origin_station_lat,origin_station_lon,origin_airport_lat,origin_airport_lon,dest_station_lat,dest_station_lon,dest_airport_lat,dest_airport_lon,origin_station_dis,dest_station_dis,origin_type,dest_type,DEP_DEL15,DEP_DELAY,ARR_DEL15,ARR_DELAY,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,DEP_TIME,ARR_TIME,TAXI_OUT,TAXI_IN,WHEELS_OFF,WHEELS_ON,ACTUAL_ELAPSED_TIME,AIR_TIME,CANCELLED,CANCELLATION_CODE,DIVERTED
2015-03-26|AS|164|ANC|ADK,2015-03-26T20:45:00Z,2015-03-26T19:53:00Z,52,70272526491,70454025704,2015-03-26,2015,1,3,26,4,AS,AS,164,N764AS,1445,1650,185.0,ANC,10299,"Anchorage, AK",AK,ADK,10165,"Adak Island, AK",AK,1192.0,5,40,25,34,0.00,0,0,,10.0,55,29.3,29.44,29.44,OVC:08 110,,61.178,-149.966,61.18103525,-149.99789189730433,51.883,-176.65,51.88278,-176.64473,1.7425141362466383,0.3625457905065297,large_airport,medium_airport,1.0,20.0,0.0,12.0,,,,,,1505,1702,11.0,4.0,1516,1658,177.0,162.0,0.0,,0.0
2015-02-26|EV|5239|DTW|BGM,2015-02-27T00:45:00Z,2015-02-26T23:53:00Z,52,72537094847,72515004725,2015-02-26,2015,1,2,26,4,EV,EV,5239,N861AS,2145,2311,86.0,DTW,11433,"Detroit, MI",MI,BGM,10577,"Binghamton, NY",NY,378.0,2,7,-1,6,0.00,0,0,,10.0,70,29.64,30.4,30.36,FEW:02 180,,42.231,-83.331,42.2056991,-83.35297537621658,42.207,-75.98,42.2106484,-75.9815509339948,3.345110277884152,0.425319579540484,large_airport,medium_airport,0.0,0.0,0.0,-2.0,,,,,,2145,2309,25.0,4.0,2210,2305,84.0,55.0,0.0,,0.0
2015-03-31|OO|7378|SLC|BTM,2015-04-01T01:50:00Z,2015-03-31T23:53:00Z,117,72572024127,72774024135,2015-03-31,2015,1,3,31,2,OO,OO,7378,N447SW,2150,2309,79.0,SLC,14869,"Salt Lake City, UT",UT,BTM,10779,"Butte, MT",MT,358.0,2,55,34,45,0.00,24,330,32.0,6.0,45,25.58,29.75,29.87,FEW:02 80 BKN:07 110,HZ:7 |FU |HZ,40.778,-111.969,40.7900661,-111.97989846185592,45.965,-112.501,45.95111,-112.49389,1.6254551471677825,1.639375335532778,large_airport,medium_airport,0.0,-4.0,0.0,-13.0,,,,,,2146,2256,15.0,4.0,2201,2252,70.0,51.0,0.0,,0.0
2015-03-31|WN|825|LAS|BUR,2015-03-31T19:00:00Z,2015-03-31T18:56:00Z,4,72386023169,72288023152,2015-03-31,2015,1,3,31,2,WN,WN,825,N288WN,1400,1505,65.0,LAS,12889,"Las Vegas, NV",NV,BUR,10800,"Burbank, CA",CA,223.0,1,80,13,51,0.00,16,210,21.0,10.0,8,27.46,29.67,29.73,FEW:02 250,,36.072,-115.163,36.0861034,-115.16111989849917,34.201,-118.358,34.2031703,-118.35917241746948,1.577304450332636,0.2643180285586571,large_airport,medium_airport,0.0,-1.0,0.0,-9.0,,,,,,1359,1456,9.0,2.0,1408,1454,57.0,46.0,0.0,,0.0
2015-03-31|WN|2524|RSW|CAK,2015-03-31T14:25:00Z,2015-03-31T13:53:00Z,32,72210812894,72521014895,2015-03-31,2015,1,3,31,2,WN,WN,2524,N716SW,1225,1505,160.0,RSW,14635,"Fort Myers, FL",FL,CAK,10874,"Akron, OH",OH,991.0,4,80,55,65,0.00,3,100,,10.0,42,30.08,30.11,30.11,CLR:00,,26.536,-81.755,26.53294975,-81.75880562612211,40.918,-81.444,40.9152061,-81.43992419328089,0.5083018634543099,0.4623884371833724,large_airport,medium_airport,0.0,0.0,0.0,-11.0,,,,,,1225,1454,10.0,4.0,1235,1450,149.0,135.0,0.0,,0.0
2015-03-30|AS|61|YAK|CDV,2015-03-30T17:41:00Z,2015-03-30T16:53:00Z,48,70361025339,70296026410,2015-03-30,2015,1,3,30,1,AS,AS,61,N765AS,1141,1229,48.0,YAK,15991,"Yakutat, AK",AK,CDV,10926,"Cordova, AK",AK,213.0,1,46,35,41,0.00,0,0,,10.0,66,29.47,29.5,29.5,CLR:00,,59.512,-139.671,59.50081355,-139.64408083356284,60.489,-145.451,60.4916501,-145.47370526084313,1.9632465587288668,1.2780354018245668,medium_airport,medium_airport,0.0,-23.0,0.0,-26.0,,,,,,1118,1203,8.0,2.0,1126,1201,45.0,35.0,0.0,,0.0
2015-03-15|OO|5317|SFO|CEC,2015-03-15T15:45:00Z,2015-03-15T14:56:00Z,49,72494023234,72594624286,2015-03-15,2015,1,3,15,7,OO,OO,5317,N237SW,1045,1222,97.0,SFO,14771,"San Francisco, CA",CA,CEC,10930,"Crescent City, CA",CA,304.0,2,67,54,60,0.00,13,320,,10.0,63,30.04,30.06,30.06,FEW:02 10 OVC:08 200,,37.62,-122.365,37.622452,-122.38407160781362,41.78,-124.237,41.7814506,-124.23807108485695,1.7016890748660345,0.1841336510959366,large_airport,medium_airport,0.0,-8.0,0.0,-17.0,,,,,,1037,1205,14.0,4.0,1051,1201,88.0,70.0,0.0,,0.0
2015-03-31|OO|2838|ORD|CHO,2015-03-31T21:50:00Z,2015-03-31T20:51:00Z,59,72530094846,72401693736,2015-03-31,2015,1,3,31,2,OO,OO,2838,N494CA,1850,2153,123.0,ORD,13930,"Chicago, IL",IL,CHO,10990,"Charlottesville, VA",VA,567.0,3,44,35,40,0.00,5,170,,10.0,71,29.22,29.95,29.94,CLR:00,,41.995,-87.934,41.97795725,-87.90917584851792,38.137,-78.455,38.1410056,-78.45246606356999,2.7930256931701187,0.497488264524073,large_airport,medium_airport,0.0,-3.0,0.0,-30.0,,,,,,1847,2123,16.0,3.0,1903,2120,96.0,77.0,0.0,,0.0
2015-03-31|MQ|3418|DFW|CID,2015-03-31T20:24:00Z,2015-03-31T19:53:00Z,31,72259003927,72545014990,2015-03-31,2015,1,3,31,2,MQ,MQ,3418,N546MQ,1724,1929,125.0,DFW,11298,"Dallas/Fort Worth, TX",TX,CID,11003,"Cedar Rapids/Iowa City, IA",IA,685.0,3,74,60,65,0.00,15,170,,10.0,62,29.24,29.86,29.88,OVC:08 250,,32.898,-97.019,32.89651945,-97.0465220537124,41.883,-91.717,41.889423,-91.7003,2.57485082431794,1.5560424351128543,large_airport,medium_airport,1.0,111.0,1.0,98.0,98.0,0.0,0.0,0.0,0.0,1915,2107,18.0,3.0,1933,2104,112.0,91.0,0.0,,0.0
2015-03-29|OO|4717|DTW|CIU,2015-03-29T21:55:00Z,2015-03-29T21:53:00Z,2,72537094847,72734014847,2015-03-29,2015,1,3,29,7,OO,OO,4717,N8903A,1955,2111,76.0,DTW,11433,"Detroit, MI",MI,CIU,11013,"Sault Ste. Marie, MI",MI,284.0,2,42,30,37,T,23,210,43.0,10.0,62,29.07,29.79,29.78,BKN:07 30 OVC:08 45,-DZ:01 |DZ |,42.231,-83.331,42.2056991,-83.35297537621658,46.479,-84.357,46.5,-84.35,3.345110277884152,2.3958004754349385,large_airport,medium_airport,0.0,11.0,0.0,14.0,,,,,,2006,2125,21.0,6.0,2027,2119,79.0,52.0,0.0,,0.0
