# Time-Based Cross-Validation Split Generation

**Author:** Daniel Costa  
**Project:** Flight Delay Prediction (W261 Final Project)  
**Dataset:** 5-Year Flight Data (2015-2019)

---

## Overview

This notebook generates **time-based cross-validation splits** for training neural network models on flight delay prediction. Time-based splitting is critical for time-series data to prevent data leakage—ensuring the model never trains on future information.

### Methodology

| Parameter | Value | Description |
|-----------|-------|-------------|
| Training Window | 30 days × 4 years | ~14,400 hours of training data per fold |
| Gap Size | 2 hours | Prevents label leakage at split boundary |
| Validation Window | 7 days × 4 years | ~3,360 hours of validation data per fold |
| Step Size | 85 hours × 4 years | Rolling window advancement |
| **Total Folds** | **10** | Sliding window cross-validation |

### Time-Based Split Visualization

```
Fold 1: [====== TRAIN (30d) ======][GAP][=== VAL (7d) ===]
Fold 2:      [====== TRAIN (30d) ======][GAP][=== VAL (7d) ===]
Fold 3:           [====== TRAIN (30d) ======][GAP][=== VAL (7d) ===]
...
Fold 10:                                    [====== TRAIN (30d) ======][GAP][=== VAL (7d) ===]
```

---

## Table of Contents

1. [Setup & Configuration](#setup)
2. [Data Loading](#data-loading)
3. [Time Index Creation](#time-index)
4. [Cross-Validation Split Generation](#cv-splits)
5. [Export Splits](#export)

In [0]:
# =============================================================================
# Section 1: Setup & Configuration
# =============================================================================

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# -----------------------------------------------------------------------------
# Cross-Validation Configuration
# -----------------------------------------------------------------------------
# Time windows are multiplied by 4 to account for 4 years of data (2015-2019)
# This creates overlapping folds that span similar time periods across years

CV_CONFIG = {
    # Training window: 30 days worth of hours (720 hours × 4 years)
    "train_hours": 720 * 4 * 5,
    
    # Gap between train and validation to prevent label leakage
    "gap_hours": 2,
    
    # Validation window: 7 days worth of hours (168 hours × 4 years)
    "val_hours": 168 * 4 * 5,
    
    # Step size for sliding window (determines fold overlap)
    "step_hours": 85 * 4 * 5,
}

# Data paths
DATA_PATHS = {
    "train": "dbfs:/student-groups/Group_2_2/5_year_custom_joined/fe_graph_and_holiday_nnfeat/training_splits/train_w_preds.parquet/",
    "val": "dbfs:/student-groups/Group_2_2/5_year_custom_joined/fe_graph_and_holiday_nnfeat/training_splits/val_w_preds.parquet/",
    "test": "dbfs:/student-groups/Group_2_2/5_year_custom_joined/fe_graph_and_holiday_nnfeat/training_splits/test_w_preds.parquet/",
    "output": "dbfs:/student-groups/Group_2_2/5_year_custom_joined/fe_graph_and_holiday_nnfeat/cv_splits2",
}

# Flag to control whether to save (prevents accidental overwrites)
SAVE_SPLITS = False  # Set to True to enable saving

In [0]:
# =============================================================================
# Section 2: Data Loading
# =============================================================================

# Load feature-engineered datasets with graph features and holiday indicators
train_df = spark.read.parquet(DATA_PATHS["train"])
val_df = spark.read.parquet(DATA_PATHS["val"])
test_df = spark.read.parquet(DATA_PATHS["test"])

print(f"Training set:   {train_df.count():,} records")
print(f"Validation set: {val_df.count():,} records")
print(f"Test set:       {test_df.count():,} records")

In [0]:
# =============================================================================
# Section 3: Time Index Creation
# =============================================================================

# Combine train and validation sets for cross-validation
# (Test set is held out for final evaluation)
cv_df = train_df.unionByName(val_df)

# Calculate dataset start time for time indexing
min_timestamp = cv_df.agg(F.min("utc_timestamp")).collect()[0][0]
print(f"Dataset start time: {min_timestamp}")

# Create initial time index (hours since dataset start)
cv_df = cv_df.withColumn(
    "time_idx", 
    ((F.col("utc_timestamp").cast("long") - F.lit(min_timestamp).cast("long")) / 3600).cast("long")
)

Dataset start time: 2015-01-01 03:30:00


In [0]:
# Filter out cancelled flights (only predict delays for flights that departed)
cv_df = cv_df.filter(F.col("CANCELLED") != 1)

total_records = cv_df.count()
print(f"Total records after filtering cancelled flights: {total_records:,}")

# Preview the data
display(cv_df.limit(10))

22451084


flight_uid,page_rank,out_degree,in_degree,weighted_out_degree,weighted_in_degree,N_RUNWAYS,betweenness_unweighted,closeness,betweenness,avg_origin_dep_delay,avg_dest_arr_delay,avg_daily_route_flights,avg_route_delay,avg_hourly_flights,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV1_TAIL_NUM,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM,year,HourlyDryBulbTemperature,HourlyDewPointTemperature,HourlyRelativeHumidity,HourlyAltimeterSetting,HourlyVisibility,HourlyStationPressure,HourlyWetBulbTemperature,HourlyPrecipitation,HourlyCloudCoverage,HourlyCloudElevation,HourlyWindSpeed,utc_timestamp,CRS_DEP_MINUTES,origin_delays_4h,prev_flight_delay_in_minutes,prev_flight_delay,delay_origin_7d,delay_origin_carrier_7d,route,delay_route_7d,flight_count_24h,LANDING_TIME_DIFF_MINUTES,AVG_ARR_DELAY_ORIGIN,AVG_TAXI_OUT_ORIGIN,IS_HOLIDAY,IS_HOLIDAY_WINDOW,AIRPORT_HUB_CLASS,RATING,AIRLINE_CATEGORY,dep_hour,day_of_year,dep_hour_sin,dep_hour_cos,dow_sin,dow_cos,doy_sin,doy_cos,HourlyVisibility_3h_change,HourlyStationPressure_3h_change,HourlyDryBulbTemperature_3h_change,HourlyWindSpeed_3h_change,HourlyPrecipitation_3h_change,utc_ts_sec,ground_flights_last_hour,arrivals_last_hour,xgb_predicted_delay,time_idx
ORD-2015-01-13-20398-3423-N671MQ-1335,7.49756322119521e+60,180,180,746.4306808859722,745.0401968826907,8,0.1467214553796886,0.0005363030041286898,0.2952646239554317,12.11599730070142,7.452488687782806,1.0910582444626744,10.829323308270675,0.2534864643150123,1,1,13,2,2015-01-13,MQ,20398,MQ,N671MQ,3423,13930,1393003,30977,ORD,"Chicago, IL",IL,17,Illinois,41,10268,1026802,30268,ALO,"Waterloo, IA",IA,19,Iowa,61,1335,1507,92.0,92.0,1.0,6,1300-1359,16.0,1523,1605,5.0,1443,1610,87.0,87.0,1.0,5,1400-1459,0.0,,0.0,68.0,63.0,42.0,1.0,234.0,1,8.0,0.0,0.0,0.0,79.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,16.974967717280947,9.224738421862204,71.29865199427508,30.706512517903818,5.477475836012067,29.975729062205048,14.990760017021216,0.0,8.0,21.032402152263067,9.550120497587288,2015-01-13T13:35:00Z,815,45,64.0,1,187556.0,52706.0,ORD-ALO,1129.0,3,-6.083333333333333,41.92905253704535,22.834530758868432,0,0,0,2.8,2,13.583333333333334,13.0,-0.4027466898587371,-0.9153114791194472,0.9749279121818236,-0.2225209339563143,0.2219215130041655,0.9750645322571948,-8.881784197001252e-16,0.0,0.0,0.0,0.0,1421156100,69,0,52.34381103515625,298
ORD-2015-03-24-20398-3039-N621MQ-2010,7.49756322119521e+60,180,180,746.4306808859722,745.0401968826907,8,0.1467214553796886,0.0005363030041286898,0.2952646239554317,12.11599730070142,7.452488687782806,1.0910582444626744,10.829323308270675,0.5447087776866284,1,3,24,2,2015-03-24,MQ,20398,MQ,N621MQ,3039,13930,1393003,30977,ORD,"Chicago, IL",IL,17,Illinois,41,10268,1026802,30268,ALO,"Waterloo, IA",IA,19,Iowa,61,2010,2005,-5.0,0.0,0.0,-1,2000-2059,31.0,2036,2126,4.0,2121,2130,9.0,9.0,0.0,0,2100-2159,0.0,,0.0,71.0,85.0,50.0,1.0,234.0,1,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,34.140517133736246,19.39274392494582,55.04396693882441,30.16019615660192,9.987701830562074,29.438642749267384,28.92671913253303,0.0,7.045787974381269,227.59588763887447,14.448367176537175,2015-03-24T20:10:00Z,1210,12,0.0,0,123767.0,17662.0,ORD-ALO,205.0,7,-5.183333333333334,15.67007299270073,19.141423357664237,0,0,0,2.8,2,20.166666666666668,83.0,-0.8433914458128857,0.5372996083468239,0.9749279121818236,-0.2225209339563143,0.989932495087353,0.141540295217043,-0.0088614190076619,-0.009140401070951,0.1837552835002256,-0.0598750915357015,0.0,1427227800,49,0,3.544440746307373,1984
ATL-2015-03-28-19790-1087-N936DN-1205,1.0311985506394703e+61,171,172,1021.2887612797376,1020.9712879409352,5,0.1155828468885709,0.0004921831785157847,0.4631891816187112,9.059287521380703,1.2673119880863737,4.428219852337982,6.8267876991478325,0.9417555373256769,1,3,28,6,2015-03-28,DL,19790,DL,N936DN,1087,10397,1039705,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,1205,1159,-6.0,0.0,0.0,-1,1200-1259,15.0,1214,1306,4.0,1321,1310,-11.0,0.0,0.0,-1,1300-1359,0.0,,0.0,76.0,71.0,52.0,1.0,366.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,38.370456341821445,24.970236560652506,59.48743699351755,30.0898584626018,10.011832508548116,28.98392552791934,33.128747180677394,0.0,3.218338407774627,51.93863698560123,11.515973790785315,2015-03-28T12:05:00Z,725,10,0.0,0,60034.0,31950.0,ATL-DAB,105.0,3,-4.816666666666666,-1.3170937764010138,15.89326950154886,0,0,0,4.5,1,12.083333333333334,87.0,-0.0218148850345609,-0.9997620270799092,-0.7818314824680299,0.6234898018587334,0.9973249731081556,0.0730951298980777,3.552713678800501e-15,7.105427357601002e-15,1.4210854715202004e-14,0.0,0.0,1427544300,52,0,2.292229652404785,2072
ATL-2015-05-06-19790-2587-N522US-2204,1.0311985506394703e+61,171,172,1021.2887612797376,1020.9712879409352,5,0.1155828468885709,0.0004921831785157847,0.4631891816187112,9.059287521380703,1.2673119880863737,4.428219852337982,6.8267876991478325,0.8408531583264971,2,5,6,3,2015-05-06,DL,19790,DL,N522US,2587,10397,1039705,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,2204,2225,21.0,21.0,1.0,1,2200-2259,10.0,2235,2326,4.0,2327,2330,3.0,3.0,0.0,0,2300-2359,0.0,,0.0,83.0,65.0,51.0,1.0,366.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,78.31709391105072,50.81421991919945,38.43337870235223,30.12954545372013,9.94957411174089,29.025334787112403,62.165417715686594,0.0,2.521763503891779,152.4225170644208,4.873047339765285,2015-05-06T22:04:00Z,1324,9,0.0,0,32438.0,14140.0,ATL-DAB,43.0,6,-6.6,-6.28468624064479,15.71286701208981,0,0,0,4.5,1,22.066666666666663,126.0,-0.4848096202463376,0.8746197071393954,0.4338837391175582,-0.900968867902419,0.8263541987239096,-0.5631507242749186,0.0,3.552713678800501e-15,1.4210854715202004e-14,0.0,0.0,1430949840,69,0,-0.2628335356712341,3018
ATL-2015-06-01-19790-1087-N989DL-1205,1.0311985506394703e+61,171,172,1021.2887612797376,1020.9712879409352,5,0.1155828468885709,0.0004921831785157847,0.4631891816187112,9.059287521380703,1.2673119880863737,4.428219852337982,6.8267876991478325,0.9417555373256769,2,6,1,1,2015-06-01,DL,19790,DL,N989DL,1087,10397,1039705,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,1205,1203,-2.0,0.0,0.0,-1,1200-1259,11.0,1214,1310,2.0,1325,1312,-13.0,0.0,0.0,-1,1300-1359,0.0,,0.0,80.0,69.0,56.0,1.0,366.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,70.30673183751884,65.2742804857733,84.60870273233728,30.084610242008964,8.792947958220637,28.983059714984687,66.88488224686607,0.0,6.839863135025106,128.42012480714533,3.9864759439675495,2015-06-01T12:05:00Z,725,28,0.0,0,175821.0,110011.0,ATL-DAB,713.0,2,-1.75,17.484221048122233,17.743824075396258,0,0,0,4.5,1,12.083333333333334,152.0,-0.0218148850345609,-0.9997620270799092,0.7818314824680298,0.6234898018587336,0.5012418134457758,-0.865307254363206,0.0,0.0,0.0,0.0,0.0,1433160300,53,0,8.101856231689453,3632
ATL-2015-06-04-19790-2439-N944DL-1534,1.0311985506394703e+61,171,172,1021.2887612797376,1020.9712879409352,5,0.1155828468885709,0.0004921831785157847,0.4631891816187112,9.059287521380703,1.2673119880863737,4.428219852337982,6.8267876991478325,0.8539786710418376,2,6,4,4,2015-06-04,DL,19790,DL,N944DL,2439,10397,1039705,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,1534,1539,5.0,5.0,0.0,0,1500-1559,18.0,1557,1650,4.0,1705,1654,-11.0,0.0,0.0,-1,1700-1759,0.0,,0.0,91.0,75.0,53.0,1.0,366.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,70.07701558750549,62.03007278903443,76.11337872644563,30.07431013923052,9.959941132273093,28.971241077587766,65.05101578503306,0.0,7.492395765133086,35.07185513964928,7.480738506959625,2015-06-04T15:34:00Z,934,9,0.0,0,109525.0,65869.0,ATL-DAB,390.0,4,-1.8333333333333333,7.165126050420168,17.861764705882354,0,0,0,4.5,1,15.566666666666666,155.0,-0.8038568606172171,-0.5948227867513417,-0.433883739117558,-0.9009688679024191,0.4559066935084588,-0.8900275764346767,-3.552713678800501e-15,5.591518849712429e-06,-0.0050828064838555,0.0012232508665057,0.0,1433432040,91,0,3.892937660217285,3708
ATL-2015-08-23-19790-1087-N684DA-1204,1.0311985506394703e+61,171,172,1021.2887612797376,1020.9712879409352,5,0.1155828468885709,0.0004921831785157847,0.4631891816187112,9.059287521380703,1.2673119880863737,4.428219852337982,6.8267876991478325,0.9417555373256769,3,8,23,7,2015-08-23,DL,19790,DL,N684DA,1087,10397,1039705,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,1204,1212,8.0,8.0,0.0,0,1200-1259,18.0,1230,1324,2.0,1323,1326,3.0,3.0,0.0,0,1300-1359,0.0,,0.0,79.0,74.0,54.0,1.0,366.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,71.62273470774336,68.16898834612444,89.43819228634803,30.06795873803085,4.387530234858861,28.96472934335646,69.36872895740099,0.2863671603190236,6.695385691428797,67.37872119833645,2.011637762652952,2015-08-23T12:04:00Z,724,138,0.0,0,129898.0,79926.0,ATL-DAB,480.0,2,-8.233333333333333,13.161272551990084,20.095028233025754,0,0,0,4.5,1,12.066666666666666,235.0,-0.0174524064372836,-0.9998476951563912,-2.4492935982947064e-16,1.0,-0.7856498550787144,-0.6186714032625032,0.0004098481389469555,2.2687372549512475e-06,-0.0002556842045180474,0.0004821046137450935,4.65269827570336e-05,1440331440,44,0,41.335227966308594,5624
ATL-2015-12-26-19790-2387-N918DE-935,1.0311985506394703e+61,171,172,1021.2887612797376,1020.9712879409352,5,0.1155828468885709,0.0004921831785157847,0.4631891816187112,9.059287521380703,1.2673119880863737,4.428219852337982,6.8267876991478325,0.9507793273174734,4,12,26,6,2015-12-26,DL,19790,DL,N918DE,2387,10397,1039705,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,935,933,-2.0,0.0,0.0,-1,0900-0959,8.0,941,1036,2.0,1050,1038,-12.0,0.0,0.0,-1,1000-1059,0.0,,0.0,75.0,65.0,55.0,1.0,366.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,62.26159064875159,59.49435851595671,90.61792521726215,30.25785397671792,9.415949682949,29.15251662999337,60.75411604678352,0.0275770937810019,7.29819309461152,169.9596439244078,2.582924530569338,2015-12-26T09:35:00Z,575,7,0.0,0,149018.0,90055.0,ATL-DAB,716.0,2,-2.2666666666666666,18.742823765020027,17.354305740987986,0,0,0,4.5,1,9.583333333333334,360.0,0.5913096483635822,-0.8064446042674827,-0.7818314824680299,0.6234898018587334,-0.0859647987374467,0.9962981749346076,0.0045146110841489,-4.154673280964971e-06,0.0029697760074114,0.0004729279638202577,-3.469446951953614e-18,1451122500,71,0,8.290379524230957,8622
ATL-2016-03-03-19790-2453-N999DN-1956,1.0311985506394703e+61,171,172,1021.2887612797376,1020.9712879409352,5,0.1155828468885709,0.0004921831785157847,0.4631891816187112,9.059287521380703,1.2673119880863737,4.428219852337982,6.8267876991478325,0.4799015586546349,1,3,3,4,2016-03-03,DL,19790,DL,N999DN,2453,10397,1039705,30397,ATL,"Atlanta, GA",GA,13,Georgia,34,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,1956,1953,-3.0,0.0,0.0,-1,1900-1959,33.0,2026,2118,4.0,2113,2122,9.0,9.0,0.0,0,2100-2159,0.0,,0.0,77.0,89.0,52.0,1.0,366.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2016,40.93669049252433,35.07641036199518,81.04975100796356,30.021976940408912,4.058113434575104,28.92331917219142,38.41530345724452,0.1117619939176639,7.919269145768714,88.32079169981343,8.13627594465383,2016-03-03T19:56:00Z,1196,50,0.0,0,43544.0,21015.0,ATL-DAB,105.0,5,-3.05,-1.0518152978498414,17.624779696862884,0,0,0,4.5,1,19.933333333333334,63.0,-0.8746197071393961,0.4848096202463365,-0.433883739117558,-0.9009688679024191,0.8840675099433636,0.4673592171580022,-0.0968186961431172,-0.0048927584707101,-0.0882077778509469,0.215576607745688,-0.00034915749015250463,1457034960,98,0,16.266340255737305,10264
JFK-2016-03-19-20409-393-N627JB-1249,2.707706643294553e+60,70,71,269.93027071369977,269.6439704675964,4,0.0201176156981615,0.0004014315170688837,0.0445682451253481,12.95774707861136,1.2673119880863737,0.6776045939294504,6.207021791767555,0.0574241181296144,1,3,19,6,2016-03-19,B6,20409,B6,N627JB,393,12478,1247803,31703,JFK,"New York, NY",NY,36,New York,22,11252,1125203,31252,DAB,"Daytona Beach, FL",FL,12,Florida,33,1249,1321,32.0,32.0,1.0,2,1200-1259,19.0,1340,1607,6.0,1530,1613,43.0,43.0,1.0,2,1500-1559,0.0,,0.0,161.0,172.0,147.0,1.0,891.0,4,32.0,0.0,11.0,0.0,0.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2016,40.696476724312895,16.67755005602409,38.22442025103221,30.11889403712905,9.981060273094306,30.10063011328295,32.446533628146895,0.0,5.688713344388336,234.25029253378707,7.3632050970627905,2016-03-19T12:49:00Z,769,1,0.0,0,18360.0,8970.0,JFK-DAB,94.0,3,-4.95,-3.0365272631021702,25.03652726310217,0,0,0,4.2,1,12.816666666666666,79.0,-0.2121776721564461,-0.9772311064626787,-0.7818314824680299,0.6234898018587334,0.9778483415056568,0.2093146459630484,0.0,-1.4210854715202004e-14,-0.0007038767578038119,0.0007017305977985444,0.0,1458391740,10,1,2.1201367378234863,10641


In [0]:
# Reconstruct UTC timestamp from flight date and scheduled departure time
# CRS_DEP_TIME is in HHMM format (e.g., 1430 for 2:30 PM)
cv_df = cv_df.withColumn(
    "utc_timestamp",
    F.to_timestamp(
        F.concat(
            F.col("FL_DATE"),
            F.lit(" "),
            F.lpad(F.col("CRS_DEP_TIME").cast("string"), 4, "0")
        ),
        "yyyy-MM-dd HHmm"
    )
)

## Section 4: Cross-Validation Split Generation

The time index is created using `dense_rank()` over hourly-truncated timestamps. This ensures:
- Contiguous time indices (no gaps)
- Consistent ordering across the dataset
- Proper sliding window behavior for CV folds

In [0]:
# Truncate timestamp to hour level for consistent time bucketing
df_indexed = cv_df.withColumn(
    "hour", 
    F.date_trunc("hour", F.col("utc_timestamp"))
)

# Create dense time index based on unique hours
# dense_rank() ensures no gaps in the sequence
window_spec = Window.orderBy("hour")
df_indexed = df_indexed.withColumn(
    "time_idx", 
    F.dense_rank().over(window_spec)
)

print(f"Created time index with dense ranking")
display(df_indexed.select("FL_DATE", "CRS_DEP_TIME", "hour", "time_idx").limit(10))

flight_uid,page_rank,out_degree,in_degree,weighted_out_degree,weighted_in_degree,N_RUNWAYS,betweenness_unweighted,closeness,betweenness,avg_origin_dep_delay,avg_dest_arr_delay,avg_daily_route_flights,avg_route_delay,avg_hourly_flights,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,FIRST_DEP_TIME,TOTAL_ADD_GTIME,LONGEST_ADD_GTIME,DIV_AIRPORT_LANDINGS,DIV_REACHED_DEST,DIV_ACTUAL_ELAPSED_TIME,DIV_ARR_DELAY,DIV_DISTANCE,DIV1_AIRPORT,DIV1_AIRPORT_ID,DIV1_AIRPORT_SEQ_ID,DIV1_WHEELS_ON,DIV1_TOTAL_GTIME,DIV1_LONGEST_GTIME,DIV1_WHEELS_OFF,DIV1_TAIL_NUM,DIV2_AIRPORT,DIV2_AIRPORT_ID,DIV2_AIRPORT_SEQ_ID,DIV2_WHEELS_ON,DIV2_TOTAL_GTIME,DIV2_LONGEST_GTIME,DIV2_WHEELS_OFF,DIV2_TAIL_NUM,DIV3_AIRPORT,DIV3_AIRPORT_ID,DIV3_AIRPORT_SEQ_ID,DIV3_WHEELS_ON,DIV3_TOTAL_GTIME,DIV3_LONGEST_GTIME,DIV3_WHEELS_OFF,DIV3_TAIL_NUM,DIV4_AIRPORT,DIV4_AIRPORT_ID,DIV4_AIRPORT_SEQ_ID,DIV4_WHEELS_ON,DIV4_TOTAL_GTIME,DIV4_LONGEST_GTIME,DIV4_WHEELS_OFF,DIV4_TAIL_NUM,DIV5_AIRPORT,DIV5_AIRPORT_ID,DIV5_AIRPORT_SEQ_ID,DIV5_WHEELS_ON,DIV5_TOTAL_GTIME,DIV5_LONGEST_GTIME,DIV5_WHEELS_OFF,DIV5_TAIL_NUM,year,HourlyDryBulbTemperature,HourlyDewPointTemperature,HourlyRelativeHumidity,HourlyAltimeterSetting,HourlyVisibility,HourlyStationPressure,HourlyWetBulbTemperature,HourlyPrecipitation,HourlyCloudCoverage,HourlyCloudElevation,HourlyWindSpeed,utc_timestamp,CRS_DEP_MINUTES,origin_delays_4h,prev_flight_delay_in_minutes,prev_flight_delay,delay_origin_7d,delay_origin_carrier_7d,route,delay_route_7d,flight_count_24h,LANDING_TIME_DIFF_MINUTES,AVG_ARR_DELAY_ORIGIN,AVG_TAXI_OUT_ORIGIN,IS_HOLIDAY,IS_HOLIDAY_WINDOW,AIRPORT_HUB_CLASS,RATING,AIRLINE_CATEGORY,dep_hour,day_of_year,dep_hour_sin,dep_hour_cos,dow_sin,dow_cos,doy_sin,doy_cos,HourlyVisibility_3h_change,HourlyStationPressure_3h_change,HourlyDryBulbTemperature_3h_change,HourlyWindSpeed_3h_change,HourlyPrecipitation_3h_change,utc_ts_sec,ground_flights_last_hour,arrivals_last_hour,xgb_predicted_delay,time_idx,hour
SJU-2015-01-01-20409-262-N627JB-330,7.011568525186283e+59,27,27,69.17473338802297,69.24446267432322,2,0.0010797446405919,0.00023847694827980768,0.0,9.76930648451212,6.378967716607821,3.077932731747334,16.167910447761194,0.2403609515996718,1,1,1,4,2015-01-01,B6,20409,B6,N627JB,262,14843,1484304,34819,SJU,"San Juan, PR",PR,72,Puerto Rico,3,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,330,316,-14.0,0.0,0.0,-1,0001-0559,14.0,330,614,5.0,635,619,-16.0,0.0,0.0,-2,0600-0659,0.0,,0.0,245.0,243.0,224.0,1.0,1674.0,7,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,77.09678082174158,72.0,85.0,30.11,10.0,30.100000000000005,74.0,0.0,4.0,45.0,0.5784970470518466,2015-01-01T03:30:00Z,210,0,-1.0,0,0.0,0.0,SJU-BOS,0.0,1,26.85,0.0,0.0,0,0,1,4.2,1,3.5,1.0,0.7933533402912352,0.6087614290087207,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420083000,0,0,8.669572830200195,1,2015-01-01T03:00:00Z
SJU-2015-01-01-20409-2276-N646JB-438,7.011568525186283e+59,27,27,69.17473338802297,69.24446267432322,2,0.0010797446405919,0.00023847694827980768,0.0,9.76930648451212,4.927657896219506,1.3855619360131255,20.64890467732386,0.0885972108285479,1,1,1,4,2015-01-01,B6,20409,B6,N646JB,2276,14843,1484304,34819,SJU,"San Juan, PR",PR,72,Puerto Rico,3,10529,1052904,30529,BDL,"Hartford, CT",CT,9,Connecticut,11,438,550,72.0,72.0,1.0,4,0001-0559,15.0,605,902,6.0,739,908,89.0,89.0,1.0,5,0700-0759,0.0,,0.0,241.0,258.0,237.0,1.0,1666.0,7,72.0,0.0,17.0,0.0,0.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,77.09678082174158,72.0,84.99999999999999,30.099007720543018,6.435136330330868,30.06430710513733,73.99999999999999,0.0,4.0,43.90077205430146,0.578945237614033,2015-01-01T04:38:00Z,278,0,-1.0,0,0.0,0.0,SJU-BDL,0.0,1,-3.283333333333333,0.0,0.0,0,0,1,4.2,1,4.633333333333334,1.0,0.9366721892483976,0.3502073812594674,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420087080,1,0,11.304452896118164,2,2015-01-01T04:00:00Z
SJU-2015-01-01-20409-2134-N307JB-400,7.011568525186283e+59,27,27,69.17473338802297,69.24446267432322,2,0.0010797446405919,0.00023847694827980768,0.0,9.76930648451212,4.681068858314908,10.53732567678425,9.689139743090696,0.2436423297785069,1,1,1,4,2015-01-01,B6,20409,B6,N307JB,2134,14843,1484304,34819,SJU,"San Juan, PR",PR,72,Puerto Rico,3,13204,1320402,31454,MCO,"Orlando, FL",FL,12,Florida,33,400,535,95.0,95.0,1.0,6,0001-0559,9.0,544,727,3.0,605,730,85.0,85.0,1.0,5,0600-0659,0.0,,0.0,185.0,175.0,163.0,1.0,1189.0,5,85.0,0.0,0.0,0.0,0.0,407.0,51.0,51.0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,77.09678082174156,72.0,85.0,30.09900772054301,7.251930135753666,30.089007720543016,74.0,0.0,4.0,43.90077205430146,0.5784970470518466,2015-01-01T04:00:00Z,240,0,-1.0,0,0.0,0.0,SJU-MCO,0.0,1,-1.5333333333333334,0.0,0.0,0,0,1,4.2,1,4.0,1.0,0.8660254037844386,0.5000000000000001,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420084800,1,0,8.382051467895508,2,2015-01-01T04:00:00Z
BOS-2015-01-01-20416-103-N616NK-510,3.33550594017518e+60,66,67,330.58490566037733,330.54388843314194,6,0.0067407083589921,0.0003692778003518269,0.00045128460497035526,10.497351831629109,3.4517954625131466,0.9663658736669402,4.2368421052631575,0.9663658736669402,1,1,1,4,2015-01-01,NK,20416,NK,N616NK,103,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,13577,1357702,31135,MYR,"Myrtle Beach, SC",SC,45,South Carolina,37,510,506,-4.0,0.0,0.0,-1,0001-0559,16.0,522,714,6.0,730,720,-10.0,0.0,0.0,-1,0700-0759,0.0,,0.0,140.0,134.0,112.0,1.0,738.0,3,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,21.73360270925029,6.079378241160785,50.36689326650747,30.078218200885456,9.987272508583173,30.05393943692276,17.944687811225396,0.0,6.114790622038855,155.65393190922873,11.268503519306888,2015-01-01T05:10:00Z,310,0,-1.0,0,0.0,0.0,BOS-MYR,0.0,1,-1.4833333333333334,0.0,0.0,0,0,0,1.8,3,5.166666666666667,1.0,0.9762960071199334,0.2164396139381029,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420089000,0,0,7.533673763275146,3,2015-01-01T05:00:00Z
DEN-2015-01-01-20304-2599-N435SW-545,5.9918962949568564e+60,153,155,599.149302707137,598.929450369155,6,0.1058417574076963,0.000497541918391809,0.1687726614898616,9.597394720159702,5.851558514817862,20.423297785069728,11.735057840616967,0.4569319114027891,1,1,1,4,2015-01-01,OO,20304,OO,N435SW,2599,11292,1129202,30325,DEN,"Denver, CO",CO,8,Colorado,82,12892,1289203,32575,LAX,"Los Angeles, CA",CA,6,California,91,545,658,73.0,73.0,1.0,4,0001-0559,11.0,709,801,10.0,715,811,56.0,56.0,1.0,3,0700-0759,0.0,,0.0,150.0,133.0,112.0,1.0,862.0,4,56.0,0.0,0.0,0.0,0.0,548.0,29.0,29.0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,7.14272683040078,-7.343284395557959,51.54901191852368,30.03742345494755,9.98380297008043,24.626353329389904,4.233098641990847,0.0,0.0,200.00000000000003,6.823657647687256,2015-01-01T05:45:00Z,345,0,-1.0,0,0.0,0.0,DEN-LAX,0.0,1,-2.433333333333333,0.0,0.0,0,0,0,3.0,2,5.75,1.0,0.9978589232386036,0.0654031292301432,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420091100,1,1,13.766483306884766,3,2015-01-01T05:00:00Z
DEN-2015-01-01-20355-403-N660AW-550,5.9918962949568564e+60,153,155,599.149302707137,598.929450369155,6,0.1058417574076963,0.000497541918391809,0.1687726614898616,9.597394720159702,2.722339066153016,21.53814602132896,8.454694343934488,1.10746513535685,1,1,1,4,2015-01-01,US,20355,US,N660AW,403,11292,1129202,30325,DEN,"Denver, CO",CO,8,Colorado,82,14107,1410702,30466,PHX,"Phoenix, AZ",AZ,4,Arizona,81,550,550,0.0,0.0,0.0,0,0001-0559,15.0,605,736,3.0,749,739,-10.0,0.0,0.0,-1,0700-0759,0.0,,0.0,119.0,109.0,91.0,1.0,602.0,3,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,7.142726830400775,-7.343284395557956,51.54901191852367,30.037423454947547,9.983802970080424,24.626353329389893,4.233098641990846,0.0,0.0,200.00000000000003,6.823657647687255,2015-01-01T05:50:00Z,350,0,-1.0,0,0.0,0.0,DEN-PHX,0.0,1,-0.4333333333333333,0.0,0.0,0,0,0,3.0,2,5.833333333333333,1.0,0.9990482215818578,0.043619387365336,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420091400,2,2,11.118731498718262,3,2015-01-01T05:00:00Z
BOS-2015-01-01-20409-721-N623JB-545,3.33550594017518e+60,66,67,330.58490566037733,330.54388843314194,6,0.0067407083589921,0.0003692778003518269,0.00045128460497035526,10.497351831629109,5.036868501529052,5.042657916324856,13.786074507890028,0.0278917145200984,1,1,1,4,2015-01-01,B6,20409,B6,N623JB,721,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,14027,1402702,34027,PBI,"West Palm Beach/Palm Beach, FL",FL,12,Florida,33,545,551,6.0,6.0,0.0,0,0001-0559,16.0,607,857,4.0,910,901,-9.0,0.0,0.0,-1,0900-0959,0.0,,0.0,205.0,190.0,170.0,1.0,1197.0,5,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,21.73360270925029,6.079378241160784,50.36689326650748,30.078218200885456,9.987272508583176,30.05393943692276,17.9446878112254,0.0,6.114790622038855,155.65393190922873,11.268503519306888,2015-01-01T05:45:00Z,345,0,-1.0,0,0.0,0.0,BOS-PBI,0.0,1,-5.366666666666666,0.0,0.0,0,0,0,4.2,1,5.75,1.0,0.9978589232386036,0.0654031292301432,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,3.552713678800501e-15,7.105427357601002e-15,0.0,1.7763568394002505e-15,0.0,1420091100,5,0,10.235997200012209,3,2015-01-01T05:00:00Z
BRO-2015-01-01-20366-4685-N26545-520,5.213134530454164e+58,2,4,5.141919606234619,5.158326497128794,2,0.0,0.00039453305489045457,0.0,3.938098276962348,1.7362328227793442,3.404429860541428,1.5149397590361446,0.7834290401968826,1,1,1,4,2015-01-01,EV,20366,EV,N26545,4685,10747,1074702,30747,BRO,"Brownsville, TX",TX,48,Texas,74,12266,1226603,31453,IAH,"Houston, TX",TX,48,Texas,74,520,517,-3.0,0.0,0.0,-1,0001-0559,15.0,532,621,8.0,634,629,-5.0,0.0,0.0,-1,0600-0659,0.0,,0.0,74.0,72.0,49.0,1.0,308.0,2,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,43.98122758525413,42.723523369031206,95.13230866412547,30.20910959928595,4.340495521927183,30.18901515687297,43.74088061930288,0.0261003522773726,8.0,4.692041043740065,13.254647676972098,2015-01-01T05:20:00Z,320,0,-1.0,0,0.0,0.0,BRO-IAH,0.0,1,-1.35,0.0,0.0,0,0,2,3.0,2,5.333333333333333,1.0,0.984807753012208,0.1736481776669304,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420089600,0,1,10.53268337249756,3,2015-01-01T05:00:00Z
BOS-2015-01-01-19790-2079-N389DA-540,3.33550594017518e+60,66,67,330.58490566037733,330.54388843314194,6,0.0067407083589921,0.0003692778003518269,0.00045128460497035526,10.497351831629109,2.458699151829693,10.122231337161608,11.738957776156902,0.8777686628383922,1,1,1,4,2015-01-01,DL,19790,DL,N389DA,2079,10721,1072102,30721,BOS,"Boston, MA",MA,25,Massachusetts,13,11433,1143302,31295,DTW,"Detroit, MI",MI,26,Michigan,43,540,537,-3.0,0.0,0.0,-1,0001-0559,15.0,552,737,5.0,803,742,-21.0,0.0,0.0,-2,0800-0859,0.0,,0.0,143.0,125.0,105.0,1.0,632.0,3,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,21.73360270925029,6.079378241160786,50.36689326650747,30.07821820088545,9.987272508583173,30.05393943692276,17.944687811225396,0.0,6.114790622038854,155.65393190922873,11.268503519306888,2015-01-01T05:40:00Z,340,0,-1.0,0,0.0,0.0,BOS-DTW,0.0,1,-0.6166666666666667,0.0,0.0,0,0,0,4.5,1,5.666666666666667,1.0,0.9961946980917455,0.0871557427476581,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420090800,3,2,8.751954078674316,3,2015-01-01T05:00:00Z
ONT-2015-01-01-20304-5547-N910SW-559,5.412158567288126e+59,20,18,54.572600492206725,54.62674323215751,2,3.338059232096208e-05,0.0003618008969699503,0.0,8.437210630749805,2.1870903496964536,3.763740771123872,7.806233653007847,1.029532403609516,1,1,1,4,2015-01-01,OO,20304,OO,N910SW,5547,13891,1389101,32575,ONT,"Ontario, CA",CA,6,California,91,11292,1129202,30325,DEN,"Denver, CO",CO,8,Colorado,82,559,758,119.0,119.0,1.0,7,0001-0559,4.0,802,1105,11.0,920,1116,116.0,116.0,1.0,7,0900-0959,0.0,,0.0,141.0,138.0,123.0,1.0,819.0,4,0.0,116.0,0.0,0.0,0.0,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2015,34.049730641719286,19.895977710720985,55.90792985965778,30.066125398110632,9.720182231670067,29.10462127047013,28.992988645459015,0.0,0.681682606858156,250.0,1.8417031430725368,2015-01-01T05:59:00Z,359,0,-1.0,0,0.0,0.0,ONT-DEN,0.0,1,0.5,0.0,0.0,0,0,1,3.0,2,5.983333333333333,1.0,0.9999904807207344,0.0043633092847465,-0.433883739117558,-0.9009688679024191,0.0172133561558346,0.9998518392091162,,,,,,1420091940,1,4,6.462432384490967,3,2015-01-01T05:00:00Z


### Calculate Number of Folds

Based on the configuration, we calculate how many complete folds fit within the dataset timespan.

In [0]:
# Get the maximum time index to determine dataset span
max_time_idx = df_indexed.agg(F.max("time_idx")).collect()[0][0]
print(f"Dataset spans {max_time_idx:,} unique hours")

# Extract configuration values
train_size = CV_CONFIG["train_hours"]
gap_size = CV_CONFIG["gap_hours"]
val_size = CV_CONFIG["val_hours"]
step_size = CV_CONFIG["step_hours"]

# Calculate total window size and number of folds
fold_window_size = train_size + gap_size + val_size
n_folds = (max_time_idx - fold_window_size) // step_size + 1

print(f"\nCross-Validation Configuration:")
print(f"  Training window:   {train_size:,} hours ({train_size // 24:,} days)")
print(f"  Gap size:          {gap_size} hours")
print(f"  Validation window: {val_size:,} hours ({val_size // 24:,} days)")
print(f"  Step size:         {step_size:,} hours")
print(f"  Total folds:       {n_folds}")

  Max time index: 33777
Step 2: Calculated 10 folds


In [0]:
# Generate fold assignments for each time index
# Each time index can belong to multiple folds (as train, gap, or validation)
fold_mapping = []

for fold_id in range(1, n_folds + 1):
    fold_start = 1 + (fold_id - 1) * step_size
    
    # Training period
    for t in range(fold_start, fold_start + train_size):
        fold_mapping.append((t, fold_id, "train"))
    
    # Gap period (excluded from both train and validation)
    for t in range(fold_start + train_size, fold_start + train_size + gap_size):
        fold_mapping.append((t, fold_id, "gap"))
    
    # Validation period
    for t in range(fold_start + train_size + gap_size, fold_start + train_size + gap_size + val_size):
        fold_mapping.append((t, fold_id, "validation"))

print(f"Generated {len(fold_mapping):,} fold assignments")

# Create DataFrame from fold mapping
fold_df = spark.createDataFrame(fold_mapping, ["time_idx", "fold_id", "split_type"])

# Join fold assignments with flight data
# Using broadcast join since fold_df is relatively small
result = df_indexed.join(
    F.broadcast(fold_df),
    on="time_idx",
    how="inner"
)

# Display fold statistics
print("\nRecords per fold and split type:")
result.groupBy("fold_id", "split_type").count().orderBy("fold_id", "split_type").display()


In [0]:
# =============================================================================
# Section 5: Export Splits
# =============================================================================

if SAVE_SPLITS:
    print(f"Saving CV splits to: {DATA_PATHS['output']}")
    
    # Write partitioned by fold_id and split_type for efficient loading
    result.write \
        .partitionBy("fold_id", "split_type") \
        .mode("overwrite") \
        .parquet(DATA_PATHS["output"])
    
    print("✓ CV splits saved successfully!")
else:
    print("SAVE_SPLITS is False. Set to True in configuration to enable saving.")
    print(f"Output path would be: {DATA_PATHS['output']}")

Careful! About to overwrite splits. If you want to continue, type y y

In [0]:
---

## Summary

This notebook generates time-based cross-validation splits for the flight delay prediction model. Key design decisions:

1. **Time-Based Splitting**: Prevents data leakage by ensuring training data always precedes validation data chronologically.

2. **Gap Period**: A 2-hour gap between training and validation prevents label leakage from flights that may have cascading delay effects.

3. **Sliding Window**: The step size creates overlapping folds, maximizing data utilization while maintaining temporal integrity.

4. **Partitioned Output**: Data is partitioned by `fold_id` and `split_type` for efficient loading during model training.

### Output Schema

The exported parquet files contain all original features plus:
- `time_idx`: Dense time index (hourly granularity)
- `fold_id`: Cross-validation fold identifier (1-10)
- `split_type`: "train", "gap", or "validation"