# Peru SWAN Pipeline: End-to-End Run
This notebook runs each major pipeline script in a separate cell.

**Make sure to set your desired `RUN_PATH` in `config.py` before running.**

---

In [5]:
# 1. Data Preparation
%env PYTHONPATH=.
%env RUN_PATH=run_g10
%env MIN_DAYS=1
!python3 scripts/data_preparation_0.py

env: PYTHONPATH=.
env: RUN_PATH=run_g10
env: MIN_DAYS=1
✅ Using centralized configuration
🔧 DATA_PREPARATION_0.PY - Initial Processing
🚀 DATA PREPARATION STEP 0 - INITIAL PROCESSING
Run: run_g10
Reference port: DPA_VILA_VILA
Coordinates: (-18.11, -70.726)

📊 STEP 1: Loading port closure data...
✅ Loaded 23073 port closure records
Years: [np.int64(2013), np.int64(2014), np.int64(2015), np.int64(2016), np.int64(2017), np.int64(2018), np.int64(2019), np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024), np.int64(2025)]
Ports: 150 unique ports

🌊 STEP 2: Loading SWAN wave data...
Loading SWAN wave data from CSV...
Loaded 36,017 wave data records
Date range: 2013-01-01 to 2025-04-30
SWH range: 0.730 to 4.057 m

📈 STEP 3: Processing SWAN data to daily aggregates...
Processing SWAN data for DPA_VILA_VILA...
60th percentile threshold from 2023: 1.75m
✅ SWAN daily processing complete: (2557, 30)

📡 STEP 4: Loading and processing WAVERYS data...
Loading WAVERYS data for

In [8]:
# 1. Data Preparation
%env PYTHONPATH=.
%env RUN_PATH=run_g10
%env MIN_DAYS=1
!python3 scripts/data_preparation_1.py

env: PYTHONPATH=.
env: RUN_PATH=run_g10
env: MIN_DAYS=1
✅ Using centralized configuration

🔧 DATA_PREPARATION_1.PY - Enhanced Processing

STEP 1: Loading and validating daily data
✓ Loaded SWAN daily data: (2557, 30)
✓ Loaded WAVERYS data for DPA_VILA_VILA: (2578, 47)
SWAN data date range: 2018-01-01 to 2024-12-31
WAVERYS data date range: 2018-01-01 to 2024-12-31
WAVERYS ports: ['DPA_VILA_VILA']

STEP 2: Detrending and deseasonalizing reference point data
Input data shape: (2557, 30)
Apply detrending: True
Apply deseasonalizing: True
Processing 21 wave features...
Example features: ['swh_mean', 'swh_max', 'swh_min', 'swh_median', 'swh_p80']
Vectorized processing of 21 features...

STEP 3: Enhanced feature engineering on reference point data
Input shape: (2557, 73)
Use processed features: True
Creating enhanced features from 64 processed features...
  Creating persistence features (memory-efficient)...
  Creating trend features (memory-efficient)...
    Computing slopes for window 3...


In [16]:
# 2. Rule Evaluation (CV, ML, Thresholds)
%env PYTHONPATH=.
%env RUN_PATH=run_g10
%env MIN_DAYS=1
%env EVENT_DUMMY_TARGET=event_dummy_5
!python3 scripts/rule_evaluation.py

env: PYTHONPATH=.
env: RUN_PATH=run_g10
env: MIN_DAYS=1
env: EVENT_DUMMY_TARGET=event_dummy_5

🔧 RULE_EVALUATION.PY - CV Pipeline
Run: run_g10
Reference port: DPA_VILA_VILA
🔎 Looking for merged features at: /Users/ageidv/suyana/peru_swan/wave_analysis_pipeline/data/processed/run_g10/df_swan_waverys_merged.csv
✅ Loaded merged features: /Users/ageidv/suyana/peru_swan/wave_analysis_pipeline/data/processed/run_g10/df_swan_waverys_merged.csv ((2578, 13111))

🚦 Running enhanced rule evaluation pipeline ...
🚀 FAST ENHANCED CV PIPELINE
📊 Dataset: 2578 samples, 292 events (11.3%)

CREATING CROSS-VALIDATION SPLITS
📅 Using TimeSeriesSplit with 6 folds
🔍 FEATURE COMPARISON DEBUG:
Total columns in dataset: 13111
Excluded columns: ['event_dummy_1', 'event_dummy_2', 'event_dummy_3', 'event_dummy_4', 'event_dummy_5', 'event_dummy_6', 'event_dummy_7']
Features being used: 13102
First 20 features:
   1. swh_mean_swan
   2. swh_max_swan
   3. swh_min_swan
   4. swh_median_swan
   5. swh_p80_swan
   6. sw

In [5]:
# 3. AEP Calculation (Rule & ML)
%env PYTHONPATH=.
%env RUN_PATH=run_g3
%env MIN_DAYS=3
!python3 scripts/aep_calculation.py

env: PYTHONPATH=.
env: RUN_PATH=run_g3
env: MIN_DAYS=3
[AEP] Aggregated N_PARAM (fishermen): 14424.0
[AEP] Aggregated W_PARAM (wage): 11.269387755102041

🔧 AEP_CALCULATION.PY - Final AEP Analysis
Run: run_g3
Reference port: CALETA_TIERRA_COLORADA
✅ Best single rule: anom_swh_min_swan > 0.225

🚀 Running speed-optimized AEP simulation...
🚀 SPEED-OPTIMIZED UNIFIED AEP ANALYSIS
  Data: 2341 observations
  Trigger: anom_swh_min_swan > 0.22494295115257257
  Port: 14424.0 fishermen × $11.269387755102041/day
  Min event: 3 days
  Block length: 4 days
  Simulations: 4000
  Using 2341 days for simulation.
  Observed events: 50 out of 2341 days
  Pre-computing trigger values...
  Generating block bootstrap samples...
  Pre-computing valid block positions...
  Generating all simulation indices...
  Processing 4000 simulations using 12 parallel workers...
  Using 12 threads for parallel processing...
Processing batches: 100%|███████████████████████| 13/13 [00:00<00:00, 50.01it/s]
  Completed 4000 s

In [6]:
# 3. AEP Calculation (ML)
%env PYTHONPATH=.
%env PYTHONPATH=.
%env RUN_PATH=run_g3
%env MIN_DAYS=3
!python3 scripts/aep_ml_calculation.py

env: PYTHONPATH=.
env: PYTHONPATH=.
env: RUN_PATH=run_g3
env: MIN_DAYS=3
[ML-AEP] N_PARAM (fishermen): 1482.0
[ML-AEP] W_PARAM (wage): 50.2

🎲 ENHANCED ML AEP CALCULATOR - Independent Daily Draws
This uses the THEORETICALLY CORRECT approach for ML probabilities:
• Each day's probability = P(port closes | features)
• Independent random draws (no block bootstrap)
• Temporal correlation already captured in ML features
• Enhanced output matching other simulation scripts
Run: run_g3
Reference port: CALETA_TIERRA_COLORADA
✅ Using ML probabilities: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g10/ML_probs_2024.csv
✅ Using optimal threshold: 0.3136
✅ Loaded ML probabilities: (366, 9)
✅ Merged data: (150, 8497)
   Date range: 2024-01-01 00:00:00 to 2024-05-29 00:00:00
✅ Found observed events: 8 out of 150 days

🎲 Running ENHANCED ML AEP simulation...
🎲 CORRECTED ML AEP ANALYSIS (Independent Daily Draws)
  Daily probabilities: 150 days
  Probability range: 0.0063 to 0.8859
  Mean probab

In [2]:
# 5. AEP calculation for multiple conditions

%env PYTHONPATH=.
%env RUN_PATH=run_g8
%env MIN_DAYS=3
!python3 scripts/aep_calculation_experiment.py

env: PYTHONPATH=.
env: RUN_PATH=run_g8
env: MIN_DAYS=3
[AEP] Aggregated N_PARAM (fishermen): 183.0
[AEP] Aggregated W_PARAM (wage): 12.785714285714286

🚀 ENHANCED MULTI-RULE AEP ANALYSIS (Keeping Working Parts)
Run: run_g8
Reference port: CALETA_NAZCA

📋 Generating rule combinations...
  Top features: ['swh_max_swan', 'swh_p80_swan', 'anom_swh_max_waverys']
  Testing 15 double rule combinations...

--- 1/15: double_AND ---
    F1: 0.633
  🚀 Enhanced multi-rule AEP: 2 features, 4000 sims
📊 Observed events: 620 out of 2557 days
📊 Rule predictions: 844 out of 2557 days (33.0%)
📊 Observed confusion matrix: TP=463, FP=381, TN=1556, FN=157
🔄 Running enhanced simulation with detailed tracking...
✅ Completed 4000 simulations
    ✅ Mean loss: $209,378, P99 loss: $306,535

--- 2/15: double_OR ---
    F1: 0.630
  🚀 Enhanced multi-rule AEP: 2 features, 4000 sims
📊 Observed events: 620 out of 2557 days
📊 Rule predictions: 895 out of 2557 days (35.0%)
📊 Observed confusion matrix: TP=477, FP=418, TN=

In [None]:
# 5. Plot Pipeline Summary
%env PYTHONPATH=.
%env RUN_PATH=run_g8
%env MIN_DAYS=1
!python3 scripts/plot_pipeline_summary.py


env: PYTHONPATH=.
env: RUN_PATH=run_g3
env: MIN_DAYS=1
✅ Saved plot: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g3/swh_max_swan_vs_waverys_events.png
✅ Saved plot: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g3/densities_swan_vs_waverys.png
✅ Saved plot: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g3/anom_swh_max_swan_vs_waverys_events.png
✅ Saved plot: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g3/densities_anom_swh_max_swan_vs_waverys.png
✅ Saved plot: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g3/scatter_swh_max_swan_vs_waverys.png
✅ Saved plot: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g3/scatter_anom_swh_max_swan_vs_waverys.png
DEBUG: observed losses loaded:
   year  observed_loss
0  2018   1.137848e+06
1  2019   1.625496e+05
2  2020   4.876489e+05
3  2021   2.275695e+06
4  2022   3.250993e+05
5  2023   2.438245e+06
6  2024   2.600794e+06
✅ Saved plot: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g3/aep_with

In [19]:
%env PYTHONPATH=.
%env RUN_PATH=run_g7
%env MIN_DAYS=1
!python3 scripts/plot_observed_loss_distribution.py

env: PYTHONPATH=.
env: RUN_PATH=run_g7
env: MIN_DAYS=1
✅ Using centralized configuration
🔍 OBSERVED LOSS DISTRIBUTION ANALYSIS (SIMPLIFIED)
📍 Analyzing: CALETA_SAN_ANDRES (run_g7)
📁 Results directory: /Users/ageidv/suyana/peru_swan/results/cv_results/run_g7
📅 Found observed losses file: observed_yearly_losses_p75_20250709_235447.csv
✅ Loaded 7 years of observed losses
   Year range: 2018 - 2024
   Loss range: $23,799 - $259,140

OBSERVED LOSS DISTRIBUTION ANALYSIS
Port: CALETA_SAN_ANDRES
Run: run_g7
Analysis Date: 2025-07-21 13:47:01
--------------------------------------------------------------------------------

📊 BASIC STATISTICS:
   Total Years Analyzed: 7
   Total Economic Loss: $922,856
   Average Loss per Year: $131,837
   Median Loss: $108,416
   Standard Deviation: $85,690
   Coefficient of Variation: 0.65

📈 PERCENTILE ANALYSIS:
   Minimum Loss: $23,799
   25th Percentile: $71,396
   75th Percentile: $194,355
   90th Percentile: $227,409
   95th Percentile: $243,274
   99th P