# Compare how a shot is processed into subsequences: past vs dataset_original

This notebook compares **how one shot is turned into training subsequences** in two setups:

- **Past (current soen_fusion_zero pipeline)**: Segment from **meta-style** logic — segment start = baseline (e.g. 40 ms), segment end = from meta or file length; **tiling** = fixed **stride** (like `ECEiTCNDataset._build_subsequences`). Windows can be partial at the end.
- **Now (dataset_original.py)**: Segment from **shot list** with flattop — start = `t_flat_start`, end = `tend = max(tdisrupt, min(tlast, t_flat_stop))`; **tiling** = **overlap**-based (`shots2seqs`: step = `nsub - nrecept + 1`), fixed full-length windows.

We show for a selection of shots: segment bounds, number of subsequences, and a few example windows (start, stop, has_disrupt) for **past** vs **now**. Run from **soen_fusion_zero** project root.

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd

from disruptcnn.dataset_original import (
    segment_info_for_comparison,
    subsequences_original_tiling,
    subsequences_past_tiling,
)

SHOTS_DIR = Path("disruptcnn/shots")
if not SHOTS_DIR.exists():
    SHOTS_DIR = Path("shots")
DISRUPT_LIST = SHOTS_DIR / "d3d_disrupt_ecei.final.txt"
assert DISRUPT_LIST.exists(), f"Shot list not found: {DISRUPT_LIST}"

## Parameters (match ECEiTCNDataset and dataset_original)

- **Past**: baseline 40 ms, segment end = tlast from shot list; nsub and stride in 1 MHz samples (dt=0.001 ms).
- **Now**: flattop segment; nsub, nrecept, data_step as in dataset_original.

In [None]:
# Past pipeline (ECEiTCNDataset-style): 1 MHz sample space, dt from shot list = 0.001 ms
BASELINE_MS = 40.0
TWARN_MS = 300.0
NSUB_PAST = 781_250   # ~781 ms at 1 MHz
STRIDE_PAST = 481_090  # past pipeline stride

# Now (dataset_original) tiling
NSUB_ORIG = 78125
NRECEPT_ORIG = 30000
DATA_STEP_ORIG = 1

## Load shot list and build segment + subsequences for both

For **now** we use the original (flattop) segment from the shot list. For **past** we simulate: segment start = baseline (40 ms in samples), end = tlast (same shot list); disrupt_idx = (tdisrupt - 300 - tstart) / dt.

In [None]:
# Original (flattop) segments and their subsequences (dataset_original tiling)
original_segments = segment_info_for_comparison(str(DISRUPT_LIST), flattop_only=True, snr_min_threshold=None)

def past_segment_from_row(seg: dict) -> dict:
    """Simulate past pipeline segment: start = 40 ms, end = tlast, disrupt_idx from Twarn."""
    dt_ms = seg["dt"]
    tstart = seg["tstart"]
    tlast = seg["tlast"]
    tdisrupt = seg["tdisrupt"]
    # Segment: start = baseline (40 ms from tstart), end = tlast; sample 0 at tstart.
    start_idx = int(BASELINE_MS / dt_ms)  # 40 ms in samples
    stop_idx = int(np.floor((tlast - tstart) / dt_ms))
    disrupt_idx = int(np.ceil((tdisrupt - TWARN_MS - tstart) / dt_ms))
    t_dis_samples = int(np.ceil((tdisrupt - tstart) / dt_ms))  # absolute disruption time in segment samples
    return {
        "shot": seg["shot"],
        "start_idx": start_idx,
        "stop_idx": stop_idx,
        "disrupt_idx": disrupt_idx,
        "t_dis_samples": t_dis_samples,
        "dt": dt_ms,
    }

N_COMPARE = 10
shots_to_compare = [original_segments[i]["shot"] for i in range(min(N_COMPARE, len(original_segments)))]

## Per-shot: segment bounds and number of subsequences

In [None]:
rows = []
for seg in original_segments:
    if seg["shot"] not in shots_to_compare:
        continue
    past_seg = past_segment_from_row(seg)
    subq_past = subsequences_past_tiling(
        past_seg["start_idx"], past_seg["stop_idx"], past_seg["disrupt_idx"],
        nsub=NSUB_PAST, stride=STRIDE_PAST,
        stop_at_last_window_containing_disrupt=True,
        t_dis_samples=past_seg["t_dis_samples"],
    )
    subq_now = subsequences_original_tiling(seg, nsub=NSUB_ORIG, nrecept=NRECEPT_ORIG, data_step=DATA_STEP_ORIG)
    rows.append({
        "shot": seg["shot"],
        "past_start": past_seg["start_idx"],
        "past_stop": past_seg["stop_idx"],
        "past_disrupt_idx": past_seg["disrupt_idx"],
        "past_n_subseq": len(subq_past),
        "now_start": seg["start_idx"],
        "now_stop": seg["stop_idx"],
        "now_disrupt_idx": seg["disrupt_idx"],
        "now_n_subseq": len(subq_now),
        "past_seg_len": past_seg["stop_idx"] - past_seg["start_idx"] + 1,
        "now_seg_len": seg["segment_length_samples"],
    })

df = pd.DataFrame(rows)
df

## Example: one shot — subsequence windows (past vs now)

For the first shot, list the first and last few windows: start, stop, length, has_disrupt.

In [None]:
example_shot = shots_to_compare[0]
seg_now = next(s for s in original_segments if s["shot"] == example_shot)
past_seg = past_segment_from_row(seg_now)
subq_past = subsequences_past_tiling(
    past_seg["start_idx"], past_seg["stop_idx"], past_seg["disrupt_idx"],
    nsub=NSUB_PAST, stride=STRIDE_PAST,
    stop_at_last_window_containing_disrupt=True,
    t_dis_samples=past_seg["t_dis_samples"],
)
subq_now = subsequences_original_tiling(seg_now, nsub=NSUB_ORIG, nrecept=NRECEPT_ORIG, data_step=DATA_STEP_ORIG)

print(f"Shot {example_shot}")
print("--- Past (stride-based) ---")
print(f"Segment: [{past_seg['start_idx']}, {past_seg['stop_idx']}], disrupt_idx={past_seg['disrupt_idx']}, n_subseq={len(subq_past)}")
for w in subq_past[:3]:
    print(f"  seq {w['seq_idx']}: start={w['start']}, stop={w['stop']}, len={w['length']}, has_disrupt={w['has_disrupt']}, disrupt_local={w['disrupt_local']}")
if len(subq_past) > 5:
    print("  ...")
for w in subq_past[-2:]:
    print(f"  seq {w['seq_idx']}: start={w['start']}, stop={w['stop']}, len={w['length']}, has_disrupt={w['has_disrupt']}, disrupt_local={w['disrupt_local']}")

print("\n--- Now (dataset_original overlap tiling) ---")
print(f"Segment: [{seg_now['start_idx']}, {seg_now['stop_idx']}], disrupt_idx={seg_now['disrupt_idx']}, n_subseq={len(subq_now)}")
for w in subq_now[:3]:
    print(f"  seq {w['seq_idx']}: start={w['start']}, stop={w['stop']}, len={w['length']}, has_disrupt={w['has_disrupt']}, disrupt_local={w['disrupt_local']}")
if len(subq_now) > 5:
    print("  ...")
for w in subq_now[-2:]:
    print(f"  seq {w['seq_idx']}: start={w['start']}, stop={w['stop']}, len={w['length']}, has_disrupt={w['has_disrupt']}, disrupt_local={w['disrupt_local']}")

## Summary: difference in number of subsequences and segment length

Past uses a longer segment (baseline to tlast) and stride-based tiling; now uses flattop segment and overlap tiling, so counts and window positions differ.

In [None]:
df["delta_n_subseq"] = df["now_n_subseq"] - df["past_n_subseq"]
df["delta_seg_len"] = df["now_seg_len"] - df["past_seg_len"]
summary = df.agg({
    "past_n_subseq": ["min", "max", "mean"],
    "now_n_subseq": ["min", "max", "mean"],
    "delta_n_subseq": ["mean", "sum"],
    "past_seg_len": ["min", "max", "mean"],
    "now_seg_len": ["min", "max", "mean"],
    "delta_seg_len": ["mean", "sum"],
}).round(1)
summary