# 🔭 SpectraMind V50 — 01_data_exploration.ipynb

Mission‑grade data exploration for **FGS1** (long time‑series) and **AIRS** (spectral bins).

**Standards**
- Notebooks are *thin orchestration*: **CLI → Hydra configs → DVC artifacts**. No ad‑hoc pipeline logic.
- Read inputs solely from `data/` (DVC) or `outputs/` (produced by CLI).
- Write all figures/summaries to `outputs/exploration/` (DVC‑tracked as appropriate).
- Record environment/CLI state for reproducibility in `logs/` and local cell output.

**What this notebook does**
1. Environment & repo sanity checks.
2. Discover raw files (HDF5 / NPZ) under `data/`.
3. FGS1 quick looks: light curve segment, rolling stats, basic noise proxy.
4. AIRS quick looks: wavelength grid, example spectra, per‑bin variance, molecular band overlays.
5. Persist artifacts (PNGs + JSON summary) into `outputs/exploration/`.

> Tip: For calibration and training, use the dedicated notebooks (02/03/…) or the CLI directly.


## 0) Setup & folders
Resolve project paths (assumes this notebook lives in `/notebooks`). Creates an output folder for artifacts.

In [None]:
import os, sys, json, shutil, textwrap, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook'); sns.set_style('whitegrid')

# Resolve paths relative to repo root (../ from /notebooks)
NB_DIR = Path.cwd()
ROOT = NB_DIR if (NB_DIR / 'data').exists() else NB_DIR.parents[0]
DATA = ROOT / 'data'
OUT = ROOT / 'outputs'
EXPOUT = OUT / 'exploration'
LOGS = ROOT / 'logs'
EXPOUT.mkdir(parents=True, exist_ok=True)
LOGS.mkdir(parents=True, exist_ok=True)

print('ROOT:', ROOT)
print('DATA:', DATA)
print('OUT :', OUT)
print('EXPO:', EXPOUT)


## 1) Environment & CLI snapshot (best‑effort)
These calls are optional and robust to missing tools; they help trace the environment used for exploration.

In [None]:
def _run(cmd: str, cwd: Path | None = None):
    print(f"\n$ {cmd}")
    try:
        p = subprocess.run(cmd, shell=True, text=True, capture_output=True, cwd=str(cwd or ROOT))
        out = (p.stdout or '')[-2000:]
        err = (p.stderr or '')[-2000:]
        print(out.strip())
        if p.returncode != 0 and err.strip():
            print('[stderr]', err.strip())
    except Exception as e:
        print('[skip]', e)

_run('python --version')
_run('spectramind --version')          # unified CLI (if available)
_run('dvc --version')                   # DVC presence
_run('git rev-parse --short HEAD')      # commit id
_run('git status -s')                   # dirty state hint

# Persist a small env snapshot for audit
nenv_snapshot = {
    'python': sys.version.split()[0],
    'cwd': str(Path.cwd()),
}
with open(EXPOUT / 'env_snapshot.json', 'w') as f:
    json.dump(env_snapshot, f, indent=2)
print('Saved', EXPOUT / 'env_snapshot.json')


## 2) Discover candidate raw files (HDF5 / NPZ)
We only *read* from `data/`. You can plug calibrated/derived artifacts later from `outputs/` if needed.

In [None]:
from pathlib import Path
raw_h5 = sorted(DATA.glob('**/*.h5'))
raw_npz = sorted(DATA.glob('**/*.npz'))
print(f"Found {len(raw_h5)} HDF5 and {len(raw_npz)} NPZ files under data/ (show up to 10):")
for p in (raw_h5[:10] + raw_npz[:10]):
    print(' -', p.relative_to(ROOT))

# Choose a sample file for quick looks
sample_h5 = raw_h5[0] if raw_h5 else None
sample_npz = raw_npz[0] if raw_npz else None
print('\nSample HDF5:', sample_h5)
print('Sample  NPZ:', sample_npz)


## 3) HDF5 structure peek (FGS1 + AIRS groups)
We expect groups like `FGS1/time, FGS1/raw or cal`, and `AIRS/wavelength, AIRS/raw or cal`. This is a *non‑failing* peek.

In [None]:
def peek_h5(h5path: Path, max_keys: int = 15):
    import h5py
    try:
        with h5py.File(h5path, 'r') as f:
            print('Groups:', list(f.keys()))
            for g in list(f.keys()):
                try:
                    keys = list(f[g].keys())
                    head = keys[:max_keys]
                    print(f'  /{g}: {head}...')
                except Exception as e:
                    print(f'  /{g}: <unreadable> ({e})')
    except Exception as e:
        print('[skip] h5 peek:', e)

if sample_h5:
    peek_h5(sample_h5)
else:
    print('No HDF5 found. You can still proceed with NPZ if schema matches.')


## 4) FGS1 exploration — light curve segment & rolling stats
Plot a segment of the FGS1 time series and compute a simple rolling mean/std to visualize noise structure. Handles both `raw` and `cal` if present.

> **Note**: This is for *exploration only*; do not perform pipeline detrending here. Use CLI calibration/training notebooks for that.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

def load_fgs1(h5path: Path, n: int | None = 20000):
    import h5py
    with h5py.File(h5path, 'r') as f:
        t = None
        y = None
        if 'FGS1' in f:
            grp = f['FGS1']
            if 'time' in grp: t = grp['time'][:]
            # prefer calibrated if exists, else raw
            if 'cal' in grp: y = grp['cal'][:]
            elif 'raw' in grp: y = grp['raw'][:]
        if t is None or y is None:
            return None, None
        if n is not None and len(t) > n:
            t = t[:n]; y = y[:n]
        return t, y

if sample_h5:
    t, y = load_fgs1(sample_h5, n=20000)
else:
    t, y = None, None

if t is not None and y is not None:
    fig, ax = plt.subplots(figsize=(12,4))
    ax.plot(t, y, lw=0.5)
    ax.set_title('FGS1 light curve (first ~20k samples)')
    ax.set_xlabel('time [arb]')
    ax.set_ylabel('flux [arb]')
    fig.tight_layout()
    fig.savefig(EXPOUT / 'fgs1_lightcurve_segment.png', dpi=150)
    plt.show()

    # Rolling statistics (window in samples — pick ~1–2% of the segment length)
    df = pd.DataFrame({'time': t, 'flux': y})
    win = max(51, int(0.01 * len(df))) | 1  # odd window
    df['roll_mean'] = df['flux'].rolling(win, center=True, min_periods=win//2).mean()
    df['roll_std']  = df['flux'].rolling(win, center=True, min_periods=win//2).std()

    fig, ax = plt.subplots(2,1,figsize=(12,6), sharex=True)
    ax[0].plot(df['time'], df['flux'], lw=0.4, label='flux')
    ax[0].plot(df['time'], df['roll_mean'], lw=1.0, label=f'rolling mean (w={win})')
    ax[0].legend(loc='best'); ax[0].set_ylabel('flux')
    ax[1].plot(df['time'], df['roll_std'], lw=0.8, color='tab:orange')
    ax[1].set_ylabel('rolling std'); ax[1].set_xlabel('time')
    fig.suptitle('FGS1 rolling stats (exploration)')
    fig.tight_layout()
    fig.savefig(EXPOUT / 'fgs1_rolling_stats.png', dpi=150)
    plt.show()
else:
    print('FGS1 not present in sample file.')


## 5) AIRS exploration — wavelength grid & example spectra
We visualize the wavelength axis, a few spectra, and basic per‑bin variance. If both `raw` and `cal` exist, we prefer `cal` for visualization only (no pipeline operations here).

In [None]:
def load_airs(h5path: Path, k: int = 3):
    import h5py
    with h5py.File(h5path, 'r') as f:
        if 'AIRS' not in f:
            return None, None
        wl = None; cube = None
        grp = f['AIRS']
        if 'wavelength' in grp:
            wl = grp['wavelength'][:]
        # Prefer 'cal' if 3D [time/idx, wavelength] present; else 'raw'
        key = 'cal' if 'cal' in grp else ('raw' if 'raw' in grp else None)
        if key is None:
            return wl, None
        data = grp[key][:]
        # Normalize shape to (N, W)
        if data.ndim == 1:
            data = data[None, :]
        elif data.ndim == 2:
            pass
        else:
            # If shape is (T, W, H) or similar, try first axis as index and collapse others
            data = data.reshape(data.shape[0], -1)
        # Limit to first k spectra for quick looks
        data_k = data[:max(1, k)]
        return wl, data_k

wl, spectra = (None, None)
if sample_h5:
    wl, spectra = load_airs(sample_h5, k=5)

if wl is not None:
    fig, ax = plt.subplots(figsize=(10,3))
    ax.hist(wl, bins=min(60, len(wl)//2 + 1), alpha=0.8)
    ax.set_title('AIRS wavelength distribution')
    ax.set_xlabel('wavelength [μm]'); ax.set_ylabel('count')
    fig.tight_layout(); fig.savefig(EXPOUT / 'airs_wavelength_hist.png', dpi=150)
    plt.show()
else:
    print('No AIRS wavelength axis found in sample file.')

if wl is not None and spectra is not None:
    fig, ax = plt.subplots(figsize=(12,4))
    for i, s in enumerate(spectra):
        ax.plot(wl, s, lw=0.9, label=f'spectrum #{i}')
    ax.set_title('AIRS example spectra (exploration)')
    ax.set_xlabel('wavelength [μm]'); ax.set_ylabel('flux [arb]')
    ax.legend(loc='best')
    fig.tight_layout(); fig.savefig(EXPOUT / 'airs_example_spectra.png', dpi=150)
    plt.show()

    # Per‑bin variance (across the displayed sample spectra)
    vb = np.nanvar(spectra, axis=0)
    fig, ax = plt.subplots(figsize=(12,3))
    ax.plot(wl, vb, lw=1.0, color='tab:orange')
    ax.set_title('AIRS per‑bin variance (sample)')
    ax.set_xlabel('wavelength [μm]'); ax.set_ylabel('var')
    fig.tight_layout(); fig.savefig(EXPOUT / 'airs_perbin_variance.png', dpi=150)
    plt.show()
else:
    print('AIRS spectra not available for plotting.')


## 6) Molecular band overlays (educational)
For visualization only, shade canonical bands (H₂O, CO₂, CH₄) over the plotted spectrum(s). This is not a physics pipeline step; it helps eyeball correspondence between features and known bands.

In [None]:
molecular_bands = {
    'H2O': [(1.30, 1.50), (1.80, 2.00)],
    'CO2': [(2.00, 2.10), (4.20, 4.40)],
    'CH4': [(3.20, 3.40)],
}

def overlay_bands(wl, s, title='AIRS spectrum with molecular bands'):
    fig, ax = plt.subplots(figsize=(12,4))
    ax.plot(wl, s, lw=0.9, label='spectrum')
    for mol, bands in molecular_bands.items():
        for (lo, hi) in bands:
            ax.axvspan(lo, hi, color='gray', alpha=0.18)
            ax.text((lo+hi)/2, np.nanmax(s)*0.98, mol, ha='center', va='top', fontsize=8, alpha=0.8)
    ax.set_title(title)
    ax.set_xlabel('wavelength [μm]'); ax.set_ylabel('flux [arb]')
    fig.tight_layout();
    return fig

if wl is not None and spectra is not None and len(spectra):
    fig = overlay_bands(wl, spectra[0])
    fig.savefig(EXPOUT / 'airs_spectrum_with_bands.png', dpi=150)
    plt.show()
else:
    print('Skip band overlay (no AIRS spectrum available).')


## 7) Persist an exploration summary (JSON)
We stash a structured summary of what was found/visualized so downstream diagnostics can pick it up if needed.

In [None]:
summary = {
    'fgs1': {
        'present': t is not None and y is not None,
        'n_points': int(len(t)) if t is not None else 0,
        'artifacts': [
            str((EXPOUT / 'fgs1_lightcurve_segment.png').relative_to(ROOT)),
            str((EXPOUT / 'fgs1_rolling_stats.png').relative_to(ROOT)),
        ],
    },
    'airs': {
        'present': wl is not None,
        'wavelength_bins': int(len(wl)) if wl is not None else 0,
        'example_spectra': int(spectra.shape[0]) if isinstance(spectra, np.ndarray) else 0,
        'artifacts': [
            str((EXPOUT / 'airs_wavelength_hist.png').relative_to(ROOT)),
            str((EXPOUT / 'airs_example_spectra.png').relative_to(ROOT)),
            str((EXPOUT / 'airs_perbin_variance.png').relative_to(ROOT)),
            str((EXPOUT / 'airs_spectrum_with_bands.png').relative_to(ROOT)) if (EXPOUT / 'airs_spectrum_with_bands.png').exists() else None,
        ],
    },
}
with open(EXPOUT / 'exploration_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)
print('Saved', EXPOUT / 'exploration_summary.json')


## 8) (Optional) CLI stubs you may run later
These demonstrate the *correct* way to perform data calibration or diagnostics — **always via CLI/Hydra**, not ad‑hoc code. Keep commented unless you intend to run them here.

```bash
# Minimal self‑test (fast, safe):
# spectramind test --fast

# Sample calibration (writes to outputs/calibrated):
# spectramind calibrate --sample 3 --outdir outputs/calibrated --fast

# Diagnostics (HTML report saved under outputs/diagnostics):
# spectramind diagnose dashboard --no-umap=false --no-tsne=false --out outputs/diagnostics/report.html
```

> For full calibration/training flows, see `02_calibration_walkthrough.ipynb` and `03_train_v50_demo.ipynb`.