# 🧪 SpectraMind V50 — 02_calibration_walkthrough.ipynb

Mission‑grade walkthrough of the **calibration kill chain**:
raw instrument data → science‑ready lightcurves/spectra, using the **CLI + Hydra configs**.

**Standards**
- Notebooks are *thin orchestration*: **CLI → Hydra → DVC artifacts**. No ad‑hoc pipeline code.
- Inputs come from `data/` (DVC); outputs are written to `outputs/calibrated/` and tracked via DVC as appropriate.
- All figures/JSON summaries go under `outputs/calibration/`.
- This notebook is safe to run as a sample; scale up with the CLI once validated.

**What you’ll do**
1) Env & repo sanity checks (best‑effort).
2) Discover raw files in `data/`.
3) Run a **sample calibration** via the CLI (fast path): `spectramind calibrate --sample 3 --outdir outputs/calibrated --fast`.
4) Verify artifacts (e.g., `outputs/calibrated/lightcurves.h5`).
5) Visualize calibrated FGS1/AIRS quick‑looks.
6) **Raw vs Calibrated FFT** comparison (if raw series available) to illustrate systematics suppression.
7) Persist a small calibration report JSON for downstream tools.

## 0) Setup & paths
Assumes this notebook is in `/notebooks`. Creates output folders for artifacts and logs.


In [None]:
import os, sys, json, subprocess, textwrap
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook'); sns.set_style('whitegrid')

NB_DIR = Path.cwd()
ROOT = NB_DIR if (NB_DIR / 'data').exists() else NB_DIR.parents[0]
DATA = ROOT / 'data'
OUT = ROOT / 'outputs'
CAL = OUT / 'calibrated'
CAL.mkdir(parents=True, exist_ok=True)
CALPLOT = OUT / 'calibration'
CALPLOT.mkdir(parents=True, exist_ok=True)
LOGS = ROOT / 'logs'
LOGS.mkdir(parents=True, exist_ok=True)

print('ROOT:', ROOT)
print('DATA:', DATA)
print('OUT :', OUT)
print('CAL :', CAL)
print('CALPLOT:', CALPLOT)
print('LOGS:', LOGS)


## 1) Environment & CLI snapshot (best‑effort)
Captures versions and commit state for auditability. These calls are robust to missing tools.


In [None]:
def _run(cmd: str, cwd: Path | None = None):
    print(f"\n$ {cmd}")
    try:
        p = subprocess.run(cmd, shell=True, text=True, capture_output=True, cwd=str(cwd or ROOT))
        out = (p.stdout or '')[-2000:]
        err = (p.stderr or '')[-2000:]
        print(out.strip())
        if p.returncode != 0 and err.strip():
            print('[stderr]', err.strip())
        return p.returncode
    except Exception as e:
        print('[skip]', e)
        return -1

_run('python --version')
_run('spectramind --version')              # unified CLI (if available)
_run('dvc --version')                       # DVC
_run('git rev-parse --short HEAD')          # commit id
_run('git status -s')                       # working tree state

# persist a small snapshot
env_snapshot = {
    'python': sys.version.split()[0],
    'cwd': str(Path.cwd()),
}
with open(CALPLOT / 'env_snapshot.json', 'w') as f:
    json.dump(env_snapshot, f, indent=2)
print('Saved', CALPLOT / 'env_snapshot.json')


## 2) Discover raw inputs under `data/`
We will pick a sample HDF5 file for quick comparisons. If absent, the sample step still runs (CLI may locate defaults in configs).

In [None]:
raw_h5 = sorted(DATA.glob('**/*.h5'))
raw_npz = sorted(DATA.glob('**/*.npz'))
print(f"Found {len(raw_h5)} HDF5 and {len(raw_npz)} NPZ files under data/ (show up to 10):")
for p in (raw_h5[:10] + raw_npz[:10]):
    print(' -', p.relative_to(ROOT))
sample_raw_h5 = raw_h5[0] if raw_h5 else None
print('\nSample raw HDF5:', sample_raw_h5)


## 3) Run a **sample calibration** via the CLI (fast path)
This step performs a *minimal* calibration to validate the pipeline end‑to‑end. The exact settings are controlled by Hydra configs invoked by the CLI.

**Commands**
- Optional quick self‑test: `spectramind test --fast`
- Sample calibration: `spectramind calibrate --sample 3 --outdir outputs/calibrated --fast`

If the CLI is not available in this environment, these cells will simply print a note and you can run them externally.


In [None]:
ret = _run('spectramind test --fast')
ret = _run('spectramind calibrate --sample 3 --outdir outputs/calibrated --fast')
print('Calibration command exit code:', ret)

# Expected artifact(s)
cal_h5 = CAL / 'lightcurves.h5'
print('Calibrated file exists?', cal_h5.exists(), cal_h5)


## 4) Visualize calibrated FGS1 and AIRS quick‑looks
Reads from `outputs/calibrated/lightcurves.h5` and saves PNGs under `outputs/calibration/`.

> This exploration stage should not perform any pipeline detrending; it only plots results.


In [None]:
def _plot_fgs1_cal(h5path: Path, n: int = 20000):
    import h5py
    try:
        with h5py.File(h5path, 'r') as f:
            if 'FGS1' not in f: return False
            g = f['FGS1']
            t = g['time'][:]
            y = g['cal'][:] if 'cal' in g else (g['raw'][:] if 'raw' in g else None)
            if t is None or y is None: return False
            if n and len(t) > n:
                t, y = t[:n], y[:n]
            fig, ax = plt.subplots(figsize=(12,4))
            ax.plot(t, y, lw=0.5)
            ax.set_title('FGS1 calibrated (sample)')
            ax.set_xlabel('time [arb]'); ax.set_ylabel('flux [arb]')
            fig.tight_layout(); fig.savefig(CALPLOT / 'fgs1_calibrated_segment.png', dpi=150)
            plt.show()
            return True
    except Exception as e:
        print('[skip FGS1 cal plot]', e)
        return False

def _plot_airs_cal(h5path: Path, k: int = 3):
    import h5py
    try:
        with h5py.File(h5path, 'r') as f:
            if 'AIRS' not in f: return False
            g = f['AIRS']
            wl = g['wavelength'][:] if 'wavelength' in g else None
            if wl is None: return False
            key = 'cal' if 'cal' in g else ('raw' if 'raw' in g else None)
            if key is None: return False
            data = g[key][:]
            if data.ndim == 1: data = data[None, :]
            data = data[:max(1, k)]

            fig, ax = plt.subplots(figsize=(12,4))
            for i, s in enumerate(data):
                ax.plot(wl, s, lw=0.9, label=f'spectrum #{i}')
            ax.set_title('AIRS calibrated spectra (sample)')
            ax.set_xlabel('wavelength [μm]'); ax.set_ylabel('flux [arb]')
            ax.legend(loc='best')
            fig.tight_layout(); fig.savefig(CALPLOT / 'airs_calibrated_spectra.png', dpi=150)
            plt.show()
            return True
    except Exception as e:
        print('[skip AIRS cal plot]', e)
        return False

if cal_h5.exists():
    ok_fgs = _plot_fgs1_cal(cal_h5)
    ok_airs = _plot_airs_cal(cal_h5)
else:
    print('Calibrated artifact not found; ensure the CLI step finished successfully.')


## 5) **Raw vs Calibrated** FFT power comparison (FGS1)
If the raw HDF5 contains `FGS1/time` + `FGS1/raw`, we compare its power spectrum to the calibrated one to illustrate suppression of instrument/systematic bands. If not available, this section will skip gracefully.


In [None]:
def _power_spectrum(x: np.ndarray):
    x = np.asarray(x)
    x = x - np.nanmean(x)
    ps = np.abs(np.fft.rfft(x))**2
    f = np.fft.rfftfreq(len(x), d=1.0)  # arbitrary sampling step
    return f, ps

def _load_fgs1_series(h5path: Path, key: str, n: int = 20000):
    import h5py
    with h5py.File(h5path, 'r') as f:
        if 'FGS1' not in f: return None, None
        g = f['FGS1']
        t = g['time'][:] if 'time' in g else None
        if key not in g: return None, None
        y = g[key][:]
        if t is None or y is None: return None, None
        if n and len(t) > n:
            t, y = t[:n], y[:n]
        return t, y

raw_f, raw_ps = None, None
cal_f, cal_ps = None, None

# Try raw from the discovered sample, and cal from outputs
if sample_raw_h5 is not None:
    t_raw, y_raw = _load_fgs1_series(sample_raw_h5, 'raw', n=20000)
    if t_raw is not None and y_raw is not None and len(y_raw) > 32:
        raw_f, raw_ps = _power_spectrum(y_raw)

if cal_h5.exists():
    t_cal, y_cal = _load_fgs1_series(cal_h5, 'cal', n=20000)
    if t_cal is not None and y_cal is not None and len(y_cal) > 32:
        cal_f, cal_ps = _power_spectrum(y_cal)

if raw_ps is not None and cal_ps is not None:
    fig, ax = plt.subplots(figsize=(12,4))
    # ignore DC (index 0)
    ax.semilogy(raw_f[1:], raw_ps[1:], label='raw', alpha=0.9)
    ax.semilogy(cal_f[1:], cal_ps[1:], label='calibrated', alpha=0.9)
    ax.set_title('FGS1 FFT power — raw vs calibrated (sample)')
    ax.set_xlabel('frequency [arb]'); ax.set_ylabel('power')
    ax.legend(loc='best')
    fig.tight_layout(); fig.savefig(CALPLOT / 'fgs1_fft_raw_vs_cal.png', dpi=150)
    plt.show()
else:
    print('FFT comparison skipped (raw or cal series unavailable).')


## 6) Logs & DVC status (optional)
Shows end of the CLI journal (if any), then `dvc status` to help confirm tracked artifacts are up‑to‑date.


In [None]:
log_md = LOGS / 'v50_debug_log.md'
print('CLI journal exists?', log_md.exists(), log_md)
if log_md.exists():
    try:
        tail = '\n'.join(log_md.read_text(errors='ignore').splitlines()[-60:])
        print('\n--- tail logs/v50_debug_log.md ---\n' + tail)
    except Exception as e:
        print('[skip log tail]', e)

_run('dvc status')


## 7) Persist a **calibration report** JSON
Stores a lightweight summary for downstream diagnostics/dashboards (paths relative to repo root).


In [None]:
report = {
    'calibrated_artifact': str(cal_h5.relative_to(ROOT)) if cal_h5.exists() else None,
    'plots': [
        str((CALPLOT / 'fgs1_calibrated_segment.png').relative_to(ROOT)) if (CALPLOT / 'fgs1_calibrated_segment.png').exists() else None,
        str((CALPLOT / 'airs_calibrated_spectra.png').relative_to(ROOT)) if (CALPLOT / 'airs_calibrated_spectra.png').exists() else None,
        str((CALPLOT / 'fgs1_fft_raw_vs_cal.png').relative_to(ROOT)) if (CALPLOT / 'fgs1_fft_raw_vs_cal.png').exists() else None,
    ],
}
with open(CALPLOT / 'calibration_report.json', 'w') as f:
    json.dump(report, f, indent=2)
print('Saved', CALPLOT / 'calibration_report.json')


## 8) Next steps / reference CLI commands
Keep heavy operations in the CLI. Uncomment to run in this environment if appropriate.

```bash
# Quick self‑test (fast):
# spectramind test --fast

# Sample calibration (safe):
# spectramind calibrate --sample 3 --outdir outputs/calibrated --fast

# Full calibration (example — adjust according to configs/datasets):
# spectramind calibrate --outdir outputs/calibrated

# Diagnostics dashboard (HTML under outputs/diagnostics/):
# spectramind diagnose dashboard --no-umap=false --no-tsne=false --out outputs/diagnostics/report.html
```

**See also**: `03_train_v50_demo.ipynb` for a minimal training run and `05_diagnostics_suite.ipynb` for unified diagnostics.
