# 🧪 SpectraMind V50 — 01_pipeline_calibrate_train_predict

Tiny **calibrate → train → predict** pipeline to sanity-check the stack with fast settings.

This runs entirely via the **CLI** (Typer + Hydra), keeping the workflow reproducible and config-driven.

## 0) Runtime Helper — Resolve `spectramind`

Resolves in order:
1. `spectramind` (PATH)
2. `poetry run spectramind`
3. `python -m spectramind`

Exposes helpers: `sm(cmd)`, `sm_print(cmd)`.

In [None]:
import os, shlex, shutil, subprocess, sys, glob, json, textwrap
from pathlib import Path

def _resolve_spectramind_cmd():
    if shutil.which("spectramind"):
        return ["spectramind"]
    if shutil.which("poetry"):
        try:
            out = subprocess.run(["poetry", "run", "spectramind", "--version"],
                                 stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
            if out.returncode == 0:
                return ["poetry", "run", "spectramind"]
        except Exception:
            pass
    return [sys.executable, "-m", "spectramind"]

SM = _resolve_spectramind_cmd()
print("Resolved spectramind launcher:", " ".join(shlex.quote(p) for p in SM))

def sm(cmd: str, check=False, capture=False):
    args = shlex.split(cmd)
    res = subprocess.run(SM + args,
                         check=check,
                         stdout=subprocess.PIPE if capture else None,
                         stderr=subprocess.STDOUT if capture else None,
                         text=True)
    return res

def sm_print(cmd: str):
    print("$", " ".join(SM + shlex.split(cmd)))

Path("outputs").mkdir(exist_ok=True, parents=True)
Path("outputs/preds_quick").mkdir(exist_ok=True, parents=True)
Path("outputs/diagnostics_quick").mkdir(exist_ok=True, parents=True)

## 1) Sanity Checks
Python/Poetry/Git/DVC and optional CUDA snapshot.

In [None]:
!python --version
!pip --version
!poetry --version || echo '⚠️ Poetry not found (ok if not using Poetry)'
!git --version
!dvc --version || echo '⚠️ DVC not found (ok for quick run)'

try:
    import torch
    print("PyTorch:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("CUDA device:", torch.cuda.get_device_name(0))
except Exception:
    print("PyTorch not installed — continuing")

sm_print("--version")
out = sm("--version", capture=True)
print(out.stdout if out.stdout else "(no output)")

sm_print("--help")
out = sm("--help", capture=True)
print("\n".join(out.stdout.splitlines()[:20]) if out.stdout else "(no output)")

## 2) (Optional) DVC Pull
If your repo uses DVC for data/artifacts, pull latest (non-fatal if absent).

In [None]:
%%bash
set -euo pipefail
if command -v dvc >/dev/null 2>&1; then
  echo "DVC detected — pulling data (if remote configured)..."
  dvc pull || echo '⚠️ dvc pull non-fatal'
else
  echo "DVC not installed — skipping"
fi

## 3) Calibrate (fast, sampled)
Run calibration on a small sample for speed. Adjust `--sample` as needed.

In [None]:
sm_print("calibrate --sample 5 --fast")
cal = sm("calibrate --sample 5 --fast", capture=True)
print(cal.stdout or "(no output)")

# List key outputs if your pipeline writes calibrated artifacts
!ls -lah outputs || true
!ls -lah outputs/* || true 2>/dev/null || true
!find outputs -maxdepth 2 -type f | head -n 20 || true

## 4) Train (tiny)
One quick epoch / fast-dev-run for smoke validation. Override Hydra config inline if desired.

In [None]:
train_cmd = "train --epochs 1 training.fast_dev_run=true"
sm_print(train_cmd)
tr = sm(train_cmd, capture=True)
print(tr.stdout or "(no output)")

# Peek at any saved checkpoints/metrics
!find outputs -maxdepth 3 -type f \( -name "*.pt" -o -name "*.ckpt" -o -name "metrics*.json" -o -name "*log*.json" \) | head -n 20 || true
!tail -n 50 logs/v50_debug_log.md 2>/dev/null || true
!tail -n 50 outputs/metrics.json 2>/dev/null || true
!tail -n 50 outputs/train/metrics.json 2>/dev/null || true
!find outputs -maxdepth 3 -type f -name "*.json" | head -n 10 || true
!find outputs -maxdepth 3 -type f -name "*.yaml" | head -n 10 || true

## 5) Predict
Generate quick predictions (location may vary by your CLI). This writes predictions to `outputs/preds_quick/`.

- If your CLI exposes `predict`, use that.
- If predictions are created by `submit --dry-run`, run that and parse the produced CSV/ZIP accordingly.

In [None]:
pred_dir = Path("outputs/preds_quick")

# Try `predict` first
predict_cmds = [
    "predict --outdir outputs/preds_quick --fast",
    # Fallback: some repos produce predictions as part of submit dry-run
    "submit --dry-run"
]

ok = False
for cmd in predict_cmds:
    sm_print(cmd)
    out = sm(cmd, capture=True)
    print(out.stdout or "(no output)")
    # Heuristic: check if something landed in preds_quick or a submission file exists
    csvs = list(pred_dir.glob("**/*.csv"))
    subs = list(Path(".").glob("**/submission*.csv"))
    if csvs or subs:
        ok = True
        break

if not ok:
    print("⚠️ Could not detect predictions in expected locations. Inspect CLI outputs above.")

print("\nDiscovered prediction files:")
for p in pred_dir.glob("**/*.csv"):
    print("-", p)
for p in Path(".").glob("**/submission*.csv"):
    print("-", p)

!ls -lah outputs/preds_quick 2>/dev/null || true
!find outputs -maxdepth 2 -type f -name "*.csv" | head -n 20 || true
!find . -maxdepth 3 -type f -name "submission*.csv" | head -n 10 || true

## 6) Inspect Predictions (preview & quick plot)
This section attempts to open a CSV of predictions and visualize a spectrum for a quick sanity check (optional).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

def _find_pred_csv():
    # prefer explicit preds directory, else any submission*.csv
    cands = list(Path("outputs/preds_quick").glob("**/*.csv"))
    if cands:
        return cands[0]
    cands = list(Path(".").glob("**/submission*.csv"))
    return cands[0] if cands else None

csv_path = _find_pred_csv()
if csv_path and csv_path.is_file():
    print("Preview:", csv_path)
    df = pd.read_csv(csv_path)
    display(df.head(10))
    # Try naive plot: look for columns that look like wavelength bins
    num_cols = [c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])]
    # If a single row contains a full spectrum across columns, plot the first row
    if len(num_cols) > 10:
        y = df[num_cols].iloc[0].values
        x = list(range(len(y)))
        plt.figure(figsize=(8,3))
        plt.plot(x, y, lw=1)
        plt.title(f"Quick Spectrum Preview — {csv_path.name}")
        plt.xlabel("Bin index")
        plt.ylabel("Predicted μ")
        plt.tight_layout()
        plt.show()
    else:
        print("(Skipping plot — could not infer wide spectrum columns)")
else:
    print("No prediction CSV found to preview.")

## 7) (Optional) Diagnostics Snapshot
Run a minimal dashboard build to confirm plots render (UMAP/t-SNE disabled for speed). Outputs in `outputs/diagnostics_quick/`.

In [None]:
diag_cmd = "diagnose dashboard --no-umap --no-tsne --outdir outputs/diagnostics_quick"
sm_print(diag_cmd)
dg = sm(diag_cmd, capture=True)
print(dg.stdout or "(no output)")

!find outputs/diagnostics_quick -maxdepth 2 -type f | head -n 20 || true
!ls -lah outputs/diagnostics_quick 2>/dev/null || true
!find outputs -maxdepth 3 -type f -name "*report*.html" | head -n 10 || true

## 8) Summary & Next Steps
- You now have calibrated data, a quick-trained model, and a prediction artifact.
- For real experiments, increase `--sample`, remove `--fast_dev_run`, and raise `--epochs`.
- Consider running `submit` without `--dry-run` to package a full submission when ready.
- Explore diagnostics HTML under `outputs/diagnostics_quick/`.