# SpectraMind V50 — Kaggle **Training** Notebook

**Purpose:** train the SpectraMind V50 model **inside Kaggle** (no internet).  
Attach the Ariel competition dataset and your SpectraMind V50 code dataset (and optionally an **artifacts** dataset with checkpoints).  

> Keep this notebook lightweight; heavy preprocessing/training should live in library code & DVC stages. This scaffold ensures safe defaults and reproducible snapshots.

## 0) Environment & Inputs

In [None]:
import os, sys, json, platform, shutil, time
from pathlib import Path
import numpy as np
import pandas as pd

# Determinism
SEED = 42
np.random.seed(SEED)
pd.set_option('display.max_columns', 200)

IS_KAGGLE = Path('/kaggle/input').exists()
COMP_DIR = Path('/kaggle/input/ariel-data-challenge-2025') if IS_KAGGLE else Path('./data/kaggle-mock')
CODE_DS  = Path('/kaggle/input/spectramind-v50') if IS_KAGGLE else Path('./')  # attached code dataset

print("Env:", "Kaggle" if IS_KAGGLE else "Local", "| Python:", sys.version.split()[0])
print("Competition data:", COMP_DIR.exists(), str(COMP_DIR))

# Outputs / Artifacts
OUT = Path('outputs'); OUT.mkdir(parents=True, exist_ok=True)
ART = Path('artifacts'); ART.mkdir(parents=True, exist_ok=True)

# Add spectramind src path if code dataset is attached
if IS_KAGGLE and (CODE_DS/'src').exists():
    sys.path.insert(0, str(CODE_DS/'src'))
    print("Added code src path:", CODE_DS/'src')

### (Optional) Symlink Kaggle inputs into repo layout (runtime mounts)

In [None]:
# Uncomment to map Kaggle input data into a repo-style data layout (zero-copy)
# %%bash
# REPO=/kaggle/working/spectramind-v50
# mkdir -p "$REPO/data/raw" "$REPO/data/interim" "$REPO/data/processed" "$REPO/data/external" "$REPO/artifacts" "$REPO/models"
# ln -sfn /kaggle/input/ariel-data-challenge-2025  "$REPO/data/raw/adc2025"
# echo "Symlinked Kaggle inputs under $REPO/data/raw/adc2025"

## 1) GPU / Torch Check (guarded)

In [None]:
try:
    import torch
    print("PyTorch:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("Device:", torch.cuda.get_device_name(0))
except Exception as e:
    print("Torch not available / skipped:", e)

## 2) Minimal Config Snapshot (Hydra-like)

In [None]:
config = {
    "env": "kaggle" if IS_KAGGLE else "local",
    "data": {
        "competition_dir": str(COMP_DIR),
        "train_csv": str(COMP_DIR/'train.csv'),
        "train_star_info": str(COMP_DIR/'train_star_info.csv'),
        "axis_info": str(COMP_DIR/'axis_info.parquet')
    },
    "training": {
        "seed": SEED,
        "epochs": 5,            # keep light for Kaggle; increase in real runs
        "batch_size": 32,
        "lr": 1e-3,
        "precision": "fp32",     # set "bf16"/"fp16" only if supported
        "grad_accum": 1,
        "save_dir": "artifacts"
    },
    "model": {
        "name": "v50",
        "fgs1_encoder": "mamba_ssm-lite",
        "airs_encoder": "cnn-lite",
        "decoder": "heteroscedastic-head"
    }
}
with open(OUT/'config_snapshot.json', 'w') as f:
    json.dump(config, f, indent=2)
print("Wrote", OUT/'config_snapshot.json')

## 3) Fast Data Sanity (guarded)

In [None]:
def exists(p):
    try: return Path(p).exists()
    except Exception: return False

issues = []
if not exists(config["data"]["train_csv"]):
    issues.append("Missing train.csv")
if not exists(config["data"]["axis_info"]):
    issues.append("Missing axis_info.parquet (optional)")
print("Issues:", issues if issues else "None")

# Head preview to keep runs light
if not issues and Path(config["data"]["train_csv"]).exists():
    try:
        df = pd.read_csv(config["data"]["train_csv"], nrows=5)
        print("train.csv:", df.shape)
        display(df.head(3))
    except Exception as e:
        print("train.csv read error:", e)
else:
    print("train.csv not available — continuing (hooks may load differently)")

## 4) Import SpectraMind hooks (if available)

In [None]:
try:
    # Expected: returns (checkpoint_path, metrics_dict)
    from spectramind.cli_hooks import notebook_train
    HAVE_SM = True
    print("SpectraMind hooks available.")
except Exception as e:
    HAVE_SM = False
    print("SpectraMind hooks NOT available:", e)

## 5) Train (hook or demo)

In [None]:
ckpt_path = None
metrics = {}

if HAVE_SM:
    # Preferred path: delegate to package hook (should do all preprocessing + training)
    t0 = time.time()
    ckpt_path, metrics = notebook_train(config=config)
    elapsed = time.time() - t0
    print(f"Hook training finished in {elapsed:.1f}s")
    print("Checkpoint:", ckpt_path)
    print("Metrics:", metrics)
else:
    print("Falling back to a tiny demo trainer (placeholder)")
    # --- Demo: simulate a training artifact ---
    t0 = time.time()
    time.sleep(1.0)
    ckpt_path = str((Path(config["training"]["save_dir"]) / 'model_v50_demo.ckpt').resolve())
    Path(config["training"]["save_dir"]).mkdir(parents=True, exist_ok=True)
    with open(ckpt_path, 'wb') as f:
        f.write(os.urandom(256))
    metrics = {"train_loss": 0.123, "val_loss": 0.234, "elapsed_s": round(time.time()-t0, 2)}
    with open(OUT/'train_metrics.json', 'w') as f:
        json.dump(metrics, f, indent=2)
    print("Saved demo checkpoint:", ckpt_path)
    print("Metrics:", metrics)

## 6) Register Artifacts (manifest)

In [None]:
manifest = {
    "checkpoint": ckpt_path,
    "metrics": metrics,
    "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
}
with open(OUT/'train_manifest.json', 'w') as f:
    json.dump(manifest, f, indent=2)
print("Wrote", OUT/'train_manifest.json')
print("\nTraining complete. Next: use this checkpoint in your prediction notebook to build submission.csv")

## Notes
- **Zero internet**: all deps must come from the Kaggle base image or attached datasets.
- **Reproducibility**: config snapshot at `outputs/config_snapshot.json` and metrics at `outputs/train_metrics.json` (demo) or returned by your hook.
- **Data**: competition files live under `/kaggle/input/ariel-data-challenge-2025/`; use symlinks under `/kaggle/working/.../data/...` for repo-style paths (runtime-only mounts).
- **Runtime**: keep epochs small in notebooks; long runs belong in offline/CLI pipelines. Control seeds; pin dependencies via a Kaggle-ready `requirements` dataset if you need custom packages.