# SpectraMind V50 — Kaggle Notebook Template

**Purpose**: a safe, reproducible scaffold for running the NeurIPS 2025 Ariel Data Challenge workflows on Kaggle **without internet access**.

This template supports:
- Environment detection (Kaggle vs local)
- Read-only data access at `/kaggle/input`
- Optional import of the SpectraMind V50 package if it is available as a Kaggle dataset (`/kaggle/input/spectramind-v50`)
- Strict, pinned deps preferred (see `requirements-kaggle.txt` in the repo)
- Reproducible config snapshot embedded in the notebook
- Optional submission packaging (zip)

> Keep heavy work in library/DVC stages. This template is intentionally light, deterministic, and zero-internet.


## 0) Environment & Paths

In [None]:
import os, sys, json, platform, random
from pathlib import Path
import numpy as np
import pandas as pd

# Determinism
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
pd.set_option('display.max_columns', 200)

# Environment detection
IS_KAGGLE = Path('/kaggle/input').exists()
COMP_DIR = Path('/kaggle/input/ariel-data-challenge-2025') if IS_KAGGLE else Path('./data')
WORK_DIR = Path('/kaggle/working') if IS_KAGGLE else Path('.')
print("Env:", "Kaggle" if IS_KAGGLE else "Local", "| Python:", sys.version.split()[0])

# Optional: repo-attached dataset with installed package (no internet)
SPECTRAMIND_DS = Path('/kaggle/input/spectramind-v50')
if IS_KAGGLE and SPECTRAMIND_DS.exists():
    sys.path.insert(0, str(SPECTRAMIND_DS / 'src'))
    print("SpectraMind source path added:", SPECTRAMIND_DS/'src')

# Outputs folder
OUT = WORK_DIR / 'outputs'
OUT.mkdir(parents=True, exist_ok=True)
print("Outputs:", OUT)

### (Optional) Symlink Kaggle Inputs into a Repo Layout
If you mount your code under `/kaggle/working/spectramind-v50/`, you can symlink Kaggle inputs into `data/raw/...` (zero-copy).

Uncomment and adjust paths if you use this workflow (see also the symlink note in your docs).

In [None]:
# %%bash
# REPO=/kaggle/working/spectramind-v50
# mkdir -p "$REPO/data/raw" "$REPO/data/interim" "$REPO/data/processed" "$REPO/data/external" "$REPO/artifacts" "$REPO/models"
# # Competition input
# ln -sfn /kaggle/input/ariel-data-challenge-2025 "$REPO/data/raw/adc2025"
# echo "Symlinked Kaggle inputs under $REPO/data/raw/adc2025" 

## 1) Config Snapshot (embed minimal JSON for provenance)
We write a small config snapshot into `outputs/config_snapshot.json` to guarantee a reproducible record of basic settings.

In [None]:
config = {
    "pipeline": ["calibrate", "predict"],   # example only
    "model": {
        "fgs1_encoder": "mamba_ssm-lite",
        "airs_encoder": "cnn-lite",
        "decoder": "heteroscedastic-head"
    },
    "data": {
        "competition": str(COMP_DIR),
        "bins": 283
    },
    "runtime": {
        "env": "kaggle" if IS_KAGGLE else "local",
        "python": platform.python_version(),
        "seed": SEED
    }
}
with open(OUT/'config_snapshot.json', 'w') as f:
    json.dump(config, f, indent=2)
print("Wrote:", OUT/'config_snapshot.json')

## 2) Data Access (read competition files if present)
List a subset of competition files for quick discovery (guarded for portability).

In [None]:
def list_files(base: Path, patterns=('.csv', '.parquet', '.json')):
    if not base.exists():
        return []
    out = []
    for p in base.rglob('*'):
        try:
            if p.is_file() and p.suffix.lower() in patterns:
                out.append(str(p))
        except Exception:
            pass
    return sorted(out)[:50]

inventory = list_files(COMP_DIR)
print("Sample files:", len(inventory))
for p in inventory[:10]:
    print("-", p)

## 3) Minimal EDA (guarded)
Attempt to read small tables like `train.csv`/`test.csv` if present. All steps are guarded to avoid failures in other environments.

In [None]:
from IPython.display import display

def safe_read_csv(p: Path, n=5):
    try:
        df = pd.read_csv(p)
        print(p.name, df.shape)
        display(df.head(n))
        return df
    except Exception as e:
        print("Failed reading", p, "->", e)
        return None

train_csv = COMP_DIR/'train.csv'
test_csv  = COMP_DIR/'test.csv'
train_df = safe_read_csv(train_csv) if train_csv.exists() else None
test_df  = safe_read_csv(test_csv) if test_csv.exists() else None

## 4) Optional: Import SpectraMind and Run Inference Hooks
If you published your SpectraMind repo as a Kaggle Dataset containing `src/spectramind`, this cell will auto-import (zero internet).
A simple `notebook_predict` hook can be called to generate a small sample prediction for sanity checks.

In [None]:
try:
    from spectramind.cli_hooks import notebook_predict  # your repo should provide this hook
    HAVE_SM = True
    print("SpectraMind hooks available.")
except Exception as e:
    HAVE_SM = False
    print("SpectraMind hooks not available:", e)

# Optional sample inference
sample_ids = None
if 'test_df' in globals() and test_df is not None and 'id' in test_df.columns:
    sample_ids = test_df['id'].head(5).tolist()

if HAVE_SM:
    preds = notebook_predict(
        comp_dir=str(COMP_DIR),
        config=config,
        ids=sample_ids
    )
    display(preds.head())
    out_csv = OUT/'submission.csv'
    preds.to_csv(out_csv, index=False)
    print("Wrote submission to", out_csv)
else:
    print("No spectramind package — keeping template minimal. "
          "Attach your repo as a dataset and expose a 'notebook_predict' to enable this.")

## 5) Submission Packaging Helper (zip)
Creates `submission.zip` in the working directory if `outputs/submission.csv` exists.

In [None]:
import zipfile

def zip_submission(csv_path: Path, zip_path: Path):
    assert csv_path.exists(), "CSV not found"
    with zipfile.ZipFile(zip_path, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
        zf.write(csv_path, arcname=csv_path.name)
    print("Created:", zip_path)

sub = OUT/'submission.csv'
if sub.exists():
    zip_submission(sub, WORK_DIR/'submission.zip')
else:
    print("No submission.csv found — skip zipping.")

## Notes
- **No internet**: dependencies must come from the Kaggle base image or attached datasets.
- **Reproducibility**: minimal config snapshot is written to `outputs/config_snapshot.json`.
- **Mounting data**: use symlinks inside `/kaggle/working` (see optional cell above) to map Kaggle inputs to your repo layout.
- **Schema**: ensure your submission matches the challenge schema (283 μ, 283 σ bins; plus `id`).
- **Runtime**: keep per-cell runtimes modest; prefer tested library code + DVC for heavy lifting.
