# 10 · Kaggle Submission Pipeline (SpectraMind V50)

Mission‑grade **handoff notebook** to produce a Kaggle‑ready submission bundle using the official **CLI + Hydra** artifacts only (no ad‑hoc pipeline code).

### What this does
1) Detects environment (local vs Kaggle kernels) and sets safe, offline defaults.
2) Locates model predictions under `outputs/` and converts them to the competition’s expected schema.
3) Validates the file (best‑effort + optional CLI validator) and **packages** a `submission.zip`.
4) Exports a tiny **manifest** + README for audit. (Optional) DVC‑adds artifacts.

### Contract
- **CLI‑first**: we only *read* artifacts that your CLI wrote; any regeneration (predict/validate) is invoked via the CLI cells.
- **Offline‑friendly**: No internet dependence (works in Kaggle where internet is often disabled).
- **Deterministic outputs**: Everything goes to `outputs/notebooks/10_kaggle_submission/`.


In [None]:
import os, sys, json, shutil, zipfile, platform, subprocess, textwrap
from pathlib import Path
from datetime import datetime
import pandas as pd
import numpy as np

ROOT = Path.cwd().resolve()
NB_OUT = ROOT / 'outputs' / 'notebooks' / '10_kaggle_submission'
NB_OUT.mkdir(parents=True, exist_ok=True)

# Detect Kaggle notebook environment
IS_KAGGLE = Path('/kaggle/working').exists()
WORK_DIR = Path('/kaggle/working') if IS_KAGGLE else NB_OUT

print('ROOT      :', ROOT)
print('NB_OUT    :', NB_OUT)
print('IS_KAGGLE :', IS_KAGGLE)
print('WORK_DIR  :', WORK_DIR)

# Try to locate CLI (optional; only used for validation or to regenerate predictions)
CLI = shutil.which('spectramind') or (f"{sys.executable} {ROOT/'spectramind.py'}" if (ROOT/'spectramind.py').exists() else f"{sys.executable} -m spectramind")
print('CLI       :', CLI)

ENV_SNAPSHOT = {
    'python'   : platform.python_version(),
    'platform' : platform.platform(),
    'kaggle'   : bool(IS_KAGGLE),
}
(NB_OUT/'env_snapshot.json').write_text(json.dumps(ENV_SNAPSHOT, indent=2))
print('Saved env snapshot.')

## Parameters
Edit these if you need a custom source filename or competition slug (for local testing). On Kaggle, we **do not submit** from this notebook; we only produce `submission.csv`/`submission.zip` in `/kaggle/working`.

In [None]:
# A descriptive name for this submission artifact
SUBMISSION_NAME = f"spectramind_v50_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}"

# Competition slug (used only for local Kaggle CLI workflows; not required in Kaggle Kernels)
COMPETITION_SLUG = "ariel-data-challenge-2025"  # adjust to actual

# Where we expect predictions (auto-discovery if None)
PRED_SOURCE_HINTS = [
    ROOT/'outputs'/'predictions'/'predictions.csv',
    ROOT/'outputs'/'predictions.csv',
    ROOT/'outputs'/'runs',  # will scan timestamped folders
]

print('SUBMISSION_NAME   :', SUBMISSION_NAME)
print('COMPETITION_SLUG  :', COMPETITION_SLUG)
print('PRED_SOURCE_HINTS :', [str(p) for p in PRED_SOURCE_HINTS])

## 1) Locate predictions
We look for a predictions table under `outputs/`. Flexible schema handling:
- **Wide**: columns like `mu_000..mu_282`.
- **Long**: `planet_id,wavelength_index,mu` → we pivot to wide if the competition requires a single‑row per ID.

If your competition expects a very specific format (e.g., `row_id,prediction`), adapt the mapping cell below accordingly or invoke the CLI validator cell.

In [None]:
def find_predictions(hints):
    # if a hint is a file, take it; if a dir, scan inside for *.csv (prefer newest)
    candidates = []
    for h in hints:
        if not h.exists():
            continue
        if h.is_file() and h.suffix.lower() == '.csv':
            candidates.append(h)
        elif h.is_dir():
            candidates += sorted(h.rglob('*.csv'))
    # Prefer files containing 'pred' in name, fallback to any csv
    candidates = sorted(candidates, key=lambda p: (('pred' in p.name.lower())*-1, p.stat().st_mtime), reverse=True)
    return candidates[0] if candidates else None

PRED_FILE = find_predictions(PRED_SOURCE_HINTS)
print('Selected prediction file:', PRED_FILE)
if PRED_FILE is None:
    raise FileNotFoundError('No predictions CSV found under outputs/. Generate predictions first (e.g., 04_predict_v50_demo).')

pred_df = pd.read_csv(PRED_FILE)
print('pred_df shape:', pred_df.shape)
pred_df.head(3)

## 2) Map to competition schema
Adjust this transform to match the **official submission format**. Typical cases:
- **Already matches**: your file already has `Id,Prediction` or the required wide columns → just rename.
- **Long → wide**: `planet_id,wavelength_index,mu` → pivot to one row per planet (or row_id), with ordered columns `mu_000..mu_282`.

Below we implement a robust mapper with a few heuristics. If it can’t infer, it will raise with clear guidance.

In [None]:
def to_competition_schema(df: pd.DataFrame) -> pd.DataFrame:
    cols = {c.lower(): c for c in df.columns}

    # CASE 0: Already in Kaggle shape (common names); just standardize column names
    if {'id','prediction'}.issubset(set(cols)):
        out = df.rename(columns={cols['id']:'Id', cols['prediction']:'Prediction'})[['Id','Prediction']]
        return out

    # CASE 1: Wide mu_000.. columns present
    mu_cols = [c for c in df.columns if str(c).startswith('mu_')]
    if mu_cols:
        # Use the first ID-ish column as Id
        id_col = None
        for k in ('id','row_id','planet_id','sample_id'):
            if k in cols:
                id_col = cols[k]; break
        if id_col is None:
            df = df.copy()
            df.insert(0,'Id', np.arange(len(df)))
        else:
            df = df.rename(columns={id_col:'Id'})
        # Wide format expected by some challenges → rename to canonical
        # If competition expects just one column 'Prediction', replace this block accordingly
        ordered = ['Id'] + sorted(mu_cols)
        return df[ordered]

    # CASE 2: Long → Pivot (planet_id, wavelength_index, mu)
    if {'planet_id','wavelength_index','mu'}.issubset(set(cols)):
        g = df.rename(columns={cols['planet_id']:'planet_id', cols['wavelength_index']:'wavelength_index', cols['mu']:'mu'})
        # Create zero-padded column names mu_000..mu_XXX
        W = int(g['wavelength_index'].max())+1
        g['mu_col'] = g['wavelength_index'].astype(int).map(lambda i: f'mu_{i:03d}')
        wide = g.pivot_table(index='planet_id', columns='mu_col', values='mu', aggfunc='mean')
        wide = wide.reset_index().rename(columns={'planet_id':'Id'})
        # Ensure all mu_000..mu_(W-1) exist
        all_mu = [f'mu_{i:03d}' for i in range(W)]
        for c in all_mu:
            if c not in wide.columns:
                wide[c] = np.nan
        return wide[['Id'] + all_mu]

    raise ValueError('Unrecognized predictions schema. Expected either (Id,Prediction), wide mu_000.., or long (planet_id,wavelength_index,mu).')

SUB_DF = to_competition_schema(pred_df)
print('Submission shape:', SUB_DF.shape)
SUB_DF.head(3)

## 3) File‑level sanity checks
We ensure required columns exist, check for NaNs/inf, and impose minimal ordering constraints. For strict validation, use the CLI validator cell below.

In [None]:
def basic_checks(df: pd.DataFrame):
    # Has an Id column
    if 'Id' not in df.columns:
        raise AssertionError('Submission must contain Id column.')
    # No duplicates
    if df['Id'].duplicated().any():
        raise AssertionError('Duplicate Id values found.')
    # No infs; allow NaNs only if competition rules permit (here we disallow by default)
    if np.isinf(df.select_dtypes(include=[np.number]).to_numpy()).any():
        raise AssertionError('Found inf values in numeric columns.')
    if df.select_dtypes(include=[np.number]).isna().any().any():
        print('WARNING: Found NaNs; if the competition disallows NaN, fill or impute before packaging.')
    # Column order: keep Id first
    cols = df.columns.tolist()
    if cols[0] != 'Id':
        df = df[['Id'] + [c for c in cols if c != 'Id']]
    return df

SUB_DF = basic_checks(SUB_DF)
print('Post‑check shape:', SUB_DF.shape)

### (Optional) CLI validation
If your repo provides a validator (e.g., `spectramind submit validate --bundle` or a `validate_submission.py`), invoke it here.

In [None]:
RUN_CLI_VALIDATOR = False  # set True to enable
if RUN_CLI_VALIDATOR:
    TMP_CSV = NB_OUT / f"{SUBMISSION_NAME}_submission.csv"
    SUB_DF.to_csv(TMP_CSV, index=False)
    print('Wrote candidate CSV:', TMP_CSV)
    try:
        cmd = [CLI, 'submit', 'validate', f'submission={str(TMP_CSV)}']
        print('Running:', ' '.join(cmd))
        subprocess.run(cmd, check=True)
    except Exception as e:
        print('Validator failed (non‑blocking):', e)
else:
    print('CLI validation disabled. Set RUN_CLI_VALIDATOR=True to run your repo validator.')

## 4) Write `submission.csv` and `submission.zip`
On Kaggle kernels, we place these under `/kaggle/working/` so the “Submit to Competition” button can find them.

In [None]:
SUBMISSION_CSV = WORK_DIR / 'submission.csv'
SUBMISSION_ZIP = WORK_DIR / 'submission.zip'

SUB_DF.to_csv(SUBMISSION_CSV, index=False)
print('Saved:', SUBMISSION_CSV, f'({SUBMISSION_CSV.stat().st_size/1024:.1f} KiB)')

with zipfile.ZipFile(SUBMISSION_ZIP, mode='w', compression=zipfile.ZIP_DEFLATED) as zf:
    zf.write(SUBMISSION_CSV, arcname='submission.csv')
print('Saved:', SUBMISSION_ZIP, f'({SUBMISSION_ZIP.stat().st_size/1024:.1f} KiB)')

# Also copy artifacts into NB_OUT for versioning outside Kaggle env
if IS_KAGGLE:
    shutil.copy2(SUBMISSION_CSV, NB_OUT/SUBMISSION_CSV.name)
    shutil.copy2(SUBMISSION_ZIP, NB_OUT/SUBMISSION_ZIP.name)
    print('Copied artifacts to NB_OUT for archiving.')

## 5) Submission manifest & README
We store a tiny manifest + README that explains how this bundle was produced (config‑as‑data).

In [None]:
MANIFEST = {
    'submission_name' : SUBMISSION_NAME,
    'created_utc'     : datetime.utcnow().isoformat(timespec='seconds') + 'Z',
    'kaggle_env'      : IS_KAGGLE,
    'source_pred_file': str(PRED_FILE.relative_to(ROOT)) if PRED_FILE.exists() else None,
    'output_csv'      : str(SUBMISSION_CSV),
    'output_zip'      : str(SUBMISSION_ZIP)
}
(NB_OUT/'submission_manifest.json').write_text(json.dumps(MANIFEST, indent=2))

README_TXT = f"""
# SpectraMind V50 — Kaggle Submission Bundle

Submission: {SUBMISSION_NAME}

Artifacts:
- submission.csv  → main file
- submission.zip  → zipped submission.csv
- submission_manifest.json → provenance info

Notes:
- Generated offline using the SpectraMind V50 CLI artifacts under outputs/.
- If running on Kaggle, the files also reside in /kaggle/working/ for the 'Submit' button.
- To re‑generate predictions, use the 04_predict_v50_demo notebook or the CLI predict command.
"""
(NB_OUT/'README_submission.txt').write_text(README_TXT.strip())
print('Wrote manifest and README to', NB_OUT)

### (Optional) DVC add
If your repository tracks notebook outputs with DVC, register the artifacts below (non‑blocking if DVC is absent).

In [None]:
if shutil.which('dvc'):
    try:
        subprocess.run(['dvc','add', str(NB_OUT)], check=False)
        subprocess.run(['git','add', f'{NB_OUT}.dvc', '.gitignore'], check=False)
        subprocess.run(['dvc','status'], check=False)
        print('DVC add done (non‑blocking).')
    except Exception as e:
        print('DVC step failed (non‑blocking):', e)
else:
    print('DVC not found; skipping.')

---
## 6) (Optional) Local Kaggle CLI submit
For **local** runs (not Kaggle kernels), you can submit with the Kaggle CLI if you’ve configured API credentials (`~/.kaggle/kaggle.json`). This is **disabled** by default and **not needed** inside Kaggle kernels.

**Warning:** Do **not** execute this on kernels; use the UI’s _Submit to Competition_ button.


In [None]:
RUN_LOCAL_KAGGLE_SUBMIT = False  # set True for local-only
if RUN_LOCAL_KAGGLE_SUBMIT and not IS_KAGGLE:
    try:
        # Choose CSV or ZIP depending on competition rules
        bundle = SUBMISSION_ZIP if SUBMISSION_ZIP.exists() else SUBMISSION_CSV
        cmd = [
            'kaggle','competitions','submit','-c', COMPETITION_SLUG,
            '-f', str(bundle), '-m', SUBMISSION_NAME
        ]
        print('Running:', ' '.join(cmd))
        subprocess.run(cmd, check=True)
    except Exception as e:
        print('Local Kaggle submit failed (non‑blocking):', e)
else:
    print('Local Kaggle submit disabled or running inside Kaggle; skipping.')