
# SpectraMind V50 — Submission Checker (Ariel 2025)

This notebook validates a **Kaggle submission** for the NeurIPS 2025 Ariel Data Challenge.

It performs:
- **Schema checks**: required columns, count (283 μ, 283 σ), dtypes, missing/NaN, order (optional).
- **Value checks**: finiteness, σ ≥ 0, configurable bounds on μ and σ, outlier and NaN counts.
- **File checks**: size, memory, duplicate ids, row count.
- **Visual QA**: random spectrum plots with μ±σ, histograms and percentiles.
- **Report**: machine-readable summary (CSV/JSON) with pass/fail and metrics.

> You can set a path to `submission.csv` or point to a `.zip` containing it. Defaults to `artifacts/submission.csv`.


## 0) Parameters

In [None]:

from pathlib import Path
import os

# Path to CSV or ZIP
SUBMISSION_PATH = Path('artifacts/submission.csv')  # change if needed

# Expected schema
N_BINS = 283
ID_COL = 'id'
MU_PREFIX = 'mu_'
SIGMA_PREFIX = 'sigma_'

# Soft bounds (challenge-specific; keep permissive by default)
MU_ABS_MAX = 1.0           # transit depth magnitude (fraction). Set None to disable.
SIGMA_ABS_MAX = 1.0        # Set None to disable.

# Random seed for plots
SEED = 42

SUBMISSION_PATH


## 1) Load submission (CSV or ZIP)

In [None]:

import pandas as pd, numpy as np, zipfile, io, json, math, time
from pathlib import Path

def load_submission(path: Path) -> pd.DataFrame:
    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f'Not found: {path}')
    if path.suffix.lower() == '.zip':
        with zipfile.ZipFile(path, 'r') as zf:
            # Try standard filenames
            names = zf.namelist()
            cand = None
            for nm in names:
                if nm.lower().endswith('submission.csv') or nm.lower().endswith('.csv'):
                    cand = nm; break
            if cand is None:
                raise ValueError(f'No CSV file found inside zip ({len(names)} entries).')
            with zf.open(cand) as f:
                return pd.read_csv(f)
    else:
        return pd.read_csv(path)

df = load_submission(SUBMISSION_PATH)
print('Loaded:', df.shape)
display(df.head())


## 2) Build expected column list

In [None]:

def expected_columns(n_bins=283, id_col='id', mu_prefix='mu_', sigma_prefix='sigma_'):
    mu_cols = [f"{mu_prefix}{i:03d}" for i in range(n_bins)]
    sg_cols = [f"{sigma_prefix}{i:03d}" for i in range(n_bins)]
    return [id_col] + mu_cols + sg_cols

exp_cols = expected_columns(N_BINS, ID_COL, MU_PREFIX, SIGMA_PREFIX)
print('Expected column count:', len(exp_cols))
print('First 6:', exp_cols[:6], '...', 'Last 6:', exp_cols[-6:])


## 3) Schema validation

In [None]:

schema_report = {}

# Column presence (set equality, order optional)
have_cols = list(df.columns)
missing = [c for c in exp_cols if c not in df.columns]
extra = [c for c in df.columns if c not in exp_cols]

schema_report['missing_columns'] = missing
schema_report['extra_columns'] = extra
schema_report['has_all_required'] = len(missing) == 0
schema_report['column_count'] = len(df.columns)
schema_report['row_count'] = len(df)

# Optional: warn if order differs
schema_report['order_matches'] = (have_cols == exp_cols)

# dtypes check for numeric columns
num_cols = [c for c in exp_cols if c != ID_COL]
non_numeric = [c for c in num_cols if not pd.api.types.is_numeric_dtype(df[c])]
schema_report['non_numeric_value_columns'] = non_numeric

# id uniqueness/non-null
id_null = int(df[ID_COL].isna().sum()) if ID_COL in df.columns else -1
id_dups = int(df[ID_COL].duplicated().sum()) if ID_COL in df.columns else -1
schema_report['id_null_count'] = id_null
schema_report['id_duplicate_count'] = id_dups

schema_report


## 4) Value checks (NaN, finite, bounds, σ≥0)

In [None]:

val_report = {}
issues = []

# NaN / infinite
if schema_report.get('has_all_required', False):
    mu_cols = [c for c in df.columns if c.startswith(MU_PREFIX)]
    sg_cols = [c for c in df.columns if c.startswith(SIGMA_PREFIX)]
else:
    mu_cols = [c for c in df.columns if c.startswith(MU_PREFIX)]
    sg_cols = [c for c in df.columns if c.startswith(SIGMA_PREFIX)]

def count_nonfinite(series):
    s = series.to_numpy()
    return int(np.sum(~np.isfinite(s)))

val_report['mu_nan_count'] = int(df[mu_cols].isna().sum().sum()) if mu_cols else -1
val_report['sigma_nan_count'] = int(df[sg_cols].isna().sum().sum()) if sg_cols else -1
val_report['mu_nonfinite_count'] = sum(count_nonfinite(df[c]) for c in mu_cols) if mu_cols else -1
val_report['sigma_nonfinite_count'] = sum(count_nonfinite(df[c]) for c in sg_cols) if sg_cols else -1

# Sigma >= 0
if sg_cols:
    sigma_neg = int((df[sg_cols] < 0).sum().sum())
else:
    sigma_neg = -1
val_report['sigma_negative_count'] = sigma_neg
if sigma_neg > 0:
    issues.append(f"Found {sigma_neg} negative sigma values.")

# Soft bounds
def count_out_of_bounds(frame, abs_max):
    if abs_max is None: return 0
    arr = np.abs(frame.to_numpy())
    return int((arr > abs_max).sum())

val_report['mu_out_of_bounds'] = count_out_of_bounds(df[mu_cols], MU_ABS_MAX) if mu_cols else -1
val_report['sigma_out_of_bounds'] = count_out_of_bounds(df[sg_cols], SIGMA_ABS_MAX) if sg_cols else -1

val_report['issues'] = issues
val_report


## 5) Aggregates & percentiles

In [None]:

agg = {}
if 'mu_cols' in locals() and mu_cols and sg_cols:
    mu_vals = df[mu_cols].to_numpy().ravel()
    sg_vals = df[sg_cols].to_numpy().ravel()
    for name, arr in [('mu', mu_vals), ('sigma', sg_vals)]:
        clean = arr[np.isfinite(arr)]
        q = np.quantile(clean, [0, .001, .01, .05, .5, .95, .99, .999, 1.0])
        agg[f'{name}_min'] = float(q[0])
        agg[f'{name}_p001'] = float(q[1])
        agg[f'{name}_p01'] = float(q[2])
        agg[f'{name}_p05'] = float(q[3])
        agg[f'{name}_median'] = float(q[4])
        agg[f'{name}_p95'] = float(q[5])
        agg[f'{name}_p99'] = float(q[6])
        agg[f'{name}_p999'] = float(q[7])
        agg[f'{name}_max'] = float(q[8])
agg


## 6) Visual QA

In [None]:

import matplotlib.pyplot as plt
import numpy as np
import random

random.seed(SEED)
np.random.seed(SEED)

def plot_random_spectra(df, n=3):
    if not (mu_cols and sg_cols):
        print('Missing expected mu/sigma columns; skip plots.')
        return
    idxs = random.sample(range(len(df)), min(n, len(df)))
    xs = np.arange(len(mu_cols))
    for i in idxs:
        row = df.iloc[i]
        mu = row[mu_cols].to_numpy(dtype=float)
        sg = row[sg_cols].to_numpy(dtype=float)
        plt.figure(figsize=(9,3))
        plt.plot(xs, mu, label='mu')
        plt.fill_between(xs, mu - sg, mu + sg, alpha=0.2, label='mu ± sigma')
        plt.title(f'id={row[ID_COL]} (row {i})')
        plt.xlabel('bin')
        plt.ylabel('transit depth')
        plt.legend()
        plt.tight_layout()
        plt.show()

def plot_hist(arr, title, bins=100, xlim=None):
    plt.figure(figsize=(6,3))
    plt.hist(arr[np.isfinite(arr)], bins=bins, alpha=0.8)
    if xlim is not None:
        plt.xlim(*xlim)
    plt.title(title)
    plt.tight_layout()
    plt.show()

if mu_cols and sg_cols:
    plot_random_spectra(df, n=3)
    mu_vals = df[mu_cols].to_numpy().ravel()
    sg_vals = df[sg_cols].to_numpy().ravel()
    plot_hist(mu_vals, 'Histogram: mu (all bins)',
              xlim=(-MU_ABS_MAX, MU_ABS_MAX) if MU_ABS_MAX else None)
    plot_hist(sg_vals, 'Histogram: sigma (all bins)',
              xlim=(0, SIGMA_ABS_MAX) if SIGMA_ABS_MAX else None)
else:
    print('Skipping plots: mu/sigma columns not found.')


## 7) Pass/Fail & report export

In [None]:

from datetime import datetime
import json

REPORT_DIR = Path('artifacts/submission_checks')
REPORT_DIR.mkdir(parents=True, exist_ok=True)

def make_pass_fail(schema_report, val_report):
    ok = True
    msgs = []
    # Required schema
    if not schema_report.get('has_all_required', False):
        ok = False; msgs.append('Missing required columns.')
    if schema_report.get('id_null_count', 0) > 0:
        ok = False; msgs.append('Null IDs present.')
    if schema_report.get('id_duplicate_count', 0) > 0:
        ok = False; msgs.append('Duplicate IDs present.')
    if schema_report.get('non_numeric_value_columns'):
        ok = False; msgs.append('Non-numeric value columns detected.')
    # Values
    if val_report.get('sigma_negative_count', 0) > 0:
        ok = False; msgs.append('Found negative sigma values.')
    # Optional soft bounds
    if MU_ABS_MAX is not None and val_report.get('mu_out_of_bounds', 0) > 0:
        msgs.append('mu values exceed soft abs bound.')
    if SIGMA_ABS_MAX is not None and val_report.get('sigma_out_of_bounds', 0) > 0:
        msgs.append('sigma values exceed soft abs bound.')
    return ok, msgs

ok, messages = make_pass_fail(schema_report, val_report)
summary = {
    'timestamp': datetime.utcnow().isoformat() + 'Z',
    'submission_path': str(SUBMISSION_PATH),
    'row_count': schema_report.get('row_count'),
    'column_count': schema_report.get('column_count'),
    'pass': ok,
    'messages': messages,
    'schema_report': schema_report,
    'value_report': val_report,
    'aggregates': agg
}

# Save JSON and CSV-friendly flat summary
json_path = REPORT_DIR / 'submission_check.json'
with open(json_path, 'w') as f:
    json.dump(summary, f, indent=2)

flat = {
    'timestamp': summary['timestamp'],
    'path': summary['submission_path'],
    'rows': summary['row_count'],
    'cols': summary['column_count'],
    'pass': summary['pass'],
    'missing_cols': len(schema_report.get('missing_columns', [])),
    'extra_cols': len(schema_report.get('extra_columns', [])),
    'id_null': schema_report.get('id_null_count', -1),
    'id_dups': schema_report.get('id_duplicate_count', -1),
    'mu_nan': val_report.get('mu_nan_count', -1),
    'sigma_nan': val_report.get('sigma_nan_count', -1),
    'mu_nonfinite': val_report.get('mu_nonfinite_count', -1),
    'sigma_nonfinite': val_report.get('sigma_nonfinite_count', -1),
    'sigma_negative': val_report.get('sigma_negative_count', -1),
    'mu_oob': val_report.get('mu_out_of_bounds', -1),
    'sigma_oob': val_report.get('sigma_out_of_bounds', -1),
}
pd.DataFrame([flat]).to_csv(REPORT_DIR / 'submission_check.csv', index=False)

print('PASS' if ok else 'FAIL', '|', '; '.join(messages) if messages else 'OK')
print('Saved report to:', json_path, 'and CSV twin in same folder.')



---

### Appendix & Tips

- **Expected columns**: `id`, followed by `mu_000..mu_282` and `sigma_000..sigma_282` (total 1+283+283 = 567 columns).
- **σ must be non-negative** and finite; **μ** finite. You can relax/enforce soft bounds via `MU_ABS_MAX`/`SIGMA_ABS_MAX`.
- If you package as `submission.zip`, ensure it **contains submission.csv at top level**.
- Keep IDs unique; Kaggle evaluators may join on `id` and expect no duplicates.
- For speed, replace pandas with polars if desired. For very large files, you can chunk-read and validate iteratively.
- Integrate this notebook in CI to gate releases: emit a non-empty `messages` list to fail.

**Common pitfalls**
- Off-by-one in bin indexing (ensure zero-based, 3-digit padded).
- Accidentally swapping μ and σ column groups.
- Printing scientific notation strings to CSV instead of numeric dtypes (schema check flags non-numeric).

