# 06 \u2014 Symbolic Overlays (Neuro\u2011Symbolic Rules for Spectra)\n\nThis notebook shows how SpectraMind V50 overlays **symbolic, physics\u2011aware rules** on top of data\u2011driven predictions to improve scientific trustworthiness. We:\n\n1) run or load predictions via the CLI/Hydra pipeline,\n2) compute rule violations (non\u2011negativity, physical upper bound, smoothness, and band co\u2011occurrence) with tunable weights,\n3) visualize per\u2011wavelength penalties and a heatmap,\n4) export a JSON/CSV diagnostics bundle under `outputs/notebooks/06_symbolic_overlays/`.\n\nSymbolic regularization can also be added to training as: \n$$L_{\text{total}} = L_{\text{data}} + \sum_i \lambda_i\,L_{\text{symbolic},i}$$ (see the V50 plan). \n\n_Refs: V50 technical plan (symbolic loss overlays and composite loss), radiative/physical constraints, and pipeline integration recommendations._ \n\n**Repro tips**: This notebook only *reads* artifacts written by the CLI; configs and seeds live in Hydra YAML. To regenerate artifacts, run the CLI cell below.

### Why symbolic overlays?\n* Encourage **physically plausible** spectra (e.g., non\u2011negative transit depth; no values above stellar baseline). \n* Promote **spectral smoothness** consistent with instrument resolution.\n* Encode **domain relations** (e.g., if \u201cH\u2082O\u201d is strong in one band, related bands shouldn\u2019t be contradictory).\n\nThese ideas follow the V50 design where neural predictions are constrained/guided by physics and logic overlays (neuro\u2011symbolic) to improve generalization and credibility.  

In [None]:
%%bash\n# (Optional) Regenerate predictions for a small sample using the SpectraMind CLI.\n# Requires the environment to be set up per repo README (Typer CLI + Hydra configs).\n# This uses a fast path (1 planet) to create a fresh artifact we can overlay.\n# If you already have predictions in outputs/predictions/, you can skip this cell.\n\nset -euo pipefail\nROOT=${ROOT:-$(git rev-parse --show-toplevel 2>/dev/null || pwd)}\ncd "$ROOT"\n\nspectramind predict \\\n  data.sample=1 \\\n  +inference.save=true \\\n  +inference.tag=symbolic_demo \\\n  hydra.run.dir=outputs/runs/demo_symbolic_overlay \\\n  -q\n\necho "\nArtifacts written under outputs/ (predictions CSVs, logs)."

### Configure symbolic rules and weights\nWe define four penalties:\n\n- **Nonnegativity**: penalize any \(y(\lambda) < 0\).\n- **Upper bound**: penalize \(y(\lambda) > y_{\max}\) (e.g., stellar baseline or domain cap).\n- **Smoothness (TV)**: total variation \(\sum_\lambda |y(\lambda) - y(\lambda-1)|\) to suppress implausible jaggedness beyond instrument resolution.\n- **Band co\u2011occurrence**: encourage consistency within known molecule bands (here, simple water bands around ~1.35\u20131.45\u00b5m and ~1.8\u20132.0\u00b5m as placeholders; adjust to your wavelength grid).\n\nThese map to symbolic loss terms \(L_{\text{symbolic},i}\) used at train time or as post\u2011hoc overlays for diagnostics. 

In [None]:
from __future__ import annotations\nfrom pathlib import Path\nimport json, math, re\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nplt.rcParams['figure.dpi']=120\n\n# Notebook output root\nNB_OUT = Path("outputs/notebooks/06_symbolic_overlays").resolve()\nNB_OUT.mkdir(parents=True, exist_ok=True)\n\n# Symbolic weights (adjust as needed)\nLAMBDA = {\n    "nonneg": 1.0,\n    "upper": 1.0,\n    "smooth_tv": 0.2,\n    "band_consistency": 0.5,\n}\n\n# Physical caps (simple illustrative defaults)\nY_MAX = 0.02   # e.g., 2% transit depth cap for this demo dataset\n\n# Example molecule bands (index ranges on the wavelength axis after we load the grid)\nBANDS = {\n    "H2O_1": (120, 150),  # demo indices for ~1.35–1.45 µm region on the challenge grid (adjust)\n    "H2O_2": (180, 220),  # demo indices for ~1.8–2.0 µm\n}

In [None]:
def find_latest_prediction_csv(root: Path = Path("outputs/predictions")) -> Path:\n    candidates = sorted(root.rglob("*.csv"))\n    if not candidates:\n        raise FileNotFoundError("No prediction CSVs found under outputs/predictions/. Run the CLI cell above or point to an existing file.")\n    return candidates[-1]\n\npred_csv = find_latest_prediction_csv()\npred_df = pd.read_csv(pred_csv)\npred_df.head(3)

In [None]:
# Expect columns: ['planet_id', 'wavelength_index', 'wavelength_um', 'y_pred'] (adapt as per your pipeline).\nreq_cols = {"planet_id","wavelength_index","y_pred"}\nif not req_cols.issubset(set(pred_df.columns)):\n    raise ValueError(f"Prediction file missing required columns {req_cols}. Columns present: {list(pred_df.columns)}")\n\n# Select one planet for demo\npid = pred_df["planet_id"].iloc[0]\nspec = pred_df[pred_df["planet_id"]==pid].sort_values("wavelength_index").reset_index(drop=True)\ny = spec["y_pred"].to_numpy().astype(float)\nwl_idx = spec["wavelength_index"].to_numpy().astype(int)\nn = y.size\n\ndef penalty_nonneg(y):\n    mask = y < 0.0\n    return float(np.abs(y[mask]).sum()), mask\n\ndef penalty_upper(y, y_max):\n    mask = y > y_max\n    return float((y[mask]-y_max).sum()), mask\n\ndef penalty_smooth_tv(y):\n    diffs = np.abs(np.diff(y))\n    # mark a mask where large changes occur (top 10% by magnitude)\n    thr = np.quantile(diffs, 0.90) if diffs.size else 0.0\n    mask = np.zeros_like(y, dtype=bool)\n    mask[1:][diffs >= thr] = True\n    return float(diffs.sum()), mask\n\ndef penalty_band_consistency(y, bands: dict[str,tuple[int,int]]):\n    # simple: within each band, encourage the mean to be similar between bands\n    # penalty = sum of pairwise |mean_i - mean_j|; mark entire band spans for visualization\n    means = []\n    band_masks = {}\n    for k,(a,b) in bands.items():\n        a,b = max(0,a), min(n,b)\n        band_masks[k] = np.zeros_like(y, dtype=bool)\n        band_masks[k][a:b] = True\n        means.append(y[a:b].mean() if b>a else 0.0)\n    pen = 0.0\n    for i in range(len(means)):\n        for j in range(i+1,len(means)):\n            pen += abs(means[i]-means[j])\n    # combined visualization mask: highlight all bands\n    vis_mask = np.any(np.stack(list(band_masks.values())), axis=0) if bands else np.zeros_like(y, bool)\n    return float(pen), vis_mask\n\np_nonneg, m_nonneg = penalty_nonneg(y)\np_upper,  m_upper  = penalty_upper(y, Y_MAX)\np_tv,     m_tv     = penalty_smooth_tv(y)\np_band,   m_band   = penalty_band_consistency(y, BANDS)\n\nscores = {\n    "nonneg": LAMBDA["nonneg"]*p_nonneg,\n    "upper": LAMBDA["upper"]*p_upper,\n    "smooth_tv": LAMBDA["smooth_tv"]*p_tv,\n    "band_consistency": LAMBDA["band_consistency"]*p_band,\n}\nscores["total_overlay"] = sum(scores.values())\nscores

In [None]:
def overlay_plot(y, masks: dict[str,np.ndarray], title: str):\n    x = np.arange(y.size)\n    plt.figure(figsize=(10,3.2))\n    plt.plot(x, y, lw=1.4, label='predicted spectrum')\n    plt.axhline(0, color='k', lw=0.7, alpha=0.6)\n    plt.axhline(Y_MAX, color='tab:gray', lw=0.7, ls='--', alpha=0.7, label='Y_MAX cap')\n    # shaded masks\n    for name,mask in masks.items():\n        if mask.any():\n            plt.fill_between(x, y.min(), y.max(), where=mask, alpha=0.15, label=f"{name} region")\n    plt.title(title)\n    plt.xlabel('wavelength index')\n    plt.ylabel('transit depth (arb.)')\n    plt.legend(loc='best', ncol=3, fontsize=8)\n    plt.tight_layout()\n\nmasks = {\n    'negatives': m_nonneg,\n    'over_cap': m_upper,\n    'TV_high': m_tv,\n    'H2O_bands': m_band,\n}\noverlay_plot(y, masks, title=f"Planet {pid} — symbolic overlays (total={scores['total_overlay']:.3g})")

In [None]:
# Heatmap of penalty magnitudes per rule (normalized)\nvals = np.array([scores['nonneg'], scores['upper'], scores['smooth_tv'], scores['band_consistency']], dtype=float)\nlabels = ['nonneg','upper','smooth_tv','band_consistency']\nnorm = (vals - vals.min())/(vals.ptp() + 1e-9)\nplt.figure(figsize=(5,1.8))\nplt.imshow(norm[None,:], cmap='magma', aspect='auto')\nplt.yticks([]); plt.xticks(range(len(labels)), labels, rotation=0)\nplt.colorbar(label='normalized penalty')\nplt.title('Symbolic penalty summary')\nplt.tight_layout()

In [None]:
bundle = {\n  "planet_id": str(pid),\n  "weights": LAMBDA,\n  "caps": {"Y_MAX": Y_MAX},\n  "bands": BANDS,\n  "scores": scores,\n  "source_prediction_csv": str(pred_csv),\n}\n(NB_OUT/"symbolic_overlay.json").write_text(json.dumps(bundle, indent=2))\nspec.assign(nonneg_mask=m_nonneg, overcap_mask=m_upper, tv_mask=m_tv, h2o_mask=m_band)\n   .to_csv(NB_OUT/"symbolic_overlay_detail.csv", index=False)\nprint("Wrote:", NB_OUT/"symbolic_overlay.json")\nprint("Wrote:", NB_OUT/"symbolic_overlay_detail.csv")

### Add overlays during training (optional)\nYou can plug these penalties into the training loss (with Hydra config switches) so the model learns to avoid violations, using the composite objective above. This aligns with the V50 plan\u2019s approach to add physics/logic penalties alongside data likelihood. \n\n**Practice tips**\n* Start with small \(\lambda\) on smoothness so you don\u2019t over\u2011smooth real lines.\n* Keep an eye on evaluation GLL and constraint heatmaps together. If confidence is over\u2011tight, consider uncertainty calibration/ensembling. \n* For band rules, replace the demo ranges with exact instrument grid bands when available (AIRS/FGS lines).

### Provenance & references\n- Neuro\u2011symbolic overlays & composite loss used in SpectraMind V50. \n- Pipeline integration (CLI/Hydra; diagnostics heatmaps) and uncertainty calibration guidance.  \n- Physical constraints motivation (transit spectroscopy physics, plausibility of spectra). 