# 🤝 SpectraMind V50 — Hugging Face Integration & Transfer Learning (Notebook 07)

**Goal.** Integrate **Hugging Face** models into the SpectraMind V50 pipeline and demonstrate **parameter‑efficient fine‑tuning (PEFT/LoRA)** in a CLI‑first, Hydra‑safe flow. This notebook follows the physics‑informed work in 06 and extends the pipeline with pretrained backbones and transfer learning.

**What you’ll do**
1. Pre‑flight & environment capture (CLI presence, run IDs)
2. HF environment check (Transformers / Accelerate / PEFT) and graceful fallbacks
3. Compose Hydra overrides to select a **HF model** (e.g., ViT/TimeSformer) and **LoRA** knobs
4. Train via `spectramind train` with HF + PEFT overrides
5. Diagnose & compare against the custom SSM+GNN baseline
6. Artifact tree, Mermaid sketch, and next steps

> The notebook **degrades gracefully**: if `spectramind` or HF libs are not installed, it runs **DRY‑RUN** and still produces configs/logs/placeholder artifacts.


In [None]:
# ░░ Pre-flight: env, run IDs, paths, CLI detection ░░
import os, sys, json, platform, shutil, subprocess, datetime, pathlib

RUN_TS = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
RUN_ID = f"huggingface_transfer_{RUN_TS}"
ROOT_OUT = "/mnt/data/hf_transfer"
ARTIFACTS = os.path.join(ROOT_OUT, RUN_ID)
LOGS = os.path.join(ARTIFACTS, "logs")
CFG_OUT = os.path.join(ARTIFACTS, "configs")
DIAG_OUT = os.path.join(ARTIFACTS, "diagnostics")
for p in (ROOT_OUT, ARTIFACTS, LOGS, CFG_OUT, DIAG_OUT):
    os.makedirs(p, exist_ok=True)

def which(cmd: str) -> bool:
    return shutil.which(cmd) is not None

CLI_PRESENT = which("spectramind")
DRY_RUN = not CLI_PRESENT

def git_cmd(args):
    try:
        out = subprocess.check_output(["git", *args], stderr=subprocess.STDOUT, timeout=5).decode().strip()
        return out
    except Exception:
        return None

env = {
    "python": sys.version.replace("\n", " "),
    "platform": platform.platform(),
    "cli_present": CLI_PRESENT,
    "dry_run": DRY_RUN,
    "run_id": RUN_ID,
    "paths": {"artifacts": ARTIFACTS, "logs": LOGS, "configs": CFG_OUT, "diagnostics": DIAG_OUT},
    "git": {
        "commit": git_cmd(["rev-parse", "HEAD"]),
        "branch": git_cmd(["rev-parse", "--abbrev-ref", "HEAD"]),
        "status": git_cmd(["status", "--porcelain"]),
    },
}
with open(os.path.join(ARTIFACTS, "env.json"), "w") as f:
    json.dump(env, f, indent=2)

print("=== SpectraMind V50 — Notebook 07 ===")
print(json.dumps(env, indent=2))


## Hugging Face environment check (Transformers / Accelerate / PEFT)

We try importing `transformers`, `accelerate`, and `peft`. If any are missing, we **don’t install** them here (to keep the notebook reproducible/air‑gapped) but proceed in **DRY‑RUN**.


In [None]:
missing = []
try:
    import transformers  # type: ignore
    tr_ok = True
except Exception:
    tr_ok = False
    missing.append("transformers")
try:
    import accelerate  # type: ignore
    acc_ok = True
except Exception:
    acc_ok = False
    missing.append("accelerate")
try:
    import peft  # type: ignore
    peft_ok = True
except Exception:
    peft_ok = False
    missing.append("peft")

print("HF libs — transformers:", tr_ok, "| accelerate:", acc_ok, "| peft:", peft_ok)
if missing:
    print("[NOTE] Missing libs:", ", ".join(missing), " — continuing with DRY‑RUN semantics where required.")


## Compose Hydra overrides: select HF backbone + LoRA

Common toggles:
- `model=hf_vit` or `model=hf_timesformer` (assumes your repo has these config groups)
- `peft.lora.enable=true`, with rank/alpha/dropout as tunables
- Batch/epochs kept conservative by default; use ablations later for sweeps


In [None]:
import json, os

# Sensible defaults (adjust to your config structure)
overrides = {
    # Switch to an HF model group (example names; match to your repo’s configs/)
    "model": "hf_vit",                 # or "hf_timesformer"
    # Enable PEFT/LoRA adapters
    "peft.lora.enable": "true",
    "peft.lora.r": "16",
    "peft.lora.alpha": "32",
    "peft.lora.dropout": "0.05",
    # Data and training
    "data": "ariel_nominal",
    "training.max_epochs": "6",
    "training.batch_size": "16",
    "training.seed": "1337",
    # Optional: mixed precision / accelerate flags if your code supports these hydra keys
    "training.mixed_precision": "fp16",
}

cfg_file = os.path.join(CFG_OUT, "hf_peft_overrides.json")
with open(cfg_file, "w") as f:
    json.dump(overrides, f, indent=2)
print("Saved overrides ->", cfg_file)
print(json.dumps(overrides, indent=2))


## Helper: robust CLI runner (DRY‑RUN when CLI not present)

In [None]:
import shlex, time

def run_cli(cmd_list, log_name="run"):
    log_path = os.path.join(LOGS, f"{log_name}.log")
    err_path = os.path.join(LOGS, f"{log_name}.err")
    start = time.time()
    result = {"cmd": cmd_list, "dry_run": DRY_RUN, "returncode": 0, "stdout": "", "stderr": ""}
    if DRY_RUN:
        msg = f"[DRY‑RUN] Would execute: {' '.join(shlex.quote(c) for c in cmd_list)}\n"
        result["stdout"] = msg
        with open(log_path, "w") as f: f.write(msg)
        with open(err_path, "w") as f: f.write("")
        place = os.path.join(ARTIFACTS, "dry_run_placeholder.txt")
        with open(place, "a") as f: f.write(msg)
        return result

    with open(log_path, "wb") as out, open(err_path, "wb") as err:
        try:
            proc = subprocess.Popen(cmd_list, stdout=out, stderr=err, env=os.environ.copy())
            proc.wait()
            result["returncode"] = proc.returncode
        except Exception as e:
            result["returncode"] = 99
            with open(err_path, "ab") as errf:
                errf.write(str(e).encode())

    try:
        result["stdout"] = open(log_path, "r").read()
    except Exception:
        pass
    try:
        result["stderr"] = open(err_path, "r").read()
    except Exception:
        pass
    result["elapsed_sec"] = round(time.time() - start, 3)
    print(f"[rc={result['returncode']}] logs: {log_path}")
    return result


## Train with Hugging Face + PEFT/LoRA (CLI‑first)

We pass the overrides to `spectramind train` and let Hydra compose the full config. This aligns with the CLI‑first/hydra architecture and reproducibility practices.


In [None]:
cmd = [
    "spectramind", "train",
    "--config-name", "config_v50.yaml",
    "+outputs.root_dir=" + ARTIFACTS,
]
for k, v in overrides.items():
    cmd.append(f"+{k}={v}")
# Optional fast mode, if supported by your code:
cmd += ["+training.fast_mode=true"]

res_train = run_cli(cmd, log_name="01_train_hf_peft")
print(res_train["stdout"][:500])
if res_train["returncode"] not in (0, None):
    print("Training non-zero return code:", res_train["returncode"])


## Diagnostics & comparison vs custom SSM+GNN

We run diagnostics and (optionally) a comparison routine (e.g., validation metrics table or overlay plots) to quantify transfer benefits.


In [None]:
# Symbolic / general diagnostics (adjust to your CLI):
cmd_diag = ["spectramind", "diagnose", "dashboard",
            "--out", os.path.join(DIAG_OUT, "diagnostic_report_hf_v1.html")]
res_diag = run_cli(cmd_diag, log_name="02_diagnose_dashboard")
print(res_diag["stdout"][:500])

# (Optional) If you have a comparison subcommand:
cmd_cmp = ["spectramind", "diagnose", "compare",
           "--models", "baseline_v50,hf_vit",
           "--out", os.path.join(DIAG_OUT, "compare_baseline_vs_hf.json")]
res_cmp = run_cli(cmd_cmp, log_name="03_diagnose_compare")
print(res_cmp["stdout"][:500])


## Browse produced artifacts

In [None]:
import os

def tree(path, prefix=""):
    items = sorted(os.listdir(path))
    lines = []
    for i, name in enumerate(items):
        full = os.path.join(path, name)
        connector = "└── " if i == len(items)-1 else "├── "
        lines.append(prefix + connector + name)
        if os.path.isdir(full):
            extension = "    " if i == len(items)-1 else "│   "
            lines.extend(tree(full, prefix + extension))
    return lines

print("ARTIFACTS TREE:", ARTIFACTS)
print("\n".join(tree(ARTIFACTS)))
dash_path = os.path.join(DIAG_OUT, "diagnostic_report_hf_v1.html")
print("\nDashboard:", dash_path if os.path.exists(dash_path) else "(not found)")


## Pipeline sketch (Mermaid)

```mermaid
flowchart LR
  A[Pretrained HF backbone] --> B[PEFT/LoRA fine‑tune (Hydra)]
  B --> C[Predict μ, σ]
  C --> D[Diagnostics & overlays]
  D --> E[Compare vs SSM+GNN baseline]
  E --> F[Report / CI]
```

## Next steps
- Try **TimeSformer** or additional HF backbones; tune LoRA rank/alpha/dropout via `spectramind ablate`.
- Enable **Accelerate** multi‑GPU or mixed precision for faster runs if supported by your config.
- Publish the fine‑tuned model to a private registry (or Hugging Face Hub) for controlled reuse.
