# Ontology Alignment — Full Pipeline (Local Notebook)

This notebook runs the **entire pipeline locally**:

1. **Dataset construction** (build source/target CSV + build training dataset + splits)
2. **Training** (cross-encoder fine-tuning)
3. **Offline bundle building** (ontology_internal.csv + offline_bundle.pkl, with optional semantic index)
4. **Inference** (retrieval + cross-encoder scoring → predictions.csv)

It is designed to be:
- **reproducible**: every run writes to `outputs/<RUN_ID>/...`
- **modular**: you can run only the stages you need via flags
- **terminal-free**: commands are launched via notebook cells and logged to disk

---

## 0) Setup

Run this section once per environment.

Two typical workflows:

- **You are already inside the repo**  
  Just install dependencies (once) and continue.

- **You want the notebook to clone the repo**  
  Clone, `cd` into it, install dependencies.

Notes:
- In Jupyter, `!pip install ...` installs into the kernel environment.
- Make sure the notebook kernel is the same env you intend to use.

In [7]:
import sys
from pathlib import Path

# --- OPTIONAL: clone repo ---
# !git clone https://github.com/adsp-polito/2025-P13-Ontology-Alignment.git
# %cd 2025-P13-Ontology-Alignment

# --- OPTIONAL: install deps ---
# If you have requirements.txt:
# !pip install -r requirements.txt

# Fallback minimal (only if needed):
# !pip install -U sentence-transformers transformers torch pandas numpy scikit-learn

REPO_ROOT = Path(".").resolve()
print("Python:", sys.version)
print("REPO_ROOT:", REPO_ROOT)

Python: 3.13.5 (main, Jun 11 2025, 15:36:57) [Clang 17.0.0 (clang-1700.0.13.3)]
REPO_ROOT: /Users/usermastro/Desktop/Primo_Semestre_2526/ADSP/Ontology Alignment Project/OAProject


---

## 1) Helpers (run_cmd + logs) + robust W&B disable

We run the pipeline scripts (`training.py`, `build_ontology_bundle.py`, `run_inference.py`) as subprocesses.
Each command writes a log file. If the command fails, the notebook prints the tail of the log.

Important:
- Some training stacks try to use Weights & Biases (wandb).  
  In local setups this can fail (missing API key).  
  We force-disable wandb inside subprocess environments.

In [8]:
import subprocess
from pathlib import Path
import os

def print_tail(path: Path, n=120):
    p = Path(path)
    if not p.exists():
        print(f"[tail] log not found: {p}")
        return
    lines = p.read_text(errors="replace").splitlines()
    print("\n".join(lines[-n:]))

def run_cmd(cmd, log_path: Path, cwd: Path):
    cmd = [str(x) for x in cmd]
    log_path = Path(log_path)
    log_path.parent.mkdir(parents=True, exist_ok=True)

    print("\nRunning command:\n", " ".join(cmd))
    print("CWD:", Path(cwd).resolve())
    print("Log:", log_path.resolve())

    env = os.environ.copy()
    env["WANDB_MODE"] = "disabled"
    env["WANDB_SILENT"] = "true"
    env["PYTHONPATH"] = str(Path(cwd).resolve()) + os.pathsep + env.get("PYTHONPATH", "")

    with open(log_path, "w") as f:
        f.write("CMD: " + " ".join(cmd) + "\n")
        f.write("CWD: " + str(Path(cwd).resolve()) + "\n\n")
        proc = subprocess.run(
            cmd,
            stdout=f,
            stderr=subprocess.STDOUT,
            cwd=str(cwd),
            env=env,
        )

    print("Return code:", proc.returncode)
    if proc.returncode != 0:
        print("!!! Error occurred. Last lines of log:")
        print_tail(log_path, n=120)
        raise RuntimeError(f"Command failed with return code {proc.returncode}. See log: {log_path}")
    return proc.returncode

print("Helpers OK.")

Helpers OK.


---

## 2) Run mode flags (choose what to run today)

This notebook supports two styles:

- **Full pipeline**: Training → Offline → Inference (one run)
- **Stage-by-stage**: run only the stages you need

You can also “restore” artifacts (point to existing files) and skip rebuilding.

Key dependencies:
- Inference requires offline artifacts (bundle + ontology CSV).
- Inference requires a cross-encoder model id/path.

In [10]:
# ============================================
# RUN MODE FLAGS (choose what to run today)
# ============================================

# Main toggles: what stages to execute
DO_TRAINING  = True
DO_OFFLINE   = True
DO_INFERENCE = True

# Restore toggles: if True, skip building that stage and load artifacts instead
RESTORE_MODEL   = False   # restores cross-encoder (+ optionally custom inference input CSV/schema)
RESTORE_OFFLINE = False   # restores offline bundle + ontology CSV

# If you restore artifacts, you skip rebuilding them.
if RESTORE_MODEL and DO_TRAINING:
    DO_TRAINING = False
    print("RESTORE_MODEL=True => forcing DO_TRAINING=False (using restored model).")

if RESTORE_OFFLINE and DO_OFFLINE:
    DO_OFFLINE = False
    print("RESTORE_OFFLINE=True => forcing DO_OFFLINE=False (using restored offline artifacts).")

# Soft reminder (no hard stop)
if DO_INFERENCE and not (DO_TRAINING or RESTORE_MODEL):
    print("Note: inference requires CROSS_ENCODER_MODEL_ID (from training or restore).")
if DO_INFERENCE and not (DO_OFFLINE or RESTORE_OFFLINE):
    print("Note: inference requires offline artifacts (from offline stage, restore, or existing paths on disk).")

print("Flags OK.")

Flags OK.


---

## 3) Configuration (always run)

This cell defines:
- the **run directory** (`outputs/<RUN_ID>/...`) used by all stages
- all **inputs** (ontologies, alignments)
- all **model choices** (cross-encoder + bi-encoder + tokenizer)
- all **canonical artifact paths** (dataset CSVs, model dir, offline bundle, inference outputs)

Rule of thumb:
- Run this cell **every time** you open the notebook.
- You can later override specific parameters (e.g., inference top-k, custom input CSV) without changing this cell.

In [11]:
# ============================================
# CONFIGURATION (unified training -> offline -> inference)
# ============================================
from pathlib import Path
from datetime import datetime


assert "REPO_ROOT" in globals(), "REPO_ROOT not set. Run the Setup cell first."
REPO_ROOT = Path(REPO_ROOT).resolve()
print("REPO_ROOT:", REPO_ROOT)

# -----------------------------
# Run id / output layout
# -----------------------------
RUN_ID = f"unified_run_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
OUT_DIR = Path("outputs") / RUN_ID
OUT_DIR.mkdir(parents=True, exist_ok=True)

TRAIN_DIR = OUT_DIR / "training"
OFFLINE_DIR = OUT_DIR / "offline"
INFER_DIR = OUT_DIR / "inference"
TRAIN_DIR.mkdir(parents=True, exist_ok=True)
OFFLINE_DIR.mkdir(parents=True, exist_ok=True)
INFER_DIR.mkdir(parents=True, exist_ok=True)

print("RUN_ID:", RUN_ID)
print("OUT_DIR:", OUT_DIR)

# -----------------------------
# Training mode and model
# -----------------------------
RUN_MODE = "full"  # "full" | "build-dataset" | "train-only"
MODEL_TYPE = "cross-encoder"  # keep this if you want inference at the end
MODEL_NAME = "allenai/scibert_scivocab_uncased"
NUM_EPOCHS = 10

HYPERPARAMETER_TUNING = False
N_TRIALS = 5

USE_FIXED_HYPERPARAMS = True
LEARNING_RATE = 3e-5
BATCH_SIZE = 16
WEIGHT_DECAY = 0.01

SPLIT_RATIOS = "0.75,0.15,0.10"

# -----------------------------
# Inputs for dataset building
# -----------------------------
SRC_PATH = "datasets/sweet.owl"
TGT_PATH = "datasets/envo.owl"
ALIGN_PATH = "datasets/envo-sweet.rdf"

SRC_PREFIX = None
TGT_PREFIX = "http://purl.obolibrary.org/obo/ENVO_"  # e.g. "http://purl.obolibrary.org/obo/ENVO_"

USE_DESCRIPTION = True
USE_SYNONYMS = True
USE_PARENTS = True
USE_EQUIVALENT = True
USE_DISJOINT = True

VISUALIZE = False

# -----------------------------
# Canonical outputs of STEP 1
# -----------------------------
OUT_SRC_CSV = str(TRAIN_DIR / "source_ontology.csv")
OUT_TGT_CSV = str(TRAIN_DIR / "target_ontology.csv")
OUT_DATASET_CSV = str(TRAIN_DIR / "training_dataset.csv")

TRAIN_SPLIT_CSV = str(Path(OUT_DATASET_CSV).with_suffix(".train.csv"))
VAL_SPLIT_CSV   = str(Path(OUT_DATASET_CSV).with_suffix(".val.csv"))
TEST_SPLIT_CSV  = str(Path(OUT_DATASET_CSV).with_suffix(".test.csv"))

# train-only mode
DATASET_CSV = TRAIN_SPLIT_CSV  # default, used only if RUN_MODE=="train-only"
# Tip: if RUN_MODE=="train-only", set DATASET_CSV to an existing dataset path (train split or full dataset)

# model outputs
MODEL_OUT_DIR = str(TRAIN_DIR / "models" / f"{MODEL_TYPE}_custom")
FINAL_CROSS_ENCODER_DIR = str(Path(MODEL_OUT_DIR) / "final_cross_encoder_model")

# -----------------------------
# Offline bundle builder
# -----------------------------
OFFLINE_EXPORT_CSV = None
OFFLINE_ONT_PATH = TGT_PATH
OFFLINE_PREFIX = TGT_PREFIX

# Tokenizer used by cross-encoder scoring (keep aligned with cross-encoder)
CROSS_TOKENIZER_NAME = MODEL_NAME

# Bi-encoder used ONLY for semantic embeddings in offline bundle / semantic retrieval
BI_ENCODER_MODEL_ID = "allenai/scibert_scivocab_uncased"
OFFLINE_SEMANTIC_BATCH_SIZE = 64
OFFLINE_SEMANTIC_MAX_LENGTH = 256
OFFLINE_NO_SEMANTIC_NORMALIZE = False

ONTOLOGY_INTERNAL_CSV = str(OFFLINE_DIR / "ontology_internal.csv")
OFFLINE_BUNDLE_PKL = str(OFFLINE_DIR / "offline_bundle.pkl")

# -----------------------------
# Inference
# -----------------------------
# In full pipeline: default to the trained model location.
# In restore mode: this will be overwritten by the restore cell.
CROSS_ENCODER_MODEL_ID = FINAL_CROSS_ENCODER_DIR

INFER_INPUT_CSV = str(Path(OUT_DATASET_CSV).with_suffix(".test.queries.csv"))  # by default, use test split queries
INFER_OUT_CSV = str(INFER_DIR / "predictions.csv")

RETRIEVAL_COL = "source_label"
SCORING_COL = "source_text"
ID_COL = "source_iri"

INFER_MODE = "hybrid"
RETRIEVAL_LEXICAL_TOP_K = 100
RETRIEVAL_SEMANTIC_TOP_K = 100
RETRIEVAL_MERGED_TOP_K = 150
HYBRID_RATIO_SEMANTIC = 0.2
SEMANTIC_BATCH_SIZE = 64

CROSS_TOP_K = 20
CROSS_BATCH_SIZE = 32
CROSS_MAX_LENGTH = 256

KEEP_TOP_N = 20

print("Config OK.")

REPO_ROOT: /Users/usermastro/Desktop/Primo_Semestre_2526/ADSP/Ontology Alignment Project/OAProject
RUN_ID: unified_run_20260109_102452
OUT_DIR: outputs/unified_run_20260109_102452
Config OK.


---

## 4) Local restore (optional)

If you already have artifacts from a previous run (local disk), you can skip rebuilding:

- **Restore offline artifacts**:
  - `offline_bundle.pkl`
  - `ontology_internal.csv`

- **Restore model artifacts**:
  - a folder containing a saved SentenceTransformers CrossEncoder (must contain `config.json`)

These cells simply **override** the paths used downstream.
If you are running the full pipeline today, you can skip them.

In [12]:
# ============================================
# RESTORE OFFLINE ARTIFACTS (local)
# ============================================
from pathlib import Path

if not RESTORE_OFFLINE:
    print("Skipping offline restore (RESTORE_OFFLINE=False).")
else:
    # Point to a directory that contains offline_bundle.pkl and ontology_internal.csv
    # Example:
    # OFFLINE_RESTORE_SRC = "outputs/unified_run_20260101_120000/offline"
    OFFLINE_RESTORE_SRC = None  # <-- set me

    if not OFFLINE_RESTORE_SRC:
        raise ValueError("Set OFFLINE_RESTORE_SRC to a folder containing offline_bundle.pkl and ontology_internal.csv")

    restore_root = Path(OFFLINE_RESTORE_SRC).expanduser().resolve()

    if not restore_root.exists():
        raise FileNotFoundError(f"OFFLINE_RESTORE_SRC not found: {restore_root}")

    bundle_pkl = restore_root / "offline_bundle.pkl"
    onto_csv   = restore_root / "ontology_internal.csv"

    if not bundle_pkl.exists() or not onto_csv.exists():
        # fallback: search recursively (but deterministic: sort)
        pkls = sorted(restore_root.rglob("offline_bundle.pkl"))
        csvs = sorted(restore_root.rglob("ontology_internal.csv"))
        bundle_pkl = pkls[0] if pkls else None
        onto_csv = csvs[0] if csvs else None

    if bundle_pkl is None or onto_csv is None:
        raise FileNotFoundError(
            f"Could not find offline_bundle.pkl and/or ontology_internal.csv under: {restore_root}"
        )

    OFFLINE_BUNDLE_PKL = str(bundle_pkl)
    ONTOLOGY_INTERNAL_CSV = str(onto_csv)

    print("Restored offline artifacts:")
    print("   OFFLINE_BUNDLE_PKL    =", OFFLINE_BUNDLE_PKL)
    print("   ONTOLOGY_INTERNAL_CSV =", ONTOLOGY_INTERNAL_CSV)

Skipping offline restore (RESTORE_OFFLINE=False).


In [13]:
# ============================================
# RESTORE MODEL ARTIFACTS (local)
# ============================================
from pathlib import Path

if not RESTORE_MODEL:
    print("Skipping model restore (RESTORE_MODEL=False).")
else:
    # Point to a directory that is the saved CrossEncoder folder (contains config.json),
    # OR a parent directory that contains such a folder.
    # Example:
    # MODEL_RESTORE_SRC = "outputs/unified_run_20260101_120000/training/models/cross-encoder_custom/final_cross_encoder_model"
    MODEL_RESTORE_SRC = None  # <-- set me

    if not MODEL_RESTORE_SRC:
        raise ValueError("Set MODEL_RESTORE_SRC to a saved CrossEncoder folder (or a parent folder containing it).")

    restore_root = Path(MODEL_RESTORE_SRC).expanduser().resolve()
    if not restore_root.exists():
        raise FileNotFoundError(f"MODEL_RESTORE_SRC not found: {restore_root}")

    def _find_cross_encoder_dir(root: Path) -> Path:
        if (root / "config.json").exists():
            return root
        candidates = list(root.rglob("config.json"))
        if not candidates:
            raise FileNotFoundError(f"Could not find config.json under: {root}")
        return candidates[0].parent
    
    def _is_model_dir(d: Path) -> bool:
        return (d / "config.json").exists() and ((d / "pytorch_model.bin").exists() or (d / "model.safetensors").exists())

    cross_dir = _find_cross_encoder_dir(restore_root)
    flag = _is_model_dir(cross_dir)
    if not flag:
        raise FileNotFoundError(f"Restored cross-encoder model dir is invalid (missing model files): {cross_dir}")
    CROSS_ENCODER_MODEL_ID = str(cross_dir)

    print("Restored cross-encoder model dir:")
    print("   CROSS_ENCODER_MODEL_ID =", CROSS_ENCODER_MODEL_ID)

Skipping model restore (RESTORE_MODEL=False).


---

## 5) Run pipeline (Training → Offline → Inference)

This cell executes the stages selected by `DO_TRAINING`, `DO_OFFLINE`, `DO_INFERENCE`.

- Each stage writes a log file under the current run folder.
- If a stage fails, the notebook prints the last lines of the log and stops.
- Artifact paths come from **Configuration**, unless overridden by **Restore** cells.

In [15]:
# ============================================
# RUN PIPELINE (training -> offline -> inference)
# ============================================
from pathlib import Path

# Guardrails (hard)
if RUN_MODE == "full" and MODEL_TYPE != "cross-encoder":
    raise ValueError("RUN_MODE='full' ends with inference => needs MODEL_TYPE='cross-encoder'.")
if HYPERPARAMETER_TUNING and RUN_MODE != "full":
    raise ValueError("--tune only allowed in RUN_MODE='full'.")

# -----------------------------
# STAGE 1) TRAINING (+dataset)
# -----------------------------
if not DO_TRAINING:
    print("Skipping training (DO_TRAINING=False).")
else:
    Path(MODEL_OUT_DIR).mkdir(parents=True, exist_ok=True)

    train_log = TRAIN_DIR / "training.log"
    train_cmd = ["python", "training.py", "--mode", RUN_MODE]

    if RUN_MODE in {"full", "build-dataset"}:
        train_cmd += ["--src", SRC_PATH, "--tgt", TGT_PATH, "--align", ALIGN_PATH]
        train_cmd += ["--out-src", OUT_SRC_CSV, "--out-tgt", OUT_TGT_CSV, "--out-dataset", OUT_DATASET_CSV]
        train_cmd += ["--split-ratios", SPLIT_RATIOS]

        if SRC_PREFIX:
            train_cmd += ["--src-prefix", SRC_PREFIX]
        if TGT_PREFIX:
            train_cmd += ["--tgt-prefix", TGT_PREFIX]

        if USE_DESCRIPTION: train_cmd.append("--src-use-description")
        if USE_SYNONYMS: train_cmd.append("--src-use-synonyms")
        if USE_PARENTS: train_cmd.append("--src-use-parents")
        if USE_EQUIVALENT: train_cmd.append("--src-use-equivalent")
        if USE_DISJOINT: train_cmd.append("--src-use-disjoint")
        if VISUALIZE: train_cmd.append("--visualize-alignments")

    if RUN_MODE in {"full", "train-only"}:
        train_cmd += ["--model-type", MODEL_TYPE, "--model-name", MODEL_NAME, "--model-output-dir", MODEL_OUT_DIR]
        train_cmd += ["--num-epochs", str(NUM_EPOCHS)]

        if HYPERPARAMETER_TUNING:
            train_cmd += ["--tune", "--n-trials", str(N_TRIALS)]
        elif USE_FIXED_HYPERPARAMS:
            train_cmd += ["--learning-rate", str(LEARNING_RATE)]
            train_cmd += ["--batch-size", str(BATCH_SIZE)]
            train_cmd += ["--weight-decay", str(WEIGHT_DECAY)]

    if RUN_MODE == "train-only":
        train_cmd += ["--dataset-csv", DATASET_CSV]

    run_cmd(train_cmd, train_log, cwd=REPO_ROOT)

    print("\nTraining completed.")
    print("Dataset CSV:", OUT_DATASET_CSV)
    print("Train split:", TRAIN_SPLIT_CSV)
    print("Val split:", VAL_SPLIT_CSV)
    print("Test split:", TEST_SPLIT_CSV)
    print("Cross-encoder dir:", FINAL_CROSS_ENCODER_DIR)

    # In full pipeline, inference uses the newly trained model (unless overwritten later by restore)
    CROSS_ENCODER_MODEL_ID = FINAL_CROSS_ENCODER_DIR


# -----------------------------
# STAGE 2) OFFLINE BUNDLE
# -----------------------------
if not DO_OFFLINE:
    print("Skipping offline bundle (DO_OFFLINE=False).")
else:
    offline_log = OFFLINE_DIR / "offline_bundle.log"
    offline_cmd = [
        "python", "build_ontology_bundle.py",
        "--out-csv", ONTOLOGY_INTERNAL_CSV,
        "--out-bundle", OFFLINE_BUNDLE_PKL,
        "--tokenizer-name", CROSS_TOKENIZER_NAME,
        "--bi-encoder-model-id", BI_ENCODER_MODEL_ID,
        "--semantic-batch-size", str(OFFLINE_SEMANTIC_BATCH_SIZE),
        "--semantic-max-length", str(OFFLINE_SEMANTIC_MAX_LENGTH),
    ]
    if OFFLINE_NO_SEMANTIC_NORMALIZE:
        offline_cmd.append("--no-semantic-normalize")

    if OFFLINE_EXPORT_CSV:
        offline_cmd += ["--export-csv", OFFLINE_EXPORT_CSV]
    else:
        offline_cmd += ["--ont-path", OFFLINE_ONT_PATH]
        if OFFLINE_PREFIX:
            offline_cmd += ["--prefix", OFFLINE_PREFIX]

    run_cmd(offline_cmd, offline_log, cwd=REPO_ROOT)

    print("\nOffline bundle completed.")
    print("Ontology internal CSV:", ONTOLOGY_INTERNAL_CSV)
    print("Offline bundle PKL:", OFFLINE_BUNDLE_PKL)


# -----------------------------
# STAGE 3) INFERENCE
# -----------------------------
if not DO_INFERENCE:
    print("Skipping inference (DO_INFERENCE=False).")
else:
    # Final sanity checks (runtime)
    if "CROSS_ENCODER_MODEL_ID" not in globals() or CROSS_ENCODER_MODEL_ID is None:
        raise ValueError(
            "CROSS_ENCODER_MODEL_ID is not set. "
            "Run training (DO_TRAINING=True) or restore model (RESTORE_MODEL=True)."
        )

    if not Path(OFFLINE_BUNDLE_PKL).exists():
        raise FileNotFoundError(f"OFFLINE_BUNDLE_PKL not found: {OFFLINE_BUNDLE_PKL}")
    if not Path(ONTOLOGY_INTERNAL_CSV).exists():
        raise FileNotFoundError(f"ONTOLOGY_INTERNAL_CSV not found: {ONTOLOGY_INTERNAL_CSV}")
    if not Path(INFER_INPUT_CSV).exists():
        raise FileNotFoundError(
            f"INFER_INPUT_CSV not found: {INFER_INPUT_CSV}\n"
            "In full/build-dataset mode, training should generate *.test.queries.csv. "
            "Otherwise set INFER_INPUT_CSV to your custom file."
        )

    infer_log = INFER_DIR / "inference.log"
    infer_cmd = [
        "python", "run_inference.py",
        "--bundle", OFFLINE_BUNDLE_PKL,
        "--ontology-csv", ONTOLOGY_INTERNAL_CSV,
        "--input-csv", INFER_INPUT_CSV,
        "--out-csv", INFER_OUT_CSV,
        "--mode", INFER_MODE,
        "--cross-tokenizer-name", CROSS_TOKENIZER_NAME,
        "--cross-encoder-model-id", CROSS_ENCODER_MODEL_ID,
        "--retrieval-col", RETRIEVAL_COL,
        "--retrieval-lexical-top-k", str(RETRIEVAL_LEXICAL_TOP_K),
        "--retrieval-semantic-top-k", str(RETRIEVAL_SEMANTIC_TOP_K),
        "--retrieval-merged-top-k", str(RETRIEVAL_MERGED_TOP_K),
        "--hybrid-ratio-semantic", str(HYBRID_RATIO_SEMANTIC),
        "--semantic-batch-size", str(SEMANTIC_BATCH_SIZE),
        "--cross-top-k", str(CROSS_TOP_K),
        "--cross-batch-size", str(CROSS_BATCH_SIZE),
        "--cross-max-length", str(CROSS_MAX_LENGTH),
        "--keep-top-n", str(KEEP_TOP_N),
    ]
    if SCORING_COL:
        infer_cmd += ["--scoring-col", SCORING_COL]
    if ID_COL:
        infer_cmd += ["--id-col", ID_COL]

    run_cmd(infer_cmd, infer_log, cwd=REPO_ROOT)

    print("\nInference completed.")
    print("Predictions CSV:", INFER_OUT_CSV)

print("\nPipeline cell finished.")
print("Run folder:", OUT_DIR)


Running command:
 python training.py --mode full --src datasets/sweet.owl --tgt datasets/envo.owl --align datasets/envo-sweet.rdf --out-src outputs/unified_run_20260109_102452/training/source_ontology.csv --out-tgt outputs/unified_run_20260109_102452/training/target_ontology.csv --out-dataset outputs/unified_run_20260109_102452/training/training_dataset.csv --split-ratios 0.75,0.15,0.10 --tgt-prefix http://purl.obolibrary.org/obo/ENVO_ --src-use-description --src-use-synonyms --src-use-parents --src-use-equivalent --src-use-disjoint --model-type cross-encoder --model-name allenai/scibert_scivocab_uncased --model-output-dir outputs/unified_run_20260109_102452/training/models/cross-encoder_custom --num-epochs 10 --learning-rate 3e-05 --batch-size 16 --weight-decay 0.01
CWD: /Users/usermastro/Desktop/Primo_Semestre_2526/ADSP/Ontology Alignment Project/OAProject
Log: /Users/usermastro/Desktop/Primo_Semestre_2526/ADSP/Ontology Alignment Project/OAProject/outputs/unified_run_20260109_102452

RuntimeError: Command failed with return code 1. See log: outputs/unified_run_20260109_102452/training/training.log

---

## 6) Export helpers (ALWAYS RUN)

All export cells use the same design:

- Every ZIP contains the requested artifacts **plus** a `config.txt` snapshot.
- `config.txt` is generated at export time.

>ZIPs are created on disk and the path is printed

In [None]:
# ============================================
# EXPORT HELPERS (config.txt + zip) — LOCAL
# ============================================

from ast import List
from pathlib import Path
from datetime import datetime
from typing import Optional
import zipfile

def write_config_txt(config_dir: Path) -> Path:
    """
    Writes a config.txt snapshot of the current effective configuration.
    This is intended to be called right before exporting ZIP artifacts.
    """
    config_dir = Path(config_dir)
    config_dir.mkdir(parents=True, exist_ok=True)

    lines = [
        "# Ontology Alignment – Run Configuration",
        f"# Generated on: {datetime.now().isoformat()}",
        "",
        "[Run]",
        f"RUN_ID = {globals().get('RUN_ID', None)}",
        f"OUT_DIR = {globals().get('OUT_DIR', None)}",
        f"RUN_MODE = {globals().get('RUN_MODE', None)}",
        "",
        "[Model]",
        f"MODEL_TYPE = {globals().get('MODEL_TYPE', None)}",
        f"MODEL_NAME = {globals().get('MODEL_NAME', None)}",
        f"CROSS_ENCODER_MODEL_ID = {globals().get('CROSS_ENCODER_MODEL_ID', None)}",
        f"BI_ENCODER_MODEL_ID = {globals().get('BI_ENCODER_MODEL_ID', None)}",
        f"CROSS_TOKENIZER_NAME = {globals().get('CROSS_TOKENIZER_NAME', None)}",
        "",
        "[Training]",
        f"NUM_EPOCHS = {globals().get('NUM_EPOCHS', None)}",
        f"LEARNING_RATE = {globals().get('LEARNING_RATE', None)}",
        f"BATCH_SIZE = {globals().get('BATCH_SIZE', None)}",
        f"WEIGHT_DECAY = {globals().get('WEIGHT_DECAY', None)}",
        f"SPLIT_RATIOS = {globals().get('SPLIT_RATIOS', None)}",
        "",
        "[Offline]",
        f"OFFLINE_ONT_PATH = {globals().get('OFFLINE_ONT_PATH', None)}",
        f"OFFLINE_PREFIX = {globals().get('OFFLINE_PREFIX', None)}",
        f"OFFLINE_SEMANTIC_BATCH_SIZE = {globals().get('OFFLINE_SEMANTIC_BATCH_SIZE', None)}",
        f"OFFLINE_SEMANTIC_MAX_LENGTH = {globals().get('OFFLINE_SEMANTIC_MAX_LENGTH', None)}",
        f"OFFLINE_NO_SEMANTIC_NORMALIZE = {globals().get('OFFLINE_NO_SEMANTIC_NORMALIZE', None)}",
        f"OFFLINE_BUNDLE_PKL = {globals().get('OFFLINE_BUNDLE_PKL', None)}",
        f"ONTOLOGY_INTERNAL_CSV = {globals().get('ONTOLOGY_INTERNAL_CSV', None)}",
        "",
        "[Inference]",
        f"INFER_MODE = {globals().get('INFER_MODE', None)}",
        f"INFER_INPUT_CSV = {globals().get('INFER_INPUT_CSV', None)}",
        f"INFER_OUT_CSV = {globals().get('INFER_OUT_CSV', None)}",
        f"RETRIEVAL_COL = {globals().get('RETRIEVAL_COL', None)}",
        f"SCORING_COL = {globals().get('SCORING_COL', None)}",
        f"ID_COL = {globals().get('ID_COL', None)}",
        f"RETRIEVAL_LEXICAL_TOP_K = {globals().get('RETRIEVAL_LEXICAL_TOP_K', None)}",
        f"RETRIEVAL_SEMANTIC_TOP_K = {globals().get('RETRIEVAL_SEMANTIC_TOP_K', None)}",
        f"RETRIEVAL_MERGED_TOP_K = {globals().get('RETRIEVAL_MERGED_TOP_K', None)}",
        f"HYBRID_RATIO_SEMANTIC = {globals().get('HYBRID_RATIO_SEMANTIC', None)}",
        f"SEMANTIC_BATCH_SIZE = {globals().get('SEMANTIC_BATCH_SIZE', None)}",
        f"CROSS_TOP_K = {globals().get('CROSS_TOP_K', None)}",
        f"CROSS_BATCH_SIZE = {globals().get('CROSS_BATCH_SIZE', None)}",
        f"CROSS_MAX_LENGTH = {globals().get('CROSS_MAX_LENGTH', None)}",
        f"KEEP_TOP_N = {globals().get('KEEP_TOP_N', None)}",
    ]

    config_path = config_dir / "config.txt"
    config_path.write_text("\n".join(lines), encoding="utf-8")
    return config_path


def make_zip_with_config(
    zip_path: Path,
    files_to_include: List[Path],
    config_dir: Optional[Path] = None,
) -> Path:
    """
    Creates a ZIP containing existing artifacts + a config.txt snapshot.
    Returns the created zip path.
    """
    zip_path = Path(zip_path)
    zip_path.parent.mkdir(parents=True, exist_ok=True)

    if config_dir is None:
        config_dir = zip_path.parent
    config_path = write_config_txt(Path(config_dir))

    existing = [Path(p) for p in files_to_include if Path(p).exists()]
    if not existing:
        raise FileNotFoundError("None of the requested files exist. Nothing to zip.")

    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED) as z:
        for p in existing:
            z.write(p, arcname=p.name)
        z.write(config_path, arcname="config.txt")

    print("Created ZIP:", zip_path)
    print("Included:", [p.name for p in existing], "+ config.txt")
    return zip_path


print("Export helpers ready.")

---

## 7.1) Export inference outputs + gold (full pipeline)

This is a **result extraction** utility intended for the **full pipeline** case:

- `predictions.csv` must exist (produced by inference)
- `*.test.gold.csv` must exist (produced during dataset build/split)

It creates a single ZIP containing:
- `predictions.csv`
- `*.test.gold.csv`
- `config.txt` (effective run config snapshot)

In [None]:
# ============================================
# EXPORT ARTIFACTS (predictions + gold)
# ============================================

from pathlib import Path

pred_path = Path(INFER_OUT_CSV)
gold_path = Path(str(Path(OUT_DATASET_CSV).with_suffix(".test.gold.csv")))

missing = [str(p) for p in [pred_path, gold_path] if not p.exists()]
if missing:
    raise FileNotFoundError(
        "Missing required file(s):\n"
        + "\n".join(f" - {m}" for m in missing)
        + "\n\nNotes:\n"
        " - predictions.csv is produced by the inference stage.\n"
        " - *.test.gold.csv is produced during dataset construction/splitting.\n"
    )

zip_path = pred_path.parent / "predictions_and_gold.zip"
make_zip_with_config(zip_path, [pred_path, gold_path], config_dir=pred_path.parent)

## 7.2) Export predictions + optional gold

This export cell works in both scenarios:

- Full pipeline: gold is auto-detected (`*.test.gold.csv`)
- Inference-only: gold may be missing (that's fine), or you can provide a custom gold path

It creates a ZIP containing:
- `predictions.csv`
- (optional) gold CSV if available
- `config.txt` (effective run config snapshot)

In [None]:
# ============================================
# EXPORT ARTIFACTS (predictions + optional gold)
# ============================================

from pathlib import Path

# Optional: if you have an external gold file, set it here
GOLD_PATH_OVERRIDE = None  # e.g. "my_eval/gold_truth.csv"

pred_path = Path(INFER_OUT_CSV)
if not pred_path.exists():
    raise FileNotFoundError(
        f"predictions file not found:\n - {pred_path}\n\n"
        "Run inference first or set INFER_OUT_CSV correctly."
    )

gold_candidates = []
if GOLD_PATH_OVERRIDE is not None:
    gold_candidates.append(Path(GOLD_PATH_OVERRIDE))

gold_candidates.append(Path(str(Path(OUT_DATASET_CSV).with_suffix(".test.gold.csv"))))

gold_path = next((p for p in gold_candidates if p.exists()), None)

files_to_zip = [pred_path]
zip_name = "predictions_only.zip"

if gold_path is not None:
    files_to_zip.append(gold_path)
    zip_name = "predictions_and_gold.zip"
    print("Found gold:", gold_path)
else:
    print("Gold not found (OK). Tried:")
    for p in gold_candidates:
        print(" -", p)

zip_path = pred_path.parent / zip_name
make_zip_with_config(zip_path, files_to_zip, config_dir=pred_path.parent)

## 7.3) Export the entire run folder (everything)

This creates a ZIP archive of the whole `OUT_DIR` folder (training logs, models, offline bundle, inference outputs, etc.)
and also drops a `config.txt` snapshot inside the run folder before zipping.

Use this when you want to share or archive the entire run in one shot.

In [None]:
# ============================================
# EXPORT FULL RUN DIR (OUT_DIR) — LOCAL
# ============================================

from pathlib import Path
import shutil

out_dir = Path(OUT_DIR)
if not out_dir.exists():
    raise FileNotFoundError(f"OUT_DIR not found: {out_dir}")

# Ensure config.txt exists inside OUT_DIR before zipping
_ = write_config_txt(out_dir)

zip_base = str(out_dir)  # shutil.make_archive wants a string base path (without .zip)
zip_path = zip_base + ".zip"

print("Zipping:", out_dir, "->", zip_path)
shutil.make_archive(zip_base, "zip", root_dir=str(out_dir))
print("Created:", zip_path)
print("Note: ZIP includes config.txt at the root of OUT_DIR.")

---

## 8) Quick sanity checks and pointers

These cells are lightweight helpers that run **after** the pipeline.

They do not recompute anything:
- they check that key artifacts exist
- they print the most important paths (model, offline bundle, predictions)
- they optionally preview a few rows of the output CSV

In [None]:
# ============================================
# POST-RUN SUMMARY (paths + existence)
# ============================================

from pathlib import Path

def _exists(p: str | Path) -> bool:
    return Path(p).exists()

print("\n=== RUN SUMMARY ===")
print("RUN_ID:", RUN_ID)
print("OUT_DIR:", OUT_DIR)

print("\n--- Training ---")
print("OUT_DATASET_CSV:", OUT_DATASET_CSV, "| exists =", _exists(OUT_DATASET_CSV))
print("TRAIN_SPLIT_CSV:", TRAIN_SPLIT_CSV, "| exists =", _exists(TRAIN_SPLIT_CSV))
print("VAL_SPLIT_CSV  :", VAL_SPLIT_CSV,   "| exists =", _exists(VAL_SPLIT_CSV))
print("TEST_SPLIT_CSV :", TEST_SPLIT_CSV,  "| exists =", _exists(TEST_SPLIT_CSV))
print("CROSS_ENCODER_MODEL_ID:", CROSS_ENCODER_MODEL_ID, "| exists =", _exists(CROSS_ENCODER_MODEL_ID))

print("\n--- Offline ---")
print("OFFLINE_BUNDLE_PKL   :", OFFLINE_BUNDLE_PKL, "| exists =", _exists(OFFLINE_BUNDLE_PKL))
print("ONTOLOGY_INTERNAL_CSV:", ONTOLOGY_INTERNAL_CSV, "| exists =", _exists(ONTOLOGY_INTERNAL_CSV))

print("\n--- Inference ---")
print("INFER_INPUT_CSV:", INFER_INPUT_CSV, "| exists =", _exists(INFER_INPUT_CSV))
print("INFER_OUT_CSV  :", INFER_OUT_CSV,   "| exists =", _exists(INFER_OUT_CSV))

print("\n--- Exports ---")
exports_dir = Path(OUT_DIR) / "exports"
print("Exports dir:", exports_dir, "| exists =", exports_dir.exists())
if exports_dir.exists():
    zips = sorted([p.name for p in exports_dir.glob("*.zip")])
    print("ZIPs:", zips if zips else "(none yet)")

### 8.1) Preview `predictions.csv` (optional)

This is a convenience cell to quickly inspect a few rows of the inference output locally.

In [None]:
# ============================================
# PREVIEW PREDICTIONS (optional)
# ============================================

from pathlib import Path

try:
    import pandas as pd
except ImportError:
    pd = None

pred_path = Path(INFER_OUT_CSV)

if not pred_path.exists():
    print("Predictions not found:", pred_path)
elif pd is None:
    print("pandas is not installed. Install with: pip install pandas")
else:
    df = pd.read_csv(pred_path)
    print("Predictions shape:", df.shape)
    display(df.head(10))

---

---

---

## A) Stage-by-stage execution (local)

This section runs the pipeline **stage by stage**, controlled by the flags in the
**RUN MODE FLAGS** cell.

The execution logic is:

- **Training stage**
  - Runs only if `DO_TRAINING=True`
  - Produces:
    - dataset CSVs and splits
    - a trained cross-encoder model
  - If training runs, it sets `CROSS_ENCODER_MODEL_ID` automatically

- **Offline bundle stage**
  - Runs only if `DO_OFFLINE=True`
  - Produces:
    - `ontology_internal.csv`
    - `offline_bundle.pkl`

- **Inference stage**
  - Runs only if `DO_INFERENCE=True`
  - Requires:
    - a valid `CROSS_ENCODER_MODEL_ID`
      (from training or restore)
    - valid offline artifacts
      (from offline stage, restore, or existing paths)

Restore cells (if enabled) simply **override paths** used by these stages.
They do not execute any computation.

In [None]:
# ============================================
# STAGE 1 — TRAINING (+ dataset construction)
# ============================================

from pathlib import Path

if not DO_TRAINING:
    print("Skipping training stage (DO_TRAINING=False).")
else:
    train_log = TRAIN_DIR / "training.log"
    train_cmd = ["python", "training.py", "--mode", RUN_MODE]

    # -----------------------------
    # Dataset construction
    # -----------------------------
    if RUN_MODE in {"full", "build-dataset"}:
        train_cmd += [
            "--src", SRC_PATH,
            "--tgt", TGT_PATH,
            "--align", ALIGN_PATH,
            "--out-src", OUT_SRC_CSV,
            "--out-tgt", OUT_TGT_CSV,
            "--out-dataset", OUT_DATASET_CSV,
            "--split-ratios", SPLIT_RATIOS,
        ]

        if SRC_PREFIX:
            train_cmd += ["--src-prefix", SRC_PREFIX]
        if TGT_PREFIX:
            train_cmd += ["--tgt-prefix", TGT_PREFIX]

        if USE_DESCRIPTION: train_cmd.append("--src-use-description")
        if USE_SYNONYMS: train_cmd.append("--src-use-synonyms")
        if USE_PARENTS: train_cmd.append("--src-use-parents")
        if USE_EQUIVALENT: train_cmd.append("--src-use-equivalent")
        if USE_DISJOINT: train_cmd.append("--src-use-disjoint")
        if VISUALIZE: train_cmd.append("--visualize-alignments")

    # -----------------------------
    # Model training
    # -----------------------------
    if RUN_MODE in {"full", "train-only"}:
        Path(MODEL_OUT_DIR).mkdir(parents=True, exist_ok=True)

        train_cmd += [
            "--model-type", MODEL_TYPE,
            "--model-name", MODEL_NAME,
            "--model-output-dir", MODEL_OUT_DIR,
            "--num-epochs", str(NUM_EPOCHS),
        ]

        if HYPERPARAMETER_TUNING:
            train_cmd += ["--tune", "--n-trials", str(N_TRIALS)]
        elif USE_FIXED_HYPERPARAMS:
            train_cmd += [
                "--learning-rate", str(LEARNING_RATE),
                "--batch-size", str(BATCH_SIZE),
                "--weight-decay", str(WEIGHT_DECAY),
            ]

    if RUN_MODE == "train-only":
        train_cmd += ["--dataset-csv", DATASET_CSV]

    run_cmd(train_cmd, train_log, cwd=REPO_ROOT)

    print("\nTraining stage completed.")
    print("Dataset CSV:", OUT_DATASET_CSV)
    print("Cross-encoder output dir:", FINAL_CROSS_ENCODER_DIR)

    # In a full run, the freshly trained model becomes the default for inference
    CROSS_ENCODER_MODEL_ID = FINAL_CROSS_ENCODER_DIR

In [None]:
# ============================================
# STAGE 2 — OFFLINE BUNDLE
# ============================================

if not DO_OFFLINE:
    print("Skipping offline bundle stage (DO_OFFLINE=False).")
else:
    offline_log = OFFLINE_DIR / "offline_bundle.log"

    offline_cmd = [
        "python", "build_ontology_bundle.py",
        "--out-csv", ONTOLOGY_INTERNAL_CSV,
        "--out-bundle", OFFLINE_BUNDLE_PKL,
        "--tokenizer-name", CROSS_TOKENIZER_NAME,
        "--bi-encoder-model-id", BI_ENCODER_MODEL_ID,
        "--semantic-batch-size", str(OFFLINE_SEMANTIC_BATCH_SIZE),
        "--semantic-max-length", str(OFFLINE_SEMANTIC_MAX_LENGTH),
    ]

    if OFFLINE_NO_SEMANTIC_NORMALIZE:
        offline_cmd.append("--no-semantic-normalize")

    if OFFLINE_EXPORT_CSV:
        offline_cmd += ["--export-csv", OFFLINE_EXPORT_CSV]
    else:
        offline_cmd += ["--ont-path", OFFLINE_ONT_PATH]
        if OFFLINE_PREFIX:
            offline_cmd += ["--prefix", OFFLINE_PREFIX]

    run_cmd(offline_cmd, offline_log, cwd=REPO_ROOT)

    print("\nOffline bundle stage completed.")
    print("Ontology internal CSV:", ONTOLOGY_INTERNAL_CSV)
    print("Offline bundle PKL:", OFFLINE_BUNDLE_PKL)

In [None]:
# ============================================
# STAGE 3 — INFERENCE
# ============================================

if not DO_INFERENCE:
    print("Skipping inference stage (DO_INFERENCE=False).")
else:
    # -----------------------------
    # Runtime sanity checks
    # -----------------------------
    if "CROSS_ENCODER_MODEL_ID" not in globals() or CROSS_ENCODER_MODEL_ID is None:
        raise ValueError(
            "CROSS_ENCODER_MODEL_ID is not set. "
            "Run training or restore a model before inference."
        )

    if not Path(OFFLINE_BUNDLE_PKL).exists():
        raise FileNotFoundError(f"OFFLINE_BUNDLE_PKL not found: {OFFLINE_BUNDLE_PKL}")

    if not Path(ONTOLOGY_INTERNAL_CSV).exists():
        raise FileNotFoundError(f"ONTOLOGY_INTERNAL_CSV not found: {ONTOLOGY_INTERNAL_CSV}")

    if not Path(INFER_INPUT_CSV).exists():
        raise FileNotFoundError(
            f"INFER_INPUT_CSV not found: {INFER_INPUT_CSV}\n"
            "Provide a custom input CSV or run dataset construction first."
        )

    infer_log = INFER_DIR / "inference.log"

    infer_cmd = [
        "python", "run_inference.py",
        "--bundle", OFFLINE_BUNDLE_PKL,
        "--ontology-csv", ONTOLOGY_INTERNAL_CSV,
        "--input-csv", INFER_INPUT_CSV,
        "--out-csv", INFER_OUT_CSV,
        "--mode", INFER_MODE,
        "--cross-tokenizer-name", CROSS_TOKENIZER_NAME,
        "--cross-encoder-model-id", CROSS_ENCODER_MODEL_ID,
        "--retrieval-col", RETRIEVAL_COL,
        "--retrieval-lexical-top-k", str(RETRIEVAL_LEXICAL_TOP_K),
        "--retrieval-semantic-top-k", str(RETRIEVAL_SEMANTIC_TOP_K),
        "--retrieval-merged-top-k", str(RETRIEVAL_MERGED_TOP_K),
        "--hybrid-ratio-semantic", str(HYBRID_RATIO_SEMANTIC),
        "--semantic-batch-size", str(SEMANTIC_BATCH_SIZE),
        "--cross-top-k", str(CROSS_TOP_K),
        "--cross-batch-size", str(CROSS_BATCH_SIZE),
        "--cross-max-length", str(CROSS_MAX_LENGTH),
        "--keep-top-n", str(KEEP_TOP_N),
    ]

    if SCORING_COL:
        infer_cmd += ["--scoring-col", SCORING_COL]
    if ID_COL:
        infer_cmd += ["--id-col", ID_COL]

    run_cmd(infer_cmd, infer_log, cwd=REPO_ROOT)

    print("\nInference stage completed.")
    print("Predictions CSV:", INFER_OUT_CSV)

---

### B) Export inference outputs (predictions + optional gold)

This export creates one ZIP containing:
- `predictions.csv`
- (optional) gold CSV if available
- `config.txt`

In [None]:
# ============================================
# EXPORT ARTIFACTS (predictions + optional gold)
# ============================================

from pathlib import Path

# Optional: external gold file (if you have one)
GOLD_PATH_OVERRIDE = None  # e.g. "my_eval/gold_truth.csv"

pred_path = Path(INFER_OUT_CSV)
if not pred_path.exists():
    raise FileNotFoundError(
        f"predictions file not found:\n - {pred_path}\n\n"
        "Run inference first or set INFER_OUT_CSV correctly."
    )

gold_candidates = []
if GOLD_PATH_OVERRIDE is not None:
    gold_candidates.append(Path(GOLD_PATH_OVERRIDE))

# Default gold path from dataset build/split (full pipeline)
gold_candidates.append(Path(str(Path(OUT_DATASET_CSV).with_suffix(".test.gold.csv"))))

gold_path = next((p for p in gold_candidates if p.exists()), None)

files_to_zip = [pred_path]
zip_name = "predictions_only.zip"

if gold_path is not None:
    files_to_zip.append(gold_path)
    zip_name = "predictions_and_gold.zip"
    print("Found gold:", gold_path)
else:
    print("Gold not found (OK). Tried:")
    for p in gold_candidates:
        print(" -", p)

zip_path = pred_path.parent / zip_name
make_zip_with_config(zip_path, files_to_zip, config_dir=pred_path.parent)

### B.A) Export the entire run folder (`OUT_DIR`)

This creates a ZIP archive of the whole `OUT_DIR` folder:
- training logs + model artifacts
- offline bundle artifacts
- inference outputs

A `config.txt` snapshot is written **inside `OUT_DIR`** before zipping.

In [None]:
# ============================================
# EXPORT FULL RUN DIR (OUT_DIR) — LOCAL
# ============================================

from pathlib import Path
import shutil

out_dir = Path(OUT_DIR)
if not out_dir.exists():
    raise FileNotFoundError(f"OUT_DIR not found: {out_dir}")

# Ensure config.txt exists inside OUT_DIR before zipping
_ = write_config_txt(out_dir)

zip_base = str(out_dir)  # shutil.make_archive wants base path (without .zip)
zip_path = zip_base + ".zip"

print("Zipping:", out_dir, "->", zip_path)
shutil.make_archive(zip_base, "zip", root_dir=str(out_dir))
print("Created:", zip_path)
print("Note: ZIP includes config.txt at the root of OUT_DIR.")

---

## C) Quick sanity checks and pointers (post-run)

These cells are lightweight helpers that run **after** the staged pipeline.

They do not recompute anything:
- they check which key artifacts exist **given the current effective paths**
- they print the most important locations (model, offline bundle, predictions)
- they list any created ZIP exports under `OUT_DIR/exports`
- they optionally preview a few rows of `predictions.csv`

In [None]:
# ============================================
# POST-RUN SUMMARY (stage-aware: paths + existence)
# ============================================

from pathlib import Path

def _exists(p: str | Path | None) -> bool:
    if p is None:
        return False
    try:
        return Path(p).exists()
    except Exception:
        return False

print("\n=== RUN SUMMARY (stage-aware) ===")
print("RUN_ID :", globals().get("RUN_ID", None))
print("OUT_DIR:", globals().get("OUT_DIR", None))

print("\nFlags:")
print("  DO_TRAINING   =", globals().get("DO_TRAINING", None))
print("  DO_OFFLINE    =", globals().get("DO_OFFLINE", None))
print("  DO_INFERENCE  =", globals().get("DO_INFERENCE", None))
print("  RESTORE_MODEL =", globals().get("RESTORE_MODEL", None))
print("  RESTORE_OFFLINE =", globals().get("RESTORE_OFFLINE", None))

# -----------------------------
# Training artifacts (optional)
# -----------------------------
print("\n--- Training artifacts (may be absent in stage runs) ---")
print("OUT_DATASET_CSV:", globals().get("OUT_DATASET_CSV", None), "| exists =", _exists(globals().get("OUT_DATASET_CSV", None)))
print("TRAIN_SPLIT_CSV:", globals().get("TRAIN_SPLIT_CSV", None), "| exists =", _exists(globals().get("TRAIN_SPLIT_CSV", None)))
print("VAL_SPLIT_CSV  :", globals().get("VAL_SPLIT_CSV", None),   "| exists =", _exists(globals().get("VAL_SPLIT_CSV", None)))
print("TEST_SPLIT_CSV :", globals().get("TEST_SPLIT_CSV", None),  "| exists =", _exists(globals().get("TEST_SPLIT_CSV", None)))

# Gold/queries are only guaranteed in full/build-dataset mode
gold_path = None
queries_path = None
if "OUT_DATASET_CSV" in globals():
    gold_path = str(Path(OUT_DATASET_CSV).with_suffix(".test.gold.csv"))
    queries_path = str(Path(OUT_DATASET_CSV).with_suffix(".test.queries.csv"))

print("TEST gold CSV  :", gold_path,   "| exists =", _exists(gold_path))
print("TEST queries CSV:", queries_path, "| exists =", _exists(queries_path))

# Model
print("CROSS_ENCODER_MODEL_ID:", globals().get("CROSS_ENCODER_MODEL_ID", None),
      "| exists =", _exists(globals().get("CROSS_ENCODER_MODEL_ID", None)))

# -----------------------------
# Offline artifacts (required for inference, but may be restored)
# -----------------------------
print("\n--- Offline artifacts ---")
print("OFFLINE_BUNDLE_PKL   :", globals().get("OFFLINE_BUNDLE_PKL", None),
      "| exists =", _exists(globals().get("OFFLINE_BUNDLE_PKL", None)))
print("ONTOLOGY_INTERNAL_CSV:", globals().get("ONTOLOGY_INTERNAL_CSV", None),
      "| exists =", _exists(globals().get("ONTOLOGY_INTERNAL_CSV", None)))

# -----------------------------
# Inference artifacts (optional)
# -----------------------------
print("\n--- Inference artifacts ---")
print("INFER_INPUT_CSV:", globals().get("INFER_INPUT_CSV", None),
      "| exists =", _exists(globals().get("INFER_INPUT_CSV", None)))
print("INFER_OUT_CSV  :", globals().get("INFER_OUT_CSV", None),
      "| exists =", _exists(globals().get("INFER_OUT_CSV", None)))

# -----------------------------
# Export ZIPs (local design)
# -----------------------------
print("\n--- Exports ---")
exports_dir = Path(globals().get("OUT_DIR", ".")) / "exports"
print("Exports dir:", exports_dir, "| exists =", exports_dir.exists())
if exports_dir.exists():
    zips = sorted([p.name for p in exports_dir.glob("*.zip")])
    print("ZIPs:", zips if zips else "(none yet)")

print("\nTip:")
print(" - If something is missing, check the stage logs under:")
print("   ", Path(globals().get("OUT_DIR", ".")) / "training" / "training.log")
print("   ", Path(globals().get("OUT_DIR", ".")) / "offline" / "offline_bundle.log")
print("   ", Path(globals().get("OUT_DIR", ".")) / "inference" / "inference.log")
print("=== END SUMMARY ===\n")

### C.A) Preview `predictions.csv` (optional)

This is a convenience cell to quickly inspect a few rows of the inference output locally.
It is safe to run even if inference was skipped: it will just print a message.

In [None]:
# ============================================
# PREVIEW PREDICTIONS (optional) — stage-aware
# ============================================

from pathlib import Path

try:
    import pandas as pd
except ImportError:
    pd = None

pred_path = Path(globals().get("INFER_OUT_CSV", "predictions.csv"))

if not pred_path.exists():
    print("Predictions not found:", pred_path)
elif pd is None:
    print("pandas is not installed. Install with: pip install pandas")
else:
    df = pd.read_csv(pred_path)
    print("Predictions path :", pred_path)
    print("Predictions shape:", df.shape)
    display(df.head(10))

---

---

---

# Evaluation

This section evaluates the inference output (`predictions.csv`) against the ground-truth gold file
produced by the dataset builder (`training_dataset.test.gold.csv`) using the **external evaluation script**.

## Setup and path resolution

This cell configures and resolves all inputs required to run the evaluation script.

In particular, it:
- sets the evaluation hyperparameter `K`
- resolves the paths to:
  - `predictions.csv`
  - the gold test split (`*.test.gold.csv`)
  - the evaluation script
- optionally allows overriding prediction and gold paths for custom evaluation
- optionally detects and passes a `config.txt` file **as input metadata** (it is never generated here)
- configures the export of a **metrics CSV** (the only persistent evaluation artifact)

Important notes:
- evaluation output is **printed directly to the notebook**
- metrics are saved as a CSV for aggregation and comparison across runs
- no `evaluation.log` is produced anymore
- saving a merged prediction–gold CSV is optional and intended only for debugging

All path logic is kept explicit and separate so that the evaluation script itself
remains independent from notebook-specific filesystem concerns.

In [None]:
from __future__ import annotations

import sys
from pathlib import Path

# ============================================
# EVALUATION CONFIG (notebook-side)
# ============================================
K = int(CROSS_TOP_K) if "CROSS_TOP_K" in globals() else 10

# If you want to evaluate files not produced by this run, set these:
PRED_PATH_OVERRIDE = None  # e.g. "my_runs/run_20260109/predictions.csv"
GOLD_PATH_OVERRIDE = None  # e.g. "my_gold/test.gold.csv"

# Optional outputs
SAVE_MERGED = False  # if True -> pass --out-merged
OUT_MERGED_OVERRIDE = None  # if None -> defaults next to predictions as merged_eval.csv

# Metrics CSV export (script feature)
SAVE_METRICS = True  # if True -> pass --out-metrics
OUT_METRICS_OVERRIDE = None  # if None -> defaults next to predictions as metrics_eval.csv
APPEND_METRICS = True  # if True -> pass --append-metrics

# Optional config.txt input for metadata (script feature)
# IMPORTANT: we DO NOT generate config.txt here. We only optionally consume one.
CONFIG_TXT_OVERRIDE = None  # e.g. "outputs/.../config.txt" (optional)

# Optional join/key overrides (rare)
GT_ID_COL = None
PRED_ID_COL = None

# Column name overrides (defaults match your script)
GT_GOLD_COL = "gold_target_iris"
GT_MATCH_COL = "match"
GT_TEXT_COL = "source_text"
PRED_TEXT_COL = "attribute_text"

# ============================================
# Resolve paths
# ============================================
# predictions
if PRED_PATH_OVERRIDE is not None:
    pred_path = Path(PRED_PATH_OVERRIDE).expanduser().resolve()
else:
    if "INFER_OUT_CSV" not in globals():
        raise ValueError("INFER_OUT_CSV is not set. Run inference or set PRED_PATH_OVERRIDE.")
    pred_path = Path(INFER_OUT_CSV).expanduser().resolve()

if not pred_path.exists():
    raise FileNotFoundError(f"Predictions file not found: {pred_path}")

# gold
if GOLD_PATH_OVERRIDE is not None:
    gold_path = Path(GOLD_PATH_OVERRIDE).expanduser().resolve()
else:
    if "OUT_DATASET_CSV" not in globals():
        raise ValueError("OUT_DATASET_CSV is not set. Build dataset or set GOLD_PATH_OVERRIDE.")
    gold_path = Path(str(Path(OUT_DATASET_CSV).with_suffix(".test.gold.csv"))).expanduser().resolve()

if not gold_path.exists():
    raise FileNotFoundError(f"Gold file not found: {gold_path}")

# script path
EVAL_SCRIPT = Path("testing/new_evaluate_inference.py").resolve()
if not EVAL_SCRIPT.exists():
    raise FileNotFoundError(f"Evaluation script not found: {EVAL_SCRIPT}")

# outputs (all default next to predictions)
EVAL_OUT_DIR = pred_path.parent
EVAL_LOG = EVAL_OUT_DIR / "evaluation.log"

if OUT_MERGED_OVERRIDE is None:
    merged_out_path = EVAL_OUT_DIR / "merged_eval.csv"
else:
    merged_out_path = Path(OUT_MERGED_OVERRIDE).expanduser().resolve()

if OUT_METRICS_OVERRIDE is None:
    metrics_out_path = EVAL_OUT_DIR / "metrics_eval.csv"
else:
    metrics_out_path = Path(OUT_METRICS_OVERRIDE).expanduser().resolve()

# config.txt (optional input)
config_txt_path = None
if CONFIG_TXT_OVERRIDE is not None:
    config_txt_path = Path(CONFIG_TXT_OVERRIDE).expanduser().resolve()
    if not config_txt_path.exists():
        raise FileNotFoundError(f"CONFIG_TXT_OVERRIDE not found: {config_txt_path}")
if config_txt_path is None:
    candidate = pred_path.parent / "config.txt"
    if candidate.exists():
        config_txt_path = candidate

print("Evaluation setup OK.")
print("  pred_path:", pred_path)
print("  gold_path:", gold_path)
print("  eval_log :", EVAL_LOG)
print("  save_merged :", SAVE_MERGED, "| merged_out_path:", merged_out_path)
print("  save_metrics:", SAVE_METRICS, "| metrics_out_path:", metrics_out_path)
print("  config_txt:", config_txt_path)

Evaluation setup OK.
  pred_path: runs_toEvaluate/predictions_and_gold/predictions.csv
  gold_path: runs_toEvaluate/predictions_and_gold/training_dataset.test.gold.csv
  eval_log : runs_toEvaluate/predictions_and_gold/evaluation.log
  merged   : runs_toEvaluate/predictions_and_gold/merged_eval.csv


### Helper: run_cmd

Utility function to execute framework scripts as subprocesses and **stream their stdout/stderr live to the notebook**.

Key properties:
- output is printed in real time (no buffering)
- subprocesses run with Weights & Biases explicitly disabled
- execution fails fast if the command exits with a non-zero return code

In [None]:
from pathlib import Path

def run_cmd(cmd, cwd: Path):
    cmd = [str(x) for x in cmd]

    print("\nRunning command:\n", " ".join(cmd))
    print("CWD:", Path(cwd).resolve())

    env = os.environ.copy()
    env["WANDB_MODE"] = "disabled"
    env["WANDB_SILENT"] = "true"

    proc = subprocess.Popen(
        cmd,
        cwd=str(cwd),
        env=env,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1,
        universal_newlines=True,
    )

    assert proc.stdout is not None
    for line in proc.stdout:
        print(line, end="")

    proc.wait()

    print("\nReturn code:", proc.returncode)
    if proc.returncode != 0:
        raise RuntimeError(f"Command failed with return code {proc.returncode}")

    return proc.returncode

print("run_cmd ready.")

run_cmd_tee ready.


## Evaluation execution and metrics export

This cell runs the **framework evaluation script** (`new_evaluate_inference.py`) as a subprocess,
delegating all evaluation logic to the framework itself.

What this step does:
- executes the evaluation on `predictions.csv` against the gold test split
- applies the framework’s join strategy (explicit IDs → `source_iri` → `row_id` → safe text fallback)
- computes coverage and ranking metrics **on positive examples only**:
  - Precision@1
  - Hits@K
  - MRR@K
- prints a detailed evaluation report directly in the notebook output
- optionally exports:
  - a merged predictions–gold CSV (for debugging/inspection)
  - a metrics CSV (append-friendly, suitable for cross-run comparison)

Notes:
- metrics persistence replaces the need for an `evaluation.log`
- run metadata can be attached to the metrics CSV via an optional `config.txt`
- all output is streamed live to the notebook for transparency and debuggability

The notebook acts purely as a launcher: evaluation semantics are fully defined and versioned
inside the framework script.

In [None]:
from __future__ import annotations
import sys

eval_cmd = [
    sys.executable, str(EVAL_SCRIPT),
    "--test-split", str(gold_path),
    "--predictions", str(pred_path),
    "--k", str(int(K)),
    "--gt-gold-col", str(GT_GOLD_COL),
    "--gt-match-col", str(GT_MATCH_COL),
    "--gt-text-col", str(GT_TEXT_COL),
    "--pred-text-col", str(PRED_TEXT_COL),
]

if GT_ID_COL is not None:
    eval_cmd += ["--gt-id-col", str(GT_ID_COL)]
if PRED_ID_COL is not None:
    eval_cmd += ["--pred-id-col", str(PRED_ID_COL)]

if SAVE_MERGED:
    eval_cmd += ["--out-merged", str(merged_out_path)]

if SAVE_METRICS:
    eval_cmd += ["--out-metrics", str(metrics_out_path)]
    if APPEND_METRICS:
        eval_cmd += ["--append-metrics"]
    if config_txt_path is not None:
        eval_cmd += ["--config", str(config_txt_path)]

run_cmd(eval_cmd, cwd=REPO_ROOT)

print("\nEvaluation completed.")
print("Metrics CSV:", metrics_out_path.resolve() if SAVE_METRICS else "(not saved)")


Running command:
 /Users/usermastro/Desktop/Primo_Semestre_2526/ADSP/Ontology Alignment Project/OAProject/OAvenv/bin/python testing/new_evaluate_inference.py --test-split runs_toEvaluate/predictions_and_gold/training_dataset.test.gold.csv --predictions runs_toEvaluate/predictions_and_gold/predictions.csv --k 20 --gt-gold-col gold_target_iris --gt-match-col None --gt-text-col None --pred-text-col attribute_text --gt-id-col source_iri --pred-id-col source_iri --out-merged runs_toEvaluate/predictions_and_gold/merged_eval.csv
CWD: /Users/usermastro/Desktop/Primo_Semestre_2526/ADSP/Ontology Alignment Project/OAProject
Log: /Users/usermastro/Desktop/Primo_Semestre_2526/ADSP/Ontology Alignment Project/OAProject/runs_toEvaluate/predictions_and_gold/evaluation.log

=== Evaluation Report ===
Join method: id_col | pred[source_iri] == gt[source_iri]
Pred rows: 67
Coverage (predicted_iri != null): 1.0000
GT attach rate (gold present after join): 1.0000

Retrieval source distribution:
     exact: 0