# Unified Ontology Alignment Pipeline  
### Training · Offline Indexing · Inference (Modular & Restorable)

This notebook provides a **unified, modular pipeline** for ontology alignment based on transformer models (cross-encoder + bi-encoder), designed to be **fully reproducible**, **restartable**, and **flexible** across different execution scenarios.

The pipeline supports:
- dataset construction and model training,
- offline preprocessing for efficient inference,
- large-scale inference with configurable retrieval and scoring,
- restoration of previously trained models and offline artifacts.

The notebook is designed for **Colab-first usage**, but the logic mirrors a production-ready CLI pipeline.

---

## What you can do with this notebook

Depending on your needs, you can use this notebook to:

- **Run the full pipeline end-to-end**
  - build the training dataset,
  - train a cross-encoder,
  - build the offline semantic index,
  - run inference on test or custom queries.

- **Run only specific stages**
  - dataset construction + training only,
  - offline preprocessing only,
  - inference only (using previously generated artifacts).

- **Restore and reuse artifacts**
  - load a previously trained cross-encoder,
  - load a precomputed offline bundle,
  - run inference without re-training or re-indexing.

This avoids unnecessary recomputation and makes the notebook suitable for:
- long Colab sessions,
- interrupted runs,
- collaborative workflows,
- reproducible experiments.

---

## Pipeline stages (conceptual)

The pipeline is logically divided into **three independent stages**:

1. **Training & Dataset Construction**
   - Loads source/target ontologies and gold alignments.
   - Builds a labeled dataset with positives, hard negatives, and random negatives.
   - Trains a cross-encoder model for fine-grained scoring.

2. **Offline Preprocessing**
   - Builds an internal ontology representation.
   - Computes and stores semantic embeddings using a bi-encoder.
   - Produces an offline bundle optimized for fast retrieval during inference.

3. **Inference**
   - Retrieves candidate matches using lexical and/or semantic search.
   - Scores candidates using the trained cross-encoder.
   - Outputs ranked predictions for each input query.

Each stage can be executed **independently** or **skipped and restored**.

---

## Execution modes

The notebook supports multiple execution modes through explicit flags.

Typical usage patterns include:

### Full pipeline (from scratch)
- Training ✔
- Offline preprocessing ✔
- Inference ✔

### Training only
- Training ✔
- Offline preprocessing ✘
- Inference ✘

### Offline preprocessing only
- Training ✘
- Offline preprocessing ✔
- Inference ✘

### Inference only (recommended for experimentation)
- Training ✘ (model restored)
- Offline preprocessing ✘ (bundle restored)
- Inference ✔

This makes it easy to:
- train once,
- reuse models and offline artifacts many times,
- experiment with different inference parameters or input queries.

---

## Artifact restoration philosophy

Instead of hardcoding paths or forcing recomputation, this notebook allows you to:

- **restore model artifacts** (trained cross-encoder),
- **restore offline artifacts** (semantic bundle + ontology CSV),
- **override inference inputs** (custom query files, different schemas).

All overrides are performed **inside dedicated cells**, keeping:
- configuration centralized,
- logic explicit,
- side effects local and controlled.

---

## Notebook structure (high level)

1. **Setup**
   - Clone repository and install dependencies.

2. **Configuration**
   - Centralized definition of paths, models, hyperparameters, and defaults.

3. **Run Mode Flags**
   - Choose which stages to execute and which artifacts to restore.

4. **Optional Restore Cells**
   - Restore offline inputs (for offline preprocessing).
   - Restore offline artifacts (skip preprocessing).
   - Restore model + inference inputs (skip training).

5. **Execution Cells**
   - Training stage.
   - Offline preprocessing stage.
   - Inference stage.

6. **Export & Download**
   - Zip and download results (predictions, gold alignments, artifacts).

---

## Design goals

This notebook is designed with the following principles in mind:

- **Modularity** – each stage is independent and reusable.
- **Reproducibility** – all artifacts are explicit and versionable.
- **Efficiency** – avoid recomputation whenever possible.
- **Clarity** – configuration, execution, and restoration are clearly separated.

If you follow the intended flow, you should never need to modify the core execution cells—only the configuration and run flags.

---

*You are now ready to configure and run the pipeline according to your needs.*

---

## Setup — Repository and Dependencies

This cell prepares the execution environment and makes the notebook **fully reproducible** on platforms such as Google Colab or similar hosted environments.

Specifically, the cell performs the following steps:

1. **Clones the project repository** (or reuses it if it is already present).
2. **Sets the working directory** to the repository root, so that the core scripts
   (`training.py`, `build_ontology_bundle.py`, `run_inference.py`) can be executed
   directly from the notebook.
3. **Installs Python dependencies** from `requirements.txt`, if the file is present.
4. **Runs sanity checks** to ensure that the main framework scripts are available
   in the expected locations.

After this cell completes successfully, the notebook can safely execute the full
pipeline: **training → offline bundle → inference**, in a single, unified workflow.

In [None]:
# ============================================
# SETUP (clone repo + install deps)
# ============================================
from pathlib import Path
import os
import subprocess

REPO_URL = "https://github.com/adsp-polito/2025-P13-Ontology-Alignment.git"  # <-- cambia se serve
REPO_DIR = Path("repo").resolve()

def sh(cmd: str):
    print("\n$", cmd)
    subprocess.check_call(cmd, shell=True)

# 1) Clone (or reuse)
if not REPO_DIR.exists():
    sh(f"git clone {REPO_URL} {REPO_DIR}")
else:
    print("Repo already present at:", REPO_DIR)

# 2) Move into repo
os.chdir(REPO_DIR)
print("CWD:", Path(".").resolve())

# 3) Install requirements (best-effort)
if Path("requirements.txt").exists():
    sh("pip -q install -r requirements.txt")
else:
    print("No requirements.txt found. Skipping pip install.")

# 4) Sanity checks: scripts must exist
for p in ["training.py", "build_ontology_bundle.py", "run_inference.py"]:
    if not Path(p).exists():
        raise FileNotFoundError(f"Missing {p} in repo root: {Path('.').resolve()}")

print("Setup OK: repo + scripts found.")

---

## Configuration (always run this cell)

This cell defines the **canonical configuration** of the pipeline.

**This cell must always be executed**, regardless of which stages you plan to run
(training, offline preprocessing, inference, or any combination of them).

### What this cell does

- Defines a **unique run directory** under `outputs/` (`RUN_ID`)
- Sets **default paths**, **model choices**, and **hyperparameters**
- Establishes the **canonical variable names** used by all subsequent cells

All downstream cells (training, offline preprocessing, inference, restore) **depend on these variables being defined first**.

### Important design principle: defaults + overrides

This configuration cell provides **sane defaults** assuming a **full pipeline run**:

- dataset construction + training
- offline preprocessing
- inference on the test split

However, **not all variables defined here are always used as-is**.

Depending on the selected run mode, some variables will be **intentionally overridden** later:

- If you **restore a trained model**,  
  `CROSS_ENCODER_MODEL_ID` will be overwritten in the *Restore Model* cell.

- If you **restore offline artifacts**,  
  `OFFLINE_BUNDLE_PKL` and `ONTOLOGY_INTERNAL_CSV` will be overwritten in the *Restore Offline Artifacts* cell.

- If you **run inference on a custom query file**,  
  `INFER_INPUT_CSV`, `RETRIEVAL_COL`, `SCORING_COL`, and `ID_COL` may be overridden in the *Inference Restore / Override* cell.

This is **by design**:
- defaults live here,
- overrides live **only** in the cells that also load the corresponding artifacts,
- no hidden or implicit state changes.

### What you should (and should not) change here

**You should edit this cell to**:
- change ontology paths,
- change model names,
- tune hyperparameters,
- adjust retrieval/scoring defaults,
- control output layout.

**You should NOT edit this cell to**:
- point to restored models or offline bundles,
- change inference inputs for a specific run.

Those actions belong to the dedicated *restore / override* cells below.

### Mental model

Think of this cell as:

> “The baseline configuration of the experiment.”

All other cells either:
- **use it as-is**, or
- **explicitly override parts of it**, in a visible and controlled way.

If something goes wrong later, this cell is always the first place to look.

---

When this cell finishes executing, the notebook is in a **well-defined initial state**, ready for:
- full execution,
- partial execution,
- or artifact restoration.

In [None]:
# ============================================
# CONFIGURATION (unified training -> offline -> inference)
# ============================================
from pathlib import Path
from datetime import datetime

REPO_ROOT = Path(".").resolve()
print("REPO_ROOT:", REPO_ROOT)

# -----------------------------
# Run id / output layout
# -----------------------------
RUN_ID = f"unified_run_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
OUT_DIR = Path("outputs") / RUN_ID
OUT_DIR.mkdir(parents=True, exist_ok=True)

TRAIN_DIR = OUT_DIR / "training"
OFFLINE_DIR = OUT_DIR / "offline"
INFER_DIR = OUT_DIR / "inference"
TRAIN_DIR.mkdir(parents=True, exist_ok=True)
OFFLINE_DIR.mkdir(parents=True, exist_ok=True)
INFER_DIR.mkdir(parents=True, exist_ok=True)

print("RUN_ID:", RUN_ID)
print("OUT_DIR:", OUT_DIR)

# -----------------------------
# Training mode and model
# -----------------------------
RUN_MODE = "full"  # "full" | "build-dataset" | "train-only"
MODEL_TYPE = "cross-encoder"  # keep this if you want inference at the end
MODEL_NAME = "allenai/scibert_scivocab_uncased"  # This is the model used for both cross-encoder and bi-encoder. The bi-encoder is only used for offline retrieval.
NUM_EPOCHS = 10

HYPERPARAMETER_TUNING = False
N_TRIALS = 5

USE_FIXED_HYPERPARAMS = True
LEARNING_RATE = 3e-5
BATCH_SIZE = 16
WEIGHT_DECAY = 0.01

SPLIT_RATIOS = "0.75,0.15,0.10"

# -----------------------------
# Inputs for dataset building
# -----------------------------
SRC_PATH = "data/sweet.owl"
TGT_PATH = "data/envo.owl"
ALIGN_PATH = "data/envo-sweet.rdf"

SRC_PREFIX = None
TGT_PREFIX = None  # e.g. "http://purl.obolibrary.org/obo/ENVO_"

USE_DESCRIPTION = True
USE_SYNONYMS = True
USE_PARENTS = True
USE_EQUIVALENT = True
USE_DISJOINT = True

VISUALIZE = False

# -----------------------------
# Canonical outputs of STEP 1
# -----------------------------
OUT_SRC_CSV = str(TRAIN_DIR / "source_ontology.csv")
OUT_TGT_CSV = str(TRAIN_DIR / "target_ontology.csv")
OUT_DATASET_CSV = str(TRAIN_DIR / "training_dataset.csv")

TRAIN_SPLIT_CSV = str(Path(OUT_DATASET_CSV).with_suffix(".train.csv"))
VAL_SPLIT_CSV   = str(Path(OUT_DATASET_CSV).with_suffix(".val.csv"))
TEST_SPLIT_CSV  = str(Path(OUT_DATASET_CSV).with_suffix(".test.csv"))

# train-only mode
DATASET_CSV = TRAIN_SPLIT_CSV

# model outputs
MODEL_OUT_DIR = str(TRAIN_DIR / "models" / f"{MODEL_TYPE}_custom")
FINAL_CROSS_ENCODER_DIR = str(Path(MODEL_OUT_DIR) / "final_cross_encoder_model")

# -----------------------------
# Offline bundle builder
# -----------------------------
OFFLINE_EXPORT_CSV = None
OFFLINE_ONT_PATH = TGT_PATH
OFFLINE_PREFIX = TGT_PREFIX

CROSS_TOKENIZER_NAME = MODEL_NAME

BI_ENCODER_MODEL_ID = "pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb" # or "allenai/scibert_scivocab_uncased"
OFFLINE_SEMANTIC_BATCH_SIZE = 64
OFFLINE_SEMANTIC_MAX_LENGTH = 256
OFFLINE_NO_SEMANTIC_NORMALIZE = False

ONTOLOGY_INTERNAL_CSV = str(OFFLINE_DIR / "ontology_internal.csv")
OFFLINE_BUNDLE_PKL = str(OFFLINE_DIR / "offline_bundle.pkl")

# -----------------------------
# Inference
# -----------------------------
CROSS_ENCODER_MODEL_ID = FINAL_CROSS_ENCODER_DIR

INFER_INPUT_CSV = str(Path(OUT_DATASET_CSV).with_suffix(".test.queries.csv"))
INFER_OUT_CSV = str(INFER_DIR / "predictions.csv")

RETRIEVAL_COL = "source_label"
SCORING_COL = "source_text"
ID_COL = "source_iri"

INFER_MODE = "hybrid"
RETRIEVAL_LEXICAL_TOP_K = 100
RETRIEVAL_SEMANTIC_TOP_K = 100
RETRIEVAL_MERGED_TOP_K = 150
HYBRID_RATIO_SEMANTIC = 0.5
SEMANTIC_BATCH_SIZE = 64

CROSS_TOP_K = 20
CROSS_BATCH_SIZE = 32
CROSS_MAX_LENGTH = 256

KEEP_TOP_N = 0

print("Config OK.")

## Run Mode Flags (what to execute in this session)

This cell defines **which stages of the pipeline will be executed in the current session**  
and whether previously generated artifacts should be **restored instead of rebuilt**.

Think of it as the **execution control panel** of the notebook.

### Main execution toggles

These flags control **which pipeline stages are executed**:

- **`DO_TRAINING`**  
  Build the dataset (if needed) and train the model.

- **`DO_OFFLINE`**  
  Build the offline artifacts used for retrieval  
  (ontology internal CSV + offline bundle).

- **`DO_INFERENCE`**  
  Run inference on a query CSV using:
  - a trained or restored cross-encoder
  - offline retrieval artifacts

You can run:
- the **full pipeline** (all `True`),
- **only a subset** of stages,
- or **only inference** on restored artifacts.

### Restore toggles (skip building, load artifacts)

These flags allow you to **reuse artifacts from a previous run** (or another machine)
instead of rebuilding them.

- **`RESTORE_MODEL`**  
  Skip training and **load a previously trained cross-encoder**  
  (and optionally override inference input/schema).

- **`RESTORE_OFFLINE`**  
  Skip offline preprocessing and **load an existing offline bundle**
  (ontology CSV + semantic index).

Each restore cell:
- loads the required files,
- **overwrites the corresponding configuration variables**, and
- makes the notebook ready for inference without rerunning earlier stages.

### Guardrails and coherence checks

The checks below enforce **logical consistency**:

- Inference **always requires offline artifacts**  
  → either build them (`DO_OFFLINE=True`) or restore them (`RESTORE_OFFLINE=True`).

- Inference **requires a model**  
  → either train it (`DO_TRAINING=True`) or restore it (`RESTORE_MODEL=True`).

Warnings are printed when:
- you enable both *build* and *restore* for the same stage  
  (the **last executed cell wins**, by design).

In [None]:
# ============================================
# RUN MODE FLAGS (choose what to run today)
# ============================================

# Main toggles: what stages to execute
DO_TRAINING  = True
DO_OFFLINE   = True
DO_INFERENCE = True

# Restore toggles: if True, skip building that stage and load artifacts instead
RESTORE_MODEL   = False   # restores cross-encoder + (optionally) custom inference input CSV/schema
RESTORE_OFFLINE = False   # restores offline bundle + ontology CSV

# -----------------------------
# Guardrails (keep logic coherent)
# -----------------------------
if DO_INFERENCE and not (DO_OFFLINE or RESTORE_OFFLINE):
    print(
        "Note: DO_INFERENCE=True without DO_OFFLINE/RESTORE_OFFLINE.\n"
        "Assuming offline artifacts already exist on disk "
        "(from a previous run or manual configuration)."
    )

if DO_INFERENCE and not (DO_TRAINING or RESTORE_MODEL):
    print(
        "Note: inference requires CROSS_ENCODER_MODEL_ID to be set "
        "(either via training or restore)."
    )

# Coherence hints
if DO_TRAINING and RESTORE_MODEL:
    print("Note: DO_TRAINING=True but RESTORE_MODEL=True. Training will run; restore can overwrite the model id if executed after training.")
if DO_OFFLINE and RESTORE_OFFLINE:
    print("Note: DO_OFFLINE=True but RESTORE_OFFLINE=True. Offline will run; restore can overwrite the offline paths if executed after offline.")

---

### Download helpers (config.txt + ZIP packaging)

This cell defines **utility functions** used by all download steps in the notebook.

What these helpers do:

- **`config.txt` generation**
  - Captures the *effective configuration of the run* (after Configuration, Restore, and Overrides)
  - Includes model choices, training settings, offline bundle parameters, and inference knobs
  - Ensures full reproducibility of results outside Colab

- **ZIP creation**
  - Packages selected output artifacts (e.g. predictions, gold, bundles)
  - Automatically includes the generated `config.txt`
  - Produces a single, shareable archive for analysis or collaboration

Design notes:

- Helpers are defined **once** and reused by all download cells
- `config.txt` always reflects the **final state of global variables**
- No computation is triggered here — this cell only defines functions

This keeps the notebook clean, modular, and reproducible.

In [None]:
# ============================================
# DOWNLOAD HELPERS (config.txt + zip)
# ============================================

from pathlib import Path
from datetime import datetime
import zipfile

def write_config_txt(config_dir: Path) -> Path:
    """
    Creates a config.txt in config_dir describing the current run configuration.
    Returns the path to the created file.
    """
    config_dir = Path(config_dir)
    config_dir.mkdir(parents=True, exist_ok=True)

    config_lines = [
        "# Ontology Alignment – Run Configuration",
        f"# Generated on: {datetime.now().isoformat()}",
        "",
        "[Run]",
        f"RUN_ID = {globals().get('RUN_ID', None)}",
        f"OUT_DIR = {globals().get('OUT_DIR', None)}",
        f"RUN_MODE = {globals().get('RUN_MODE', None)}",
        "",
        "[Model]",
        f"MODEL_TYPE = {globals().get('MODEL_TYPE', None)}",
        f"MODEL_NAME = {globals().get('MODEL_NAME', None)}",
        f"CROSS_ENCODER_MODEL_ID = {globals().get('CROSS_ENCODER_MODEL_ID', None)}",
        f"BI_ENCODER_MODEL_ID = {globals().get('BI_ENCODER_MODEL_ID', None)}",
        f"CROSS_TOKENIZER_NAME = {globals().get('CROSS_TOKENIZER_NAME', None)}",
        "",
        "[Training]",
        f"NUM_EPOCHS = {globals().get('NUM_EPOCHS', None)}",
        f"LEARNING_RATE = {globals().get('LEARNING_RATE', None)}",
        f"BATCH_SIZE = {globals().get('BATCH_SIZE', None)}",
        f"WEIGHT_DECAY = {globals().get('WEIGHT_DECAY', None)}",
        f"SPLIT_RATIOS = {globals().get('SPLIT_RATIOS', None)}",
        "",
        "[Offline]",
        f"OFFLINE_ONT_PATH = {globals().get('OFFLINE_ONT_PATH', None)}",
        f"OFFLINE_PREFIX = {globals().get('OFFLINE_PREFIX', None)}",
        f"OFFLINE_SEMANTIC_BATCH_SIZE = {globals().get('OFFLINE_SEMANTIC_BATCH_SIZE', None)}",
        f"OFFLINE_SEMANTIC_MAX_LENGTH = {globals().get('OFFLINE_SEMANTIC_MAX_LENGTH', None)}",
        f"OFFLINE_BUNDLE_PKL = {globals().get('OFFLINE_BUNDLE_PKL', None)}",
        f"ONTOLOGY_INTERNAL_CSV = {globals().get('ONTOLOGY_INTERNAL_CSV', None)}",
        "",
        "[Inference]",
        f"INFER_MODE = {globals().get('INFER_MODE', None)}",
        f"INFER_INPUT_CSV = {globals().get('INFER_INPUT_CSV', None)}",
        f"INFER_OUT_CSV = {globals().get('INFER_OUT_CSV', None)}",
        f"RETRIEVAL_COL = {globals().get('RETRIEVAL_COL', None)}",
        f"SCORING_COL = {globals().get('SCORING_COL', None)}",
        f"ID_COL = {globals().get('ID_COL', None)}",
        f"RETRIEVAL_LEXICAL_TOP_K = {globals().get('RETRIEVAL_LEXICAL_TOP_K', None)}",
        f"RETRIEVAL_SEMANTIC_TOP_K = {globals().get('RETRIEVAL_SEMANTIC_TOP_K', None)}",
        f"RETRIEVAL_MERGED_TOP_K = {globals().get('RETRIEVAL_MERGED_TOP_K', None)}",
        f"HYBRID_RATIO_SEMANTIC = {globals().get('HYBRID_RATIO_SEMANTIC', None)}",
        f"CROSS_TOP_K = {globals().get('CROSS_TOP_K', None)}",
        f"CROSS_BATCH_SIZE = {globals().get('CROSS_BATCH_SIZE', None)}",
        f"CROSS_MAX_LENGTH = {globals().get('CROSS_MAX_LENGTH', None)}",
        f"KEEP_TOP_N = {globals().get('KEEP_TOP_N', None)}",
    ]

    config_path = config_dir / "config.txt"
    config_path.write_text("\n".join(config_lines))
    return config_path


def make_zip_with_config(
    zip_path: Path,
    files_to_include: list[Path],
    config_dir: Path | None = None,
    download: bool = True,
) -> Path:
    """
    Creates a zip that includes:
      - all files in files_to_include that exist
      - a config.txt written into config_dir (defaults to zip_path.parent)
    Returns the zip path.
    """
    zip_path = Path(zip_path)
    zip_path.parent.mkdir(parents=True, exist_ok=True)

    if config_dir is None:
        config_dir = zip_path.parent
    config_path = write_config_txt(Path(config_dir))

    existing_files = []
    for p in files_to_include:
        p = Path(p)
        if p.exists():
            existing_files.append(p)

    if not existing_files:
        raise FileNotFoundError("None of the requested files exist. Nothing to zip.")

    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED) as z:
        # artifacts
        for p in existing_files:
            z.write(p, arcname=p.name)
        # config
        z.write(config_path, arcname="config.txt")

    print("Created ZIP:", zip_path)
    print("Included files:", [p.name for p in existing_files], "+ config.txt")

    if download:
        try:
            from google.colab import files as colab_files
        except ImportError as e:
            raise RuntimeError("files.download() is Colab-only.") from e
        colab_files.download(str(zip_path))

    return zip_path


print("Download helpers OK.")

---

## 1) Run full pipeline (Dataset Construction & Training -> Offline Preprocessing -> Inference)

This section executes:
1) `training.py`
2) `build_ontology_bundle.py`
3) `run_inference.py`

Logs are written under the run directory.
The notebook stops immediately if any step fails.

In [None]:
# ============================================
# RUN PIPELINE (training -> offline -> inference)
# ============================================
from pathlib import Path

def print_tail(path: Path, n=120):
    p = Path(path)
    if not p.exists():
        print(f"[tail] log not found: {p}")
        return
    lines = p.read_text(errors="replace").splitlines()
    print("\n".join(lines[-n:]))

def run_cmd(cmd, log_path: Path, cwd: Path):
    print("\nRunning command:\n", " ".join(cmd))
    print("CWD:", cwd)
    print("Log:", log_path)
    log_path.parent.mkdir(parents=True, exist_ok=True)

    with open(log_path, "w") as f:
        proc = subprocess.run(cmd, stdout=f, stderr=subprocess.STDOUT, cwd=str(cwd))

    print("Return code:", proc.returncode)
    if proc.returncode != 0:
        print("!!! Error occurred. Last lines of log:")
        print_tail(log_path, n=120)
        raise RuntimeError(f"Command failed with return code {proc.returncode}. See log: {log_path}")
    return proc.returncode

# Guardrails
if RUN_MODE == "full" and MODEL_TYPE != "cross-encoder":
    raise ValueError("RUN_MODE='full' ends with inference => needs MODEL_TYPE='cross-encoder'.")

if HYPERPARAMETER_TUNING and RUN_MODE != "full":
    raise ValueError("--tune only allowed in RUN_MODE='full'.")

Path(MODEL_OUT_DIR).mkdir(parents=True, exist_ok=True)

# -----------------------------
# STEP 1) TRAINING
# -----------------------------
train_log = TRAIN_DIR / "training.log"

train_cmd = ["python", "training.py", "--mode", RUN_MODE]

if RUN_MODE in {"full", "build-dataset"}:
    train_cmd += ["--src", SRC_PATH, "--tgt", TGT_PATH, "--align", ALIGN_PATH]
    train_cmd += ["--out-src", OUT_SRC_CSV, "--out-tgt", OUT_TGT_CSV, "--out-dataset", OUT_DATASET_CSV]
    train_cmd += ["--split-ratios", SPLIT_RATIOS]

    if SRC_PREFIX:
        train_cmd += ["--src-prefix", SRC_PREFIX]
    if TGT_PREFIX:
        train_cmd += ["--tgt-prefix", TGT_PREFIX]

    if USE_DESCRIPTION: train_cmd.append("--src-use-description")
    if USE_SYNONYMS: train_cmd.append("--src-use-synonyms")
    if USE_PARENTS: train_cmd.append("--src-use-parents")
    if USE_EQUIVALENT: train_cmd.append("--src-use-equivalent")
    if USE_DISJOINT: train_cmd.append("--src-use-disjoint")
    if VISUALIZE: train_cmd.append("--visualize-alignments")

if RUN_MODE in {"full", "train-only"}:
    train_cmd += ["--model-type", MODEL_TYPE, "--model-name", MODEL_NAME, "--model-output-dir", MODEL_OUT_DIR]
    train_cmd += ["--num-epochs", str(NUM_EPOCHS)]

    if HYPERPARAMETER_TUNING:
        train_cmd += ["--tune", "--n-trials", str(N_TRIALS)]
    elif USE_FIXED_HYPERPARAMS:
        train_cmd += ["--learning-rate", str(LEARNING_RATE)]
        train_cmd += ["--batch-size", str(BATCH_SIZE)]
        train_cmd += ["--weight-decay", str(WEIGHT_DECAY)]

if RUN_MODE == "train-only":
    train_cmd += ["--dataset-csv", DATASET_CSV]

run_cmd(train_cmd, train_log, cwd=REPO_ROOT)

print("\nTraining completed.")
print("Dataset CSV:", OUT_DATASET_CSV)
print("Train split:", TRAIN_SPLIT_CSV)
print("Val split:", VAL_SPLIT_CSV)
print("Test split:", TEST_SPLIT_CSV)
print("Cross-encoder dir:", FINAL_CROSS_ENCODER_DIR)

CROSS_ENCODER_MODEL_ID = FINAL_CROSS_ENCODER_DIR

# -----------------------------
# STEP 2) OFFLINE BUNDLE
# -----------------------------
offline_log = OFFLINE_DIR / "offline_bundle.log"

offline_cmd = [
    "python", "build_ontology_bundle.py",
    "--out-csv", ONTOLOGY_INTERNAL_CSV,
    "--out-bundle", OFFLINE_BUNDLE_PKL,
    "--tokenizer-name", CROSS_TOKENIZER_NAME,
    "--bi-encoder-model-id", BI_ENCODER_MODEL_ID,
    "--semantic-batch-size", str(OFFLINE_SEMANTIC_BATCH_SIZE),
    "--semantic-max-length", str(OFFLINE_SEMANTIC_MAX_LENGTH),
]
if OFFLINE_NO_SEMANTIC_NORMALIZE:
    offline_cmd.append("--no-semantic-normalize")

if OFFLINE_EXPORT_CSV:
    offline_cmd += ["--export-csv", OFFLINE_EXPORT_CSV]
else:
    offline_cmd += ["--ont-path", OFFLINE_ONT_PATH]
    if OFFLINE_PREFIX:
        offline_cmd += ["--prefix", OFFLINE_PREFIX]

run_cmd(offline_cmd, offline_log, cwd=REPO_ROOT)

print("\nOffline bundle completed.")
print("Ontology internal CSV:", ONTOLOGY_INTERNAL_CSV)
print("Offline bundle PKL:", OFFLINE_BUNDLE_PKL)

# -----------------------------
# STEP 3) INFERENCE
# -----------------------------
infer_log = INFER_DIR / "inference.log"

if not Path(INFER_INPUT_CSV).exists():
    raise FileNotFoundError(
        f"INFER_INPUT_CSV not found: {INFER_INPUT_CSV}\n"
        "In full/build-dataset mode, training should generate *.test.queries.csv. "
        "If you want a custom query file, set INFER_INPUT_CSV to its path."
    )

infer_cmd = [
    "python", "run_inference.py",
    "--bundle", OFFLINE_BUNDLE_PKL,
    "--ontology-csv", ONTOLOGY_INTERNAL_CSV,
    "--input-csv", INFER_INPUT_CSV,
    "--out-csv", INFER_OUT_CSV,
    "--mode", INFER_MODE,
    "--cross-tokenizer-name", CROSS_TOKENIZER_NAME,
    "--cross-encoder-model-id", CROSS_ENCODER_MODEL_ID,
    "--retrieval-col", RETRIEVAL_COL,
    "--retrieval-lexical-top-k", str(RETRIEVAL_LEXICAL_TOP_K),
    "--retrieval-semantic-top-k", str(RETRIEVAL_SEMANTIC_TOP_K),
    "--retrieval-merged-top-k", str(RETRIEVAL_MERGED_TOP_K),
    "--hybrid-ratio-semantic", str(HYBRID_RATIO_SEMANTIC),
    "--semantic-batch-size", str(SEMANTIC_BATCH_SIZE),
    "--cross-top-k", str(CROSS_TOP_K),
    "--cross-batch-size", str(CROSS_BATCH_SIZE),
    "--cross-max-length", str(CROSS_MAX_LENGTH),
    "--keep-top-n", str(KEEP_TOP_N),
]
if SCORING_COL:
    infer_cmd += ["--scoring-col", SCORING_COL]
if ID_COL:
    infer_cmd += ["--id-col", ID_COL]

run_cmd(infer_cmd, infer_log, cwd=REPO_ROOT)

print("\nUnified pipeline completed successfully.")
print("Outputs:")
print(" - Training:", TRAIN_DIR)
print(" - Offline bundle:", OFFLINE_DIR)
print(" - Inference:", INFER_DIR)
print("Predictions CSV:", INFER_OUT_CSV)

---

### 1.1) Download inference outputs, gold standard, and configuration

This cell is a **utility step for result extraction**.  
It is designed to be used **after a full pipeline run**, where both predictions and gold labels are available.

What the cell does:

- Locates the inference output file:
  - `predictions.csv` generated by `run_inference.py`
- Locates the corresponding gold standard file:
  - `*.test.gold.csv` generated during dataset construction in `training.py`
- Verifies that both files exist and stops with a clear error if any is missing
- Creates a single ZIP archive (`predictions_and_gold.zip`) that contains:
  - `predictions.csv`
  - `*.test.gold.csv`
  - `config.txt`, a snapshot of the **effective configuration** used for the run (models, parameters, paths)

Why this is useful:

- Enables quick local evaluation and debugging without rerunning training or offline preprocessing
- Makes it easy to share results with collaborators in a **self-contained and reproducible bundle**
- Ensures that predictions and ground truth are always paired with the exact configuration that produced them

This cell does **not** modify any artifacts and does **not** trigger computation.  
It only packages files that were already produced by previous steps.

In [None]:
# ============================================
# DOWNLOAD ARTIFACTS (predictions + gold + config.txt)
# ============================================

from pathlib import Path

# --- Expected paths from configuration ---
pred_path = Path(INFER_OUT_CSV)  # outputs/.../inference/predictions.csv
gold_path = Path(str(Path(OUT_DATASET_CSV).with_suffix(".test.gold.csv")))
# outputs/.../training_dataset.test.gold.csv

# --- Sanity checks ---
missing = [str(p) for p in [pred_path, gold_path] if not p.exists()]
if missing:
    raise FileNotFoundError(
        "Missing required file(s):\n"
        + "\n".join(f" - {m}" for m in missing)
        + "\n\nNotes:\n"
        " - predictions.csv is produced by the inference stage.\n"
        " - *.test.gold.csv is produced in full/build-dataset training runs.\n"
    )

print("Found predictions:", pred_path)
print("Found gold:", gold_path)

# --- Create ZIP (artifacts + config.txt) ---
zip_path = pred_path.parent / "predictions_and_gold.zip"

make_zip_with_config(
    zip_path=zip_path,
    files_to_include=[pred_path, gold_path],
    config_dir=pred_path.parent,
    download=True,
)

---

## 2) Stage-by-stage execution (Training → Offline Bundle → Inference)

This notebook supports two ways of running the pipeline:

1) **Full pipeline (one-click)**: runs Training, then Offline Bundle building, then Inference.
2) **Stage-by-stage (3 separate cells)**: runs the same steps, but split into independent stages so you can re-run only what you need.

The pipeline has a natural dependency chain:

Each stage can either:
- **produce artifacts**, or
- **consume restored artifacts**, depending on the run flags.

### Stage 1 — Training / Dataset Construction

**Purpose**
- Build the training dataset and splits.
- Optionally train a cross-encoder model.

**Executed when**
- `DO_TRAINING = True`

**What it does**
- If `RUN_MODE` is `full` or `build-dataset`:
  - builds `training_dataset.csv`
  - generates `*.train.csv`, `*.val.csv`, `*.test.csv`
  - generates `*.test.queries.csv` (default inference input)
- If `RUN_MODE` is `full` or `train-only`:
  - trains a cross-encoder model

**Artifacts produced**
- `training_dataset.csv`
- `training_dataset.{train,val,test}.csv`
- `training_dataset.test.queries.csv`
- `final_cross_encoder_model/`

**Alternative**
- If `RESTORE_MODEL = True`, this stage can be skipped and the model loaded from disk instead.

### Stage 2 — Offline Bundle Builder

**Purpose**
- Build the retrieval infrastructure for the target ontology:
  - lexical index
  - optional semantic embeddings

**Executed when**
- `DO_OFFLINE = True`

**What it does**
- Processes the target ontology
- Builds the offline retrieval bundle

**Artifacts produced**
- `ontology_internal.csv`
- `offline_bundle.pkl`

**Notes**
- This stage is independent of training once the ontology is fixed.
- It is typically run once per ontology configuration.

**Alternative**
- If `RESTORE_OFFLINE = True`, this stage can be skipped and the offline artifacts loaded instead.

### Stage 3 — Inference

**Purpose**
- Run retrieval + scoring for each query in `INFER_INPUT_CSV`.

**Executed when**
- `DO_INFERENCE = True`

**What it does**
- Loads:
  - a trained or restored cross-encoder
  - a built or restored offline bundle
- Runs:
  - lexical or hybrid retrieval
  - cross-encoder scoring
- Produces predictions

**Artifacts produced**
- `predictions.csv`
- optional top-N candidate columns (if enabled)

**This is the stage you re-run most often**, because it’s where you typically change:
- `INFER_MODE` (`lexical` vs `hybrid`)
- retrieval top-k values and hybrid ratio
- scoring batch size / max length
- input CSV and column schema

---

### Practical workflows

**Full pipeline (from scratch)**
- Enable all stages
- No restore flags
- One run produces all artifacts

**Inference-only iteration**
- Disable Training and Offline
- Enable Inference
- Restore model + offline artifacts
- Change inference parameters freely

**Offline preprocessing once, inference many times**
- Run Offline once
- Zip artifacts
- Restore them in future sessions

---

### Key design principle

- The **Configuration cell** defines defaults.
- The **Run Mode Flags** decide *what to execute*.
- The **Restore cells** decide *where artifacts come from*.
- The **Inference stage** always consumes the *currently active* artifacts, whether built or restored.

This design keeps the notebook:
- reproducible,
- modular,
- and efficient for Colab-based experimentation.

---

## Helpers and execution environment

This cell defines small helper utilities used by all execution stages:

- `run_cmd(...)`  
  A thin wrapper around `subprocess.run` that:
  - executes CLI scripts (`training.py`, `build_ontology_bundle.py`, `run_inference.py`)
  - redirects stdout/stderr to a log file
  - stops the notebook immediately if a command fails

- `print_tail(...)`  
  Convenience helper to print the last lines of a log file when an error occurs,
  making debugging easier in Colab sessions.

In addition, this cell disables **Weights & Biases (wandb)** by default:
- avoids interactive login prompts in Colab
- keeps runs fully offline and reproducible

> This cell should be run **once** before executing any pipeline stage.

In [None]:
# ============================================
# HELPERS (run_cmd + logs) + ENV
# ============================================
import subprocess
from pathlib import Path
import os

def print_tail(path: Path, n=120):
    p = Path(path)
    if not p.exists():
        print(f"[tail] log not found: {p}")
        return
    lines = p.read_text(errors="replace").splitlines()
    print("\n".join(lines[-n:]))

def run_cmd(cmd, log_path: Path, cwd: Path):
    print("\nRunning command:\n", " ".join(cmd))
    print("CWD:", cwd)
    print("Log:", log_path)
    log_path.parent.mkdir(parents=True, exist_ok=True)

    with open(log_path, "w") as f:
        proc = subprocess.run(cmd, stdout=f, stderr=subprocess.STDOUT, cwd=str(cwd))

    print("Return code:", proc.returncode)
    if proc.returncode != 0:
        print("!!! Error occurred. Last lines of log:")
        print_tail(log_path, n=120)
        raise RuntimeError(f"Command failed with return code {proc.returncode}. See log: {log_path}")
    return proc.returncode

# Disable wandb in Colab unless you explicitly want it
os.environ["WANDB_MODE"] = "disabled"
os.environ["WANDB_SILENT"] = "true"

print("Helpers OK.")

---

## Stage 1 - Training

In [None]:
# ============================================
# RUN TRAINING STAGE
# ============================================

import subprocess
from pathlib import Path
import os

def print_tail(path: Path, n=120):
    p = Path(path)
    if not p.exists():
        print(f"[tail] log not found: {p}")
        return
    lines = p.read_text(errors="replace").splitlines()
    print("\n".join(lines[-n:]))

def run_cmd(cmd, log_path: Path, cwd: Path):
    print("\nRunning command:\n", " ".join(cmd))
    print("CWD:", cwd)
    print("Log:", log_path)
    log_path.parent.mkdir(parents=True, exist_ok=True)

    # --- Force-disable W&B for the subprocess (robust) ---
    env = os.environ.copy()
    env["WANDB_MODE"] = "disabled"     # wandb will not try to log in / sync
    env["WANDB_SILENT"] = "true"

    with open(log_path, "w") as f:
        proc = subprocess.run(
            cmd,
            stdout=f,
            stderr=subprocess.STDOUT,
            cwd=str(cwd),
            env=env,
        )

    print("Return code:", proc.returncode)
    if proc.returncode != 0:
        print("!!! Error occurred. Last lines of log:")
        print_tail(log_path, n=120)
        raise RuntimeError(f"Command failed with return code {proc.returncode}. See log: {log_path}")
    return proc.returncode

if not DO_TRAINING:
    print("Skipping training (DO_TRAINING=False).")
else:
    # Disable wandb in Colab unless you explicitly want it
    os.environ["WANDB_MODE"] = "disabled"
    os.environ["WANDB_SILENT"] = "true"

    Path(MODEL_OUT_DIR).mkdir(parents=True, exist_ok=True)

    train_log = TRAIN_DIR / "training.log"
    train_cmd = ["python", "training.py", "--mode", RUN_MODE]

    if RUN_MODE in {"full", "build-dataset"}:
        train_cmd += ["--src", SRC_PATH, "--tgt", TGT_PATH, "--align", ALIGN_PATH]
        train_cmd += ["--out-src", OUT_SRC_CSV, "--out-tgt", OUT_TGT_CSV, "--out-dataset", OUT_DATASET_CSV]
        train_cmd += ["--split-ratios", SPLIT_RATIOS]

        if SRC_PREFIX:
            train_cmd += ["--src-prefix", SRC_PREFIX]
        if TGT_PREFIX:
            train_cmd += ["--tgt-prefix", TGT_PREFIX]

        if USE_DESCRIPTION: train_cmd.append("--src-use-description")
        if USE_SYNONYMS: train_cmd.append("--src-use-synonyms")
        if USE_PARENTS: train_cmd.append("--src-use-parents")
        if USE_EQUIVALENT: train_cmd.append("--src-use-equivalent")
        if USE_DISJOINT: train_cmd.append("--src-use-disjoint")
        if VISUALIZE: train_cmd.append("--visualize-alignments")

    if RUN_MODE in {"full", "train-only"}:
        train_cmd += ["--model-type", MODEL_TYPE, "--model-name", MODEL_NAME, "--model-output-dir", MODEL_OUT_DIR]
        train_cmd += ["--num-epochs", str(NUM_EPOCHS)]

        if HYPERPARAMETER_TUNING:
            train_cmd += ["--tune", "--n-trials", str(N_TRIALS)]
        elif USE_FIXED_HYPERPARAMS:
            train_cmd += ["--learning-rate", str(LEARNING_RATE)]
            train_cmd += ["--batch-size", str(BATCH_SIZE)]
            train_cmd += ["--weight-decay", str(WEIGHT_DECAY)]

    if RUN_MODE == "train-only":
        train_cmd += ["--dataset-csv", DATASET_CSV]

    run_cmd(train_cmd, train_log, cwd=REPO_ROOT)

    print("\nTraining completed.")
    print("Dataset CSV:", OUT_DATASET_CSV)
    print("Train split:", TRAIN_SPLIT_CSV)
    print("Val split:", VAL_SPLIT_CSV)
    print("Test split:", TEST_SPLIT_CSV)
    print("Model out dir:", MODEL_OUT_DIR)

    # Default scorer for inference (can be overridden by restore cell)
    CROSS_ENCODER_MODEL_ID = FINAL_CROSS_ENCODER_DIR
    print("CROSS_ENCODER_MODEL_ID set to:", CROSS_ENCODER_MODEL_ID)

---

## Stage 2 - Offline Preprocessing

---

### RESTORE OFFLINE INPUTS (Ontology or Export CSV)
Use this only if you want to RUN offline preprocessing (DO_OFFLINE=True) but your ontology/export files are not already present in the session.

In [None]:
from pathlib import Path
import shutil
import zipfile

# --- IF guard ---
if not DO_OFFLINE:
    print("DO_OFFLINE=False -> skipping offline input restore cell.")
else:
    # Choose what you want to restore:
    #   - "ontology"  => upload OWL/RDF files and set OFFLINE_ONT_PATH
    #   - "export_csv"=> upload an already-exported ontology_internal.csv and set OFFLINE_EXPORT_CSV
    OFFLINE_INPUT_MODE = "ontology"  # "ontology" | "export_csv"

    # Optional: if you prefer to point to an existing path instead of uploading
    OFFLINE_INPUT_SRC = None  # e.g. "/content/myfiles/sweet.owl" OR None to upload

    RESTORE_ROOT = Path(OUT_DIR) / "restored"
    RESTORED_OFFLINE_INPUT_DIR = RESTORE_ROOT / "offline_inputs"
    RESTORED_OFFLINE_INPUT_DIR.mkdir(parents=True, exist_ok=True)

    def _extract_zip_to(zip_path: Path, dest_dir: Path) -> None:
        dest_dir.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(zip_path, "r") as z:
            z.extractall(dest_dir)

    def _find_first(root: Path, patterns: list[str]) -> Path:
        for pat in patterns:
            hits = list(root.rglob(pat))
            if hits:
                return hits[0]
        raise FileNotFoundError(f"Could not find any of {patterns} under: {root}")

    # ---- Acquire input files ----
    if OFFLINE_INPUT_SRC is None:
        from google.colab import files
        uploaded = files.upload()  # upload either: a single file OR a zip containing the files
        name = next(iter(uploaded.keys()))
        p = Path(name).resolve()

        # Clean previous restore
        if RESTORED_OFFLINE_INPUT_DIR.exists():
            shutil.rmtree(RESTORED_OFFLINE_INPUT_DIR)
        RESTORED_OFFLINE_INPUT_DIR.mkdir(parents=True, exist_ok=True)

        if p.suffix.lower() == ".zip":
            print("Extracting:", p, "->", RESTORED_OFFLINE_INPUT_DIR)
            _extract_zip_to(p, RESTORED_OFFLINE_INPUT_DIR)
            restore_root = RESTORED_OFFLINE_INPUT_DIR
        else:
            # single file uploaded
            shutil.move(str(p), str(RESTORED_OFFLINE_INPUT_DIR / p.name))
            restore_root = RESTORED_OFFLINE_INPUT_DIR
    else:
        restore_root = Path(OFFLINE_INPUT_SRC).expanduser().resolve()
        if not restore_root.exists():
            raise FileNotFoundError(f"OFFLINE_INPUT_SRC not found: {restore_root}")

    # ---- Override configuration vars used by offline stage ----
    if OFFLINE_INPUT_MODE == "ontology":
        ont_path = _find_first(restore_root, ["*.owl", "*.rdf", "*.ttl", "*.obo"])
        OFFLINE_ONT_PATH = str(ont_path)
        OFFLINE_EXPORT_CSV = None  # ensure ontology mode
        print("Restored offline INPUT (ontology):")
        print("   OFFLINE_ONT_PATH   =", OFFLINE_ONT_PATH)
        print("   OFFLINE_EXPORT_CSV =", OFFLINE_EXPORT_CSV)

    elif OFFLINE_INPUT_MODE == "export_csv":
        export_csv = _find_first(restore_root, ["*.csv"])
        OFFLINE_EXPORT_CSV = str(export_csv)
        # When using export CSV, OFFLINE_ONT_PATH is not needed
        print("Restored offline INPUT (export CSV):")
        print("   OFFLINE_EXPORT_CSV =", OFFLINE_EXPORT_CSV)

    else:
        raise ValueError("OFFLINE_INPUT_MODE must be 'ontology' or 'export_csv'.")

---

In [None]:
# ============================================
# RUN OFFLINE BUNDLE STAGE
# ============================================

import subprocess
from pathlib import Path

if not DO_OFFLINE:
    print("Skipping offline bundle build (DO_OFFLINE=False).")
else:
    offline_log = OFFLINE_DIR / "offline_bundle.log"

    offline_cmd = [
        "python", "build_ontology_bundle.py",
        "--out-csv", ONTOLOGY_INTERNAL_CSV,
        "--out-bundle", OFFLINE_BUNDLE_PKL,
        "--tokenizer-name", CROSS_TOKENIZER_NAME,
        "--bi-encoder-model-id", BI_ENCODER_MODEL_ID,
        "--semantic-batch-size", str(OFFLINE_SEMANTIC_BATCH_SIZE),
        "--semantic-max-length", str(OFFLINE_SEMANTIC_MAX_LENGTH),
    ]

    if OFFLINE_NO_SEMANTIC_NORMALIZE:
        offline_cmd.append("--no-semantic-normalize")

    if OFFLINE_EXPORT_CSV:
        offline_cmd += ["--export-csv", OFFLINE_EXPORT_CSV]
    else:
        offline_cmd += ["--ont-path", OFFLINE_ONT_PATH]
        if OFFLINE_PREFIX:
            offline_cmd += ["--prefix", OFFLINE_PREFIX]

    run_cmd(offline_cmd, offline_log, cwd=REPO_ROOT)

    print("\nOffline bundle completed.")
    print("Ontology internal CSV:", ONTOLOGY_INTERNAL_CSV)
    print("Offline bundle PKL:", OFFLINE_BUNDLE_PKL)

---

## Stage 3 - Inference

---

### Optional: Restore artifacts for inference-only runs

The following cells are **optional** and are only needed when you want to run **Inference without rebuilding artifacts in the current session**.

They allow you to restore previously generated outputs (e.g. from another Colab session, a teammate, or a downloaded ZIP) and **override the corresponding configuration variables locally**, keeping everything self-contained.

You typically use these cells when:
- you already trained a model elsewhere and just want to run inference
- you already built the offline bundle and want to reuse it
- you want to run inference on a custom input CSV without re-running training or offline preprocessing

If you are running the **full pipeline in this notebook session**, you can safely skip these cells.

The restore logic is split into two parts:
- **Offline artifacts restore** (ontology CSV + offline bundle)
- **Model + inference input restore** (cross-encoder + optional custom query CSV)

Both cells automatically override the variables used by the inference stage.

In [None]:
# ============================================
# RESTORE OFFLINE ARTIFACTS (offline bundle + ontology CSV)
# ============================================

from pathlib import Path
import shutil
import zipfile

if not (DO_INFERENCE and RESTORE_OFFLINE):
    print("Skipping: RESTORE_OFFLINE is False (or DO_INFERENCE is False).")
else:
    # If you want to skip upload and point to an existing folder, set this:
    # Example: OFFLINE_RESTORE_SRC = "/content/my_run/offline"
    OFFLINE_RESTORE_SRC = None  # str path OR None to upload a .zip

    RESTORE_ROOT = Path(OUT_DIR) / "restored"
    RESTORED_OFFLINE_DIR = RESTORE_ROOT / "offline"

    def _extract_zip_to(zip_path: Path, dest_dir: Path) -> None:
        dest_dir.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(zip_path, "r") as z:
            z.extractall(dest_dir)

    def _find_first(root: Path, pattern: str) -> Path:
        hits = list(root.rglob(pattern))
        if not hits:
            raise FileNotFoundError(f"Could not find '{pattern}' under: {root}")
        return hits[0]

    # ---- Acquire artifacts ----
    if OFFLINE_RESTORE_SRC is None:
        from google.colab import files
        uploaded = files.upload()  # choose a zip that contains offline_bundle.pkl and ontology_internal.csv
        zip_name = next(iter(uploaded.keys()))
        zip_path = Path(zip_name).resolve()

        # Clean previous restore to avoid stale artifacts
        if RESTORED_OFFLINE_DIR.exists():
            shutil.rmtree(RESTORED_OFFLINE_DIR)
        RESTORED_OFFLINE_DIR.mkdir(parents=True, exist_ok=True)

        print("Extracting:", zip_path, "->", RESTORED_OFFLINE_DIR)
        _extract_zip_to(zip_path, RESTORED_OFFLINE_DIR)
        restore_root = RESTORED_OFFLINE_DIR
    else:
        restore_root = Path(OFFLINE_RESTORE_SRC).expanduser().resolve()
        if not restore_root.exists():
            raise FileNotFoundError(f"OFFLINE_RESTORE_SRC not found: {restore_root}")

    # ---- Locate required files ----
    bundle_pkl = _find_first(restore_root, "offline_bundle.pkl")
    onto_csv = _find_first(restore_root, "ontology_internal.csv")

    # ---- Override variables used by inference stage ----
    OFFLINE_BUNDLE_PKL = str(bundle_pkl)
    ONTOLOGY_INTERNAL_CSV = str(onto_csv)

    print("Restored offline artifacts:")
    print("   OFFLINE_BUNDLE_PKL    =", OFFLINE_BUNDLE_PKL)
    print("   ONTOLOGY_INTERNAL_CSV =", ONTOLOGY_INTERNAL_CSV)
    print("   Bundle exists:", Path(OFFLINE_BUNDLE_PKL).exists())
    print("   CSV exists:", Path(ONTOLOGY_INTERNAL_CSV).exists())

In [None]:
# ============================================
# RESTORE MODEL + INFERENCE INPUT (for inference-only runs)
# ============================================

from pathlib import Path
import shutil
import zipfile

if not (DO_INFERENCE and RESTORE_MODEL):
    print("Skipping: RESTORE_MODEL is False (or DO_INFERENCE is False).")
else:
    # -----------------------------
    # A) RESTORE MODEL (Cross-Encoder)
    # -----------------------------

    # If you want to skip upload and point to an existing folder, set this:
    # Example: MODEL_RESTORE_SRC = "/content/my_model/final_cross_encoder_model"
    MODEL_RESTORE_SRC = None  # str path OR None to upload a .zip

    RESTORE_ROOT = Path(OUT_DIR) / "restored"
    RESTORED_MODEL_DIR = RESTORE_ROOT / "model"

    def _extract_zip_to(zip_path: Path, dest_dir: Path) -> None:
        dest_dir.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(zip_path, "r") as z:
            z.extractall(dest_dir)

    def _find_cross_encoder_dir(root: Path) -> Path:
        """
        Locate a SentenceTransformers CrossEncoder saved folder.
        We accept a directory containing config.json (minimal proxy).
        """
        if (root / "config.json").exists():
            return root

        candidates = list(root.rglob("config.json"))
        if not candidates:
            raise FileNotFoundError(
                f"Could not find config.json under: {root}\n"
                "Your zip/folder should include the saved model directory (SentenceTransformers CrossEncoder)."
            )
        return candidates[0].parent

    # Acquire model artifacts
    if MODEL_RESTORE_SRC is None:
        from google.colab import files
        uploaded = files.upload()  # zip containing final_cross_encoder_model/...
        zip_name = next(iter(uploaded.keys()))
        zip_path = Path(zip_name).resolve()

        if RESTORED_MODEL_DIR.exists():
            shutil.rmtree(RESTORED_MODEL_DIR)
        RESTORED_MODEL_DIR.mkdir(parents=True, exist_ok=True)

        print("Extracting model zip:", zip_path, "->", RESTORED_MODEL_DIR)
        _extract_zip_to(zip_path, RESTORED_MODEL_DIR)
        model_restore_root = RESTORED_MODEL_DIR
    else:
        model_restore_root = Path(MODEL_RESTORE_SRC).expanduser().resolve()
        if not model_restore_root.exists():
            raise FileNotFoundError(f"MODEL_RESTORE_SRC not found: {model_restore_root}")

    cross_dir = _find_cross_encoder_dir(model_restore_root)

    # Override variable used by inference stage
    CROSS_ENCODER_MODEL_ID = str(cross_dir)

    print("Restored cross-encoder model dir:")
    print("   CROSS_ENCODER_MODEL_ID =", CROSS_ENCODER_MODEL_ID)
    print("   Contains config.json:", (Path(CROSS_ENCODER_MODEL_ID) / "config.json").exists())

    # Optional: keep tokenizer aligned with restored model
    # If you know the tokenizer id used, set it here; otherwise keep existing CROSS_TOKENIZER_NAME from config.
    # CROSS_TOKENIZER_NAME = "allenai/scibert_scivocab_uncased"

    # -----------------------------
    # B) (OPTIONAL) RESTORE / OVERRIDE INFERENCE INPUT + SCHEMA
    # -----------------------------
    # If you want to run inference on a custom CSV, enable this and upload it.
    RESTORE_INFER_INPUT = False  # set True to upload your own input CSV (queries/attributes)

    if not RESTORE_INFER_INPUT:
        print("Inference input not restored (RESTORE_INFER_INPUT=False). Using INFER_INPUT_CSV from Configuration:")
        print("   INFER_INPUT_CSV =", INFER_INPUT_CSV)
        print("   RETRIEVAL_COL =", RETRIEVAL_COL, "| SCORING_COL =", SCORING_COL, "| ID_COL =", ID_COL)
    else:
        from google.colab import files
        uploaded = files.upload()  # upload your custom input CSV
        csv_name = next(iter(uploaded.keys()))
        csv_path = Path(csv_name).resolve()

        # Override input path
        INFER_INPUT_CSV = str(csv_path)

        # Override schema to match your custom CSV
        # IMPORTANT: set these to columns that actually exist in your file.
        # - RETRIEVAL_COL: used for exact+lexical retrieval
        # - SCORING_COL: used as query text for semantic retrieval + cross-encoder scoring
        # - ID_COL: carried through as identifier
        RETRIEVAL_COL = "source_label"
        SCORING_COL = "source_text"
        ID_COL = "source_iri"

        print("Restored/overrode inference input:")
        print("   INFER_INPUT_CSV =", INFER_INPUT_CSV)
        print("   RETRIEVAL_COL =", RETRIEVAL_COL, "| SCORING_COL =", SCORING_COL, "| ID_COL =", ID_COL)

---

### Inference parameter overrides (no restore required)

This cell allows you to **change inference parameters even if you did NOT restore any artifacts**.

If you already have valid artifacts on disk (e.g. from a previous full pipeline run in the same session), you can:

- modify `INFER_INPUT_CSV` to point to a different query file
- change the column schema (`RETRIEVAL_COL`, `SCORING_COL`, `ID_COL`)
- tune retrieval and scoring parameters (top-k values, hybrid ratio, batch sizes, max length)
- change the inference output path to avoid overwriting previous results

No files are rebuilt and no artifacts are reloaded here:  
this cell **only overrides variables** that will be consumed by the inference stage.

If you skip both restore cells and run this override cell, inference will use:
- the existing offline bundle
- the existing trained model
- the updated parameters defined here

In [None]:
# ============================================
# INFERENCE OVERRIDES (quick tuning)
# ============================================

# Toggle: set True if you want to override defaults from Configuration
OVERRIDE_INFERENCE_PARAMS = True

if not OVERRIDE_INFERENCE_PARAMS:
    print("Skipping inference overrides (OVERRIDE_INFERENCE_PARAMS=False).")
else:
    # --- Input/Output ---
    # If you want to run on a custom CSV, set it here (must exist on disk)
    # INFER_INPUT_CSV = "data/my_custom_queries.csv"
    # INFER_OUT_CSV   = str(INFER_DIR / "predictions_custom.csv")

    # --- Column schema (must match your INFER_INPUT_CSV) ---
    # Used for exact/lexical retrieval (typically a short label)
    RETRIEVAL_COL = "source_label"
    # Used for semantic retrieval + cross-encoder scoring (typically richer text)
    SCORING_COL = "source_text"
    # Carried through as identifier
    ID_COL = "source_iri"

    # --- Retrieval / scoring knobs ---
    INFER_MODE = "hybrid"  # "lexical" | "hybrid"

    RETRIEVAL_LEXICAL_TOP_K  = 100
    RETRIEVAL_SEMANTIC_TOP_K = 100
    RETRIEVAL_MERGED_TOP_K   = 150
    HYBRID_RATIO_SEMANTIC    = 0.5  # 0..1

    SEMANTIC_BATCH_SIZE = 64

    CROSS_TOP_K       = 20
    CROSS_BATCH_SIZE  = 32
    CROSS_MAX_LENGTH  = 256

    KEEP_TOP_N = 0  # 0 = only best prediction, >0 keeps extra columns

    # --- Print final effective config (sanity) ---
    print("Inference parameters set:")
    print("  INFER_INPUT_CSV =", INFER_INPUT_CSV)
    print("  INFER_OUT_CSV   =", INFER_OUT_CSV)
    print("  MODE            =", INFER_MODE)
    print("  RETRIEVAL_COL   =", RETRIEVAL_COL)
    print("  SCORING_COL     =", SCORING_COL)
    print("  ID_COL          =", ID_COL)
    print("  Lex/Sem/Merged  =", RETRIEVAL_LEXICAL_TOP_K, RETRIEVAL_SEMANTIC_TOP_K, RETRIEVAL_MERGED_TOP_K)
    print("  Hybrid ratio    =", HYBRID_RATIO_SEMANTIC)
    print("  Cross top-k     =", CROSS_TOP_K)

### Run inference

This cell executes the **inference stage only**, using whatever artifacts are currently available in the session.

Important notes:

- This cell **does not build or restore artifacts**.
  It only consumes what already exists on disk or was restored earlier.
- You can freely change inference parameters (input file, columns, top-k values, batch sizes, etc.)
  before running this cell, without retraining or rebuilding the offline bundle.
- This is the cell you typically re-run multiple times during experimentation.

If `DO_INFERENCE=False`, the cell is skipped safely without side effects.

In [None]:
# ============================================
# RUN INFERENCE STAGE
# ============================================

from pathlib import Path

if not DO_INFERENCE:
    print("Skipping inference (DO_INFERENCE=False).")
else:
    # Final sanity checks
    if "CROSS_ENCODER_MODEL_ID" not in globals() or CROSS_ENCODER_MODEL_ID is None:
        raise ValueError(
            "CROSS_ENCODER_MODEL_ID is not set. "
            "Run training (DO_TRAINING=True) or restore model (RESTORE_MODEL=True)."
        )

    if not Path(OFFLINE_BUNDLE_PKL).exists():
        raise FileNotFoundError(f"OFFLINE_BUNDLE_PKL not found: {OFFLINE_BUNDLE_PKL}")
    if not Path(ONTOLOGY_INTERNAL_CSV).exists():
        raise FileNotFoundError(f"ONTOLOGY_INTERNAL_CSV not found: {ONTOLOGY_INTERNAL_CSV}")
    if not Path(INFER_INPUT_CSV).exists():
        raise FileNotFoundError(
            f"INFER_INPUT_CSV not found: {INFER_INPUT_CSV}\n"
            "Set INFER_INPUT_CSV to your custom file, or enable RESTORE_INFER_INPUT inside the restore cell."
        )

    infer_log = INFER_DIR / "inference.log"

    infer_cmd = [
        "python", "run_inference.py",
        "--bundle", OFFLINE_BUNDLE_PKL,
        "--ontology-csv", ONTOLOGY_INTERNAL_CSV,
        "--input-csv", INFER_INPUT_CSV,
        "--out-csv", INFER_OUT_CSV,
        "--mode", INFER_MODE,
        "--cross-tokenizer-name", CROSS_TOKENIZER_NAME,
        "--cross-encoder-model-id", CROSS_ENCODER_MODEL_ID,
        "--retrieval-col", RETRIEVAL_COL,
        "--retrieval-lexical-top-k", str(RETRIEVAL_LEXICAL_TOP_K),
        "--retrieval-semantic-top-k", str(RETRIEVAL_SEMANTIC_TOP_K),
        "--retrieval-merged-top-k", str(RETRIEVAL_MERGED_TOP_K),
        "--hybrid-ratio-semantic", str(HYBRID_RATIO_SEMANTIC),
        "--semantic-batch-size", str(SEMANTIC_BATCH_SIZE),
        "--cross-top-k", str(CROSS_TOP_K),
        "--cross-batch-size", str(CROSS_BATCH_SIZE),
        "--cross-max-length", str(CROSS_MAX_LENGTH),
        "--keep-top-n", str(KEEP_TOP_N),
    ]

    if SCORING_COL:
        infer_cmd += ["--scoring-col", SCORING_COL]
    if ID_COL:
        infer_cmd += ["--id-col", ID_COL]

    run_cmd(infer_cmd, infer_log, cwd=REPO_ROOT)

    print("\nInference completed.")
    print("Predictions CSV:", INFER_OUT_CSV)

---

### 2.1) Download predictions and (optional) gold standard

This cell is a **results export utility**.

It allows you to download the outputs of the inference stage in a **single, self-contained ZIP archive** that also includes the effective run configuration.

**What this cell does:**

- Always includes:
  - `predictions.csv` produced by the inference stage
- Optionally includes a gold standard file:
  - automatically, if `*.test.gold.csv` exists (full / build-dataset pipeline), or
  - manually, if you provide a custom gold CSV path in the cell
- Always generates:
  - a ZIP archive containing:
    - `predictions.csv`
    - the gold CSV (if available)
    - `config.txt`, capturing the **exact configuration** used for the run

**How to use it:**

- **Full pipeline**  
  Just run the cell: the gold file is detected automatically.
- **Inference-only / external evaluation**  
  Provide a path to your own gold CSV (or leave it empty).
- **No gold available**  
  Leave the gold path as `None`: the ZIP will still be created with predictions + config.

This cell does **not** run any computation.  
It only packages artifacts already produced by previous stages, together with their configuration.

In [None]:
# ============================================
# DOWNLOAD ARTIFACTS (predictions + optional gold + config)
# ============================================
from pathlib import Path

# Colab-only helper
try:
    from google.colab import files
except ImportError as e:
    raise RuntimeError("This cell is intended for Google Colab (google.colab not available).") from e

# =====================================================
# USER OVERRIDE (optional)
# =====================================================
# If you have a gold file produced elsewhere, set it here.
# Example:
# GOLD_PATH_OVERRIDE = "/content/my_eval/gold_truth.csv"
GOLD_PATH_OVERRIDE = None  # str path or None

# =====================================================
# REQUIRED: predictions
# =====================================================
pred_path = Path(INFER_OUT_CSV)

if not pred_path.exists():
    raise FileNotFoundError(
        f"predictions file not found:\n - {pred_path}\n\n"
        "Notes:\n"
        " - predictions.csv is produced by the inference stage.\n"
        " - Run Stage 3 (Inference) first, or set INFER_OUT_CSV correctly."
    )

print("Found predictions:", pred_path)

# =====================================================
# OPTIONAL: gold
# =====================================================
gold_candidates = []

# 1) Explicit user-provided gold
if GOLD_PATH_OVERRIDE is not None:
    gold_candidates.append(Path(GOLD_PATH_OVERRIDE))

# 2) Default full-pipeline gold location
gold_candidates.append(
    Path(str(Path(OUT_DATASET_CSV).with_suffix(".test.gold.csv")))
)

gold_path = next((p for p in gold_candidates if p.exists()), None)

if gold_path is None:
    print(
        "\nGold file NOT found (this is OK for inference-only runs).\n"
        "Tried the following paths:\n"
        + "\n".join(f" - {p}" for p in gold_candidates)
    )
else:
    print("Found gold:", gold_path)

# =====================================================
# CREATE ZIP (with config.txt)
# =====================================================
files_to_zip = [pred_path]
if gold_path is not None:
    files_to_zip.append(gold_path)

zip_name = "predictions_and_gold.zip" if gold_path else "predictions_only.zip"
zip_path = pred_path.parent / zip_name

make_zip_with_config(
    zip_path=zip_path,
    files_to_include=files_to_zip,
    config_dir=pred_path.parent,   # config.txt lives next to outputs
    download=True,
)

---

## Optional: package outputs for download (run only at the end)

This is only needed if you want to export artifacts locally.
Skip it if you are iterating quickly.

In [None]:
# ============================================
# DOWNLOAD FULL RUN (outputs + config)
# ============================================
from pathlib import Path

# Colab-only helper
try:
    from google.colab import files
except ImportError as e:
    raise RuntimeError("This cell is intended for Google Colab (google.colab not available).") from e

# -------------------------------------------------
# Zip the entire run directory + config.txt
# -------------------------------------------------
run_dir = Path(OUT_DIR)
if not run_dir.exists():
    raise FileNotFoundError(f"OUT_DIR not found: {run_dir}")

zip_path = run_dir.with_suffix(".zip")

print("Creating full run archive:")
print("  Source:", run_dir)
print("  Target:", zip_path)

# Collect all files under OUT_DIR (recursive)
files_to_zip = [p for p in run_dir.rglob("*") if p.is_file()]

make_zip_with_config(
    zip_path=zip_path,
    files_to_include=files_to_zip,
    config_dir=run_dir,   # config.txt written inside the run folder
    download=True,
)