# Inference Launcher

## Purpose of this notebook
This notebook is a script-first, notebook-as-launcher interface for running ontology-attribute alignment inference on Colab.
It will:
- clone the repository,
- install dependencies,
- load the required inference artifacts (offline bundle + ontology CSV + input attributes CSV),
- run inference via the CLI script run_inference.py,
- package outputs into a zip file for download.

## What is produced by this notebook

Running this notebook produces the complete outputs of an ontology–attribute alignment inference run.

Specifically, the notebook generates:
- a predictions CSV containing one row per input attribute, including:
    - the predicted ontology class IRI,
    - the associated confidence score,
    - metadata about the retrieval and scoring process,
    - optionally, the top-N scored candidate classes per attribute;
- a full execution log capturing:
    - the exact command used to run run_inference.py,
    - retrieval and scoring progress,
    - potential warnings or errors;
- a self-contained run directory, uniquely identified by a timestamp-based run ID, which includes:
    - predictions.csv,
    - run_inference.log,
    - any additional artifacts produced during inference.

All outputs are saved in a single run directory under outputs/ and are automatically packaged into a ZIP file, which can be downloaded to the local machine at the end of the notebook.

---

### Clone repository

This cell clones the project repository into the Colab VM.
Optionally, you can lock the code to a specific commit hash to ensure that future runs remain identical.

In [None]:
%%bash
set -e

REPO_URL="https://github.com/adsp-polito/2025-P13-Ontology-Alignment.git"
REPO_DIR="2025-P13-Ontology-Alignment"

# Optional: lock to a specific commit for full reproducibility
COMMIT=""   # e.g. "4ffd790"

rm -rf "$REPO_DIR"
git clone "$REPO_URL" "$REPO_DIR"
cd "$REPO_DIR"

if [ -n "$COMMIT" ]; then
  git checkout "$COMMIT"
fi

echo "Checked out commit:"
git rev-parse HEAD

### Enter the repo directory

Colab runs each bash cell in its own subshell.
To keep the notebook state consistent, we move into the cloned repository using %cd.

In [None]:
%cd 2025-P13-Ontology-Alignment

### Install dependencies
This cell upgrades pip and installs all required Python packages from requirements.txt.
This ensures the notebook can run in a clean Colab environment without relying on local state.

In [None]:
%%bash
set -e
python -m pip install --upgrade pip
pip install -r requirements.txt

---

### Check runtime (CPU/GPU)
Inference can run on CPU, but it is typically much faster on GPU, especially for:
- semantic retrieval (bi-encoder query encoding),
- cross-encoder scoring (the expensive step).

This cell confirms whether CUDA is available and prints the GPU name.

In [None]:
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

---

### Provide required input artifacts
run_inference.py needs three inputs:
1.	Offline bundle: offline_bundle.pkl.
2.	Ontology CSV: internal_ontology.csv containing at least columns iri and text.
3.	Input CSV: a dataset containing the attribute strings to align.

#### Option A: Upload files manually (recommended for quick tests)

This cell opens a file picker and lets you upload local files into the Colab VM.

Important note about semantic embeddings:
If your offline bundle contains a semantic index whose embeddings were saved separately (e.g., .npy), you must upload those files too, preserving the same relative paths, otherwise semantic retrieval will not be available.

In [None]:
from google.colab import files
import os

os.makedirs("data", exist_ok=True)

uploaded = files.upload()  # select offline_bundle.pkl, internal_ontology.csv, input.csv, and any embedding files if needed
for fn, content in uploaded.items():
    out_path = os.path.join("data", fn)
    with open(out_path, "wb") as f:
        f.write(content)
    print("Saved:", out_path)

#### Option B: Upload a zipped offline-bundle output directory

If your offline-bundle launcher produced a full run directory (with offline_bundle.pkl, embeddings, logs, and internal_ontology.csv), it is safest to upload it as a .zip and unzip it here.
This avoids path mismatches for embedding files.

In [None]:
from google.colab import files
import os, zipfile

os.makedirs("data", exist_ok=True)

uploaded = files.upload()  # choose something like offline_bundle_run_YYYYMMDD_HHMMSS.zip
zip_name = next(iter(uploaded.keys()))

with open(zip_name, "wb") as f:
    f.write(uploaded[zip_name])

with zipfile.ZipFile(zip_name, "r") as z:
    z.extractall("data")

print("Extracted zip into: data/")

---

### Configure output directory for this inference run

This creates a unique output directory using a timestamp-based run ID, so multiple runs do not overwrite each other.
We also store the “latest output directory path” into a small text file so later cells can automatically reuse it without manual edits.

In [None]:
%%bash
set -e

RUN_ID="inference_run_$(date +%Y%m%d_%H%M%S)"
OUT_DIR="outputs/${RUN_ID}"
mkdir -p "$OUT_DIR"

echo "$OUT_DIR" > outputs/LAST_INFERENCE_RUN_DIR.txt

echo "OUT_DIR=$OUT_DIR"

---

### Configure inference parameters

This cell defines the parameters that will be passed to run_inference.py, including:
- the paths to input artifacts,
- which column contains attribute text,
- retrieval mode (lexical or hybrid),
- model IDs,
- batch sizes and top-k values.

You can edit these values without touching the script.

In [1]:
# -----------------------
# Paths inside Colab VM
# -----------------------
BUNDLE_PATH = "data/offline_bundle.pkl"
ONTOLOGY_CSV = "data/internal_ontology.csv"
INPUT_CSV = "data/input.csv"

# -----------------------
# Input schema
# -----------------------
ATTR_COL = "attribute"
ID_COL = None  # set to something like "attribute_id" if you want to carry an ID through

# -----------------------
# Retrieval mode
# -----------------------
MODE = "hybrid"  # "lexical" or "hybrid"

# -----------------------
# Models (HF model ids)
# -----------------------
CROSS_TOKENIZER_NAME = "dmis-lab/biobert-base-cased-v1.1"
CROSS_ENCODER_MODEL_ID = "YOUR_CROSS_ENCODER_HF_ID"  # REQUIRED

# -----------------------
# Device
# -----------------------
DEVICE = None  # None = auto, or "cuda", or "cpu"

# -----------------------
# Retrieval params
# -----------------------
RETR_LEX_TOPK = 120
RETR_SEM_TOPK = 120
RETR_MERGED_TOPK = 200
HYBRID_RATIO_SEM = 0.5
SEM_BATCH_SIZE = 64

# -----------------------
# Cross-encoder scoring params
# -----------------------
CROSS_TOPK = 20
CROSS_BATCH_SIZE = 32
CROSS_MAX_LEN = 256

# -----------------------
# Output detail
# -----------------------
KEEP_TOP_N = 5

---

### Run inference via the CLI script

This cell runs the full inference pipeline by invoking the script run_inference.py with CLI arguments.

What happens inside the script:
- Stage A retrieval (exact + lexical + semantic depending on mode)
- Stage B cross-encoder scoring for selected candidate pairs
- A final predictions CSV is written to the run output directory

All stdout/stderr is captured into a log file, so runs are traceable and debuggable.

In [None]:
import subprocess

with open("outputs/LAST_INFERENCE_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

pred_csv = f"{out_dir}/predictions.csv"
log_path = f"{out_dir}/run_inference.log"

cmd = [
    "python", "run_inference.py",
    "--bundle", BUNDLE_PATH,
    "--ontology-csv", ONTOLOGY_CSV,
    "--input-csv", INPUT_CSV,
    "--out-csv", pred_csv,
    "--attr-col", ATTR_COL,
    "--mode", MODE,
    "--cross-tokenizer-name", CROSS_TOKENIZER_NAME,
    "--cross-encoder-model-id", CROSS_ENCODER_MODEL_ID,
    "--retrieval-lexical-top-k", str(RETR_LEX_TOPK),
    "--retrieval-semantic-top-k", str(RETR_SEM_TOPK),
    "--retrieval-merged-top-k", str(RETR_MERGED_TOPK),
    "--hybrid-ratio-semantic", str(HYBRID_RATIO_SEM),
    "--semantic-batch-size", str(SEM_BATCH_SIZE),
    "--cross-top-k", str(CROSS_TOPK),
    "--cross-batch-size", str(CROSS_BATCH_SIZE),
    "--cross-max-length", str(CROSS_MAX_LEN),
    "--keep-top-n", str(KEEP_TOP_N),
]

if ID_COL is not None:
    cmd += ["--id-col", ID_COL]

if DEVICE is not None:
    cmd += ["--device", DEVICE]

print("Running command:\n", " ".join(cmd))
with open(log_path, "w") as log_file:
    proc = subprocess.run(cmd, stdout=log_file, stderr=subprocess.STDOUT)

print("Return code:", proc.returncode)
print("Predictions:", pred_csv)
print("Log:", log_path)

---

### Inspect the output

This cell loads the generated predictions.csv and shows a preview.
It is a sanity check to confirm the run produced results in the expected format.

In [None]:
import pandas as pd

with open("outputs/LAST_INFERENCE_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

pred_csv = f"{out_dir}/predictions.csv"
df = pd.read_csv(pred_csv)
df.head()

---

### Package outputs for download

The inference run produces multiple useful files:
- predictions.csv
- run_inference.log

To keep runs reproducible and shareable, we package the entire run directory into a zip file.

In [None]:
%%bash
set -e
OUT_DIR=$(cat outputs/LAST_INFERENCE_RUN_DIR.txt)
ZIP_PATH="${OUT_DIR}.zip"
zip -r "$ZIP_PATH" "$OUT_DIR"
echo "Created zip: $ZIP_PATH"

This final cell downloads the zip file from the Colab VM to your local machine.

In [None]:
from google.colab import files

with open("outputs/LAST_INFERENCE_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

files.download(f"{out_dir}.zip")

---

## Notes
- If semantic retrieval does not work, it’s usually because embeddings were saved separately and were not uploaded/unzipped with the correct paths. Best practice: upload/unzip the entire offline-bundle run directory.
- If you hit GPU out-of-memory:
- reduce CROSS_BATCH_SIZE (first knob),
- reduce SEM_BATCH_SIZE,
- reduce CROSS_TOPK (reduces the number of scored pairs),
- optionally reduce RETR_*_TOPK.