# Offline Bundle Builder Launcher


## Purpose of this notebook

This notebook is a script-first, notebook-as-launcher interface for building the offline ontology bundle used by the inference pipeline.

This notebook:
- sets up a clean execution environment (Colab),
- clones the repository,
- installs dependencies,
- acquires the ontology (from URL or local upload),
- launches the official CLI script build_ontology_bundle.py,
- collects logs and outputs in a reproducible way.

## What is produced by this notebook

Running this notebook produces:
- an internal ontology CSV (iri, label, text, …),
- an offline bundle (offline_bundle.pkl) containing:
   - exact match structures,
   - lexical retrieval structures,
   - semantic index,
   - execution logs and command trace.

All outputs are saved in a single run directory and can be downloaded as a ZIP.

---

### Runtime check

In [None]:
%%bash
nvidia-smi -L || echo "No GPU detected (CPU runtime)"

Building the semantic index is computationally expensive. GPU runtime is strongly recommended.

### Clone repository (reproducible setup)
This notebook only runs the official scripts from the repository.

In [None]:
%%bash
set -e

REPO_URL="https://github.com/adsp-polito/2025-P13-Ontology-Alignment.git"
REPO_DIR="2025-P13-Ontology-Alignment"

# Optional: lock to a specific commit for full reproducibility
COMMIT=""   # e.g. "4ffd790"

rm -rf "$REPO_DIR"
git clone "$REPO_URL" "$REPO_DIR"
cd "$REPO_DIR"

if [ -n "$COMMIT" ]; then
  git checkout "$COMMIT"
fi

echo "Checked out commit:"
git rev-parse HEAD

### Enter the repo directory

Colab runs each bash cell in its own subshell.
To keep the notebook state consistent, we move into the cloned repository using %cd.

In [None]:
%cd 2025-P13-Ontology-Alignment

### Install dependencies

In [None]:
%%bash
set -e
python -m pip install --upgrade pip
pip install -r requirements.txt

---

### Ontology input: choose ONE option
The ontology can be provided either as:
- a remote URL, or
- a local file upload.

Only one option is needed.

In [None]:
%%bash
set -euo pipefail
mkdir -p outputs
ENV_FILE="outputs/oa_env.sh"
: > "$ENV_FILE"
echo "[INFO] Env file reset at $ENV_FILE"

#### Option A — Ontology from URL

In [None]:
%%bash
set -euo pipefail

ENV_FILE="outputs/oa_env.sh"
mkdir -p "$(dirname "$ENV_FILE")"

# --- Choose ONE of the two ---
# Option A: download from URL (set a real URL)
ONTO_URL=""   # e.g. "https://raw.githubusercontent.com/.../envo.owl"

# Option B: local file already present (path inside Colab/VM)
ONTO_LOCAL_PATH="/content/2025-P13-Ontology-Alignment/datasets/envo.owl"

# --- Logic ---
if [ -n "${ONTO_URL}" ]; then
  echo "[INFO] Downloading ontology from URL..."
  mkdir -p "$(dirname "$ONTO_LOCAL_PATH")"
  wget -q --show-progress -O "$ONTO_LOCAL_PATH" "$ONTO_URL"
else
  echo "[INFO] ONTO_URL empty -> using local file."
fi

# --- Sanity checks ---
if [ ! -f "$ONTO_LOCAL_PATH" ]; then
  echo "[ERROR] Ontology file not found at: $ONTO_LOCAL_PATH"
  echo "        Either set ONTO_URL to a valid URL OR set ONTO_LOCAL_PATH to an existing file."
  exit 1
fi

ls -lh "$ONTO_LOCAL_PATH"

# Save to env file (overwrite the variable line cleanly)
# (avoid duplicating exports every run)
grep -v '^export ONTO_LOCAL_PATH=' "$ENV_FILE" 2>/dev/null > "${ENV_FILE}.tmp" || true
mv "${ENV_FILE}.tmp" "$ENV_FILE"
echo "export ONTO_LOCAL_PATH=\"$ONTO_LOCAL_PATH\"" >> "$ENV_FILE"

echo "[INFO] Saved ONTO_LOCAL_PATH to $ENV_FILE"

#### Option B — Upload ontology file manually

In [None]:
from google.colab import files
import os, pathlib

ENV_FILE="outputs/oa_env.sh"
uploaded = files.upload()
fname = next(iter(uploaded.keys()))

os.makedirs("data", exist_ok=True)
onto_path = f"data/{fname}"
with open(onto_path, "wb") as f:
    f.write(uploaded[fname])

with open(ENV_FILE, "a") as f:
    f.write(f'export ONTO_LOCAL_PATH="{onto_path}"\n')

print("Saved:", onto_path)

### Sanity check
The following cell verifies that the ontology path has been correctly set.

In [None]:
%%bash
set -euo pipefail

ENV_FILE="outputs/oa_env.sh"
test -f "$ENV_FILE"
source "$ENV_FILE"

test -n "${ONTO_LOCAL_PATH:-}"
test -f "$ONTO_LOCAL_PATH"
ls -lh "$ONTO_LOCAL_PATH"

---

### Configure output directory
Each execution produces a self-contained run folder.

In [None]:
%%bash
set -euo pipefail

ENV_FILE="outputs/oa_env.sh"

RUN_ID="offline_bundle_run_$(date +%Y%m%d_%H%M%S)"
OUT_DIR="outputs/${RUN_ID}"
mkdir -p "$OUT_DIR"

OUT_CSV="${OUT_DIR}/internal_ontology.csv"
OUT_BUNDLE="${OUT_DIR}/offline_bundle.pkl"
OUT_LOG="${OUT_DIR}/build_bundle.log"

echo "export OUT_DIR=\"$OUT_DIR\""       >> "$ENV_FILE"
echo "export OUT_CSV=\"$OUT_CSV\""       >> "$ENV_FILE"
echo "export OUT_BUNDLE=\"$OUT_BUNDLE\"" >> "$ENV_FILE"
echo "export OUT_LOG=\"$OUT_LOG\""       >> "$ENV_FILE"

echo "$OUT_DIR" > outputs/LAST_RUN_DIR.txt
echo "[INFO] OUT_DIR=$OUT_DIR"

---

### Configure model and preprocessing parameters
These parameters are passed directly to the CLI script.

- `TOKENIZER_NAME` must match the tokenizer used by the cross-encoder at inference time.
- `BI_ENCODER_MODEL_ID` is used ONLY to build the semantic index (offline).
- These choices do NOT affect training here, only retrieval quality at inference.

In [None]:
%%bash
set -euo pipefail
ENV_FILE="outputs/oa_env.sh"

TOKENIZER_NAME="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
BI_ENCODER_MODEL_ID="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

SEMANTIC_BATCH_SIZE=64
SEMANTIC_MAX_LENGTH=256
NO_SEMANTIC_NORMALIZE=0

PREFIX="http://purl.obolibrary.org/obo/ENVO_"

echo "export TOKENIZER_NAME=\"$TOKENIZER_NAME\""           >> "$ENV_FILE"
echo "export BI_ENCODER_MODEL_ID=\"$BI_ENCODER_MODEL_ID\"" >> "$ENV_FILE"
echo "export SEMANTIC_BATCH_SIZE=\"$SEMANTIC_BATCH_SIZE\"" >> "$ENV_FILE"
echo "export SEMANTIC_MAX_LENGTH=\"$SEMANTIC_MAX_LENGTH\"" >> "$ENV_FILE"
echo "export NO_SEMANTIC_NORMALIZE=\"$NO_SEMANTIC_NORMALIZE\"" >> "$ENV_FILE"
echo "export PREFIX=\"$PREFIX\""                           >> "$ENV_FILE"

echo "[INFO] Model config saved."

### Practical Notes
If the ontology is too big and the semantic index is heavy:
- reduce --semantic-batch-size (es. 16/32)
- reduce --semantic-max-length (es. 128/192) if “RICH_TEXT” is big

---

## Launch offline bundle construction
This is the only cell that performs real computation. Everything is logged and fully reproducible.

In [None]:
%%bash
set -euo pipefail

source outputs/oa_env.sh

mkdir -p "$OUT_DIR"

CMD="python build_ontology_bundle.py \
  --ont-path \"$ONTO_LOCAL_PATH\" \
  --out-csv \"$OUT_CSV\" \
  --out-bundle \"$OUT_BUNDLE\" \
  --tokenizer-name \"$TOKENIZER_NAME\" \
  --bi-encoder-model-id \"$BI_ENCODER_MODEL_ID\" \
  --semantic-batch-size \"$SEMANTIC_BATCH_SIZE\" \
  --semantic-max-length \"$SEMANTIC_MAX_LENGTH\""

if [ -n "${PREFIX:-}" ]; then
  CMD="$CMD --prefix \"$PREFIX\""
fi

if [ "${NO_SEMANTIC_NORMALIZE:-0}" = "1" ]; then
  CMD="$CMD --no-semantic-normalize"
fi

echo "$CMD" | tee "$OUT_DIR/command.txt"
bash -lc "$CMD" 2>&1 | tee "$OUT_LOG"

---

### Sanity check (no inference)
This cell does not score anything.
It only verifies that the bundle loads correctly.

In [None]:
from pathlib import Path
from ontologies.offline_preprocessing import load_offline_bundle

out_dir = Path(open("outputs/LAST_RUN_DIR.txt").read().strip())
bundle_path = out_dir / "offline_bundle.pkl"

bundle = load_offline_bundle(bundle_path, load_semantic_embeddings_=True, mmap=True)
print("Bundle keys:", list(bundle.keys()))
sem = bundle.get("semantic_index", {})
print("Semantic keys:", list(sem.keys()))
print("Embeddings shape:", getattr(sem.get("embeddings", None), "shape", None))
print("#IRIs:", len(sem.get("iris", [])))

---

## Package outputs for download
Package the outputs of the current run into a single archive that can be downloaded and reused locally or in downstream pipelines.

In [None]:
%%bash
OUT_DIR=$(cat outputs/LAST_RUN_DIR.txt)
ZIP_PATH="${OUT_DIR}.zip"

zip -r "$ZIP_PATH" "$OUT_DIR"

Download the generated archive from the Colab environment to the local machine.

In [None]:
from google.colab import files

with open("outputs/LAST_RUN_DIR.txt") as f:
    OUT_DIR = f.read().strip()

files.download(f"{OUT_DIR}.zip")