# Offline Bundle Builder Launcher


## Purpose of this notebook

This notebook is a script-first, notebook-as-launcher interface for building the offline ontology bundle used by the inference pipeline.

This notebook:
- sets up a clean execution environment (Colab),
- clones the repository,
- installs dependencies,
- acquires the ontology (from URL or local upload),
- launches the official CLI script build_ontology_bundle.py,
- collects logs and outputs in a reproducible way.

## What is produced by this notebook

Running this notebook produces:
- an internal ontology CSV (iri, label, text, …),
- an offline bundle (offline_bundle.pkl) containing:
   - exact match structures,
   - lexical retrieval structures,
   - semantic index,
   - execution logs and command trace.

All outputs are saved in a single run directory and can be downloaded as a ZIP.

---

### Runtime check

In [None]:
%%bash
nvidia-smi -L || echo "No GPU detected (CPU runtime)"

Building the semantic index is computationally expensive. GPU runtime is strongly recommended.

### Clone repository (reproducible setup)
This notebook only runs the official scripts from the repository.

In [None]:
%%bash

REPO_URL="https://github.com/adsp-polito/2025-P13-Ontology-Alignment.git"
REPO_DIR="2025-P13-Ontology-Alignment"

# Optional: lock to a specific commit for full reproducibility
COMMIT=""   # e.g. "4ffd790"

rm -rf "$REPO_DIR"
git clone "$REPO_URL" "$REPO_DIR"
cd "$REPO_DIR"

if [ -n "$COMMIT" ]; then
  git checkout "$COMMIT"
fi

git rev-parse HEAD

### Enter the repo directory

Colab runs each bash cell in its own subshell.
To keep the notebook state consistent, we move into the cloned repository using %cd.

In [None]:
%cd 2025-P13-Ontology-Alignment

### Install dependencies

In [None]:
%%bash

pip -q install --upgrade pip
pip -q install -r requirements.txt

---

### Ontology input: choose ONE option
The ontology can be provided either as:
- a remote URL, or
- a local file upload.

Only one option is needed.

#### Option A — Ontology from URL

In [None]:
%%bash

ONTO_URL="https://example.com/your_ontology.owl"   # ← change this
ONTO_LOCAL_PATH="data/input_ontology.owl"

mkdir -p data
wget -O "$ONTO_LOCAL_PATH" "$ONTO_URL"
ls -lh "$ONTO_LOCAL_PATH"

#### Option B — Upload ontology file manually

In [None]:
from google.colab import files
import os

uploaded = files.upload()
fname = next(iter(uploaded.keys()))

os.makedirs("data", exist_ok=True)
ONTO_LOCAL_PATH = f"data/{fname}"

with open(ONTO_LOCAL_PATH, "wb") as f:
    f.write(uploaded[fname])

print("Ontology saved to:", ONTO_LOCAL_PATH)

---

### Configure output directory
Each execution produces a self-contained run folder.

In [None]:
%%bash

RUN_ID="offline_bundle_run_$(date +%Y%m%d_%H%M%S)"
OUT_DIR="outputs/${RUN_ID}"

mkdir -p "$OUT_DIR"
echo "$OUT_DIR" > outputs/LAST_RUN_DIR.txt

OUT_CSV="${OUT_DIR}/internal_ontology.csv"
OUT_BUNDLE="${OUT_DIR}/offline_bundle.pkl"
OUT_LOG="${OUT_DIR}/build_bundle.log"

---

### Configure model and preprocessing parameters
These parameters are passed directly to the CLI script.

In [None]:
TOKENIZER_NAME="dmis-lab/biobert-base-cased-v1.1"

# Bi-encoder used to build the semantic index
BI_ENCODER_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2"

SEMANTIC_BATCH_SIZE=64
SEMANTIC_MAX_LENGTH=256

NO_SEMANTIC_NORMALIZE = 0  # Set to 1 to disable normalization of semantic embeddings

# Optional: restrict ontology classes by IRI prefix
PREFIX=""   # e.g. "http://purl.obolibrary.org/obo/ENVO_"

### Practical Notes
If the ontology is too big and the semantic index is heavy:
- reduce --semantic-batch-size (es. 16/32)
- reduce --semantic-max-length (es. 128/192) if “RICH_TEXT” is big

---

## Launch offline bundle construction
This is the only cell that performs real computation. Everything is logged and fully reproducible.

In [None]:
%%bash

CMD="python build_ontology_bundle.py \
  --ont-path ${ONTO_LOCAL_PATH} \
  --out-csv ${OUT_CSV} \
  --out-bundle ${OUT_BUNDLE} \
  --tokenizer-name ${TOKENIZER_NAME} \
  --bi-encoder-model-id ${BI_ENCODER_MODEL_ID} \
  --semantic-batch-size ${SEMANTIC_BATCH_SIZE} \
  --semantic-max-length ${SEMANTIC_MAX_LENGTH} \
  --no-semantic-normalize ${NO_SEMANTIC_NORMALIZE} \
"


if [ -n \"$PREFIX\" ]; then
  CMD=\"$CMD --prefix ${PREFIX}\"
fi

echo "$CMD" | tee "${OUT_DIR}/command.txt"
bash -lc "$CMD" 2>&1 | tee "$OUT_LOG"

---

### Sanity check (no inference)
This cell does not score anything.
It only verifies that the bundle loads correctly.

In [None]:
from ontologies.offline_preprocessing import load_offline_bundle
from pathlib import Path

with open("outputs/LAST_RUN_DIR.txt") as f:
    OUT_DIR = f.read().strip()

OUT_BUNDLE = Path(OUT_DIR) / "offline_bundle.pkl"

bundle = load_offline_bundle(
    OUT_BUNDLE,
    load_semantic_embeddings=True,
    mmap=True,
)

print("Bundle keys:", list(bundle.keys()))

if "semantic_index" in bundle:
    sem = bundle["semantic_index"]
    print("Semantic index fields:", list(sem.keys()))
    print("Number of classes:", len(sem.get("iris", [])))

---

## Package outputs for download
Package the outputs of the current run into a single archive that can be downloaded and reused locally or in downstream pipelines.

In [None]:
%%bash
OUT_DIR=$(cat outputs/LAST_RUN_DIR.txt)
ZIP_PATH="${OUT_DIR}.zip"

zip -r "$ZIP_PATH" "$OUT_DIR"

Download the generated archive from the Colab environment to the local machine.

In [None]:
from google.colab import files

with open("outputs/LAST_RUN_DIR.txt") as f:
    OUT_DIR = f.read().strip()

files.download(f"{OUT_DIR}.zip")