# Training Launcher

## Purpose of this notebook
This notebook is a **script-first, notebook-as-launcher** interface for running the **end-to-end ontology alignment training pipeline** on Colab.

It will:
- clone the repository,
- install dependencies,
- load the required input artifacts (OWL/RDF),
- run the pipeline via `training.py` in one of three modes,
- package outputs (CSVs, logs, models) into a ZIP for download.

## What is produced by this notebook
Depending on the selected mode, the notebook produces:

### Mode = `full` (build dataset + train)
- Source ontology CSV (textual features)
- Target ontology CSV (textual features)
- Training dataset CSV (pairs + label)
- Trained model directory (bi-encoder or cross-encoder)
- Execution logs

### Mode = `build-dataset`
- Source ontology CSV
- Target ontology CSV
- Training dataset CSV
- Execution logs

### Mode = `train-only` (only train from CSV)
- Trained model directory
- Execution logs

All outputs are saved under a unique timestamped directory in `outputs/`.

---

### Clone Repository

In [None]:
%%bash
set -e

REPO_URL="https://github.com/adsp-polito/2025-P13-Ontology-Alignment.git"
REPO_DIR="2025-P13-Ontology-Alignment"

# Optional: lock to a specific commit for full reproducibility
COMMIT=""   # e.g. "4ffd790"

rm -rf "$REPO_DIR"
git clone "$REPO_URL" "$REPO_DIR"
cd "$REPO_DIR"

if [ -n "$COMMIT" ]; then
  git checkout "$COMMIT"
fi

echo "Checked out commit:"
git rev-parse HEAD

### Enter the repository directory

Colab runs each `%%bash` cell in its own subshell, so directory changes do not persist across cells.
We use `%cd` to move into the cloned repository for all subsequent Python cells.

In [None]:
%cd 2025-P13-Ontology-Alignment

### Install dependencies

In [None]:
%%bash
set -e
python -m pip install --upgrade pip
pip install -r requirements.txt

## Check runtime

In [None]:
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

---

## Choose run mode

Select exactly one mode:

- `full`: build dataset + train  
- `build-dataset`: only build the dataset CSVs (no training)  
- `train-only`: train from an existing dataset CSV (no ontology loading)

The notebook will create a unique output directory under `outputs/` for this run.

In [None]:
# Choose one: "full", "build-dataset", "train-only"
RUN_MODE = "full"

# W&B logging (optional)
USE_WANDB = False

# Number of epochs (used for full/train-only)
NUM_EPOCHS = 10

# Model choice (used for full/train-only)
MODEL_TYPE = "bi-encoder"  # "bi-encoder" or "cross-encoder"
MODEL_NAME = "pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb"

# Hyperparameter values (used for full/train-only; ignored if tuning is enabled)
USE_FIXED_HYPERPARAMS = False # Set to True to use fixed hyperparameters
LEARNING_RATE = 3e-5
BATCH_SIZE = 16
WEIGHT_DECAY = 0.01

# Tuning choice (used for full/train-only)
HYPERPARAMETER_TUNING = True
N_TRIALS = 20  # Number of trials for hyperparameter tuning

---

## Provide required input artifacts

### For `full` and `build-dataset`
You must provide:
1. **Source Ontology** (`.owl`)
2. **Target Ontology** (`.owl`)
3. **Reference Alignment** (`.rdf`, OAEI-like format)

Upload them manually (recommended for Colab).

### For `train-only`
You must provide:
1. **Training dataset CSV** with columns: `source_text`, `target_text`, `match`

In [None]:
from google.colab import files
import os

os.makedirs("data", exist_ok=True)

print("Upload required files for the selected mode.")
print("- full/build-dataset: source.owl, target.owl, alignment.rdf")
print("- train-only: training_dataset.csv")
uploaded = files.upload()

for fn, content in uploaded.items():
    out_path = os.path.join("data", fn)
    with open(out_path, "wb") as f:
        f.write(content)
    print("Saved:", out_path)

---

## Set input file paths

Assign the correct filenames to the variables below.
These paths will be passed to `training.py` as CLI arguments.

In [None]:
# ---- For full/build-dataset ----
SRC_PATH = "data/sweet.owl"
TGT_PATH = "data/envo.owl"
ALIGN_PATH = "data/envo-sweet.rdf"

# ---- For train-only ----
DATASET_CSV = "data/training_dataset.csv"  # must exist if RUN_MODE="train-only"

---

## Configure dataset-building options (for `full` and `build-dataset`)

### Ontology filters (IRI prefixes)
Optionally filter classes by IRI prefix. Use `None` to disable filtering.

### Source text configuration
These flags control which fields are included in the **source** textual representation (SHORT_TEXT):
- description
- synonyms
- parents
- equivalent classes
- disjoint classes

The target text (RICH_TEXT) is built by the repository pipeline.

In [None]:
# -----------------------
# Ontology Filters (Prefixes)
# -----------------------
SRC_PREFIX = None  # e.g. "http://sweetontology.net/"
TGT_PREFIX = None  # e.g. "http://purl.obolibrary.org/obo/ENVO_"

# -----------------------
# Source Text Configuration
# -----------------------
USE_DESCRIPTION = False
USE_SYNONYMS = False
USE_PARENTS = False
USE_EQUIVALENT = False
USE_DISJOINT = False

# -----------------------
# Visualization (optional)
# -----------------------
VISUALIZE = False  # True -> create alignment visualization (if supported)

---

## Configure output directory for this run

We create a unique run directory using a timestamp-based run ID.

In [None]:
%%bash
set -e

RUN_ID="training_run_$(date +%Y%m%d_%H%M%S)"
OUT_DIR="outputs/${RUN_ID}"
mkdir -p "$OUT_DIR"

echo "$OUT_DIR" > outputs/LAST_TRAINING_RUN_DIR.txt
echo "OUT_DIR=$OUT_DIR"

---

## Run the pipeline via `training.py`

This cell builds a CLI command and runs `training.py` in the selected mode.

It writes logs to `training.log` inside the run directory.

- In `full` mode: builds dataset + trains a model  
- In `build-dataset` mode: builds dataset and exits  
- In `train-only` mode: trains from an existing dataset CSV

In [None]:
import os
import subprocess
from pathlib import Path

# Optional W&B behavior
if not USE_WANDB:
    os.environ["WANDB_MODE"] = "disabled"
else:
    import wandb
    wandb.login()

# Read output directory
with open("outputs/LAST_TRAINING_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

# Define output paths (always under OUT_DIR)
out_src_csv = f"{out_dir}/source_ontology.csv"
out_tgt_csv = f"{out_dir}/target_ontology.csv"
out_dataset_csv = f"{out_dir}/training_dataset.csv"
model_out_dir = f"{out_dir}/models/{MODEL_TYPE}_custom/"
log_path = f"{out_dir}/training.log"

cmd = ["python", "training.py", "--mode", RUN_MODE]

# Mode: full or build-dataset => requires ontologies + alignment + output CSV paths
if RUN_MODE in {"full", "build-dataset"}:
    cmd += ["--src", SRC_PATH, "--tgt", TGT_PATH, "--align", ALIGN_PATH]
    cmd += ["--out-src", out_src_csv, "--out-tgt", out_tgt_csv, "--out-dataset", out_dataset_csv]

    if SRC_PREFIX:
        cmd += ["--src-prefix", SRC_PREFIX]
    if TGT_PREFIX:
        cmd += ["--tgt-prefix", TGT_PREFIX]

    if USE_DESCRIPTION: cmd.append("--src-use-description")
    if USE_SYNONYMS: cmd.append("--src-use-synonyms")
    if USE_PARENTS: cmd.append("--src-use-parents")
    if USE_EQUIVALENT: cmd.append("--src-use-equivalent")
    if USE_DISJOINT: cmd.append("--src-use-disjoint")
    if VISUALIZE: cmd.append("--visualize-alignments")

# Mode: full or train-only => requires model args
if RUN_MODE in {"full", "train-only"}:
    cmd += ["--model-type", MODEL_TYPE, "--model-name", MODEL_NAME, "--model-output-dir", model_out_dir]
    cmd += ["--num-epochs", str(NUM_EPOCHS)]
    
    if HYPERPARAMETER_TUNING:
        cmd += ["--tune", "--n-trials", str(N_TRIALS)]
    elif USE_FIXED_HYPERPARAMS:
        cmd += ["--learning-rate", str(LEARNING_RATE)]
        cmd += ["--batch-size", str(BATCH_SIZE)]
        cmd += ["--weight-decay", str(WEIGHT_DECAY)]

# Mode: train-only => requires dataset CSV
if RUN_MODE == "train-only":
    cmd += ["--dataset-csv", DATASET_CSV]

print("Running command:\n", " ".join(cmd))
print(f"\nLogs: {log_path}")

# Create dirs
Path(out_dir).mkdir(parents=True, exist_ok=True)
Path(model_out_dir).mkdir(parents=True, exist_ok=True)

# Run and capture logs
with open(log_path, "w") as log_file:
    proc = subprocess.run(cmd, stdout=log_file, stderr=subprocess.STDOUT)

print("Return code:", proc.returncode)

if proc.returncode != 0:
    print("!!! Error occurred. Showing last 60 lines of log:")
    os.system(f"tail -n 60 {log_path}")
else:
    print(f"Run completed successfully. Outputs are under: {out_dir}")

---

## Inspect generated dataset (optional)

This section is meaningful only if the run produced `training_dataset.csv`
(i.e., mode = `full` or `build-dataset`).

In [None]:
import pandas as pd
import os

with open("outputs/LAST_TRAINING_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

dataset_csv = f"{out_dir}/training_dataset.csv"

if os.path.exists(dataset_csv):
    df = pd.read_csv(dataset_csv)
    print("Training dataset shape:", df.shape)
    display(df.head(10))
    print("\nLabel distribution:")
    display(df["match"].value_counts())
else:
    print("No training_dataset.csv found for this run (expected in train-only mode).")

---

## Package outputs for download

This zips the entire run directory (logs, CSVs, and model artifacts) into a single file
and downloads it to your local machine.

In [None]:
%%bash
set -e
OUT_DIR=$(cat outputs/LAST_TRAINING_RUN_DIR.txt)
ZIP_PATH="${OUT_DIR}.zip"

echo "Zipping $OUT_DIR ..."
zip -r -q "$ZIP_PATH" "$OUT_DIR"
echo "Created zip: $ZIP_PATH"

In [None]:
from google.colab import files

with open("outputs/LAST_TRAINING_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

files.download(f"{out_dir}.zip")