# Training Launcher

## Purpose of this notebook
This notebook is a script-first, notebook-as-launcher interface for running the **End-to-End Training Pipeline** on Colab.
It will:
- clone the repository,
- install dependencies,
- load the required raw ontology files (Source OWL, Target OWL, Reference Alignment RDF),
- run the pipeline via `training.py`,
- package the processed datasets and trained model into a zip file for download.

## What is produced by this notebook

Running this notebook executes the `training.py` script, which performs two main stages:

1. **Dataset Generation**:
   - Parses raw `.owl` files into textual representations (CSVs).
   - Merges them with the reference alignment to create positive/negative training pairs.
2. **Model Training** (Bi-Encoder or Cross-Encoder):
   - Fine-tunes a BERT-based model on the generated dataset.

**Outputs:**
- A **processed dataset CSV** used for training.
- **Source/Target CSVs** containing the parsed textual features.
- A **Trained Model** directory (saved in HuggingFace format).
- An **Alignment Visualization** (optional).

All outputs are saved in a unique timestamped directory under `outputs/`.

---

### Clone repository

This cell clones the project repository into the Colab VM.

In [None]:
%%bash
set -e

REPO_URL="https://github.com/adsp-polito/2025-P13-Ontology-Alignment.git"
REPO_DIR="2025-P13-Ontology-Alignment"

# Optional: lock to a specific commit for reproducibility
COMMIT=""

rm -rf "$REPO_DIR"
git clone "$REPO_URL" "$REPO_DIR"
cd "$REPO_DIR"

if [ -n "$COMMIT" ]; then
  git checkout "$COMMIT"
fi

echo "Checked out commit:"
git rev-parse HEAD

### Enter the repo directory

In [None]:
%cd 2025-P13-Ontology-Alignment

### Install dependencies

In [None]:
%%bash
set -e
python -m pip install --upgrade pip
pip install -r requirements.txt

---

### Check runtime (CPU/GPU)
Training requires a GPU for reasonable performance.

In [None]:
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

---

### Provide required input artifacts
`training.py` needs three specific files to build the dataset and train:
1. **Source Ontology** (e.g., `sweet.owl`)
2. **Target Ontology** (e.g., `envo.owl`)
3. **Reference Alignment** (e.g., `envo-sweet.rdf` - OAEI format)

In [None]:
from google.colab import files
import os

os.makedirs("data", exist_ok=True)

print("Please upload: Source Ontology (.owl), Target Ontology (.owl), and Alignment (.rdf)")
uploaded = files.upload()

for fn, content in uploaded.items():
    out_path = os.path.join("data", fn)
    with open(out_path, "wb") as f:
        f.write(content)
    print("Saved:", out_path)

---

### Configure output directory for this training run
Creates a unique output directory using a timestamp-based run ID.

In [None]:
%%bash
set -e

RUN_ID="training_run_$(date +%Y%m%d_%H%M%S)"
OUT_DIR="outputs/${RUN_ID}"
mkdir -p "$OUT_DIR"

# Save dir path for Python access
echo "$OUT_DIR" > outputs/LAST_TRAINING_RUN_DIR.txt

echo "OUT_DIR=$OUT_DIR"

### Configure Training Parameters

Here you define the arguments for `training.py`. 
**Note:** Ensure `SRC_PATH`, `TGT_PATH`, and `ALIGN_PATH` match the filenames you uploaded above.

In [None]:
# -----------------------
# Input Paths (Must match uploaded files)
# -----------------------
SRC_PATH = "data/sweet.owl"   # Update filename if needed
TGT_PATH = "data/envo.owl"    # Update filename if needed
ALIGN_PATH = "data/envo-sweet.rdf" # Update filename if needed

# -----------------------
# Ontology Filters (Prefixes)
# -----------------------
# Optional: Use if you want to filter classes by IRI prefix
SRC_PREFIX = None # "http://sweetontology.net/"
TGT_PREFIX = None # "http://purl.obolibrary.org/obo/ENVO_"

# -----------------------
# Text Generation Configuration
# -----------------------
# These flags control what text is added to the training examples
USE_DESCRIPTION = False
USE_SYNONYMS = False
USE_PARENTS = False
USE_EQUIVALENT = False
USE_DISJOINT = False

# -----------------------
# Model Configuration
# -----------------------
# options: "bi-encoder" or "cross-encoder"
MODEL_TYPE = "bi-encoder" 

# HuggingFace Model ID
MODEL_NAME = "pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb"

# -----------------------
# Visualization
# -----------------------
VISUALIZE = False  # Set to True to generate a graph visualization of alignments

### Run Training Pipeline

This cell constructs the CLI command and executes `training.py`. It handles:
1. **Ontology Loading**: Parsing OWL files.
2. **Text Generation**: Applying the flags (descriptions, synonyms, etc.).
3. **Dataset Building**: Merging with alignments.
4. **Training**: Fine-tuning the specified model.

In [None]:
import subprocess
import os
import wandb

# Initialize Weights & Biases (optional)
wandb.login() # Follow prompts to login
# os.environ["WANDB_MODE"] = "disabled" # Uncomment to disable W&B logging

# Read output directory from previous step
with open("outputs/LAST_TRAINING_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

# Define Output Paths
out_src_csv = f"{out_dir}/source_ontology.csv"
out_tgt_csv = f"{out_dir}/target_ontology.csv"
out_dataset_csv = f"{out_dir}/training_dataset.csv"
model_out_dir = f"{out_dir}/models/{MODEL_TYPE}_custom/"
log_path = f"{out_dir}/training.log"

# Build Command
cmd = [
    "python", "training.py",
    "--src", SRC_PATH,
    "--tgt", TGT_PATH,
    "--align", ALIGN_PATH,
    "--out-src", out_src_csv,
    "--out-tgt", out_tgt_csv,
    "--out-dataset", out_dataset_csv,
    "--model-type", MODEL_TYPE,
    "--model-name", MODEL_NAME,
    "--model-output-dir", model_out_dir
]

# Add Optional Arguments
if SRC_PREFIX:
    cmd += ["--src-prefix", SRC_PREFIX]
if TGT_PREFIX:
    cmd += ["--tgt-prefix", TGT_PREFIX]

# Add Boolean Flags
if USE_DESCRIPTION: cmd.append("--src-use-description")
if USE_SYNONYMS: cmd.append("--src-use-synonyms")
if USE_PARENTS: cmd.append("--src-use-parents")
if USE_EQUIVALENT: cmd.append("--src-use-equivalent")
if USE_DISJOINT: cmd.append("--src-use-disjoint")
if VISUALIZE: cmd.append("--visualize-alignments")

print("Running command:\n", " ".join(cmd))
print(f"\nLogs are being written to: {log_path} ...")

# Run execution
with open(log_path, "w") as log_file:
    proc = subprocess.run(cmd, stdout=log_file, stderr=subprocess.STDOUT)

print("Process finished with return code:", proc.returncode)

if proc.returncode != 0:
    print("!!! Error occurred. Printing last 20 lines of log:")
    os.system(f"tail -n 20 {log_path}")
else:
    print(f"Training complete. Model saved to: {model_out_dir}")

---

### Inspect Generated Training Data
Let's look at the dataset that was built and used for training. This confirms that text, synonyms, and descriptions were merged correctly.

In [None]:
import pandas as pd

with open("outputs/LAST_TRAINING_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

dataset_csv = f"{out_dir}/training_dataset.csv"
if os.path.exists(dataset_csv):
    df = pd.read_csv(dataset_csv)
    print(f"Training Dataset Shape: {df.shape}")
    display(df.head())
else:
    print("Dataset CSV not found. Did the run fail?")

---

### Package and Download Results
This zips the entire output directory (logs, CSVs, and the trained model) for download.

In [None]:
%%bash
set -e
OUT_DIR=$(cat outputs/LAST_TRAINING_RUN_DIR.txt)
ZIP_PATH="${OUT_DIR}.zip"

echo "Zipping $OUT_DIR ..."
zip -r -q "$ZIP_PATH" "$OUT_DIR"
echo "Created zip: $ZIP_PATH"

In [None]:
from google.colab import files

with open("outputs/LAST_TRAINING_RUN_DIR.txt") as f:
    out_dir = f.read().strip()

files.download(f"{out_dir}.zip")