A toolkit for generating vector embeddings from OWL ontologies using Graph Neural Networks (GNNs), with HuggingFace Sentence Transformers integration and MTEB benchmarking.
pip install on2vec
Create production-ready Sentence Transformers models with ontology knowledge in one command:
# Complete end-to-end workflow
on2vec hf biomedical.owl my-biomedical-model
Use like any sentence transformer:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('./hf_models/my-biomedical-model')
embeddings = model.encode(['heart disease', 'cardiovascular problems'])
- π Quick Start
- π₯ Installation
- π€ HuggingFace Integration
- π§ͺ MTEB Benchmarking
- π» Core on2vec Usage
- ποΈ Architecture
- π Documentation
# Basic installation
pip install on2vec
# With MTEB benchmarking support
pip install on2vec[benchmark]
# With all optional dependencies
pip install on2vec[all]
git clone <repository-url>
cd on2vec
pip install -e .
- Python >= 3.10
- PyTorch + torch-geometric
- owlready2, sentence-transformers
- polars, matplotlib, umap-learn
# Create complete model with auto-generated documentation
on2vec hf ontology.owl model-name
# With custom settings
on2vec hf ontology.owl model-name \
--base-model all-mpnet-base-v2 \
--fusion gated \
--epochs 200
# 1. Train ontology embeddings
on2vec hf-train ontology.owl --output embeddings.parquet
# 2. Create HuggingFace model (auto-detects base model)
on2vec hf-create embeddings.parquet model-name
# 3. Test model functionality
on2vec hf-test ./hf_models/model-name
# 4. Inspect model details
on2vec inspect ./hf_models/model-name
# Process multiple ontologies
on2vec hf-batch owl_files/ ./output \
--base-models all-MiniLM-L6-v2 all-mpnet-base-v2 \
--fusion-methods concat gated \
--max-workers 4
- β Auto-generated model cards with comprehensive metadata
- β Smart base model detection from embeddings
- β Upload instructions and HuggingFace Hub preparation
- β Domain detection and appropriate tagging
- β Multiple fusion methods: concat, attention, gated, weighted_avg
- β Batch processing for multiple ontologies
Evaluate your models against the Massive Text Embedding Benchmark:
# Fast evaluation on subset of tasks
on2vec benchmark ./hf_models/my-model --quick
# Focus on specific task types
on2vec benchmark ./hf_models/my-model --task-types STS Classification
# Full MTEB benchmark
on2vec benchmark ./hf_models/my-model
# Benchmark vanilla baseline
on2vec benchmark sentence-transformers/all-MiniLM-L6-v2 \
--model-name vanilla-baseline --quick
# Compare ontology vs vanilla models
on2vec compare ./hf_models/my-model --detailed
- β Full MTEB integration with 58+ evaluation tasks
- β Task filtering by category (STS, Classification, Clustering, etc.)
- β Automated reporting with JSON summaries and markdown reports
- β Resource management with configurable batch sizes
- β Comparison tools for baseline evaluation
# Train GCN model
on2vec train ontology.owl --output model.pt --model-type gcn --epochs 100
# Train with text features (for HuggingFace integration)
on2vec train ontology.owl --output embeddings.parquet --use-text-features
# Multi-relation models with all ObjectProperties
on2vec train ontology.owl --output model.pt --use-multi-relation --model-type rgcn
# Generate embeddings from trained model
on2vec embed model.pt ontology.owl --output embeddings.parquet
# Create UMAP visualization
on2vec visualize embeddings.parquet --output visualization.png
from sentence_transformers import SentenceTransformer
from on2vec import train_ontology_embeddings, embed_ontology_with_model
# Train model
result = train_ontology_embeddings(
owl_file="ontology.owl",
model_output="model.pt",
model_type="gcn",
hidden_dim=128,
out_dim=64
)
# Generate embeddings
embeddings = embed_ontology_with_model(
model_path="model.pt",
owl_file="ontology.owl",
output_file="embeddings.parquet"
)
# Use HuggingFace model
model = SentenceTransformer('./hf_models/my-model')
vectors = model.encode(['concept 1', 'concept 2'])
- Graph Construction: Converts OWL ontologies to graph representations
- GNN Training: Supports GCN, GAT, RGCN, and heterogeneous architectures
- Text Integration: Combines structural and semantic features using sentence transformers
- Fusion Methods: Multiple approaches to combine text + structural embeddings
- HuggingFace Bridge: Creates sentence-transformers compatible models
OWL Ontology β Graph β GNN Training β Structural Embeddings
β
Text Features β Sentence Transformer β Text Embeddings
β
Fusion Layer β Final Model
β
HuggingFace Model + Model Card
- GCN: Graph Convolutional Networks
- GAT: Graph Attention Networks
- RGCN: Relational GCN for multi-relation graphs
- Heterogeneous: Relation-specific layers with attention
- π CLI Quick Reference - All commands and examples
- π HuggingFace Integration - Complete workflow guide
- π§ͺ MTEB Benchmarking - Evaluation framework
- 𧬠Project Instructions - Development guidelines
- π€ HuggingFace Ready: One-command model creation with professional documentation
- π§ͺ MTEB Integration: Comprehensive benchmarking against standard tasks
- π Rich Metadata: Auto-generated model cards with complete technical details
- π§ Smart Automation: Auto-detects base models, domains, and configurations
- β‘ Batch Processing: Handle multiple ontologies efficiently
- π¨ Multiple Fusion Methods: Flexible combination of text and structural features
- π Comprehensive Evaluation: Built-in comparison and testing tools
# 1. Install on2vec
pip install on2vec[benchmark]
# 2. Create a model from biomedical ontology
on2vec hf EDAM.owl edam-biomedical
# 3. Quick benchmark evaluation
on2vec benchmark ./hf_models/edam-biomedical --quick
# 4. Compare with vanilla models
on2vec compare ./hf_models/edam-biomedical --detailed
# 5. Inspect model details
on2vec inspect ./hf_models/edam-biomedical
# 6. Upload to HuggingFace Hub (instructions auto-generated)
# See ./hf_models/edam-biomedical/UPLOAD_INSTRUCTIONS.md
The model is immediately usable as a drop-in replacement for any sentence-transformer, with the added benefit of ontological domain knowledge!
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Batch training - Process multiple ontologies efficiently in parallel
- Model finetuning - Allow incremental training and adaptation of existing models
- Provenance tracking - Keep metadata about source OWL files for traceability
- Embedding injection - Store embeddings directly in OWL files using base64 encoding
- Parameter optimization - Automated hyperparameter tuning and optimization
- Use case examples - Comprehensive documentation and examples for different scenarios
If you use on2vec in your research, please cite:
@software{on2vec2025,
title={on2vec: Ontology Embeddings with Graph Neural Networks},
author={David Steinberg},
year={2025},
url={https://github.com/david4096/on2vec}
}