Skip to content

david4096/on2vec

Repository files navigation

on2vec

A toolkit for generating vector embeddings from OWL ontologies using Graph Neural Networks (GNNs), with HuggingFace Sentence Transformers integration and MTEB benchmarking.

πŸš€ Quick Start

Installation

pip install on2vec

Create production-ready Sentence Transformers models with ontology knowledge in one command:

# Complete end-to-end workflow
on2vec hf biomedical.owl my-biomedical-model

Use like any sentence transformer:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('./hf_models/my-biomedical-model')
embeddings = model.encode(['heart disease', 'cardiovascular problems'])

πŸ“‹ Table of Contents

πŸ“₯ Installation

From PyPI (Recommended)

# Basic installation
pip install on2vec

# With MTEB benchmarking support
pip install on2vec[benchmark]

# With all optional dependencies
pip install on2vec[all]

From Source

git clone <repository-url>
cd on2vec
pip install -e .

Dependencies

  • Python >= 3.10
  • PyTorch + torch-geometric
  • owlready2, sentence-transformers
  • polars, matplotlib, umap-learn

πŸ€— HuggingFace Integration

One-Command Model Creation

# Create complete model with auto-generated documentation
on2vec hf ontology.owl model-name

# With custom settings
on2vec hf ontology.owl model-name \
  --base-model all-mpnet-base-v2 \
  --fusion gated \
  --epochs 200

Step-by-Step Workflow

# 1. Train ontology embeddings
on2vec hf-train ontology.owl --output embeddings.parquet

# 2. Create HuggingFace model (auto-detects base model)
on2vec hf-create embeddings.parquet model-name

# 3. Test model functionality
on2vec hf-test ./hf_models/model-name

# 4. Inspect model details
on2vec inspect ./hf_models/model-name

Batch Processing

# Process multiple ontologies
on2vec hf-batch owl_files/ ./output \
  --base-models all-MiniLM-L6-v2 all-mpnet-base-v2 \
  --fusion-methods concat gated \
  --max-workers 4

Features

  • βœ… Auto-generated model cards with comprehensive metadata
  • βœ… Smart base model detection from embeddings
  • βœ… Upload instructions and HuggingFace Hub preparation
  • βœ… Domain detection and appropriate tagging
  • βœ… Multiple fusion methods: concat, attention, gated, weighted_avg
  • βœ… Batch processing for multiple ontologies

πŸ§ͺ MTEB Benchmarking

Evaluate your models against the Massive Text Embedding Benchmark:

Quick Benchmark

# Fast evaluation on subset of tasks
on2vec benchmark ./hf_models/my-model --quick

# Focus on specific task types
on2vec benchmark ./hf_models/my-model --task-types STS Classification

# Full MTEB benchmark
on2vec benchmark ./hf_models/my-model

Compare Models

# Benchmark vanilla baseline
on2vec benchmark sentence-transformers/all-MiniLM-L6-v2 \
  --model-name vanilla-baseline --quick

# Compare ontology vs vanilla models
on2vec compare ./hf_models/my-model --detailed

Features

  • βœ… Full MTEB integration with 58+ evaluation tasks
  • βœ… Task filtering by category (STS, Classification, Clustering, etc.)
  • βœ… Automated reporting with JSON summaries and markdown reports
  • βœ… Resource management with configurable batch sizes
  • βœ… Comparison tools for baseline evaluation

πŸ’» Core on2vec Usage

Basic Training

# Train GCN model
on2vec train ontology.owl --output model.pt --model-type gcn --epochs 100

# Train with text features (for HuggingFace integration)
on2vec train ontology.owl --output embeddings.parquet --use-text-features

# Multi-relation models with all ObjectProperties
on2vec train ontology.owl --output model.pt --use-multi-relation --model-type rgcn

Generate Embeddings

# Generate embeddings from trained model
on2vec embed model.pt ontology.owl --output embeddings.parquet

Visualization

# Create UMAP visualization
on2vec visualize embeddings.parquet --output visualization.png

Python API

from sentence_transformers import SentenceTransformer
from on2vec import train_ontology_embeddings, embed_ontology_with_model

# Train model
result = train_ontology_embeddings(
    owl_file="ontology.owl",
    model_output="model.pt",
    model_type="gcn",
    hidden_dim=128,
    out_dim=64
)

# Generate embeddings
embeddings = embed_ontology_with_model(
    model_path="model.pt",
    owl_file="ontology.owl",
    output_file="embeddings.parquet"
)

# Use HuggingFace model
model = SentenceTransformer('./hf_models/my-model')
vectors = model.encode(['concept 1', 'concept 2'])

πŸ—οΈ Architecture

Core Components

  • Graph Construction: Converts OWL ontologies to graph representations
  • GNN Training: Supports GCN, GAT, RGCN, and heterogeneous architectures
  • Text Integration: Combines structural and semantic features using sentence transformers
  • Fusion Methods: Multiple approaches to combine text + structural embeddings
  • HuggingFace Bridge: Creates sentence-transformers compatible models

Model Pipeline

OWL Ontology β†’ Graph β†’ GNN Training β†’ Structural Embeddings
                                            ↓
Text Features β†’ Sentence Transformer β†’ Text Embeddings
                                            ↓
                              Fusion Layer β†’ Final Model
                                            ↓
                              HuggingFace Model + Model Card

Supported Architectures

  • GCN: Graph Convolutional Networks
  • GAT: Graph Attention Networks
  • RGCN: Relational GCN for multi-relation graphs
  • Heterogeneous: Relation-specific layers with attention

πŸ“š Documentation

🎯 Key Features

  • πŸ€— HuggingFace Ready: One-command model creation with professional documentation
  • πŸ§ͺ MTEB Integration: Comprehensive benchmarking against standard tasks
  • πŸ“Š Rich Metadata: Auto-generated model cards with complete technical details
  • πŸ”§ Smart Automation: Auto-detects base models, domains, and configurations
  • ⚑ Batch Processing: Handle multiple ontologies efficiently
  • 🎨 Multiple Fusion Methods: Flexible combination of text and structural features
  • πŸ“ˆ Comprehensive Evaluation: Built-in comparison and testing tools

πŸš€ Example Workflow

# 1. Install on2vec
pip install on2vec[benchmark]

# 2. Create a model from biomedical ontology
on2vec hf EDAM.owl edam-biomedical

# 3. Quick benchmark evaluation
on2vec benchmark ./hf_models/edam-biomedical --quick

# 4. Compare with vanilla models
on2vec compare ./hf_models/edam-biomedical --detailed

# 5. Inspect model details
on2vec inspect ./hf_models/edam-biomedical

# 6. Upload to HuggingFace Hub (instructions auto-generated)
# See ./hf_models/edam-biomedical/UPLOAD_INSTRUCTIONS.md

The model is immediately usable as a drop-in replacement for any sentence-transformer, with the added benefit of ontological domain knowledge!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

TODOs!

  1. Batch training - Process multiple ontologies efficiently in parallel
  2. Model finetuning - Allow incremental training and adaptation of existing models
  3. Provenance tracking - Keep metadata about source OWL files for traceability
  4. Embedding injection - Store embeddings directly in OWL files using base64 encoding
  5. Parameter optimization - Automated hyperparameter tuning and optimization
  6. Use case examples - Comprehensive documentation and examples for different scenarios

Citation

If you use on2vec in your research, please cite:

@software{on2vec2025,
  title={on2vec: Ontology Embeddings with Graph Neural Networks},
  author={David Steinberg},
  year={2025},
  url={https://github.com/david4096/on2vec}
}

About

A tool for generating embeddings of classes organized into an an ontology

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •