GENEB is a multi-task benchmark for DNA sequence encoders: 100 classification tasks in 13 functional categories, evaluated with a linear probe on precomputed embeddings under full-, 10-shot, and 1-shot regimes (reported metrics: MCC, accuracy, macro-F1).
- Paper: https://arxiv.org/abs/2606.04525
- Source code: https://github.com/darlednik/geneb
- Leaderboard: https://huggingface.co/spaces/darlednik/geneb-leaderboard
- Task data: https://huggingface.co/datasets/darlednik/geneb-tasks
- Submit a model: CONTRIBUTING.md
This repository contains three coordinated parts:
- Evaluation harness (
harness/) — reference code to compute embeddings, train the protocol-defined logistic-regression probe, and emit a submission file. - Benchmark definition (
benchmark/) — task list, category map, probe settings, and model registry metadata. - Leaderboard artifacts (
leaderboard/) — static site and JSON tables built from reviewed submissions (regenerated by CI; not edited manually).
Baseline and contributed results are stored as submissions/<model_id>.json. To add a new
model, contributors provide a small extractor module that loads their encoder and produces
sequence embeddings; the repository stores this code and the resulting metrics, but not
third-party model weights.
benchmark/
benchmark_spec.json # task list, categories, metrics, probe protocol, dataset pin
model_meta.json # display names, parameter counts, links, provenance labels
submissions/
<model_id>.json # per-model results (reviewed before merge)
leaderboard/
index.html # Hugging Face Space UI
leaderboard.json # macro scores by functional category (generated)
leaderboard_tasks.json # per-task scores (generated)
README.md # Space metadata
tools/
validate_submission.py # schema and completeness checks (CI)
build_leaderboard.py # aggregate submissions into leaderboard JSON
pack_submission.py # convert legacy per-task JSON logs to submission format
sync_geneb_dataset.py # download or publish pinned task CSVs on Hugging Face
model_cards/
<model_id>.md # optional training-data and disclosure notes
harness/
run_GENEB.py # end-to-end local evaluation → submission JSON
extractors/
base.py # extractor interface
<module>.py # model-specific embedding module (per submission)
.github/workflows/
validate.yml # PR checks on submissions and spec
build-and-sync.yml # rebuild leaderboard and push to the Space (on merge to main)
To add a model, a contributor:
- implements an embedding extractor under
harness/extractors/; - runs
harness/run_GENEB.pylocally on the pinned GENEB task data; - opens a pull request with:
submissions/<model_id>.json,- the extractor module,
- an optional model card.
Maintainers and CI then:
- validate the submission schema and completeness;
- review the extractor and metadata for plausibility;
- merge the pull request;
- rebuild
leaderboard/*.json; - sync the updated leaderboard to the Hugging Face Space.
The benchmark definition lives in benchmark/benchmark_spec.json: it specifies the task
list, task categories, metrics, probe settings, random seeds, and the dataset revision used
for evaluation. Files under leaderboard/ are generated from reviewed submissions and
should not be edited manually.
GENEB does not provide a central re-scoring service. Contributors run evaluation locally
using the published harness and the dataset revision specified in benchmark_spec.json.
Maintainers review pull requests for schema compliance, completeness, and plausibility, but
do not re-run every model on maintainer infrastructure.
| Mechanism | Role |
|---|---|
Dataset revision in benchmark_spec.json |
Fixes the exact benchmark data used for evaluation |
| Fixed probe protocol | Keeps downstream evaluation comparable across models |
| Submission schema + CI validation | Checks completeness, metric ranges, and file consistency |
| Extractor code in the PR | Shows how embeddings were produced and helps others reproduce the run |
| Model card | Documents architecture, training data, and possible benchmark overlap |
| Provenance metadata | Marks externally submitted runs as self-reported |
Community submissions are marked as self-reported in metadata. A run is considered
reproducible when the submitted extractor, the GENEB harness version, and the pinned dataset
revision are sufficient for another user to repeat the evaluation.
@misc{ledneva2026genebgenomicmodelshard,
title = {GENEB: Why Genomic Models Are Hard to Compare},
author = {Daria Ledneva and Mikhail Nuridinov and Denis Kuznetsov},
year = {2026},
eprint = {2606.04525},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.04525}
}Repository and leaderboard contact: Daria Ledneva.
Questions, suggestions, feedback, and model submissions are welcome.