Skip to content

darlednik/GENEB

Repository files navigation

GENEB — Genomic Embedding Benchmark

GENEB is a multi-task benchmark for DNA sequence encoders: 100 classification tasks in 13 functional categories, evaluated with a linear probe on precomputed embeddings under full-, 10-shot, and 1-shot regimes (reported metrics: MCC, accuracy, macro-F1).


Repository overview

This repository contains three coordinated parts:

  1. Evaluation harness (harness/) — reference code to compute embeddings, train the protocol-defined logistic-regression probe, and emit a submission file.
  2. Benchmark definition (benchmark/) — task list, category map, probe settings, and model registry metadata.
  3. Leaderboard artifacts (leaderboard/) — static site and JSON tables built from reviewed submissions (regenerated by CI; not edited manually).

Baseline and contributed results are stored as submissions/<model_id>.json. To add a new model, contributors provide a small extractor module that loads their encoder and produces sequence embeddings; the repository stores this code and the resulting metrics, but not third-party model weights.


Repository layout

benchmark/
  benchmark_spec.json       # task list, categories, metrics, probe protocol, dataset pin
  model_meta.json           # display names, parameter counts, links, provenance labels
submissions/
  <model_id>.json           # per-model results (reviewed before merge)
leaderboard/
  index.html                # Hugging Face Space UI
  leaderboard.json          # macro scores by functional category (generated)
  leaderboard_tasks.json    # per-task scores (generated)
  README.md                 # Space metadata
tools/
  validate_submission.py    # schema and completeness checks (CI)
  build_leaderboard.py      # aggregate submissions into leaderboard JSON
  pack_submission.py        # convert legacy per-task JSON logs to submission format
  sync_geneb_dataset.py     # download or publish pinned task CSVs on Hugging Face
model_cards/
  <model_id>.md             # optional training-data and disclosure notes
harness/
  run_GENEB.py              # end-to-end local evaluation → submission JSON
  extractors/
    base.py                 # extractor interface
    <module>.py             # model-specific embedding module (per submission)
.github/workflows/
  validate.yml              # PR checks on submissions and spec
  build-and-sync.yml        # rebuild leaderboard and push to the Space (on merge to main)

From model to leaderboard row

To add a model, a contributor:

  1. implements an embedding extractor under harness/extractors/;
  2. runs harness/run_GENEB.py locally on the pinned GENEB task data;
  3. opens a pull request with:
    • submissions/<model_id>.json,
    • the extractor module,
    • an optional model card.

Maintainers and CI then:

  1. validate the submission schema and completeness;
  2. review the extractor and metadata for plausibility;
  3. merge the pull request;
  4. rebuild leaderboard/*.json;
  5. sync the updated leaderboard to the Hugging Face Space.

The benchmark definition lives in benchmark/benchmark_spec.json: it specifies the task list, task categories, metrics, probe settings, random seeds, and the dataset revision used for evaluation. Files under leaderboard/ are generated from reviewed submissions and should not be edited manually.


Evaluation and reproducibility

GENEB does not provide a central re-scoring service. Contributors run evaluation locally using the published harness and the dataset revision specified in benchmark_spec.json. Maintainers review pull requests for schema compliance, completeness, and plausibility, but do not re-run every model on maintainer infrastructure.

Mechanism Role
Dataset revision in benchmark_spec.json Fixes the exact benchmark data used for evaluation
Fixed probe protocol Keeps downstream evaluation comparable across models
Submission schema + CI validation Checks completeness, metric ranges, and file consistency
Extractor code in the PR Shows how embeddings were produced and helps others reproduce the run
Model card Documents architecture, training data, and possible benchmark overlap
Provenance metadata Marks externally submitted runs as self-reported

Community submissions are marked as self-reported in metadata. A run is considered reproducible when the submitted extractor, the GENEB harness version, and the pinned dataset revision are sufficient for another user to repeat the evaluation.


Citation

@misc{ledneva2026genebgenomicmodelshard,
  title         = {GENEB: Why Genomic Models Are Hard to Compare},
  author        = {Daria Ledneva and Mikhail Nuridinov and Denis Kuznetsov},
  year          = {2026},
  eprint        = {2606.04525},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.04525}
}

Contact

Repository and leaderboard contact: Daria Ledneva.

Questions, suggestions, feedback, and model submissions are welcome.

About

GENEB: ICML 2026 benchmark for genomic foundation models across 100 tasks and 13 functional categories.

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors