Skip to content

Add VA-Spec Annotation Dumps to Public Data Export #762

@bencap

Description

@bencap

Summary

The public data export (export_public_data.py) currently produces CSV files for scores, counts, and VEP/gnomAD/ClinGen annotations. It does not include VA-Spec annotations. This issue tracks adding a per-score-set VA-Spec dump to the ZIP archive, where each variant is annotated at the highest level it individually supports.

Problem

The existing streaming endpoints (GET /score-sets/{urn}/annotations/pathogenicity and GET /score-sets/{urn}/annotations/functional) allow per-score-set access to VA-Spec annotations as NDJSON, but this data is absent from the bulk data export used for distribution and archival.

Additionally, emitting all three annotation levels (StudyResult, FunctionalStatement, PathogenicityStatement) per variant would produce large amounts of redundant data: the higher-level objects already embed the lower-level evidence structures. The dump should instead emit only the highest level each variant can support.

Proposed Behavior

For each published, CC0 score set included in the data export ZIP:

  1. For each variant, determine the highest annotation level it individually supports:

    • PathogenicityStatement — variant has a non-null score, has been successfully mapped to VRS coordinates, and the score set has calibrations with acmg_classification defined (checked via can_annotate_variant_for_pathogenicity_evidence())
    • FunctionalStatement — variant has a non-null score, has been mapped, and the score set has calibrations with functional_classifications defined (checked via can_annotate_variant_for_functional_statement())
    • StudyResult — variant has been successfully mapped but does not meet the score/calibration requirements above
    • null — variant is unmapped
  2. Emit one annotation per variant at its highest supported level as NDJSON (one JSON object per line), with null for variants that cannot be annotated.

  3. Add the file to the ZIP alongside the existing CSV files using the naming convention [URN].va-spec.ndjson. Emit the file for all score sets that have at least one mapped variant.

  4. Update the manifest (main.json) to include a max_va_spec_annotation_level field on each score set entry ("study_result", "functional_statement", "pathogenicity_statement", or null), reflecting the ceiling determined by the score set's calibration configuration. Consumers can use this field to know the best-case level before downloading.

Acceptance Criteria

  • The export ZIP contains a [URN].va-spec.ndjson file for every published score set that has at least one mapped variant.
  • Each line is the highest-level VA-Spec annotation that variant individually supports, or null if it cannot be annotated (unmapped).
  • Within a score set, variants with null scores that are otherwise mapped produce a StudyResult rather than a higher-level statement, even if other variants in the same score set produce PathogenicityStatement or FunctionalStatement objects.
  • Within a score set, variants with non-null scores but no successful VRS mapping produce null.
  • Score sets with no mapped variants at all are skipped (no .va-spec.ndjson file; max_va_spec_annotation_level is null in main.json).
  • Each line of the NDJSON file is either a valid serialized VA-Spec object or null (matching the pattern already used by _stream_generated_annotations()).
  • The script runs to completion without errors on the full production dataset.

Implementation Notes

Per-variant level determination

The eligibility logic already exists in src/mavedb/lib/annotation/util.py (can_annotate_variant_for_functional_statement(), can_annotate_variant_for_pathogenicity_evidence()). These functions check both per-variant conditions (non-null score, successful mapping) and score-set-level conditions (calibration configuration). They already operate at the variant level, so no new abstraction is needed — the existing checks drive the branching logic per variant.

The score set's calibration configuration is the ceiling: if a score set has no acmg_classification calibrations, no variant in it can ever produce a PathogenicityStatement. But variants can still fall below that ceiling individually due to null scores or mapping failures.

Serialization

The streaming endpoints in src/mavedb/routers/score_sets.py (_stream_generated_annotations()) already implement the per-variant NDJSON loop using variant_study_result(), variant_functional_impact_statement(), and variant_pathogenicity_statement() from src/mavedb/lib/annotation/annotate.py. However, note that the streaming endpoints emit a fixed level for all variants in a request (either all pathogenicity or all functional). The dump logic will need to branch per-variant to select the highest supported level, rather than applying a single function uniformly.

File placement and naming

Current ZIP structure:

export.zip
├── main.json
├── LICENSE.txt
└── csv/
├── {urn}.scores.csv
├── {urn}.counts.csv
└── {urn}.annotations.csv ← only when mapped

Proposed addition:

export.zip
├── main.json
├── LICENSE.txt
└── csv/
├── {urn}.scores.csv
├── {urn}.counts.csv
├── {urn}.annotations.csv
└── {urn}.va-spec.ndjson ← only when mapped

If the directory name csv/ no longer fits, it could be renamed data/ as part of this change, but this is a breaking change to the archive structure and should be decided explicitly.

Memory and streaming

The existing export script streams CSVs row-by-row to avoid loading entire score sets into memory. The same approach should be applied to the VA-Spec files — write each NDJSON line as it is generated rather than collecting all annotations in memory first.

main.json schema change

Add a max_va_spec_annotation_level field to the per-score-set entry in main.json. Valid values: "study_result", "functional_statement", "pathogenicity_statement", null. This reflects the score-set ceiling (calibration-determined), not the level of any individual variant. Consumers should expect that some variants in the file may produce lower-level annotations or null even for score sets with a high ceiling value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: backendTask implementation touches the backendtype: enhancementEnhancement to an existing feature
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions