Skip to content

Releases: dial481/allelix

v1.5.0

09 Jun 19:39
ac79998

Choose a tag to compare

What's new

Version tag consolidation across all six annotators. Local processing
stamps (iv:, pv:, cv:, sv:) are now stored in a dedicated
local_version_tag column instead of being appended to remote_signal.
Eliminates the suffix-parsing pattern that caused the SNPedia re-download
loop. All six annotators use the same dual-version mechanism. Existing
caches self-heal on upgrade — no re-download required.

Multi-allelic enrichment fix (#25). gnomAD and AlphaMissense now use
exact alt-allele matching instead of MAX() aggregation at multi-allelic
sites.

SNPedia auto-download (#30). db update downloads the SNPedia cache
from HuggingFace automatically.

AlphaMissense gnomAD version stamping (#28). Runtime warning when the
installed gnomAD version doesn't match what was used to build the AM cache.

See CHANGELOG.md for the full list of fixes and additions.

Migration

Existing caches self-heal on first run. No action required beyond
pip install --upgrade allelix and running any allelix command.

Stats

  • 1066 tests, 92% coverage
  • ADR-0028: local_version_tag convention

v1.4.0

09 Jun 10:32
f3b5efd

Choose a tag to compare

Added

  • AlphaMissense variant pathogenicity enrichment. New AlphaMissenseAnnotator enriches annotations with missense variant pathogenicity scores from DeepMind's AlphaMissense (71M variants, CC BY 4.0). Pre-built SQLite cache downloaded from HuggingFace via db update. AM Score column in terminal, HTML, and JSON reports. PharmGKB rows show AM scores as neutral with caveat (protein structure impact only — tooltip in HTML, dimmed * footnote in terminal, am_caveat field in JSON). --no-alphamissense flag to skip.
  • Config file system. config.toml with per-source on/off toggles and license.commercial = true safety switch that auto-disables non-commercial sources (SNPedia). allelix config show/set/reset CLI commands. CLI flags override config per-invocation.
  • scripts/build_alphamissense_cache.py — AlphaMissense cache build script with Zenodo HTTPS streaming (default) and local TSV modes. Joins against gnomAD cache for coordinate-to-rsID mapping.
  • AlphaMissense CC BY 4.0 attribution in HTML and JSON reports.
  • Magnitude scoring legend in HTML report (collapsible, per-source scoring tables for ClinVar, PharmGKB, GWAS, SNPedia).
  • Source floor note in HTML report when per-source magnitude minimums are active.
  • Repute row background tints in HTML report (red for pathogenic/risk, green for protective/benign) derived from existing significance field.
  • Sortable columns in HTML report (magnitude, gene, source, AM score) via inline JavaScript.
  • ADR-0027 documenting the AlphaMissense enrichment cache architecture.

Fixed

  • HTML report table overflows viewport, columns clipped on left (#20). Added overflow-x: auto container, sticky rsID column, max-width on description cells, refs collapsed into <details> toggle, conditional Review Status column (hidden when all empty), stat card flex-wrap.
  • AlphaMissense build script has zero unit-test coverage (#24). Added 25 tests covering TSV parsing, gnomAD rsID join, chr prefix normalization, --no-gnomad NULL-rsid path, multi-allelic composite PK, batched insert, and end-to-end integration.
  • Download integrity: Content-Length check after downloads catches truncated files.
  • Disk space preflight before decompressing .sqlite.gz caches uses 5x gz size (accounts for gz + decompressed tmp on disk simultaneously).
  • _connection() guards on gnomAD and AlphaMissense annotators raise FileNotFoundError with actionable message when cache is missing.
  • Dead cache_exists() removed from gnomAD and AlphaMissense loaders.
  • Legacy caches stamp remote signal instead of re-downloading on db update.
  • README database sizes updated to match actual on-disk measurements.

Changed

  • db update display includes gnomAD and AlphaMissense in "Analyzing against" annotator list.
  • Both build scripts (build_gnomad_cache.py, build_alphamissense_cache.py) run VACUUM for smaller output files.

v1.3.1

08 Jun 12:37

Choose a tag to compare

Fixed

  • Test suite downloaded real ~6 GB gnomAD cache on every run, filling CI
    runner disk. All db update tests now use a 792-byte mock fixture via
    file:// URL — same pattern as ClinVar, PharmGKB, and GWAS. No
    production code changes.

Changed

  • CI: job timeout (20 min), pytest step timeout (15 min),
    workflow_dispatch trigger, verbose output (pytest -v --tb=short)
  • Ship tooling: scripts/tag-release.sh derives tag from pyproject.toml
    (single source of truth)
  • Git hooks: raw .githooks/pre-push replaces pre-commit framework shim,
    blocks tag pushes where version doesn't match
  • CONTRIBUTING.md: corrected slow-test documentation (CI skips them, not
    runs them), added "Run the full suite locally" section
  • Documentation: fixed stale hook instructions, added missing changelog
    comparison links and ADR index entries
  • Removed dead code: scripts/check_version_tag.sh
  • Removed version-tag-match entry from .pre-commit-config.yaml

v1.3.0: gnomAD Exome Frequency Cache

08 Jun 05:12

Choose a tag to compare

Added

  • gnomAD population allele frequencies. New GnomadAnnotator enriches
    report annotations with population frequency context from gnomAD v4.1 exomes
    (~16M variants, 730K individuals). Pre-built cache downloaded from HuggingFace
    via db update. Frequency column in terminal, HTML, and JSON reports.
    --no-gnomad flag to skip.
  • CPIC fallback for PharmGKB. db update succeeds when CPIC API is
    unreachable — reuses cached allele function data. Recovery auto-triggers on
    next successful check.
  • Graceful db update. Individual annotator download failures print an error
    and continue to remaining annotators instead of aborting the entire update.
  • scripts/build_gnomad_cache.py — streaming VCF build script. Downloads ~120GB
    gnomAD exome VCFs over HTTPS (or reads local files with --local-dir), never
    saves VCFs to disk, outputs ~6GB SQLite (~3GB gzipped).
  • JSON report schema_version bumped to "2" (added allele_frequency field).
    Diff engine accepts both v1 and v2 baselines.
  • gnomAD ODbL v1.0 attribution in HTML and JSON reports.
  • CI workflow (.github/workflows/ci.yml) — lint + test on push/PR to main.

Fixed

  • Offline claim in README corrected: analysis runs offline by default with opt-out
    freshness check, not opt-in network access.
  • __del__ partial-init crash on PharmGKB constructor failure.
  • .gitignore updated for GWAS Catalog test data.

Changed

  • Pre-push hook reduced to version-tag check only (CI runs the full suite).

Technical

  • Composite primary key (chrom, pos, ref, alt) on gnomad_frequencies
    preserves multi-allelic sites (rsID-only PK silently dropped ~20% of records).
  • Coordinate columns indexed for future AlphaMissense/CADD integration.
  • MAX(af) GROUP BY rsid in lookup queries handles multiple rows per rsID.
  • 951 tests, 93%+ coverage.

Allelix v1.2.0 — 2026-06-07

07 Jun 05:32

Choose a tag to compare

Fixed

  • GRCh36 fallback bug. Non-confident GRCh36 detection (e.g., 3 of 4 probe SNPs matched) was falling back to GRCh37 as the effective build, silently bypassing the ClinVar safety guard and annotating GRCh36 positions against GRCh37 coordinates. Fixed in both the end-of-stream flush() path and the buffer-limit path (large FTDNA files where probe SNPs appear past the 100K-variant buffer cap). ClinVar is now correctly skipped for GRCh36 data.
  • pyproject.toml version corrected (was 1.1.0 on the v1.1.1 release).
  • Changelog dates corrected to match actual release timestamps.

Added

  • Database auto-refresh. The analysis pipeline checks database file ages before running. Databases older than 7 days with a changed remote signal (MD5/ETag) are refreshed automatically. Use --no-update to skip. Network failures warn and continue with stale caches.
  • Version-tag drift guard. Pre-push hook (scripts/check_version_tag.sh) asserts any pushed v* tag matches pyproject.toml version, preventing the class of bug where a release ships with a stale version string.
  • Corpas family exome VCF attribution with paper DOI (Corpas et al., BMC Genomics 2015, doi:10.1186/s12864-015-1973-7) and CC-BY/CC0 license. Every genotype fixture in the repo now has documented provenance.

Allelix v1.1.1 — 2026-06-06

06 Jun 20:46

Choose a tag to compare

Housekeeping release -- zero behavior change.

Changed

  • Relocated real genotype test data (test_data/real/ and test_data/transcoded/) to GitHub release assets. Fresh clone size reduced from ~650 MB to ~150 MB. Tests skip gracefully when data is absent; scripts/fetch_testdata.sh restores it.
  • Clarified .gitignore and test_data/README.md: the "never commit" rule applies to private genetic data, not CC0 public-domain openSNP fixtures hosted as release assets.

Fixed

  • Orphaned [Unreleased] changelog sections assigned proper version numbers ([0.7.2] and [0.8.0]) matching their chronological position in the development history.
  • Duplicate [0.7.1] changelog header consolidated into a single entry.
  • Dead compare links for internal pre-release versions removed (0.x tags were never pushed to the public repository).

Test data

Download test_data.tar.gz below and extract it in the repo root, or run:

./scripts/fetch_testdata.sh

Allelix v1.1.0 — 2026-06-06

06 Jun 15:21

Choose a tag to compare

Parser hardening, build detection, compare command, license compliance.

New

  • allelix compare — compare two genotype files with strand-aware concordance classification. Detects concordant matches, strand-flip matches, discordant calls, and strand-ambiguous (palindromic A/T, C/G) SNPs. Per-chromosome breakdown. Build detection on both files.
  • High-value SNP no-call flagging — 12 clinically important SNPs (APOE, BRCA1/2, MTHFR, CYP2D6, etc.) are tracked. No-calls on these variants surface warnings in stats, analyze, and all report formats. Cluster-incomplete detection (e.g., "APOE genotype cannot be determined").
  • ClinVar review status — CLNREVSTAT now displayed in terminal, HTML, and JSON reports. Distinguishes expert-panel-reviewed from single-submitter pathogenic calls.
  • GRCh36 build detection — all 11 probe SNPs now have GRCh36 positions. 3-way voting across GRCh36/37/38. Headerless files with GRCh36 coordinates detected correctly.
  • License attributions in reports — HTML footer and JSON output include CC BY-SA 4.0 attribution for PharmGKB and CC BY-NC-SA 3.0 US for SNPedia when those annotators are used. Public-domain sources produce no extra attribution.
  • CONTRIBUTING.md — "How to add a parser" and "How to add an annotator" tutorials.

Changed

  • 23andMe parser detection tightened to canonical first-line header. Files that merely mention "23andMe" in comments no longer match.

Fixed

  • is_must_include internal field no longer leaks into JSON output.
  • Compare command uses detect_build() instead of parser default build.

Allelix v1.0.0

06 Jun 12:17

Choose a tag to compare

Allelix v1.0.0

Open-source genotype analysis toolkit. Takes raw DNA files from consumer
testing services and runs them against public variant databases, producing
source-attributed research reports. The open-source Promethease replacement.

Supported input formats

  • 23andMe
  • AncestryDNA
  • Family Tree DNA (FTDNA)
  • MyHeritage DNA
  • Living DNA
  • MyHappyGenes (Tempus)

Auto-detection identifies the format from file structure. --format overrides
when needed.

Annotation sources

Database License Lookup method
ClinVar Public domain Position-based, dual-build (GRCh37 + GRCh38)
PharmGKB CC BY-SA 4.0 rsID-based, CPIC allele function filtering
GWAS Catalog Public domain rsID-based, p-value + effect size scoring
SNPedia CC BY-NC-SA 3.0 rsID-based (--exclude-snpedia for commercial use)

All databases are downloaded by the user via allelix db update and cached
locally. Analysis runs offline with zero network access.

Key features

  • Build auto-detection. Detects genome build (GRCh37/GRCh38) from position
    data, not file headers. Warns on header/data mismatch.
  • Offline-first. No telemetry, no uploads, no analytics. Genotype data
    never touches a network.
  • Streaming parsers. 17 MB files never load fully into memory.
  • Three report formats. Terminal (Rich), self-contained HTML, machine-readable
    JSON (schema v1).
  • Report diff. --diff previous.json compares the current run against a
    prior report and surfaces new, removed, and changed annotations.
  • Focused reports. allelix methylation and allelix pharmacogenomics for
    targeted analysis.
  • Freshness detection. db update checks remote signals (MD5, ETag) and
    only re-downloads when the upstream source has changed.

Install

pip install allelix
allelix db update
allelix analyze your_file.txt

Requires Python 3.11+.

Known limitations

  • GRCh36 (hg18). Allelix has no GRCh36 ClinVar cache, so ClinVar
    annotations are skipped entirely for GRCh36 files. PharmGKB, GWAS Catalog,
    and SNPedia use rsID-only lookups and are unaffected.
  • Star alleles. CYP2D6, CYP2C19, and other genes annotated by haplotype
    in PharmGKB are underserved — only SNV-level annotations are matched.
  • VCF. Not yet supported. Planned for v2.0.

License

AGPL-3.0-or-later. Third-party databases retain their original licenses on the
user's machine.

Test suite

794 tests, 94% coverage.