Releases: dial481/allelix
v1.5.0
What's new
Version tag consolidation across all six annotators. Local processing
stamps (iv:, pv:, cv:, sv:) are now stored in a dedicated
local_version_tag column instead of being appended to remote_signal.
Eliminates the suffix-parsing pattern that caused the SNPedia re-download
loop. All six annotators use the same dual-version mechanism. Existing
caches self-heal on upgrade — no re-download required.
Multi-allelic enrichment fix (#25). gnomAD and AlphaMissense now use
exact alt-allele matching instead of MAX() aggregation at multi-allelic
sites.
SNPedia auto-download (#30). db update downloads the SNPedia cache
from HuggingFace automatically.
AlphaMissense gnomAD version stamping (#28). Runtime warning when the
installed gnomAD version doesn't match what was used to build the AM cache.
See CHANGELOG.md for the full list of fixes and additions.
Migration
Existing caches self-heal on first run. No action required beyond
pip install --upgrade allelix and running any allelix command.
Stats
- 1066 tests, 92% coverage
- ADR-0028: local_version_tag convention
v1.4.0
Added
- AlphaMissense variant pathogenicity enrichment. New
AlphaMissenseAnnotatorenriches annotations with missense variant pathogenicity scores from DeepMind's AlphaMissense (71M variants, CC BY 4.0). Pre-built SQLite cache downloaded from HuggingFace viadb update. AM Score column in terminal, HTML, and JSON reports. PharmGKB rows show AM scores as neutral with caveat (protein structure impact only — tooltip in HTML, dimmed*footnote in terminal,am_caveatfield in JSON).--no-alphamissenseflag to skip. - Config file system.
config.tomlwith per-source on/off toggles andlicense.commercial = truesafety switch that auto-disables non-commercial sources (SNPedia).allelix config show/set/resetCLI commands. CLI flags override config per-invocation. scripts/build_alphamissense_cache.py— AlphaMissense cache build script with Zenodo HTTPS streaming (default) and local TSV modes. Joins against gnomAD cache for coordinate-to-rsID mapping.- AlphaMissense CC BY 4.0 attribution in HTML and JSON reports.
- Magnitude scoring legend in HTML report (collapsible, per-source scoring tables for ClinVar, PharmGKB, GWAS, SNPedia).
- Source floor note in HTML report when per-source magnitude minimums are active.
- Repute row background tints in HTML report (red for pathogenic/risk, green for protective/benign) derived from existing significance field.
- Sortable columns in HTML report (magnitude, gene, source, AM score) via inline JavaScript.
- ADR-0027 documenting the AlphaMissense enrichment cache architecture.
Fixed
- HTML report table overflows viewport, columns clipped on left (#20). Added
overflow-x: autocontainer, sticky rsID column,max-widthon description cells, refs collapsed into<details>toggle, conditional Review Status column (hidden when all empty), stat cardflex-wrap. - AlphaMissense build script has zero unit-test coverage (#24). Added 25 tests covering TSV parsing, gnomAD rsID join, chr prefix normalization,
--no-gnomadNULL-rsid path, multi-allelic composite PK, batched insert, and end-to-end integration. - Download integrity: Content-Length check after downloads catches truncated files.
- Disk space preflight before decompressing
.sqlite.gzcaches uses 5x gz size (accounts for gz + decompressed tmp on disk simultaneously). _connection()guards on gnomAD and AlphaMissense annotators raiseFileNotFoundErrorwith actionable message when cache is missing.- Dead
cache_exists()removed from gnomAD and AlphaMissense loaders. - Legacy caches stamp remote signal instead of re-downloading on
db update. - README database sizes updated to match actual on-disk measurements.
Changed
db updatedisplay includes gnomAD and AlphaMissense in "Analyzing against" annotator list.- Both build scripts (
build_gnomad_cache.py,build_alphamissense_cache.py) runVACUUMfor smaller output files.
v1.3.1
Fixed
- Test suite downloaded real ~6 GB gnomAD cache on every run, filling CI
runner disk. Alldb updatetests now use a 792-byte mock fixture via
file://URL — same pattern as ClinVar, PharmGKB, and GWAS. No
production code changes.
Changed
- CI: job timeout (20 min), pytest step timeout (15 min),
workflow_dispatchtrigger, verbose output (pytest -v --tb=short) - Ship tooling:
scripts/tag-release.shderives tag from pyproject.toml
(single source of truth) - Git hooks: raw
.githooks/pre-pushreplaces pre-commit framework shim,
blocks tag pushes where version doesn't match - CONTRIBUTING.md: corrected slow-test documentation (CI skips them, not
runs them), added "Run the full suite locally" section - Documentation: fixed stale hook instructions, added missing changelog
comparison links and ADR index entries - Removed dead code:
scripts/check_version_tag.sh - Removed
version-tag-matchentry from.pre-commit-config.yaml
v1.3.0: gnomAD Exome Frequency Cache
Added
- gnomAD population allele frequencies. New
GnomadAnnotatorenriches
report annotations with population frequency context from gnomAD v4.1 exomes
(~16M variants, 730K individuals). Pre-built cache downloaded from HuggingFace
viadb update. Frequency column in terminal, HTML, and JSON reports.
--no-gnomadflag to skip. - CPIC fallback for PharmGKB.
db updatesucceeds when CPIC API is
unreachable — reuses cached allele function data. Recovery auto-triggers on
next successful check. - Graceful
db update. Individual annotator download failures print an error
and continue to remaining annotators instead of aborting the entire update. scripts/build_gnomad_cache.py— streaming VCF build script. Downloads ~120GB
gnomAD exome VCFs over HTTPS (or reads local files with--local-dir), never
saves VCFs to disk, outputs ~6GB SQLite (~3GB gzipped).- JSON report
schema_versionbumped to"2"(addedallele_frequencyfield).
Diff engine accepts both v1 and v2 baselines. - gnomAD ODbL v1.0 attribution in HTML and JSON reports.
- CI workflow (
.github/workflows/ci.yml) — lint + test on push/PR to main.
Fixed
- Offline claim in README corrected: analysis runs offline by default with opt-out
freshness check, not opt-in network access. __del__partial-init crash on PharmGKB constructor failure..gitignoreupdated for GWAS Catalog test data.
Changed
- Pre-push hook reduced to version-tag check only (CI runs the full suite).
Technical
- Composite primary key
(chrom, pos, ref, alt)ongnomad_frequencies—
preserves multi-allelic sites (rsID-only PK silently dropped ~20% of records). - Coordinate columns indexed for future AlphaMissense/CADD integration.
MAX(af) GROUP BY rsidin lookup queries handles multiple rows per rsID.- 951 tests, 93%+ coverage.
Allelix v1.2.0 — 2026-06-07
Fixed
- GRCh36 fallback bug. Non-confident GRCh36 detection (e.g., 3 of 4 probe SNPs matched) was falling back to GRCh37 as the effective build, silently bypassing the ClinVar safety guard and annotating GRCh36 positions against GRCh37 coordinates. Fixed in both the end-of-stream
flush()path and the buffer-limit path (large FTDNA files where probe SNPs appear past the 100K-variant buffer cap). ClinVar is now correctly skipped for GRCh36 data. pyproject.tomlversion corrected (was1.1.0on the v1.1.1 release).- Changelog dates corrected to match actual release timestamps.
Added
- Database auto-refresh. The analysis pipeline checks database file ages before running. Databases older than 7 days with a changed remote signal (MD5/ETag) are refreshed automatically. Use
--no-updateto skip. Network failures warn and continue with stale caches. - Version-tag drift guard. Pre-push hook (
scripts/check_version_tag.sh) asserts any pushedv*tag matchespyproject.tomlversion, preventing the class of bug where a release ships with a stale version string. - Corpas family exome VCF attribution with paper DOI (Corpas et al., BMC Genomics 2015, doi:10.1186/s12864-015-1973-7) and CC-BY/CC0 license. Every genotype fixture in the repo now has documented provenance.
Allelix v1.1.1 — 2026-06-06
Housekeeping release -- zero behavior change.
Changed
- Relocated real genotype test data (
test_data/real/andtest_data/transcoded/) to GitHub release assets. Fresh clone size reduced from ~650 MB to ~150 MB. Tests skip gracefully when data is absent;scripts/fetch_testdata.shrestores it. - Clarified
.gitignoreandtest_data/README.md: the "never commit" rule applies to private genetic data, not CC0 public-domain openSNP fixtures hosted as release assets.
Fixed
- Orphaned
[Unreleased]changelog sections assigned proper version numbers ([0.7.2]and[0.8.0]) matching their chronological position in the development history. - Duplicate
[0.7.1]changelog header consolidated into a single entry. - Dead compare links for internal pre-release versions removed (0.x tags were never pushed to the public repository).
Test data
Download test_data.tar.gz below and extract it in the repo root, or run:
./scripts/fetch_testdata.shAllelix v1.1.0 — 2026-06-06
Parser hardening, build detection, compare command, license compliance.
New
allelix compare— compare two genotype files with strand-aware concordance classification. Detects concordant matches, strand-flip matches, discordant calls, and strand-ambiguous (palindromic A/T, C/G) SNPs. Per-chromosome breakdown. Build detection on both files.- High-value SNP no-call flagging — 12 clinically important SNPs (APOE, BRCA1/2, MTHFR, CYP2D6, etc.) are tracked. No-calls on these variants surface warnings in
stats,analyze, and all report formats. Cluster-incomplete detection (e.g., "APOE genotype cannot be determined"). - ClinVar review status — CLNREVSTAT now displayed in terminal, HTML, and JSON reports. Distinguishes expert-panel-reviewed from single-submitter pathogenic calls.
- GRCh36 build detection — all 11 probe SNPs now have GRCh36 positions. 3-way voting across GRCh36/37/38. Headerless files with GRCh36 coordinates detected correctly.
- License attributions in reports — HTML footer and JSON output include CC BY-SA 4.0 attribution for PharmGKB and CC BY-NC-SA 3.0 US for SNPedia when those annotators are used. Public-domain sources produce no extra attribution.
- CONTRIBUTING.md — "How to add a parser" and "How to add an annotator" tutorials.
Changed
- 23andMe parser detection tightened to canonical first-line header. Files that merely mention "23andMe" in comments no longer match.
Fixed
is_must_includeinternal field no longer leaks into JSON output.- Compare command uses
detect_build()instead of parser default build.
Allelix v1.0.0
Allelix v1.0.0
Open-source genotype analysis toolkit. Takes raw DNA files from consumer
testing services and runs them against public variant databases, producing
source-attributed research reports. The open-source Promethease replacement.
Supported input formats
- 23andMe
- AncestryDNA
- Family Tree DNA (FTDNA)
- MyHeritage DNA
- Living DNA
- MyHappyGenes (Tempus)
Auto-detection identifies the format from file structure. --format overrides
when needed.
Annotation sources
| Database | License | Lookup method |
|---|---|---|
| ClinVar | Public domain | Position-based, dual-build (GRCh37 + GRCh38) |
| PharmGKB | CC BY-SA 4.0 | rsID-based, CPIC allele function filtering |
| GWAS Catalog | Public domain | rsID-based, p-value + effect size scoring |
| SNPedia | CC BY-NC-SA 3.0 | rsID-based (--exclude-snpedia for commercial use) |
All databases are downloaded by the user via allelix db update and cached
locally. Analysis runs offline with zero network access.
Key features
- Build auto-detection. Detects genome build (GRCh37/GRCh38) from position
data, not file headers. Warns on header/data mismatch. - Offline-first. No telemetry, no uploads, no analytics. Genotype data
never touches a network. - Streaming parsers. 17 MB files never load fully into memory.
- Three report formats. Terminal (Rich), self-contained HTML, machine-readable
JSON (schema v1). - Report diff.
--diff previous.jsoncompares the current run against a
prior report and surfaces new, removed, and changed annotations. - Focused reports.
allelix methylationandallelix pharmacogenomicsfor
targeted analysis. - Freshness detection.
db updatechecks remote signals (MD5, ETag) and
only re-downloads when the upstream source has changed.
Install
pip install allelix
allelix db update
allelix analyze your_file.txt
Requires Python 3.11+.
Known limitations
- GRCh36 (hg18). Allelix has no GRCh36 ClinVar cache, so ClinVar
annotations are skipped entirely for GRCh36 files. PharmGKB, GWAS Catalog,
and SNPedia use rsID-only lookups and are unaffected. - Star alleles. CYP2D6, CYP2C19, and other genes annotated by haplotype
in PharmGKB are underserved — only SNV-level annotations are matched. - VCF. Not yet supported. Planned for v2.0.
License
AGPL-3.0-or-later. Third-party databases retain their original licenses on the
user's machine.
Test suite
794 tests, 94% coverage.