feat: N_NO_COVERAGE field and coverage-evidence filters#31
Conversation
Distinguish between true hom-ref and uncertain coverage for non-carrier WES samples. Two opt-in mechanisms (combinable, fully backward-compatible) move samples from N_HOM_REF to a new N_NO_COVERAGE field while keeping them in eligible/AN. Phase 1 (query-time, no schema change): --min-pass K per-WES-tech gate on PASS carriers (het|hom) --min-observed K per-WES-tech gate on any-VCF entries (het|hom|fail) Phase 2 (build-time, schema_version 3.0): --min-dp / --min-gq / --min-qual carrier quality thresholds at create-db --min-covered K minimum quality carriers per WES tech --min-quality-evidence K query-time companion (errors on legacy DBs) Stores two new Parquet columns (filtered_bitmap, quality_pass_bitmap) and the chosen thresholds under coverage_filter in manifest.json. update-db recomputes filtered_bitmap on add-samples; compact preserves both columns. ingest reads FORMAT/DP, FORMAT/GQ, and QUAL (None when absent). New invariant: N_HET + N_HOM_ALT + N_HOM_REF + N_FAIL + N_NO_COVERAGE = n_eligible WGS samples and carriers (het/hom/fail) are never reclassified. Affected commands: query, variant-info (genotype='no_coverage'), annotate (AFQUERY_N_NO_COVERAGE INFO), dump (N_NO_COVERAGE column). resources/normalize_vcf.sh now preserves FORMAT/DP and FORMAT/GQ.
- docs/advanced/coverage-evidence.md (new): conceptual guide covering both phases, threshold-selection guidance, and the new genotype invariant. - docs/guides/query.md: new section "Coverage-Evidence Filters" plus N_NO_COVERAGE in text/tsv/json examples. - docs/guides/create-database.md: new section explaining --min-dp/--min-gq/ --min-qual/--min-covered and the schema_version 3.0 bump. - docs/guides/update-database.md: explain filtered_bitmap recomputation on add-samples for Phase 2 databases. - docs/getting-started/preprocessing.md: note FORMAT/DP and FORMAT/GQ preservation for Phase 2 quality thresholds. - docs/reference/cli.md: add Phase 1/2 flags to query, variant-info, annotate, dump, and create-db. - docs/reference/python-api.md: add N_NO_COVERAGE to QueryResult, 'no_coverage' genotype to SampleCarrier, and the three new SampleFilter fields. - mkdocs.yml: link the new advanced page.
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
- Replace lexicographic schema_version >= "3.0" with tuple-based
_parse_schema_version (lex would break at "10.0" / "3.10").
- Rename _select_cols -> _bitmap_cols returning list[str]; drop the
brittle .split(",") aliasing in _query_batch_inner.
- Unify duplicated het|hom|fail bitmap union in _compute_no_coverage_bm.
- Widen SampleCarrier.filter_pass to bool | None; emit None for
no_coverage rows (rendered as null / empty / '-' across json/tsv/text)
so PASS/FAIL no longer misrepresents samples that had no call.
- Drop duplicate `from pyroaring import BitMap` in compact.py.
- Pass PARQUET_SCHEMA to pa.table in _make_phase2_db so test fixtures
match production large_binary types.
- Add test_no_coverage_filter_pass_is_none assertion.
- Update Python/CLI reference docs for the filter_pass type change.
| exactly right: every covered sample without a variant call is hom-ref. For WES | ||
| samples it is a *best-effort* assumption: the BED capture region tells us a | ||
| position *could* be sequenced, but not that it *was* sequenced at adequate | ||
| depth in this particular sample. Standard variant-only VCFs do not contain |
There was a problem hiding this comment.
Start paragraph with an introductory sentence, like "Standard variant-only VCFs do not contain hom-ref calls, so AFQuery cannot distinguish "true hom-ref" from "no coverage"
for non-carrier WES samples."
| # Coverage Evidence | ||
|
|
||
| `N_HOM_REF` is computed as a residual: | ||
| `len(eligible) − N_HET − N_HOM_ALT − N_FAIL`. For WGS samples that residual is |
There was a problem hiding this comment.
replace WGS for full covered genomes
| for non-carrier WES samples. | ||
|
|
||
| Two opt-in mechanisms let users tighten that assumption. Together they expose | ||
| a new field, **`N_NO_COVERAGE`**, that holds samples whose hom-ref status is |
There was a problem hiding this comment.
remove mention to "new" functionalities: the tool is still under development, so no need to mention old features.
|
|
||
| --- | ||
|
|
||
| ## Phase 1 — Query-time, evidence-counting |
There was a problem hiding this comment.
remove mention to phase 1 and phase 2. This documentation must be focused on how to use the tool and when to use different parameters
|
|
||
| If the tech falls below either threshold, *all of its non-carrier samples* at | ||
| that position move from `N_HOM_REF` to `N_NO_COVERAGE`. When both flags are | ||
| set, both must hold (AND). Default `0` ⇒ no filtering, identical to legacy |
There was a problem hiding this comment.
remove mention to legacy behavioiur
Rewrites docs/advanced/coverage-evidence.md as user-facing documentation: drops the Phase 1 / Phase 2 / schema_version / "best-effort" / "legacy" framing, adopts "fully-covered" / "partially-covered" terminology, adds a worked before/after query example, and structures the page around when to reach for each flag (--min-pass, --min-observed, --min-dp, --min-gq, --min-qual, --min-covered, --min-quality-evidence). Cleans up the same Phase 1/2 and schema_version 3.0 wording in create-database.md, update-database.md, query.md, cli.md, python-api.md, and preprocessing.md. Adds the previously-undocumented N_NO_COVERAGE / no_coverage / AFQUERY_N_NO_COVERAGE rows to the field tables in glossary.md, understanding-output.md, annotate-vcf.md, dump-export.md, and variant-info.md, where the field already showed up in output examples.
Closes #19, #17. The previous CI used `mkdocs gh-deploy --force`, which ignored the `mike` provider declared in `mkdocs.yml` and overwrote `gh-pages` flat on every push, so the version selector had nothing to render. The print-site plugin had `add_to_navigation: false`, so the generated `/print_page/` was unreachable from the UI. - `docs.yml`: replace `mkdocs gh-deploy` with `mike deploy --push --update-aliases dev`; configure git identity; fetch `gh-pages`. Add a `bootstrap` workflow_dispatch input that runs `mike delete --all` first, for the one-time migration from the prior flat deploy. - `release.yml`: new `docs` job for non-`rc` tags. Runs `mike deploy --push --update-aliases <version> latest` and `mike set-default --push latest`, so the site root redirects to the most recent tag. - `mkdocs.yml`: add `site_url` (required by mike for cross-version links); flip `print-site.add_to_navigation` to `true`; register `autorefs` explicitly before `mkdocstrings` so it does not get auto-inserted after `print-site` (silences the false-positive "print-site should be last" warning under `--strict`). - `CONTRIBUTING.md`: document the docs deployment and release workflow, local preview, and the one-time bootstrap.
Summary
N_NO_COVERAGEfield tracking samples whose technology does not cover a variant's region, so AN/AF can be interpreted in light of capture differences across WGS/WES kits.docs/advanced/coverage-evidence.mdplus updates to create-db, update-db, query, CLI, and Python API guides).Test plan
pytest --tb=short -q(includes newtests/test_no_coverage.py, 284 lines)pytest tests/test_no_coverage.py -vafquery queryandafquery annotateon a sample DB to confirmN_NO_COVERAGEappears and filters behave as documentedmkdocs serverenders the new coverage-evidence page