Skip to content

feat(ptm): stratified AF-depletion analysis (hvantk ptm constraint)#89

Merged
enriquea merged 8 commits intodevfrom
ptm-constraint
Apr 14, 2026
Merged

feat(ptm): stratified AF-depletion analysis (hvantk ptm constraint)#89
enriquea merged 8 commits intodevfrom
ptm-constraint

Conversation

@enriquea
Copy link
Copy Markdown
Collaborator

Summary

  • Adds hvantk ptm constraint — a stratified allele-frequency depletion analysis that compares gnomAD AF distributions between PTM-proximal and non-PTM variants, grouped by tissue / cell-type / custom metadata fields. Turns the multi-dataset EDA prototyped in Notebooks E–I (GTEx RNA, GTEx protein, HCA adult, Farah mid-fetal, Asp early fetal) into a reusable CLI.
  • New modules under hvantk/ptm/: constraint.py (orchestrator + 5 statistical tests), constraint_expression.py (Hail MT / AnnData / tabular backend adapter), constraint_plots.py (4-panel figure renderer), constraint_report.py (self-contained HTML report). τ computation is delegated to the maintained tspex package via a thin wrapper in hvantk/utils/tissue_specificity.py.
  • Follows the library-preference policy from the design spec (prefer established ecosystem packages over reimplementing τ / aggregation / differential expression).

Test plan

  • pytest hvantk/tests/test_ptm_constraint.py — 6 smoke tests pass (tspex-vs-Yanai reference, tabular / AnnData adapter contracts, config validation, gene-feature computation, CLI help).
  • Full fast test suite: 289 tests pass.
  • hvantk ptm --help and hvantk ptm constraint --help render cleanly.
  • End-to-end synthetic run (no Hail): 500 variants x 30 genes x 6 tissues → 5 TSVs, 4 PNGs, 1 HTML report produced.
  • End-to-end run on a real PTM-annotated variant HT + GTEx / AnnData source (pending a larger fixture; not gated on this PR).

Refs: local/planning/2026-04-13-hvantk-ptm-constraint-implementation-plan.md, local/planning/2026-04-13-ptm-constraint-design.md.

🤖 Generated with Claude Code

enriquea and others added 6 commits April 13, 2026 23:48
The file was rewritten for AnnData during scverse migration but kept the
Hail-era filename. The leftover Hail histogram function had no live callers
(only @pytest.mark.hail tests against removed Hail builders) and a broken
``"ad.AnnData"`` annotation that triggered an F821 lint error.

- New hvantk/visualization/expression/anndata.py exposes
  visualize_expression_distribution(adata, ...) operating on adata.X.
- Old hail.py removed; visualization facade and subpackage doc updated.
- Test import switched to the new module.
…ze_expression_ad

Completes the cleanup planned in the scverse-integration plan (Task 10
step 5). The Hail-based functions in matrix_utils.py were superseded by
their _ad equivalents when expression I/O migrated to AnnData; their only
remaining callers were two skipped @pytest.mark.hail test files.

- Remove annotate_column_summary, describe_expression_mt, summarize_expression,
  summarize_matrix, filter_by_metadata, filter_by_gene_list, filter_by_expression,
  get_top_expressed_genes (and the optional hail import).
- Delete tests/test_matrix_utils.py and tests/test_summarize_expression.py.
- summarize_expression_ad: stop mutating adata.obs (was writing _group_label
  into the caller's object) and replace the per-row index.get_loc lookup
  with groupby indices for O(n_cells) splitting.
- gene_sets.py: refresh extract_marker_gene_sets docstring to point at the
  AnnData summarizer and note its long-format output requires pivot or
  scanpy rank_genes_groups for direct use.
…hail_context

- converters.anndata_to_hail_mt previously built the long-format DataFrame
  with a Python double loop over n_obs * n_var entries (200M dict allocs
  for a modest 10k x 20k matrix). Replace with np.repeat / np.tile /
  X.ravel — single pass, no Python loop.
- The np.bool monkey-patch (Hail 0.2.x compat with NumPy >= 1.24) lived
  as a side-effect at the top of converters.py. Move it into hail_context.py
  so it runs before any Hail import regardless of which module pulls Hail
  in first. converters.py now imports hl via hail_context to guarantee the
  shim is applied.
test_dataset_download_returns_intermediate_tsv mocked
urllib.request.urlretrieve, but the production code calls requests.get.
The mock was a no-op, so the test attempted a real network download and
hung pytest -q indefinitely. Mock requests.get with a context-manager
fake response that streams the bundled fixture zip.
- matrix_builders.py, cptac.py, expression_atlas.py: remove inner
  ``import anndata as ad`` repeats inside builder functions; module-level
  import already in scope.
- test_matrix_utils_anndata.py: replace the importlib.util workaround
  (which sidestepped hvantk.utils.__init__ over a non-existent Python
  3.10 syntax issue) with a plain import. Project requires Python >= 3.10
  where PEP 604 unions are valid.
Implements the `hvantk ptm constraint` CLI and `hvantk.ptm.constraint`
module from the 2026-04-13 implementation plan. Compares gnomAD allele-
frequency distributions between PTM-proximal and non-PTM variants,
stratified by tissue, cell type, or any categorical metadata field.

Modules added:
- hvantk/ptm/constraint.py — orchestrator, PTMConstraintConfig/Result,
  five statistical tests (per-group ranking, τ quartile, τ × LOEUF
  factorial, PTM category × group heatmap, within-gene Wilcoxon)
- hvantk/ptm/constraint_expression.py — unified backend adapter for
  Hail MatrixTable / AnnData / tabular expression sources
- hvantk/ptm/constraint_plots.py — four-panel figure renderer
- hvantk/ptm/constraint_report.py — self-contained HTML report
- hvantk/utils/tissue_specificity.py — thin tspex wrapper for τ / TSI
  / Gini / Shannon metrics

Wiring:
- hvantk/commands/ptm_cli.py — `ptm constraint` subcommand
- hvantk/ptm/__init__.py — lazy exports for the new API
- pyproject.toml — `constraint` extra (tspex, matplotlib, seaborn)

Docs:
- docs_site/tools/ptm-constraint.md
- docs_site/guide/usage.md (10-line recipe)
- docs_site/architecture.md (module index)

Tests: 6 smoke tests in hvantk/tests/test_ptm_constraint.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f23d0862-d1b5-4a31-a1b8-413d71859eb7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ptm-constraint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 13, 2026

Not up to standards ⛔

🔴 Issues 24 high · 19 medium · 31 minor

Alerts:
⚠ 74 issues (≤ 0 issues of at least minor severity)

Results:
74 new issues

Category Results
UnusedCode 2 medium
Documentation 30 minor
ErrorProne 5 high
Security 3 medium
19 high
CodeStyle 1 minor
Complexity 14 medium

View in Codacy

🟢 Metrics 186 complexity · 5 duplication

Metric Results
Complexity 186
Duplication 5

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

…ield

Code-review follow-ups to the initial constraint implementation:

- _anndata_value_column unconditionally returned "mean", so
  --expression-metric median and median_nonzero were silent no-ops on
  the AnnData backend. Replaced the value-column helper with a direct
  numpy aggregation (_aggregate_anndata_direct) that runs per-group
  np.median / np.nanmedian (for nonzero) over adata.X when the user
  requests a median metric.

- _load_variants silently skipped label filtering when the configured
  label_field was absent from the variants HT, leading to results that
  claim to be filtered to TN/TP but actually include every label. Now
  raises ValueError with a clear remediation hint unless the user
  explicitly passes --label-filter all.

- pyproject.toml + requirements.txt: added tspex so CI (which installs
  from requirements.txt) can import the wrapper and the constraint
  smoke tests do not fail with ImportError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@enriquea
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance against the constraint feature files (hvantk/ptm/constraint*.py, hvantk/utils/tissue_specificity.py, hvantk/commands/ptm_cli.py, hvantk/ptm/__init__.py, pyproject.toml, added docs).

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

The Conda CI workflow has been failing since the AnnData migration
landed on dev because environment.yml never gained anndata/scanpy. The
constraint work also needs tspex; add it alongside so `python-package-
conda.yml` can import every test module.

Added: anndata, scanpy, pysam, PyYAML, psutil, scikit-learn, tspex.
@enriquea enriquea merged commit e18b4ae into dev Apr 14, 2026
2 of 3 checks passed
@enriquea enriquea deleted the ptm-constraint branch April 14, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant