Add new PTM (Post-Translational Modification) analysis module by enriquea · Pull Request #77 · bigbio/hvantk

enriquea · 2026-03-24T22:06:54Z

This pull request introduces comprehensive documentation and CLI support for the new PTM (Post-Translational Modification) analysis module in the hvantk toolkit. The changes include updates to the architecture overview, usage guides, data source documentation, and a detailed PTM tool reference. The PTM workflow is now fully documented with example commands, Python API usage, and integration with existing variant analysis pipelines.

PTM Module Documentation and CLI Integration

Documentation and Architecture:

Added a new section for the ptm/ module in the architecture overview, detailing its structure and purpose for PTM variant analysis. [1] [2]
Introduced ptm_cli.py in the CLI commands list and described associated test files. [1] [2]
Added a dedicated PTM example in the workflow index, including a quick start guide and links to further documentation. [1] [2]

Tool and Usage Documentation:

Created a comprehensive docs_site/tools/ptm.md reference describing the PTM pipeline, command-line interface, Python API, workflow phases, annotation fields, output files, and module structure.
Added a detailed example walkthrough (docs_site/examples/ptm.md) covering all steps from data preparation to report generation, with expected outputs and code snippets.
Updated the usage guide to include PTM site table creation and clarified automatic vs. manual data download options.
Documented UniProt PTM and Ensembl GTF data sources, including download instructions and integration with the PTM build command.

CLI and Core Functionality Enhancements

Command-line Integration:

Registered the UniProt PTM downloader in the download_cli command group, enabling automated acquisition of PTM data. [1] [2]

Code Consistency:

Updated ClinVar annotation logic to use centralized pathogenic/benign label sets from hvantk.core.constants for consistency across modules.

Summary by CodeRabbit

New Features
- Full PTM analysis workflow added: new top-level hvantk ptm commands (build, annotate, landscape, export-strata, population, report), PTM table builder, plotting utilities, and HTML report generation.
- UniProt PTM downloader and a new mktable ptm-sites command; CLI root now exposes the ptm group.
Documentation
- Extensive PTM docs, examples, quick-start CLI commands, and data-source guidance.
Tests
- New end-to-end PTM test suite.

…eld renaming logic

…sites

…on tools

…aging human PTM data

…eins

…d maintainability

…dscape analysis

…ssification

coderabbitai · 2026-03-24T22:07:41Z

📝 Walkthrough

Walkthrough

Adds a new PTM domain: UniProt PTM download, Ensembl GTF parsing, protein→genome mapping, PTM site Hail table builder, variant annotation, landscape/population analyses, plotting, HTML report generation, CLI entrypoints, utilities (sanitization/renaming), and tests; plus documentation updates.

Changes

Cohort / File(s)	Summary
Documentation `docs_site/architecture.md`, `docs_site/examples/index.md`, `docs_site/examples/ptm.md`, `docs_site/guide/data-sources.md`, `docs_site/guide/usage.md`, `docs_site/tools/ptm.md`	Added PTM architecture, examples, data-source notes, CLI quickstarts, Python API examples, and tool reference pages.
PTM package & constants `hvantk/ptm/__init__.py`, `hvantk/ptm/constants.py`	New public ptm package surface and configuration constants (UniProt/Ensembl params, PTM categories, output columns, defaults).
Mapping & pipeline orchestration `hvantk/ptm/mapper.py`, `hvantk/ptm/pipeline.py`	GTF parser, codon-level residue→genome mapping, transcript resolution, PTM build pipeline (downloads, mapping, mapped TSV → Hail HT) with config/result dataclasses.
Annotation & analysis `hvantk/ptm/annotate.py`, `hvantk/ptm/analysis.py`	Variant annotation producing is_ptm/is_proximal/distance, export strata, ptm_landscape (ClinVar enrichment) and ptm_population (AF summaries) result types.
Plotting & reporting `hvantk/ptm/plot.py`, `hvantk/ptm/report.py`	Matplotlib plot helpers (landscape, overlap, distance, AF), figure encoding, and HTML report generation with embedded or referenced images.
Datasets & downloaders `hvantk/datasets/uniprot_ptm_datasets.py`, `hvantk/commands/uniprot_ptm_downloader.py`, `hvantk/commands/download_cli.py`	New UniProt PTM dataset module (pagination, TSV output), CLI downloader `uniprot-ptm`, and registration in download CLI.
Table builders & CLI wiring `hvantk/tables/table_builders.py`, `hvantk/tables/registry.py`, `hvantk/commands/make_table_cli.py`, `hvantk/hvantk.py`	Added `create_ptm_sites_tb` builder, registered `"ptm-sites"` in registry, new `mktable ptm-sites` CLI command, and registered `ptm` command group in root CLI.
Utilities & sanitization `hvantk/data/file_utils.py`, `hvantk/datasets/gencc_datasets.py`, `hvantk/utils/table_utils.py`	Added `sanitize_tsv` to clean TSVs and applied to GenCC downloads; added field-name normalization and `str_to_bool` for table imports and rename-map building.
Annotation refactor & label centralization `hvantk/annotation/annotate.py`, `hvantk/core/constants.py`, `hvantk/psroc/pipeline.py`	Centralized ClinVar pathogenic/benign label lists in `core.constants`; updated annotation and psroc pipeline to import shared constants.
Testing `hvantk/tests/test_ptm.py`	New tests for GTF mapping, transcript resolution, map_ptm_sites integration, CLI help wiring, and PTM build config validation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as "hvantk ptm"
    participant Pipeline as "ptm_build_pipeline"
    participant UniProt as "UniProtPTMDataset"
    participant GTF as "parse_ensembl_gtf"
    participant Mapper as "map_ptm_sites"
    participant Table as "create_ptm_sites_tb"
    participant Annotate as "annotate_variants_with_ptm"
    participant Analysis as "ptm_landscape/ptm_population"
    participant Plot as "plot_*"
    participant Report as "generate_report"

    User->>CLI: ptm build (config)
    CLI->>Pipeline: ptm_build_pipeline(config)
    Pipeline->>UniProt: download_uniprot_ptm() (if needed)
    Pipeline->>GTF: download_ensembl_gtf()/parse_ensembl_gtf(gtf_path)
    GTF-->>Pipeline: GTFData
    Pipeline->>Mapper: map_ptm_sites(ptm_tsv, gtf_data)
    Mapper-->>Pipeline: mapped_tsv, PTMBuildResult
    Pipeline->>Table: create_ptm_sites_tb(mapped_tsv -> HT)
    Table-->>CLI: output HT created

    User->>CLI: ptm annotate
    CLI->>Annotate: annotate_variants_with_ptm(variants_ht, ptm_ht)
    Annotate-->>CLI: annotated HT

    User->>CLI: ptm landscape / population
    CLI->>Analysis: ptm_landscape(...) or ptm_population(...)
    Analysis-->>CLI: Result objects

    CLI->>Plot: plot_* (result)
    Plot-->>CLI: PNGs

    User->>CLI: ptm report
    CLI->>Report: generate_report(landscape_result, population_result)
    Report-->>CLI: HTML report

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Added new use case (psroc) and library refactoring #45 — overlaps on centralizing ClinVar pathogenic/benign label constants and psroc label usage.
Improving the documentation and user workflow for the PSROC pipeline, #67 — related changes touching psroc/pipeline label handling and constants centralization.

Suggested labels

Review effort 4/5

Suggested reviewers

ypriverol

Poem

🐇 I hopped through UniProt rows with delight,

mapped residues to codons by moonlight,
tables, plots, and a report in a sweep—
now PTM sites dance where variants sleep.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 79.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changeset: introducing a new PTM (Post-Translational Modification) analysis module. This accurately reflects the comprehensive additions of documentation, CLI integration, pipeline code, and API utilities across the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adds a new PTM (Post-Translational Modification) analysis module to hvantk, including a build pipeline (UniProt PTM + Ensembl GTF → mapped TSV → Hail Table), variant annotation utilities, downstream analyses (landscape/population), plotting/reporting, plus CLI and documentation to integrate it into existing workflows.

Changes:

Introduces hvantk.ptm (mapping, annotation, analyses, plotting, HTML reporting) and wires it into the main CLI (hvantk ptm ...).
Adds UniProt PTM dataset download support (hvantk download uniprot-ptm) and a PTM sites table builder registered in the table registry / mktable.
Improves consistency/utilities: centralized ClinVar label constants, flexible header renaming + string→bool coercion, and a TSV sanitizer used for GenCC downloads.

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
`hvantk/utils/table_utils.py`	Adds flexible header rename mapping and a shared string→bool coercion helper.
`hvantk/tests/test_ptm.py`	Adds unit/integration-style tests for PTM mapping, resolution, pipeline round-trip, and CLI help.
`hvantk/tables/table_builders.py`	Uses new rename/bool helpers for COSMIC CGC; adds `create_ptm_sites_tb` to build a PTM sites HT.
`hvantk/tables/registry.py`	Registers `ptm-sites` table adapter.
`hvantk/ptm/report.py`	Adds static HTML report generator combining PTM analysis results and plots.
`hvantk/ptm/plot.py`	Adds plotting utilities for PTM landscape/population analyses and figure encoding.
`hvantk/ptm/pipeline.py`	Adds PTM build pipeline orchestration, download helpers, and mapping TSV writer.
`hvantk/ptm/mapper.py`	Adds local protein→genome coordinate mapper and Ensembl GTF parser.
`hvantk/ptm/constants.py`	Adds PTM module constants (URLs, categories, output columns).
`hvantk/ptm/annotate.py`	Adds Hail-based variant annotation against PTM sites/proximity.
`hvantk/ptm/analysis.py`	Adds Q1/Q3 analyses and export of PTM/non-PTM strata; uses centralized ClinVar labels.
`hvantk/ptm/__init__.py`	Exposes PTM public API.
`hvantk/psroc/pipeline.py`	Switches PSROC ClinVar label constants to centralized `core.constants` (backward compatible aliases).
`hvantk/hvantk.py`	Registers the new `ptm` CLI group with the main `hvantk` CLI.
`hvantk/datasets/uniprot_ptm_datasets.py`	Implements UniProt PTM REST API download to a normalized TSV.
`hvantk/datasets/gencc_datasets.py`	Sanitizes GenCC TSV post-download for Hail import compatibility.
`hvantk/data/file_utils.py`	Adds `sanitize_tsv()` helper for cleaning TSVs prior to Hail import.
`hvantk/core/constants.py`	Adds centralized ClinVar pathogenic/benign label lists.
`hvantk/commands/uniprot_ptm_downloader.py`	Adds CLI command to download UniProt PTM data.
`hvantk/commands/ptm_cli.py`	Adds `hvantk ptm` command group and subcommands for build/annotate/analyses/report.
`hvantk/commands/make_table_cli.py`	Adds `hvantk mktable ptm-sites` builder command.
`hvantk/commands/download_cli.py`	Registers `hvantk download uniprot-ptm`.
`hvantk/annotation/annotate.py`	Updates ClinVar label logic to use centralized constants.
`docs_site/tools/ptm.md`	Adds PTM tool reference documentation.
`docs_site/guide/usage.md`	Documents PTM build usage in the usage guide.
`docs_site/guide/data-sources.md`	Documents UniProt PTM + Ensembl GTF as data sources.
`docs_site/examples/ptm.md`	Adds an end-to-end PTM example walkthrough.
`docs_site/examples/index.md`	Links PTM example from the examples index.
`docs_site/architecture.md`	Updates architecture overview to include the new `ptm/` module and tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T22:14:16Z

+        for row in rows:
+            # Flatten any embedded newlines within cells
+            cleaned = [
+                cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")
+                for cell in row
+            ]
+            fh.write(delimiter.join(cleaned) + "\n")


When rewriting the file, sanitize_tsv writes delimiter.join(cleaned) without CSV quoting/escaping and does not validate that each row has the same column count as the header. If any cell contains the delimiter/quotes, this can corrupt the TSV structure for downstream parsing. Consider using csv.writer with the same delimiter (and optionally enforcing/repairing a fixed column count) so the output remains a valid delimited file.

Suggested change

for row in rows:

# Flatten any embedded newlines within cells

cleaned = [

cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")

for cell in row

]

fh.write(delimiter.join(cleaned) + "\n")

writer = csv.writer(

fh,

delimiter=delimiter,

quotechar='"',

lineterminator="\n",

)

for row_index, row in enumerate(rows):

# Ensure consistent column count with header

if len(row) < n_cols:

logger.warning(

"sanitize_tsv: row %d has %d columns, expected %d; padding with empty fields",

row_index,

len(row),

n_cols,

)

row = row + [""] * (n_cols - len(row))

elif len(row) > n_cols:

logger.warning(

"sanitize_tsv: row %d has %d columns, expected %d; truncating extra columns",

row_index,

len(row),

n_cols,

)

row = row[:n_cols]

# Flatten any embedded newlines within cells and remove delimiters

cleaned = [

cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")

for cell in row

]

writer.writerow(cleaned)

Copilot · 2026-03-24T22:14:16Z

+    target_names = set(field_map.values())
+    rename = {}
+    for canonical_raw, target in field_map.items():
+        norm = _normalize_field_name(canonical_raw)
+        actual = norm_to_actual.get(norm)
+        if actual is not None and actual != target:
+            # Skip if actual field name already equals the target or
+            # if the target name is already taken by another field
+            if actual not in target_names:
+                rename[actual] = target
+        # Also try matching by the target name's normalized form
+        # (handles cases where actual headers already use the target naming)
+        if actual is None:
+            norm_target = _normalize_field_name(target)
+            actual_by_target = norm_to_actual.get(norm_target)
+            if actual_by_target is not None and actual_by_target != target:
+                if actual_by_target not in target_names:
+                    rename[actual_by_target] = target


build_rename_map() attempts to avoid rename collisions, but the current check (if actual not in target_names) does not prevent renaming a field to a target name that already exists in actual_fields (e.g., both "gene_symbol" and "Gene Symbol" present). hl.Table.rename() will fail with duplicate fields in that case. Consider skipping a rename when target in actual_fields (or when target is already produced by another rename) rather than checking actual against target_names.

Suggested change

target_names = set(field_map.values())

rename = {}

for canonical_raw, target in field_map.items():

norm = _normalize_field_name(canonical_raw)

actual = norm_to_actual.get(norm)

if actual is not None and actual != target:

# Skip if actual field name already equals the target or

# if the target name is already taken by another field

if actual not in target_names:

rename[actual] = target

# Also try matching by the target name's normalized form

# (handles cases where actual headers already use the target naming)

if actual is None:

norm_target = _normalize_field_name(target)

actual_by_target = norm_to_actual.get(norm_target)

if actual_by_target is not None and actual_by_target != target:

if actual_by_target not in target_names:

rename[actual_by_target] = target

# Track all names that are or will be used as targets to avoid collisions

used_targets = set(actual_fields)

rename = {}

for canonical_raw, target in field_map.items():

norm = _normalize_field_name(canonical_raw)

actual = norm_to_actual.get(norm)

if actual is not None and actual != target:

# Skip if actual field name already equals the target or

# if the target name is already taken by an existing or renamed field

if target not in used_targets:

rename[actual] = target

used_targets.add(target)

# Also try matching by the target name's normalized form

# (handles cases where actual headers already use the target naming)

if actual is None:

norm_target = _normalize_field_name(target)

actual_by_target = norm_to_actual.get(norm_target)

if actual_by_target is not None and actual_by_target != target:

if target not in used_targets:

rename[actual_by_target] = target

used_targets.add(target)

Copilot · 2026-03-24T22:14:17Z

+- build: Download PTM data, map coordinates, build Hail Table
+- annotate: Annotate a variant table with PTM site information
+- landscape: PTM-variant overlap analysis (Q1)
+- evaluate: Predictor performance at PTM sites (Q2)


Module docstring lists an evaluate subcommand, but this CLI defines export-strata for Q2 and does not implement evaluate. Please update the docstring to reflect the actual subcommands to avoid confusing users.

Suggested change

- evaluate: Predictor performance at PTM sites (Q2)

- export-strata: Export PTM/non-PTM variant strata for predictor evaluation (Q2)

Copilot · 2026-03-24T22:14:17Z

+            from hvantk.ptm.plot import (
+                plot_landscape_summary,
+                plot_overlap_by_category,
+                plot_distance_distribution,
+            )
+            import matplotlib
+            matplotlib.use("Agg")


matplotlib.use("Agg") is called after importing hvantk.ptm.plot, but that module imports matplotlib.pyplot at import time. In headless environments this can fail or select an interactive backend before Agg is set. To reliably force a non-interactive backend, call matplotlib.use("Agg") before importing any module that imports pyplot (i.e., move the import matplotlib; matplotlib.use("Agg") above the from hvantk.ptm.plot import ...).

Suggested change

from hvantk.ptm.plot import (

plot_landscape_summary,

plot_overlap_by_category,

plot_distance_distribution,

)

import matplotlib

matplotlib.use("Agg")

import matplotlib

matplotlib.use("Agg")

from hvantk.ptm.plot import (

plot_landscape_summary,

plot_overlap_by_category,

plot_distance_distribution,

)

Copilot · 2026-03-24T22:14:17Z

+            from hvantk.ptm.plot import plot_population_af
+            import matplotlib
+            matplotlib.use("Agg")


Same backend-order issue here: matplotlib.use("Agg") is invoked after importing plot_population_af from hvantk.ptm.plot (which imports pyplot at module import). Move the backend selection before importing plot functions to avoid failures on systems without a display.

Suggested change

from hvantk.ptm.plot import plot_population_af

import matplotlib

matplotlib.use("Agg")

import matplotlib

matplotlib.use("Agg")

from hvantk.ptm.plot import plot_population_af

Copilot · 2026-03-24T22:14:18Z

    # Pre-calculate conditions for better readability and performance
    is_pathogenic = t.clinvar_clnsig.any(
-        lambda x: hl.set(pathogenic_label_clinvar).contains(x)
+        lambda x: hl.set(CLINVAR_PATHOGENIC_LABELS).contains(x)
    )
-    is_benign = t.clinvar_clnsig.any(lambda x: hl.set(benign_label_clinvar).contains(x))
+    is_benign = t.clinvar_clnsig.any(lambda x: hl.set(CLINVAR_BENIGN_LABELS).contains(x))

    t = t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG)
    t = t.annotate(


annotate_clinvar_clnsig computes is_pathogenic / is_benign from t.clinvar_clnsig before clinvar_clnsig is annotated from clinvar_ht. For inputs that don’t already have a clinvar_clnsig field (e.g., commands/annotate_features.py reads a variant HT then calls this), this will raise a field-not-found error and/or classify based on stale data. Compute the conditions from the joined ClinVar CLNSIG expression (or move the t = t.annotate(clinvar_clnsig=...) line before the .any(...) checks).

Copilot · 2026-03-24T22:14:18Z

+    The table is keyed by a genomic interval covering the PTM codon,
+    enabling interval-based joins with variant tables.
+


Docstring says the PTM sites table is "keyed by a genomic interval covering the PTM codon", but the implementation keys by a single locus (codon_start) and keeps flanking_interval as a non-key field. Please update the docstring to match the actual keying strategy (or change the key to an interval if that’s what downstream joins expect).

Copilot · 2026-03-24T22:14:19Z

+    hl.Table
+        The checkpointed Hail Table keyed by locus interval.
+    """
+    from hvantk.ptm.constants import PTM_TYPE_CATEGORIES


PTM_TYPE_CATEGORIES is imported but never used in this function. Dropping the unused import will avoid lint noise and prevents unnecessary imports when building tables.

Suggested change

from hvantk.ptm.constants import PTM_TYPE_CATEGORIES

Copilot · 2026-03-24T22:14:19Z

+    codon_pos = positions[idx:idx + 3]
+    return CodonMapping(
+        chrom=chrom,
+        codon_start=min(codon_pos),
+        codon_end=max(codon_pos),
+        strand=strand,
+    )


map_residue_to_genomic returns codon_start=min(codon_pos) and codon_end=max(codon_pos). For split codons spanning an exon boundary, those three genomic positions are not contiguous; using min/max creates an interval that includes the intron and later code works around this by truncating the end. This can cause missed annotations for variants that hit the codon base(s) in the other exon. Consider returning/storing the exact 3 genomic positions (e.g., codon_positions: tuple[int,int,int]) or separate locus fields so downstream annotation can join on the true codon bases.

Copilot · 2026-03-24T22:14:19Z

+        plt.close(fig)
+        return f"<img src='{src}' alt='{output_path.stem}' class='embedded-image'/>"
+    except Exception:
+        logger.warning("Failed to generate plot %s, skipping.", output_path.stem)


The exception handler in _render_plot logs only a generic warning, which makes debugging plot failures difficult. Logging the exception details (e.g., logger.exception(..., exc_info=True)) would preserve the stack trace while still skipping the plot.

Suggested change

logger.warning("Failed to generate plot %s, skipping.", output_path.stem)

logger.exception("Failed to generate plot %s, skipping.", output_path.stem)

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (12)

hvantk/tests/test_ptm.py (1)
218-221: Use tmp_path fixture instead of hardcoded /tmp paths.

The test uses hardcoded /tmp/test paths which are flagged as potentially insecure. Using pytest's tmp_path fixture is more portable and secure.
♻️ Proposed fix
-    config2 = PTMBuildConfig(
-        output_dir="/tmp/test", output_ht="/tmp/test/out.ht"
-    )
+def test_ptm_build_config_validation(tmp_path):
+    """PTMBuildConfig.validate catches missing required fields."""
+    config = PTMBuildConfig()
+    errors = config.validate()
+    assert len(errors) == 2
+    assert any("output_dir" in e for e in errors)
+    assert any("output_ht" in e for e in errors)
+
+    config2 = PTMBuildConfig(
+        output_dir=str(tmp_path), output_ht=str(tmp_path / "out.ht")
+    )
     assert config2.validate() == []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_ptm.py` around lines 218 - 221, The test uses hardcoded
paths; update the test to accept pytest's tmp_path fixture and construct
PTMBuildConfig using tmp_path (e.g., tmp_path / "test" for output_dir and
output_ht) instead of "/tmp/test" and "/tmp/test/out.ht", then call
config2.validate() as before; ensure the test function signature includes
tmp_path so tmp directories are used and removed automatically.
hvantk/datasets/uniprot_ptm_datasets.py (1)
226-231: Use raise ... from None or from err for exception chaining.

The re-raised ValueError should use proper exception chaining to clarify the error origin.
♻️ Proposed fix
         try:
             datetime.strptime(version_date, "%Y-%m-%d")
         except ValueError:
             raise ValueError(
                 f"Invalid version_date format: {version_date}. Expected YYYY-MM-DD"
-            )
+            ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 226 - 231, The except
block that re-raises a ValueError for invalid version_date should preserve
exception chaining; change "except ValueError:" to "except ValueError as err:"
and re-raise with "raise ValueError(f'Invalid version_date format:
{version_date}. Expected YYYY-MM-DD') from err" so the original parsing error is
chained (refer to the try/except around datetime.strptime and the variable
version_date).
hvantk/ptm/report.py (1)
258-260: Log the exception details for debugging.

The broad except Exception is acceptable here to prevent plot failures from crashing report generation. However, logging only the plot name loses valuable diagnostic information.
♻️ Proposed fix to include exception details
     except Exception:
-        logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
+        logger.warning(
+            "Failed to generate plot %s, skipping.", output_path.stem, exc_info=True
+        )
         return ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/report.py` around lines 258 - 260, The except block that currently
catches Exception and calls logger.warning("Failed to generate plot %s,
skipping.", output_path.stem) should be changed to log the actual exception
details; either capture the exception (except Exception as e) and include it in
the log message or call logger.exception/ logger.warning(..., exc_info=True) so
the stack trace is recorded while still returning ""; update the except block
surrounding the plot generation (the block using output_path.stem and returning
"") to emit the exception information for debugging.
docs_site/examples/ptm.md (1)
43-49: Add language specifier to fenced code blocks for output examples.

The output examples lack language specifiers. Use text or console to satisfy linting and improve syntax highlighting consistency.
📝 Proposed fix
 **Expected output:**
-```
+```text
 Mapping complete: 55422/56120 mapped (98.7%), 698 failed
 Resolution: {'ensembl_canonical': 48231, 'longest_cds': 7191}
 Mapped TSV: data/ptm/ptm_mapped.tsv
 Hail Table: data/ptm/ptm_sites.ht
</details>

Apply the same fix to the output blocks at lines 63, 83, and 150.

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs_site/examples/ptm.md around lines 43 - 49, Add a language specifier
(text or console) to the fenced code blocks that show output examples —
specifically update the block immediately following the "Expected output:"
header (the block containing "Mapping complete: 55422/56120 mapped...") and
apply the same change to the other output blocks referenced (the ones at the
locations corresponding to lines 63, 83, and 150) so each opening
triple-backtick becomes ```text; keep the block contents unchanged.
</details>

</blockquote></details>
<details>
<summary>hvantk/ptm/analysis.py (3)</summary><blockquote>

`42-44`: **Remove extraneous f-string prefix.**

Line 44 uses an f-string but contains no placeholders.

<details>
<summary>📝 Proposed fix</summary>

```diff
     def summary(self) -> str:
         lines = [
-            f"PTM-Variant Landscape:",
+            "PTM-Variant Landscape:",
             f"  Variants: {self.n_variants:,} total, "
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 42 - 44, The f-string prefix in the
summary method is unnecessary because there are no placeholders; update the
summary function (def summary(self) -> str:) to use a normal string literal for
the line "PTM-Variant Landscape:" instead of an f-string (locate the list
assigned to the variable lines inside summary) so remove the leading f from that
string.
```

</details>

---

`77-79`: **Remove extraneous f-string prefix.**

Line 79 uses an f-string but contains no placeholders.

<details>
<summary>📝 Proposed fix</summary>

```diff
     def summary(self) -> str:
         lines = [
-            f"PTM Population Analysis:",
+            "PTM Population Analysis:",
             f"  Total variants: {self.n_variants:,}",
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 77 - 79, The string in the summary()
method is using an unnecessary f-string prefix for a literal with no
placeholders; update the list entry in PTM Population Analysis within the
summary() function to use a normal string literal (remove the leading f) so the
line reads a plain quoted string instead of an f-string.
```

</details>

---

`369-369`: **Consider validating `af_field` existence before use.**

The dynamic field access `annotated[af_field]` will fail at query execution time if the field doesn't exist. An early validation could provide a clearer error message.

<details>
<summary>📝 Optional validation</summary>

```diff
+    if af_field not in annotated.row:
+        raise ValueError(f"AF field '{af_field}' not found in gnomAD table. "
+                        f"Available fields: {list(annotated.row)}")
     af = annotated[af_field]
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` at line 369, Before indexing annotated with af_field,
validate that the key/property exists to avoid a runtime KeyError; in the
function where af = annotated[af_field] is used, check presence (e.g., "if
af_field not in annotated" or use hasattr for objects) and raise a clear
ValueError/TypeError describing the missing field and listing available keys (or
provide a sensible default/fallback) so callers get a helpful error instead of a
cryptic crash.
```

</details>

</blockquote></details>
<details>
<summary>hvantk/ptm/plot.py (2)</summary><blockquote>

`43-50`: **Consider renaming `format` parameter to avoid shadowing builtin.**

While functional, shadowing the `format` builtin is flagged by linters. Common alternatives: `fmt`, `output_format`, or `file_format`.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` around lines 43 - 50, The parameter name format in
function plot_landscape_summary shadows the builtin; rename it to a
non-conflicting identifier such as output_format (or fmt/file_format) in the
function signature and update every internal reference within
plot_landscape_summary (and its type hint/default) to use output_format, plus
update all call sites, docstrings, and tests that pass format to use the new
name so behavior is preserved.
```

</details>

---

`253-258`: **Add `strict=True` to `zip()` for defensive programming.**

While `bars` and `n_values` are guaranteed to have the same length here, adding `strict=True` catches any future refactoring errors that might cause misalignment.

<details>
<summary>📝 Proposed fix</summary>

```diff
         # Annotate counts on bars
-        for bar, n in zip(bars, n_values):
+        for bar, n in zip(bars, n_values, strict=True):
             ax.text(
                 bar.get_x() + bar.get_width() / 2, bar.get_height(),
                 f"n={n:,}", ha="center", va="bottom", fontsize=9,
             )
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` around lines 253 - 258, The loop annotating bar counts
should use zip(..., strict=True) to guard against mismatched lengths; update the
zip call that pairs bars and n_values in hvantk/ptm/plot.py so it reads
zip(bars, n_values, strict=True) and leave the rest of the ax.text(...) logic
unchanged to ensure a ValueError is raised if future refactors cause
misalignment.
```

</details>

</blockquote></details>
<details>
<summary>hvantk/commands/ptm_cli.py (3)</summary><blockquote>

`119-123`: **Add context to validation error output.**

The validation errors are printed without a preceding message indicating what failed.

<details>
<summary>📝 Proposed fix</summary>

```diff
         errors = config.validate()
         if errors:
+            click.echo("Configuration validation failed:", err=True)
             for error in errors:
                 click.echo(f"  - {error}", err=True)
             ctx.exit(1)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` around lines 119 - 123, When config.validate()
returns errors, prepend a clear context message before listing them: call
click.echo with a message like "Configuration validation failed:" (using
err=True) just prior to the for-loop that echoes each error, then proceed to
loop over errors and call ctx.exit(1); update the block that references errors,
config.validate(), click.echo, and ctx.exit to include that preceding
explanatory line so the output clearly indicates what failed.
```

</details>

---

`181-187`: **Consider aggregating counts in a single pass.**

Three separate `count()` calls trigger three table scans. For large variant tables, this could be optimized with a single aggregation.

<details>
<summary>⚡ Proposed optimization</summary>

```diff
-        n_total = result.count()
-        n_ptm_site = result.filter(result.is_ptm_site).count()
-        n_proximal = result.filter(result.is_ptm_proximal).count()
+        counts = result.aggregate(hl.struct(
+            n_total=hl.agg.count(),
+            n_ptm_site=hl.agg.filter(result.is_ptm_site, hl.agg.count()),
+            n_proximal=hl.agg.filter(result.is_ptm_proximal, hl.agg.count()),
+        ))
+        n_total = counts.n_total
+        n_ptm_site = counts.n_ptm_site
+        n_proximal = counts.n_proximal
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` around lines 181 - 187, The three separate
count() calls (n_total = result.count(), n_ptm_site =
result.filter(result.is_ptm_site).count(), n_proximal =
result.filter(result.is_ptm_proximal).count()) cause multiple table scans;
replace them with a single aggregation over result that computes total, PTM site
and proximal counts in one pass (e.g., a single query using conditional
sums/CASE or SQLAlchemy func.sum on expressions for is_ptm_site and
is_ptm_proximal, or iterate once and accumulate into n_total, n_ptm_site,
n_proximal) so the code still sets n_total, n_ptm_site and n_proximal but with
one database scan.
```

</details>

---

`1-11`: **Docstring mentions "evaluate" command that doesn't exist.**

The module docstring at line 8 lists `evaluate: Predictor performance at PTM sites (Q2)`, but this command is not implemented in the file. The actual Q2 functionality is handled via `export-strata` combined with external `psroc` commands as documented in the `ptm_group` docstring.

Consider updating the module docstring to reflect the actual command set:

<details>
<summary>📝 Proposed fix</summary>

```diff
 """
 PTM CLI Commands - Post-Translational Modification Analysis
 
 This module provides CLI commands for the PTM variant classification pipeline:
 - build: Download PTM data, map coordinates, build Hail Table
 - annotate: Annotate a variant table with PTM site information
 - landscape: PTM-variant overlap analysis (Q1)
-- evaluate: Predictor performance at PTM sites (Q2)
+- export-strata: Export PTM-stratified variant lists for external predictor evaluation (Q2)
 - population: Population-level PTM-variant burden (Q3)
 - report: Generate summary report
 """
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` around lines 1 - 11, Update the module docstring
in ptm_cli.py to remove or replace the incorrect "evaluate: Predictor
performance at PTM sites (Q2)" entry and instead document the actual commands
implemented (build, annotate, landscape, population, report) and note that Q2
functionality is provided via export-strata plus external psroc usage as
described in the ptm_group docstring; edit the top-level triple-quoted string
(module docstring) to reflect the accurate command set and cross-reference the
ptm_group explanation so readers are not misled about an evaluate command that
does not exist.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs_site/examples/index.md:

Around line 97-99: The quick-start commands have a PTM HT path mismatch: the
build command (hvantk ptm build) writes data/ptm/ptm_sites.ht but the landscape
command (hvantk ptm landscape) expects --ptm-ht ptm_sites.ht; update the
landscape invocation to use --ptm-ht data/ptm/ptm_sites.ht (or alternatively
change the build --output-ht to produce ptm_sites.ht in the current dir) so the
two commands reference the same file path.

In @docs_site/guide/data-sources.md:

Around line 250-257: The docs currently list placeholder comments instead of
actionable commands for pre-downloading data; replace the commented notes in the
"Download (optional, for offline use)" block with runnable commands such as
using the hvantk CLI (hvantk download uniprot-ptm --output-dir data/ptm) to
fetch the UniProt PTM TSV and a wget call to fetch the Ensembl GTF (e.g.,
download Homo_sapiens.GRCh38.*.gtf.gz into data/ref) so the document contains
concrete, executable pre-download steps.

In @hvantk/annotation/annotate.py:

Around line 30-41: The predicates is_pathogenic and is_benign currently
reference t.clinvar_clnsig before it is assigned; either compute clinvar_clnsig
first or build predicates directly from the joined ClinVar expression
clinvar_ht[t.key].info.CLNSIG. Update the code so that t =
t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG) is executed before
creating is_pathogenic/is_benign, or replace usages of t.clinvar_clnsig in the
is_pathogenic and is_benign expressions with clinvar_ht[t.key].info.CLNSIG
(using CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS with .any(...) as
before) to avoid referencing an unassigned field.

In @hvantk/commands/ptm_cli.py:

Around line 332-338: Move the matplotlib backend selection before importing
the plotting function: call matplotlib.use("Agg") prior to the from
hvantk.ptm.plot import plot_population_af statement in the save_plots block so
the backend is set before any plotting modules are imported; update the
save_plots branch that currently imports plot_population_af and then calls
matplotlib.use to instead import matplotlib, set matplotlib.use("Agg"), then
import plot_population_af and call plot_population_af(result,
os.path.join(output, "population_af.png")).

Around line 232-244: The call to matplotlib.use("Agg") occurs after importing
hvantk.ptm.plot (which imports matplotlib.pyplot), so move the backend selection
before that import: when save_plots is true, call matplotlib.use("Agg") first,
then import plot_landscape_summary, plot_overlap_by_category, and
plot_distance_distribution from hvantk.ptm.plot and proceed to call those
functions (or alternatively set the backend inside the hvantk.ptm.plot module
itself by adding matplotlib.use("Agg") before any pyplot import).

In @hvantk/commands/uniprot_ptm_downloader.py:

Around line 23-26: The CLI help for the "--date" argument incorrectly says the
default is "today"; update the help text to reflect that when no date is
provided the code uses the latest UniProt snapshot
(UniProtPTMDataset.from_latest()), not the current date. Locate the argument
definition for "--date" (the argparse add_argument call) and change the help
string to mention "default: latest UniProt snapshot (via
UniProtPTMDataset.from_latest())" and make the same change for the duplicate
help text near the UniProtPTMDataset.from_latest() usage so both descriptions
match the actual behavior.

In @hvantk/data/file_utils.py:

Around line 98-118: The sanitizer currently corrupts fields by replacing
delimiter chars and buffers the whole file; in the sanitize_tsv logic use
csv.writer to stream-output rows to tmp_path instead of manual delimiter.join
and remove the cell.replace(delimiter, " ") step so you only normalize embedded
newlines (replace "\n" and "\r" with spaces), iterate reader rows one-by-one (or
filter blank rows) and writer.writerow(cleaned_row) for each, and preserve
existing quoting/escaping semantics provided by csv.writer; ensure you still
warn/return on empty input and atomically replace the original file with the
tmp_path after successful write.

In @hvantk/datasets/gencc_datasets.py:

Around line 115-119: The call to sanitize_tsv(output_path) can return 0
indicating failure, but the code always logs success and returns output_path;
update the surrounding function (the one that calls sanitize_tsv) to capture the
return value (e.g., result = sanitize_tsv(output_path)), check if result == 0,
and on failure log an error via logger.error and either raise an exception or
return None/False (consistent with the function's error handling convention)
instead of returning output_path; ensure you reference sanitize_tsv and the
caller function name so the check replaces the unconditional logger.info and
return when sanitization fails.

In @hvantk/ptm/constants.py:

Around line 32-35: The mapping currently places the generic "Glycyl lysine
isopeptide (Lys-Gly)" entry before the SUMO-specific keys, so
hvantk/ptm/pipeline.py's first-match loop (which uses desc.startswith(prefix))
classifies SUMO entries as "ubiquitination"; reorder the dictionary so the two
SUMO-specific keys ("Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter
in SUMO)" and "... SUMO2") appear before the generic "Glycyl lysine isopeptide
(Lys-Gly)" key to ensure those descriptions match "sumoylation" first.

In @hvantk/ptm/report.py:

Around line 204-212: The guard only checks result.ccr_mean_ptm but then
formats result.ccr_mean_non_ptm unconditionally; update the logic in the report
generation (the block that builds ccr_html using result.ccr_mean_ptm and
result.ccr_mean_non_ptm) to verify each value before formatting: only include
the PTM line if result.ccr_mean_ptm is not None and only include the non-PTM
line if result.ccr_mean_non_ptm is not None (or provide a fallback like "N/A");
you can either add a combined guard (if result.ccr_mean_ptm is not None or
result.ccr_mean_non_ptm is not None) and then conditionally append each
f"
..." line, or check both individually, so neither f-string attempts to
format a None value.

In @hvantk/tables/table_builders.py:

Around line 1390-1393: The code currently allows negative flanking_codons
which leads to negative flank_bp and inverted/shrunk intervals; add an explicit
validation at the start of create_ptm_sites_tb (and any public entry that
forwards the parameter, e.g., the mktable ptm-sites command handler) to raise a
clear exception (ValueError) when flanking_codons < 0, and avoid computing
flank_bp for negative values (the existing flank_bp calculation block around
where flank_bp is computed/used, referenced in the diff lines ~1460-1469, should
only run for non-negative flanking_codons). Ensure the check references the
function parameter name flanking_codons and returns/raises before any
interval-building logic runs.

In @hvantk/utils/table_utils.py:

Around line 59-76: The rename-building logic in the loop over field_map (using
_normalize_field_name, norm_to_actual, target_names and rename) must avoid
producing a mapping whose destination name already exists in the table; update
the checks so before setting rename[actual]=target (or
rename[actual_by_target]=target) you verify that target is not present in the
current actual fields (actual_fields) and not already a destination in
rename.values(); additionally detect and raise/return a clear error when
multiple different source headers collapse to the same normalized key so we fail
fast instead of producing duplicate destinations.

Nitpick comments:
In @docs_site/examples/ptm.md:

Around line 43-49: Add a language specifier (text or console) to the
fenced code blocks that show output examples — specifically update the block
immediately following the "Expected output:" header (the block containing
"Mapping complete: 55422/56120 mapped...") and apply the same change to the
other output blocks referenced (the ones at the locations corresponding to lines
63, 83, and 150) so each opening triple-backtick becomes ```text; keep the block
contents unchanged.

In @hvantk/commands/ptm_cli.py:

Around line 119-123: When config.validate() returns errors, prepend a clear
context message before listing them: call click.echo with a message like
"Configuration validation failed:" (using err=True) just prior to the for-loop
that echoes each error, then proceed to loop over errors and call ctx.exit(1);
update the block that references errors, config.validate(), click.echo, and
ctx.exit to include that preceding explanatory line so the output clearly
indicates what failed.

Around line 181-187: The three separate count() calls (n_total =
result.count(), n_ptm_site = result.filter(result.is_ptm_site).count(),
n_proximal = result.filter(result.is_ptm_proximal).count()) cause multiple table
scans; replace them with a single aggregation over result that computes total,
PTM site and proximal counts in one pass (e.g., a single query using conditional
sums/CASE or SQLAlchemy func.sum on expressions for is_ptm_site and
is_ptm_proximal, or iterate once and accumulate into n_total, n_ptm_site,
n_proximal) so the code still sets n_total, n_ptm_site and n_proximal but with
one database scan.

Around line 1-11: Update the module docstring in ptm_cli.py to remove or
replace the incorrect "evaluate: Predictor performance at PTM sites (Q2)" entry
and instead document the actual commands implemented (build, annotate,
landscape, population, report) and note that Q2 functionality is provided via
export-strata plus external psroc usage as described in the ptm_group docstring;
edit the top-level triple-quoted string (module docstring) to reflect the
accurate command set and cross-reference the ptm_group explanation so readers
are not misled about an evaluate command that does not exist.

In @hvantk/datasets/uniprot_ptm_datasets.py:

Around line 226-231: The except block that re-raises a ValueError for invalid
version_date should preserve exception chaining; change "except ValueError:" to
"except ValueError as err:" and re-raise with "raise ValueError(f'Invalid
version_date format: {version_date}. Expected YYYY-MM-DD') from err" so the
original parsing error is chained (refer to the try/except around
datetime.strptime and the variable version_date).

In @hvantk/ptm/analysis.py:

Around line 42-44: The f-string prefix in the summary method is unnecessary
because there are no placeholders; update the summary function (def
summary(self) -> str:) to use a normal string literal for the line "PTM-Variant
Landscape:" instead of an f-string (locate the list assigned to the variable
lines inside summary) so remove the leading f from that string.

Around line 77-79: The string in the summary() method is using an unnecessary
f-string prefix for a literal with no placeholders; update the list entry in PTM
Population Analysis within the summary() function to use a normal string literal
(remove the leading f) so the line reads a plain quoted string instead of an
f-string.

Line 369: Before indexing annotated with af_field, validate that the
key/property exists to avoid a runtime KeyError; in the function where af =
annotated[af_field] is used, check presence (e.g., "if af_field not in
annotated" or use hasattr for objects) and raise a clear ValueError/TypeError
describing the missing field and listing available keys (or provide a sensible
default/fallback) so callers get a helpful error instead of a cryptic crash.

In @hvantk/ptm/plot.py:

Around line 43-50: The parameter name format in function
plot_landscape_summary shadows the builtin; rename it to a non-conflicting
identifier such as output_format (or fmt/file_format) in the function signature
and update every internal reference within plot_landscape_summary (and its type
hint/default) to use output_format, plus update all call sites, docstrings, and
tests that pass format to use the new name so behavior is preserved.

Around line 253-258: The loop annotating bar counts should use zip(...,
strict=True) to guard against mismatched lengths; update the zip call that pairs
bars and n_values in hvantk/ptm/plot.py so it reads zip(bars, n_values,
strict=True) and leave the rest of the ax.text(...) logic unchanged to ensure a
ValueError is raised if future refactors cause misalignment.

In @hvantk/ptm/report.py:

Around line 258-260: The except block that currently catches Exception and
calls logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
should be changed to log the actual exception details; either capture the
exception (except Exception as e) and include it in the log message or call
logger.exception/ logger.warning(..., exc_info=True) so the stack trace is
recorded while still returning ""; update the except block surrounding the plot
generation (the block using output_path.stem and returning "") to emit the
exception information for debugging.

In @hvantk/tests/test_ptm.py:

Around line 218-221: The test uses hardcoded paths; update the test to accept
pytest's tmp_path fixture and construct PTMBuildConfig using tmp_path (e.g.,
tmp_path / "test" for output_dir and output_ht) instead of "/tmp/test" and
"/tmp/test/out.ht", then call config2.validate() as before; ensure the test
function signature includes tmp_path so tmp directories are used and removed
automatically.
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `0caf719e-a360-40a8-adf6-12992f1bcb5f`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 3fa0ef5b95a8b6edaefcb4b5a2c6653bcc5945b4 and 857ce337560ad0b28dcb87d8a5c34d2ed48f7b9f.

</details>

<details>
<summary>⛔ Files ignored due to path filters (1)</summary>

* `docs_site/images/hvantk-ptm-workflow.svg` is excluded by `!**/*.svg`

</details>

<details>
<summary>📒 Files selected for processing (29)</summary>

* `docs_site/architecture.md`
* `docs_site/examples/index.md`
* `docs_site/examples/ptm.md`
* `docs_site/guide/data-sources.md`
* `docs_site/guide/usage.md`
* `docs_site/tools/ptm.md`
* `hvantk/annotation/annotate.py`
* `hvantk/commands/download_cli.py`
* `hvantk/commands/make_table_cli.py`
* `hvantk/commands/ptm_cli.py`
* `hvantk/commands/uniprot_ptm_downloader.py`
* `hvantk/core/constants.py`
* `hvantk/data/file_utils.py`
* `hvantk/datasets/gencc_datasets.py`
* `hvantk/datasets/uniprot_ptm_datasets.py`
* `hvantk/hvantk.py`
* `hvantk/psroc/pipeline.py`
* `hvantk/ptm/__init__.py`
* `hvantk/ptm/analysis.py`
* `hvantk/ptm/annotate.py`
* `hvantk/ptm/constants.py`
* `hvantk/ptm/mapper.py`
* `hvantk/ptm/pipeline.py`
* `hvantk/ptm/plot.py`
* `hvantk/ptm/report.py`
* `hvantk/tables/registry.py`
* `hvantk/tables/table_builders.py`
* `hvantk/tests/test_ptm.py`
* `hvantk/utils/table_utils.py`

</details>

</details>

coderabbitai · 2026-03-24T22:21:45Z

+**Download** (optional, for offline use):
+
+```bash
+# The build command handles this automatically, but you can pre-download:
+# 1. UniProt PTM TSV via REST API query
+# 2. Ensembl GTF for coordinate mapping
+```
+


⚠️ Potential issue | 🟡 Minor

Replace placeholder comments with an executable pre-download command.

Line 253–256 currently documents intent but gives no runnable command. Add hvantk download uniprot-ptm to make offline prep actually actionable.

Suggested doc fix

**Download** (optional, for offline use): ```bash -# The build command handles this automatically, but you can pre-download: -# 1. UniProt PTM TSV via REST API query -# 2. Ensembl GTF for coordinate mapping +# UniProt PTM TSV (optional pre-download) +hvantk download uniprot-ptm --output-dir data/ptm + +# Ensembl GTF (optional pre-download for mapping) +wget https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.gtf.gz -P data/ref

</details>  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion **Download** (optional, for offline use):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs_site/guide/data-sources.md` around lines 250 - 257, The docs currently list placeholder comments instead of actionable commands for pre-downloading data; replace the commented notes in the "Download (optional, for offline use)" block with runnable commands such as using the hvantk CLI (hvantk download uniprot-ptm --output-dir data/ptm) to fetch the UniProt PTM TSV and a wget call to fetch the Ensembl GTF (e.g., download Homo_sapiens.GRCh38.*.gtf.gz into data/ref) so the document contains concrete, executable pre-download steps.

- Fix clinvar_clnsig predicate ordering bug (referenced before annotation) - Move matplotlib.use("Agg") before plot imports in PTM CLI commands - Reorder PTM_TYPE_CATEGORIES so SUMO entries match before generic Lys-Gly - Guard ccr_mean_non_ptm for None before formatting in report - Check sanitize_tsv return value in GenCC downloader - Rewrite sanitize_tsv to use csv.writer, stream rows, validate column count - Detect rename collisions and ambiguous headers in build_rename_map - Fix docstrings (keyed by locus not interval, evaluate→export-strata) - Remove unused PTM_TYPE_CATEGORIES import, validate flanking_codons >= 0 - Use single-pass Hail aggregation for PTM annotation counts - Fix docs: path mismatch, placeholder commands, code fence specifiers - Minor: exception chaining, extraneous f-strings, exc_info logging, strict zip

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (8)

hvantk/annotation/annotate.py (1)

30-49: LGTM! The predicate ordering bug has been correctly fixed.

The previous issue where is_pathogenic and is_benign referenced t.clinvar_clnsig before assignment is now resolved—the field is annotated first (line 36) before the predicates are computed (lines 39-42). Using centralized constants from hvantk.core.constants improves maintainability.

One optional refinement for clarity and marginal performance: hoist the hl.set() calls outside the lambdas.

,

♻️ Optional: Hoist set creation outside lambdas

     # First annotate clinvar_clnsig from the ClinVar table
     t = t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG)

     # Now compute conditions using the annotated field
+    pathogenic_labels = hl.set(CLINVAR_PATHOGENIC_LABELS)
+    benign_labels = hl.set(CLINVAR_BENIGN_LABELS)
     is_pathogenic = t.clinvar_clnsig.any(
-        lambda x: hl.set(CLINVAR_PATHOGENIC_LABELS).contains(x)
+        lambda x: pathogenic_labels.contains(x)
     )
-    is_benign = t.clinvar_clnsig.any(lambda x: hl.set(CLINVAR_BENIGN_LABELS).contains(x))
+    is_benign = t.clinvar_clnsig.any(lambda x: benign_labels.contains(x))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/annotation/annotate.py` around lines 30 - 49, Hoist the hl.set(...)
creations out of the lambdas to avoid rebuilding them repeatedly: create local
Hail sets (e.g., pathogenic_set = hl.set(CLINVAR_PATHOGENIC_LABELS) and
benign_set = hl.set(CLINVAR_BENIGN_LABELS)) before computing is_pathogenic and
is_benign, then use those variables inside the t.clinvar_clnsig.any(...)
lambdas; update references in the is_pathogenic and is_benign predicates and
leave the rest of the t.annotate logic unchanged.

hvantk/commands/uniprot_ptm_downloader.py (1)

40-55: Solid CLI implementation with appropriate error handling.

The lazy import pattern and error handling with ctx.exit(1) follow good CLI practices.

One minor note: logger.exception() already includes the exception information, so the f-string f"Download failed: {e}" duplicates it slightly. You could simplify to logger.exception("Download failed").
✨ Simplify exception logging
     except Exception as e:
-        logger.exception(f"Download failed: {e}")
+        logger.exception("Download failed")
         click.echo(f"Error: {e}", err=True)
         ctx.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/uniprot_ptm_downloader.py` around lines 40 - 55, The
exception log duplicates the exception text; in the except block for the UniProt
PTM download (the try that imports UniProtPTMDataset and calls
UniProtPTMDataset.from_date/from_latest, download, etc.) replace
logger.exception(f"Download failed: {e}") with a simpler
logger.exception("Download failed") so the stack/exception is still recorded
without duplicating the exception string, keeping the existing
click.echo(f"Error: {e}", err=True) and ctx.exit(1) behavior unchanged.

hvantk/datasets/uniprot_ptm_datasets.py (2)

257-325: Well-implemented download with proper cleanup.

The download method correctly:

Creates output directory if needed
Checks for existing files before overwrite
Streams results page-by-page to avoid memory issues
Cleans up partial files on failure

One minor style improvement per static analysis:

✨ Use implicit string conversion in f-string

-            raise RuntimeError(
-                f"Failed to download UniProt PTM data: {str(e)}"
-            ) from e
+            raise RuntimeError(
+                f"Failed to download UniProt PTM data: {e}"
+            ) from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 257 - 325, In download,
the exception message uses explicit str conversion (f"{str(e)}"); change it to
use implicit string conversion (e.g., f"{e}") when constructing the RuntimeError
message in the except block so the code is cleaner and satisfies the static
analysis rule in the download function's error handling.

157-183: Consider validating URL scheme for defense-in-depth.

The static analysis flags that urllib.request.urlopen accepts arbitrary URL schemes including file://. While UNIPROT_API_URL is presumably an HTTPS constant, the pagination URLs from _get_next_link come from server response headers. A compromised or malicious server could theoretically redirect to a file:// URL.

For defense-in-depth, consider validating the scheme:

🛡️ Optional: Add scheme validation

 def _fetch_json(url: str) -> tuple:
     """
     Fetch a JSON page from UniProt using *requests* (preferred) or *urllib*.
     ...
     """
+    from urllib.parse import urlparse
+    parsed = urlparse(url)
+    if parsed.scheme not in ("http", "https"):
+        raise ValueError(f"Unsupported URL scheme: {parsed.scheme}")
+
     try:
         import requests as _requests

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 157 - 183, The fetch
helper _fetch_json currently opens arbitrary URL schemes (via
urllib.request.urlopen or requests.get); validate the URL scheme before issuing
the request by parsing the URL (e.g., urllib.parse.urlparse) and only allowing
"http" and "https" (reject or raise an error for "file", "ftp", etc.), and apply
the same check to any pagination links produced by _get_next_link (ensure those
links are validated before being passed into _fetch_json); this defends against
malicious server-provided pagination URLs while preserving normal
UNIPROT_API_URL usage.

hvantk/commands/ptm_cli.py (1)

182-182: Remove duplicate hail import.

hail as hl is already imported at line 171. This duplicate import is unnecessary.

♻️ Remove duplicate import

         result = result.checkpoint(output_ht, overwrite=overwrite)

-        import hail as hl
         counts = result.aggregate(hl.struct(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` at line 182, Remove the duplicate import of "hail
as hl" found at the second import line; keep the original "import hail as hl"
that already exists earlier (around line with other imports) and delete the
redundant "import hail as hl" occurrence to avoid duplicate imports in the
module (check the top-level imports near the PTM CLI definitions and functions
that use hl).

hvantk/ptm/report.py (1)

258-260: Consider narrowing the exception type for plot failures.

The broad Exception catch (flagged by Ruff BLE001) is acceptable here for fault tolerance, but could be narrowed to catch more specific exceptions like (ValueError, IOError, RuntimeError) to avoid masking unexpected errors.

♻️ Optional: narrow exception scope

-    except Exception:
+    except (ValueError, IOError, RuntimeError, TypeError):
         logger.warning("Failed to generate plot %s, skipping.", output_path.stem, exc_info=True)
         return ""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/report.py` around lines 258 - 260, Replace the broad "except
Exception" in the plot-generation failure handler with a narrowed tuple of
expected exceptions (for example except (ValueError, IOError, RuntimeError):) so
you don't mask unrelated errors; keep the existing logger.warning call (using
output_path.stem and exc_info=True) and the return "" behavior unchanged; update
the except clause where the code logs "Failed to generate plot %s, skipping."
(the block that references output_path.stem and logger.warning) to use the
narrower exception tuple to address Ruff BLE001.

hvantk/ptm/plot.py (1)

49-49: Parameter format shadows Python builtin.

The format parameter appears in multiple functions (lines 49, 126, 171, 218, 277, 287, 298). While common in plotting code, consider renaming to fmt or image_format to avoid shadowing.

♻️ Rename `format` parameter across all functions

 def plot_landscape_summary(
     result: PTMLandscapeResult,
     output_path: str,
     figsize: Tuple[int, int] = (8, 5),
     dpi: int = 300,
     title: Optional[str] = None,
-    format: str = "png",
+    fmt: str = "png",
 ) -> plt.Figure:

Apply similar changes to other functions and update internal usages of format to fmt.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` at line 49, Rename the parameter named format to fmt (or
image_format) across all plotting functions in the hvantk.ptm.plot module that
declare "format: str = 'png'" (those function signatures flagged in the review)
and update every internal usage of format to the new name, plus adjust any local
variables or return values that reference it; ensure function signatures (and
docstrings if present) and all call sites within the repository are updated
consistently so nothing still references the old format parameter name.

hvantk/utils/table_utils.py (1)

65-87: Consider detecting duplicate rename destinations for robustness.

The target not in actual_fields check now prevents renaming into an existing field. For additional safety against malformed field_map inputs where multiple entries could resolve to the same actual field, you could also check target not in rename.values() before adding:

             if actual not in target_names and target not in actual_fields:
+                if target in rename.values():
+                    logger.warning(
+                        "build_rename_map: target %r already assigned, skipping %r",
+                        target, actual,
+                    )
+                    continue
                 rename[actual] = target

Current usage with COSMIC_CGC_FIELDS has no collisions, so this is purely defensive.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/utils/table_utils.py` around lines 65 - 87, The current header-rename
logic may allow multiple source fields to map to the same target; update the
blocks that assign to rename (the branches that set rename[actual] = target and
rename[actual_by_target] = target) to also check that the proposed target is not
already used in rename.values() before adding; use the same normalization/path
lookup logic around _normalize_field_name, norm_to_actual, field_map,
actual_fields and target_names to prevent duplicate rename destinations and keep
the existing checks (actual not in target_names and target not in
actual_fields).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/ptm/analysis.py`:
- Around line 86-91: The current guard only checks self.ccr_mean_ptm but still
formats self.ccr_mean_non_ptm which can be None; update the block that emits the
summary (the lines using self.ccr_mean_ptm and self.ccr_mean_non_ptm, produced
by ptm_population()/the summary method) to check each mean independently and
only format the non-None values. Concretely, replace the single-if that formats
both values with logic that builds the string parts conditionally (e.g., if
self.ccr_mean_ptm is not None append "PTM=...", and if self.ccr_mean_non_ptm is
not None append "non-PTM=...") and then join/append that to lines so no f-string
is called with a None value.

In `@hvantk/tables/table_builders.py`:
- Around line 1142-1146: The fallback sets kwargs["force_bgz"]=True whenever
resolved_path.endswith(".gz"), which overrides resolve_compression's explicit
False for standard gzip; change the code to unpack the boolean from
resolve_compression (e.g., resolved_path, is_bgz = resolve_compression(...)) and
only set kwargs["force_bgz"]=True when either force_bgz was explicitly requested
or is_bgz is True (do not set it solely based on resolved_path.endswith(".gz")).
Ensure you update references to resolved_path and the function
resolve_compression accordingly and remove the blunt .endswith(".gz") check.
- Around line 1461-1471: The flanking_interval end can exceed chromosome bounds
(ht.codon_end + flank_bp); clamp it to the contig length before calling
hl.locus_interval. Retrieve the contig length for ht._contig from the
reference_genome (the same approach used in the PTM annotation pattern), compute
end_pos = hl.min(contig_length, ht.codon_end + flank_bp), and use that end_pos
in hl.locus_interval (keeping hl.max(1, ht.codon_start - flank_bp) for the
start); update the flanking_interval annotation accordingly (symbols:
flanking_interval, ht.codon_end, flank_bp, ht._contig, reference_genome,
hl.locus_interval).

---

Nitpick comments:
In `@hvantk/annotation/annotate.py`:
- Around line 30-49: Hoist the hl.set(...) creations out of the lambdas to avoid
rebuilding them repeatedly: create local Hail sets (e.g., pathogenic_set =
hl.set(CLINVAR_PATHOGENIC_LABELS) and benign_set =
hl.set(CLINVAR_BENIGN_LABELS)) before computing is_pathogenic and is_benign,
then use those variables inside the t.clinvar_clnsig.any(...) lambdas; update
references in the is_pathogenic and is_benign predicates and leave the rest of
the t.annotate logic unchanged.

In `@hvantk/commands/ptm_cli.py`:
- Line 182: Remove the duplicate import of "hail as hl" found at the second
import line; keep the original "import hail as hl" that already exists earlier
(around line with other imports) and delete the redundant "import hail as hl"
occurrence to avoid duplicate imports in the module (check the top-level imports
near the PTM CLI definitions and functions that use hl).

In `@hvantk/commands/uniprot_ptm_downloader.py`:
- Around line 40-55: The exception log duplicates the exception text; in the
except block for the UniProt PTM download (the try that imports
UniProtPTMDataset and calls UniProtPTMDataset.from_date/from_latest, download,
etc.) replace logger.exception(f"Download failed: {e}") with a simpler
logger.exception("Download failed") so the stack/exception is still recorded
without duplicating the exception string, keeping the existing
click.echo(f"Error: {e}", err=True) and ctx.exit(1) behavior unchanged.

In `@hvantk/datasets/uniprot_ptm_datasets.py`:
- Around line 257-325: In download, the exception message uses explicit str
conversion (f"{str(e)}"); change it to use implicit string conversion (e.g.,
f"{e}") when constructing the RuntimeError message in the except block so the
code is cleaner and satisfies the static analysis rule in the download
function's error handling.
- Around line 157-183: The fetch helper _fetch_json currently opens arbitrary
URL schemes (via urllib.request.urlopen or requests.get); validate the URL
scheme before issuing the request by parsing the URL (e.g.,
urllib.parse.urlparse) and only allowing "http" and "https" (reject or raise an
error for "file", "ftp", etc.), and apply the same check to any pagination links
produced by _get_next_link (ensure those links are validated before being passed
into _fetch_json); this defends against malicious server-provided pagination
URLs while preserving normal UNIPROT_API_URL usage.

In `@hvantk/ptm/plot.py`:
- Line 49: Rename the parameter named format to fmt (or image_format) across all
plotting functions in the hvantk.ptm.plot module that declare "format: str =
'png'" (those function signatures flagged in the review) and update every
internal usage of format to the new name, plus adjust any local variables or
return values that reference it; ensure function signatures (and docstrings if
present) and all call sites within the repository are updated consistently so
nothing still references the old format parameter name.

In `@hvantk/ptm/report.py`:
- Around line 258-260: Replace the broad "except Exception" in the
plot-generation failure handler with a narrowed tuple of expected exceptions
(for example except (ValueError, IOError, RuntimeError):) so you don't mask
unrelated errors; keep the existing logger.warning call (using output_path.stem
and exc_info=True) and the return "" behavior unchanged; update the except
clause where the code logs "Failed to generate plot %s, skipping." (the block
that references output_path.stem and logger.warning) to use the narrower
exception tuple to address Ruff BLE001.

In `@hvantk/utils/table_utils.py`:
- Around line 65-87: The current header-rename logic may allow multiple source
fields to map to the same target; update the blocks that assign to rename (the
branches that set rename[actual] = target and rename[actual_by_target] = target)
to also check that the proposed target is not already used in rename.values()
before adding; use the same normalization/path lookup logic around
_normalize_field_name, norm_to_actual, field_map, actual_fields and target_names
to prevent duplicate rename destinations and keep the existing checks (actual
not in target_names and target not in actual_fields).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a83341fb-2b11-4f02-87df-be7d021b8452

📥 Commits

Reviewing files that changed from the base of the PR and between 857ce33 and a58516f.

📒 Files selected for processing (15)

docs_site/examples/index.md
docs_site/examples/ptm.md
docs_site/guide/data-sources.md
hvantk/annotation/annotate.py
hvantk/commands/ptm_cli.py
hvantk/commands/uniprot_ptm_downloader.py
hvantk/data/file_utils.py
hvantk/datasets/gencc_datasets.py
hvantk/datasets/uniprot_ptm_datasets.py
hvantk/ptm/analysis.py
hvantk/ptm/constants.py
hvantk/ptm/plot.py
hvantk/ptm/report.py
hvantk/tables/table_builders.py
hvantk/utils/table_utils.py

✅ Files skipped from review due to trivial changes (4)

docs_site/guide/data-sources.md
docs_site/examples/index.md
hvantk/ptm/constants.py
docs_site/examples/ptm.md

🚧 Files skipped from review as they are similar to previous changes (2)

hvantk/datasets/gencc_datasets.py
hvantk/data/file_utils.py

coderabbitai · 2026-03-24T23:11:18Z

+        if force_bgz:
+            kwargs["force_bgz"] = True
+        elif resolved_path.endswith(".gz"):
+            kwargs["force_bgz"] = True
+        return hl.import_table(**kwargs)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Hail import_table force_bgz parameter gzip vs bgzip

💡 Result:

In Hail, the hl.import_table function (also referenced as import_table in methods) imports delimited text files (TSV/CSV) into a Hail Table. The force_bgz parameter controls how compressed .gz files are handled during import. GZIP is standard gzip compression, which decompresses serially (single-threaded, no parallelism). BGZIP (Blocked GZIP) is a bioinformatics format that splits the file into independent blocks, enabling parallel random access and multi-threaded decompression. By default, Hail treats .gz files as standard GZIP (serial). Set force_bgz=True to treat .gz files as BGZIP, allowing parallel reading across Spark executors for faster performance on large files, even if the extension isn't .bgz. From Hail docs: "If True, load .gz files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. This option is useful when the file extension is not '.bgz', but the file is blocked gzip, so that the file can be read in parallel and not on a single node." Use force_bgz=True for BGZIP-compressed files (common in genomics) to enable parallelism; otherwise, use standard GZIP handling.

Citations:

1: https://hail.is/docs/0.2/methods/impex.html

2: https://hail.is/docs/0.2/_modules/hail/methods/impex.html

🏁 Script executed:

# Find the resolve_compression function definition rg -n "def resolve_compression" --type python

Repository: bigbio/hvantk

Length of output: 87

🏁 Script executed:

# Also check if resolve_compression is imported from elsewhere rg -n "resolve_compression" hvantk/tables/table_builders.py | head -20

Repository: bigbio/hvantk

Length of output: 235

🏁 Script executed:

# Find and read the resolve_compression function fd -t f file_utils.py -path "*/data/*"

Repository: bigbio/hvantk

Length of output: 225

🏁 Script executed:

# Search for resolve_compression definition rg -n "def resolve_compression" --type-list | head -5

Repository: bigbio/hvantk

Length of output: 446

🏁 Script executed:

# Try a simpler search for the function find . -name "file_utils.py" -type f

Repository: bigbio/hvantk

Length of output: 111

🏁 Script executed:

# Read the file_utils.py to find resolve_compression cat -n ./hvantk/data/file_utils.py

Repository: bigbio/hvantk

Length of output: 32113

The fallback .gz check incorrectly overrides resolve_compression's explicit False return for standard gzip files.

When resolve_compression detects a standard gzip file, it explicitly returns (filepath, False) to indicate the file cannot be read in parallel. However, the fallback check at line 1144 ignores this signal and sets force_bgz=True anyway because the path ends with .gz.

According to Hail documentation and the learnings from this codebase, force_bgz=True tells Hail the file is block-gzipped (BGZF) and safe for parallel reading. Standard gzip files should NOT have this flag set—Hail will read them single-threaded without it.

🔧 Suggested fix

if force_bgz: kwargs["force_bgz"] = True - elif resolved_path.endswith(".gz"): - kwargs["force_bgz"] = True + # Don't set force_bgz for standard gzip — resolve_compression already handles this return hl.import_table(**kwargs)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/tables/table_builders.py` around lines 1142 - 1146, The fallback sets kwargs["force_bgz"]=True whenever resolved_path.endswith(".gz"), which overrides resolve_compression's explicit False for standard gzip; change the code to unpack the boolean from resolve_compression (e.g., resolved_path, is_bgz = resolve_compression(...)) and only set kwargs["force_bgz"]=True when either force_bgz was explicitly requested or is_bgz is True (do not set it solely based on resolved_path.endswith(".gz")). Ensure you update references to resolved_path and the function resolve_compression accordingly and remove the blunt .endswith(".gz") check.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

hvantk/ptm/analysis.py (1)

223-224: Materialize annotated before the follow-on Hail actions.

stats, category_counts, and distance_dist are separate actions on the same lazy table. Without persist()/checkpoint(), Hail will rerun annotate_variants_with_ptm() and _label_clinvar() for each one.

♻️ Low-friction fix

     annotated = annotate_variants_with_ptm(clinvar_ht, ptm_ht, flanking_codons)
-    annotated = _label_clinvar(annotated)
+    annotated = _label_clinvar(annotated).persist()

Also applies to: 227-271

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 223 - 224, The pipeline recomputes
annotate_variants_with_ptm and _label_clinvar for each subsequent Hail action
because the intermediate Table `annotated` is lazy; persist or checkpoint
`annotated` immediately after calling _label_clinvar (i.e., after annotated =
_label_clinvar(annotated)) so subsequent actions (`stats`, `category_counts`,
`distance_dist`) operate on a materialized table; use annotated =
annotated.persist() or annotated = annotated.checkpoint(...) with an appropriate
path and then continue with the downstream actions to avoid repeated
recomputation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/ptm/analysis.py`:
- Around line 334-353: The is_non_ptm mask currently defined as "is_non_ptm =
~is_ptm & ~is_prox" allows noncoding variants; update the logic to require
coding variants or declare a precondition. Either (A) enforce coding in the
analysis by changing the mask to include a coding flag from gnomad_ht (e.g.,
is_non_ptm = ~is_ptm & ~is_prox & is_coding, where is_coding is derived from
gnomad_ht consequence/variant annotation) and apply the same change where
similar masks are built (see the other block around 375-388), or (B) add a clear
docstring/precondition that gnomad_ht must be pre-filtered to coding variants
and validate that upstream (raise an error if coding annotation/flag is
missing). Reference symbols: is_non_ptm, is_ptm, is_prox, af_field and the PTM
analysis function in hvantk/ptm/analysis.py.
- Around line 165-171: The _vid construction in ht_out (ht.annotate using
ht.locus.contig, ht.locus.position and ht.alleles[1]) only picks the first ALT
and will drop additional alleles for multiallelic records; either enforce
biallelic input or explode multi-allelic sites into one row per ALT. Fix by
validating ht.alleles length (e.g., assert/hl.if_else to raise or filter where
hl.len(ht.alleles) == 2) when you require biallelic, or else transform the table
before building _vid by exploding ALT alleles (use an explode/flatten pattern to
iterate over hl.range(1, hl.len(ht.alleles)) or an equivalent Hail explode on
ht.alleles[1:], create a per-ALT allele variable, and build _vid from that
allele instead of ht.alleles[1]) so each exported row includes the correct ALT.
- Around line 103-124: The code collapses multiallelic ClinVar CLNSIG to
CLNSIG[0] in _extract_clnsig and then labels variants by that single value;
instead, preserve the array and change _label_clinvar to test if any allele’s
CLNSIG matches the pathogenic/benign label sets (use hl.any(lambda x:
p_labels.contains(x), clnsig) and similarly for b_labels), and handle the scalar
case by checking membership directly when clnsig is not an array; update
_extract_clnsig/_label_clinvar to stop using CLNSIG[0] and use hl.any(...) (or
equivalent) against CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS so
multiallelic records are correctly classified.

In `@hvantk/tables/table_builders.py`:
- Around line 1431-1441: create_ptm_sites_tb currently hardcodes adding "chr" to
inputs (contig_remap/ht._contig) which breaks non-prefixed references; change
create_ptm_sites_tb to detect whether the target reference's contigs are
prefixed by checking get_reference(reference_genome).contigs for a "chr" prefix,
then normalize incoming ht.chrom by stripping any leading "chr"/"CHR" to a base
contig name, apply the MT<->M remap on that base (update contig_remap logic),
and finally re-add the "chr" prefix only if the reference uses prefixed contigs
before assigning ht._contig and filtering against valid_contigs (use
valid_contigs = hl.set(hl.literal(hl.get_reference(reference_genome).contigs))
as before).

---

Nitpick comments:
In `@hvantk/ptm/analysis.py`:
- Around line 223-224: The pipeline recomputes annotate_variants_with_ptm and
_label_clinvar for each subsequent Hail action because the intermediate Table
`annotated` is lazy; persist or checkpoint `annotated` immediately after calling
_label_clinvar (i.e., after annotated = _label_clinvar(annotated)) so subsequent
actions (`stats`, `category_counts`, `distance_dist`) operate on a materialized
table; use annotated = annotated.persist() or annotated =
annotated.checkpoint(...) with an appropriate path and then continue with the
downstream actions to avoid repeated recomputation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3d8c33c-a809-45df-a6aa-817e16c324e9

📥 Commits

Reviewing files that changed from the base of the PR and between a58516f and 0dd39f9.

📒 Files selected for processing (3)

docs_site/examples/ptm.md
hvantk/ptm/analysis.py
hvantk/tables/table_builders.py

✅ Files skipped from review due to trivial changes (1)

docs_site/examples/ptm.md

coderabbitai · 2026-03-25T11:00:27Z

+def _extract_clnsig(ht: hl.Table) -> hl.Expression:
+    """Extract scalar CLNSIG expression, handling array vs string dtype."""
+    clnsig_dtype = ht.info.CLNSIG.dtype
+    if isinstance(clnsig_dtype, hl.tarray):
+        return hl.or_missing(
+            hl.is_defined(ht.info.CLNSIG) & (hl.len(ht.info.CLNSIG) > 0),
+            ht.info.CLNSIG[0],
+        )
+    return ht.info.CLNSIG
+
+
+def _label_clinvar(ht: hl.Table) -> hl.Table:
+    """Add _label field ('P', 'B', or 'other') based on ClinVar CLNSIG."""
+    clnsig = _extract_clnsig(ht)
+    p_labels = hl.literal(CLINVAR_PATHOGENIC_LABELS)
+    b_labels = hl.literal(CLINVAR_BENIGN_LABELS)
+    return ht.annotate(
+        _label=hl.case()
+        .when(hl.is_defined(clnsig) & p_labels.contains(clnsig), "P")
+        .when(hl.is_defined(clnsig) & b_labels.contains(clnsig), "B")
+        .default("other")
+    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and locate the files mentioned find . -type f -name "*.py" | grep -E "(analysis|table_builders)" | head -20

Repository: bigbio/hvantk

Length of output: 162

🏁 Script executed:

# Read the create_clinvar_tb function cat -n hvantk/tables/table_builders.py | sed -n '248,310p'

Repository: bigbio/hvantk

Length of output: 2381

🏁 Script executed:

# Also check the imports and constants needed for context head -100 hvantk/ptm/analysis.py | cat -n

Repository: bigbio/hvantk

Length of output: 4391

🏁 Script executed:

# Search for CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS grep -n "CLINVAR_PATHOGENIC_LABELS\|CLINVAR_BENIGN_LABELS" hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 252

🏁 Script executed:

# Check if split_multiallelic is used elsewhere in the codebase grep -rn "split_multiallelic" --include="*.py" .

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Find ptm_landscape function and see how it uses _label_clinvar grep -n "def ptm_landscape\|_label_clinvar" hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 175

🏁 Script executed:

# Read the ptm_landscape function to understand the context cat -n hvantk/ptm/analysis.py | sed -n '200,300p'

Repository: bigbio/hvantk

Length of output: 4533

🏁 Script executed:

# Look for any documentation or tests that show how multiallelic VCFs are handled find . -type f \( -name "*.py" -o -name "*.md" \) | xargs grep -l "multiallelic\|split_multiallelic" 2>/dev/null | head -10

Repository: bigbio/hvantk

Length of output: 127

🏁 Script executed:

# Check how multiallelic is handled in converters.py grep -n "multiallelic\|split_multiallelic" hvantk/hgc/converters.py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 1507

🏁 Script executed:

# Check the test file grep -n "multiallelic\|split_multiallelic" hvantk/tests/ancestry/test_filter_hail.py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 316

🏁 Script executed:

# Look at the VCF specification and understand CLNSIG in multiallelic context # First, let's see if there's a README or documentation about ClinVar data handling find . -type f \( -name "README*" -o -name "*.md" \) | xargs grep -l "ClinVar\|CLNSIG" 2>/dev/null

Repository: bigbio/hvantk

Length of output: 501

🏁 Script executed:

# Check the actual VCF header or info field definitions grep -rn "CLNSIG\|INFO.*ClinVar" --include="*.py" . | head -20

Repository: bigbio/hvantk

Length of output: 1987

🏁 Script executed:

# Let's look at clinvar_streamer.py to understand the CLNSIG structure better cat -n hvantk/data/clinvar_streamer.py | sed -n '150,170p'

Repository: bigbio/hvantk

Length of output: 842

🏁 Script executed:

# Check psroc/pipeline.py to see if it has the same CLNSIG extraction pattern and any comments cat -n hvantk/psroc/pipeline.py | sed -n '1075,1095p'

Repository: bigbio/hvantk

Length of output: 905

🏁 Script executed:

# Look at the ClinVar documentation file cat docs_site/examples/clinvar.md | head -100

Repository: bigbio/hvantk

Length of output: 2454

🏁 Script executed:

# Check if there are any comments or issues about multiallelic CLNSIG handling grep -rn "multiallelic.*CLNSIG\|CLNSIG.*multiallelic\|collapse.*CLNSIG\|CLNSIG.*collapse" --include="*.py" --include="*.md" .

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Check how annotate_variants_with_ptm handles the table grep -n "def annotate_variants_with_ptm" hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 91

🏁 Script executed:

# Let's understand the VCF format better - check if there's any filtering or splitting elsewhere grep -rn "split_multiallelic\|\.split_rows()" --include="*.py" hvantk/

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Check the ClinVar VCF spec or any comments about how CLNSIG values work grep -rn "CLNSIG.*array\|array.*CLNSIG\|CLNSIG.*multiple\|multiple.*CLNSIG" --include="*.py" --include="*.md" . -i

Repository: bigbio/hvantk

Length of output: 410

🏁 Script executed:

# Let's check what Hail's import_vcf actually does by looking at how alleles are structured cat -n hvantk/tables/table_builders.py | sed -n '285,310p'

Repository: bigbio/hvantk

Length of output: 998

🏁 Script executed:

# Check if there are any comments or TODOs about CLNSIG handling grep -rn "TODO\|FIXME\|XXX\|HACK\|BUG" hvantk/ptm/analysis.py hvantk/psroc/pipeline.py | grep -i "clnsig\|multiallelic"

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Look for any tests that show expected behavior with multiallelic CLNSIG find . -path ./node_modules -prune -o -type f -name "test_*.py" -o -name "*_test.py" | xargs grep -l "CLNSIG\|_extract_clnsig" 2>/dev/null

Repository: bigbio/hvantk

Length of output: 41

🏁 Script executed:

# Check the full _extract_clnsig function context and any comments around it cat -n hvantk/ptm/analysis.py | sed -n '100,115p'

Repository: bigbio/hvantk

Length of output: 754

🏁 Script executed:

# Compare how clinvar_streamer handles CLNSIG vs how ptm handles it - show surrounding context cat -n hvantk/data/clinvar_streamer.py | sed -n '145,175p'

Repository: bigbio/hvantk

Length of output: 1281

Don't collapse multiallelic ClinVar CLNSIG to the first entry.

create_clinvar_tb() in hvantk/tables/table_builders.py:292-298 imports raw VCF with hl.import_vcf(...).rows() without splitting multiallelic sites. For multiallelic records, CLNSIG is stored as an array with one value per alternate allele. However, _extract_clnsig() extracts only CLNSIG[0], discarding significance information for alleles at positions 1+. This causes variants with mixed-significance alleles (e.g., alt[0]=Pathogenic, alt[1]=Benign) to be silently mislabeled based only on the first allele, leading to incorrect counts and enrichment calculations in ptm_landscape().

The correct pattern is shown in hvantk/data/clinvar_streamer.py:155-157, which uses .any() to check if any array element matches the label set. Note that hvantk/psroc/pipeline.py:1077-1085 contains the same problematic CLNSIG[0] logic.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/analysis.py` around lines 103 - 124, The code collapses multiallelic ClinVar CLNSIG to CLNSIG[0] in _extract_clnsig and then labels variants by that single value; instead, preserve the array and change _label_clinvar to test if any allele’s CLNSIG matches the pathogenic/benign label sets (use hl.any(lambda x: p_labels.contains(x), clnsig) and similarly for b_labels), and handle the scalar case by checking membership directly when clnsig is not an array; update _extract_clnsig/_label_clinvar to stop using CLNSIG[0] and use hl.any(...) (or equivalent) against CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS so multiallelic records are correctly classified.

coderabbitai · 2026-03-25T11:00:28Z

+        ht_out = ht.annotate(
+            _vid=hl.delimit(
+                [
+                    hl.str(ht.locus.contig),
+                    hl.str(ht.locus.position),
+                    ht.alleles[0],
+                    ht.alleles[1],


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd hvantk --type d | head -20

Repository: bigbio/hvantk

Length of output: 64

🏁 Script executed:

git ls-files | grep -E "(analysis|ptm)" | head -20

Repository: bigbio/hvantk

Length of output: 502

🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '160,180p'

Repository: bigbio/hvantk

Length of output: 826

🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '150,190p'

Repository: bigbio/hvantk

Length of output: 1519

🏁 Script executed:

wc -l hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 83

🏁 Script executed:

cat -n hvantk/ptm/analysis.py | head -200

Repository: bigbio/hvantk

Length of output: 8248

🏁 Script executed:

rg "export_ptm_strata" --type py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 1557

🏁 Script executed:

rg "annotate_variants_with_ptm" --type py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 3097

🏁 Script executed:

cat -n hvantk/ptm/pipeline.py | head -150

Repository: bigbio/hvantk

Length of output: 5635

🏁 Script executed:

cat -n hvantk/ptm/pipeline.py | grep -A 10 -B 10 "export_ptm_strata"

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

rg "def annotate_variants_with_ptm" --type py -A 50

Repository: bigbio/hvantk

Length of output: 3075

🏁 Script executed:

cat -n hvantk/ptm/annotate.py | head -80

Repository: bigbio/hvantk

Length of output: 3442

🏁 Script executed:

rg "split_multi" --type py

Repository: bigbio/hvantk

Length of output: 2076

🏁 Script executed:

rg "biallelic" --type py -i

Repository: bigbio/hvantk

Length of output: 4168

🏁 Script executed:

cat -n hvantk/commands/ptm_cli.py | grep -A 30 "def export_variants"

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

cat -n hvantk/ptm/pipeline.py | grep -A 30 "export_ptm_strata"

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

cat -n hvantk/ptm/tests/test_ptm.py

Repository: bigbio/hvantk

Length of output: 117

🏁 Script executed:

find . -name "*test*ptm*" -type f

Repository: bigbio/hvantk

Length of output: 83

🏁 Script executed:

find hvantk/tests -name "*.py" -type f

Repository: bigbio/hvantk

Length of output: 2754

🏁 Script executed:

cat -n hvantk/commands/ptm_cli.py

Repository: bigbio/hvantk

Length of output: 20124

🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '127,182p'

Repository: bigbio/hvantk

Length of output: 2159

🏁 Script executed:

rg "alleles\[" hvantk/ptm/ --type py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 396

🏁 Script executed:

cat -n hvantk/tests/test_ptm.py

Repository: bigbio/hvantk

Length of output: 8829

🏁 Script executed:

rg "alleles\|split\|biallelic" hvantk/tests/test_ptm.py -i

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

rg "unsplit\|multi" hvantk/ptm/ --type py -i -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '1,130p'

Repository: bigbio/hvantk

Length of output: 5560

Hardcoded single ALT export will lose data from multi-allelic variants.

The _vid formatter at lines 171 accesses only alleles[1], so variants with multiple ALT alleles (e.g., ["A", "T", "C"]) will silently truncate to ref:alt1 and drop additional ALTs. The exported output will be incomplete for downstream tools when processing unsplit tables. Either validate input is biallelic and document this constraint, or emit one row per ALT.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/analysis.py` around lines 165 - 171, The _vid construction in ht_out (ht.annotate using ht.locus.contig, ht.locus.position and ht.alleles[1]) only picks the first ALT and will drop additional alleles for multiallelic records; either enforce biallelic input or explode multi-allelic sites into one row per ALT. Fix by validating ht.alleles length (e.g., assert/hl.if_else to raise or filter where hl.len(ht.alleles) == 2) when you require biallelic, or else transform the table before building _vid by exploding ALT alleles (use an explode/flatten pattern to iterate over hl.range(1, hl.len(ht.alleles)) or an equivalent Hail explode on ht.alleles[1:], create a per-ALT allele variable, and build _vid from that allele instead of ht.alleles[1]) so each exported row includes the correct ALT.

coderabbitai · 2026-03-25T11:00:28Z

+    """Population-level allele frequency analysis at PTM sites (Q3).
+
+    Annotates gnomAD variants with PTM proximity and compares allele
+    frequency distributions between PTM-site and non-PTM coding variants.
+    This is where Hail scalability matters (~700M gnomAD variants).
+
+    Parameters
+    ----------
+    gnomad_ht : hl.Table
+        gnomAD variant table with allele frequency field.
+    ptm_ht : hl.Table
+        PTM sites Hail Table from create_ptm_sites_tb.
+    output_dir : str
+        Directory for output files (population_summary.json).
+    ccr_ht : hl.Table, optional
+        CCR table for constrained region cross-reference.
+    af_field : str
+        Name of the allele frequency field in gnomad_ht (default: "AF").
+    flanking_codons : int
+        Flanking codons for proximal window (default: 5).


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "analysis.py" | grep ptm

Repository: bigbio/hvantk

Length of output: 81

🏁 Script executed:

wc -l ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 85

🏁 Script executed:

sed -n '330,395p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 2322

🏁 Script executed:

sed -n '375,388p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 690

🏁 Script executed:

grep -n "def annotate_variants_with_ptm" ./hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 91

🏁 Script executed:

sed -n '14,80p' ./hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 2540

🏁 Script executed:

sed -n '320,340p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 763

🏁 Script executed:

sed -n '80,150p' ./hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 932

🏁 Script executed:

rg -i "coding|consequence|variant_type" ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 130

🏁 Script executed:

sed -n '326,420p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 3315

🏁 Script executed:

rg "ptm_population" ./hvantk --type py -B 3 -A 3

Repository: bigbio/hvantk

Length of output: 3520

🏁 Script executed:

sed -n '1,100p' ./hvantk/commands/ptm_cli.py | head -80

Repository: bigbio/hvantk

Length of output: 2629

🏁 Script executed:

grep -n "def ptm_population_cmd" ./hvantk/commands/ptm_cli.py -A 30

Repository: bigbio/hvantk

Length of output: 1354

The "non-PTM coding" filter is not enforced in the code.

The docstring promises comparison against "non-PTM coding variants," but is_non_ptm = ~is_ptm & ~is_prox includes all noncoding variants from gnomAD. Either add a documented precondition that gnomad_ht must be pre-filtered to coding variants, or add explicit filtering (e.g., is_non_ptm = ~is_ptm & ~is_prox & is_coding).

Also applies to: 375-388

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/analysis.py` around lines 334 - 353, The is_non_ptm mask currently defined as "is_non_ptm = ~is_ptm & ~is_prox" allows noncoding variants; update the logic to require coding variants or declare a precondition. Either (A) enforce coding in the analysis by changing the mask to include a coding flag from gnomad_ht (e.g., is_non_ptm = ~is_ptm & ~is_prox & is_coding, where is_coding is derived from gnomad_ht consequence/variant annotation) and apply the same change where similar masks are built (see the other block around 375-388), or (B) add a clear docstring/precondition that gnomad_ht must be pre-filtered to coding variants and validate that upstream (raise an error if coding annotation/flag is missing). Reference symbols: is_non_ptm, is_ptm, is_prox, af_field and the PTM analysis function in hvantk/ptm/analysis.py.

coderabbitai · 2026-03-25T11:00:28Z

+        # Remap contig names to match GRCh38 (e.g., MT -> M for chrM)
+        contig_remap = hl.dict({"MT": "M"})
+        ht = ht.annotate(
+            _contig=hl.str("chr") + contig_remap.get(ht.chrom, ht.chrom),
+        )
+
+        # Filter to valid contigs in the reference genome
+        valid_contigs = hl.set(
+            hl.literal(hl.get_reference(reference_genome).contigs)
+        )
+        ht = ht.filter(valid_contigs.contains(ht._contig))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What contig names do Hail's built-in GRCh37andGRCh38reference genomes use by default? Please cite the officialReferenceGenome docs or source examples.

💡 Result:

Hail's built-in GRCh37 reference genome uses numeric contig names by default: "1", "2", ..., "22", "X", "Y", "MT". GRCh38 uses prefixed names: "chr1", "chr2", ..., "chr22", "chrX", "chrY", "chrM".

Citations:

1: https://hail.is/docs/0.2/genetics/hail.genetics.ReferenceGenome.html

2: https://hail.is/docs/0.2/_modules/hail/genetics/reference_genome.html

3: https://hail.is/docs/0.2/functions/genetics.html

4: https://discuss.hail.is/t/contig-1-is-not-in-the-reference-genome-grch38-error/1841

🏁 Script executed:

fd "table_builders.py" | head -5

Repository: bigbio/hvantk

Length of output: 88

🏁 Script executed:

# Get file size and then read specific lines wc -l hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 93

🏁 Script executed:

# Read lines 1431-1441 to verify the code snippet sed -n '1431,1441p' hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 507

🏁 Script executed:

# Read more context around the code to see the function definition and parameters sed -n '1400,1441p' hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 1534

🏁 Script executed:

# Also read the rest of the transform function to understand the full logic sed -n '1441,1480p' hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 1335

Make PTM contig normalization reference-genome-aware.

The code hard-codes "chr" prefix for all inputs, which breaks with reference_genome="GRCh37". GRCh37's contigs are unprefixed ("1", "MT"), while GRCh38's are prefixed ("chr1", "chrM"). With GRCh37, input "1" becomes "chr1" (invalid) and is filtered out; already-prefixed inputs become "chrchr1". Additionally, the MT→M remapping assumes GRCh38's "chrM" format and will mismatch GRCh37's "MT".

The fix should detect the reference genome's contig naming convention at the start of create_ptm_sites_tb, normalize inputs to a base form (removing any existing prefix), remap MT/M appropriately, and conditionally re-add the prefix only if the reference uses one.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/tables/table_builders.py` around lines 1431 - 1441, create_ptm_sites_tb currently hardcodes adding "chr" to inputs (contig_remap/ht._contig) which breaks non-prefixed references; change create_ptm_sites_tb to detect whether the target reference's contigs are prefixed by checking get_reference(reference_genome).contigs for a "chr" prefix, then normalize incoming ht.chrom by stripping any leading "chr"/"CHR" to a base contig name, apply the MT<->M remap on that base (update contig_remap logic), and finally re-add the "chr" prefix only if the reference uses prefixed contigs before assigning ht._contig and filtering against valid_contigs (use valid_contigs = hl.set(hl.literal(hl.get_reference(reference_genome).contigs)) as before).

enriquea added 19 commits March 24, 2026 00:01

feat: add sanitize_tsv function for Hail compatibility and improve fi…

7f5d0c2

…eld renaming logic

feat: add uniprot_ptm_downloader command to download CLI

d257311

feat: add ptm_group command to main CLI entry point

2c63234

feat: add ptm_group command to main CLI entry point

8014230

feat: add ptm_group command to main CLI entry point

f1409f9

feat: add create_ptm_sites_tb function to generate Hail Table of PTM …

fbb623f

…sites

feat: add PTM variant classification module with mapping and annotati…

8014910

…on tools

feat: add PTM CLI commands for variant classification analysis

79943d4

test: add unit tests for hvantk.ptm module functionality

07f3db9

feat: add UniProt PTM dataset handling module for downloading and man…

c17adb5

…aging human PTM data

feat: add CLI command for downloading UniProt PTM data for human prot…

8af0da6

…eins

feat: add PTM analysis module with landscape and population functions

4aa4ea2

refactor: replace hardcoded ClinVar labels with constants for improve…

26c9ab8

…d maintainability

feat: add ClinVar clinical significance labels for improved clarity

e470879

refactor: re-export ClinVar label constants for backward compatibility

40ddd88

feat: enhance PTM CLI with new export-strata command and detailed lan…

0a140c2

…dscape analysis

feat: add visualization and report generation for PTM analysis

bad2932

feat: add PTM module documentation and usage examples for variant cla…

282f0bd

…ssification

update diagram

857ce33

enriquea requested a review from Copilot March 24, 2026 22:06

enriquea self-assigned this Mar 24, 2026

enriquea added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 24, 2026

Copilot started reviewing on behalf of enriquea March 24, 2026 22:07 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

fix: improve CCR reporting logic for PTM and non-PTM sites

baf3d84

enriquea added 2 commits March 25, 2026 11:44

fix: update resolution keys and output file name in PTM documentation

03598b8

fix: adjust flanking interval calculation to respect chromosome lengths

0dd39f9

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

enriquea merged commit eda6657 into main Mar 25, 2026
6 of 7 checks passed

This was referenced Apr 14, 2026

Integrate AnnData support and implement PTM constraint analysis #90

Merged

Add PTM features including tests, reports, and atlas assembly #91

Merged

-        for row in rows:
-            # Flatten any embedded newlines within cells
-            cleaned = [
-                cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")
-                for cell in row
-            ]
-            fh.write(delimiter.join(cleaned) + "\n")
+        writer = csv.writer(
+            fh,
+            delimiter=delimiter,
+            quotechar='"',
+            lineterminator="\n",
+        )
+        for row_index, row in enumerate(rows):
+            # Ensure consistent column count with header
+            if len(row) < n_cols:
+                logger.warning(
+                    "sanitize_tsv: row %d has %d columns, expected %d; padding with empty fields",
+                    row_index,
+                    len(row),
+                    n_cols,
+                )
+                row = row + [""] * (n_cols - len(row))
+            elif len(row) > n_cols:
+                logger.warning(
+                    "sanitize_tsv: row %d has %d columns, expected %d; truncating extra columns",
+                    row_index,
+                    len(row),
+                    n_cols,
+                )
+                row = row[:n_cols]
+            # Flatten any embedded newlines within cells and remove delimiters
+            cleaned = [
+                cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")
+                for cell in row
+            ]
+            writer.writerow(cleaned)

	- evaluate: Predictor performance at PTM sites (Q2)
	- export-strata: Export PTM/non-PTM variant strata for predictor evaluation (Q2)

		The table is keyed by a genomic interval covering the PTM codon,
		enabling interval-based joins with variant tables.

	logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
	logger.exception("Failed to generate plot %s, skipping.", output_path.stem)

Conversation

enriquea commented Mar 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 25, 2026

Choose a reason for hiding this comment

enriquea commented Mar 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 24, 2026 •

edited

Loading