Skip to content

Add new PTM (Post-Translational Modification) analysis module#77

Merged
enriquea merged 23 commits intomainfrom
dev
Mar 25, 2026
Merged

Add new PTM (Post-Translational Modification) analysis module#77
enriquea merged 23 commits intomainfrom
dev

Conversation

@enriquea
Copy link
Copy Markdown
Collaborator

@enriquea enriquea commented Mar 24, 2026

This pull request introduces comprehensive documentation and CLI support for the new PTM (Post-Translational Modification) analysis module in the hvantk toolkit. The changes include updates to the architecture overview, usage guides, data source documentation, and a detailed PTM tool reference. The PTM workflow is now fully documented with example commands, Python API usage, and integration with existing variant analysis pipelines.

PTM Module Documentation and CLI Integration

Documentation and Architecture:

  • Added a new section for the ptm/ module in the architecture overview, detailing its structure and purpose for PTM variant analysis. [1] [2]
  • Introduced ptm_cli.py in the CLI commands list and described associated test files. [1] [2]
  • Added a dedicated PTM example in the workflow index, including a quick start guide and links to further documentation. [1] [2]

Tool and Usage Documentation:

  • Created a comprehensive docs_site/tools/ptm.md reference describing the PTM pipeline, command-line interface, Python API, workflow phases, annotation fields, output files, and module structure.
  • Added a detailed example walkthrough (docs_site/examples/ptm.md) covering all steps from data preparation to report generation, with expected outputs and code snippets.
  • Updated the usage guide to include PTM site table creation and clarified automatic vs. manual data download options.
  • Documented UniProt PTM and Ensembl GTF data sources, including download instructions and integration with the PTM build command.

CLI and Core Functionality Enhancements

Command-line Integration:

  • Registered the UniProt PTM downloader in the download_cli command group, enabling automated acquisition of PTM data. [1] [2]

Code Consistency:

  • Updated ClinVar annotation logic to use centralized pathogenic/benign label sets from hvantk.core.constants for consistency across modules.

Summary by CodeRabbit

  • New Features

    • Full PTM analysis workflow added: new top-level hvantk ptm commands (build, annotate, landscape, export-strata, population, report), PTM table builder, plotting utilities, and HTML report generation.
    • UniProt PTM downloader and a new mktable ptm-sites command; CLI root now exposes the ptm group.
  • Documentation

    • Extensive PTM docs, examples, quick-start CLI commands, and data-source guidance.
  • Tests

    • New end-to-end PTM test suite.

enriquea added 19 commits March 24, 2026 00:01
@enriquea enriquea requested a review from Copilot March 24, 2026 22:06
@enriquea enriquea self-assigned this Mar 24, 2026
@enriquea enriquea added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 24, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

Adds a new PTM domain: UniProt PTM download, Ensembl GTF parsing, protein→genome mapping, PTM site Hail table builder, variant annotation, landscape/population analyses, plotting, HTML report generation, CLI entrypoints, utilities (sanitization/renaming), and tests; plus documentation updates.

Changes

Cohort / File(s) Summary
Documentation
docs_site/architecture.md, docs_site/examples/index.md, docs_site/examples/ptm.md, docs_site/guide/data-sources.md, docs_site/guide/usage.md, docs_site/tools/ptm.md
Added PTM architecture, examples, data-source notes, CLI quickstarts, Python API examples, and tool reference pages.
PTM package & constants
hvantk/ptm/__init__.py, hvantk/ptm/constants.py
New public ptm package surface and configuration constants (UniProt/Ensembl params, PTM categories, output columns, defaults).
Mapping & pipeline orchestration
hvantk/ptm/mapper.py, hvantk/ptm/pipeline.py
GTF parser, codon-level residue→genome mapping, transcript resolution, PTM build pipeline (downloads, mapping, mapped TSV → Hail HT) with config/result dataclasses.
Annotation & analysis
hvantk/ptm/annotate.py, hvantk/ptm/analysis.py
Variant annotation producing is_ptm/is_proximal/distance, export strata, ptm_landscape (ClinVar enrichment) and ptm_population (AF summaries) result types.
Plotting & reporting
hvantk/ptm/plot.py, hvantk/ptm/report.py
Matplotlib plot helpers (landscape, overlap, distance, AF), figure encoding, and HTML report generation with embedded or referenced images.
Datasets & downloaders
hvantk/datasets/uniprot_ptm_datasets.py, hvantk/commands/uniprot_ptm_downloader.py, hvantk/commands/download_cli.py
New UniProt PTM dataset module (pagination, TSV output), CLI downloader uniprot-ptm, and registration in download CLI.
Table builders & CLI wiring
hvantk/tables/table_builders.py, hvantk/tables/registry.py, hvantk/commands/make_table_cli.py, hvantk/hvantk.py
Added create_ptm_sites_tb builder, registered "ptm-sites" in registry, new mktable ptm-sites CLI command, and registered ptm command group in root CLI.
Utilities & sanitization
hvantk/data/file_utils.py, hvantk/datasets/gencc_datasets.py, hvantk/utils/table_utils.py
Added sanitize_tsv to clean TSVs and applied to GenCC downloads; added field-name normalization and str_to_bool for table imports and rename-map building.
Annotation refactor & label centralization
hvantk/annotation/annotate.py, hvantk/core/constants.py, hvantk/psroc/pipeline.py
Centralized ClinVar pathogenic/benign label lists in core.constants; updated annotation and psroc pipeline to import shared constants.
Testing
hvantk/tests/test_ptm.py
New tests for GTF mapping, transcript resolution, map_ptm_sites integration, CLI help wiring, and PTM build config validation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as "hvantk ptm"
    participant Pipeline as "ptm_build_pipeline"
    participant UniProt as "UniProtPTMDataset"
    participant GTF as "parse_ensembl_gtf"
    participant Mapper as "map_ptm_sites"
    participant Table as "create_ptm_sites_tb"
    participant Annotate as "annotate_variants_with_ptm"
    participant Analysis as "ptm_landscape/ptm_population"
    participant Plot as "plot_*"
    participant Report as "generate_report"

    User->>CLI: ptm build (config)
    CLI->>Pipeline: ptm_build_pipeline(config)
    Pipeline->>UniProt: download_uniprot_ptm() (if needed)
    Pipeline->>GTF: download_ensembl_gtf()/parse_ensembl_gtf(gtf_path)
    GTF-->>Pipeline: GTFData
    Pipeline->>Mapper: map_ptm_sites(ptm_tsv, gtf_data)
    Mapper-->>Pipeline: mapped_tsv, PTMBuildResult
    Pipeline->>Table: create_ptm_sites_tb(mapped_tsv -> HT)
    Table-->>CLI: output HT created

    User->>CLI: ptm annotate
    CLI->>Annotate: annotate_variants_with_ptm(variants_ht, ptm_ht)
    Annotate-->>CLI: annotated HT

    User->>CLI: ptm landscape / population
    CLI->>Analysis: ptm_landscape(...) or ptm_population(...)
    Analysis-->>CLI: Result objects

    CLI->>Plot: plot_* (result)
    Plot-->>CLI: PNGs

    User->>CLI: ptm report
    CLI->>Report: generate_report(landscape_result, population_result)
    Report-->>CLI: HTML report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

Review effort 4/5

Suggested reviewers

  • ypriverol

Poem

🐇 I hopped through UniProt rows with delight,

mapped residues to codons by moonlight,
tables, plots, and a report in a sweep—
now PTM sites dance where variants sleep.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changeset: introducing a new PTM (Post-Translational Modification) analysis module. This accurately reflects the comprehensive additions of documentation, CLI integration, pipeline code, and API utilities across the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new PTM (Post-Translational Modification) analysis module to hvantk, including a build pipeline (UniProt PTM + Ensembl GTF → mapped TSV → Hail Table), variant annotation utilities, downstream analyses (landscape/population), plotting/reporting, plus CLI and documentation to integrate it into existing workflows.

Changes:

  • Introduces hvantk.ptm (mapping, annotation, analyses, plotting, HTML reporting) and wires it into the main CLI (hvantk ptm ...).
  • Adds UniProt PTM dataset download support (hvantk download uniprot-ptm) and a PTM sites table builder registered in the table registry / mktable.
  • Improves consistency/utilities: centralized ClinVar label constants, flexible header renaming + string→bool coercion, and a TSV sanitizer used for GenCC downloads.

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
hvantk/utils/table_utils.py Adds flexible header rename mapping and a shared string→bool coercion helper.
hvantk/tests/test_ptm.py Adds unit/integration-style tests for PTM mapping, resolution, pipeline round-trip, and CLI help.
hvantk/tables/table_builders.py Uses new rename/bool helpers for COSMIC CGC; adds create_ptm_sites_tb to build a PTM sites HT.
hvantk/tables/registry.py Registers ptm-sites table adapter.
hvantk/ptm/report.py Adds static HTML report generator combining PTM analysis results and plots.
hvantk/ptm/plot.py Adds plotting utilities for PTM landscape/population analyses and figure encoding.
hvantk/ptm/pipeline.py Adds PTM build pipeline orchestration, download helpers, and mapping TSV writer.
hvantk/ptm/mapper.py Adds local protein→genome coordinate mapper and Ensembl GTF parser.
hvantk/ptm/constants.py Adds PTM module constants (URLs, categories, output columns).
hvantk/ptm/annotate.py Adds Hail-based variant annotation against PTM sites/proximity.
hvantk/ptm/analysis.py Adds Q1/Q3 analyses and export of PTM/non-PTM strata; uses centralized ClinVar labels.
hvantk/ptm/__init__.py Exposes PTM public API.
hvantk/psroc/pipeline.py Switches PSROC ClinVar label constants to centralized core.constants (backward compatible aliases).
hvantk/hvantk.py Registers the new ptm CLI group with the main hvantk CLI.
hvantk/datasets/uniprot_ptm_datasets.py Implements UniProt PTM REST API download to a normalized TSV.
hvantk/datasets/gencc_datasets.py Sanitizes GenCC TSV post-download for Hail import compatibility.
hvantk/data/file_utils.py Adds sanitize_tsv() helper for cleaning TSVs prior to Hail import.
hvantk/core/constants.py Adds centralized ClinVar pathogenic/benign label lists.
hvantk/commands/uniprot_ptm_downloader.py Adds CLI command to download UniProt PTM data.
hvantk/commands/ptm_cli.py Adds hvantk ptm command group and subcommands for build/annotate/analyses/report.
hvantk/commands/make_table_cli.py Adds hvantk mktable ptm-sites builder command.
hvantk/commands/download_cli.py Registers hvantk download uniprot-ptm.
hvantk/annotation/annotate.py Updates ClinVar label logic to use centralized constants.
docs_site/tools/ptm.md Adds PTM tool reference documentation.
docs_site/guide/usage.md Documents PTM build usage in the usage guide.
docs_site/guide/data-sources.md Documents UniProt PTM + Ensembl GTF as data sources.
docs_site/examples/ptm.md Adds an end-to-end PTM example walkthrough.
docs_site/examples/index.md Links PTM example from the examples index.
docs_site/architecture.md Updates architecture overview to include the new ptm/ module and tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hvantk/data/file_utils.py Outdated
Comment on lines +112 to +118
for row in rows:
# Flatten any embedded newlines within cells
cleaned = [
cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")
for cell in row
]
fh.write(delimiter.join(cleaned) + "\n")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When rewriting the file, sanitize_tsv writes delimiter.join(cleaned) without CSV quoting/escaping and does not validate that each row has the same column count as the header. If any cell contains the delimiter/quotes, this can corrupt the TSV structure for downstream parsing. Consider using csv.writer with the same delimiter (and optionally enforcing/repairing a fixed column count) so the output remains a valid delimited file.

Suggested change
for row in rows:
# Flatten any embedded newlines within cells
cleaned = [
cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")
for cell in row
]
fh.write(delimiter.join(cleaned) + "\n")
writer = csv.writer(
fh,
delimiter=delimiter,
quotechar='"',
lineterminator="\n",
)
for row_index, row in enumerate(rows):
# Ensure consistent column count with header
if len(row) < n_cols:
logger.warning(
"sanitize_tsv: row %d has %d columns, expected %d; padding with empty fields",
row_index,
len(row),
n_cols,
)
row = row + [""] * (n_cols - len(row))
elif len(row) > n_cols:
logger.warning(
"sanitize_tsv: row %d has %d columns, expected %d; truncating extra columns",
row_index,
len(row),
n_cols,
)
row = row[:n_cols]
# Flatten any embedded newlines within cells and remove delimiters
cleaned = [
cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ")
for cell in row
]
writer.writerow(cleaned)

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +76
target_names = set(field_map.values())
rename = {}
for canonical_raw, target in field_map.items():
norm = _normalize_field_name(canonical_raw)
actual = norm_to_actual.get(norm)
if actual is not None and actual != target:
# Skip if actual field name already equals the target or
# if the target name is already taken by another field
if actual not in target_names:
rename[actual] = target
# Also try matching by the target name's normalized form
# (handles cases where actual headers already use the target naming)
if actual is None:
norm_target = _normalize_field_name(target)
actual_by_target = norm_to_actual.get(norm_target)
if actual_by_target is not None and actual_by_target != target:
if actual_by_target not in target_names:
rename[actual_by_target] = target
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_rename_map() attempts to avoid rename collisions, but the current check (if actual not in target_names) does not prevent renaming a field to a target name that already exists in actual_fields (e.g., both "gene_symbol" and "Gene Symbol" present). hl.Table.rename() will fail with duplicate fields in that case. Consider skipping a rename when target in actual_fields (or when target is already produced by another rename) rather than checking actual against target_names.

Suggested change
target_names = set(field_map.values())
rename = {}
for canonical_raw, target in field_map.items():
norm = _normalize_field_name(canonical_raw)
actual = norm_to_actual.get(norm)
if actual is not None and actual != target:
# Skip if actual field name already equals the target or
# if the target name is already taken by another field
if actual not in target_names:
rename[actual] = target
# Also try matching by the target name's normalized form
# (handles cases where actual headers already use the target naming)
if actual is None:
norm_target = _normalize_field_name(target)
actual_by_target = norm_to_actual.get(norm_target)
if actual_by_target is not None and actual_by_target != target:
if actual_by_target not in target_names:
rename[actual_by_target] = target
# Track all names that are or will be used as targets to avoid collisions
used_targets = set(actual_fields)
rename = {}
for canonical_raw, target in field_map.items():
norm = _normalize_field_name(canonical_raw)
actual = norm_to_actual.get(norm)
if actual is not None and actual != target:
# Skip if actual field name already equals the target or
# if the target name is already taken by an existing or renamed field
if target not in used_targets:
rename[actual] = target
used_targets.add(target)
# Also try matching by the target name's normalized form
# (handles cases where actual headers already use the target naming)
if actual is None:
norm_target = _normalize_field_name(target)
actual_by_target = norm_to_actual.get(norm_target)
if actual_by_target is not None and actual_by_target != target:
if target not in used_targets:
rename[actual_by_target] = target
used_targets.add(target)

Copilot uses AI. Check for mistakes.
Comment thread hvantk/commands/ptm_cli.py Outdated
- build: Download PTM data, map coordinates, build Hail Table
- annotate: Annotate a variant table with PTM site information
- landscape: PTM-variant overlap analysis (Q1)
- evaluate: Predictor performance at PTM sites (Q2)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module docstring lists an evaluate subcommand, but this CLI defines export-strata for Q2 and does not implement evaluate. Please update the docstring to reflect the actual subcommands to avoid confusing users.

Suggested change
- evaluate: Predictor performance at PTM sites (Q2)
- export-strata: Export PTM/non-PTM variant strata for predictor evaluation (Q2)

Copilot uses AI. Check for mistakes.
Comment thread hvantk/commands/ptm_cli.py Outdated
Comment on lines +233 to +239
from hvantk.ptm.plot import (
plot_landscape_summary,
plot_overlap_by_category,
plot_distance_distribution,
)
import matplotlib
matplotlib.use("Agg")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matplotlib.use("Agg") is called after importing hvantk.ptm.plot, but that module imports matplotlib.pyplot at import time. In headless environments this can fail or select an interactive backend before Agg is set. To reliably force a non-interactive backend, call matplotlib.use("Agg") before importing any module that imports pyplot (i.e., move the import matplotlib; matplotlib.use("Agg") above the from hvantk.ptm.plot import ...).

Suggested change
from hvantk.ptm.plot import (
plot_landscape_summary,
plot_overlap_by_category,
plot_distance_distribution,
)
import matplotlib
matplotlib.use("Agg")
import matplotlib
matplotlib.use("Agg")
from hvantk.ptm.plot import (
plot_landscape_summary,
plot_overlap_by_category,
plot_distance_distribution,
)

Copilot uses AI. Check for mistakes.
Comment thread hvantk/commands/ptm_cli.py Outdated
Comment on lines +333 to +335
from hvantk.ptm.plot import plot_population_af
import matplotlib
matplotlib.use("Agg")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same backend-order issue here: matplotlib.use("Agg") is invoked after importing plot_population_af from hvantk.ptm.plot (which imports pyplot at module import). Move the backend selection before importing plot functions to avoid failures on systems without a display.

Suggested change
from hvantk.ptm.plot import plot_population_af
import matplotlib
matplotlib.use("Agg")
import matplotlib
matplotlib.use("Agg")
from hvantk.ptm.plot import plot_population_af

Copilot uses AI. Check for mistakes.
Comment thread hvantk/annotation/annotate.py Outdated
Comment on lines 35 to 42
# Pre-calculate conditions for better readability and performance
is_pathogenic = t.clinvar_clnsig.any(
lambda x: hl.set(pathogenic_label_clinvar).contains(x)
lambda x: hl.set(CLINVAR_PATHOGENIC_LABELS).contains(x)
)
is_benign = t.clinvar_clnsig.any(lambda x: hl.set(benign_label_clinvar).contains(x))
is_benign = t.clinvar_clnsig.any(lambda x: hl.set(CLINVAR_BENIGN_LABELS).contains(x))

t = t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG)
t = t.annotate(
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotate_clinvar_clnsig computes is_pathogenic / is_benign from t.clinvar_clnsig before clinvar_clnsig is annotated from clinvar_ht. For inputs that don’t already have a clinvar_clnsig field (e.g., commands/annotate_features.py reads a variant HT then calls this), this will raise a field-not-found error and/or classify based on stale data. Compute the conditions from the joined ClinVar CLNSIG expression (or move the t = t.annotate(clinvar_clnsig=...) line before the .any(...) checks).

Copilot uses AI. Check for mistakes.
Comment thread hvantk/tables/table_builders.py Outdated
Comment on lines +1404 to +1406
The table is keyed by a genomic interval covering the PTM codon,
enabling interval-based joins with variant tables.

Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring says the PTM sites table is "keyed by a genomic interval covering the PTM codon", but the implementation keys by a single locus (codon_start) and keeps flanking_interval as a non-key field. Please update the docstring to match the actual keying strategy (or change the key to an interval if that’s what downstream joins expect).

Copilot uses AI. Check for mistakes.
Comment thread hvantk/tables/table_builders.py Outdated
hl.Table
The checkpointed Hail Table keyed by locus interval.
"""
from hvantk.ptm.constants import PTM_TYPE_CATEGORIES
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTM_TYPE_CATEGORIES is imported but never used in this function. Dropping the unused import will avoid lint noise and prevents unnecessary imports when building tables.

Suggested change
from hvantk.ptm.constants import PTM_TYPE_CATEGORIES

Copilot uses AI. Check for mistakes.
Comment thread hvantk/ptm/mapper.py
Comment on lines +152 to +158
codon_pos = positions[idx:idx + 3]
return CodonMapping(
chrom=chrom,
codon_start=min(codon_pos),
codon_end=max(codon_pos),
strand=strand,
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_residue_to_genomic returns codon_start=min(codon_pos) and codon_end=max(codon_pos). For split codons spanning an exon boundary, those three genomic positions are not contiguous; using min/max creates an interval that includes the intron and later code works around this by truncating the end. This can cause missed annotations for variants that hit the codon base(s) in the other exon. Consider returning/storing the exact 3 genomic positions (e.g., codon_positions: tuple[int,int,int]) or separate locus fields so downstream annotation can join on the true codon bases.

Copilot uses AI. Check for mistakes.
Comment thread hvantk/ptm/report.py Outdated
plt.close(fig)
return f"<img src='{src}' alt='{output_path.stem}' class='embedded-image'/>"
except Exception:
logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler in _render_plot logs only a generic warning, which makes debugging plot failures difficult. Logging the exception details (e.g., logger.exception(..., exc_info=True)) would preserve the stack trace while still skipping the plot.

Suggested change
logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
logger.exception("Failed to generate plot %s, skipping.", output_path.stem)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (12)
hvantk/tests/test_ptm.py (1)

218-221: Use tmp_path fixture instead of hardcoded /tmp paths.

The test uses hardcoded /tmp/test paths which are flagged as potentially insecure. Using pytest's tmp_path fixture is more portable and secure.

♻️ Proposed fix
-    config2 = PTMBuildConfig(
-        output_dir="/tmp/test", output_ht="/tmp/test/out.ht"
-    )
+def test_ptm_build_config_validation(tmp_path):
+    """PTMBuildConfig.validate catches missing required fields."""
+    config = PTMBuildConfig()
+    errors = config.validate()
+    assert len(errors) == 2
+    assert any("output_dir" in e for e in errors)
+    assert any("output_ht" in e for e in errors)
+
+    config2 = PTMBuildConfig(
+        output_dir=str(tmp_path), output_ht=str(tmp_path / "out.ht")
+    )
     assert config2.validate() == []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_ptm.py` around lines 218 - 221, The test uses hardcoded
paths; update the test to accept pytest's tmp_path fixture and construct
PTMBuildConfig using tmp_path (e.g., tmp_path / "test" for output_dir and
output_ht) instead of "/tmp/test" and "/tmp/test/out.ht", then call
config2.validate() as before; ensure the test function signature includes
tmp_path so tmp directories are used and removed automatically.
hvantk/datasets/uniprot_ptm_datasets.py (1)

226-231: Use raise ... from None or from err for exception chaining.

The re-raised ValueError should use proper exception chaining to clarify the error origin.

♻️ Proposed fix
         try:
             datetime.strptime(version_date, "%Y-%m-%d")
         except ValueError:
             raise ValueError(
                 f"Invalid version_date format: {version_date}. Expected YYYY-MM-DD"
-            )
+            ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 226 - 231, The except
block that re-raises a ValueError for invalid version_date should preserve
exception chaining; change "except ValueError:" to "except ValueError as err:"
and re-raise with "raise ValueError(f'Invalid version_date format:
{version_date}. Expected YYYY-MM-DD') from err" so the original parsing error is
chained (refer to the try/except around datetime.strptime and the variable
version_date).
hvantk/ptm/report.py (1)

258-260: Log the exception details for debugging.

The broad except Exception is acceptable here to prevent plot failures from crashing report generation. However, logging only the plot name loses valuable diagnostic information.

♻️ Proposed fix to include exception details
     except Exception:
-        logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
+        logger.warning(
+            "Failed to generate plot %s, skipping.", output_path.stem, exc_info=True
+        )
         return ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/report.py` around lines 258 - 260, The except block that currently
catches Exception and calls logger.warning("Failed to generate plot %s,
skipping.", output_path.stem) should be changed to log the actual exception
details; either capture the exception (except Exception as e) and include it in
the log message or call logger.exception/ logger.warning(..., exc_info=True) so
the stack trace is recorded while still returning ""; update the except block
surrounding the plot generation (the block using output_path.stem and returning
"") to emit the exception information for debugging.
docs_site/examples/ptm.md (1)

43-49: Add language specifier to fenced code blocks for output examples.

The output examples lack language specifiers. Use text or console to satisfy linting and improve syntax highlighting consistency.

📝 Proposed fix
 **Expected output:**
-```
+```text
 Mapping complete: 55422/56120 mapped (98.7%), 698 failed
 Resolution: {'ensembl_canonical': 48231, 'longest_cds': 7191}
 Mapped TSV: data/ptm/ptm_mapped.tsv
 Hail Table: data/ptm/ptm_sites.ht
</details>

Apply the same fix to the output blocks at lines 63, 83, and 150.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs_site/examples/ptm.md around lines 43 - 49, Add a language specifier
(text or console) to the fenced code blocks that show output examples —
specifically update the block immediately following the "Expected output:"
header (the block containing "Mapping complete: 55422/56120 mapped...") and
apply the same change to the other output blocks referenced (the ones at the
locations corresponding to lines 63, 83, and 150) so each opening
triple-backtick becomes ```text; keep the block contents unchanged.


</details>

</blockquote></details>
<details>
<summary>hvantk/ptm/analysis.py (3)</summary><blockquote>

`42-44`: **Remove extraneous f-string prefix.**

Line 44 uses an f-string but contains no placeholders.

<details>
<summary>📝 Proposed fix</summary>

```diff
     def summary(self) -> str:
         lines = [
-            f"PTM-Variant Landscape:",
+            "PTM-Variant Landscape:",
             f"  Variants: {self.n_variants:,} total, "
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 42 - 44, The f-string prefix in the
summary method is unnecessary because there are no placeholders; update the
summary function (def summary(self) -> str:) to use a normal string literal for
the line "PTM-Variant Landscape:" instead of an f-string (locate the list
assigned to the variable lines inside summary) so remove the leading f from that
string.
```

</details>

---

`77-79`: **Remove extraneous f-string prefix.**

Line 79 uses an f-string but contains no placeholders.

<details>
<summary>📝 Proposed fix</summary>

```diff
     def summary(self) -> str:
         lines = [
-            f"PTM Population Analysis:",
+            "PTM Population Analysis:",
             f"  Total variants: {self.n_variants:,}",
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 77 - 79, The string in the summary()
method is using an unnecessary f-string prefix for a literal with no
placeholders; update the list entry in PTM Population Analysis within the
summary() function to use a normal string literal (remove the leading f) so the
line reads a plain quoted string instead of an f-string.
```

</details>

---

`369-369`: **Consider validating `af_field` existence before use.**

The dynamic field access `annotated[af_field]` will fail at query execution time if the field doesn't exist. An early validation could provide a clearer error message.

<details>
<summary>📝 Optional validation</summary>

```diff
+    if af_field not in annotated.row:
+        raise ValueError(f"AF field '{af_field}' not found in gnomAD table. "
+                        f"Available fields: {list(annotated.row)}")
     af = annotated[af_field]
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` at line 369, Before indexing annotated with af_field,
validate that the key/property exists to avoid a runtime KeyError; in the
function where af = annotated[af_field] is used, check presence (e.g., "if
af_field not in annotated" or use hasattr for objects) and raise a clear
ValueError/TypeError describing the missing field and listing available keys (or
provide a sensible default/fallback) so callers get a helpful error instead of a
cryptic crash.
```

</details>

</blockquote></details>
<details>
<summary>hvantk/ptm/plot.py (2)</summary><blockquote>

`43-50`: **Consider renaming `format` parameter to avoid shadowing builtin.**

While functional, shadowing the `format` builtin is flagged by linters. Common alternatives: `fmt`, `output_format`, or `file_format`.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` around lines 43 - 50, The parameter name format in
function plot_landscape_summary shadows the builtin; rename it to a
non-conflicting identifier such as output_format (or fmt/file_format) in the
function signature and update every internal reference within
plot_landscape_summary (and its type hint/default) to use output_format, plus
update all call sites, docstrings, and tests that pass format to use the new
name so behavior is preserved.
```

</details>

---

`253-258`: **Add `strict=True` to `zip()` for defensive programming.**

While `bars` and `n_values` are guaranteed to have the same length here, adding `strict=True` catches any future refactoring errors that might cause misalignment.

<details>
<summary>📝 Proposed fix</summary>

```diff
         # Annotate counts on bars
-        for bar, n in zip(bars, n_values):
+        for bar, n in zip(bars, n_values, strict=True):
             ax.text(
                 bar.get_x() + bar.get_width() / 2, bar.get_height(),
                 f"n={n:,}", ha="center", va="bottom", fontsize=9,
             )
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` around lines 253 - 258, The loop annotating bar counts
should use zip(..., strict=True) to guard against mismatched lengths; update the
zip call that pairs bars and n_values in hvantk/ptm/plot.py so it reads
zip(bars, n_values, strict=True) and leave the rest of the ax.text(...) logic
unchanged to ensure a ValueError is raised if future refactors cause
misalignment.
```

</details>

</blockquote></details>
<details>
<summary>hvantk/commands/ptm_cli.py (3)</summary><blockquote>

`119-123`: **Add context to validation error output.**

The validation errors are printed without a preceding message indicating what failed.

<details>
<summary>📝 Proposed fix</summary>

```diff
         errors = config.validate()
         if errors:
+            click.echo("Configuration validation failed:", err=True)
             for error in errors:
                 click.echo(f"  - {error}", err=True)
             ctx.exit(1)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` around lines 119 - 123, When config.validate()
returns errors, prepend a clear context message before listing them: call
click.echo with a message like "Configuration validation failed:" (using
err=True) just prior to the for-loop that echoes each error, then proceed to
loop over errors and call ctx.exit(1); update the block that references errors,
config.validate(), click.echo, and ctx.exit to include that preceding
explanatory line so the output clearly indicates what failed.
```

</details>

---

`181-187`: **Consider aggregating counts in a single pass.**

Three separate `count()` calls trigger three table scans. For large variant tables, this could be optimized with a single aggregation.

<details>
<summary>⚡ Proposed optimization</summary>

```diff
-        n_total = result.count()
-        n_ptm_site = result.filter(result.is_ptm_site).count()
-        n_proximal = result.filter(result.is_ptm_proximal).count()
+        counts = result.aggregate(hl.struct(
+            n_total=hl.agg.count(),
+            n_ptm_site=hl.agg.filter(result.is_ptm_site, hl.agg.count()),
+            n_proximal=hl.agg.filter(result.is_ptm_proximal, hl.agg.count()),
+        ))
+        n_total = counts.n_total
+        n_ptm_site = counts.n_ptm_site
+        n_proximal = counts.n_proximal
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` around lines 181 - 187, The three separate
count() calls (n_total = result.count(), n_ptm_site =
result.filter(result.is_ptm_site).count(), n_proximal =
result.filter(result.is_ptm_proximal).count()) cause multiple table scans;
replace them with a single aggregation over result that computes total, PTM site
and proximal counts in one pass (e.g., a single query using conditional
sums/CASE or SQLAlchemy func.sum on expressions for is_ptm_site and
is_ptm_proximal, or iterate once and accumulate into n_total, n_ptm_site,
n_proximal) so the code still sets n_total, n_ptm_site and n_proximal but with
one database scan.
```

</details>

---

`1-11`: **Docstring mentions "evaluate" command that doesn't exist.**

The module docstring at line 8 lists `evaluate: Predictor performance at PTM sites (Q2)`, but this command is not implemented in the file. The actual Q2 functionality is handled via `export-strata` combined with external `psroc` commands as documented in the `ptm_group` docstring.

Consider updating the module docstring to reflect the actual command set:

<details>
<summary>📝 Proposed fix</summary>

```diff
 """
 PTM CLI Commands - Post-Translational Modification Analysis
 
 This module provides CLI commands for the PTM variant classification pipeline:
 - build: Download PTM data, map coordinates, build Hail Table
 - annotate: Annotate a variant table with PTM site information
 - landscape: PTM-variant overlap analysis (Q1)
-- evaluate: Predictor performance at PTM sites (Q2)
+- export-strata: Export PTM-stratified variant lists for external predictor evaluation (Q2)
 - population: Population-level PTM-variant burden (Q3)
 - report: Generate summary report
 """
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` around lines 1 - 11, Update the module docstring
in ptm_cli.py to remove or replace the incorrect "evaluate: Predictor
performance at PTM sites (Q2)" entry and instead document the actual commands
implemented (build, annotate, landscape, population, report) and note that Q2
functionality is provided via export-strata plus external psroc usage as
described in the ptm_group docstring; edit the top-level triple-quoted string
(module docstring) to reflect the accurate command set and cross-reference the
ptm_group explanation so readers are not misled about an evaluate command that
does not exist.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs_site/examples/index.md:

  • Around line 97-99: The quick-start commands have a PTM HT path mismatch: the
    build command (hvantk ptm build) writes data/ptm/ptm_sites.ht but the landscape
    command (hvantk ptm landscape) expects --ptm-ht ptm_sites.ht; update the
    landscape invocation to use --ptm-ht data/ptm/ptm_sites.ht (or alternatively
    change the build --output-ht to produce ptm_sites.ht in the current dir) so the
    two commands reference the same file path.

In @docs_site/guide/data-sources.md:

  • Around line 250-257: The docs currently list placeholder comments instead of
    actionable commands for pre-downloading data; replace the commented notes in the
    "Download (optional, for offline use)" block with runnable commands such as
    using the hvantk CLI (hvantk download uniprot-ptm --output-dir data/ptm) to
    fetch the UniProt PTM TSV and a wget call to fetch the Ensembl GTF (e.g.,
    download Homo_sapiens.GRCh38.*.gtf.gz into data/ref) so the document contains
    concrete, executable pre-download steps.

In @hvantk/annotation/annotate.py:

  • Around line 30-41: The predicates is_pathogenic and is_benign currently
    reference t.clinvar_clnsig before it is assigned; either compute clinvar_clnsig
    first or build predicates directly from the joined ClinVar expression
    clinvar_ht[t.key].info.CLNSIG. Update the code so that t =
    t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG) is executed before
    creating is_pathogenic/is_benign, or replace usages of t.clinvar_clnsig in the
    is_pathogenic and is_benign expressions with clinvar_ht[t.key].info.CLNSIG
    (using CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS with .any(...) as
    before) to avoid referencing an unassigned field.

In @hvantk/commands/ptm_cli.py:

  • Around line 332-338: Move the matplotlib backend selection before importing
    the plotting function: call matplotlib.use("Agg") prior to the from
    hvantk.ptm.plot import plot_population_af statement in the save_plots block so
    the backend is set before any plotting modules are imported; update the
    save_plots branch that currently imports plot_population_af and then calls
    matplotlib.use to instead import matplotlib, set matplotlib.use("Agg"), then
    import plot_population_af and call plot_population_af(result,
    os.path.join(output, "population_af.png")).
  • Around line 232-244: The call to matplotlib.use("Agg") occurs after importing
    hvantk.ptm.plot (which imports matplotlib.pyplot), so move the backend selection
    before that import: when save_plots is true, call matplotlib.use("Agg") first,
    then import plot_landscape_summary, plot_overlap_by_category, and
    plot_distance_distribution from hvantk.ptm.plot and proceed to call those
    functions (or alternatively set the backend inside the hvantk.ptm.plot module
    itself by adding matplotlib.use("Agg") before any pyplot import).

In @hvantk/commands/uniprot_ptm_downloader.py:

  • Around line 23-26: The CLI help for the "--date" argument incorrectly says the
    default is "today"; update the help text to reflect that when no date is
    provided the code uses the latest UniProt snapshot
    (UniProtPTMDataset.from_latest()), not the current date. Locate the argument
    definition for "--date" (the argparse add_argument call) and change the help
    string to mention "default: latest UniProt snapshot (via
    UniProtPTMDataset.from_latest())" and make the same change for the duplicate
    help text near the UniProtPTMDataset.from_latest() usage so both descriptions
    match the actual behavior.

In @hvantk/data/file_utils.py:

  • Around line 98-118: The sanitizer currently corrupts fields by replacing
    delimiter chars and buffers the whole file; in the sanitize_tsv logic use
    csv.writer to stream-output rows to tmp_path instead of manual delimiter.join
    and remove the cell.replace(delimiter, " ") step so you only normalize embedded
    newlines (replace "\n" and "\r" with spaces), iterate reader rows one-by-one (or
    filter blank rows) and writer.writerow(cleaned_row) for each, and preserve
    existing quoting/escaping semantics provided by csv.writer; ensure you still
    warn/return on empty input and atomically replace the original file with the
    tmp_path after successful write.

In @hvantk/datasets/gencc_datasets.py:

  • Around line 115-119: The call to sanitize_tsv(output_path) can return 0
    indicating failure, but the code always logs success and returns output_path;
    update the surrounding function (the one that calls sanitize_tsv) to capture the
    return value (e.g., result = sanitize_tsv(output_path)), check if result == 0,
    and on failure log an error via logger.error and either raise an exception or
    return None/False (consistent with the function's error handling convention)
    instead of returning output_path; ensure you reference sanitize_tsv and the
    caller function name so the check replaces the unconditional logger.info and
    return when sanitization fails.

In @hvantk/ptm/constants.py:

  • Around line 32-35: The mapping currently places the generic "Glycyl lysine
    isopeptide (Lys-Gly)" entry before the SUMO-specific keys, so
    hvantk/ptm/pipeline.py's first-match loop (which uses desc.startswith(prefix))
    classifies SUMO entries as "ubiquitination"; reorder the dictionary so the two
    SUMO-specific keys ("Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter
    in SUMO)" and "... SUMO2") appear before the generic "Glycyl lysine isopeptide
    (Lys-Gly)" key to ensure those descriptions match "sumoylation" first.

In @hvantk/ptm/report.py:

  • Around line 204-212: The guard only checks result.ccr_mean_ptm but then
    formats result.ccr_mean_non_ptm unconditionally; update the logic in the report
    generation (the block that builds ccr_html using result.ccr_mean_ptm and
    result.ccr_mean_non_ptm) to verify each value before formatting: only include
    the PTM line if result.ccr_mean_ptm is not None and only include the non-PTM
    line if result.ccr_mean_non_ptm is not None (or provide a fallback like "N/A");
    you can either add a combined guard (if result.ccr_mean_ptm is not None or
    result.ccr_mean_non_ptm is not None) and then conditionally append each
    f"

    ..." line, or check both individually, so neither f-string attempts to
    format a None value.

In @hvantk/tables/table_builders.py:

  • Around line 1390-1393: The code currently allows negative flanking_codons
    which leads to negative flank_bp and inverted/shrunk intervals; add an explicit
    validation at the start of create_ptm_sites_tb (and any public entry that
    forwards the parameter, e.g., the mktable ptm-sites command handler) to raise a
    clear exception (ValueError) when flanking_codons < 0, and avoid computing
    flank_bp for negative values (the existing flank_bp calculation block around
    where flank_bp is computed/used, referenced in the diff lines ~1460-1469, should
    only run for non-negative flanking_codons). Ensure the check references the
    function parameter name flanking_codons and returns/raises before any
    interval-building logic runs.

In @hvantk/utils/table_utils.py:

  • Around line 59-76: The rename-building logic in the loop over field_map (using
    _normalize_field_name, norm_to_actual, target_names and rename) must avoid
    producing a mapping whose destination name already exists in the table; update
    the checks so before setting rename[actual]=target (or
    rename[actual_by_target]=target) you verify that target is not present in the
    current actual fields (actual_fields) and not already a destination in
    rename.values(); additionally detect and raise/return a clear error when
    multiple different source headers collapse to the same normalized key so we fail
    fast instead of producing duplicate destinations.

Nitpick comments:
In @docs_site/examples/ptm.md:

  • Around line 43-49: Add a language specifier (text or console) to the
    fenced code blocks that show output examples — specifically update the block
    immediately following the "Expected output:" header (the block containing
    "Mapping complete: 55422/56120 mapped...") and apply the same change to the
    other output blocks referenced (the ones at the locations corresponding to lines
    63, 83, and 150) so each opening triple-backtick becomes ```text; keep the block
    contents unchanged.

In @hvantk/commands/ptm_cli.py:

  • Around line 119-123: When config.validate() returns errors, prepend a clear
    context message before listing them: call click.echo with a message like
    "Configuration validation failed:" (using err=True) just prior to the for-loop
    that echoes each error, then proceed to loop over errors and call ctx.exit(1);
    update the block that references errors, config.validate(), click.echo, and
    ctx.exit to include that preceding explanatory line so the output clearly
    indicates what failed.
  • Around line 181-187: The three separate count() calls (n_total =
    result.count(), n_ptm_site = result.filter(result.is_ptm_site).count(),
    n_proximal = result.filter(result.is_ptm_proximal).count()) cause multiple table
    scans; replace them with a single aggregation over result that computes total,
    PTM site and proximal counts in one pass (e.g., a single query using conditional
    sums/CASE or SQLAlchemy func.sum on expressions for is_ptm_site and
    is_ptm_proximal, or iterate once and accumulate into n_total, n_ptm_site,
    n_proximal) so the code still sets n_total, n_ptm_site and n_proximal but with
    one database scan.
  • Around line 1-11: Update the module docstring in ptm_cli.py to remove or
    replace the incorrect "evaluate: Predictor performance at PTM sites (Q2)" entry
    and instead document the actual commands implemented (build, annotate,
    landscape, population, report) and note that Q2 functionality is provided via
    export-strata plus external psroc usage as described in the ptm_group docstring;
    edit the top-level triple-quoted string (module docstring) to reflect the
    accurate command set and cross-reference the ptm_group explanation so readers
    are not misled about an evaluate command that does not exist.

In @hvantk/datasets/uniprot_ptm_datasets.py:

  • Around line 226-231: The except block that re-raises a ValueError for invalid
    version_date should preserve exception chaining; change "except ValueError:" to
    "except ValueError as err:" and re-raise with "raise ValueError(f'Invalid
    version_date format: {version_date}. Expected YYYY-MM-DD') from err" so the
    original parsing error is chained (refer to the try/except around
    datetime.strptime and the variable version_date).

In @hvantk/ptm/analysis.py:

  • Around line 42-44: The f-string prefix in the summary method is unnecessary
    because there are no placeholders; update the summary function (def
    summary(self) -> str:) to use a normal string literal for the line "PTM-Variant
    Landscape:" instead of an f-string (locate the list assigned to the variable
    lines inside summary) so remove the leading f from that string.
  • Around line 77-79: The string in the summary() method is using an unnecessary
    f-string prefix for a literal with no placeholders; update the list entry in PTM
    Population Analysis within the summary() function to use a normal string literal
    (remove the leading f) so the line reads a plain quoted string instead of an
    f-string.
  • Line 369: Before indexing annotated with af_field, validate that the
    key/property exists to avoid a runtime KeyError; in the function where af =
    annotated[af_field] is used, check presence (e.g., "if af_field not in
    annotated" or use hasattr for objects) and raise a clear ValueError/TypeError
    describing the missing field and listing available keys (or provide a sensible
    default/fallback) so callers get a helpful error instead of a cryptic crash.

In @hvantk/ptm/plot.py:

  • Around line 43-50: The parameter name format in function
    plot_landscape_summary shadows the builtin; rename it to a non-conflicting
    identifier such as output_format (or fmt/file_format) in the function signature
    and update every internal reference within plot_landscape_summary (and its type
    hint/default) to use output_format, plus update all call sites, docstrings, and
    tests that pass format to use the new name so behavior is preserved.
  • Around line 253-258: The loop annotating bar counts should use zip(...,
    strict=True) to guard against mismatched lengths; update the zip call that pairs
    bars and n_values in hvantk/ptm/plot.py so it reads zip(bars, n_values,
    strict=True) and leave the rest of the ax.text(...) logic unchanged to ensure a
    ValueError is raised if future refactors cause misalignment.

In @hvantk/ptm/report.py:

  • Around line 258-260: The except block that currently catches Exception and
    calls logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
    should be changed to log the actual exception details; either capture the
    exception (except Exception as e) and include it in the log message or call
    logger.exception/ logger.warning(..., exc_info=True) so the stack trace is
    recorded while still returning ""; update the except block surrounding the plot
    generation (the block using output_path.stem and returning "") to emit the
    exception information for debugging.

In @hvantk/tests/test_ptm.py:

  • Around line 218-221: The test uses hardcoded paths; update the test to accept
    pytest's tmp_path fixture and construct PTMBuildConfig using tmp_path (e.g.,
    tmp_path / "test" for output_dir and output_ht) instead of "/tmp/test" and
    "/tmp/test/out.ht", then call config2.validate() as before; ensure the test
    function signature includes tmp_path so tmp directories are used and removed
    automatically.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `0caf719e-a360-40a8-adf6-12992f1bcb5f`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 3fa0ef5b95a8b6edaefcb4b5a2c6653bcc5945b4 and 857ce337560ad0b28dcb87d8a5c34d2ed48f7b9f.

</details>

<details>
<summary>⛔ Files ignored due to path filters (1)</summary>

* `docs_site/images/hvantk-ptm-workflow.svg` is excluded by `!**/*.svg`

</details>

<details>
<summary>📒 Files selected for processing (29)</summary>

* `docs_site/architecture.md`
* `docs_site/examples/index.md`
* `docs_site/examples/ptm.md`
* `docs_site/guide/data-sources.md`
* `docs_site/guide/usage.md`
* `docs_site/tools/ptm.md`
* `hvantk/annotation/annotate.py`
* `hvantk/commands/download_cli.py`
* `hvantk/commands/make_table_cli.py`
* `hvantk/commands/ptm_cli.py`
* `hvantk/commands/uniprot_ptm_downloader.py`
* `hvantk/core/constants.py`
* `hvantk/data/file_utils.py`
* `hvantk/datasets/gencc_datasets.py`
* `hvantk/datasets/uniprot_ptm_datasets.py`
* `hvantk/hvantk.py`
* `hvantk/psroc/pipeline.py`
* `hvantk/ptm/__init__.py`
* `hvantk/ptm/analysis.py`
* `hvantk/ptm/annotate.py`
* `hvantk/ptm/constants.py`
* `hvantk/ptm/mapper.py`
* `hvantk/ptm/pipeline.py`
* `hvantk/ptm/plot.py`
* `hvantk/ptm/report.py`
* `hvantk/tables/registry.py`
* `hvantk/tables/table_builders.py`
* `hvantk/tests/test_ptm.py`
* `hvantk/utils/table_utils.py`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread docs_site/examples/index.md
Comment on lines +250 to +257
**Download** (optional, for offline use):

```bash
# The build command handles this automatically, but you can pre-download:
# 1. UniProt PTM TSV via REST API query
# 2. Ensembl GTF for coordinate mapping
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace placeholder comments with an executable pre-download command.

Line 253–256 currently documents intent but gives no runnable command. Add hvantk download uniprot-ptm to make offline prep actually actionable.

Suggested doc fix
 **Download** (optional, for offline use):

 ```bash
-# The build command handles this automatically, but you can pre-download:
-# 1. UniProt PTM TSV via REST API query
-# 2. Ensembl GTF for coordinate mapping
+# UniProt PTM TSV (optional pre-download)
+hvantk download uniprot-ptm --output-dir data/ptm
+
+# Ensembl GTF (optional pre-download for mapping)
+wget https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.gtf.gz -P data/ref
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion
**Download** (optional, for offline use):

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs_site/guide/data-sources.md` around lines 250 - 257, The docs currently
list placeholder comments instead of actionable commands for pre-downloading
data; replace the commented notes in the "Download (optional, for offline use)"
block with runnable commands such as using the hvantk CLI (hvantk download
uniprot-ptm --output-dir data/ptm) to fetch the UniProt PTM TSV and a wget call
to fetch the Ensembl GTF (e.g., download Homo_sapiens.GRCh38.*.gtf.gz into
data/ref) so the document contains concrete, executable pre-download steps.

Comment thread hvantk/annotation/annotate.py Outdated
Comment thread hvantk/commands/ptm_cli.py
Comment thread hvantk/commands/ptm_cli.py
Comment thread hvantk/datasets/gencc_datasets.py
Comment thread hvantk/ptm/constants.py Outdated
Comment thread hvantk/ptm/report.py
Comment thread hvantk/tables/table_builders.py
Comment thread hvantk/utils/table_utils.py
                                                                                      - Fix clinvar_clnsig predicate ordering bug (referenced before annotation)
  - Move matplotlib.use("Agg") before plot imports in PTM CLI commands
  - Reorder PTM_TYPE_CATEGORIES so SUMO entries match before generic Lys-Gly
  - Guard ccr_mean_non_ptm for None before formatting in report
  - Check sanitize_tsv return value in GenCC downloader
  - Rewrite sanitize_tsv to use csv.writer, stream rows, validate column count
  - Detect rename collisions and ambiguous headers in build_rename_map
  - Fix docstrings (keyed by locus not interval, evaluate→export-strata)
  - Remove unused PTM_TYPE_CATEGORIES import, validate flanking_codons >= 0
  - Use single-pass Hail aggregation for PTM annotation counts
  - Fix docs: path mismatch, placeholder commands, code fence specifiers
  - Minor: exception chaining, extraneous f-strings, exc_info logging, strict zip
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
hvantk/annotation/annotate.py (1)

30-49: LGTM! The predicate ordering bug has been correctly fixed.

The previous issue where is_pathogenic and is_benign referenced t.clinvar_clnsig before assignment is now resolved—the field is annotated first (line 36) before the predicates are computed (lines 39-42). Using centralized constants from hvantk.core.constants improves maintainability.

One optional refinement for clarity and marginal performance: hoist the hl.set() calls outside the lambdas.

,

♻️ Optional: Hoist set creation outside lambdas
     # First annotate clinvar_clnsig from the ClinVar table
     t = t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG)

     # Now compute conditions using the annotated field
+    pathogenic_labels = hl.set(CLINVAR_PATHOGENIC_LABELS)
+    benign_labels = hl.set(CLINVAR_BENIGN_LABELS)
     is_pathogenic = t.clinvar_clnsig.any(
-        lambda x: hl.set(CLINVAR_PATHOGENIC_LABELS).contains(x)
+        lambda x: pathogenic_labels.contains(x)
     )
-    is_benign = t.clinvar_clnsig.any(lambda x: hl.set(CLINVAR_BENIGN_LABELS).contains(x))
+    is_benign = t.clinvar_clnsig.any(lambda x: benign_labels.contains(x))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/annotation/annotate.py` around lines 30 - 49, Hoist the hl.set(...)
creations out of the lambdas to avoid rebuilding them repeatedly: create local
Hail sets (e.g., pathogenic_set = hl.set(CLINVAR_PATHOGENIC_LABELS) and
benign_set = hl.set(CLINVAR_BENIGN_LABELS)) before computing is_pathogenic and
is_benign, then use those variables inside the t.clinvar_clnsig.any(...)
lambdas; update references in the is_pathogenic and is_benign predicates and
leave the rest of the t.annotate logic unchanged.
hvantk/commands/uniprot_ptm_downloader.py (1)

40-55: Solid CLI implementation with appropriate error handling.

The lazy import pattern and error handling with ctx.exit(1) follow good CLI practices.

One minor note: logger.exception() already includes the exception information, so the f-string f"Download failed: {e}" duplicates it slightly. You could simplify to logger.exception("Download failed").

✨ Simplify exception logging
     except Exception as e:
-        logger.exception(f"Download failed: {e}")
+        logger.exception("Download failed")
         click.echo(f"Error: {e}", err=True)
         ctx.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/uniprot_ptm_downloader.py` around lines 40 - 55, The
exception log duplicates the exception text; in the except block for the UniProt
PTM download (the try that imports UniProtPTMDataset and calls
UniProtPTMDataset.from_date/from_latest, download, etc.) replace
logger.exception(f"Download failed: {e}") with a simpler
logger.exception("Download failed") so the stack/exception is still recorded
without duplicating the exception string, keeping the existing
click.echo(f"Error: {e}", err=True) and ctx.exit(1) behavior unchanged.
hvantk/datasets/uniprot_ptm_datasets.py (2)

257-325: Well-implemented download with proper cleanup.

The download method correctly:

  • Creates output directory if needed
  • Checks for existing files before overwrite
  • Streams results page-by-page to avoid memory issues
  • Cleans up partial files on failure

One minor style improvement per static analysis:

✨ Use implicit string conversion in f-string
-            raise RuntimeError(
-                f"Failed to download UniProt PTM data: {str(e)}"
-            ) from e
+            raise RuntimeError(
+                f"Failed to download UniProt PTM data: {e}"
+            ) from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 257 - 325, In download,
the exception message uses explicit str conversion (f"{str(e)}"); change it to
use implicit string conversion (e.g., f"{e}") when constructing the RuntimeError
message in the except block so the code is cleaner and satisfies the static
analysis rule in the download function's error handling.

157-183: Consider validating URL scheme for defense-in-depth.

The static analysis flags that urllib.request.urlopen accepts arbitrary URL schemes including file://. While UNIPROT_API_URL is presumably an HTTPS constant, the pagination URLs from _get_next_link come from server response headers. A compromised or malicious server could theoretically redirect to a file:// URL.

For defense-in-depth, consider validating the scheme:

🛡️ Optional: Add scheme validation
 def _fetch_json(url: str) -> tuple:
     """
     Fetch a JSON page from UniProt using *requests* (preferred) or *urllib*.
     ...
     """
+    from urllib.parse import urlparse
+    parsed = urlparse(url)
+    if parsed.scheme not in ("http", "https"):
+        raise ValueError(f"Unsupported URL scheme: {parsed.scheme}")
+
     try:
         import requests as _requests
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 157 - 183, The fetch
helper _fetch_json currently opens arbitrary URL schemes (via
urllib.request.urlopen or requests.get); validate the URL scheme before issuing
the request by parsing the URL (e.g., urllib.parse.urlparse) and only allowing
"http" and "https" (reject or raise an error for "file", "ftp", etc.), and apply
the same check to any pagination links produced by _get_next_link (ensure those
links are validated before being passed into _fetch_json); this defends against
malicious server-provided pagination URLs while preserving normal
UNIPROT_API_URL usage.
hvantk/commands/ptm_cli.py (1)

182-182: Remove duplicate hail import.

hail as hl is already imported at line 171. This duplicate import is unnecessary.

♻️ Remove duplicate import
         result = result.checkpoint(output_ht, overwrite=overwrite)

-        import hail as hl
         counts = result.aggregate(hl.struct(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/ptm_cli.py` at line 182, Remove the duplicate import of "hail
as hl" found at the second import line; keep the original "import hail as hl"
that already exists earlier (around line with other imports) and delete the
redundant "import hail as hl" occurrence to avoid duplicate imports in the
module (check the top-level imports near the PTM CLI definitions and functions
that use hl).
hvantk/ptm/report.py (1)

258-260: Consider narrowing the exception type for plot failures.

The broad Exception catch (flagged by Ruff BLE001) is acceptable here for fault tolerance, but could be narrowed to catch more specific exceptions like (ValueError, IOError, RuntimeError) to avoid masking unexpected errors.

♻️ Optional: narrow exception scope
-    except Exception:
+    except (ValueError, IOError, RuntimeError, TypeError):
         logger.warning("Failed to generate plot %s, skipping.", output_path.stem, exc_info=True)
         return ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/report.py` around lines 258 - 260, Replace the broad "except
Exception" in the plot-generation failure handler with a narrowed tuple of
expected exceptions (for example except (ValueError, IOError, RuntimeError):) so
you don't mask unrelated errors; keep the existing logger.warning call (using
output_path.stem and exc_info=True) and the return "" behavior unchanged; update
the except clause where the code logs "Failed to generate plot %s, skipping."
(the block that references output_path.stem and logger.warning) to use the
narrower exception tuple to address Ruff BLE001.
hvantk/ptm/plot.py (1)

49-49: Parameter format shadows Python builtin.

The format parameter appears in multiple functions (lines 49, 126, 171, 218, 277, 287, 298). While common in plotting code, consider renaming to fmt or image_format to avoid shadowing.

♻️ Rename `format` parameter across all functions
 def plot_landscape_summary(
     result: PTMLandscapeResult,
     output_path: str,
     figsize: Tuple[int, int] = (8, 5),
     dpi: int = 300,
     title: Optional[str] = None,
-    format: str = "png",
+    fmt: str = "png",
 ) -> plt.Figure:

Apply similar changes to other functions and update internal usages of format to fmt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` at line 49, Rename the parameter named format to fmt (or
image_format) across all plotting functions in the hvantk.ptm.plot module that
declare "format: str = 'png'" (those function signatures flagged in the review)
and update every internal usage of format to the new name, plus adjust any local
variables or return values that reference it; ensure function signatures (and
docstrings if present) and all call sites within the repository are updated
consistently so nothing still references the old format parameter name.
hvantk/utils/table_utils.py (1)

65-87: Consider detecting duplicate rename destinations for robustness.

The target not in actual_fields check now prevents renaming into an existing field. For additional safety against malformed field_map inputs where multiple entries could resolve to the same actual field, you could also check target not in rename.values() before adding:

             if actual not in target_names and target not in actual_fields:
+                if target in rename.values():
+                    logger.warning(
+                        "build_rename_map: target %r already assigned, skipping %r",
+                        target, actual,
+                    )
+                    continue
                 rename[actual] = target

Current usage with COSMIC_CGC_FIELDS has no collisions, so this is purely defensive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/utils/table_utils.py` around lines 65 - 87, The current header-rename
logic may allow multiple source fields to map to the same target; update the
blocks that assign to rename (the branches that set rename[actual] = target and
rename[actual_by_target] = target) to also check that the proposed target is not
already used in rename.values() before adding; use the same normalization/path
lookup logic around _normalize_field_name, norm_to_actual, field_map,
actual_fields and target_names to prevent duplicate rename destinations and keep
the existing checks (actual not in target_names and target not in
actual_fields).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/ptm/analysis.py`:
- Around line 86-91: The current guard only checks self.ccr_mean_ptm but still
formats self.ccr_mean_non_ptm which can be None; update the block that emits the
summary (the lines using self.ccr_mean_ptm and self.ccr_mean_non_ptm, produced
by ptm_population()/the summary method) to check each mean independently and
only format the non-None values. Concretely, replace the single-if that formats
both values with logic that builds the string parts conditionally (e.g., if
self.ccr_mean_ptm is not None append "PTM=...", and if self.ccr_mean_non_ptm is
not None append "non-PTM=...") and then join/append that to lines so no f-string
is called with a None value.

In `@hvantk/tables/table_builders.py`:
- Around line 1142-1146: The fallback sets kwargs["force_bgz"]=True whenever
resolved_path.endswith(".gz"), which overrides resolve_compression's explicit
False for standard gzip; change the code to unpack the boolean from
resolve_compression (e.g., resolved_path, is_bgz = resolve_compression(...)) and
only set kwargs["force_bgz"]=True when either force_bgz was explicitly requested
or is_bgz is True (do not set it solely based on resolved_path.endswith(".gz")).
Ensure you update references to resolved_path and the function
resolve_compression accordingly and remove the blunt .endswith(".gz") check.
- Around line 1461-1471: The flanking_interval end can exceed chromosome bounds
(ht.codon_end + flank_bp); clamp it to the contig length before calling
hl.locus_interval. Retrieve the contig length for ht._contig from the
reference_genome (the same approach used in the PTM annotation pattern), compute
end_pos = hl.min(contig_length, ht.codon_end + flank_bp), and use that end_pos
in hl.locus_interval (keeping hl.max(1, ht.codon_start - flank_bp) for the
start); update the flanking_interval annotation accordingly (symbols:
flanking_interval, ht.codon_end, flank_bp, ht._contig, reference_genome,
hl.locus_interval).

---

Nitpick comments:
In `@hvantk/annotation/annotate.py`:
- Around line 30-49: Hoist the hl.set(...) creations out of the lambdas to avoid
rebuilding them repeatedly: create local Hail sets (e.g., pathogenic_set =
hl.set(CLINVAR_PATHOGENIC_LABELS) and benign_set =
hl.set(CLINVAR_BENIGN_LABELS)) before computing is_pathogenic and is_benign,
then use those variables inside the t.clinvar_clnsig.any(...) lambdas; update
references in the is_pathogenic and is_benign predicates and leave the rest of
the t.annotate logic unchanged.

In `@hvantk/commands/ptm_cli.py`:
- Line 182: Remove the duplicate import of "hail as hl" found at the second
import line; keep the original "import hail as hl" that already exists earlier
(around line with other imports) and delete the redundant "import hail as hl"
occurrence to avoid duplicate imports in the module (check the top-level imports
near the PTM CLI definitions and functions that use hl).

In `@hvantk/commands/uniprot_ptm_downloader.py`:
- Around line 40-55: The exception log duplicates the exception text; in the
except block for the UniProt PTM download (the try that imports
UniProtPTMDataset and calls UniProtPTMDataset.from_date/from_latest, download,
etc.) replace logger.exception(f"Download failed: {e}") with a simpler
logger.exception("Download failed") so the stack/exception is still recorded
without duplicating the exception string, keeping the existing
click.echo(f"Error: {e}", err=True) and ctx.exit(1) behavior unchanged.

In `@hvantk/datasets/uniprot_ptm_datasets.py`:
- Around line 257-325: In download, the exception message uses explicit str
conversion (f"{str(e)}"); change it to use implicit string conversion (e.g.,
f"{e}") when constructing the RuntimeError message in the except block so the
code is cleaner and satisfies the static analysis rule in the download
function's error handling.
- Around line 157-183: The fetch helper _fetch_json currently opens arbitrary
URL schemes (via urllib.request.urlopen or requests.get); validate the URL
scheme before issuing the request by parsing the URL (e.g.,
urllib.parse.urlparse) and only allowing "http" and "https" (reject or raise an
error for "file", "ftp", etc.), and apply the same check to any pagination links
produced by _get_next_link (ensure those links are validated before being passed
into _fetch_json); this defends against malicious server-provided pagination
URLs while preserving normal UNIPROT_API_URL usage.

In `@hvantk/ptm/plot.py`:
- Line 49: Rename the parameter named format to fmt (or image_format) across all
plotting functions in the hvantk.ptm.plot module that declare "format: str =
'png'" (those function signatures flagged in the review) and update every
internal usage of format to the new name, plus adjust any local variables or
return values that reference it; ensure function signatures (and docstrings if
present) and all call sites within the repository are updated consistently so
nothing still references the old format parameter name.

In `@hvantk/ptm/report.py`:
- Around line 258-260: Replace the broad "except Exception" in the
plot-generation failure handler with a narrowed tuple of expected exceptions
(for example except (ValueError, IOError, RuntimeError):) so you don't mask
unrelated errors; keep the existing logger.warning call (using output_path.stem
and exc_info=True) and the return "" behavior unchanged; update the except
clause where the code logs "Failed to generate plot %s, skipping." (the block
that references output_path.stem and logger.warning) to use the narrower
exception tuple to address Ruff BLE001.

In `@hvantk/utils/table_utils.py`:
- Around line 65-87: The current header-rename logic may allow multiple source
fields to map to the same target; update the blocks that assign to rename (the
branches that set rename[actual] = target and rename[actual_by_target] = target)
to also check that the proposed target is not already used in rename.values()
before adding; use the same normalization/path lookup logic around
_normalize_field_name, norm_to_actual, field_map, actual_fields and target_names
to prevent duplicate rename destinations and keep the existing checks (actual
not in target_names and target not in actual_fields).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a83341fb-2b11-4f02-87df-be7d021b8452

📥 Commits

Reviewing files that changed from the base of the PR and between 857ce33 and a58516f.

📒 Files selected for processing (15)
  • docs_site/examples/index.md
  • docs_site/examples/ptm.md
  • docs_site/guide/data-sources.md
  • hvantk/annotation/annotate.py
  • hvantk/commands/ptm_cli.py
  • hvantk/commands/uniprot_ptm_downloader.py
  • hvantk/data/file_utils.py
  • hvantk/datasets/gencc_datasets.py
  • hvantk/datasets/uniprot_ptm_datasets.py
  • hvantk/ptm/analysis.py
  • hvantk/ptm/constants.py
  • hvantk/ptm/plot.py
  • hvantk/ptm/report.py
  • hvantk/tables/table_builders.py
  • hvantk/utils/table_utils.py
✅ Files skipped from review due to trivial changes (4)
  • docs_site/guide/data-sources.md
  • docs_site/examples/index.md
  • hvantk/ptm/constants.py
  • docs_site/examples/ptm.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • hvantk/datasets/gencc_datasets.py
  • hvantk/data/file_utils.py

Comment thread hvantk/ptm/analysis.py Outdated
Comment on lines +1142 to +1146
if force_bgz:
kwargs["force_bgz"] = True
elif resolved_path.endswith(".gz"):
kwargs["force_bgz"] = True
return hl.import_table(**kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Hail import_table force_bgz parameter gzip vs bgzip

💡 Result:

In Hail, the hl.import_table function (also referenced as import_table in methods) imports delimited text files (TSV/CSV) into a Hail Table. The force_bgz parameter controls how compressed .gz files are handled during import. GZIP is standard gzip compression, which decompresses serially (single-threaded, no parallelism). BGZIP (Blocked GZIP) is a bioinformatics format that splits the file into independent blocks, enabling parallel random access and multi-threaded decompression. By default, Hail treats .gz files as standard GZIP (serial). Set force_bgz=True to treat .gz files as BGZIP, allowing parallel reading across Spark executors for faster performance on large files, even if the extension isn't .bgz. From Hail docs: "If True, load .gz files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. This option is useful when the file extension is not '.bgz', but the file is blocked gzip, so that the file can be read in parallel and not on a single node." Use force_bgz=True for BGZIP-compressed files (common in genomics) to enable parallelism; otherwise, use standard GZIP handling.

Citations:


🏁 Script executed:

# Find the resolve_compression function definition
rg -n "def resolve_compression" --type python

Repository: bigbio/hvantk

Length of output: 87


🏁 Script executed:

# Also check if resolve_compression is imported from elsewhere
rg -n "resolve_compression" hvantk/tables/table_builders.py | head -20

Repository: bigbio/hvantk

Length of output: 235


🏁 Script executed:

# Find and read the resolve_compression function
fd -t f file_utils.py -path "*/data/*"

Repository: bigbio/hvantk

Length of output: 225


🏁 Script executed:

# Search for resolve_compression definition
rg -n "def resolve_compression" --type-list | head -5

Repository: bigbio/hvantk

Length of output: 446


🏁 Script executed:

# Try a simpler search for the function
find . -name "file_utils.py" -type f

Repository: bigbio/hvantk

Length of output: 111


🏁 Script executed:

# Read the file_utils.py to find resolve_compression
cat -n ./hvantk/data/file_utils.py

Repository: bigbio/hvantk

Length of output: 32113


The fallback .gz check incorrectly overrides resolve_compression's explicit False return for standard gzip files.

When resolve_compression detects a standard gzip file, it explicitly returns (filepath, False) to indicate the file cannot be read in parallel. However, the fallback check at line 1144 ignores this signal and sets force_bgz=True anyway because the path ends with .gz.

According to Hail documentation and the learnings from this codebase, force_bgz=True tells Hail the file is block-gzipped (BGZF) and safe for parallel reading. Standard gzip files should NOT have this flag set—Hail will read them single-threaded without it.

🔧 Suggested fix
         if force_bgz:
             kwargs["force_bgz"] = True
-        elif resolved_path.endswith(".gz"):
-            kwargs["force_bgz"] = True
+        # Don't set force_bgz for standard gzip — resolve_compression already handles this
         return hl.import_table(**kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/table_builders.py` around lines 1142 - 1146, The fallback sets
kwargs["force_bgz"]=True whenever resolved_path.endswith(".gz"), which overrides
resolve_compression's explicit False for standard gzip; change the code to
unpack the boolean from resolve_compression (e.g., resolved_path, is_bgz =
resolve_compression(...)) and only set kwargs["force_bgz"]=True when either
force_bgz was explicitly requested or is_bgz is True (do not set it solely based
on resolved_path.endswith(".gz")). Ensure you update references to resolved_path
and the function resolve_compression accordingly and remove the blunt
.endswith(".gz") check.

Comment thread hvantk/tables/table_builders.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
hvantk/ptm/analysis.py (1)

223-224: Materialize annotated before the follow-on Hail actions.

stats, category_counts, and distance_dist are separate actions on the same lazy table. Without persist()/checkpoint(), Hail will rerun annotate_variants_with_ptm() and _label_clinvar() for each one.

♻️ Low-friction fix
     annotated = annotate_variants_with_ptm(clinvar_ht, ptm_ht, flanking_codons)
-    annotated = _label_clinvar(annotated)
+    annotated = _label_clinvar(annotated).persist()

Also applies to: 227-271

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 223 - 224, The pipeline recomputes
annotate_variants_with_ptm and _label_clinvar for each subsequent Hail action
because the intermediate Table `annotated` is lazy; persist or checkpoint
`annotated` immediately after calling _label_clinvar (i.e., after annotated =
_label_clinvar(annotated)) so subsequent actions (`stats`, `category_counts`,
`distance_dist`) operate on a materialized table; use annotated =
annotated.persist() or annotated = annotated.checkpoint(...) with an appropriate
path and then continue with the downstream actions to avoid repeated
recomputation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/ptm/analysis.py`:
- Around line 334-353: The is_non_ptm mask currently defined as "is_non_ptm =
~is_ptm & ~is_prox" allows noncoding variants; update the logic to require
coding variants or declare a precondition. Either (A) enforce coding in the
analysis by changing the mask to include a coding flag from gnomad_ht (e.g.,
is_non_ptm = ~is_ptm & ~is_prox & is_coding, where is_coding is derived from
gnomad_ht consequence/variant annotation) and apply the same change where
similar masks are built (see the other block around 375-388), or (B) add a clear
docstring/precondition that gnomad_ht must be pre-filtered to coding variants
and validate that upstream (raise an error if coding annotation/flag is
missing). Reference symbols: is_non_ptm, is_ptm, is_prox, af_field and the PTM
analysis function in hvantk/ptm/analysis.py.
- Around line 165-171: The _vid construction in ht_out (ht.annotate using
ht.locus.contig, ht.locus.position and ht.alleles[1]) only picks the first ALT
and will drop additional alleles for multiallelic records; either enforce
biallelic input or explode multi-allelic sites into one row per ALT. Fix by
validating ht.alleles length (e.g., assert/hl.if_else to raise or filter where
hl.len(ht.alleles) == 2) when you require biallelic, or else transform the table
before building _vid by exploding ALT alleles (use an explode/flatten pattern to
iterate over hl.range(1, hl.len(ht.alleles)) or an equivalent Hail explode on
ht.alleles[1:], create a per-ALT allele variable, and build _vid from that
allele instead of ht.alleles[1]) so each exported row includes the correct ALT.
- Around line 103-124: The code collapses multiallelic ClinVar CLNSIG to
CLNSIG[0] in _extract_clnsig and then labels variants by that single value;
instead, preserve the array and change _label_clinvar to test if any allele’s
CLNSIG matches the pathogenic/benign label sets (use hl.any(lambda x:
p_labels.contains(x), clnsig) and similarly for b_labels), and handle the scalar
case by checking membership directly when clnsig is not an array; update
_extract_clnsig/_label_clinvar to stop using CLNSIG[0] and use hl.any(...) (or
equivalent) against CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS so
multiallelic records are correctly classified.

In `@hvantk/tables/table_builders.py`:
- Around line 1431-1441: create_ptm_sites_tb currently hardcodes adding "chr" to
inputs (contig_remap/ht._contig) which breaks non-prefixed references; change
create_ptm_sites_tb to detect whether the target reference's contigs are
prefixed by checking get_reference(reference_genome).contigs for a "chr" prefix,
then normalize incoming ht.chrom by stripping any leading "chr"/"CHR" to a base
contig name, apply the MT<->M remap on that base (update contig_remap logic),
and finally re-add the "chr" prefix only if the reference uses prefixed contigs
before assigning ht._contig and filtering against valid_contigs (use
valid_contigs = hl.set(hl.literal(hl.get_reference(reference_genome).contigs))
as before).

---

Nitpick comments:
In `@hvantk/ptm/analysis.py`:
- Around line 223-224: The pipeline recomputes annotate_variants_with_ptm and
_label_clinvar for each subsequent Hail action because the intermediate Table
`annotated` is lazy; persist or checkpoint `annotated` immediately after calling
_label_clinvar (i.e., after annotated = _label_clinvar(annotated)) so subsequent
actions (`stats`, `category_counts`, `distance_dist`) operate on a materialized
table; use annotated = annotated.persist() or annotated =
annotated.checkpoint(...) with an appropriate path and then continue with the
downstream actions to avoid repeated recomputation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3d8c33c-a809-45df-a6aa-817e16c324e9

📥 Commits

Reviewing files that changed from the base of the PR and between a58516f and 0dd39f9.

📒 Files selected for processing (3)
  • docs_site/examples/ptm.md
  • hvantk/ptm/analysis.py
  • hvantk/tables/table_builders.py
✅ Files skipped from review due to trivial changes (1)
  • docs_site/examples/ptm.md

Comment thread hvantk/ptm/analysis.py
Comment on lines +103 to +124
def _extract_clnsig(ht: hl.Table) -> hl.Expression:
"""Extract scalar CLNSIG expression, handling array vs string dtype."""
clnsig_dtype = ht.info.CLNSIG.dtype
if isinstance(clnsig_dtype, hl.tarray):
return hl.or_missing(
hl.is_defined(ht.info.CLNSIG) & (hl.len(ht.info.CLNSIG) > 0),
ht.info.CLNSIG[0],
)
return ht.info.CLNSIG


def _label_clinvar(ht: hl.Table) -> hl.Table:
"""Add _label field ('P', 'B', or 'other') based on ClinVar CLNSIG."""
clnsig = _extract_clnsig(ht)
p_labels = hl.literal(CLINVAR_PATHOGENIC_LABELS)
b_labels = hl.literal(CLINVAR_BENIGN_LABELS)
return ht.annotate(
_label=hl.case()
.when(hl.is_defined(clnsig) & p_labels.contains(clnsig), "P")
.when(hl.is_defined(clnsig) & b_labels.contains(clnsig), "B")
.default("other")
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and locate the files mentioned
find . -type f -name "*.py" | grep -E "(analysis|table_builders)" | head -20

Repository: bigbio/hvantk

Length of output: 162


🏁 Script executed:

# Read the create_clinvar_tb function
cat -n hvantk/tables/table_builders.py | sed -n '248,310p'

Repository: bigbio/hvantk

Length of output: 2381


🏁 Script executed:

# Also check the imports and constants needed for context
head -100 hvantk/ptm/analysis.py | cat -n

Repository: bigbio/hvantk

Length of output: 4391


🏁 Script executed:

# Search for CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS
grep -n "CLINVAR_PATHOGENIC_LABELS\|CLINVAR_BENIGN_LABELS" hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 252


🏁 Script executed:

# Check if split_multiallelic is used elsewhere in the codebase
grep -rn "split_multiallelic" --include="*.py" .

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Find ptm_landscape function and see how it uses _label_clinvar
grep -n "def ptm_landscape\|_label_clinvar" hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 175


🏁 Script executed:

# Read the ptm_landscape function to understand the context
cat -n hvantk/ptm/analysis.py | sed -n '200,300p'

Repository: bigbio/hvantk

Length of output: 4533


🏁 Script executed:

# Look for any documentation or tests that show how multiallelic VCFs are handled
find . -type f \( -name "*.py" -o -name "*.md" \) | xargs grep -l "multiallelic\|split_multiallelic" 2>/dev/null | head -10

Repository: bigbio/hvantk

Length of output: 127


🏁 Script executed:

# Check how multiallelic is handled in converters.py
grep -n "multiallelic\|split_multiallelic" hvantk/hgc/converters.py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 1507


🏁 Script executed:

# Check the test file
grep -n "multiallelic\|split_multiallelic" hvantk/tests/ancestry/test_filter_hail.py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 316


🏁 Script executed:

# Look at the VCF specification and understand CLNSIG in multiallelic context
# First, let's see if there's a README or documentation about ClinVar data handling
find . -type f \( -name "README*" -o -name "*.md" \) | xargs grep -l "ClinVar\|CLNSIG" 2>/dev/null

Repository: bigbio/hvantk

Length of output: 501


🏁 Script executed:

# Check the actual VCF header or info field definitions
grep -rn "CLNSIG\|INFO.*ClinVar" --include="*.py" . | head -20

Repository: bigbio/hvantk

Length of output: 1987


🏁 Script executed:

# Let's look at clinvar_streamer.py to understand the CLNSIG structure better
cat -n hvantk/data/clinvar_streamer.py | sed -n '150,170p'

Repository: bigbio/hvantk

Length of output: 842


🏁 Script executed:

# Check psroc/pipeline.py to see if it has the same CLNSIG extraction pattern and any comments
cat -n hvantk/psroc/pipeline.py | sed -n '1075,1095p'

Repository: bigbio/hvantk

Length of output: 905


🏁 Script executed:

# Look at the ClinVar documentation file
cat docs_site/examples/clinvar.md | head -100

Repository: bigbio/hvantk

Length of output: 2454


🏁 Script executed:

# Check if there are any comments or issues about multiallelic CLNSIG handling
grep -rn "multiallelic.*CLNSIG\|CLNSIG.*multiallelic\|collapse.*CLNSIG\|CLNSIG.*collapse" --include="*.py" --include="*.md" .

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Check how annotate_variants_with_ptm handles the table
grep -n "def annotate_variants_with_ptm" hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 91


🏁 Script executed:

# Let's understand the VCF format better - check if there's any filtering or splitting elsewhere
grep -rn "split_multiallelic\|\.split_rows()" --include="*.py" hvantk/

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Check the ClinVar VCF spec or any comments about how CLNSIG values work
grep -rn "CLNSIG.*array\|array.*CLNSIG\|CLNSIG.*multiple\|multiple.*CLNSIG" --include="*.py" --include="*.md" . -i

Repository: bigbio/hvantk

Length of output: 410


🏁 Script executed:

# Let's check what Hail's import_vcf actually does by looking at how alleles are structured
cat -n hvantk/tables/table_builders.py | sed -n '285,310p'

Repository: bigbio/hvantk

Length of output: 998


🏁 Script executed:

# Check if there are any comments or TODOs about CLNSIG handling
grep -rn "TODO\|FIXME\|XXX\|HACK\|BUG" hvantk/ptm/analysis.py hvantk/psroc/pipeline.py | grep -i "clnsig\|multiallelic"

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Look for any tests that show expected behavior with multiallelic CLNSIG
find . -path ./node_modules -prune -o -type f -name "test_*.py" -o -name "*_test.py" | xargs grep -l "CLNSIG\|_extract_clnsig" 2>/dev/null

Repository: bigbio/hvantk

Length of output: 41


🏁 Script executed:

# Check the full _extract_clnsig function context and any comments around it
cat -n hvantk/ptm/analysis.py | sed -n '100,115p'

Repository: bigbio/hvantk

Length of output: 754


🏁 Script executed:

# Compare how clinvar_streamer handles CLNSIG vs how ptm handles it - show surrounding context
cat -n hvantk/data/clinvar_streamer.py | sed -n '145,175p'

Repository: bigbio/hvantk

Length of output: 1281


Don't collapse multiallelic ClinVar CLNSIG to the first entry.

create_clinvar_tb() in hvantk/tables/table_builders.py:292-298 imports raw VCF with hl.import_vcf(...).rows() without splitting multiallelic sites. For multiallelic records, CLNSIG is stored as an array with one value per alternate allele. However, _extract_clnsig() extracts only CLNSIG[0], discarding significance information for alleles at positions 1+. This causes variants with mixed-significance alleles (e.g., alt[0]=Pathogenic, alt[1]=Benign) to be silently mislabeled based only on the first allele, leading to incorrect counts and enrichment calculations in ptm_landscape().

The correct pattern is shown in hvantk/data/clinvar_streamer.py:155-157, which uses .any() to check if any array element matches the label set. Note that hvantk/psroc/pipeline.py:1077-1085 contains the same problematic CLNSIG[0] logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 103 - 124, The code collapses
multiallelic ClinVar CLNSIG to CLNSIG[0] in _extract_clnsig and then labels
variants by that single value; instead, preserve the array and change
_label_clinvar to test if any allele’s CLNSIG matches the pathogenic/benign
label sets (use hl.any(lambda x: p_labels.contains(x), clnsig) and similarly for
b_labels), and handle the scalar case by checking membership directly when
clnsig is not an array; update _extract_clnsig/_label_clinvar to stop using
CLNSIG[0] and use hl.any(...) (or equivalent) against CLINVAR_PATHOGENIC_LABELS
and CLINVAR_BENIGN_LABELS so multiallelic records are correctly classified.

Comment thread hvantk/ptm/analysis.py
Comment on lines +165 to +171
ht_out = ht.annotate(
_vid=hl.delimit(
[
hl.str(ht.locus.contig),
hl.str(ht.locus.position),
ht.alleles[0],
ht.alleles[1],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd hvantk --type d | head -20

Repository: bigbio/hvantk

Length of output: 64


🏁 Script executed:

git ls-files | grep -E "(analysis|ptm)" | head -20

Repository: bigbio/hvantk

Length of output: 502


🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '160,180p'

Repository: bigbio/hvantk

Length of output: 826


🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '150,190p'

Repository: bigbio/hvantk

Length of output: 1519


🏁 Script executed:

wc -l hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 83


🏁 Script executed:

cat -n hvantk/ptm/analysis.py | head -200

Repository: bigbio/hvantk

Length of output: 8248


🏁 Script executed:

rg "export_ptm_strata" --type py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 1557


🏁 Script executed:

rg "annotate_variants_with_ptm" --type py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 3097


🏁 Script executed:

cat -n hvantk/ptm/pipeline.py | head -150

Repository: bigbio/hvantk

Length of output: 5635


🏁 Script executed:

cat -n hvantk/ptm/pipeline.py | grep -A 10 -B 10 "export_ptm_strata"

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

rg "def annotate_variants_with_ptm" --type py -A 50

Repository: bigbio/hvantk

Length of output: 3075


🏁 Script executed:

cat -n hvantk/ptm/annotate.py | head -80

Repository: bigbio/hvantk

Length of output: 3442


🏁 Script executed:

rg "split_multi" --type py

Repository: bigbio/hvantk

Length of output: 2076


🏁 Script executed:

rg "biallelic" --type py -i

Repository: bigbio/hvantk

Length of output: 4168


🏁 Script executed:

cat -n hvantk/commands/ptm_cli.py | grep -A 30 "def export_variants"

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

cat -n hvantk/ptm/pipeline.py | grep -A 30 "export_ptm_strata"

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

cat -n hvantk/ptm/tests/test_ptm.py

Repository: bigbio/hvantk

Length of output: 117


🏁 Script executed:

find . -name "*test*ptm*" -type f

Repository: bigbio/hvantk

Length of output: 83


🏁 Script executed:

find hvantk/tests -name "*.py" -type f

Repository: bigbio/hvantk

Length of output: 2754


🏁 Script executed:

cat -n hvantk/commands/ptm_cli.py

Repository: bigbio/hvantk

Length of output: 20124


🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '127,182p'

Repository: bigbio/hvantk

Length of output: 2159


🏁 Script executed:

rg "alleles\[" hvantk/ptm/ --type py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 396


🏁 Script executed:

cat -n hvantk/tests/test_ptm.py

Repository: bigbio/hvantk

Length of output: 8829


🏁 Script executed:

rg "alleles\|split\|biallelic" hvantk/tests/test_ptm.py -i

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

rg "unsplit\|multi" hvantk/ptm/ --type py -i -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

cat -n hvantk/ptm/analysis.py | sed -n '1,130p'

Repository: bigbio/hvantk

Length of output: 5560


Hardcoded single ALT export will lose data from multi-allelic variants.

The _vid formatter at lines 171 accesses only alleles[1], so variants with multiple ALT alleles (e.g., ["A", "T", "C"]) will silently truncate to ref:alt1 and drop additional ALTs. The exported output will be incomplete for downstream tools when processing unsplit tables. Either validate input is biallelic and document this constraint, or emit one row per ALT.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 165 - 171, The _vid construction in
ht_out (ht.annotate using ht.locus.contig, ht.locus.position and ht.alleles[1])
only picks the first ALT and will drop additional alleles for multiallelic
records; either enforce biallelic input or explode multi-allelic sites into one
row per ALT. Fix by validating ht.alleles length (e.g., assert/hl.if_else to
raise or filter where hl.len(ht.alleles) == 2) when you require biallelic, or
else transform the table before building _vid by exploding ALT alleles (use an
explode/flatten pattern to iterate over hl.range(1, hl.len(ht.alleles)) or an
equivalent Hail explode on ht.alleles[1:], create a per-ALT allele variable, and
build _vid from that allele instead of ht.alleles[1]) so each exported row
includes the correct ALT.

Comment thread hvantk/ptm/analysis.py
Comment on lines +334 to +353
"""Population-level allele frequency analysis at PTM sites (Q3).

Annotates gnomAD variants with PTM proximity and compares allele
frequency distributions between PTM-site and non-PTM coding variants.
This is where Hail scalability matters (~700M gnomAD variants).

Parameters
----------
gnomad_ht : hl.Table
gnomAD variant table with allele frequency field.
ptm_ht : hl.Table
PTM sites Hail Table from create_ptm_sites_tb.
output_dir : str
Directory for output files (population_summary.json).
ccr_ht : hl.Table, optional
CCR table for constrained region cross-reference.
af_field : str
Name of the allele frequency field in gnomad_ht (default: "AF").
flanking_codons : int
Flanking codons for proximal window (default: 5).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "analysis.py" | grep ptm

Repository: bigbio/hvantk

Length of output: 81


🏁 Script executed:

wc -l ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 85


🏁 Script executed:

sed -n '330,395p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 2322


🏁 Script executed:

sed -n '375,388p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 690


🏁 Script executed:

grep -n "def annotate_variants_with_ptm" ./hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 91


🏁 Script executed:

sed -n '14,80p' ./hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 2540


🏁 Script executed:

sed -n '320,340p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 763


🏁 Script executed:

sed -n '80,150p' ./hvantk/ptm/annotate.py

Repository: bigbio/hvantk

Length of output: 932


🏁 Script executed:

rg -i "coding|consequence|variant_type" ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 130


🏁 Script executed:

sed -n '326,420p' ./hvantk/ptm/analysis.py

Repository: bigbio/hvantk

Length of output: 3315


🏁 Script executed:

rg "ptm_population" ./hvantk --type py -B 3 -A 3

Repository: bigbio/hvantk

Length of output: 3520


🏁 Script executed:

sed -n '1,100p' ./hvantk/commands/ptm_cli.py | head -80

Repository: bigbio/hvantk

Length of output: 2629


🏁 Script executed:

grep -n "def ptm_population_cmd" ./hvantk/commands/ptm_cli.py -A 30

Repository: bigbio/hvantk

Length of output: 1354


The "non-PTM coding" filter is not enforced in the code.

The docstring promises comparison against "non-PTM coding variants," but is_non_ptm = ~is_ptm & ~is_prox includes all noncoding variants from gnomAD. Either add a documented precondition that gnomad_ht must be pre-filtered to coding variants, or add explicit filtering (e.g., is_non_ptm = ~is_ptm & ~is_prox & is_coding).

Also applies to: 375-388

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/analysis.py` around lines 334 - 353, The is_non_ptm mask currently
defined as "is_non_ptm = ~is_ptm & ~is_prox" allows noncoding variants; update
the logic to require coding variants or declare a precondition. Either (A)
enforce coding in the analysis by changing the mask to include a coding flag
from gnomad_ht (e.g., is_non_ptm = ~is_ptm & ~is_prox & is_coding, where
is_coding is derived from gnomad_ht consequence/variant annotation) and apply
the same change where similar masks are built (see the other block around
375-388), or (B) add a clear docstring/precondition that gnomad_ht must be
pre-filtered to coding variants and validate that upstream (raise an error if
coding annotation/flag is missing). Reference symbols: is_non_ptm, is_ptm,
is_prox, af_field and the PTM analysis function in hvantk/ptm/analysis.py.

Comment on lines +1431 to +1441
# Remap contig names to match GRCh38 (e.g., MT -> M for chrM)
contig_remap = hl.dict({"MT": "M"})
ht = ht.annotate(
_contig=hl.str("chr") + contig_remap.get(ht.chrom, ht.chrom),
)

# Filter to valid contigs in the reference genome
valid_contigs = hl.set(
hl.literal(hl.get_reference(reference_genome).contigs)
)
ht = ht.filter(valid_contigs.contains(ht._contig))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What contig names do Hail's built-in GRCh37andGRCh38reference genomes use by default? Please cite the officialReferenceGenome docs or source examples.

💡 Result:

Hail's built-in GRCh37 reference genome uses numeric contig names by default: "1", "2", ..., "22", "X", "Y", "MT". GRCh38 uses prefixed names: "chr1", "chr2", ..., "chr22", "chrX", "chrY", "chrM".

Citations:


🏁 Script executed:

fd "table_builders.py" | head -5

Repository: bigbio/hvantk

Length of output: 88


🏁 Script executed:

# Get file size and then read specific lines
wc -l hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 93


🏁 Script executed:

# Read lines 1431-1441 to verify the code snippet
sed -n '1431,1441p' hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 507


🏁 Script executed:

# Read more context around the code to see the function definition and parameters
sed -n '1400,1441p' hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 1534


🏁 Script executed:

# Also read the rest of the transform function to understand the full logic
sed -n '1441,1480p' hvantk/tables/table_builders.py

Repository: bigbio/hvantk

Length of output: 1335


Make PTM contig normalization reference-genome-aware.

The code hard-codes "chr" prefix for all inputs, which breaks with reference_genome="GRCh37". GRCh37's contigs are unprefixed ("1", "MT"), while GRCh38's are prefixed ("chr1", "chrM"). With GRCh37, input "1" becomes "chr1" (invalid) and is filtered out; already-prefixed inputs become "chrchr1". Additionally, the MT→M remapping assumes GRCh38's "chrM" format and will mismatch GRCh37's "MT".

The fix should detect the reference genome's contig naming convention at the start of create_ptm_sites_tb, normalize inputs to a base form (removing any existing prefix), remap MT/M appropriately, and conditionally re-add the prefix only if the reference uses one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/table_builders.py` around lines 1431 - 1441,
create_ptm_sites_tb currently hardcodes adding "chr" to inputs
(contig_remap/ht._contig) which breaks non-prefixed references; change
create_ptm_sites_tb to detect whether the target reference's contigs are
prefixed by checking get_reference(reference_genome).contigs for a "chr" prefix,
then normalize incoming ht.chrom by stripping any leading "chr"/"CHR" to a base
contig name, apply the MT<->M remap on that base (update contig_remap logic),
and finally re-add the "chr" prefix only if the reference uses prefixed contigs
before assigning ht._contig and filtering against valid_contigs (use
valid_contigs = hl.set(hl.literal(hl.get_reference(reference_genome).contigs))
as before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants