Conversation
…eld renaming logic
…aging human PTM data
…d maintainability
📝 WalkthroughWalkthroughAdds a new PTM domain: UniProt PTM download, Ensembl GTF parsing, protein→genome mapping, PTM site Hail table builder, variant annotation, landscape/population analyses, plotting, HTML report generation, CLI entrypoints, utilities (sanitization/renaming), and tests; plus documentation updates. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as "hvantk ptm"
participant Pipeline as "ptm_build_pipeline"
participant UniProt as "UniProtPTMDataset"
participant GTF as "parse_ensembl_gtf"
participant Mapper as "map_ptm_sites"
participant Table as "create_ptm_sites_tb"
participant Annotate as "annotate_variants_with_ptm"
participant Analysis as "ptm_landscape/ptm_population"
participant Plot as "plot_*"
participant Report as "generate_report"
User->>CLI: ptm build (config)
CLI->>Pipeline: ptm_build_pipeline(config)
Pipeline->>UniProt: download_uniprot_ptm() (if needed)
Pipeline->>GTF: download_ensembl_gtf()/parse_ensembl_gtf(gtf_path)
GTF-->>Pipeline: GTFData
Pipeline->>Mapper: map_ptm_sites(ptm_tsv, gtf_data)
Mapper-->>Pipeline: mapped_tsv, PTMBuildResult
Pipeline->>Table: create_ptm_sites_tb(mapped_tsv -> HT)
Table-->>CLI: output HT created
User->>CLI: ptm annotate
CLI->>Annotate: annotate_variants_with_ptm(variants_ht, ptm_ht)
Annotate-->>CLI: annotated HT
User->>CLI: ptm landscape / population
CLI->>Analysis: ptm_landscape(...) or ptm_population(...)
Analysis-->>CLI: Result objects
CLI->>Plot: plot_* (result)
Plot-->>CLI: PNGs
User->>CLI: ptm report
CLI->>Report: generate_report(landscape_result, population_result)
Report-->>CLI: HTML report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds a new PTM (Post-Translational Modification) analysis module to hvantk, including a build pipeline (UniProt PTM + Ensembl GTF → mapped TSV → Hail Table), variant annotation utilities, downstream analyses (landscape/population), plotting/reporting, plus CLI and documentation to integrate it into existing workflows.
Changes:
- Introduces
hvantk.ptm(mapping, annotation, analyses, plotting, HTML reporting) and wires it into the main CLI (hvantk ptm ...). - Adds UniProt PTM dataset download support (
hvantk download uniprot-ptm) and a PTM sites table builder registered in the table registry /mktable. - Improves consistency/utilities: centralized ClinVar label constants, flexible header renaming + string→bool coercion, and a TSV sanitizer used for GenCC downloads.
Reviewed changes
Copilot reviewed 29 out of 30 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
hvantk/utils/table_utils.py |
Adds flexible header rename mapping and a shared string→bool coercion helper. |
hvantk/tests/test_ptm.py |
Adds unit/integration-style tests for PTM mapping, resolution, pipeline round-trip, and CLI help. |
hvantk/tables/table_builders.py |
Uses new rename/bool helpers for COSMIC CGC; adds create_ptm_sites_tb to build a PTM sites HT. |
hvantk/tables/registry.py |
Registers ptm-sites table adapter. |
hvantk/ptm/report.py |
Adds static HTML report generator combining PTM analysis results and plots. |
hvantk/ptm/plot.py |
Adds plotting utilities for PTM landscape/population analyses and figure encoding. |
hvantk/ptm/pipeline.py |
Adds PTM build pipeline orchestration, download helpers, and mapping TSV writer. |
hvantk/ptm/mapper.py |
Adds local protein→genome coordinate mapper and Ensembl GTF parser. |
hvantk/ptm/constants.py |
Adds PTM module constants (URLs, categories, output columns). |
hvantk/ptm/annotate.py |
Adds Hail-based variant annotation against PTM sites/proximity. |
hvantk/ptm/analysis.py |
Adds Q1/Q3 analyses and export of PTM/non-PTM strata; uses centralized ClinVar labels. |
hvantk/ptm/__init__.py |
Exposes PTM public API. |
hvantk/psroc/pipeline.py |
Switches PSROC ClinVar label constants to centralized core.constants (backward compatible aliases). |
hvantk/hvantk.py |
Registers the new ptm CLI group with the main hvantk CLI. |
hvantk/datasets/uniprot_ptm_datasets.py |
Implements UniProt PTM REST API download to a normalized TSV. |
hvantk/datasets/gencc_datasets.py |
Sanitizes GenCC TSV post-download for Hail import compatibility. |
hvantk/data/file_utils.py |
Adds sanitize_tsv() helper for cleaning TSVs prior to Hail import. |
hvantk/core/constants.py |
Adds centralized ClinVar pathogenic/benign label lists. |
hvantk/commands/uniprot_ptm_downloader.py |
Adds CLI command to download UniProt PTM data. |
hvantk/commands/ptm_cli.py |
Adds hvantk ptm command group and subcommands for build/annotate/analyses/report. |
hvantk/commands/make_table_cli.py |
Adds hvantk mktable ptm-sites builder command. |
hvantk/commands/download_cli.py |
Registers hvantk download uniprot-ptm. |
hvantk/annotation/annotate.py |
Updates ClinVar label logic to use centralized constants. |
docs_site/tools/ptm.md |
Adds PTM tool reference documentation. |
docs_site/guide/usage.md |
Documents PTM build usage in the usage guide. |
docs_site/guide/data-sources.md |
Documents UniProt PTM + Ensembl GTF as data sources. |
docs_site/examples/ptm.md |
Adds an end-to-end PTM example walkthrough. |
docs_site/examples/index.md |
Links PTM example from the examples index. |
docs_site/architecture.md |
Updates architecture overview to include the new ptm/ module and tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for row in rows: | ||
| # Flatten any embedded newlines within cells | ||
| cleaned = [ | ||
| cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ") | ||
| for cell in row | ||
| ] | ||
| fh.write(delimiter.join(cleaned) + "\n") |
There was a problem hiding this comment.
When rewriting the file, sanitize_tsv writes delimiter.join(cleaned) without CSV quoting/escaping and does not validate that each row has the same column count as the header. If any cell contains the delimiter/quotes, this can corrupt the TSV structure for downstream parsing. Consider using csv.writer with the same delimiter (and optionally enforcing/repairing a fixed column count) so the output remains a valid delimited file.
| for row in rows: | |
| # Flatten any embedded newlines within cells | |
| cleaned = [ | |
| cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ") | |
| for cell in row | |
| ] | |
| fh.write(delimiter.join(cleaned) + "\n") | |
| writer = csv.writer( | |
| fh, | |
| delimiter=delimiter, | |
| quotechar='"', | |
| lineterminator="\n", | |
| ) | |
| for row_index, row in enumerate(rows): | |
| # Ensure consistent column count with header | |
| if len(row) < n_cols: | |
| logger.warning( | |
| "sanitize_tsv: row %d has %d columns, expected %d; padding with empty fields", | |
| row_index, | |
| len(row), | |
| n_cols, | |
| ) | |
| row = row + [""] * (n_cols - len(row)) | |
| elif len(row) > n_cols: | |
| logger.warning( | |
| "sanitize_tsv: row %d has %d columns, expected %d; truncating extra columns", | |
| row_index, | |
| len(row), | |
| n_cols, | |
| ) | |
| row = row[:n_cols] | |
| # Flatten any embedded newlines within cells and remove delimiters | |
| cleaned = [ | |
| cell.replace("\n", " ").replace("\r", " ").replace(delimiter, " ") | |
| for cell in row | |
| ] | |
| writer.writerow(cleaned) |
| target_names = set(field_map.values()) | ||
| rename = {} | ||
| for canonical_raw, target in field_map.items(): | ||
| norm = _normalize_field_name(canonical_raw) | ||
| actual = norm_to_actual.get(norm) | ||
| if actual is not None and actual != target: | ||
| # Skip if actual field name already equals the target or | ||
| # if the target name is already taken by another field | ||
| if actual not in target_names: | ||
| rename[actual] = target | ||
| # Also try matching by the target name's normalized form | ||
| # (handles cases where actual headers already use the target naming) | ||
| if actual is None: | ||
| norm_target = _normalize_field_name(target) | ||
| actual_by_target = norm_to_actual.get(norm_target) | ||
| if actual_by_target is not None and actual_by_target != target: | ||
| if actual_by_target not in target_names: | ||
| rename[actual_by_target] = target |
There was a problem hiding this comment.
build_rename_map() attempts to avoid rename collisions, but the current check (if actual not in target_names) does not prevent renaming a field to a target name that already exists in actual_fields (e.g., both "gene_symbol" and "Gene Symbol" present). hl.Table.rename() will fail with duplicate fields in that case. Consider skipping a rename when target in actual_fields (or when target is already produced by another rename) rather than checking actual against target_names.
| target_names = set(field_map.values()) | |
| rename = {} | |
| for canonical_raw, target in field_map.items(): | |
| norm = _normalize_field_name(canonical_raw) | |
| actual = norm_to_actual.get(norm) | |
| if actual is not None and actual != target: | |
| # Skip if actual field name already equals the target or | |
| # if the target name is already taken by another field | |
| if actual not in target_names: | |
| rename[actual] = target | |
| # Also try matching by the target name's normalized form | |
| # (handles cases where actual headers already use the target naming) | |
| if actual is None: | |
| norm_target = _normalize_field_name(target) | |
| actual_by_target = norm_to_actual.get(norm_target) | |
| if actual_by_target is not None and actual_by_target != target: | |
| if actual_by_target not in target_names: | |
| rename[actual_by_target] = target | |
| # Track all names that are or will be used as targets to avoid collisions | |
| used_targets = set(actual_fields) | |
| rename = {} | |
| for canonical_raw, target in field_map.items(): | |
| norm = _normalize_field_name(canonical_raw) | |
| actual = norm_to_actual.get(norm) | |
| if actual is not None and actual != target: | |
| # Skip if actual field name already equals the target or | |
| # if the target name is already taken by an existing or renamed field | |
| if target not in used_targets: | |
| rename[actual] = target | |
| used_targets.add(target) | |
| # Also try matching by the target name's normalized form | |
| # (handles cases where actual headers already use the target naming) | |
| if actual is None: | |
| norm_target = _normalize_field_name(target) | |
| actual_by_target = norm_to_actual.get(norm_target) | |
| if actual_by_target is not None and actual_by_target != target: | |
| if target not in used_targets: | |
| rename[actual_by_target] = target | |
| used_targets.add(target) |
| - build: Download PTM data, map coordinates, build Hail Table | ||
| - annotate: Annotate a variant table with PTM site information | ||
| - landscape: PTM-variant overlap analysis (Q1) | ||
| - evaluate: Predictor performance at PTM sites (Q2) |
There was a problem hiding this comment.
Module docstring lists an evaluate subcommand, but this CLI defines export-strata for Q2 and does not implement evaluate. Please update the docstring to reflect the actual subcommands to avoid confusing users.
| - evaluate: Predictor performance at PTM sites (Q2) | |
| - export-strata: Export PTM/non-PTM variant strata for predictor evaluation (Q2) |
| from hvantk.ptm.plot import ( | ||
| plot_landscape_summary, | ||
| plot_overlap_by_category, | ||
| plot_distance_distribution, | ||
| ) | ||
| import matplotlib | ||
| matplotlib.use("Agg") |
There was a problem hiding this comment.
matplotlib.use("Agg") is called after importing hvantk.ptm.plot, but that module imports matplotlib.pyplot at import time. In headless environments this can fail or select an interactive backend before Agg is set. To reliably force a non-interactive backend, call matplotlib.use("Agg") before importing any module that imports pyplot (i.e., move the import matplotlib; matplotlib.use("Agg") above the from hvantk.ptm.plot import ...).
| from hvantk.ptm.plot import ( | |
| plot_landscape_summary, | |
| plot_overlap_by_category, | |
| plot_distance_distribution, | |
| ) | |
| import matplotlib | |
| matplotlib.use("Agg") | |
| import matplotlib | |
| matplotlib.use("Agg") | |
| from hvantk.ptm.plot import ( | |
| plot_landscape_summary, | |
| plot_overlap_by_category, | |
| plot_distance_distribution, | |
| ) |
| from hvantk.ptm.plot import plot_population_af | ||
| import matplotlib | ||
| matplotlib.use("Agg") |
There was a problem hiding this comment.
Same backend-order issue here: matplotlib.use("Agg") is invoked after importing plot_population_af from hvantk.ptm.plot (which imports pyplot at module import). Move the backend selection before importing plot functions to avoid failures on systems without a display.
| from hvantk.ptm.plot import plot_population_af | |
| import matplotlib | |
| matplotlib.use("Agg") | |
| import matplotlib | |
| matplotlib.use("Agg") | |
| from hvantk.ptm.plot import plot_population_af |
| # Pre-calculate conditions for better readability and performance | ||
| is_pathogenic = t.clinvar_clnsig.any( | ||
| lambda x: hl.set(pathogenic_label_clinvar).contains(x) | ||
| lambda x: hl.set(CLINVAR_PATHOGENIC_LABELS).contains(x) | ||
| ) | ||
| is_benign = t.clinvar_clnsig.any(lambda x: hl.set(benign_label_clinvar).contains(x)) | ||
| is_benign = t.clinvar_clnsig.any(lambda x: hl.set(CLINVAR_BENIGN_LABELS).contains(x)) | ||
|
|
||
| t = t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG) | ||
| t = t.annotate( |
There was a problem hiding this comment.
annotate_clinvar_clnsig computes is_pathogenic / is_benign from t.clinvar_clnsig before clinvar_clnsig is annotated from clinvar_ht. For inputs that don’t already have a clinvar_clnsig field (e.g., commands/annotate_features.py reads a variant HT then calls this), this will raise a field-not-found error and/or classify based on stale data. Compute the conditions from the joined ClinVar CLNSIG expression (or move the t = t.annotate(clinvar_clnsig=...) line before the .any(...) checks).
| The table is keyed by a genomic interval covering the PTM codon, | ||
| enabling interval-based joins with variant tables. | ||
|
|
There was a problem hiding this comment.
Docstring says the PTM sites table is "keyed by a genomic interval covering the PTM codon", but the implementation keys by a single locus (codon_start) and keeps flanking_interval as a non-key field. Please update the docstring to match the actual keying strategy (or change the key to an interval if that’s what downstream joins expect).
| hl.Table | ||
| The checkpointed Hail Table keyed by locus interval. | ||
| """ | ||
| from hvantk.ptm.constants import PTM_TYPE_CATEGORIES |
There was a problem hiding this comment.
PTM_TYPE_CATEGORIES is imported but never used in this function. Dropping the unused import will avoid lint noise and prevents unnecessary imports when building tables.
| from hvantk.ptm.constants import PTM_TYPE_CATEGORIES |
| codon_pos = positions[idx:idx + 3] | ||
| return CodonMapping( | ||
| chrom=chrom, | ||
| codon_start=min(codon_pos), | ||
| codon_end=max(codon_pos), | ||
| strand=strand, | ||
| ) |
There was a problem hiding this comment.
map_residue_to_genomic returns codon_start=min(codon_pos) and codon_end=max(codon_pos). For split codons spanning an exon boundary, those three genomic positions are not contiguous; using min/max creates an interval that includes the intron and later code works around this by truncating the end. This can cause missed annotations for variants that hit the codon base(s) in the other exon. Consider returning/storing the exact 3 genomic positions (e.g., codon_positions: tuple[int,int,int]) or separate locus fields so downstream annotation can join on the true codon bases.
| plt.close(fig) | ||
| return f"<img src='{src}' alt='{output_path.stem}' class='embedded-image'/>" | ||
| except Exception: | ||
| logger.warning("Failed to generate plot %s, skipping.", output_path.stem) |
There was a problem hiding this comment.
The exception handler in _render_plot logs only a generic warning, which makes debugging plot failures difficult. Logging the exception details (e.g., logger.exception(..., exc_info=True)) would preserve the stack trace while still skipping the plot.
| logger.warning("Failed to generate plot %s, skipping.", output_path.stem) | |
| logger.exception("Failed to generate plot %s, skipping.", output_path.stem) |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (12)
hvantk/tests/test_ptm.py (1)
218-221: Usetmp_pathfixture instead of hardcoded/tmppaths.The test uses hardcoded
/tmp/testpaths which are flagged as potentially insecure. Using pytest'stmp_pathfixture is more portable and secure.♻️ Proposed fix
- config2 = PTMBuildConfig( - output_dir="/tmp/test", output_ht="/tmp/test/out.ht" - ) +def test_ptm_build_config_validation(tmp_path): + """PTMBuildConfig.validate catches missing required fields.""" + config = PTMBuildConfig() + errors = config.validate() + assert len(errors) == 2 + assert any("output_dir" in e for e in errors) + assert any("output_ht" in e for e in errors) + + config2 = PTMBuildConfig( + output_dir=str(tmp_path), output_ht=str(tmp_path / "out.ht") + ) assert config2.validate() == []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tests/test_ptm.py` around lines 218 - 221, The test uses hardcoded paths; update the test to accept pytest's tmp_path fixture and construct PTMBuildConfig using tmp_path (e.g., tmp_path / "test" for output_dir and output_ht) instead of "/tmp/test" and "/tmp/test/out.ht", then call config2.validate() as before; ensure the test function signature includes tmp_path so tmp directories are used and removed automatically.hvantk/datasets/uniprot_ptm_datasets.py (1)
226-231: Useraise ... from Noneorfrom errfor exception chaining.The re-raised
ValueErrorshould use proper exception chaining to clarify the error origin.♻️ Proposed fix
try: datetime.strptime(version_date, "%Y-%m-%d") except ValueError: raise ValueError( f"Invalid version_date format: {version_date}. Expected YYYY-MM-DD" - ) + ) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 226 - 231, The except block that re-raises a ValueError for invalid version_date should preserve exception chaining; change "except ValueError:" to "except ValueError as err:" and re-raise with "raise ValueError(f'Invalid version_date format: {version_date}. Expected YYYY-MM-DD') from err" so the original parsing error is chained (refer to the try/except around datetime.strptime and the variable version_date).hvantk/ptm/report.py (1)
258-260: Log the exception details for debugging.The broad
except Exceptionis acceptable here to prevent plot failures from crashing report generation. However, logging only the plot name loses valuable diagnostic information.♻️ Proposed fix to include exception details
except Exception: - logger.warning("Failed to generate plot %s, skipping.", output_path.stem) + logger.warning( + "Failed to generate plot %s, skipping.", output_path.stem, exc_info=True + ) return ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/report.py` around lines 258 - 260, The except block that currently catches Exception and calls logger.warning("Failed to generate plot %s, skipping.", output_path.stem) should be changed to log the actual exception details; either capture the exception (except Exception as e) and include it in the log message or call logger.exception/ logger.warning(..., exc_info=True) so the stack trace is recorded while still returning ""; update the except block surrounding the plot generation (the block using output_path.stem and returning "") to emit the exception information for debugging.docs_site/examples/ptm.md (1)
43-49: Add language specifier to fenced code blocks for output examples.The output examples lack language specifiers. Use
textorconsoleto satisfy linting and improve syntax highlighting consistency.📝 Proposed fix
**Expected output:** -``` +```text Mapping complete: 55422/56120 mapped (98.7%), 698 failed Resolution: {'ensembl_canonical': 48231, 'longest_cds': 7191} Mapped TSV: data/ptm/ptm_mapped.tsv Hail Table: data/ptm/ptm_sites.ht</details> Apply the same fix to the output blocks at lines 63, 83, and 150. <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs_site/examples/ptm.mdaround lines 43 - 49, Add a language specifier
(textorconsole) to the fenced code blocks that show output examples —
specifically update the block immediately following the "Expected output:"
header (the block containing "Mapping complete: 55422/56120 mapped...") and
apply the same change to the other output blocks referenced (the ones at the
locations corresponding to lines 63, 83, and 150) so each opening
triple-backtick becomes ```text; keep the block contents unchanged.</details> </blockquote></details> <details> <summary>hvantk/ptm/analysis.py (3)</summary><blockquote> `42-44`: **Remove extraneous f-string prefix.** Line 44 uses an f-string but contains no placeholders. <details> <summary>📝 Proposed fix</summary> ```diff def summary(self) -> str: lines = [ - f"PTM-Variant Landscape:", + "PTM-Variant Landscape:", f" Variants: {self.n_variants:,} total, " ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/analysis.py` around lines 42 - 44, The f-string prefix in the summary method is unnecessary because there are no placeholders; update the summary function (def summary(self) -> str:) to use a normal string literal for the line "PTM-Variant Landscape:" instead of an f-string (locate the list assigned to the variable lines inside summary) so remove the leading f from that string. ``` </details> --- `77-79`: **Remove extraneous f-string prefix.** Line 79 uses an f-string but contains no placeholders. <details> <summary>📝 Proposed fix</summary> ```diff def summary(self) -> str: lines = [ - f"PTM Population Analysis:", + "PTM Population Analysis:", f" Total variants: {self.n_variants:,}", ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/analysis.py` around lines 77 - 79, The string in the summary() method is using an unnecessary f-string prefix for a literal with no placeholders; update the list entry in PTM Population Analysis within the summary() function to use a normal string literal (remove the leading f) so the line reads a plain quoted string instead of an f-string. ``` </details> --- `369-369`: **Consider validating `af_field` existence before use.** The dynamic field access `annotated[af_field]` will fail at query execution time if the field doesn't exist. An early validation could provide a clearer error message. <details> <summary>📝 Optional validation</summary> ```diff + if af_field not in annotated.row: + raise ValueError(f"AF field '{af_field}' not found in gnomAD table. " + f"Available fields: {list(annotated.row)}") af = annotated[af_field] ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/analysis.py` at line 369, Before indexing annotated with af_field, validate that the key/property exists to avoid a runtime KeyError; in the function where af = annotated[af_field] is used, check presence (e.g., "if af_field not in annotated" or use hasattr for objects) and raise a clear ValueError/TypeError describing the missing field and listing available keys (or provide a sensible default/fallback) so callers get a helpful error instead of a cryptic crash. ``` </details> </blockquote></details> <details> <summary>hvantk/ptm/plot.py (2)</summary><blockquote> `43-50`: **Consider renaming `format` parameter to avoid shadowing builtin.** While functional, shadowing the `format` builtin is flagged by linters. Common alternatives: `fmt`, `output_format`, or `file_format`. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/plot.py` around lines 43 - 50, The parameter name format in function plot_landscape_summary shadows the builtin; rename it to a non-conflicting identifier such as output_format (or fmt/file_format) in the function signature and update every internal reference within plot_landscape_summary (and its type hint/default) to use output_format, plus update all call sites, docstrings, and tests that pass format to use the new name so behavior is preserved. ``` </details> --- `253-258`: **Add `strict=True` to `zip()` for defensive programming.** While `bars` and `n_values` are guaranteed to have the same length here, adding `strict=True` catches any future refactoring errors that might cause misalignment. <details> <summary>📝 Proposed fix</summary> ```diff # Annotate counts on bars - for bar, n in zip(bars, n_values): + for bar, n in zip(bars, n_values, strict=True): ax.text( bar.get_x() + bar.get_width() / 2, bar.get_height(), f"n={n:,}", ha="center", va="bottom", fontsize=9, ) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/plot.py` around lines 253 - 258, The loop annotating bar counts should use zip(..., strict=True) to guard against mismatched lengths; update the zip call that pairs bars and n_values in hvantk/ptm/plot.py so it reads zip(bars, n_values, strict=True) and leave the rest of the ax.text(...) logic unchanged to ensure a ValueError is raised if future refactors cause misalignment. ``` </details> </blockquote></details> <details> <summary>hvantk/commands/ptm_cli.py (3)</summary><blockquote> `119-123`: **Add context to validation error output.** The validation errors are printed without a preceding message indicating what failed. <details> <summary>📝 Proposed fix</summary> ```diff errors = config.validate() if errors: + click.echo("Configuration validation failed:", err=True) for error in errors: click.echo(f" - {error}", err=True) ctx.exit(1) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/ptm_cli.py` around lines 119 - 123, When config.validate() returns errors, prepend a clear context message before listing them: call click.echo with a message like "Configuration validation failed:" (using err=True) just prior to the for-loop that echoes each error, then proceed to loop over errors and call ctx.exit(1); update the block that references errors, config.validate(), click.echo, and ctx.exit to include that preceding explanatory line so the output clearly indicates what failed. ``` </details> --- `181-187`: **Consider aggregating counts in a single pass.** Three separate `count()` calls trigger three table scans. For large variant tables, this could be optimized with a single aggregation. <details> <summary>⚡ Proposed optimization</summary> ```diff - n_total = result.count() - n_ptm_site = result.filter(result.is_ptm_site).count() - n_proximal = result.filter(result.is_ptm_proximal).count() + counts = result.aggregate(hl.struct( + n_total=hl.agg.count(), + n_ptm_site=hl.agg.filter(result.is_ptm_site, hl.agg.count()), + n_proximal=hl.agg.filter(result.is_ptm_proximal, hl.agg.count()), + )) + n_total = counts.n_total + n_ptm_site = counts.n_ptm_site + n_proximal = counts.n_proximal ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/ptm_cli.py` around lines 181 - 187, The three separate count() calls (n_total = result.count(), n_ptm_site = result.filter(result.is_ptm_site).count(), n_proximal = result.filter(result.is_ptm_proximal).count()) cause multiple table scans; replace them with a single aggregation over result that computes total, PTM site and proximal counts in one pass (e.g., a single query using conditional sums/CASE or SQLAlchemy func.sum on expressions for is_ptm_site and is_ptm_proximal, or iterate once and accumulate into n_total, n_ptm_site, n_proximal) so the code still sets n_total, n_ptm_site and n_proximal but with one database scan. ``` </details> --- `1-11`: **Docstring mentions "evaluate" command that doesn't exist.** The module docstring at line 8 lists `evaluate: Predictor performance at PTM sites (Q2)`, but this command is not implemented in the file. The actual Q2 functionality is handled via `export-strata` combined with external `psroc` commands as documented in the `ptm_group` docstring. Consider updating the module docstring to reflect the actual command set: <details> <summary>📝 Proposed fix</summary> ```diff """ PTM CLI Commands - Post-Translational Modification Analysis This module provides CLI commands for the PTM variant classification pipeline: - build: Download PTM data, map coordinates, build Hail Table - annotate: Annotate a variant table with PTM site information - landscape: PTM-variant overlap analysis (Q1) -- evaluate: Predictor performance at PTM sites (Q2) +- export-strata: Export PTM-stratified variant lists for external predictor evaluation (Q2) - population: Population-level PTM-variant burden (Q3) - report: Generate summary report """ ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/ptm_cli.py` around lines 1 - 11, Update the module docstring in ptm_cli.py to remove or replace the incorrect "evaluate: Predictor performance at PTM sites (Q2)" entry and instead document the actual commands implemented (build, annotate, landscape, population, report) and note that Q2 functionality is provided via export-strata plus external psroc usage as described in the ptm_group docstring; edit the top-level triple-quoted string (module docstring) to reflect the accurate command set and cross-reference the ptm_group explanation so readers are not misled about an evaluate command that does not exist. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@docs_site/examples/index.md:
- Around line 97-99: The quick-start commands have a PTM HT path mismatch: the
build command (hvantk ptm build) writes data/ptm/ptm_sites.ht but the landscape
command (hvantk ptm landscape) expects --ptm-ht ptm_sites.ht; update the
landscape invocation to use --ptm-ht data/ptm/ptm_sites.ht (or alternatively
change the build --output-ht to produce ptm_sites.ht in the current dir) so the
two commands reference the same file path.In
@docs_site/guide/data-sources.md:
- Around line 250-257: The docs currently list placeholder comments instead of
actionable commands for pre-downloading data; replace the commented notes in the
"Download (optional, for offline use)" block with runnable commands such as
using the hvantk CLI (hvantk download uniprot-ptm --output-dir data/ptm) to
fetch the UniProt PTM TSV and a wget call to fetch the Ensembl GTF (e.g.,
download Homo_sapiens.GRCh38.*.gtf.gz into data/ref) so the document contains
concrete, executable pre-download steps.In
@hvantk/annotation/annotate.py:
- Around line 30-41: The predicates is_pathogenic and is_benign currently
reference t.clinvar_clnsig before it is assigned; either compute clinvar_clnsig
first or build predicates directly from the joined ClinVar expression
clinvar_ht[t.key].info.CLNSIG. Update the code so that t =
t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG) is executed before
creating is_pathogenic/is_benign, or replace usages of t.clinvar_clnsig in the
is_pathogenic and is_benign expressions with clinvar_ht[t.key].info.CLNSIG
(using CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS with .any(...) as
before) to avoid referencing an unassigned field.In
@hvantk/commands/ptm_cli.py:
- Around line 332-338: Move the matplotlib backend selection before importing
the plotting function: call matplotlib.use("Agg") prior to the from
hvantk.ptm.plot import plot_population_af statement in the save_plots block so
the backend is set before any plotting modules are imported; update the
save_plots branch that currently imports plot_population_af and then calls
matplotlib.use to instead import matplotlib, set matplotlib.use("Agg"), then
import plot_population_af and call plot_population_af(result,
os.path.join(output, "population_af.png")).- Around line 232-244: The call to matplotlib.use("Agg") occurs after importing
hvantk.ptm.plot (which imports matplotlib.pyplot), so move the backend selection
before that import: when save_plots is true, call matplotlib.use("Agg") first,
then import plot_landscape_summary, plot_overlap_by_category, and
plot_distance_distribution from hvantk.ptm.plot and proceed to call those
functions (or alternatively set the backend inside the hvantk.ptm.plot module
itself by adding matplotlib.use("Agg") before any pyplot import).In
@hvantk/commands/uniprot_ptm_downloader.py:
- Around line 23-26: The CLI help for the "--date" argument incorrectly says the
default is "today"; update the help text to reflect that when no date is
provided the code uses the latest UniProt snapshot
(UniProtPTMDataset.from_latest()), not the current date. Locate the argument
definition for "--date" (the argparse add_argument call) and change the help
string to mention "default: latest UniProt snapshot (via
UniProtPTMDataset.from_latest())" and make the same change for the duplicate
help text near the UniProtPTMDataset.from_latest() usage so both descriptions
match the actual behavior.In
@hvantk/data/file_utils.py:
- Around line 98-118: The sanitizer currently corrupts fields by replacing
delimiter chars and buffers the whole file; in the sanitize_tsv logic use
csv.writer to stream-output rows to tmp_path instead of manual delimiter.join
and remove the cell.replace(delimiter, " ") step so you only normalize embedded
newlines (replace "\n" and "\r" with spaces), iterate reader rows one-by-one (or
filter blank rows) and writer.writerow(cleaned_row) for each, and preserve
existing quoting/escaping semantics provided by csv.writer; ensure you still
warn/return on empty input and atomically replace the original file with the
tmp_path after successful write.In
@hvantk/datasets/gencc_datasets.py:
- Around line 115-119: The call to sanitize_tsv(output_path) can return 0
indicating failure, but the code always logs success and returns output_path;
update the surrounding function (the one that calls sanitize_tsv) to capture the
return value (e.g., result = sanitize_tsv(output_path)), check if result == 0,
and on failure log an error via logger.error and either raise an exception or
return None/False (consistent with the function's error handling convention)
instead of returning output_path; ensure you reference sanitize_tsv and the
caller function name so the check replaces the unconditional logger.info and
return when sanitization fails.In
@hvantk/ptm/constants.py:
- Around line 32-35: The mapping currently places the generic "Glycyl lysine
isopeptide (Lys-Gly)" entry before the SUMO-specific keys, so
hvantk/ptm/pipeline.py's first-match loop (which uses desc.startswith(prefix))
classifies SUMO entries as "ubiquitination"; reorder the dictionary so the two
SUMO-specific keys ("Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter
in SUMO)" and "... SUMO2") appear before the generic "Glycyl lysine isopeptide
(Lys-Gly)" key to ensure those descriptions match "sumoylation" first.In
@hvantk/ptm/report.py:
- Around line 204-212: The guard only checks result.ccr_mean_ptm but then
formats result.ccr_mean_non_ptm unconditionally; update the logic in the report
generation (the block that builds ccr_html using result.ccr_mean_ptm and
result.ccr_mean_non_ptm) to verify each value before formatting: only include
the PTM line if result.ccr_mean_ptm is not None and only include the non-PTM
line if result.ccr_mean_non_ptm is not None (or provide a fallback like "N/A");
you can either add a combined guard (if result.ccr_mean_ptm is not None or
result.ccr_mean_non_ptm is not None) and then conditionally append each
f"..." line, or check both individually, so neither f-string attempts to
format a None value.In
@hvantk/tables/table_builders.py:
- Around line 1390-1393: The code currently allows negative flanking_codons
which leads to negative flank_bp and inverted/shrunk intervals; add an explicit
validation at the start of create_ptm_sites_tb (and any public entry that
forwards the parameter, e.g., the mktable ptm-sites command handler) to raise a
clear exception (ValueError) when flanking_codons < 0, and avoid computing
flank_bp for negative values (the existing flank_bp calculation block around
where flank_bp is computed/used, referenced in the diff lines ~1460-1469, should
only run for non-negative flanking_codons). Ensure the check references the
function parameter name flanking_codons and returns/raises before any
interval-building logic runs.In
@hvantk/utils/table_utils.py:
- Around line 59-76: The rename-building logic in the loop over field_map (using
_normalize_field_name, norm_to_actual, target_names and rename) must avoid
producing a mapping whose destination name already exists in the table; update
the checks so before setting rename[actual]=target (or
rename[actual_by_target]=target) you verify that target is not present in the
current actual fields (actual_fields) and not already a destination in
rename.values(); additionally detect and raise/return a clear error when
multiple different source headers collapse to the same normalized key so we fail
fast instead of producing duplicate destinations.
Nitpick comments:
In@docs_site/examples/ptm.md:
- Around line 43-49: Add a language specifier (
textorconsole) to the
fenced code blocks that show output examples — specifically update the block
immediately following the "Expected output:" header (the block containing
"Mapping complete: 55422/56120 mapped...") and apply the same change to the
other output blocks referenced (the ones at the locations corresponding to lines
63, 83, and 150) so each opening triple-backtick becomes ```text; keep the block
contents unchanged.In
@hvantk/commands/ptm_cli.py:
- Around line 119-123: When config.validate() returns errors, prepend a clear
context message before listing them: call click.echo with a message like
"Configuration validation failed:" (using err=True) just prior to the for-loop
that echoes each error, then proceed to loop over errors and call ctx.exit(1);
update the block that references errors, config.validate(), click.echo, and
ctx.exit to include that preceding explanatory line so the output clearly
indicates what failed.- Around line 181-187: The three separate count() calls (n_total =
result.count(), n_ptm_site = result.filter(result.is_ptm_site).count(),
n_proximal = result.filter(result.is_ptm_proximal).count()) cause multiple table
scans; replace them with a single aggregation over result that computes total,
PTM site and proximal counts in one pass (e.g., a single query using conditional
sums/CASE or SQLAlchemy func.sum on expressions for is_ptm_site and
is_ptm_proximal, or iterate once and accumulate into n_total, n_ptm_site,
n_proximal) so the code still sets n_total, n_ptm_site and n_proximal but with
one database scan.- Around line 1-11: Update the module docstring in ptm_cli.py to remove or
replace the incorrect "evaluate: Predictor performance at PTM sites (Q2)" entry
and instead document the actual commands implemented (build, annotate,
landscape, population, report) and note that Q2 functionality is provided via
export-strata plus external psroc usage as described in the ptm_group docstring;
edit the top-level triple-quoted string (module docstring) to reflect the
accurate command set and cross-reference the ptm_group explanation so readers
are not misled about an evaluate command that does not exist.In
@hvantk/datasets/uniprot_ptm_datasets.py:
- Around line 226-231: The except block that re-raises a ValueError for invalid
version_date should preserve exception chaining; change "except ValueError:" to
"except ValueError as err:" and re-raise with "raise ValueError(f'Invalid
version_date format: {version_date}. Expected YYYY-MM-DD') from err" so the
original parsing error is chained (refer to the try/except around
datetime.strptime and the variable version_date).In
@hvantk/ptm/analysis.py:
- Around line 42-44: The f-string prefix in the summary method is unnecessary
because there are no placeholders; update the summary function (def
summary(self) -> str:) to use a normal string literal for the line "PTM-Variant
Landscape:" instead of an f-string (locate the list assigned to the variable
lines inside summary) so remove the leading f from that string.- Around line 77-79: The string in the summary() method is using an unnecessary
f-string prefix for a literal with no placeholders; update the list entry in PTM
Population Analysis within the summary() function to use a normal string literal
(remove the leading f) so the line reads a plain quoted string instead of an
f-string.- Line 369: Before indexing annotated with af_field, validate that the
key/property exists to avoid a runtime KeyError; in the function where af =
annotated[af_field] is used, check presence (e.g., "if af_field not in
annotated" or use hasattr for objects) and raise a clear ValueError/TypeError
describing the missing field and listing available keys (or provide a sensible
default/fallback) so callers get a helpful error instead of a cryptic crash.In
@hvantk/ptm/plot.py:
- Around line 43-50: The parameter name format in function
plot_landscape_summary shadows the builtin; rename it to a non-conflicting
identifier such as output_format (or fmt/file_format) in the function signature
and update every internal reference within plot_landscape_summary (and its type
hint/default) to use output_format, plus update all call sites, docstrings, and
tests that pass format to use the new name so behavior is preserved.- Around line 253-258: The loop annotating bar counts should use zip(...,
strict=True) to guard against mismatched lengths; update the zip call that pairs
bars and n_values in hvantk/ptm/plot.py so it reads zip(bars, n_values,
strict=True) and leave the rest of the ax.text(...) logic unchanged to ensure a
ValueError is raised if future refactors cause misalignment.In
@hvantk/ptm/report.py:
- Around line 258-260: The except block that currently catches Exception and
calls logger.warning("Failed to generate plot %s, skipping.", output_path.stem)
should be changed to log the actual exception details; either capture the
exception (except Exception as e) and include it in the log message or call
logger.exception/ logger.warning(..., exc_info=True) so the stack trace is
recorded while still returning ""; update the except block surrounding the plot
generation (the block using output_path.stem and returning "") to emit the
exception information for debugging.In
@hvantk/tests/test_ptm.py:
- Around line 218-221: The test uses hardcoded paths; update the test to accept
pytest's tmp_path fixture and construct PTMBuildConfig using tmp_path (e.g.,
tmp_path / "test" for output_dir and output_ht) instead of "/tmp/test" and
"/tmp/test/out.ht", then call config2.validate() as before; ensure the test
function signature includes tmp_path so tmp directories are used and removed
automatically.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro **Run ID**: `0caf719e-a360-40a8-adf6-12992f1bcb5f` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 3fa0ef5b95a8b6edaefcb4b5a2c6653bcc5945b4 and 857ce337560ad0b28dcb87d8a5c34d2ed48f7b9f. </details> <details> <summary>⛔ Files ignored due to path filters (1)</summary> * `docs_site/images/hvantk-ptm-workflow.svg` is excluded by `!**/*.svg` </details> <details> <summary>📒 Files selected for processing (29)</summary> * `docs_site/architecture.md` * `docs_site/examples/index.md` * `docs_site/examples/ptm.md` * `docs_site/guide/data-sources.md` * `docs_site/guide/usage.md` * `docs_site/tools/ptm.md` * `hvantk/annotation/annotate.py` * `hvantk/commands/download_cli.py` * `hvantk/commands/make_table_cli.py` * `hvantk/commands/ptm_cli.py` * `hvantk/commands/uniprot_ptm_downloader.py` * `hvantk/core/constants.py` * `hvantk/data/file_utils.py` * `hvantk/datasets/gencc_datasets.py` * `hvantk/datasets/uniprot_ptm_datasets.py` * `hvantk/hvantk.py` * `hvantk/psroc/pipeline.py` * `hvantk/ptm/__init__.py` * `hvantk/ptm/analysis.py` * `hvantk/ptm/annotate.py` * `hvantk/ptm/constants.py` * `hvantk/ptm/mapper.py` * `hvantk/ptm/pipeline.py` * `hvantk/ptm/plot.py` * `hvantk/ptm/report.py` * `hvantk/tables/registry.py` * `hvantk/tables/table_builders.py` * `hvantk/tests/test_ptm.py` * `hvantk/utils/table_utils.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| **Download** (optional, for offline use): | ||
|
|
||
| ```bash | ||
| # The build command handles this automatically, but you can pre-download: | ||
| # 1. UniProt PTM TSV via REST API query | ||
| # 2. Ensembl GTF for coordinate mapping | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Replace placeholder comments with an executable pre-download command.
Line 253–256 currently documents intent but gives no runnable command. Add hvantk download uniprot-ptm to make offline prep actually actionable.
Suggested doc fix
**Download** (optional, for offline use):
```bash
-# The build command handles this automatically, but you can pre-download:
-# 1. UniProt PTM TSV via REST API query
-# 2. Ensembl GTF for coordinate mapping
+# UniProt PTM TSV (optional pre-download)
+hvantk download uniprot-ptm --output-dir data/ptm
+
+# Ensembl GTF (optional pre-download for mapping)
+wget https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.gtf.gz -P data/ref</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
**Download** (optional, for offline use):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs_site/guide/data-sources.md` around lines 250 - 257, The docs currently
list placeholder comments instead of actionable commands for pre-downloading
data; replace the commented notes in the "Download (optional, for offline use)"
block with runnable commands such as using the hvantk CLI (hvantk download
uniprot-ptm --output-dir data/ptm) to fetch the UniProt PTM TSV and a wget call
to fetch the Ensembl GTF (e.g., download Homo_sapiens.GRCh38.*.gtf.gz into
data/ref) so the document contains concrete, executable pre-download steps.
- Fix clinvar_clnsig predicate ordering bug (referenced before annotation)
- Move matplotlib.use("Agg") before plot imports in PTM CLI commands
- Reorder PTM_TYPE_CATEGORIES so SUMO entries match before generic Lys-Gly
- Guard ccr_mean_non_ptm for None before formatting in report
- Check sanitize_tsv return value in GenCC downloader
- Rewrite sanitize_tsv to use csv.writer, stream rows, validate column count
- Detect rename collisions and ambiguous headers in build_rename_map
- Fix docstrings (keyed by locus not interval, evaluate→export-strata)
- Remove unused PTM_TYPE_CATEGORIES import, validate flanking_codons >= 0
- Use single-pass Hail aggregation for PTM annotation counts
- Fix docs: path mismatch, placeholder commands, code fence specifiers
- Minor: exception chaining, extraneous f-strings, exc_info logging, strict zip
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (8)
hvantk/annotation/annotate.py (1)
30-49: LGTM! The predicate ordering bug has been correctly fixed.The previous issue where
is_pathogenicandis_benignreferencedt.clinvar_clnsigbefore assignment is now resolved—the field is annotated first (line 36) before the predicates are computed (lines 39-42). Using centralized constants fromhvantk.core.constantsimproves maintainability.One optional refinement for clarity and marginal performance: hoist the
hl.set()calls outside the lambdas.,
♻️ Optional: Hoist set creation outside lambdas
# First annotate clinvar_clnsig from the ClinVar table t = t.annotate(clinvar_clnsig=clinvar_ht[t.key].info.CLNSIG) # Now compute conditions using the annotated field + pathogenic_labels = hl.set(CLINVAR_PATHOGENIC_LABELS) + benign_labels = hl.set(CLINVAR_BENIGN_LABELS) is_pathogenic = t.clinvar_clnsig.any( - lambda x: hl.set(CLINVAR_PATHOGENIC_LABELS).contains(x) + lambda x: pathogenic_labels.contains(x) ) - is_benign = t.clinvar_clnsig.any(lambda x: hl.set(CLINVAR_BENIGN_LABELS).contains(x)) + is_benign = t.clinvar_clnsig.any(lambda x: benign_labels.contains(x))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/annotation/annotate.py` around lines 30 - 49, Hoist the hl.set(...) creations out of the lambdas to avoid rebuilding them repeatedly: create local Hail sets (e.g., pathogenic_set = hl.set(CLINVAR_PATHOGENIC_LABELS) and benign_set = hl.set(CLINVAR_BENIGN_LABELS)) before computing is_pathogenic and is_benign, then use those variables inside the t.clinvar_clnsig.any(...) lambdas; update references in the is_pathogenic and is_benign predicates and leave the rest of the t.annotate logic unchanged.hvantk/commands/uniprot_ptm_downloader.py (1)
40-55: Solid CLI implementation with appropriate error handling.The lazy import pattern and error handling with
ctx.exit(1)follow good CLI practices.One minor note:
logger.exception()already includes the exception information, so the f-stringf"Download failed: {e}"duplicates it slightly. You could simplify tologger.exception("Download failed").✨ Simplify exception logging
except Exception as e: - logger.exception(f"Download failed: {e}") + logger.exception("Download failed") click.echo(f"Error: {e}", err=True) ctx.exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/uniprot_ptm_downloader.py` around lines 40 - 55, The exception log duplicates the exception text; in the except block for the UniProt PTM download (the try that imports UniProtPTMDataset and calls UniProtPTMDataset.from_date/from_latest, download, etc.) replace logger.exception(f"Download failed: {e}") with a simpler logger.exception("Download failed") so the stack/exception is still recorded without duplicating the exception string, keeping the existing click.echo(f"Error: {e}", err=True) and ctx.exit(1) behavior unchanged.hvantk/datasets/uniprot_ptm_datasets.py (2)
257-325: Well-implemented download with proper cleanup.The download method correctly:
- Creates output directory if needed
- Checks for existing files before overwrite
- Streams results page-by-page to avoid memory issues
- Cleans up partial files on failure
One minor style improvement per static analysis:
✨ Use implicit string conversion in f-string
- raise RuntimeError( - f"Failed to download UniProt PTM data: {str(e)}" - ) from e + raise RuntimeError( + f"Failed to download UniProt PTM data: {e}" + ) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 257 - 325, In download, the exception message uses explicit str conversion (f"{str(e)}"); change it to use implicit string conversion (e.g., f"{e}") when constructing the RuntimeError message in the except block so the code is cleaner and satisfies the static analysis rule in the download function's error handling.
157-183: Consider validating URL scheme for defense-in-depth.The static analysis flags that
urllib.request.urlopenaccepts arbitrary URL schemes includingfile://. WhileUNIPROT_API_URLis presumably an HTTPS constant, the pagination URLs from_get_next_linkcome from server response headers. A compromised or malicious server could theoretically redirect to afile://URL.For defense-in-depth, consider validating the scheme:
🛡️ Optional: Add scheme validation
def _fetch_json(url: str) -> tuple: """ Fetch a JSON page from UniProt using *requests* (preferred) or *urllib*. ... """ + from urllib.parse import urlparse + parsed = urlparse(url) + if parsed.scheme not in ("http", "https"): + raise ValueError(f"Unsupported URL scheme: {parsed.scheme}") + try: import requests as _requests🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/uniprot_ptm_datasets.py` around lines 157 - 183, The fetch helper _fetch_json currently opens arbitrary URL schemes (via urllib.request.urlopen or requests.get); validate the URL scheme before issuing the request by parsing the URL (e.g., urllib.parse.urlparse) and only allowing "http" and "https" (reject or raise an error for "file", "ftp", etc.), and apply the same check to any pagination links produced by _get_next_link (ensure those links are validated before being passed into _fetch_json); this defends against malicious server-provided pagination URLs while preserving normal UNIPROT_API_URL usage.hvantk/commands/ptm_cli.py (1)
182-182: Remove duplicatehailimport.
hail as hlis already imported at line 171. This duplicate import is unnecessary.♻️ Remove duplicate import
result = result.checkpoint(output_ht, overwrite=overwrite) - import hail as hl counts = result.aggregate(hl.struct(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/ptm_cli.py` at line 182, Remove the duplicate import of "hail as hl" found at the second import line; keep the original "import hail as hl" that already exists earlier (around line with other imports) and delete the redundant "import hail as hl" occurrence to avoid duplicate imports in the module (check the top-level imports near the PTM CLI definitions and functions that use hl).hvantk/ptm/report.py (1)
258-260: Consider narrowing the exception type for plot failures.The broad
Exceptioncatch (flagged by Ruff BLE001) is acceptable here for fault tolerance, but could be narrowed to catch more specific exceptions like(ValueError, IOError, RuntimeError)to avoid masking unexpected errors.♻️ Optional: narrow exception scope
- except Exception: + except (ValueError, IOError, RuntimeError, TypeError): logger.warning("Failed to generate plot %s, skipping.", output_path.stem, exc_info=True) return ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/report.py` around lines 258 - 260, Replace the broad "except Exception" in the plot-generation failure handler with a narrowed tuple of expected exceptions (for example except (ValueError, IOError, RuntimeError):) so you don't mask unrelated errors; keep the existing logger.warning call (using output_path.stem and exc_info=True) and the return "" behavior unchanged; update the except clause where the code logs "Failed to generate plot %s, skipping." (the block that references output_path.stem and logger.warning) to use the narrower exception tuple to address Ruff BLE001.hvantk/ptm/plot.py (1)
49-49: Parameterformatshadows Python builtin.The
formatparameter appears in multiple functions (lines 49, 126, 171, 218, 277, 287, 298). While common in plotting code, consider renaming tofmtorimage_formatto avoid shadowing.♻️ Rename `format` parameter across all functions
def plot_landscape_summary( result: PTMLandscapeResult, output_path: str, figsize: Tuple[int, int] = (8, 5), dpi: int = 300, title: Optional[str] = None, - format: str = "png", + fmt: str = "png", ) -> plt.Figure:Apply similar changes to other functions and update internal usages of
formattofmt.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/plot.py` at line 49, Rename the parameter named format to fmt (or image_format) across all plotting functions in the hvantk.ptm.plot module that declare "format: str = 'png'" (those function signatures flagged in the review) and update every internal usage of format to the new name, plus adjust any local variables or return values that reference it; ensure function signatures (and docstrings if present) and all call sites within the repository are updated consistently so nothing still references the old format parameter name.hvantk/utils/table_utils.py (1)
65-87: Consider detecting duplicate rename destinations for robustness.The
target not in actual_fieldscheck now prevents renaming into an existing field. For additional safety against malformedfield_mapinputs where multiple entries could resolve to the same actual field, you could also checktarget not in rename.values()before adding:if actual not in target_names and target not in actual_fields: + if target in rename.values(): + logger.warning( + "build_rename_map: target %r already assigned, skipping %r", + target, actual, + ) + continue rename[actual] = targetCurrent usage with
COSMIC_CGC_FIELDShas no collisions, so this is purely defensive.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/utils/table_utils.py` around lines 65 - 87, The current header-rename logic may allow multiple source fields to map to the same target; update the blocks that assign to rename (the branches that set rename[actual] = target and rename[actual_by_target] = target) to also check that the proposed target is not already used in rename.values() before adding; use the same normalization/path lookup logic around _normalize_field_name, norm_to_actual, field_map, actual_fields and target_names to prevent duplicate rename destinations and keep the existing checks (actual not in target_names and target not in actual_fields).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hvantk/ptm/analysis.py`:
- Around line 86-91: The current guard only checks self.ccr_mean_ptm but still
formats self.ccr_mean_non_ptm which can be None; update the block that emits the
summary (the lines using self.ccr_mean_ptm and self.ccr_mean_non_ptm, produced
by ptm_population()/the summary method) to check each mean independently and
only format the non-None values. Concretely, replace the single-if that formats
both values with logic that builds the string parts conditionally (e.g., if
self.ccr_mean_ptm is not None append "PTM=...", and if self.ccr_mean_non_ptm is
not None append "non-PTM=...") and then join/append that to lines so no f-string
is called with a None value.
In `@hvantk/tables/table_builders.py`:
- Around line 1142-1146: The fallback sets kwargs["force_bgz"]=True whenever
resolved_path.endswith(".gz"), which overrides resolve_compression's explicit
False for standard gzip; change the code to unpack the boolean from
resolve_compression (e.g., resolved_path, is_bgz = resolve_compression(...)) and
only set kwargs["force_bgz"]=True when either force_bgz was explicitly requested
or is_bgz is True (do not set it solely based on resolved_path.endswith(".gz")).
Ensure you update references to resolved_path and the function
resolve_compression accordingly and remove the blunt .endswith(".gz") check.
- Around line 1461-1471: The flanking_interval end can exceed chromosome bounds
(ht.codon_end + flank_bp); clamp it to the contig length before calling
hl.locus_interval. Retrieve the contig length for ht._contig from the
reference_genome (the same approach used in the PTM annotation pattern), compute
end_pos = hl.min(contig_length, ht.codon_end + flank_bp), and use that end_pos
in hl.locus_interval (keeping hl.max(1, ht.codon_start - flank_bp) for the
start); update the flanking_interval annotation accordingly (symbols:
flanking_interval, ht.codon_end, flank_bp, ht._contig, reference_genome,
hl.locus_interval).
---
Nitpick comments:
In `@hvantk/annotation/annotate.py`:
- Around line 30-49: Hoist the hl.set(...) creations out of the lambdas to avoid
rebuilding them repeatedly: create local Hail sets (e.g., pathogenic_set =
hl.set(CLINVAR_PATHOGENIC_LABELS) and benign_set =
hl.set(CLINVAR_BENIGN_LABELS)) before computing is_pathogenic and is_benign,
then use those variables inside the t.clinvar_clnsig.any(...) lambdas; update
references in the is_pathogenic and is_benign predicates and leave the rest of
the t.annotate logic unchanged.
In `@hvantk/commands/ptm_cli.py`:
- Line 182: Remove the duplicate import of "hail as hl" found at the second
import line; keep the original "import hail as hl" that already exists earlier
(around line with other imports) and delete the redundant "import hail as hl"
occurrence to avoid duplicate imports in the module (check the top-level imports
near the PTM CLI definitions and functions that use hl).
In `@hvantk/commands/uniprot_ptm_downloader.py`:
- Around line 40-55: The exception log duplicates the exception text; in the
except block for the UniProt PTM download (the try that imports
UniProtPTMDataset and calls UniProtPTMDataset.from_date/from_latest, download,
etc.) replace logger.exception(f"Download failed: {e}") with a simpler
logger.exception("Download failed") so the stack/exception is still recorded
without duplicating the exception string, keeping the existing
click.echo(f"Error: {e}", err=True) and ctx.exit(1) behavior unchanged.
In `@hvantk/datasets/uniprot_ptm_datasets.py`:
- Around line 257-325: In download, the exception message uses explicit str
conversion (f"{str(e)}"); change it to use implicit string conversion (e.g.,
f"{e}") when constructing the RuntimeError message in the except block so the
code is cleaner and satisfies the static analysis rule in the download
function's error handling.
- Around line 157-183: The fetch helper _fetch_json currently opens arbitrary
URL schemes (via urllib.request.urlopen or requests.get); validate the URL
scheme before issuing the request by parsing the URL (e.g.,
urllib.parse.urlparse) and only allowing "http" and "https" (reject or raise an
error for "file", "ftp", etc.), and apply the same check to any pagination links
produced by _get_next_link (ensure those links are validated before being passed
into _fetch_json); this defends against malicious server-provided pagination
URLs while preserving normal UNIPROT_API_URL usage.
In `@hvantk/ptm/plot.py`:
- Line 49: Rename the parameter named format to fmt (or image_format) across all
plotting functions in the hvantk.ptm.plot module that declare "format: str =
'png'" (those function signatures flagged in the review) and update every
internal usage of format to the new name, plus adjust any local variables or
return values that reference it; ensure function signatures (and docstrings if
present) and all call sites within the repository are updated consistently so
nothing still references the old format parameter name.
In `@hvantk/ptm/report.py`:
- Around line 258-260: Replace the broad "except Exception" in the
plot-generation failure handler with a narrowed tuple of expected exceptions
(for example except (ValueError, IOError, RuntimeError):) so you don't mask
unrelated errors; keep the existing logger.warning call (using output_path.stem
and exc_info=True) and the return "" behavior unchanged; update the except
clause where the code logs "Failed to generate plot %s, skipping." (the block
that references output_path.stem and logger.warning) to use the narrower
exception tuple to address Ruff BLE001.
In `@hvantk/utils/table_utils.py`:
- Around line 65-87: The current header-rename logic may allow multiple source
fields to map to the same target; update the blocks that assign to rename (the
branches that set rename[actual] = target and rename[actual_by_target] = target)
to also check that the proposed target is not already used in rename.values()
before adding; use the same normalization/path lookup logic around
_normalize_field_name, norm_to_actual, field_map, actual_fields and target_names
to prevent duplicate rename destinations and keep the existing checks (actual
not in target_names and target not in actual_fields).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a83341fb-2b11-4f02-87df-be7d021b8452
📒 Files selected for processing (15)
docs_site/examples/index.mddocs_site/examples/ptm.mddocs_site/guide/data-sources.mdhvantk/annotation/annotate.pyhvantk/commands/ptm_cli.pyhvantk/commands/uniprot_ptm_downloader.pyhvantk/data/file_utils.pyhvantk/datasets/gencc_datasets.pyhvantk/datasets/uniprot_ptm_datasets.pyhvantk/ptm/analysis.pyhvantk/ptm/constants.pyhvantk/ptm/plot.pyhvantk/ptm/report.pyhvantk/tables/table_builders.pyhvantk/utils/table_utils.py
✅ Files skipped from review due to trivial changes (4)
- docs_site/guide/data-sources.md
- docs_site/examples/index.md
- hvantk/ptm/constants.py
- docs_site/examples/ptm.md
🚧 Files skipped from review as they are similar to previous changes (2)
- hvantk/datasets/gencc_datasets.py
- hvantk/data/file_utils.py
| if force_bgz: | ||
| kwargs["force_bgz"] = True | ||
| elif resolved_path.endswith(".gz"): | ||
| kwargs["force_bgz"] = True | ||
| return hl.import_table(**kwargs) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Hail import_table force_bgz parameter gzip vs bgzip
💡 Result:
In Hail, the hl.import_table function (also referenced as import_table in methods) imports delimited text files (TSV/CSV) into a Hail Table. The force_bgz parameter controls how compressed .gz files are handled during import. GZIP is standard gzip compression, which decompresses serially (single-threaded, no parallelism). BGZIP (Blocked GZIP) is a bioinformatics format that splits the file into independent blocks, enabling parallel random access and multi-threaded decompression. By default, Hail treats .gz files as standard GZIP (serial). Set force_bgz=True to treat .gz files as BGZIP, allowing parallel reading across Spark executors for faster performance on large files, even if the extension isn't .bgz. From Hail docs: "If True, load .gz files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. This option is useful when the file extension is not '.bgz', but the file is blocked gzip, so that the file can be read in parallel and not on a single node." Use force_bgz=True for BGZIP-compressed files (common in genomics) to enable parallelism; otherwise, use standard GZIP handling.
Citations:
- 1: https://hail.is/docs/0.2/methods/impex.html
- 2: https://hail.is/docs/0.2/_modules/hail/methods/impex.html
🏁 Script executed:
# Find the resolve_compression function definition
rg -n "def resolve_compression" --type pythonRepository: bigbio/hvantk
Length of output: 87
🏁 Script executed:
# Also check if resolve_compression is imported from elsewhere
rg -n "resolve_compression" hvantk/tables/table_builders.py | head -20Repository: bigbio/hvantk
Length of output: 235
🏁 Script executed:
# Find and read the resolve_compression function
fd -t f file_utils.py -path "*/data/*"Repository: bigbio/hvantk
Length of output: 225
🏁 Script executed:
# Search for resolve_compression definition
rg -n "def resolve_compression" --type-list | head -5Repository: bigbio/hvantk
Length of output: 446
🏁 Script executed:
# Try a simpler search for the function
find . -name "file_utils.py" -type fRepository: bigbio/hvantk
Length of output: 111
🏁 Script executed:
# Read the file_utils.py to find resolve_compression
cat -n ./hvantk/data/file_utils.pyRepository: bigbio/hvantk
Length of output: 32113
The fallback .gz check incorrectly overrides resolve_compression's explicit False return for standard gzip files.
When resolve_compression detects a standard gzip file, it explicitly returns (filepath, False) to indicate the file cannot be read in parallel. However, the fallback check at line 1144 ignores this signal and sets force_bgz=True anyway because the path ends with .gz.
According to Hail documentation and the learnings from this codebase, force_bgz=True tells Hail the file is block-gzipped (BGZF) and safe for parallel reading. Standard gzip files should NOT have this flag set—Hail will read them single-threaded without it.
🔧 Suggested fix
if force_bgz:
kwargs["force_bgz"] = True
- elif resolved_path.endswith(".gz"):
- kwargs["force_bgz"] = True
+ # Don't set force_bgz for standard gzip — resolve_compression already handles this
return hl.import_table(**kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/tables/table_builders.py` around lines 1142 - 1146, The fallback sets
kwargs["force_bgz"]=True whenever resolved_path.endswith(".gz"), which overrides
resolve_compression's explicit False for standard gzip; change the code to
unpack the boolean from resolve_compression (e.g., resolved_path, is_bgz =
resolve_compression(...)) and only set kwargs["force_bgz"]=True when either
force_bgz was explicitly requested or is_bgz is True (do not set it solely based
on resolved_path.endswith(".gz")). Ensure you update references to resolved_path
and the function resolve_compression accordingly and remove the blunt
.endswith(".gz") check.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
hvantk/ptm/analysis.py (1)
223-224: Materializeannotatedbefore the follow-on Hail actions.
stats,category_counts, anddistance_distare separate actions on the same lazy table. Withoutpersist()/checkpoint(), Hail will rerunannotate_variants_with_ptm()and_label_clinvar()for each one.♻️ Low-friction fix
annotated = annotate_variants_with_ptm(clinvar_ht, ptm_ht, flanking_codons) - annotated = _label_clinvar(annotated) + annotated = _label_clinvar(annotated).persist()Also applies to: 227-271
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/analysis.py` around lines 223 - 224, The pipeline recomputes annotate_variants_with_ptm and _label_clinvar for each subsequent Hail action because the intermediate Table `annotated` is lazy; persist or checkpoint `annotated` immediately after calling _label_clinvar (i.e., after annotated = _label_clinvar(annotated)) so subsequent actions (`stats`, `category_counts`, `distance_dist`) operate on a materialized table; use annotated = annotated.persist() or annotated = annotated.checkpoint(...) with an appropriate path and then continue with the downstream actions to avoid repeated recomputation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hvantk/ptm/analysis.py`:
- Around line 334-353: The is_non_ptm mask currently defined as "is_non_ptm =
~is_ptm & ~is_prox" allows noncoding variants; update the logic to require
coding variants or declare a precondition. Either (A) enforce coding in the
analysis by changing the mask to include a coding flag from gnomad_ht (e.g.,
is_non_ptm = ~is_ptm & ~is_prox & is_coding, where is_coding is derived from
gnomad_ht consequence/variant annotation) and apply the same change where
similar masks are built (see the other block around 375-388), or (B) add a clear
docstring/precondition that gnomad_ht must be pre-filtered to coding variants
and validate that upstream (raise an error if coding annotation/flag is
missing). Reference symbols: is_non_ptm, is_ptm, is_prox, af_field and the PTM
analysis function in hvantk/ptm/analysis.py.
- Around line 165-171: The _vid construction in ht_out (ht.annotate using
ht.locus.contig, ht.locus.position and ht.alleles[1]) only picks the first ALT
and will drop additional alleles for multiallelic records; either enforce
biallelic input or explode multi-allelic sites into one row per ALT. Fix by
validating ht.alleles length (e.g., assert/hl.if_else to raise or filter where
hl.len(ht.alleles) == 2) when you require biallelic, or else transform the table
before building _vid by exploding ALT alleles (use an explode/flatten pattern to
iterate over hl.range(1, hl.len(ht.alleles)) or an equivalent Hail explode on
ht.alleles[1:], create a per-ALT allele variable, and build _vid from that
allele instead of ht.alleles[1]) so each exported row includes the correct ALT.
- Around line 103-124: The code collapses multiallelic ClinVar CLNSIG to
CLNSIG[0] in _extract_clnsig and then labels variants by that single value;
instead, preserve the array and change _label_clinvar to test if any allele’s
CLNSIG matches the pathogenic/benign label sets (use hl.any(lambda x:
p_labels.contains(x), clnsig) and similarly for b_labels), and handle the scalar
case by checking membership directly when clnsig is not an array; update
_extract_clnsig/_label_clinvar to stop using CLNSIG[0] and use hl.any(...) (or
equivalent) against CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS so
multiallelic records are correctly classified.
In `@hvantk/tables/table_builders.py`:
- Around line 1431-1441: create_ptm_sites_tb currently hardcodes adding "chr" to
inputs (contig_remap/ht._contig) which breaks non-prefixed references; change
create_ptm_sites_tb to detect whether the target reference's contigs are
prefixed by checking get_reference(reference_genome).contigs for a "chr" prefix,
then normalize incoming ht.chrom by stripping any leading "chr"/"CHR" to a base
contig name, apply the MT<->M remap on that base (update contig_remap logic),
and finally re-add the "chr" prefix only if the reference uses prefixed contigs
before assigning ht._contig and filtering against valid_contigs (use
valid_contigs = hl.set(hl.literal(hl.get_reference(reference_genome).contigs))
as before).
---
Nitpick comments:
In `@hvantk/ptm/analysis.py`:
- Around line 223-224: The pipeline recomputes annotate_variants_with_ptm and
_label_clinvar for each subsequent Hail action because the intermediate Table
`annotated` is lazy; persist or checkpoint `annotated` immediately after calling
_label_clinvar (i.e., after annotated = _label_clinvar(annotated)) so subsequent
actions (`stats`, `category_counts`, `distance_dist`) operate on a materialized
table; use annotated = annotated.persist() or annotated =
annotated.checkpoint(...) with an appropriate path and then continue with the
downstream actions to avoid repeated recomputation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f3d8c33c-a809-45df-a6aa-817e16c324e9
📒 Files selected for processing (3)
docs_site/examples/ptm.mdhvantk/ptm/analysis.pyhvantk/tables/table_builders.py
✅ Files skipped from review due to trivial changes (1)
- docs_site/examples/ptm.md
| def _extract_clnsig(ht: hl.Table) -> hl.Expression: | ||
| """Extract scalar CLNSIG expression, handling array vs string dtype.""" | ||
| clnsig_dtype = ht.info.CLNSIG.dtype | ||
| if isinstance(clnsig_dtype, hl.tarray): | ||
| return hl.or_missing( | ||
| hl.is_defined(ht.info.CLNSIG) & (hl.len(ht.info.CLNSIG) > 0), | ||
| ht.info.CLNSIG[0], | ||
| ) | ||
| return ht.info.CLNSIG | ||
|
|
||
|
|
||
| def _label_clinvar(ht: hl.Table) -> hl.Table: | ||
| """Add _label field ('P', 'B', or 'other') based on ClinVar CLNSIG.""" | ||
| clnsig = _extract_clnsig(ht) | ||
| p_labels = hl.literal(CLINVAR_PATHOGENIC_LABELS) | ||
| b_labels = hl.literal(CLINVAR_BENIGN_LABELS) | ||
| return ht.annotate( | ||
| _label=hl.case() | ||
| .when(hl.is_defined(clnsig) & p_labels.contains(clnsig), "P") | ||
| .when(hl.is_defined(clnsig) & b_labels.contains(clnsig), "B") | ||
| .default("other") | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file structure and locate the files mentioned
find . -type f -name "*.py" | grep -E "(analysis|table_builders)" | head -20Repository: bigbio/hvantk
Length of output: 162
🏁 Script executed:
# Read the create_clinvar_tb function
cat -n hvantk/tables/table_builders.py | sed -n '248,310p'Repository: bigbio/hvantk
Length of output: 2381
🏁 Script executed:
# Also check the imports and constants needed for context
head -100 hvantk/ptm/analysis.py | cat -nRepository: bigbio/hvantk
Length of output: 4391
🏁 Script executed:
# Search for CLINVAR_PATHOGENIC_LABELS and CLINVAR_BENIGN_LABELS
grep -n "CLINVAR_PATHOGENIC_LABELS\|CLINVAR_BENIGN_LABELS" hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 252
🏁 Script executed:
# Check if split_multiallelic is used elsewhere in the codebase
grep -rn "split_multiallelic" --include="*.py" .Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Find ptm_landscape function and see how it uses _label_clinvar
grep -n "def ptm_landscape\|_label_clinvar" hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 175
🏁 Script executed:
# Read the ptm_landscape function to understand the context
cat -n hvantk/ptm/analysis.py | sed -n '200,300p'Repository: bigbio/hvantk
Length of output: 4533
🏁 Script executed:
# Look for any documentation or tests that show how multiallelic VCFs are handled
find . -type f \( -name "*.py" -o -name "*.md" \) | xargs grep -l "multiallelic\|split_multiallelic" 2>/dev/null | head -10Repository: bigbio/hvantk
Length of output: 127
🏁 Script executed:
# Check how multiallelic is handled in converters.py
grep -n "multiallelic\|split_multiallelic" hvantk/hgc/converters.py -B 2 -A 2Repository: bigbio/hvantk
Length of output: 1507
🏁 Script executed:
# Check the test file
grep -n "multiallelic\|split_multiallelic" hvantk/tests/ancestry/test_filter_hail.py -B 2 -A 2Repository: bigbio/hvantk
Length of output: 316
🏁 Script executed:
# Look at the VCF specification and understand CLNSIG in multiallelic context
# First, let's see if there's a README or documentation about ClinVar data handling
find . -type f \( -name "README*" -o -name "*.md" \) | xargs grep -l "ClinVar\|CLNSIG" 2>/dev/nullRepository: bigbio/hvantk
Length of output: 501
🏁 Script executed:
# Check the actual VCF header or info field definitions
grep -rn "CLNSIG\|INFO.*ClinVar" --include="*.py" . | head -20Repository: bigbio/hvantk
Length of output: 1987
🏁 Script executed:
# Let's look at clinvar_streamer.py to understand the CLNSIG structure better
cat -n hvantk/data/clinvar_streamer.py | sed -n '150,170p'Repository: bigbio/hvantk
Length of output: 842
🏁 Script executed:
# Check psroc/pipeline.py to see if it has the same CLNSIG extraction pattern and any comments
cat -n hvantk/psroc/pipeline.py | sed -n '1075,1095p'Repository: bigbio/hvantk
Length of output: 905
🏁 Script executed:
# Look at the ClinVar documentation file
cat docs_site/examples/clinvar.md | head -100Repository: bigbio/hvantk
Length of output: 2454
🏁 Script executed:
# Check if there are any comments or issues about multiallelic CLNSIG handling
grep -rn "multiallelic.*CLNSIG\|CLNSIG.*multiallelic\|collapse.*CLNSIG\|CLNSIG.*collapse" --include="*.py" --include="*.md" .Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Check how annotate_variants_with_ptm handles the table
grep -n "def annotate_variants_with_ptm" hvantk/ptm/annotate.pyRepository: bigbio/hvantk
Length of output: 91
🏁 Script executed:
# Let's understand the VCF format better - check if there's any filtering or splitting elsewhere
grep -rn "split_multiallelic\|\.split_rows()" --include="*.py" hvantk/Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Check the ClinVar VCF spec or any comments about how CLNSIG values work
grep -rn "CLNSIG.*array\|array.*CLNSIG\|CLNSIG.*multiple\|multiple.*CLNSIG" --include="*.py" --include="*.md" . -iRepository: bigbio/hvantk
Length of output: 410
🏁 Script executed:
# Let's check what Hail's import_vcf actually does by looking at how alleles are structured
cat -n hvantk/tables/table_builders.py | sed -n '285,310p'Repository: bigbio/hvantk
Length of output: 998
🏁 Script executed:
# Check if there are any comments or TODOs about CLNSIG handling
grep -rn "TODO\|FIXME\|XXX\|HACK\|BUG" hvantk/ptm/analysis.py hvantk/psroc/pipeline.py | grep -i "clnsig\|multiallelic"Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Look for any tests that show expected behavior with multiallelic CLNSIG
find . -path ./node_modules -prune -o -type f -name "test_*.py" -o -name "*_test.py" | xargs grep -l "CLNSIG\|_extract_clnsig" 2>/dev/nullRepository: bigbio/hvantk
Length of output: 41
🏁 Script executed:
# Check the full _extract_clnsig function context and any comments around it
cat -n hvantk/ptm/analysis.py | sed -n '100,115p'Repository: bigbio/hvantk
Length of output: 754
🏁 Script executed:
# Compare how clinvar_streamer handles CLNSIG vs how ptm handles it - show surrounding context
cat -n hvantk/data/clinvar_streamer.py | sed -n '145,175p'Repository: bigbio/hvantk
Length of output: 1281
Don't collapse multiallelic ClinVar CLNSIG to the first entry.
create_clinvar_tb() in hvantk/tables/table_builders.py:292-298 imports raw VCF with hl.import_vcf(...).rows() without splitting multiallelic sites. For multiallelic records, CLNSIG is stored as an array with one value per alternate allele. However, _extract_clnsig() extracts only CLNSIG[0], discarding significance information for alleles at positions 1+. This causes variants with mixed-significance alleles (e.g., alt[0]=Pathogenic, alt[1]=Benign) to be silently mislabeled based only on the first allele, leading to incorrect counts and enrichment calculations in ptm_landscape().
The correct pattern is shown in hvantk/data/clinvar_streamer.py:155-157, which uses .any() to check if any array element matches the label set. Note that hvantk/psroc/pipeline.py:1077-1085 contains the same problematic CLNSIG[0] logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/analysis.py` around lines 103 - 124, The code collapses
multiallelic ClinVar CLNSIG to CLNSIG[0] in _extract_clnsig and then labels
variants by that single value; instead, preserve the array and change
_label_clinvar to test if any allele’s CLNSIG matches the pathogenic/benign
label sets (use hl.any(lambda x: p_labels.contains(x), clnsig) and similarly for
b_labels), and handle the scalar case by checking membership directly when
clnsig is not an array; update _extract_clnsig/_label_clinvar to stop using
CLNSIG[0] and use hl.any(...) (or equivalent) against CLINVAR_PATHOGENIC_LABELS
and CLINVAR_BENIGN_LABELS so multiallelic records are correctly classified.
| ht_out = ht.annotate( | ||
| _vid=hl.delimit( | ||
| [ | ||
| hl.str(ht.locus.contig), | ||
| hl.str(ht.locus.position), | ||
| ht.alleles[0], | ||
| ht.alleles[1], |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd hvantk --type d | head -20Repository: bigbio/hvantk
Length of output: 64
🏁 Script executed:
git ls-files | grep -E "(analysis|ptm)" | head -20Repository: bigbio/hvantk
Length of output: 502
🏁 Script executed:
cat -n hvantk/ptm/analysis.py | sed -n '160,180p'Repository: bigbio/hvantk
Length of output: 826
🏁 Script executed:
cat -n hvantk/ptm/analysis.py | sed -n '150,190p'Repository: bigbio/hvantk
Length of output: 1519
🏁 Script executed:
wc -l hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 83
🏁 Script executed:
cat -n hvantk/ptm/analysis.py | head -200Repository: bigbio/hvantk
Length of output: 8248
🏁 Script executed:
rg "export_ptm_strata" --type py -B 2 -A 2Repository: bigbio/hvantk
Length of output: 1557
🏁 Script executed:
rg "annotate_variants_with_ptm" --type py -B 2 -A 2Repository: bigbio/hvantk
Length of output: 3097
🏁 Script executed:
cat -n hvantk/ptm/pipeline.py | head -150Repository: bigbio/hvantk
Length of output: 5635
🏁 Script executed:
cat -n hvantk/ptm/pipeline.py | grep -A 10 -B 10 "export_ptm_strata"Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
rg "def annotate_variants_with_ptm" --type py -A 50Repository: bigbio/hvantk
Length of output: 3075
🏁 Script executed:
cat -n hvantk/ptm/annotate.py | head -80Repository: bigbio/hvantk
Length of output: 3442
🏁 Script executed:
rg "split_multi" --type pyRepository: bigbio/hvantk
Length of output: 2076
🏁 Script executed:
rg "biallelic" --type py -iRepository: bigbio/hvantk
Length of output: 4168
🏁 Script executed:
cat -n hvantk/commands/ptm_cli.py | grep -A 30 "def export_variants"Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
cat -n hvantk/ptm/pipeline.py | grep -A 30 "export_ptm_strata"Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
cat -n hvantk/ptm/tests/test_ptm.pyRepository: bigbio/hvantk
Length of output: 117
🏁 Script executed:
find . -name "*test*ptm*" -type fRepository: bigbio/hvantk
Length of output: 83
🏁 Script executed:
find hvantk/tests -name "*.py" -type fRepository: bigbio/hvantk
Length of output: 2754
🏁 Script executed:
cat -n hvantk/commands/ptm_cli.pyRepository: bigbio/hvantk
Length of output: 20124
🏁 Script executed:
cat -n hvantk/ptm/analysis.py | sed -n '127,182p'Repository: bigbio/hvantk
Length of output: 2159
🏁 Script executed:
rg "alleles\[" hvantk/ptm/ --type py -B 2 -A 2Repository: bigbio/hvantk
Length of output: 396
🏁 Script executed:
cat -n hvantk/tests/test_ptm.pyRepository: bigbio/hvantk
Length of output: 8829
🏁 Script executed:
rg "alleles\|split\|biallelic" hvantk/tests/test_ptm.py -iRepository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
rg "unsplit\|multi" hvantk/ptm/ --type py -i -B 2 -A 2Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
cat -n hvantk/ptm/analysis.py | sed -n '1,130p'Repository: bigbio/hvantk
Length of output: 5560
Hardcoded single ALT export will lose data from multi-allelic variants.
The _vid formatter at lines 171 accesses only alleles[1], so variants with multiple ALT alleles (e.g., ["A", "T", "C"]) will silently truncate to ref:alt1 and drop additional ALTs. The exported output will be incomplete for downstream tools when processing unsplit tables. Either validate input is biallelic and document this constraint, or emit one row per ALT.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/analysis.py` around lines 165 - 171, The _vid construction in
ht_out (ht.annotate using ht.locus.contig, ht.locus.position and ht.alleles[1])
only picks the first ALT and will drop additional alleles for multiallelic
records; either enforce biallelic input or explode multi-allelic sites into one
row per ALT. Fix by validating ht.alleles length (e.g., assert/hl.if_else to
raise or filter where hl.len(ht.alleles) == 2) when you require biallelic, or
else transform the table before building _vid by exploding ALT alleles (use an
explode/flatten pattern to iterate over hl.range(1, hl.len(ht.alleles)) or an
equivalent Hail explode on ht.alleles[1:], create a per-ALT allele variable, and
build _vid from that allele instead of ht.alleles[1]) so each exported row
includes the correct ALT.
| """Population-level allele frequency analysis at PTM sites (Q3). | ||
|
|
||
| Annotates gnomAD variants with PTM proximity and compares allele | ||
| frequency distributions between PTM-site and non-PTM coding variants. | ||
| This is where Hail scalability matters (~700M gnomAD variants). | ||
|
|
||
| Parameters | ||
| ---------- | ||
| gnomad_ht : hl.Table | ||
| gnomAD variant table with allele frequency field. | ||
| ptm_ht : hl.Table | ||
| PTM sites Hail Table from create_ptm_sites_tb. | ||
| output_dir : str | ||
| Directory for output files (population_summary.json). | ||
| ccr_ht : hl.Table, optional | ||
| CCR table for constrained region cross-reference. | ||
| af_field : str | ||
| Name of the allele frequency field in gnomad_ht (default: "AF"). | ||
| flanking_codons : int | ||
| Flanking codons for proximal window (default: 5). |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "analysis.py" | grep ptmRepository: bigbio/hvantk
Length of output: 81
🏁 Script executed:
wc -l ./hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 85
🏁 Script executed:
sed -n '330,395p' ./hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 2322
🏁 Script executed:
sed -n '375,388p' ./hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 690
🏁 Script executed:
grep -n "def annotate_variants_with_ptm" ./hvantk/ptm/annotate.pyRepository: bigbio/hvantk
Length of output: 91
🏁 Script executed:
sed -n '14,80p' ./hvantk/ptm/annotate.pyRepository: bigbio/hvantk
Length of output: 2540
🏁 Script executed:
sed -n '320,340p' ./hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 763
🏁 Script executed:
sed -n '80,150p' ./hvantk/ptm/annotate.pyRepository: bigbio/hvantk
Length of output: 932
🏁 Script executed:
rg -i "coding|consequence|variant_type" ./hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 130
🏁 Script executed:
sed -n '326,420p' ./hvantk/ptm/analysis.pyRepository: bigbio/hvantk
Length of output: 3315
🏁 Script executed:
rg "ptm_population" ./hvantk --type py -B 3 -A 3Repository: bigbio/hvantk
Length of output: 3520
🏁 Script executed:
sed -n '1,100p' ./hvantk/commands/ptm_cli.py | head -80Repository: bigbio/hvantk
Length of output: 2629
🏁 Script executed:
grep -n "def ptm_population_cmd" ./hvantk/commands/ptm_cli.py -A 30Repository: bigbio/hvantk
Length of output: 1354
The "non-PTM coding" filter is not enforced in the code.
The docstring promises comparison against "non-PTM coding variants," but is_non_ptm = ~is_ptm & ~is_prox includes all noncoding variants from gnomAD. Either add a documented precondition that gnomad_ht must be pre-filtered to coding variants, or add explicit filtering (e.g., is_non_ptm = ~is_ptm & ~is_prox & is_coding).
Also applies to: 375-388
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/analysis.py` around lines 334 - 353, The is_non_ptm mask currently
defined as "is_non_ptm = ~is_ptm & ~is_prox" allows noncoding variants; update
the logic to require coding variants or declare a precondition. Either (A)
enforce coding in the analysis by changing the mask to include a coding flag
from gnomad_ht (e.g., is_non_ptm = ~is_ptm & ~is_prox & is_coding, where
is_coding is derived from gnomad_ht consequence/variant annotation) and apply
the same change where similar masks are built (see the other block around
375-388), or (B) add a clear docstring/precondition that gnomad_ht must be
pre-filtered to coding variants and validate that upstream (raise an error if
coding annotation/flag is missing). Reference symbols: is_non_ptm, is_ptm,
is_prox, af_field and the PTM analysis function in hvantk/ptm/analysis.py.
| # Remap contig names to match GRCh38 (e.g., MT -> M for chrM) | ||
| contig_remap = hl.dict({"MT": "M"}) | ||
| ht = ht.annotate( | ||
| _contig=hl.str("chr") + contig_remap.get(ht.chrom, ht.chrom), | ||
| ) | ||
|
|
||
| # Filter to valid contigs in the reference genome | ||
| valid_contigs = hl.set( | ||
| hl.literal(hl.get_reference(reference_genome).contigs) | ||
| ) | ||
| ht = ht.filter(valid_contigs.contains(ht._contig)) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What contig names do Hail's built-in GRCh37andGRCh38reference genomes use by default? Please cite the officialReferenceGenome docs or source examples.
💡 Result:
Hail's built-in GRCh37 reference genome uses numeric contig names by default: "1", "2", ..., "22", "X", "Y", "MT". GRCh38 uses prefixed names: "chr1", "chr2", ..., "chr22", "chrX", "chrY", "chrM".
Citations:
- 1: https://hail.is/docs/0.2/genetics/hail.genetics.ReferenceGenome.html
- 2: https://hail.is/docs/0.2/_modules/hail/genetics/reference_genome.html
- 3: https://hail.is/docs/0.2/functions/genetics.html
- 4: https://discuss.hail.is/t/contig-1-is-not-in-the-reference-genome-grch38-error/1841
🏁 Script executed:
fd "table_builders.py" | head -5Repository: bigbio/hvantk
Length of output: 88
🏁 Script executed:
# Get file size and then read specific lines
wc -l hvantk/tables/table_builders.pyRepository: bigbio/hvantk
Length of output: 93
🏁 Script executed:
# Read lines 1431-1441 to verify the code snippet
sed -n '1431,1441p' hvantk/tables/table_builders.pyRepository: bigbio/hvantk
Length of output: 507
🏁 Script executed:
# Read more context around the code to see the function definition and parameters
sed -n '1400,1441p' hvantk/tables/table_builders.pyRepository: bigbio/hvantk
Length of output: 1534
🏁 Script executed:
# Also read the rest of the transform function to understand the full logic
sed -n '1441,1480p' hvantk/tables/table_builders.pyRepository: bigbio/hvantk
Length of output: 1335
Make PTM contig normalization reference-genome-aware.
The code hard-codes "chr" prefix for all inputs, which breaks with reference_genome="GRCh37". GRCh37's contigs are unprefixed ("1", "MT"), while GRCh38's are prefixed ("chr1", "chrM"). With GRCh37, input "1" becomes "chr1" (invalid) and is filtered out; already-prefixed inputs become "chrchr1". Additionally, the MT→M remapping assumes GRCh38's "chrM" format and will mismatch GRCh37's "MT".
The fix should detect the reference genome's contig naming convention at the start of create_ptm_sites_tb, normalize inputs to a base form (removing any existing prefix), remap MT/M appropriately, and conditionally re-add the prefix only if the reference uses one.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/tables/table_builders.py` around lines 1431 - 1441,
create_ptm_sites_tb currently hardcodes adding "chr" to inputs
(contig_remap/ht._contig) which breaks non-prefixed references; change
create_ptm_sites_tb to detect whether the target reference's contigs are
prefixed by checking get_reference(reference_genome).contigs for a "chr" prefix,
then normalize incoming ht.chrom by stripping any leading "chr"/"CHR" to a base
contig name, apply the MT<->M remap on that base (update contig_remap logic),
and finally re-add the "chr" prefix only if the reference uses prefixed contigs
before assigning ht._contig and filtering against valid_contigs (use
valid_contigs = hl.set(hl.literal(hl.get_reference(reference_genome).contigs))
as before).
This pull request introduces comprehensive documentation and CLI support for the new PTM (Post-Translational Modification) analysis module in the
hvantktoolkit. The changes include updates to the architecture overview, usage guides, data source documentation, and a detailed PTM tool reference. The PTM workflow is now fully documented with example commands, Python API usage, and integration with existing variant analysis pipelines.PTM Module Documentation and CLI Integration
Documentation and Architecture:
ptm/module in the architecture overview, detailing its structure and purpose for PTM variant analysis. [1] [2]ptm_cli.pyin the CLI commands list and described associated test files. [1] [2]Tool and Usage Documentation:
docs_site/tools/ptm.mdreference describing the PTM pipeline, command-line interface, Python API, workflow phases, annotation fields, output files, and module structure.docs_site/examples/ptm.md) covering all steps from data preparation to report generation, with expected outputs and code snippets.CLI and Core Functionality Enhancements
Command-line Integration:
download_clicommand group, enabling automated acquisition of PTM data. [1] [2]Code Consistency:
hvantk.core.constantsfor consistency across modules.Summary by CodeRabbit
New Features
Documentation
Tests