Skip to content

Handling files formats conversion & improve docs#60

Merged
enriquea merged 21 commits intomainfrom
dev
Mar 4, 2026
Merged

Handling files formats conversion & improve docs#60
enriquea merged 21 commits intomainfrom
dev

Conversation

@enriquea
Copy link
Copy Markdown
Collaborator

@enriquea enriquea commented Mar 3, 2026

This pull request introduces support for automatic and standalone conversion of standard gzip (.gz) files to block gzip (.bgz) format, which is required for Hail's parallel import. It adds a new CLI command for conversion and updates relevant commands and documentation to guide users on handling file formats. Additionally, it improves CLI usability with new options and ensures that table and matrix creation commands output Hail object descriptions for easier inspection.

File Format Conversion Support

  • Added a new standalone CLI command convert-bgz (hvantk/commands/convert_bgz_cli.py) to convert .gz files to .bgz, with options for output path and thread count.
  • Updated mktable dbnsfp, mkmatrix ucsc, and mkmatrix expression-atlas commands to support an --auto-convert-bgz flag for automatic conversion of .gz files to .bgz before import. [1] [2] [3] [4] [5] [6] [7]
  • Improved documentation in README.md, docs/library/usage.md, and docs/library/annotation-sources.md with clear instructions and tips for file format conversion, including troubleshooting advice for gzip vs BGZF issues. [1] [2] [3] [4] [5] [6] [7]

CLI Usability Improvements

  • Added short option aliases for many CLI arguments (e.g., -e, -m, -o, -g, etc.) for more convenient usage in matrix and table creation commands. [1] [2] [3]
  • Ensured that after creating a Hail Table or MatrixTable, its describe() output is displayed in the CLI for immediate inspection. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Documentation and Example Updates

  • Added usage examples for the new convert-bgz command and the --auto-convert-bgz flag in relevant sections of documentation and CLI help texts. [1] [2] [3] [4] [5] [6]

Minor Improvements

  • Updated the recommended Poetry shell activation in README.md for improved compatibility.

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24]

Summary by CodeRabbit

  • New Features

    • Added convert-bgz CLI to convert gzip → BGZF
    • Added --auto-convert-bgz (and new short flags like -e) to matrix/table commands for on-the-fly conversion
  • Documentation

    • Updated installation activation method
    • Added File Format Conversion guidance and BGZF requirements, plus usage notes for --auto-convert-bgz
  • Tests

    • Added CLI tests for mkmatrix subcommands
    • Expanded file utilities test coverage

@qodo-code-review
Copy link
Copy Markdown
Contributor

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Add gzip-to-BGZF conversion support and improve CLI usability

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Add standalone convert-bgz CLI command for gzip to BGZF conversion
• Implement automatic gzip-to-BGZF conversion in table/matrix builders
• Add shorthand CLI options (-e, -m, -o, -g, etc.) for mkmatrix commands
• Call describe() on created MatrixTables and Tables for inspection
• Add comprehensive unit tests for file compression detection and conversion
Diagram
flowchart LR
  A["Standard gzip files"] -->|detect_compression| B["Compression detection"]
  B -->|is_bgzf/is_gzipped| C["Determine file type"]
  C -->|auto_convert_bgz flag| D["resolve_compression"]
  D -->|convert_gz_to_bgz| E["BGZF output"]
  E -->|hl.import_*| F["Hail parallel import"]
  G["CLI: convert-bgz"] -->|standalone conversion| E
  H["CLI: mkmatrix/mktable"] -->|--auto-convert-bgz| D
Loading

Grey Divider

File Changes

1. hvantk/commands/convert_bgz_cli.py ✨ Enhancement +58/-0

New standalone CLI command for gzip-to-BGZF conversion

hvantk/commands/convert_bgz_cli.py


2. hvantk/commands/make_matrix_cli.py ✨ Enhancement +66/-23

Add shorthand options and auto-convert-bgz flag to mkmatrix

hvantk/commands/make_matrix_cli.py


3. hvantk/commands/make_table_cli.py ✨ Enhancement +23/-8

Add describe() calls and auto-convert-bgz to mktable commands

hvantk/commands/make_table_cli.py


View more (14)
4. hvantk/data/file_utils.py ✨ Enhancement +348/-3

Implement compression detection and gzip-to-BGZF conversion utilities

hvantk/data/file_utils.py


5. hvantk/hvantk.py ✨ Enhancement +9/-0

Lazy load convert-bgz command with error handling

hvantk/hvantk.py


6. hvantk/tables/expression_atlas.py ✨ Enhancement +12/-0

Add auto_convert_bgz parameter and resolve_compression call

hvantk/tables/expression_atlas.py


7. hvantk/tables/genome_builders.py ✨ Enhancement +21/-4

Add auto_convert_bgz support for 1000 Genomes VCF import

hvantk/tables/genome_builders.py


8. hvantk/tables/matrix_builders.py ✨ Enhancement +4/-0

Add auto_convert_bgz parameter to matrix builder functions

hvantk/tables/matrix_builders.py


9. hvantk/tables/table_builders.py ✨ Enhancement +11/-0

Add auto_convert_bgz support to dbNSFP table builder

hvantk/tables/table_builders.py


10. hvantk/tables/ucsc.py ✨ Enhancement +19/-4

Add compression resolution and NA value handling for UCSC metadata

hvantk/tables/ucsc.py


11. hvantk/tests/hgc/test_converters.py 🧪 Tests +2/-1

Add skip condition for test when gnomad package unavailable

hvantk/tests/hgc/test_converters.py


12. hvantk/tests/test_cli_mkmatrix.py 🧪 Tests +145/-0

Add unit tests for mkmatrix CLI shorthand options

hvantk/tests/test_cli_mkmatrix.py


13. hvantk/tests/test_file_utils.py 🧪 Tests +233/-1

Add comprehensive tests for compression detection and conversion

hvantk/tests/test_file_utils.py


14. README.md 📝 Documentation +14/-1

Add file format conversion instructions and poetry shell fix

README.md


15. docs/library/annotation-sources.md 📝 Documentation +9/-0

Add BGZF requirement note and conversion guidance

docs/library/annotation-sources.md


16. docs/library/usage.md 📝 Documentation +63/-0

Add file format conversion section with examples and tips

docs/library/usage.md


17. pyproject.toml Dependencies +3/-2

Make gnomad optional and add pysam as required dependency

pyproject.toml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review bot commented Mar 3, 2026

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Auto-convert overwrites outputs🐞 Bug ⛯ Reliability
Description
Despite documentation promising reuse, --auto-convert-bgz always reconverts and will truncate any
existing adjacent .bgz file, risking accidental clobbering and unnecessary repeated work. This can
also surprise users who manually produced a .bgz (or indexed it) and expect it to be reused.
Code

hvantk/data/file_utils.py[R493-500]

+    if compression == "gzip":
+        if auto_convert:
+            logger.info(
+                "File '%s' is standard gzip. Converting to BGZF for parallel reading...",
+                filepath,
+            )
+            bgz_path = convert_gz_to_bgz(filepath, threads=threads)
+            return bgz_path, True
Evidence
The docs explicitly state the converted .bgz is reused on subsequent runs, but the implementation
always calls convert_gz_to_bgz for gzip inputs when auto_convert is enabled, and the conversion
backends open the output path with write/truncate mode.

docs/library/usage.md[375-386]
hvantk/data/file_utils.py[458-512]
hvantk/data/file_utils.py[367-376]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`--auto-convert-bgz` is documented as writing a sibling `.bgz` file and reusing it on subsequent runs, but the implementation always reconverts and truncates any existing `.bgz` output.
### Issue Context
This impacts commands that call `resolve_compression(..., auto_convert=True)`, and can clobber existing `.bgz` files and waste time.
### Fix Focus Areas
- hvantk/data/file_utils.py[458-512]
- hvantk/data/file_utils.py[379-451]
- hvantk/data/file_utils.py[293-312]
- hvantk/data/file_utils.py[367-377]
- docs/library/usage.md[375-386]
### Notes
- Consider: if output exists and `is_bgzf(output)` is true, return it.
- If converting, write to `output_path + ".tmp"` (or `NamedTemporaryFile` in same dir) and `os.replace` to avoid partial files.
- If you keep reconversion semantics, update the docs to remove the reuse claim and/or add an explicit overwrite flag.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. bgzip pipeline misses errors🐞 Bug ⛯ Reliability
Description
The system bgzip backend only validates bgzip’s exit code and ignores gzip’s exit code/stderr, so
gzip-side failures can surface as unclear downstream issues or produce invalid/partial outputs.
Error messages are also lost because stderr isn’t captured and propagated.
Code

hvantk/data/file_utils.py[R293-311]

+def _convert_with_bgzip(input_path: str, output_path: str, threads: int) -> None:
+    """Convert using system ``bgzip`` via subprocess pipe."""
+    with open(output_path, "wb") as f_out:
+        p1 = subprocess.Popen(
+            ["gzip", "-dc", input_path],
+            stdout=subprocess.PIPE,
+        )
+        p2 = subprocess.Popen(
+            ["bgzip", "-c", f"-@{threads}"],
+            stdin=p1.stdout,
+            stdout=f_out,
+        )
+        p1.stdout.close()  # allow p1 to receive SIGPIPE if p2 exits
+        p2.communicate()
+        p1.wait()
+    if p2.returncode != 0:
+        raise RuntimeError(
+            f"bgzip conversion failed with return code {p2.returncode}"
+        )
Evidence
The conversion pipeline starts two subprocesses but checks only p2.returncode. It never checks
p1.returncode and does not capture stderr from either subprocess to report conversion errors
clearly.

hvantk/data/file_utils.py[293-311]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The bgzip conversion backend validates only the `bgzip` process exit code and ignores failures from the `gzip -dc` process; stderr is not captured, making failures hard to debug and potentially leaving partial outputs.
### Issue Context
This code is used by `convert_gz_to_bgz()` when a system `bgzip` binary is detected.
### Fix Focus Areas
- hvantk/data/file_utils.py[293-312]
- hvantk/data/file_utils.py[379-451]
### Notes
- Capture stderr for both processes (e.g., `stderr=subprocess.PIPE`).
- After `p1.wait()`, check `p1.returncode` and raise with stderr content.
- Consider using a temp output file and `os.replace` to avoid leaving corrupted `.bgz` on disk.
- Optionally verify required binaries (`gzip`, `bgzip`) with `shutil.which` before selecting this backend.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. force_bgz ignored for BGZF🐞 Bug ✓ Correctness
Description
resolve_compression’s docstring states the caller’s desired force_bgz is honored when the input is
actually BGZF, but the implementation always returns True for BGZF inputs. This makes
--no-force-bgz ineffective and violates the function’s contract.
Code

hvantk/data/file_utils.py[R474-492]

+    force_bgz : bool
+        Desired ``force_bgz`` setting (only honoured when the file is
+        actually BGZF).
+    auto_convert : bool
+        When True and the file is plain gzip, automatically convert it to
+        BGZF before import.
+    threads : int
+        Thread count passed to :func:`convert_gz_to_bgz` when converting.
+
+    Returns
+    -------
+    tuple[str, bool]
+        ``(filepath, force_bgz)`` ready to pass to ``hl.import_*``.
+    """
+    compression = detect_compression(filepath)
+
+    if compression == "bgzf":
+        return filepath, True
+
Evidence
The parameter documentation says force_bgz is honored for BGZF inputs, but the code returns a
hard-coded True when compression=="bgzf" and never uses the provided argument.

hvantk/data/file_utils.py[458-492]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`resolve_compression()` ignores its `force_bgz` argument when the detected file is BGZF, always returning `True`.
### Issue Context
This affects CLI flags like `--force-bgz/--no-force-bgz` in `mkmatrix` commands and any other call sites passing `force_bgz`.
### Fix Focus Areas
- hvantk/data/file_utils.py[458-492]
- hvantk/tests/test_file_utils.py[259-276]
### Notes
- If the intent is to honor caller preference for BGZF: change to `return filepath, force_bgz`.
- Add a unit test asserting `resolve_compression(bgzf_file, force_bgz=False)` returns `(bgzf_file, False)`.
- If the intent is to always force BGZF when BGZF is detected, remove the `force_bgz` parameter and update docstrings/callers.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. convert-bgz accepts directories 🐞 Bug ⛯ Reliability
Description
The convert-bgz CLI allows passing a directory path as input (default click.Path permits dirs),
which then fails with a low-level IsADirectoryError when the code tries to open it. This should be
validated at the CLI layer for a clearer error.
Code

hvantk/commands/convert_bgz_cli.py[R18-19]

+@click.command("convert-bgz", context_settings=CONTEXT_SETTINGS)
+@click.argument("input_file", type=click.Path(exists=True))
Evidence
The CLI argument uses click.Path(exists=True) without dir_okay=False, and the compression detection
opens the provided path as a file, which will error for directories.

hvantk/commands/convert_bgz_cli.py[18-19]
hvantk/data/file_utils.py[230-234]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`hvantk convert-bgz` currently accepts directories as `input_file`, leading to an unhandled `IsADirectoryError` during compression detection.
### Issue Context
This is a CLI UX/reliability issue; it should fail early with a clear message.
### Fix Focus Areas
- hvantk/commands/convert_bgz_cli.py[18-26]
- hvantk/data/file_utils.py[184-241]
### Notes
- Update to `click.Path(exists=True, dir_okay=False, file_okay=True, readable=True)`.
- Optionally wrap detection/conversion errors with `click.ClickException` for cleaner CLI output.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

Adds BGZF handling across the toolkit: new convert-bgz CLI, file utilities for detection and gzip→BGZF conversion (multiple backends), an --auto-convert-bgz option propagated to matrix/table builders and CLIs, docs/tests updated, and pysam added as a dependency.

Changes

Cohort / File(s) Summary
Documentation & README
README.md, docs/library/annotation-sources.md, docs/library/usage.md
Added BGZF file-format guidance, usage for --auto-convert-bgz and convert-bgz, and updated installation/quick-start notes.
Compression CLI & CLI wiring
hvantk/commands/convert_bgz_cli.py, hvantk/hvantk.py
New convert-bgz click command with compression detection and backend selection; lazily registered into the top-level CLI.
Matrix & Table CLI updates
hvantk/commands/make_matrix_cli.py, hvantk/commands/make_table_cli.py
Added --auto-convert-bgz flags and short aliases; propagated auto_convert_bgz to builders; added post-create mt.describe() / ht.describe() calls.
File utilities & conversion logic
hvantk/data/file_utils.py
Added detection (is_gzipped, is_bgzf, detect_compression), conversion orchestration (convert_gz_to_bgz, backend selection), resolve_compression, download (download_file, url_exists), and zip utilities.
Builders integration
hvantk/tables/...
hvantk/tables/ucsc.py, .../expression_atlas.py, .../genome_builders.py, .../matrix_builders.py, .../table_builders.py
Added auto_convert_bgz parameters; call resolve_compression before imports to optionally convert gzip→BGZF; minor metadata handling fixes in UCSC loader.
Tests
hvantk/tests/test_file_utils.py, hvantk/tests/test_cli_mkmatrix.py, hvantk/tests/hgc/test_converters.py
Expanded tests for compression detection/conversion/backends and CLI mkmatrix coverage; conditional skip added for a gnomad-dependent test.
Project manifest
pyproject.toml
Added pysam >=0.22 dependency; made gnomad optional and included it in the hgc extra.

Sequence Diagram

sequenceDiagram
    participant User as User/CLI
    participant CLI as mkmatrix/mktable CLI
    participant Resolver as resolve_compression
    participant Converter as convert_gz_to_bgz
    participant Backend as Conversion Backend<br/>(bgzip/pysam/Python)
    participant Importer as hl.import_*

    User->>CLI: Invoke with --auto-convert-bgz
    CLI->>Resolver: resolve_compression(input_file, auto_convert=True)
    Resolver->>Resolver: Detect file compression type
    alt File is gzip
        Resolver->>Converter: convert_gz_to_bgz(input_file)
        Converter->>Converter: Select backend
        Converter->>Backend: Convert gzip → BGZF
        Backend-->>Converter: Converted file path
        Converter-->>Resolver: Return BGZF path
    else File is BGZF
        Resolver-->>Resolver: Return as-is
    end
    Resolver-->>CLI: Return path + force_bgz flag (if applicable)
    CLI->>Importer: Import with force_bgz setting
    Importer-->>CLI: MatrixTable/Table created
    CLI->>User: Display describe() output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement, documentation, Review effort 4/5

Suggested reviewers

  • ypriverol

Poem

🐰 I nibble bytes both small and big,
I swap gzip for BGZF—what a jig!
Three backends hum, conversions play,
Imports hop faster every day.
Hooray for files that parallel cheer!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and uses a generic phrasing ('Handling' and 'improve docs') that could describe many different changesets without clearly conveying the specific primary change. Revise the title to be more specific about the key feature, such as: 'Add gzip to BGZF conversion CLI command and auto-convert option' or 'Add convert-bgz command and --auto-convert-bgz flag for format handling'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@enriquea enriquea self-assigned this Mar 3, 2026
Comment thread hvantk/data/file_utils.py
Comment thread hvantk/data/file_utils.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
hvantk/data/file_utils.py (1)

460-492: force_bgz is currently ignored for BGZF input.

Line 460 exposes force_bgz, but Line 491 always returns True for BGZF files. That makes the parameter misleading.

♻️ Minimal fix
-    if compression == "bgzf":
-        return filepath, True
+    if compression == "bgzf":
+        return filepath, bool(force_bgz)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/data/file_utils.py` around lines 460 - 492, The function that detects
compression (calls detect_compression) currently ignores the force_bgz parameter
for BGZF inputs by returning True unconditionally; change the BGZF branch to
return the original filepath and the passed-in force_bgz value instead of True
so the caller's preference is honored. Locate the BGZF case (the if compression
== "bgzf": branch) and replace the hardcoded True with the force_bgz parameter,
leaving the other branches (plain gzip conversion via convert_gz_to_bgz,
auto_convert, threads) unchanged.
hvantk/hvantk.py (1)

56-63: Consider narrowing the exception handling.

The silent pass on ImportError could mask legitimate import errors within convert_bgz_cli (e.g., a missing dependency or syntax error in that module). Consider logging a debug message or being more specific about which import failure is acceptable.

♻️ Optional: Add debug logging for import failure
 try:
     from hvantk.commands.convert_bgz_cli import convert_bgz_cmd
 
     cli.add_command(convert_bgz_cmd)  # GZ → BGZF conversion utility
 except ImportError:
-    pass
+    logger.debug("convert_bgz_cmd not available (module not installed or import error)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/hvantk.py` around lines 56 - 63, The try/except around importing
hvantk.commands.convert_bgz_cli currently swallows all ImportError causes;
change it to catch ImportError as e and log a debug or warning message that
includes the exception (e) and the module name
("hvantk.commands.convert_bgz_cli") instead of silently passing, so
missing-optional-command installations are quiet but real import problems are
visible; keep the import attempt and cli.add_command(convert_bgz_cmd) as-is and
allow other exception types to propagate so syntax/bug errors in convert_bgz_cli
are not masked.
hvantk/commands/make_matrix_cli.py (1)

1-9: Misplaced CLI example in module docstring.

The convert-bgz example on line 8 belongs in convert_bgz_cli.py, not in this make_matrix_cli.py module which handles mkmatrix commands. Consider moving this example to the appropriate file for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/make_matrix_cli.py` around lines 1 - 9, The module docstring
for make_matrix_cli.py contains an example for the unrelated "hvantk
convert-bgz" command; remove that convert-bgz example from the
make_matrix_cli.py top docstring and add it to the docstring (or examples
section) of the convert_bgz_cli module (convert_bgz_cli) so CLI examples are in
the module that implements the convert-bgz command and mkmatrix examples remain
in make_matrix_cli.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/library/annotation-sources.md`:
- Around line 9-13: Update the wording that claims ".gz files will fail" to a
more accurate statement: change the sentence in the annotation-sources text that
asserts standard gzip (".gz") files "will fail" so it explains that ".gz and
uncompressed files are processed single-threaded (do not enable parallel
import), while block gzip (BGZF) '.bgz' files enable parallel import." Ensure
the paragraph around the current bgz guidance and the two bullet options (the
`--auto-convert-bgz` flag and `hvantk convert-bgz`) reflects this nuance.

In `@docs/library/usage.md`:
- Around line 373-374: The sentence that reads "Hail requires block gzip (BGZF)
compressed files for parallel import. Standard gzip (`.gz`) files are not
directly supported." should be reworded to clarify that `.bgz`/BGZF enables
parallel import while `.gz` and uncompressed files are still handled but
processed single-threaded (slower). Locate the sentence beginning "Hail requires
block gzip (BGZF)..." and replace "Standard gzip (`.gz`) files are not directly
supported." with language like "Standard gzip (`.gz`) and uncompressed files are
supported but will be processed with a single CPU; only BGZF (`.bgz`) enables
parallel import." Also ensure the subsequent troubleshooting paragraph aligns
with this wording so the doc is consistent.

In `@hvantk/commands/convert_bgz_cli.py`:
- Around line 29-35: Validate that the parsed --threads value is a positive
integer before calling the backend: after parsing (the variable named threads or
args.threads in convert_bgz_cli.py) check threads > 0 and if not, exit early
with a clear error (use parser.error or raise SystemExit with a message) so the
backend never receives non-positive values; update any code paths that accept
threads (including the other occurrences referenced in 36-57) to perform the
same guardrail prior to invoking the conversion function (e.g., whatever calls
into bgzip/system bgzip backend).

In `@hvantk/data/file_utils.py`:
- Around line 293-312: The _convert_with_bgzip function currently only checks
p2.returncode and can leave a partial/invalid output if the gzip (p1) subprocess
fails; update _convert_with_bgzip to (1) capture and wait for both p1 and p2,
checking both return codes (or use subprocess.run for each step) and capturing
stderr from each process, (2) on any failure remove the output_path to avoid
leaving partial files, and (3) raise a RuntimeError that includes both p1 and p2
exit codes and stderr messages (reference p1, p2, output_path, and function name
_convert_with_bgzip to locate the change).

---

Nitpick comments:
In `@hvantk/commands/make_matrix_cli.py`:
- Around line 1-9: The module docstring for make_matrix_cli.py contains an
example for the unrelated "hvantk convert-bgz" command; remove that convert-bgz
example from the make_matrix_cli.py top docstring and add it to the docstring
(or examples section) of the convert_bgz_cli module (convert_bgz_cli) so CLI
examples are in the module that implements the convert-bgz command and mkmatrix
examples remain in make_matrix_cli.py.

In `@hvantk/data/file_utils.py`:
- Around line 460-492: The function that detects compression (calls
detect_compression) currently ignores the force_bgz parameter for BGZF inputs by
returning True unconditionally; change the BGZF branch to return the original
filepath and the passed-in force_bgz value instead of True so the caller's
preference is honored. Locate the BGZF case (the if compression == "bgzf":
branch) and replace the hardcoded True with the force_bgz parameter, leaving the
other branches (plain gzip conversion via convert_gz_to_bgz, auto_convert,
threads) unchanged.

In `@hvantk/hvantk.py`:
- Around line 56-63: The try/except around importing
hvantk.commands.convert_bgz_cli currently swallows all ImportError causes;
change it to catch ImportError as e and log a debug or warning message that
includes the exception (e) and the module name
("hvantk.commands.convert_bgz_cli") instead of silently passing, so
missing-optional-command installations are quiet but real import problems are
visible; keep the import attempt and cli.add_command(convert_bgz_cmd) as-is and
allow other exception types to propagate so syntax/bug errors in convert_bgz_cli
are not masked.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f490fa3 and 6d6a2af.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • README.md
  • docs/library/annotation-sources.md
  • docs/library/usage.md
  • hvantk/commands/convert_bgz_cli.py
  • hvantk/commands/make_matrix_cli.py
  • hvantk/commands/make_table_cli.py
  • hvantk/data/file_utils.py
  • hvantk/hvantk.py
  • hvantk/tables/expression_atlas.py
  • hvantk/tables/genome_builders.py
  • hvantk/tables/matrix_builders.py
  • hvantk/tables/table_builders.py
  • hvantk/tables/ucsc.py
  • hvantk/tests/hgc/test_converters.py
  • hvantk/tests/test_cli_mkmatrix.py
  • hvantk/tests/test_file_utils.py
  • pyproject.toml

Comment thread docs/library/annotation-sources.md Outdated
Comment thread docs/library/usage.md Outdated
Comment thread hvantk/commands/convert_bgz_cli.py
Comment thread hvantk/data/file_utils.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
hvantk/data/file_utils.py (1)

441-471: ⚠️ Potential issue | 🟠 Major

Use atomic writes to avoid reusing corrupted .bgz artifacts.

Line 442 can reuse an existing output based on header detection only. Since conversion writes directly to output_path, an interrupted/failed run (especially non-bgzip backends) can leave a partial file that may be incorrectly reused later.

🛠️ Suggested hardening (temp file + atomic replace)
 def convert_gz_to_bgz(
@@
-    if backend == "bgzip":
-        logger.info(
-            "Converting '%s' to BGZF using system bgzip with %d threads...",
-            input_path,
-            threads,
-        )
-        _convert_with_bgzip(input_path, output_path, threads)
-    elif backend == "pysam":
-        logger.info(
-            "Converting '%s' to BGZF using pysam (system bgzip not found)...",
-            input_path,
-        )
-        _convert_with_pysam(input_path, output_path)
-    else:
-        logger.info(
-            "Converting '%s' to BGZF using pure Python (pysam and bgzip not available)...",
-            input_path,
-        )
-        _convert_with_python(input_path, output_path)
+    tmp_output = f"{output_path}.tmp"
+    try:
+        if os.path.exists(tmp_output):
+            os.remove(tmp_output)
+
+        if backend == "bgzip":
+            logger.info(
+                "Converting '%s' to BGZF using system bgzip with %d threads...",
+                input_path,
+                threads,
+            )
+            _convert_with_bgzip(input_path, tmp_output, threads)
+        elif backend == "pysam":
+            logger.info(
+                "Converting '%s' to BGZF using pysam (system bgzip not found)...",
+                input_path,
+            )
+            _convert_with_pysam(input_path, tmp_output)
+        else:
+            logger.info(
+                "Converting '%s' to BGZF using pure Python (pysam and bgzip not available)...",
+                input_path,
+            )
+            _convert_with_python(input_path, tmp_output)
+
+        os.replace(tmp_output, output_path)
+    finally:
+        if os.path.exists(tmp_output):
+            os.remove(tmp_output)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/data/file_utils.py` around lines 441 - 471, The current early-return
that reuses output_path when detect_compression(output_path) == "bgzf" is unsafe
because conversions write directly to output_path and partial/corrupt files can
be reused; change the conversion flow (around the call sites of
_convert_with_bgzip, _convert_with_pysam, _convert_with_python in this block) to
write to a temporary file (e.g., output_path + a unique suffix or use
tempfile.NamedTemporaryFile) and only atomically replace the final output_path
with os.replace(temp_path, output_path) after successful conversion and a
post-check (e.g., detect_compression(temp_path) == "bgzf"), and ensure any temp
file is removed on failure or exception so interrupted runs cannot leave partial
.bgz that will be reused by the initial detect_compression check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/data/file_utils.py`:
- Around line 395-399: The convert_gz_to_bgz function should validate the
threads parameter early: check that threads is an int > 0 at the top of
convert_gz_to_bgz and raise a clear ValueError (e.g. "threads must be a positive
integer") if the check fails so callers outside the CLI get immediate,
descriptive feedback; update any docstring or parameter validation near the
function signature to reflect this contract.

---

Duplicate comments:
In `@hvantk/data/file_utils.py`:
- Around line 441-471: The current early-return that reuses output_path when
detect_compression(output_path) == "bgzf" is unsafe because conversions write
directly to output_path and partial/corrupt files can be reused; change the
conversion flow (around the call sites of _convert_with_bgzip,
_convert_with_pysam, _convert_with_python in this block) to write to a temporary
file (e.g., output_path + a unique suffix or use tempfile.NamedTemporaryFile)
and only atomically replace the final output_path with os.replace(temp_path,
output_path) after successful conversion and a post-check (e.g.,
detect_compression(temp_path) == "bgzf"), and ensure any temp file is removed on
failure or exception so interrupted runs cannot leave partial .bgz that will be
reused by the initial detect_compression check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3a00c4d-05d4-4b39-9c3b-64e92ae3434d

📥 Commits

Reviewing files that changed from the base of the PR and between 6d6a2af and 5fa046e.

📒 Files selected for processing (6)
  • docs/library/annotation-sources.md
  • docs/library/usage.md
  • hvantk/commands/convert_bgz_cli.py
  • hvantk/commands/make_matrix_cli.py
  • hvantk/data/file_utils.py
  • hvantk/hvantk.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • hvantk/hvantk.py

Comment thread hvantk/data/file_utils.py
Comment on lines +395 to +399
def convert_gz_to_bgz(
input_path: str,
output_path: str | None = None,
threads: int = 4,
) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate threads in the shared conversion API.

convert_gz_to_bgz is also callable outside convert-bgz CLI. A non-positive threads value should fail early with a clear error rather than deferring to backend-specific failures.

✅ Small guardrail
 def convert_gz_to_bgz(
     input_path: str,
     output_path: str | None = None,
     threads: int = 4,
 ) -> str:
@@
+    if threads < 1:
+        raise ValueError("threads must be a positive integer")
+
     if not os.path.exists(input_path):
         raise FileNotFoundError(f"File not found: {input_path}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def convert_gz_to_bgz(
input_path: str,
output_path: str | None = None,
threads: int = 4,
) -> str:
def convert_gz_to_bgz(
input_path: str,
output_path: str | None = None,
threads: int = 4,
) -> str:
if threads < 1:
raise ValueError("threads must be a positive integer")
if not os.path.exists(input_path):
raise FileNotFoundError(f"File not found: {input_path}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/data/file_utils.py` around lines 395 - 399, The convert_gz_to_bgz
function should validate the threads parameter early: check that threads is an
int > 0 at the top of convert_gz_to_bgz and raise a clear ValueError (e.g.
"threads must be a positive integer") if the check fails so callers outside the
CLI get immediate, descriptive feedback; update any docstring or parameter
validation near the function signature to reflect this contract.

@enriquea enriquea merged commit 3c0ccee into main Mar 4, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant