Conversation
…ons for gzip files
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan Review Summary by QodoAdd gzip-to-BGZF conversion support and improve CLI usability
WalkthroughsDescription• Add standalone convert-bgz CLI command for gzip to BGZF conversion • Implement automatic gzip-to-BGZF conversion in table/matrix builders • Add shorthand CLI options (-e, -m, -o, -g, etc.) for mkmatrix commands • Call describe() on created MatrixTables and Tables for inspection • Add comprehensive unit tests for file compression detection and conversion Diagramflowchart LR
A["Standard gzip files"] -->|detect_compression| B["Compression detection"]
B -->|is_bgzf/is_gzipped| C["Determine file type"]
C -->|auto_convert_bgz flag| D["resolve_compression"]
D -->|convert_gz_to_bgz| E["BGZF output"]
E -->|hl.import_*| F["Hail parallel import"]
G["CLI: convert-bgz"] -->|standalone conversion| E
H["CLI: mkmatrix/mktable"] -->|--auto-convert-bgz| D
File Changes1. hvantk/commands/convert_bgz_cli.py
|
Code Review by Qodo
1.
|
📝 WalkthroughWalkthroughAdds BGZF handling across the toolkit: new convert-bgz CLI, file utilities for detection and gzip→BGZF conversion (multiple backends), an --auto-convert-bgz option propagated to matrix/table builders and CLIs, docs/tests updated, and pysam added as a dependency. Changes
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant CLI as mkmatrix/mktable CLI
participant Resolver as resolve_compression
participant Converter as convert_gz_to_bgz
participant Backend as Conversion Backend<br/>(bgzip/pysam/Python)
participant Importer as hl.import_*
User->>CLI: Invoke with --auto-convert-bgz
CLI->>Resolver: resolve_compression(input_file, auto_convert=True)
Resolver->>Resolver: Detect file compression type
alt File is gzip
Resolver->>Converter: convert_gz_to_bgz(input_file)
Converter->>Converter: Select backend
Converter->>Backend: Convert gzip → BGZF
Backend-->>Converter: Converted file path
Converter-->>Resolver: Return BGZF path
else File is BGZF
Resolver-->>Resolver: Return as-is
end
Resolver-->>CLI: Return path + force_bgz flag (if applicable)
CLI->>Importer: Import with force_bgz setting
Importer-->>CLI: MatrixTable/Table created
CLI->>User: Display describe() output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
hvantk/data/file_utils.py (1)
460-492:force_bgzis currently ignored for BGZF input.Line 460 exposes
force_bgz, but Line 491 always returnsTruefor BGZF files. That makes the parameter misleading.♻️ Minimal fix
- if compression == "bgzf": - return filepath, True + if compression == "bgzf": + return filepath, bool(force_bgz)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/data/file_utils.py` around lines 460 - 492, The function that detects compression (calls detect_compression) currently ignores the force_bgz parameter for BGZF inputs by returning True unconditionally; change the BGZF branch to return the original filepath and the passed-in force_bgz value instead of True so the caller's preference is honored. Locate the BGZF case (the if compression == "bgzf": branch) and replace the hardcoded True with the force_bgz parameter, leaving the other branches (plain gzip conversion via convert_gz_to_bgz, auto_convert, threads) unchanged.hvantk/hvantk.py (1)
56-63: Consider narrowing the exception handling.The silent
passonImportErrorcould mask legitimate import errors withinconvert_bgz_cli(e.g., a missing dependency or syntax error in that module). Consider logging a debug message or being more specific about which import failure is acceptable.♻️ Optional: Add debug logging for import failure
try: from hvantk.commands.convert_bgz_cli import convert_bgz_cmd cli.add_command(convert_bgz_cmd) # GZ → BGZF conversion utility except ImportError: - pass + logger.debug("convert_bgz_cmd not available (module not installed or import error)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/hvantk.py` around lines 56 - 63, The try/except around importing hvantk.commands.convert_bgz_cli currently swallows all ImportError causes; change it to catch ImportError as e and log a debug or warning message that includes the exception (e) and the module name ("hvantk.commands.convert_bgz_cli") instead of silently passing, so missing-optional-command installations are quiet but real import problems are visible; keep the import attempt and cli.add_command(convert_bgz_cmd) as-is and allow other exception types to propagate so syntax/bug errors in convert_bgz_cli are not masked.hvantk/commands/make_matrix_cli.py (1)
1-9: Misplaced CLI example in module docstring.The
convert-bgzexample on line 8 belongs inconvert_bgz_cli.py, not in thismake_matrix_cli.pymodule which handlesmkmatrixcommands. Consider moving this example to the appropriate file for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/make_matrix_cli.py` around lines 1 - 9, The module docstring for make_matrix_cli.py contains an example for the unrelated "hvantk convert-bgz" command; remove that convert-bgz example from the make_matrix_cli.py top docstring and add it to the docstring (or examples section) of the convert_bgz_cli module (convert_bgz_cli) so CLI examples are in the module that implements the convert-bgz command and mkmatrix examples remain in make_matrix_cli.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/library/annotation-sources.md`:
- Around line 9-13: Update the wording that claims ".gz files will fail" to a
more accurate statement: change the sentence in the annotation-sources text that
asserts standard gzip (".gz") files "will fail" so it explains that ".gz and
uncompressed files are processed single-threaded (do not enable parallel
import), while block gzip (BGZF) '.bgz' files enable parallel import." Ensure
the paragraph around the current bgz guidance and the two bullet options (the
`--auto-convert-bgz` flag and `hvantk convert-bgz`) reflects this nuance.
In `@docs/library/usage.md`:
- Around line 373-374: The sentence that reads "Hail requires block gzip (BGZF)
compressed files for parallel import. Standard gzip (`.gz`) files are not
directly supported." should be reworded to clarify that `.bgz`/BGZF enables
parallel import while `.gz` and uncompressed files are still handled but
processed single-threaded (slower). Locate the sentence beginning "Hail requires
block gzip (BGZF)..." and replace "Standard gzip (`.gz`) files are not directly
supported." with language like "Standard gzip (`.gz`) and uncompressed files are
supported but will be processed with a single CPU; only BGZF (`.bgz`) enables
parallel import." Also ensure the subsequent troubleshooting paragraph aligns
with this wording so the doc is consistent.
In `@hvantk/commands/convert_bgz_cli.py`:
- Around line 29-35: Validate that the parsed --threads value is a positive
integer before calling the backend: after parsing (the variable named threads or
args.threads in convert_bgz_cli.py) check threads > 0 and if not, exit early
with a clear error (use parser.error or raise SystemExit with a message) so the
backend never receives non-positive values; update any code paths that accept
threads (including the other occurrences referenced in 36-57) to perform the
same guardrail prior to invoking the conversion function (e.g., whatever calls
into bgzip/system bgzip backend).
In `@hvantk/data/file_utils.py`:
- Around line 293-312: The _convert_with_bgzip function currently only checks
p2.returncode and can leave a partial/invalid output if the gzip (p1) subprocess
fails; update _convert_with_bgzip to (1) capture and wait for both p1 and p2,
checking both return codes (or use subprocess.run for each step) and capturing
stderr from each process, (2) on any failure remove the output_path to avoid
leaving partial files, and (3) raise a RuntimeError that includes both p1 and p2
exit codes and stderr messages (reference p1, p2, output_path, and function name
_convert_with_bgzip to locate the change).
---
Nitpick comments:
In `@hvantk/commands/make_matrix_cli.py`:
- Around line 1-9: The module docstring for make_matrix_cli.py contains an
example for the unrelated "hvantk convert-bgz" command; remove that convert-bgz
example from the make_matrix_cli.py top docstring and add it to the docstring
(or examples section) of the convert_bgz_cli module (convert_bgz_cli) so CLI
examples are in the module that implements the convert-bgz command and mkmatrix
examples remain in make_matrix_cli.py.
In `@hvantk/data/file_utils.py`:
- Around line 460-492: The function that detects compression (calls
detect_compression) currently ignores the force_bgz parameter for BGZF inputs by
returning True unconditionally; change the BGZF branch to return the original
filepath and the passed-in force_bgz value instead of True so the caller's
preference is honored. Locate the BGZF case (the if compression == "bgzf":
branch) and replace the hardcoded True with the force_bgz parameter, leaving the
other branches (plain gzip conversion via convert_gz_to_bgz, auto_convert,
threads) unchanged.
In `@hvantk/hvantk.py`:
- Around line 56-63: The try/except around importing
hvantk.commands.convert_bgz_cli currently swallows all ImportError causes;
change it to catch ImportError as e and log a debug or warning message that
includes the exception (e) and the module name
("hvantk.commands.convert_bgz_cli") instead of silently passing, so
missing-optional-command installations are quiet but real import problems are
visible; keep the import attempt and cli.add_command(convert_bgz_cmd) as-is and
allow other exception types to propagate so syntax/bug errors in convert_bgz_cli
are not masked.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (17)
README.mddocs/library/annotation-sources.mddocs/library/usage.mdhvantk/commands/convert_bgz_cli.pyhvantk/commands/make_matrix_cli.pyhvantk/commands/make_table_cli.pyhvantk/data/file_utils.pyhvantk/hvantk.pyhvantk/tables/expression_atlas.pyhvantk/tables/genome_builders.pyhvantk/tables/matrix_builders.pyhvantk/tables/table_builders.pyhvantk/tables/ucsc.pyhvantk/tests/hgc/test_converters.pyhvantk/tests/test_cli_mkmatrix.pyhvantk/tests/test_file_utils.pypyproject.toml
…es and emphasize BGZF for parallel import
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
hvantk/data/file_utils.py (1)
441-471:⚠️ Potential issue | 🟠 MajorUse atomic writes to avoid reusing corrupted
.bgzartifacts.Line 442 can reuse an existing output based on header detection only. Since conversion writes directly to
output_path, an interrupted/failed run (especially non-bgzip backends) can leave a partial file that may be incorrectly reused later.🛠️ Suggested hardening (temp file + atomic replace)
def convert_gz_to_bgz( @@ - if backend == "bgzip": - logger.info( - "Converting '%s' to BGZF using system bgzip with %d threads...", - input_path, - threads, - ) - _convert_with_bgzip(input_path, output_path, threads) - elif backend == "pysam": - logger.info( - "Converting '%s' to BGZF using pysam (system bgzip not found)...", - input_path, - ) - _convert_with_pysam(input_path, output_path) - else: - logger.info( - "Converting '%s' to BGZF using pure Python (pysam and bgzip not available)...", - input_path, - ) - _convert_with_python(input_path, output_path) + tmp_output = f"{output_path}.tmp" + try: + if os.path.exists(tmp_output): + os.remove(tmp_output) + + if backend == "bgzip": + logger.info( + "Converting '%s' to BGZF using system bgzip with %d threads...", + input_path, + threads, + ) + _convert_with_bgzip(input_path, tmp_output, threads) + elif backend == "pysam": + logger.info( + "Converting '%s' to BGZF using pysam (system bgzip not found)...", + input_path, + ) + _convert_with_pysam(input_path, tmp_output) + else: + logger.info( + "Converting '%s' to BGZF using pure Python (pysam and bgzip not available)...", + input_path, + ) + _convert_with_python(input_path, tmp_output) + + os.replace(tmp_output, output_path) + finally: + if os.path.exists(tmp_output): + os.remove(tmp_output)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/data/file_utils.py` around lines 441 - 471, The current early-return that reuses output_path when detect_compression(output_path) == "bgzf" is unsafe because conversions write directly to output_path and partial/corrupt files can be reused; change the conversion flow (around the call sites of _convert_with_bgzip, _convert_with_pysam, _convert_with_python in this block) to write to a temporary file (e.g., output_path + a unique suffix or use tempfile.NamedTemporaryFile) and only atomically replace the final output_path with os.replace(temp_path, output_path) after successful conversion and a post-check (e.g., detect_compression(temp_path) == "bgzf"), and ensure any temp file is removed on failure or exception so interrupted runs cannot leave partial .bgz that will be reused by the initial detect_compression check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hvantk/data/file_utils.py`:
- Around line 395-399: The convert_gz_to_bgz function should validate the
threads parameter early: check that threads is an int > 0 at the top of
convert_gz_to_bgz and raise a clear ValueError (e.g. "threads must be a positive
integer") if the check fails so callers outside the CLI get immediate,
descriptive feedback; update any docstring or parameter validation near the
function signature to reflect this contract.
---
Duplicate comments:
In `@hvantk/data/file_utils.py`:
- Around line 441-471: The current early-return that reuses output_path when
detect_compression(output_path) == "bgzf" is unsafe because conversions write
directly to output_path and partial/corrupt files can be reused; change the
conversion flow (around the call sites of _convert_with_bgzip,
_convert_with_pysam, _convert_with_python in this block) to write to a temporary
file (e.g., output_path + a unique suffix or use tempfile.NamedTemporaryFile)
and only atomically replace the final output_path with os.replace(temp_path,
output_path) after successful conversion and a post-check (e.g.,
detect_compression(temp_path) == "bgzf"), and ensure any temp file is removed on
failure or exception so interrupted runs cannot leave partial .bgz that will be
reused by the initial detect_compression check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e3a00c4d-05d4-4b39-9c3b-64e92ae3434d
📒 Files selected for processing (6)
docs/library/annotation-sources.mddocs/library/usage.mdhvantk/commands/convert_bgz_cli.pyhvantk/commands/make_matrix_cli.pyhvantk/data/file_utils.pyhvantk/hvantk.py
🚧 Files skipped from review as they are similar to previous changes (1)
- hvantk/hvantk.py
| def convert_gz_to_bgz( | ||
| input_path: str, | ||
| output_path: str | None = None, | ||
| threads: int = 4, | ||
| ) -> str: |
There was a problem hiding this comment.
Validate threads in the shared conversion API.
convert_gz_to_bgz is also callable outside convert-bgz CLI. A non-positive threads value should fail early with a clear error rather than deferring to backend-specific failures.
✅ Small guardrail
def convert_gz_to_bgz(
input_path: str,
output_path: str | None = None,
threads: int = 4,
) -> str:
@@
+ if threads < 1:
+ raise ValueError("threads must be a positive integer")
+
if not os.path.exists(input_path):
raise FileNotFoundError(f"File not found: {input_path}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def convert_gz_to_bgz( | |
| input_path: str, | |
| output_path: str | None = None, | |
| threads: int = 4, | |
| ) -> str: | |
| def convert_gz_to_bgz( | |
| input_path: str, | |
| output_path: str | None = None, | |
| threads: int = 4, | |
| ) -> str: | |
| if threads < 1: | |
| raise ValueError("threads must be a positive integer") | |
| if not os.path.exists(input_path): | |
| raise FileNotFoundError(f"File not found: {input_path}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/data/file_utils.py` around lines 395 - 399, The convert_gz_to_bgz
function should validate the threads parameter early: check that threads is an
int > 0 at the top of convert_gz_to_bgz and raise a clear ValueError (e.g.
"threads must be a positive integer") if the check fails so callers outside the
CLI get immediate, descriptive feedback; update any docstring or parameter
validation near the function signature to reflect this contract.
This pull request introduces support for automatic and standalone conversion of standard gzip (
.gz) files to block gzip (.bgz) format, which is required for Hail's parallel import. It adds a new CLI command for conversion and updates relevant commands and documentation to guide users on handling file formats. Additionally, it improves CLI usability with new options and ensures that table and matrix creation commands output Hail object descriptions for easier inspection.File Format Conversion Support
convert-bgz(hvantk/commands/convert_bgz_cli.py) to convert.gzfiles to.bgz, with options for output path and thread count.mktable dbnsfp,mkmatrix ucsc, andmkmatrix expression-atlascommands to support an--auto-convert-bgzflag for automatic conversion of.gzfiles to.bgzbefore import. [1] [2] [3] [4] [5] [6] [7]README.md,docs/library/usage.md, anddocs/library/annotation-sources.mdwith clear instructions and tips for file format conversion, including troubleshooting advice for gzip vs BGZF issues. [1] [2] [3] [4] [5] [6] [7]CLI Usability Improvements
-e,-m,-o,-g, etc.) for more convenient usage in matrix and table creation commands. [1] [2] [3]describe()output is displayed in the CLI for immediate inspection. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]Documentation and Example Updates
convert-bgzcommand and the--auto-convert-bgzflag in relevant sections of documentation and CLI help texts. [1] [2] [3] [4] [5] [6]Minor Improvements
README.mdfor improved compatibility.References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24]
Summary by CodeRabbit
New Features
Documentation
Tests