Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughA new CPTAC data processing module was added to convert expression and metadata pandas DataFrames into Hail MatrixTables and Tables, combine and save them with validation and logging. Comprehensive pytest tests and a minor trailing newline in Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Pandas as Pandas DataFrames
participant CPTAC as hvantk.htables.cptac
participant Hail as Hail
User->>CPTAC: convert_cptac_expression_to_matrix_table(expression_df)
CPTAC->>Pandas: validate columns, reshape to coords
CPTAC->>Hail: create MatrixTable (rows=genes, cols=samples)
User->>CPTAC: convert_cptac_metadata_to_table(metadata_df)
CPTAC->>Pandas: validate sample ID, cast types
CPTAC->>Hail: create Table keyed by sample ID
User->>CPTAC: create_cptac_matrix_table(expression_df, metadata_df)
CPTAC->>CPTAC: call expression & metadata converters
CPTAC->>Hail: annotate MatrixTable.cols with metadata
User->>CPTAC: save_cptac_matrix_table(mt, path)
CPTAC->>Hail: write MatrixTable to disk (optional overwrite)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
hvantk/htables/cptac.py (1)
50-59: Inefficient construction ofgene_name_dictIterating over every row scales O(n). The dictionary only needs unique gene IDs:
- gene_name_dict = {} - if gene_name_col and gene_name_col in df.columns: - for _, row in df.iterrows(): - gene_id = row[gene_id_col] - gene_name = row[gene_name_col] - gene_name_dict[gene_id] = gene_name + gene_name_dict = ( + df[[gene_id_col, gene_name_col]] + .drop_duplicates(subset=[gene_id_col]) + .set_index(gene_id_col)[gene_name_col] + .to_dict() + ) if gene_name_col and gene_name_col in df.columns else {}Reduces memory and CPU on large data sets.
hvantk/tests/test_cptac.py (2)
8-15: Remove unused imports
tempfile,os, andsave_cptac_matrix_tableare never referenced; Ruff flags them (F401).
Delete to keep the test module clean.-import tempfile -import os -... - save_cptac_matrix_table + # save_cptac_matrix_table # re-add when write-path tests are added
17-19: Avoid globalhl.init()in test moduleInitialising Hail at import-time can clash with other test files and slows collection.
Move into a session-level fixture:@pytest.fixture(scope="session", autouse=True) def hail_backend(): hl.init() yield hl.stop()Ensures single backend per pytest run.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
hvantk/htables/cptac.py(1 hunks)hvantk/tests/test_cptac.py(1 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
hvantk/tests/test_cptac.py (1)
hvantk/htables/cptac.py (4)
convert_cptac_expression_to_matrix_table(16-83)convert_cptac_metadata_to_table(85-129)create_cptac_matrix_table(131-197)save_cptac_matrix_table(199-222)
🪛 Ruff (0.12.2)
hvantk/tests/test_cptac.py
8-8: tempfile imported but unused
Remove unused import: tempfile
(F401)
9-9: os imported but unused
Remove unused import: os
(F401)
14-14: hvantk.htables.cptac.save_cptac_matrix_table imported but unused
Remove unused import: hvantk.htables.cptac.save_cptac_matrix_table
(F401)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
- GitHub Check: build (3.12)
- GitHub Check: build
- GitHub Check: build-linux
🔇 Additional comments (2)
pyproject.toml (1)
38-39: No action neededA solitary trailing newline was added; harmless and consistent with POSIX text-file conventions.
hvantk/tests/test_cptac.py (1)
140-158: Test expectation conflicts with documented behaviour
create_cptac_matrix_tablecurrently should raise on sample-ID mismatch (per docstring).
This test expects silent success, cementing the bug flagged in the module.Adjust the test once the function is fixed:
-with pytest.warns(UserWarning): - create_cptac_matrix_table(...) +with pytest.raises(ValueError): + create_cptac_matrix_table(...)
…C data processing
…r mismatch handling
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
hvantk/htables/cptac.py (1)
55-58: Inefficient per-row loop to buildgene_name_dictIterating with
df.iterrows()scales poorly for large CPTAC datasets.
A vectorised one-liner is clearer and ~10-20× faster:if gene_name_col and gene_name_col in df.columns: gene_name_dict = dict(df[[gene_id_col, gene_name_col]].drop_duplicates().values)hvantk/tests/test_cptac.py (2)
8-15: Remove unused imports
tempfile,os, andsave_cptac_matrix_tableare never used; Ruff flags them (F401).-import tempfile -import os -from hvantk.htables.cptac import ( - convert_cptac_expression_to_matrix_table, - convert_cptac_metadata_to_table, - create_cptac_matrix_table, - save_cptac_matrix_table -) +from hvantk.htables.cptac import ( + convert_cptac_expression_to_matrix_table, + convert_cptac_metadata_to_table, + create_cptac_matrix_table, +)
17-18: Avoid side-effects at import timeCalling
hl.init()at module import can interfere with test discovery and parallel execution.
Prefer a session-scoped fixture:@pytest.fixture(scope="session", autouse=True) def _hail(): hl.init() yield hl.stop()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
hvantk/htables/cptac.py(1 hunks)hvantk/tests/test_cptac.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
hvantk/tests/test_cptac.py
8-8: tempfile imported but unused
Remove unused import: tempfile
(F401)
9-9: os imported but unused
Remove unused import: os
(F401)
14-14: hvantk.htables.cptac.save_cptac_matrix_table imported but unused
Remove unused import: hvantk.htables.cptac.save_cptac_matrix_table
(F401)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: build
- GitHub Check: build (3.12)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
- GitHub Check: build-linux
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a new CPTAC data processing module that provides functionality for converting CPTAC gene expression and metadata into Hail MatrixTables and Tables for downstream analysis. The implementation includes comprehensive error handling and validation.
Key changes include:
- Complete CPTAC data processing module with 5 main functions for expression and metadata conversion
- Comprehensive unit tests covering all functions with error handling validation
- Minor formatting fix adding an empty line to pyproject.toml
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| hvantk/htables/cptac.py | Implements core CPTAC data processing functionality with expression-to-MatrixTable and metadata-to-Table conversion functions |
| hvantk/tests/test_cptac.py | Provides comprehensive test coverage for all CPTAC module functions including error handling scenarios |
| pyproject.toml | Adds missing trailing newline for proper formatting |
User description
Summary
This pull request implements the CPTAC data processing module containing functions for converting expression and metadata outputs. Additionally, unit tests have been added to ensure the correctness of the module.
Code Changes
CPTAC data processingmodule with conversion functions for:Checklist
PR Type
Enhancement
Description
Add CPTAC data processing module with conversion functions
Implement expression data to MatrixTable conversion
Add metadata processing and table creation utilities
Include comprehensive unit tests for all functions
Diagram Walkthrough
File Walkthrough
cptac.py
CPTAC data processing module implementationhvantk/htables/cptac.py
test_cptac.py
Unit tests for CPTAC modulehvantk/tests/test_cptac.py
pyproject.toml
Minor formatting fixpyproject.toml
Summary by CodeRabbit
New Features
Tests
Chores