Skip to content

Add CPTAC data processing module with unit tests#22

Merged
ypriverol merged 6 commits intomainfrom
dev
Aug 11, 2025
Merged

Add CPTAC data processing module with unit tests#22
ypriverol merged 6 commits intomainfrom
dev

Conversation

@enriquea
Copy link
Copy Markdown
Collaborator

@enriquea enriquea commented Aug 8, 2025

User description

Summary

This pull request implements the CPTAC data processing module containing functions for converting expression and metadata outputs. Additionally, unit tests have been added to ensure the correctness of the module.

Code Changes

  • Introduced CPTAC data processing module with conversion functions for:
    • Expression data processing
    • Metadata processing
  • Added corresponding unit tests to validate the processing methods.

Checklist

  • Code builds correctly
  • Tests have been written and pass successfully
  • Documentation updated if necessary

PR Type

Enhancement


Description

  • Add CPTAC data processing module with conversion functions

  • Implement expression data to MatrixTable conversion

  • Add metadata processing and table creation utilities

  • Include comprehensive unit tests for all functions


Diagram Walkthrough

flowchart LR
  A["Expression DataFrame"] --> B["convert_cptac_expression_to_matrix_table"]
  C["Metadata DataFrame"] --> D["convert_cptac_metadata_to_table"]
  B --> E["MatrixTable"]
  D --> F["Metadata Table"]
  E --> G["create_cptac_matrix_table"]
  F --> G
  G --> H["Annotated MatrixTable"]
  H --> I["save_cptac_matrix_table"]
Loading

File Walkthrough

Relevant files
Enhancement
cptac.py
CPTAC data processing module implementation                           

hvantk/htables/cptac.py

  • Add complete CPTAC data processing module with 5 main functions
  • Implement expression DataFrame to Hail MatrixTable conversion
  • Add metadata processing with type casting support
  • Include comprehensive error handling and logging
+222/-0 
Tests
test_cptac.py
Unit tests for CPTAC module                                                           

hvantk/tests/test_cptac.py

  • Add comprehensive unit tests for all CPTAC functions
  • Include test fixtures for sample data generation
  • Test error handling for missing columns and mismatched samples
  • Validate MatrixTable and Table creation with proper types
+157/-0 
Formatting
pyproject.toml
Minor formatting fix                                                                         

pyproject.toml

  • Add empty line at end of file
+1/-0     

Summary by CodeRabbit

  • New Features

    • Added support for converting CPTAC gene expression and sample metadata into Hail MatrixTables/Tables, integrating metadata with validation and annotation, and saving processed CPTAC data for downstream analysis.
  • Tests

    • Added comprehensive tests covering conversion, annotation, type checks, and error handling (including missing columns and sample ID mismatches).
  • Chores

    • Minor formatting newline added to project manifest.

@enriquea enriquea requested a review from ypriverol August 8, 2025 14:37
@enriquea enriquea self-assigned this Aug 8, 2025
@enriquea enriquea added the enhancement New feature or request label Aug 8, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Aug 8, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

A new CPTAC data processing module was added to convert expression and metadata pandas DataFrames into Hail MatrixTables and Tables, combine and save them with validation and logging. Comprehensive pytest tests and a minor trailing newline in pyproject.toml were also added.

Changes

Cohort / File(s) Change Summary
CPTAC Data Processing Module
hvantk/htables/cptac.py
New module adding: convert_cptac_expression_to_matrix_table, convert_cptac_metadata_to_table, create_cptac_matrix_table, and save_cptac_matrix_table. Includes input validation, error reporting, logging, Hail conversions, and sample ID consistency checks.
CPTAC Data Processing Tests
hvantk/tests/test_cptac.py
New pytest module with fixtures and tests covering expression->MatrixTable conversion, metadata->Table conversion, combined MatrixTable creation, error handling for missing columns, and sample ID mismatch scenarios.
Project Configuration Formatting
pyproject.toml
Added a trailing newline at the end of the [tool.poetry.scripts] section; no functional changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pandas as Pandas DataFrames
    participant CPTAC as hvantk.htables.cptac
    participant Hail as Hail

    User->>CPTAC: convert_cptac_expression_to_matrix_table(expression_df)
    CPTAC->>Pandas: validate columns, reshape to coords
    CPTAC->>Hail: create MatrixTable (rows=genes, cols=samples)

    User->>CPTAC: convert_cptac_metadata_to_table(metadata_df)
    CPTAC->>Pandas: validate sample ID, cast types
    CPTAC->>Hail: create Table keyed by sample ID

    User->>CPTAC: create_cptac_matrix_table(expression_df, metadata_df)
    CPTAC->>CPTAC: call expression & metadata converters
    CPTAC->>Hail: annotate MatrixTable.cols with metadata

    User->>CPTAC: save_cptac_matrix_table(mt, path)
    CPTAC->>Hail: write MatrixTable to disk (optional overwrite)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Suggested reviewers

  • ypriverol

Poem

A rabbit munched on rows and IDs,
Mapping genes and samples with speed.
Hail met pandas in a tidy ballet,
Tests hopped in to show the way.
Newlines and logs — data's now ready! 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dde5286 and f2b7f79.

📒 Files selected for processing (1)
  • hvantk/htables/cptac.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • hvantk/htables/cptac.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build (3.12)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
  • GitHub Check: build
  • GitHub Check: build-linux
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@qodo-code-review
Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The conversion from pandas to Hail via hl.Table.from_pandas without explicit schema/casting may infer incorrect types (e.g., Expression as int32 or string) and column name spaces like 'Gene Name' can cause unexpected field handling. Validate dtype casting for entry field and sanitize/standardize column names before creating Hail structures.

coord_df = df[[gene_id_col, sample_id_col, expression_col]].copy()

# Convert to Hail Table
coord_ht = hl.Table.from_pandas(coord_df)

# Convert to MatrixTable using coordinate representation
mt = coord_ht.to_matrix_table(
    row_key=[gene_id_col],
    col_key=[sample_id_col],
    row_fields=[],
    col_fields=[]
)

# If we have gene names, add them as row annotations
if gene_name_dict:
    # Create a literal dictionary in Hail
    gene_name_dict_expr = hl.literal(gene_name_dict)
    # Use the dictionary to annotate rows with gene names
    mt = mt.annotate_rows(gene_name=gene_name_dict_expr.get(mt[gene_id_col]))
Inefficient Collection

Collecting sample IDs with mt[sample_id_col].collect() and metadata_ht[sample_id_col].collect() can be expensive at scale. Consider using keys(), distinct, or semi-joins to compare sets, or count mismatches via anti-joins without full collection.

# Check for sample ID mismatches - get the sample IDs from the col_value
# Instead of mt.s, we need to use mt.col_value which gives us the value in the col_key
expr_samples = set(mt[sample_id_col].collect())
meta_samples = set(metadata_ht[sample_id_col].collect())

if expr_samples != meta_samples:
    missing_in_meta = expr_samples - meta_samples
    missing_in_expr = meta_samples - expr_samples
    if missing_in_meta:
        logger.warning(f"Samples in expression data but not in metadata: {missing_in_meta}")
    if missing_in_expr:
        logger.warning(f"Samples in metadata but not in expression data: {missing_in_expr}")
Global Hail Init

hl.init() is called at import time for the test module, which can interfere with test isolation and configuration. Prefer a session-scoped fixture to initialize and stop Hail per test session.

# Initialize Hail
hl.init()

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review bot commented Aug 8, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix invalid key collection
Suggestion Impact:The commit changed sample ID collection from accessing a field directly to using mt.col_key and updated the annotate_cols join to use mt.col_key. It also altered the mismatch handling, aggregating messages and raising an error instead of logging warnings.

code diff:

+    # Check for sample ID mismatches - get the sample IDs from the MatrixTable columns
+    # Use mt.col_key to access the column keys (sample IDs) properly
+    expr_samples = set(mt.col_key.collect())
     meta_samples = set(metadata_ht[sample_id_col].collect())
 
     if expr_samples != meta_samples:
         missing_in_meta = expr_samples - meta_samples
         missing_in_expr = meta_samples - expr_samples
+        error_messages = []
         if missing_in_meta:
-            logger.warning(f"Samples in expression data but not in metadata: {missing_in_meta}")
+            error_messages.append(f"Samples in expression data but not in metadata: {missing_in_meta}")
         if missing_in_expr:
-            logger.warning(f"Samples in metadata but not in expression data: {missing_in_expr}")
-    
+            error_messages.append(f"Samples in metadata but not in expression data: {missing_in_expr}")
+        raise ValueError("Sample ID mismatches found. " + "; ".join(error_messages))
+
     # Annotate MatrixTable with metadata
     # Use the sample_id_col to join with metadata
-    mt = mt.annotate_cols(**metadata_ht[mt[sample_id_col]])
+    mt = mt.annotate_cols(**metadata_ht[mt.col_key])
 

Accessing column and row values directly on a MatrixTable/Table and calling
collect() is invalid and will error; also it forces full scans. Use
mt.col_key.collect() and metadata_ht.key.collect() (or
.key_by(...).key.collect()) to obtain keys safely, or avoid collecting by
performing a keyed join and checking for missingness. This fixes runtime errors
and reduces unnecessary actions.

hvantk/htables/cptac.py [181-190]

-# Check for sample ID mismatches - get the sample IDs from the col_value
-# Instead of mt.s, we need to use mt.col_value which gives us the value in the col_key
-expr_samples = set(mt[sample_id_col].collect())
-meta_samples = set(metadata_ht[sample_id_col].collect())
+# Compute sample ID sets from keys
+expr_samples = set(hl.eval(hl.agg.collect_as_set(mt.col_key[sample_id_col])))
+meta_samples = set(hl.eval(hl.agg.collect_as_set(metadata_ht.key[sample_id_col])))
 
 if expr_samples != meta_samples:
     missing_in_meta = expr_samples - meta_samples
     missing_in_expr = meta_samples - expr_samples
     if missing_in_meta:
-        logger.warning(f"Samples in expression data but not in metadata: {missing_in_meta}")
+        logger.warning(f"Samples in expression data but not in metadata: {sorted(missing_in_meta)}")
     if missing_in_expr:
-        logger.warning(f"Samples in metadata but not in expression data: {missing_in_expr}")
+        logger.warning(f"Samples in metadata but not in expression data: {sorted(missing_in_expr)}")

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies that the existing code uses an invalid and inefficient method to collect key values, which would cause a runtime error, and proposes a valid and more performant fix.

High
General
Avoid costly full counts

Calling count_rows() and count_cols() triggers full scans and can be very
expensive on large datasets. Avoid eager actions in logging and instead log
schema or keys, or guard counts behind a debug flag. This prevents unexpected
performance regressions.

hvantk/htables/cptac.py [82-83]

-logger.info(f"Created MatrixTable with {mt.count_rows()} genes and {mt.count_cols()} samples")
+logger.info(
+    "Created MatrixTable with row key %s and col key %s",
+    list(mt.row_key), list(mt.col_key)
+)
 return mt

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that count_rows() and count_cols() are expensive operations in Hail and provides a more performant logging alternative, which is a good practice.

Medium
  • Update

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
hvantk/htables/cptac.py (1)

50-59: Inefficient construction of gene_name_dict

Iterating over every row scales O(n). The dictionary only needs unique gene IDs:

-    gene_name_dict = {}
-    if gene_name_col and gene_name_col in df.columns:
-        for _, row in df.iterrows():
-            gene_id = row[gene_id_col]
-            gene_name = row[gene_name_col]
-            gene_name_dict[gene_id] = gene_name
+    gene_name_dict = (
+        df[[gene_id_col, gene_name_col]]
+        .drop_duplicates(subset=[gene_id_col])
+        .set_index(gene_id_col)[gene_name_col]
+        .to_dict()
+    ) if gene_name_col and gene_name_col in df.columns else {}

Reduces memory and CPU on large data sets.

hvantk/tests/test_cptac.py (2)

8-15: Remove unused imports

tempfile, os, and save_cptac_matrix_table are never referenced; Ruff flags them (F401).
Delete to keep the test module clean.

-import tempfile
-import os
-...
-    save_cptac_matrix_table
+    # save_cptac_matrix_table  # re-add when write-path tests are added

17-19: Avoid global hl.init() in test module

Initialising Hail at import-time can clash with other test files and slows collection.
Move into a session-level fixture:

@pytest.fixture(scope="session", autouse=True)
def hail_backend():
    hl.init()
    yield
    hl.stop()

Ensures single backend per pytest run.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f721e3 and 42059f2.

📒 Files selected for processing (3)
  • hvantk/htables/cptac.py (1 hunks)
  • hvantk/tests/test_cptac.py (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
hvantk/tests/test_cptac.py (1)
hvantk/htables/cptac.py (4)
  • convert_cptac_expression_to_matrix_table (16-83)
  • convert_cptac_metadata_to_table (85-129)
  • create_cptac_matrix_table (131-197)
  • save_cptac_matrix_table (199-222)
🪛 Ruff (0.12.2)
hvantk/tests/test_cptac.py

8-8: tempfile imported but unused

Remove unused import: tempfile

(F401)


9-9: os imported but unused

Remove unused import: os

(F401)


14-14: hvantk.htables.cptac.save_cptac_matrix_table imported but unused

Remove unused import: hvantk.htables.cptac.save_cptac_matrix_table

(F401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.12)
  • GitHub Check: build
  • GitHub Check: build-linux
🔇 Additional comments (2)
pyproject.toml (1)

38-39: No action needed

A solitary trailing newline was added; harmless and consistent with POSIX text-file conventions.

hvantk/tests/test_cptac.py (1)

140-158: Test expectation conflicts with documented behaviour

create_cptac_matrix_table currently should raise on sample-ID mismatch (per docstring).
This test expects silent success, cementing the bug flagged in the module.

Adjust the test once the function is fixed:

-with pytest.warns(UserWarning):
-    create_cptac_matrix_table(...)
+with pytest.raises(ValueError):
+    create_cptac_matrix_table(...)

Comment thread hvantk/htables/cptac.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
hvantk/htables/cptac.py (1)

55-58: Inefficient per-row loop to build gene_name_dict

Iterating with df.iterrows() scales poorly for large CPTAC datasets.
A vectorised one-liner is clearer and ~10-20× faster:

if gene_name_col and gene_name_col in df.columns:
    gene_name_dict = dict(df[[gene_id_col, gene_name_col]].drop_duplicates().values)
hvantk/tests/test_cptac.py (2)

8-15: Remove unused imports

tempfile, os, and save_cptac_matrix_table are never used; Ruff flags them (F401).

-import tempfile
-import os
-from hvantk.htables.cptac import (
-    convert_cptac_expression_to_matrix_table,
-    convert_cptac_metadata_to_table,
-    create_cptac_matrix_table,
-    save_cptac_matrix_table
-)
+from hvantk.htables.cptac import (
+    convert_cptac_expression_to_matrix_table,
+    convert_cptac_metadata_to_table,
+    create_cptac_matrix_table,
+)

17-18: Avoid side-effects at import time

Calling hl.init() at module import can interfere with test discovery and parallel execution.
Prefer a session-scoped fixture:

@pytest.fixture(scope="session", autouse=True)
def _hail():
    hl.init()
    yield
    hl.stop()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8846f5e and dde5286.

📒 Files selected for processing (2)
  • hvantk/htables/cptac.py (1 hunks)
  • hvantk/tests/test_cptac.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
hvantk/tests/test_cptac.py

8-8: tempfile imported but unused

Remove unused import: tempfile

(F401)


9-9: os imported but unused

Remove unused import: os

(F401)


14-14: hvantk.htables.cptac.save_cptac_matrix_table imported but unused

Remove unused import: hvantk.htables.cptac.save_cptac_matrix_table

(F401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build
  • GitHub Check: build (3.12)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.10)
  • GitHub Check: build-linux

Comment thread hvantk/htables/cptac.py
@ypriverol ypriverol requested a review from Copilot August 11, 2025 10:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new CPTAC data processing module that provides functionality for converting CPTAC gene expression and metadata into Hail MatrixTables and Tables for downstream analysis. The implementation includes comprehensive error handling and validation.

Key changes include:

  • Complete CPTAC data processing module with 5 main functions for expression and metadata conversion
  • Comprehensive unit tests covering all functions with error handling validation
  • Minor formatting fix adding an empty line to pyproject.toml

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
hvantk/htables/cptac.py Implements core CPTAC data processing functionality with expression-to-MatrixTable and metadata-to-Table conversion functions
hvantk/tests/test_cptac.py Provides comprehensive test coverage for all CPTAC module functions including error handling scenarios
pyproject.toml Adds missing trailing newline for proper formatting

Comment thread hvantk/htables/cptac.py Outdated
Comment thread hvantk/htables/cptac.py Outdated
Comment thread hvantk/htables/cptac.py
@ypriverol ypriverol merged commit d2c51be into main Aug 11, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Review effort 3/5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants