Refactoring of static data and dispatch behavior, I/O streamlining #88

mobiusklein · 2025-01-30T15:27:59Z

User description

Changes to static data

Isobaric quantification

The isobaric quantification definitions and inference logic were duplicated and spread across multiple files, and made heavy use of magic strings. I consolidated these into a single module, and converted magic strings into enums for easier validation. If the classification logic of different TMT kits were less messy, I'd have probably moved these to be defined outside of code.

Added benefits:

Fewer magic strings
Single definition of quantification classification logic

Normalization methods

Again, the feature normalization and peptide normalization methods made heavy use of magic strings, as well as having multiple difficult to distinguish function names. The group reduction logic in feature normalizations were also repeated many times. This refactor replaces magic strings with enums, and each enum variant is bound to a function for doing most granular transformation and then use shared methods to propagate those transformations up the hierarchy as defined previously.

Added benefits:

Fewer magic strings
Validation of CLI options
More straight-forward application code

Organism definitions

Organisms were defined as a dictionary of dictionaries with a certain schema, and organisms were selected using magic strings again. This change converted organisms to a static data file that is loaded on import instead of defining the data in code. It uses a new type to enforce that defined schema. Additionally, the type includes a means of validating that an organism exists from a given identifier, and provides validation. This links with the re-implementation of the Proteomic Ruler.

I attempted to streamline the logic in the peptide2protein in order to remove indirectly shared local variables and to move those stateful incremental calculations into objects to unclutter the main function. PeptideProteinMapper should remove the use of nonlocal binding and encapsulate the TPA calculation. ConcentrationWeightByProteomicRuler encapsulates the Proteomic Ruler implementation previously defined in calculate_weight_and_concentration without repeating some intermediate vector math.

I/O streamlining

The peptide_normalization program appends to a CSV file on each pass through the inner loop, and only after it is done writing the entire result to CSV, it does a bulk conversion to Parquet. Additionally, because pandas does CSV writing, it introduces extra state into the loop. This change introduces two new writing threads, one for writing incremental CSVs, and one for writing incremental Parquets. These threads run in the background while DuckDB and pandas do their thing. Since these operations are primarily I/O-bound they should not hold the GIL for very long.

PR Type

Enhancement, Documentation

Description

Refactored normalization methods and static data handling.
- Introduced enums for normalization and quantification methods.
- Centralized organism metadata in a JSON file.
Enhanced I/O operations for CSV and Parquet file handling.
- Added threaded write tasks for efficient file writing.
Improved peptide-to-protein mapping and proteomic ruler calculations.
- Encapsulated logic in new classes for modularity.
Updated documentation to reflect new features and usage.

Changes walkthrough 📝

Relevant files

Enhancement

11 files

features2peptides.py `Enhanced CLI options for feature-to-peptide conversion`	+11/-3
peptides2protein.py `Improved peptide-to-protein conversion with proteomic ruler`	+9/-3
ibaqpy_commons.py `Refactored common utilities and removed redundant code`	+22/-82
normalization_methods.py `Removed and replaced with centralized normalization logic`	+0/-127
peptide_normalization.py `Refactored peptide normalization with modular methods`	+114/-156
utils.py `Adjusted batch correction to handle covariates`	+2/-1
write_queue.py `Added threaded CSV and Parquet write tasks`	+123/-0
normalization.py `Introduced enums for feature and peptide normalization`	+160/-0
organism_metadata.py `Centralized organism metadata in a JSON-backed model`	+30/-0
quantification_type.py `Added enums and mappings for quantification categories`	+179/-0
organisms.json `Added JSON file for organism metadata`	+526/-0

Configuration changes

1 files

MANIFEST.in `Included data files in package distribution`	+1/-0

Documentation

1 files

README.md `Updated documentation for new features and usage`	+29/-14

Additional files

2 files

peptides2protein.py	+202/-55
__init__.py	[link]

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

documenting code and functions.

update：doc

update: doc

coderabbitai · 2025-01-30T15:28:06Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

qodo-code-review · 2025-01-30T15:28:42Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Error Handling The new code introduces potential error cases that need validation - for example the `QuantificationCategory.classify()` method can raise ValueError but the error handling is not consistent across the codebase. return clean_peptide def analyse_sdrf(sdrf_path: str) -> tuple[int, QuantificationCategory, list[str], IsobaricLabel \| None]: Type Safety The IsobaricLabelSpec class implements Mapping but some of its methods could return unexpected types - for example getitem could raise KeyError which is not documented. class IsobaricLabelSpec(Mapping[str, int]): registry: ClassVar[dict[str, "IsobaricLabelSpec"]] = {} name: str channels: dict[str, int] = field(default_factory=dict) def __post_init__(self): self.registry[self.name] = self @property def id(self): try: return IsobaricLabel[self.name] except ValueError: return None def __getitem__(self, key: str) -> int: return self.channels[key] def __iter__(self) -> Iterator[str]: yield from self.channels def __len__(self) -> int: return len(self.channels) def __contains__(self, key) -> bool: return key in self.channels Code Complexity The PeptideProteinMapper class has complex logic for calculating protein group metrics that would benefit from additional validation and testing to ensure correctness. class PeptideProteinMapper: _peptide_protein_ratio: dict[str, float] unique_peptide_counts: dict[str, int] map_size: dict[str, int] protein_mass_map: dict[str, float] def __init__( self, unique_peptide_counts: dict[str, int] \| None = None, map_size: dict[str, int] \| None = None, protein_mass_map: dict[str, float] \| None = None, ): self.unique_peptide_counts = unique_peptide_counts or {} self.map_size = map_size or {} self.protein_mass_map = protein_mass_map or {} self._peptide_protein_ratio = {} def peptide_protein_ratio(self, protein_group: str): if protein_group in self._peptide_protein_ratio: return self._peptide_protein_ratio[protein_group] proteins_list = protein_group.split(";") total = 0 for prot in proteins_list: total += self.unique_peptide_counts[prot] if not proteins_list: val = self._peptide_protein_ratio[protein_group] = 0 else: val = self._peptide_protein_ratio[protein_group] = total / len(proteins_list) return val def get_average_nr_peptides_unique_by_group(self, pdrow: Series) -> float \| Series: """ Calculate the average number of unique peptides per protein group. This function computes the average number of unique peptides for a given protein group by dividing the normalized intensity by the product of the group size and the average unique peptide count. If no proteins are found in the group or the sum of unique peptide counts is zero, it returns NaN. :param pdrow: Series containing the protein group data. :return: Series containing the average number of unique peptides per protein group. """ average_peptides_per_protein = self.peptide_protein_ratio(pdrow.name[0]) if average_peptides_per_protein > 0: return ( pdrow.NormIntensity / self.map_size[pdrow.name] / average_peptides_per_protein ) # If there is no protein in the group, return np nan return np.nan # type: ignore def protein_group_mass(self, protein_group: str): """ Calculate the molecular weight of a protein group. :param group: Protein group. :return: Molecular weight of the protein group. """ mw_list = [self.protein_mass_map[i] for i in protein_group.split(";")] return sum(mw_list)

qodo-code-review · 2025-01-30T15:29:11Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Possible issue	Fix incorrect IQR normalization calculation The IQR normalization function incorrectly returns the mean of 75th and 25th percentiles instead of normalizing by IQR. It should divide the data by the IQR value. ibaqpy/model/normalization.py [115-117] @FeatureNormalizationMethod.IQR.register_replicate_fn def iqr_normalization(df, args, kwargs): - return df.quantile([0.75, 0.25], interpolation="linear").mean() + q75, q25 = df.quantile([0.75, 0.25], interpolation="linear") + return df / (q75 - q25) Apply this suggestion* Suggestion importance[1-10]: 9 Why: The current implementation incorrectly returns the mean of quartiles instead of normalizing by IQR (q75-q25). This is a critical bug that would produce incorrect normalization results.	9
	Escape special regex characters The regex pattern in `remove_contaminants_entrapments_decoys()` should escape special regex characters in the contaminant strings to avoid pattern errors. Use `re.escape()` when building the regex pattern. ibaqpy/ibaq/peptide_normalization.py [118-119] -cregex = "\|".join(contaminants) +cregex = "\|".join(re.escape(c) for c in contaminants) return dataset[~dataset[protein_field].str.contains(cregex, regex=True)] Apply this suggestion Suggestion importance[1-10]: 8 Why: The suggestion addresses a potential security and functionality issue by properly escaping special regex characters in contaminant strings, preventing regex pattern errors that could cause the filtering to fail or behave unexpectedly.	8
	Handle missing protein mass values The `protein_group_mass()` method should handle cases where a protein ID in the group is not found in the protein mass map, rather than letting it raise a KeyError. ibaqpy/ibaq/peptides2protein.py [319-320] -mw_list = [self.protein_mass_map[i] for i in protein_group.split(";")] -return sum(mw_list) +mw_list = [self.protein_mass_map.get(i, 0.0) for i in protein_group.split(";")] +return sum(mw_list) or 1.0 # Return 1.0 if sum is 0 Apply this suggestion Suggestion importance[1-10]: 7 Why: The suggestion improves code robustness by gracefully handling cases where protein IDs are not found in the mass map, preventing potential KeyError exceptions and providing a fallback value.	7
General	Add validation for TMT labels The classify method makes assumptions about TMT labeling schemes based on label count or specific labels, but doesn't handle invalid/malformed labels. Add validation to ensure labels match expected format. ibaqpy/model/quantification_type.py [31-41] elif any("tmt" in s.lower() for s in labels): label_category = cls.TMT + if not all(l.startswith("TMT") for l in labels): + raise ValueError(f"Invalid TMT labels found in {labels}") if ( len(labels) > 11 or "TMT134N" in labels or "TMT133C" in labels or "TMT133N" in labels or "TMT132C" in labels or "TMT132N" in labels ): label_scheme = IsobaricLabel.TMT16plex Apply this suggestion Suggestion importance[1-10]: 7 Why: Adding validation for TMT label format would prevent potential errors from malformed input data and improve robustness of the classification logic.	7
General	Add units to formula variables The protein copy number formula should include units for each variable to help users understand the calculation. Add units in square brackets after each variable. README.md [16] -protein copies per cell = protein MS-signal * (avogadro / molecular mass) * (DNA mass / histone MS-signal) +protein copies per cell = protein MS-signal [intensity] * (avogadro [mol^-1] / molecular mass [g/mol]) * (DNA mass [g] / histone MS-signal [intensity]) Apply this suggestion Suggestion importance[1-10]: 5 Why: Adding units to the formula variables would improve clarity and help users understand the calculation better, though it's not critical for functionality. The suggestion is valid and would enhance documentation readability.	5

mobiusklein · 2025-01-30T19:09:25Z

The unit tests are failing because the unit tests are actually incorrect.

ibaqpy/tests/test_peptide_normalize.py

Lines 7 to 24 in 1058460

    
           def test_feature_assembly(): 
        
               args = { 
        
                   "parquet": str(TESTS_DIR / "example/feature.parquet"), 
        
                   "sdrf": str(TESTS_DIR / "example/PXD017834-TMT.sdrf.tsv"), 
        
                   "min_aa": 7, 
        
                   "min_unique": 2, 
        
                   "remove_ids": None, 
        
                   "remove_decoy_contaminants": True, 
        
                   "remove_low_frequency_peptides": True, 
        
                   "output": str(TESTS_DIR / "example/PXD017834-peptides-norm.csv"), 
        
                   "skip_normalization": False, 
        
                   "nmethod": "median", 
        
                   "pnmethod": "max_min", 
        
                   "log2": True, 
        
                   "save_parquet": True, 
        
               } 
        
               print(args) 
        
               peptide_normalization(**args)

Note line 19, "pnmethod": "max_min",. That's not a valid peptide normalization. It's a feature normalization magic string, technically, though not one advertised. The only peptide normalizations that were supported before, "globalMedian" and "conditionMedian", are checked for like this:

ibaqpy/ibaqpy/ibaq/peptide_normalization.py

Lines 592 to 602 in 1058460

    
           if not skip_normalization: 
        
               if pnmethod == GLOBALMEDIAN: 
        
                   dataset_df.loc[:, NORM_INTENSITY] = ( 
        
                       dataset_df[NORM_INTENSITY] / med_map[sample] 
        
                   ) 
        
               elif pnmethod == CONDITIONMEDIAN: 
        
                   con = dataset_df[CONDITION].unique()[0] 
        
                   dataset_df.loc[:, NORM_INTENSITY] = ( 
        
                       dataset_df[NORM_INTENSITY] / med_map[con][sample] 
        
                   )

So if a magic string match isn't found, it's ignored. Ergo this is like passing no peptide normalization method.

Therefore I've changed the test to "none" as its peptide normalization method and made that preserve the previous behavior.

ypriverol and others added 9 commits December 17, 2024 07:31

Merge pull request bigbio#82 from bigbio/dev

6486357

documenting code and functions.

update：doc

a628257

Merge pull request bigbio#86 from zprobot/master

2797596

update：doc

update: doc

c067b32

Merge branch 'master' of https://github.com/zprobot/ibaqpy

46c2599

Merge pull request bigbio#87 from zprobot/master

1058460

update: doc

refactor

7b405ac

further adaption

043e022

small factorization of writing

bdbb01c

qodo-code-review bot added the Review effort [1-5]: 4 label Jan 30, 2025

mobiusklein added 2 commits January 30, 2025 11:00

make type annotations compatible

5a05589

remove unused

57e7070

ypriverol self-requested a review January 30, 2025 16:14

make tests local-friendly and actually do what they mean

79d1a03

ypriverol approved these changes Jan 30, 2025

View reviewed changes

remove more syntax

b45fc2b

ypriverol merged commit 48a091c into bigbio:dev Jan 30, 2025
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactoring of static data and dispatch behavior, I/O streamlining #88

Refactoring of static data and dispatch behavior, I/O streamlining #88

Uh oh!

mobiusklein commented Jan 30, 2025 •

edited by qodo-code-review bot

Loading

Uh oh!

coderabbitai bot commented Jan 30, 2025

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

qodo-code-review bot commented Jan 30, 2025

Uh oh!

qodo-code-review bot commented Jan 30, 2025

Uh oh!

mobiusklein commented Jan 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactoring of static data and dispatch behavior, I/O streamlining #88

Refactoring of static data and dispatch behavior, I/O streamlining #88

Uh oh!

Conversation

mobiusklein commented Jan 30, 2025 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Changes to static data

Isobaric quantification

Normalization methods

Organism definitions

I/O streamlining

PR Type

Description

Changes walkthrough 📝

Uh oh!

coderabbitai bot commented Jan 30, 2025

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

qodo-code-review bot commented Jan 30, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-code-review bot commented Jan 30, 2025

PR Code Suggestions ✨

Uh oh!

mobiusklein commented Jan 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mobiusklein commented Jan 30, 2025 •

edited by qodo-code-review bot

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)