Skip to content

Conversation

@mobiusklein
Copy link
Contributor

@mobiusklein mobiusklein commented Jan 30, 2025

User description

Changes to static data

Isobaric quantification

The isobaric quantification definitions and inference logic were duplicated and spread across multiple files, and made heavy use of magic strings. I consolidated these into a single module, and converted magic strings into enums for easier validation. If the classification logic of different TMT kits were less messy, I'd have probably moved these to be defined outside of code.

Added benefits:

  • Fewer magic strings
  • Single definition of quantification classification logic

Normalization methods

Again, the feature normalization and peptide normalization methods made heavy use of magic strings, as well as having multiple difficult to distinguish function names. The group reduction logic in feature normalizations were also repeated many times. This refactor replaces magic strings with enums, and each enum variant is bound to a function for doing most granular transformation and then use shared methods to propagate those transformations up the hierarchy as defined previously.

Added benefits:

  • Fewer magic strings
  • Validation of CLI options
  • More straight-forward application code

Organism definitions

Organisms were defined as a dictionary of dictionaries with a certain schema, and organisms were selected using magic strings again. This change converted organisms to a static data file that is loaded on import instead of defining the data in code. It uses a new type to enforce that defined schema. Additionally, the type includes a means of validating that an organism exists from a given identifier, and provides validation. This links with the re-implementation of the Proteomic Ruler.

I attempted to streamline the logic in the peptide2protein in order to remove indirectly shared local variables and to move those stateful incremental calculations into objects to unclutter the main function. PeptideProteinMapper should remove the use of nonlocal binding and encapsulate the TPA calculation. ConcentrationWeightByProteomicRuler encapsulates the Proteomic Ruler implementation previously defined in calculate_weight_and_concentration without repeating some intermediate vector math.

I/O streamlining

The peptide_normalization program appends to a CSV file on each pass through the inner loop, and only after it is done writing the entire result to CSV, it does a bulk conversion to Parquet. Additionally, because pandas does CSV writing, it introduces extra state into the loop. This change introduces two new writing threads, one for writing incremental CSVs, and one for writing incremental Parquets. These threads run in the background while DuckDB and pandas do their thing. Since these operations are primarily I/O-bound they should not hold the GIL for very long.


PR Type

Enhancement, Documentation


Description

  • Refactored normalization methods and static data handling.

    • Introduced enums for normalization and quantification methods.
    • Centralized organism metadata in a JSON file.
  • Enhanced I/O operations for CSV and Parquet file handling.

    • Added threaded write tasks for efficient file writing.
  • Improved peptide-to-protein mapping and proteomic ruler calculations.

    • Encapsulated logic in new classes for modularity.
  • Updated documentation to reflect new features and usage.


Changes walkthrough 📝

Relevant files
Enhancement
11 files
features2peptides.py
Enhanced CLI options for feature-to-peptide conversion     
+11/-3   
peptides2protein.py
Improved peptide-to-protein conversion with proteomic ruler
+9/-3     
ibaqpy_commons.py
Refactored common utilities and removed redundant code     
+22/-82 
normalization_methods.py
Removed and replaced with centralized normalization logic
+0/-127 
peptide_normalization.py
Refactored peptide normalization with modular methods       
+114/-156
utils.py
Adjusted batch correction to handle covariates                     
+2/-1     
write_queue.py
Added threaded CSV and Parquet write tasks                             
+123/-0 
normalization.py
Introduced enums for feature and peptide normalization     
+160/-0 
organism_metadata.py
Centralized organism metadata in a JSON-backed model         
+30/-0   
quantification_type.py
Added enums and mappings for quantification categories     
+179/-0 
organisms.json
Added JSON file for organism metadata                                       
+526/-0 
Configuration changes
1 files
MANIFEST.in
Included data files in package distribution                           
+1/-0     
Documentation
1 files
README.md
Updated documentation for new features and usage                 
+29/-14 
Additional files
2 files
peptides2protein.py +202/-55
__init__.py [link]   

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • @coderabbitai
    Copy link
    Contributor

    coderabbitai bot commented Jan 30, 2025

    Important

    Review skipped

    Auto reviews are disabled on base/target branches other than the default branch.

    Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

    You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    ❤️ Share
    🪧 Tips

    Chat

    There are 3 ways to chat with CodeRabbit:

    • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
      • I pushed a fix in commit <commit_id>, please review it.
      • Generate unit testing code for this file.
      • Open a follow-up GitHub issue for this discussion.
    • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
      • @coderabbitai generate unit testing code for this file.
      • @coderabbitai modularize this function.
    • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
      • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
      • @coderabbitai read src/utils.ts and generate unit testing code.
      • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
      • @coderabbitai help me debug CodeRabbit configuration file.

    Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

    CodeRabbit Commands (Invoked using PR comments)

    • @coderabbitai pause to pause the reviews on a PR.
    • @coderabbitai resume to resume the paused reviews.
    • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
    • @coderabbitai full review to do a full review from scratch and review all the files again.
    • @coderabbitai summary to regenerate the summary of the PR.
    • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
    • @coderabbitai resolve resolve all the CodeRabbit review comments.
    • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
    • @coderabbitai help to get help.

    Other keywords and placeholders

    • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
    • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
    • Add @coderabbitai anywhere in the PR title to generate the title automatically.

    CodeRabbit Configuration File (.coderabbit.yaml)

    • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
    • Please see the configuration documentation for more information.
    • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

    Documentation and Community

    • Visit our Documentation for detailed information on how to use CodeRabbit.
    • Join our Discord Community to get help, request features, and share feedback.
    • Follow us on X/Twitter for updates and announcements.

    @qodo-code-review
    Copy link
    Contributor

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Error Handling

    The new code introduces potential error cases that need validation - for example the QuantificationCategory.classify() method can raise ValueError but the error handling is not consistent across the codebase.

        return clean_peptide
    
    
    def analyse_sdrf(sdrf_path: str) -> tuple[int, QuantificationCategory, list[str], IsobaricLabel | None]:
    Type Safety

    The IsobaricLabelSpec class implements Mapping but some of its methods could return unexpected types - for example getitem could raise KeyError which is not documented.

    class IsobaricLabelSpec(Mapping[str, int]):
        registry: ClassVar[dict[str, "IsobaricLabelSpec"]] = {}
    
        name: str
        channels: dict[str, int] = field(default_factory=dict)
    
        def __post_init__(self):
            self.registry[self.name] = self
    
        @property
        def id(self):
            try:
                return IsobaricLabel[self.name]
            except ValueError:
                return None
    
        def __getitem__(self, key: str) -> int:
            return self.channels[key]
    
        def __iter__(self) -> Iterator[str]:
            yield from self.channels
    
        def __len__(self) -> int:
            return len(self.channels)
    
        def __contains__(self, key) -> bool:
            return key in self.channels
    Code Complexity

    The PeptideProteinMapper class has complex logic for calculating protein group metrics that would benefit from additional validation and testing to ensure correctness.

    class PeptideProteinMapper:
        _peptide_protein_ratio: dict[str, float]
    
        unique_peptide_counts: dict[str, int]
        map_size: dict[str, int]
    
        protein_mass_map: dict[str, float]
    
        def __init__(
            self,
            unique_peptide_counts: dict[str, int] | None = None,
            map_size: dict[str, int] | None = None,
            protein_mass_map: dict[str, float] | None = None,
        ):
            self.unique_peptide_counts = unique_peptide_counts or {}
            self.map_size = map_size or {}
            self.protein_mass_map = protein_mass_map or {}
    
            self._peptide_protein_ratio = {}
    
        def peptide_protein_ratio(self, protein_group: str):
            if protein_group in self._peptide_protein_ratio:
                return self._peptide_protein_ratio[protein_group]
    
            proteins_list = protein_group.split(";")
    
            total = 0
            for prot in proteins_list:
                total += self.unique_peptide_counts[prot]
    
            if not proteins_list:
                val = self._peptide_protein_ratio[protein_group] = 0
            else:
                val = self._peptide_protein_ratio[protein_group] = total / len(proteins_list)
            return val
    
        def get_average_nr_peptides_unique_by_group(self, pdrow: Series) -> float | Series:
            """
            Calculate the average number of unique peptides per protein group.
    
            This function computes the average number of unique peptides for a given
            protein group by dividing the normalized intensity by the product of the
            group size and the average unique peptide count. If no proteins are found
            in the group or the sum of unique peptide counts is zero, it returns NaN.
    
            :param pdrow: Series containing the protein group data.
            :return: Series containing the average number of unique peptides per protein group.
            """
    
            average_peptides_per_protein = self.peptide_protein_ratio(pdrow.name[0])
    
            if average_peptides_per_protein > 0:
                return (
                    pdrow.NormIntensity
                    / self.map_size[pdrow.name]
                    / average_peptides_per_protein
                )
    
            # If there is no protein in the group, return np nan
            return np.nan  # type: ignore
    
        def protein_group_mass(self, protein_group: str):
            """
            Calculate the molecular weight of a protein group.
    
            :param group: Protein group.
            :return: Molecular weight of the protein group.
            """
            mw_list = [self.protein_mass_map[i] for i in protein_group.split(";")]
            return sum(mw_list)

    @qodo-code-review
    Copy link
    Contributor

    Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Fix incorrect IQR normalization calculation

    The IQR normalization function incorrectly returns the mean of 75th and 25th
    percentiles instead of normalizing by IQR. It should divide the data by the IQR
    value.

    ibaqpy/model/normalization.py [115-117]

     @FeatureNormalizationMethod.IQR.register_replicate_fn
     def iqr_normalization(df, *args, **kwargs):
    -    return df.quantile([0.75, 0.25], interpolation="linear").mean()
    +    q75, q25 = df.quantile([0.75, 0.25], interpolation="linear")
    +    return df / (q75 - q25)
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: The current implementation incorrectly returns the mean of quartiles instead of normalizing by IQR (q75-q25). This is a critical bug that would produce incorrect normalization results.

    9
    Escape special regex characters

    The regex pattern in remove_contaminants_entrapments_decoys() should escape special
    regex characters in the contaminant strings to avoid pattern errors. Use re.escape()
    when building the regex pattern.

    ibaqpy/ibaq/peptide_normalization.py [118-119]

    -cregex = "|".join(contaminants)
    +cregex = "|".join(re.escape(c) for c in contaminants)
     return dataset[~dataset[protein_field].str.contains(cregex, regex=True)]
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion addresses a potential security and functionality issue by properly escaping special regex characters in contaminant strings, preventing regex pattern errors that could cause the filtering to fail or behave unexpectedly.

    8
    Handle missing protein mass values

    The protein_group_mass() method should handle cases where a protein ID in the group
    is not found in the protein mass map, rather than letting it raise a KeyError.

    ibaqpy/ibaq/peptides2protein.py [319-320]

    -mw_list = [self.protein_mass_map[i] for i in protein_group.split(";")]
    -return sum(mw_list)
    +mw_list = [self.protein_mass_map.get(i, 0.0) for i in protein_group.split(";")]
    +return sum(mw_list) or 1.0  # Return 1.0 if sum is 0
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion improves code robustness by gracefully handling cases where protein IDs are not found in the mass map, preventing potential KeyError exceptions and providing a fallback value.

    7
    General
    Add validation for TMT labels

    The classify method makes assumptions about TMT labeling schemes based on label
    count or specific labels, but doesn't handle invalid/malformed labels. Add
    validation to ensure labels match expected format.

    ibaqpy/model/quantification_type.py [31-41]

     elif any("tmt" in s.lower() for s in labels):
         label_category = cls.TMT
    +    if not all(l.startswith("TMT") for l in labels):
    +        raise ValueError(f"Invalid TMT labels found in {labels}")
         if (
             len(labels) > 11
             or "TMT134N" in labels
             or "TMT133C" in labels
             or "TMT133N" in labels
             or "TMT132C" in labels
             or "TMT132N" in labels
         ):
             label_scheme = IsobaricLabel.TMT16plex
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Adding validation for TMT label format would prevent potential errors from malformed input data and improve robustness of the classification logic.

    7
    Add units to formula variables

    The protein copy number formula should include units for each variable to help users
    understand the calculation. Add units in square brackets after each variable.

    README.md [16]

    -protein copies per cell = protein MS-signal *  (avogadro / molecular mass) * (DNA mass / histone MS-signal)
    +protein copies per cell = protein MS-signal [intensity] * (avogadro [mol^-1] / molecular mass [g/mol]) * (DNA mass [g] / histone MS-signal [intensity])
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: Adding units to the formula variables would improve clarity and help users understand the calculation better, though it's not critical for functionality. The suggestion is valid and would enhance documentation readability.

    5

    @ypriverol ypriverol self-requested a review January 30, 2025 16:14
    @mobiusklein
    Copy link
    Contributor Author

    The unit tests are failing because the unit tests are actually incorrect.

    def test_feature_assembly():
    args = {
    "parquet": str(TESTS_DIR / "example/feature.parquet"),
    "sdrf": str(TESTS_DIR / "example/PXD017834-TMT.sdrf.tsv"),
    "min_aa": 7,
    "min_unique": 2,
    "remove_ids": None,
    "remove_decoy_contaminants": True,
    "remove_low_frequency_peptides": True,
    "output": str(TESTS_DIR / "example/PXD017834-peptides-norm.csv"),
    "skip_normalization": False,
    "nmethod": "median",
    "pnmethod": "max_min",
    "log2": True,
    "save_parquet": True,
    }
    print(args)
    peptide_normalization(**args)

    Note line 19, "pnmethod": "max_min",. That's not a valid peptide normalization. It's a feature normalization magic string, technically, though not one advertised. The only peptide normalizations that were supported before, "globalMedian" and "conditionMedian", are checked for like this:

    if not skip_normalization:
    if pnmethod == GLOBALMEDIAN:
    dataset_df.loc[:, NORM_INTENSITY] = (
    dataset_df[NORM_INTENSITY] / med_map[sample]
    )
    elif pnmethod == CONDITIONMEDIAN:
    con = dataset_df[CONDITION].unique()[0]
    dataset_df.loc[:, NORM_INTENSITY] = (
    dataset_df[NORM_INTENSITY] / med_map[con][sample]
    )

    So if a magic string match isn't found, it's ignored. Ergo this is like passing no peptide normalization method.

    Therefore I've changed the test to "none" as its peptide normalization method and made that preserve the previous behavior.

    @ypriverol ypriverol merged commit 48a091c into bigbio:dev Jan 30, 2025
    8 of 10 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants