-
Notifications
You must be signed in to change notification settings - Fork 13
Refactoring of static data and dispatch behavior, I/O streamlining #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
documenting code and functions.
update:doc
update: doc
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Code Suggestions ✨Explore these optional code suggestions:
|
|
The unit tests are failing because the unit tests are actually incorrect. ibaqpy/tests/test_peptide_normalize.py Lines 7 to 24 in 1058460
Note line 19, ibaqpy/ibaqpy/ibaq/peptide_normalization.py Lines 592 to 602 in 1058460
So if a magic string match isn't found, it's ignored. Ergo this is like passing no peptide normalization method. Therefore I've changed the test to "none" as its peptide normalization method and made that preserve the previous behavior. |
User description
Changes to static data
Isobaric quantification
The isobaric quantification definitions and inference logic were duplicated and spread across multiple files, and made heavy use of magic strings. I consolidated these into a single module, and converted magic strings into enums for easier validation. If the classification logic of different TMT kits were less messy, I'd have probably moved these to be defined outside of code.
Added benefits:
Normalization methods
Again, the feature normalization and peptide normalization methods made heavy use of magic strings, as well as having multiple difficult to distinguish function names. The group reduction logic in feature normalizations were also repeated many times. This refactor replaces magic strings with enums, and each enum variant is bound to a function for doing most granular transformation and then use shared methods to propagate those transformations up the hierarchy as defined previously.
Added benefits:
Organism definitions
Organisms were defined as a dictionary of dictionaries with a certain schema, and organisms were selected using magic strings again. This change converted organisms to a static data file that is loaded on import instead of defining the data in code. It uses a new type to enforce that defined schema. Additionally, the type includes a means of validating that an organism exists from a given identifier, and provides validation. This links with the re-implementation of the Proteomic Ruler.
I attempted to streamline the logic in the
peptide2proteinin order to remove indirectly shared local variables and to move those stateful incremental calculations into objects to unclutter the main function.PeptideProteinMappershould remove the use ofnonlocalbinding and encapsulate the TPA calculation.ConcentrationWeightByProteomicRulerencapsulates the Proteomic Ruler implementation previously defined incalculate_weight_and_concentrationwithout repeating some intermediate vector math.I/O streamlining
The
peptide_normalizationprogram appends to a CSV file on each pass through the inner loop, and only after it is done writing the entire result to CSV, it does a bulk conversion to Parquet. Additionally, becausepandasdoes CSV writing, it introduces extra state into the loop. This change introduces two new writing threads, one for writing incremental CSVs, and one for writing incremental Parquets. These threads run in the background while DuckDB andpandasdo their thing. Since these operations are primarily I/O-bound they should not hold the GIL for very long.PR Type
Enhancement, Documentation
Description
Refactored normalization methods and static data handling.
Enhanced I/O operations for CSV and Parquet file handling.
Improved peptide-to-protein mapping and proteomic ruler calculations.
Updated documentation to reflect new features and usage.
Changes walkthrough 📝
11 files
Enhanced CLI options for feature-to-peptide conversionImproved peptide-to-protein conversion with proteomic rulerRefactored common utilities and removed redundant codeRemoved and replaced with centralized normalization logicRefactored peptide normalization with modular methodsAdjusted batch correction to handle covariatesAdded threaded CSV and Parquet write tasksIntroduced enums for feature and peptide normalizationCentralized organism metadata in a JSON-backed modelAdded enums and mappings for quantification categoriesAdded JSON file for organism metadata1 files
Included data files in package distribution1 files
Updated documentation for new features and usage2 files