First update of GainPro #1

ypriverol · 2025-12-14T19:55:41Z

PR Type

Enhancement, Documentation

Description

Comprehensive package refactoring from ProtoGain to GenerativeProteomics (GainPro)
Fixed method signatures: replaced cls with self in instance methods throughout codebase
Standardized imports: converted all relative imports to absolute GenerativeProteomics package imports
Renamed classes for consistency: GAIN_DANN → GainDann, Imputation_Management → ImputationManagement
Added professional package infrastructure: pyproject.toml, MANIFEST.in, CLI entry point via main() function
Reorganized project structure: moved datasets to datasets/ folder, scripts to scripts/, configs to configs/
Enhanced documentation: added comprehensive docstrings, updated Sphinx docs, improved README with correct paths

Diagram Walkthrough

flowchart LR
  A["ProtoGain<br/>Codebase"] -->|"Rename classes<br/>Fix method signatures"| B["GenerativeProteomics<br/>Core Package"]
  B -->|"Add absolute imports"| C["Standardized<br/>Import System"]
  C -->|"Create pyproject.toml<br/>Add CLI entry"| D["Professional<br/>Package Structure"]
  D -->|"Reorganize folders<br/>Update docs"| E["GainPro v0.2.0<br/>Ready for Distribution"]

File Walkthrough

Relevant files

Documentation

16 files

__init__.py `Added comprehensive module docstring and exports`	+71/-0
conf.py `Updated mock imports from ProtoGain to GenerativeProteomics`	+1/-1
README.md `Updated documentation with correct paths and references`	+4/-4
GainPro.dataset.rst `Updated module references from ProtoGain to GenerativeProteomics`	+1/-1
GainPro.generativeproteomics.rst `Updated module references and documentation`	+2/-2
GainPro.hypers.rst `Updated module references from ProtoGain to GenerativeProteomics`	+1/-1
GainPro.manager.rst `Updated class name and module references`	+3/-3
GainPro.model.rst `Updated module references from ProtoGain to GenerativeProteomics`	+2/-2
GainPro.output.rst `Updated module references and method signatures`	+3/-3
GainPro.rst `Updated all module references to GainPro naming`	+7/-7
GainPro.utils.rst `Updated module references from ProtoGain to GenerativeProteomics`	+1/-1
How to use.rst `Updated example paths to use datasets folder`	+1/-1
Installation.rst `Updated repository URL from ProtoGain to GainPro`	+1/-1
index.rst `Updated documentation index references to GainPro`	+1/-1
README.md `Created documentation for scripts directory`	+22/-0
README.md `Updated class name references in test documentation`	+1/-1

Enhancement

10 files

correlation.py `Fixed relative import to absolute package import`	+1/-1
gain_dann_model.py `Renamed class GAIN_DANN to GainDann, fixed imports`	+10/-10
gaindann.py `Updated all imports to absolute paths, renamed class usage`	+10/-10
generativeproteomics.py `Refactored CLI with docstrings, improved argument parsing`	+34/-18
hypers_optimization.py `Fixed imports and renamed method toJSON to to_json`	+4/-4
imputation_management.py `Renamed class and fixed absolute imports`	+4/-4
__init__.py `Created new models subpackage with documentation`	+17/-0
params_gain_dann.py `Renamed method toJSON to to_json for consistency`	+1/-1
train.py `Updated all imports to absolute paths, renamed class usage`	+13/-13
multiple_runs.sh `Created improved script with documentation and parameters`	+16/-0

Bug fix

3 files

dataset.py `Fixed method signatures from cls to self`	+8/-8
model.py `Fixed method signatures and standardized imports`	+115/-115
output.py `Fixed method signature and absolute imports`	+3/-3

Miscellaneous

4 files

test.py `Removed obsolete test file`	+0/-16
utils.py `Removed root-level utils file, moved to package`	+0/-58
multiple_runs.sh `Removed obsolete script from package directory`	+0/-8
parameters.json `Removed obsolete parameters file from package`	+0/-16

Tests

5 files

test_imputation_management.py `Updated class name references in test cases`	+6/-6
test_generate_reference.py `Updated path references from ProtoGain to GenerativeProteomics`	+1/-3
test_hint_generation.py `Updated path references from ProtoGain to GenerativeProteomics`	+1/-1
test_imputation_with_reference.py `Updated path references from ProtoGain to GenerativeProteomics`	+1/-1
test_impute_no_reference.py `Updated path references from ProtoGain to GenerativeProteomics`	+1/-1

Configuration changes

5 files

parameters.json `Updated paths to use datasets folder structure`	+4/-5
parameters_noref.json `Updated paths and removed deprecated parameters`	+3/-6
MANIFEST.in `Created manifest for package distribution`	+20/-0
params_gain.json `Created new config file with standardized paths`	+15/-0
pyproject.toml `Created comprehensive project configuration file`	+133/-0

Dependencies

1 files

requirements.txt `Updated dependencies with version constraints`	+9/-4

coderabbitai · 2025-12-14T19:55:47Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2025-12-14T19:56:34Z

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing audit logs: New CLI entry point and training/evaluation flows do not log user identity or action outcomes for critical operations like file I/O and model runs, making audit trails incomplete. Referred Code def main(): """Main entry point for the GenerativeProteomics CLI.""" start_time = time.time() with cProfile.Profile() as profile: folder = os.getcwd() args = init_arg() missing_file = args.i output_file = args.o ref_file = args.ref output_folder = args.ofolder num_iterations = args.it batch_size = args.batchsize alpha = args.alpha miss_rate = args.miss hint_rate = args.hint lr_D = args.lrd lr_G = args.lrg parameters_file = args.parameters ... (clipped 145 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Weak error handling: The new CLI path performs multiple file reads/writes and training steps without try/except or validation of inputs beyond minimal checks, risking silent failures or generic crashes. Referred Code def main(): """Main entry point for the GenerativeProteomics CLI.""" start_time = time.time() with cProfile.Profile() as profile: folder = os.getcwd() args = init_arg() missing_file = args.i output_file = args.o ref_file = args.ref output_folder = args.ofolder num_iterations = args.it batch_size = args.batchsize alpha = args.alpha miss_rate = args.miss hint_rate = args.hint lr_D = args.lrd lr_G = args.lrg parameters_file = args.parameters ... (clipped 145 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Error detail exposure: User-facing logging on missing model files uses logger.error with interpolated paths and may propagate exceptions without sanitization, potentially exposing internal paths. Referred Code model = GainDann(metadata["protein_names"], metadata["input_dim"], latent_dim=metadata["latent_dim"], n_class=metadata["n_class"], num_hidden_layers=metadata["params"]["num_hidden_layers"], dann_params=dann_params, gain_params=gain_params, gain_metrics=gain_metrics) model_path = f"{checkpoint_dir}/model.pt" if not os.path.isfile(model_path): logger.error(f"Model in {model_path} not found.") Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Input validation gaps: The CLI accepts file paths and parameters and reads CSV/TSV without comprehensive validation or sanitization, which may lead to errors or unsafe handling when files are malformed or unexpected. Referred Code """Parse command-line arguments for the imputation pipeline.""" parser = argparse.ArgumentParser( description="GenerativeProteomics (GainPro) - GAIN-based missing value imputation", formatter_class=argparse.ArgumentDefaultsHelpFormatter, ) parser.add_argument("-i", "--input", dest="i", help="path to missing data file") parser.add_argument("-o", "--output", dest="o", default="imputed", help="name of output file") parser.add_argument("--ref", help="path to a reference (complete) dataset") parser.add_argument( "--ofolder", default=os.getcwd() + "/results/", help="path to output folder" ) parser.add_argument("--it", type=int, default=2001, help="number of iterations") parser.add_argument("--batchsize", type=int, default=128, help="batch size") parser.add_argument("--alpha", type=float, default=10, help="alpha") parser.add_argument("--miss", type=float, default=0.1, help="missing rate") parser.add_argument("--hint", type=float, default=0.9, help="hint rate") parser.add_argument( "--lrd", type=float, default=0.001, help="learning rate for the discriminator" ) parser.add_argument( "--lrg", type=float, default=0.001, help="learning rate for the generator" ... (clipped 38 lines) Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-14T19:57:37Z

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix incorrect attribute assignment logic Fix the `setitem` method in `ParamsGainDann` to correctly set attributes dynamically using `setattr(self, key, value)` instead of hardcoding the attribute name to `key`. GenerativeProteomics/params_gain_dann.py [226-227] def __setitem__(self, key, value): - self.key = value + setattr(self, key, value) `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a bug in the `__setitem__` method where it incorrectly assigns to `self.key` instead of using the `key` parameter to name the attribute, and the proposed fix using `setattr` is correct.	Medium
High-level	Refactor duplicated training loop logic The `Network` class has three methods (`train_ref`, `evaluate`, `train`) with almost identical training loops. This should be refactored into a single, unified training method to reduce code duplication and improve maintainability. Examples: GenerativeProteomics/model.py [147-364] def train_ref(self, data: Data, missing_header): dim = data.dataset_scaled.shape[1] train_size = data.dataset_scaled.shape[0] if train_size < self.hypers.batch_size: self.hypers.batch_size = train_size print( "Batch size is larger than the number of samples\nReducing batch size to the number of samples\n" ) ... (clipped 208 lines) Solution Walkthrough: Before: class Network: def train_ref(self, data, ...): # ... setup ... pbar = tqdm(range(self.hypers.num_iterations)) for it in pbar: # ... sample batch with reference ... self.metrics.loss_D[it] = self._update_D(...) self.metrics.loss_G[it] = self._update_G(...) # ... calculate train and test MSE ... # ... post-processing ... def evaluate(self, data, ...): # ... setup ... pbar = tqdm(range(self.hypers.num_iterations)) for it in pbar: # ... sample batch for evaluation ... self.metrics.loss_D_evaluate[it] = self._update_D(...) self.metrics.loss_G_evaluate[it] = self._update_G(...) # ... calculate train and test MSE for evaluation ... # ... post-processing ... def train(self, data, ...): # ... setup, weight init ... pbar = tqdm(range(self.hypers.num_iterations)) for it in pbar: # ... sample batch without reference ... self.metrics.loss_D[it] = self._update_D(...) self.metrics.loss_G[it] = self._update_G(...) # ... calculate train MSE only ... # ... post-processing ... After: class Network: def _training_loop(self, data, mode='train'): # ... common setup ... pbar = tqdm(range(self.hypers.num_iterations)) for it in pbar: # ... sample batch based on mode ... loss_d = self._update_D(...) loss_g = self._update_G(...) # ... calculate and store metrics based on mode ... # (e.g., train MSE, optional test MSE) def train_ref(self, data, ...): self._training_loop(data, mode='train_ref') self.impute(data) # ... output ... def evaluate(self, data, ...): self._training_loop(data, mode='evaluate') self._evaluate_impute(data) # ... output ... def train(self, data, ...): # ... weight init ... self._training_loop(data, mode='train') self.impute(data) # ... output ... Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies significant code duplication in the core training logic across three methods, which is a major maintenance concern and violates the DRY principle.	Medium
General	Scale data before creating reference In the `_create_ref` method, scale the dataset before introducing zeros for the reference set. This prevents the scaler from being influenced by artificial zero values and ensures it is based on the true data distribution. GenerativeProteomics/dataset.py [42-56] def _create_ref(self, miss_rate, hint_rate): self.ref_mask = self.mask.detach().clone() self.ref_dataset = self.dataset.detach().clone() + self.ref_dataset_scaled = torch.from_numpy(self.scaler.transform(self.ref_dataset)).clone() + zero_idxs = torch.nonzero(self.mask == 1) chance = torch.rand(len(zero_idxs)) miss = chance > miss_rate selected_idx = zero_idxs[~miss] for idx in selected_idx: self.ref_mask[tuple(idx)] = 0 + self.ref_dataset_scaled[tuple(idx)] = 0 self.ref_dataset[tuple(idx)] = 0 self.ref_hint = generate_hint(self.ref_mask, hint_rate) - self.ref_dataset_scaled = torch.from_numpy(self.scaler.transform(self.ref_dataset)) Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that scaling the data after introducing zeros can skew the data distribution, which is a valid concern for data preprocessing that could impact model performance.	Medium
Update

Copilot

Pull request overview

This PR represents a significant refactoring of the GainPro (previously ProtoGain) project, involving package renaming, code organization improvements, and the addition of proper Python packaging configuration. The changes modernize the codebase structure by converting to absolute imports, following PEP 8 naming conventions, and adding comprehensive package metadata.

Key changes:

Renamed project from "ProtoGain" to "GainPro/GenerativeProteomics" throughout codebase
Refactored imports from relative to absolute with GenerativeProteomics. prefix
Renamed classes and methods to follow PEP 8 conventions (Imputation_Management → ImputationManagement, GAIN_DANN → GainDann, toJSON → to_json)
Fixed method signatures by replacing cls with self for instance methods
Added proper Python packaging with pyproject.toml and MANIFEST.in
Reorganized scripts and configuration files

Reviewed changes

Copilot reviewed 43 out of 47 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`utils.py`	Deleted old utility file
`test.py`	Deleted development test file
`GenerativeProteomics/multiple_runs.sh`	Deleted old shell script
`GenerativeProteomics/parameters.json`	Deleted old hardcoded parameters file
`scripts/multiple_runs.sh`	Added new organized shell script for running multiple experiments
`scripts/README.md`	Added documentation for scripts
`requirements.txt`	Updated with version constraints and better organization
`pyproject.toml`	Added comprehensive Python packaging configuration
`MANIFEST.in`	Added package manifest for distribution
`GenerativeProteomics/__init__.py`	Added package initialization with exports and documentation
`GenerativeProteomics/models/__init__.py`	Added models subpackage initialization
`GenerativeProteomics/*.py`	Updated all Python files with absolute imports and fixed method signatures
`tests/*.py`	Updated test files with class renames and import path fixes
`use-case/2-tests/*.py`	Updated use-case tests with import path fixes
`docs/source/*.rst`	Updated documentation with project name changes
`Datasets/breast/*.json`	Updated dataset parameter files with corrected paths
`Datasets/breast/breastMissing_20.csv`	Added dataset file
`configs/params_gain.json`	Added example configuration file
`README.md`	Updated with corrected paths and project references

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

GenerativeProteomics/gaindann.py

GenerativeProteomics/imputation_management.py

GenerativeProteomics/train.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ypriverol added 2 commits December 14, 2025 19:45

First iteration of refacotring from GenerativeProteomics to GainPro

36a66ef

major updates inside the package

8416da9

ypriverol requested a review from Copilot December 14, 2025 19:55

Copilot started reviewing on behalf of ypriverol December 14, 2025 19:56 View session

qodo-code-review bot added the Review effort 3/5 label Dec 14, 2025

Copilot AI reviewed Dec 14, 2025

View reviewed changes

GenerativeProteomics/gaindann.py Outdated Show resolved Hide resolved

GenerativeProteomics/gaindann.py Outdated Show resolved Hide resolved

GenerativeProteomics/imputation_management.py Outdated Show resolved Hide resolved

GenerativeProteomics/train.py Outdated Show resolved Hide resolved

ypriverol and others added 4 commits December 14, 2025 20:49

Update GenerativeProteomics/gaindann.py

db43b53

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update GenerativeProteomics/train.py

3e54455

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update GenerativeProteomics/gaindann.py

4dd03e3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update GenerativeProteomics/imputation_management.py

98ce525

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ypriverol merged commit 135c883 into dev Dec 14, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

First update of GainPro #1

First update of GainPro #1

Uh oh!

ypriverol commented Dec 14, 2025 •

edited by qodo-code-review bot

Loading

Uh oh!

coderabbitai bot commented Dec 14, 2025 •

edited

Loading

Review skipped

Other AI code review bot(s) detected

Uh oh!

qodo-code-review bot commented Dec 14, 2025 •

edited

Loading

Uh oh!

qodo-code-review bot commented Dec 14, 2025 •

edited

Loading

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

First update of GainPro #1

First update of GainPro #1

Uh oh!

Conversation

ypriverol commented Dec 14, 2025 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

coderabbitai bot commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Other AI code review bot(s) detected

Uh oh!

qodo-code-review bot commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Dec 14, 2025 •

edited by qodo-code-review bot

Loading

coderabbitai bot commented Dec 14, 2025 •

edited

Loading

qodo-code-review bot commented Dec 14, 2025 •

edited

Loading

qodo-code-review bot commented Dec 14, 2025 •

edited

Loading