Skip to content

Conversation

@ypriverol
Copy link
Member

@ypriverol ypriverol commented Dec 14, 2025

PR Type

Enhancement, Documentation


Description

  • Comprehensive package refactoring from ProtoGain to GenerativeProteomics (GainPro)

  • Fixed method signatures: replaced cls with self in instance methods throughout codebase

  • Standardized imports: converted all relative imports to absolute GenerativeProteomics package imports

  • Renamed classes for consistency: GAIN_DANNGainDann, Imputation_ManagementImputationManagement

  • Added professional package infrastructure: pyproject.toml, MANIFEST.in, CLI entry point via main() function

  • Reorganized project structure: moved datasets to datasets/ folder, scripts to scripts/, configs to configs/

  • Enhanced documentation: added comprehensive docstrings, updated Sphinx docs, improved README with correct paths


Diagram Walkthrough

flowchart LR
  A["ProtoGain<br/>Codebase"] -->|"Rename classes<br/>Fix method signatures"| B["GenerativeProteomics<br/>Core Package"]
  B -->|"Add absolute imports"| C["Standardized<br/>Import System"]
  C -->|"Create pyproject.toml<br/>Add CLI entry"| D["Professional<br/>Package Structure"]
  D -->|"Reorganize folders<br/>Update docs"| E["GainPro v0.2.0<br/>Ready for Distribution"]
Loading

File Walkthrough

Relevant files
Documentation
16 files
__init__.py
Added comprehensive module docstring and exports                 
+71/-0   
conf.py
Updated mock imports from ProtoGain to GenerativeProteomics
+1/-1     
README.md
Updated documentation with correct paths and references   
+4/-4     
GainPro.dataset.rst
Updated module references from ProtoGain to GenerativeProteomics
+1/-1     
GainPro.generativeproteomics.rst
Updated module references and documentation                           
+2/-2     
GainPro.hypers.rst
Updated module references from ProtoGain to GenerativeProteomics
+1/-1     
GainPro.manager.rst
Updated class name and module references                                 
+3/-3     
GainPro.model.rst
Updated module references from ProtoGain to GenerativeProteomics
+2/-2     
GainPro.output.rst
Updated module references and method signatures                   
+3/-3     
GainPro.rst
Updated all module references to GainPro naming                   
+7/-7     
GainPro.utils.rst
Updated module references from ProtoGain to GenerativeProteomics
+1/-1     
How to use.rst
Updated example paths to use datasets folder                         
+1/-1     
Installation.rst
Updated repository URL from ProtoGain to GainPro                 
+1/-1     
index.rst
Updated documentation index references to GainPro               
+1/-1     
README.md
Created documentation for scripts directory                           
+22/-0   
README.md
Updated class name references in test documentation           
+1/-1     
Enhancement
10 files
correlation.py
Fixed relative import to absolute package import                 
+1/-1     
gain_dann_model.py
Renamed class GAIN_DANN to GainDann, fixed imports             
+10/-10 
gaindann.py
Updated all imports to absolute paths, renamed class usage
+10/-10 
generativeproteomics.py
Refactored CLI with docstrings, improved argument parsing
+34/-18 
hypers_optimization.py
Fixed imports and renamed method toJSON to to_json             
+4/-4     
imputation_management.py
Renamed class and fixed absolute imports                                 
+4/-4     
__init__.py
Created new models subpackage with documentation                 
+17/-0   
params_gain_dann.py
Renamed method toJSON to to_json for consistency                 
+1/-1     
train.py
Updated all imports to absolute paths, renamed class usage
+13/-13 
multiple_runs.sh
Created improved script with documentation and parameters
+16/-0   
Bug fix
3 files
dataset.py
Fixed method signatures from cls to self                                 
+8/-8     
model.py
Fixed method signatures and standardized imports                 
+115/-115
output.py
Fixed method signature and absolute imports                           
+3/-3     
Miscellaneous
4 files
test.py
Removed obsolete test file                                                             
+0/-16   
utils.py
Removed root-level utils file, moved to package                   
+0/-58   
multiple_runs.sh
Removed obsolete script from package directory                     
+0/-8     
parameters.json
Removed obsolete parameters file from package                       
+0/-16   
Tests
5 files
test_imputation_management.py
Updated class name references in test cases                           
+6/-6     
test_generate_reference.py
Updated path references from ProtoGain to GenerativeProteomics
+1/-3     
test_hint_generation.py
Updated path references from ProtoGain to GenerativeProteomics
+1/-1     
test_imputation_with_reference.py
Updated path references from ProtoGain to GenerativeProteomics
+1/-1     
test_impute_no_reference.py
Updated path references from ProtoGain to GenerativeProteomics
+1/-1     
Configuration changes
5 files
parameters.json
Updated paths to use datasets folder structure                     
+4/-5     
parameters_noref.json
Updated paths and removed deprecated parameters                   
+3/-6     
MANIFEST.in
Created manifest for package distribution                               
+20/-0   
params_gain.json
Created new config file with standardized paths                   
+15/-0   
pyproject.toml
Created comprehensive project configuration file                 
+133/-0 
Dependencies
1 files
requirements.txt
Updated dependencies with version constraints                       
+9/-4     

@ypriverol ypriverol requested a review from Copilot December 14, 2025 19:55
@coderabbitai
Copy link

coderabbitai bot commented Dec 14, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

qodo-code-review bot commented Dec 14, 2025

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit logs: New CLI entry point and training/evaluation flows do not log user identity or action
outcomes for critical operations like file I/O and model runs, making audit trails
incomplete.

Referred Code
def main():
    """Main entry point for the GenerativeProteomics CLI."""
    start_time = time.time()
    with cProfile.Profile() as profile:

        folder = os.getcwd()

        args = init_arg()

        missing_file = args.i
        output_file = args.o
        ref_file = args.ref
        output_folder = args.ofolder
        num_iterations = args.it
        batch_size = args.batchsize
        alpha = args.alpha
        miss_rate = args.miss
        hint_rate = args.hint
        lr_D = args.lrd
        lr_G = args.lrg
        parameters_file = args.parameters


 ... (clipped 145 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Weak error handling: The new CLI path performs multiple file reads/writes and training steps without try/except
or validation of inputs beyond minimal checks, risking silent failures or generic crashes.

Referred Code
def main():
    """Main entry point for the GenerativeProteomics CLI."""
    start_time = time.time()
    with cProfile.Profile() as profile:

        folder = os.getcwd()

        args = init_arg()

        missing_file = args.i
        output_file = args.o
        ref_file = args.ref
        output_folder = args.ofolder
        num_iterations = args.it
        batch_size = args.batchsize
        alpha = args.alpha
        miss_rate = args.miss
        hint_rate = args.hint
        lr_D = args.lrd
        lr_G = args.lrg
        parameters_file = args.parameters


 ... (clipped 145 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Error detail exposure: User-facing logging on missing model files uses logger.error with interpolated paths and
may propagate exceptions without sanitization, potentially exposing internal paths.

Referred Code
model = GainDann(metadata["protein_names"], metadata["input_dim"], latent_dim=metadata["latent_dim"], n_class=metadata["n_class"], num_hidden_layers=metadata["params"]["num_hidden_layers"], 
                dann_params=dann_params, gain_params=gain_params, gain_metrics=gain_metrics)
model_path = f"{checkpoint_dir}/model.pt"
if not os.path.isfile(model_path):
    logger.error(f"Model in {model_path} not found.")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Input validation gaps: The CLI accepts file paths and parameters and reads CSV/TSV without comprehensive
validation or sanitization, which may lead to errors or unsafe handling when files are
malformed or unexpected.

Referred Code
"""Parse command-line arguments for the imputation pipeline."""
parser = argparse.ArgumentParser(
    description="GenerativeProteomics (GainPro) - GAIN-based missing value imputation",
    formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument("-i", "--input", dest="i", help="path to missing data file")
parser.add_argument("-o", "--output", dest="o", default="imputed", help="name of output file")
parser.add_argument("--ref", help="path to a reference (complete) dataset")
parser.add_argument(
    "--ofolder", default=os.getcwd() + "/results/", help="path to output folder"
)
parser.add_argument("--it", type=int, default=2001, help="number of iterations")
parser.add_argument("--batchsize", type=int, default=128, help="batch size")
parser.add_argument("--alpha", type=float, default=10, help="alpha")
parser.add_argument("--miss", type=float, default=0.1, help="missing rate")
parser.add_argument("--hint", type=float, default=0.9, help="hint rate")
parser.add_argument(
    "--lrd", type=float, default=0.001, help="learning rate for the discriminator"
)
parser.add_argument(
    "--lrg", type=float, default=0.001, help="learning rate for the generator"


 ... (clipped 38 lines)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Dec 14, 2025

You are nearing your monthly Qodo Merge usage quota. For more information, please visit here.

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix incorrect attribute assignment logic

Fix the setitem method in ParamsGainDann to correctly set attributes
dynamically using setattr(self, key, value) instead of hardcoding the attribute
name to key.

GenerativeProteomics/params_gain_dann.py [226-227]

 def __setitem__(self, key, value):
-    self.key = value
+    setattr(self, key, value)

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a bug in the __setitem__ method where it incorrectly assigns to self.key instead of using the key parameter to name the attribute, and the proposed fix using setattr is correct.

Medium
High-level
Refactor duplicated training loop logic

The Network class has three methods (train_ref, evaluate, train) with almost
identical training loops. This should be refactored into a single, unified
training method to reduce code duplication and improve maintainability.

Examples:

GenerativeProteomics/model.py [147-364]
    def train_ref(self, data: Data, missing_header):

        dim = data.dataset_scaled.shape[1]
        train_size = data.dataset_scaled.shape[0]

        if train_size < self.hypers.batch_size:
            self.hypers.batch_size = train_size
            print(
                "Batch size is larger than the number of samples\nReducing batch size to the number of samples\n"
            )

 ... (clipped 208 lines)

Solution Walkthrough:

Before:

class Network:
    def train_ref(self, data, ...):
        # ... setup ...
        pbar = tqdm(range(self.hypers.num_iterations))
        for it in pbar:
            # ... sample batch with reference ...
            self.metrics.loss_D[it] = self._update_D(...)
            self.metrics.loss_G[it] = self._update_G(...)
            # ... calculate train and test MSE ...
        # ... post-processing ...

    def evaluate(self, data, ...):
        # ... setup ...
        pbar = tqdm(range(self.hypers.num_iterations))
        for it in pbar:
            # ... sample batch for evaluation ...
            self.metrics.loss_D_evaluate[it] = self._update_D(...)
            self.metrics.loss_G_evaluate[it] = self._update_G(...)
            # ... calculate train and test MSE for evaluation ...
        # ... post-processing ...

    def train(self, data, ...):
        # ... setup, weight init ...
        pbar = tqdm(range(self.hypers.num_iterations))
        for it in pbar:
            # ... sample batch without reference ...
            self.metrics.loss_D[it] = self._update_D(...)
            self.metrics.loss_G[it] = self._update_G(...)
            # ... calculate train MSE only ...
        # ... post-processing ...

After:

class Network:
    def _training_loop(self, data, mode='train'):
        # ... common setup ...
        pbar = tqdm(range(self.hypers.num_iterations))
        for it in pbar:
            # ... sample batch based on mode ...
            loss_d = self._update_D(...)
            loss_g = self._update_G(...)
            
            # ... calculate and store metrics based on mode ...
            # (e.g., train MSE, optional test MSE)

    def train_ref(self, data, ...):
        self._training_loop(data, mode='train_ref')
        self.impute(data)
        # ... output ...

    def evaluate(self, data, ...):
        self._training_loop(data, mode='evaluate')
        self._evaluate_impute(data)
        # ... output ...

    def train(self, data, ...):
        # ... weight init ...
        self._training_loop(data, mode='train')
        self.impute(data)
        # ... output ...
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies significant code duplication in the core training logic across three methods, which is a major maintenance concern and violates the DRY principle.

Medium
General
Scale data before creating reference

In the _create_ref method, scale the dataset before introducing zeros for the
reference set. This prevents the scaler from being influenced by artificial zero
values and ensures it is based on the true data distribution.

GenerativeProteomics/dataset.py [42-56]

 def _create_ref(self, miss_rate, hint_rate):
 
     self.ref_mask = self.mask.detach().clone()
     self.ref_dataset = self.dataset.detach().clone()
+    self.ref_dataset_scaled = torch.from_numpy(self.scaler.transform(self.ref_dataset)).clone()
+    
     zero_idxs = torch.nonzero(self.mask == 1)
     chance = torch.rand(len(zero_idxs))
     miss = chance > miss_rate
 
     selected_idx = zero_idxs[~miss]
     for idx in selected_idx:
         self.ref_mask[tuple(idx)] = 0
+        self.ref_dataset_scaled[tuple(idx)] = 0
         self.ref_dataset[tuple(idx)] = 0
 
     self.ref_hint = generate_hint(self.ref_mask, hint_rate)
-    self.ref_dataset_scaled = torch.from_numpy(self.scaler.transform(self.ref_dataset))
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that scaling the data after introducing zeros can skew the data distribution, which is a valid concern for data preprocessing that could impact model performance.

Medium
  • Update

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR represents a significant refactoring of the GainPro (previously ProtoGain) project, involving package renaming, code organization improvements, and the addition of proper Python packaging configuration. The changes modernize the codebase structure by converting to absolute imports, following PEP 8 naming conventions, and adding comprehensive package metadata.

Key changes:

  • Renamed project from "ProtoGain" to "GainPro/GenerativeProteomics" throughout codebase
  • Refactored imports from relative to absolute with GenerativeProteomics. prefix
  • Renamed classes and methods to follow PEP 8 conventions (Imputation_ManagementImputationManagement, GAIN_DANNGainDann, toJSONto_json)
  • Fixed method signatures by replacing cls with self for instance methods
  • Added proper Python packaging with pyproject.toml and MANIFEST.in
  • Reorganized scripts and configuration files

Reviewed changes

Copilot reviewed 43 out of 47 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
utils.py Deleted old utility file
test.py Deleted development test file
GenerativeProteomics/multiple_runs.sh Deleted old shell script
GenerativeProteomics/parameters.json Deleted old hardcoded parameters file
scripts/multiple_runs.sh Added new organized shell script for running multiple experiments
scripts/README.md Added documentation for scripts
requirements.txt Updated with version constraints and better organization
pyproject.toml Added comprehensive Python packaging configuration
MANIFEST.in Added package manifest for distribution
GenerativeProteomics/__init__.py Added package initialization with exports and documentation
GenerativeProteomics/models/__init__.py Added models subpackage initialization
GenerativeProteomics/*.py Updated all Python files with absolute imports and fixed method signatures
tests/*.py Updated test files with class renames and import path fixes
use-case/2-tests/*.py Updated use-case tests with import path fixes
docs/source/*.rst Updated documentation with project name changes
Datasets/breast/*.json Updated dataset parameter files with corrected paths
Datasets/breast/breastMissing_20.csv Added dataset file
configs/params_gain.json Added example configuration file
README.md Updated with corrected paths and project references

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ypriverol and others added 4 commits December 14, 2025 20:49
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ypriverol ypriverol merged commit 135c883 into dev Dec 14, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants