Skip to content

Initial commit pyLucXor#4

Merged
ypriverol merged 1 commit into
bigbio:mainfrom
weizhongchun:main
Aug 25, 2025
Merged

Initial commit pyLucXor#4
ypriverol merged 1 commit into
bigbio:mainfrom
weizhongchun:main

Conversation

@weizhongchun
Copy link
Copy Markdown
Collaborator

@weizhongchun weizhongchun commented Aug 20, 2025

PR Type

Enhancement, Documentation


Description

• Complete implementation of pyLucXor, a Python package for peptide-spectrum match (PSM) analysis and modification site localization
• Core PSM processing engine with two-stage workflow: Stage 1 for FLR estimation with decoys, Stage 2 for final scoring
• Statistical models for CID and HCD fragmentation scoring with kernel density estimation and parallel training
• Comprehensive peptide sequence analysis with fragment ion generation (b-ions, y-ions) and neutral loss handling
• False localization rate (FLR) calculation using statistical modeling and kernel density estimation
• Parallel processing framework with configurable thread pools for improved performance
• Command-line interface with extensive configuration options and multi-threading support
• Mass spectrum data processing with peak normalization and intensity calculations
• Configuration management system with validation and default value handling
• Complete documentation including installation, usage examples, and API reference


Diagram Walkthrough

flowchart LR
  CLI["Command Line Interface"] --> Core["Core Processor"]
  Core --> PSM["PSM Analysis"]
  Core --> Models["CID/HCD Models"]
  PSM --> Peptide["Peptide Processing"]
  PSM --> Spectrum["Spectrum Matching"]
  Models --> FLR["FLR Calculator"]
  Peptide --> Parallel["Parallel Processing"]
  FLR --> Results["Output (idXML/CSV)"]
Loading

File Walkthrough

Relevant files
Enhancement
10 files
psm.py
Complete PSM class implementation with scoring and FLR calculation

pyLucXor/psm.py

• Implements PSM (Peptide-Spectrum Match) class with comprehensive
functionality for handling peptide-spectrum matches
• Provides methods
for processing PSMs, generating permutations, scoring, and calculating
delta scores and FLR values
• Includes support for modification
handling, spectrum matching, and both CID/HCD fragmentation methods

Implements two-stage processing workflow with decoy generation for FLR
estimation

+1367/-0
models.py
Statistical models for CID and HCD fragmentation scoring 

pyLucXor/models.py

• Implements CID and HCD statistical models (CIDModel, HCDModel) for
scoring peptide-spectrum matches
• Provides model data classes
(ModelData_CID, ModelData_HCD) with intensity and distance parameter
calculations
• Includes non-parametric density estimation using kernel
density estimation with NumPy vectorization
• Supports parallel model
training using ThreadPoolExecutor for improved performance

+1122/-0
parallel.py
Parallel processing framework for PSM analysis                     

pyLucXor/parallel.py

• Implements parallel processing infrastructure with worker classes
for different tasks
• Provides ScoringWorker, PSMProcessingWorker,
SpectrumMatchingWorker for concurrent execution
• Includes utility
functions for optimal thread count calculation and generic parallel
processing
• Supports batch processing of PSMs with configurable
thread pools

+261/-0 
core.py
Core processing engine for PSM workflow orchestration       

pyLucXor/core.py

• Implements CoreProcessor class as main orchestrator for PSM
processing workflow
• Provides two-stage processing: Stage 1 for FLR
estimation with decoys, Stage 2 for final scoring
• Includes methods
for result writing in both idXML and CSV formats
• Integrates FLR
calculation and PSM score management

+207/-0 
peptide.py
Core peptide sequence analysis and fragment ion generation

pyLucXor/peptide.py

• Implements comprehensive Peptide class for peptide sequence
representation and analysis
• Provides fragment ion generation (b-ions
and y-ions) with neutral loss handling
• Includes permutation
generation for modification site localization
• Supports both target
and decoy sequence processing with FLR calculation

+907/-0 
flr.py
False localization rate calculation and statistical modeling

pyLucXor/flr.py

• Implements FLRCalculator class for false localization rate
estimation
• Uses kernel density estimation for statistical modeling
of delta scores
• Provides both global and local FLR calculations with
minorization
• Supports two-round FLR calculation workflow with
mapping persistence

+798/-0 
cli.py
Command-line interface and main processing workflow           

pyLucXor/cli.py

• Implements complete command-line interface for pyLuciPHOr2 tool

Provides comprehensive argument parsing with extensive configuration
options
• Implements two-stage processing workflow with model training
and FLR calculation
• Includes multi-threading support and file I/O
handling for mzML and idXML formats

+781/-0 
spectrum.py
Mass spectrum data representation and processing                 

pyLucXor/spectrum.py

• Implements Spectrum class for mass spectrometry data handling

Provides peak intensity normalization and relative intensity
calculations
• Includes median-based normalization for spectral data
processing
• Supports peak finding and indexing operations

+134/-0 
peak.py
Mass spectrometry peak representation and matching             

pyLucXor/peak.py

• Implements Peak class representing individual mass spectrometry
peaks
• Provides peak matching information and scoring attributes

Includes serialization methods for data persistence
• Supports peak
comparison and hash operations

+129/-0 
globals.py
Global state management and utility functions                       

pyLucXor/globals.py

• Implements global variables management for PSM processing
• Provides
FLR estimate recording and assignment functionality
• Includes decoy
symbol mapping utilities
• Supports PSM score clearing for multi-round
processing

+99/-0   
Configuration changes
3 files
constants.py
Constants and configuration definitions for mass spectrometry

pyLucXor/constants.py

• Defines comprehensive constants including amino acid masses,
modification masses, and decoy mappings
• Provides algorithm types,
physical constants, and default configuration parameters
• Includes
ion types, neutral losses, and score type definitions
• Sets up decoy
amino acid mapping with mass calculations

+199/-0 
__init__.py
Package initialization with lazy loading and API exports 

pyLucXor/init.py

• Implements package initialization with lazy loading using
getattr
• Defines package version and exports main classes

Provides clean API access to core functionality
• Sets up module-level
imports for LucXor, PyLuciPHOr2, Peptide, PSM, and model classes

+27/-0   
config.py
Configuration management and parameter handling                   

pyLucXor/config.py

• Implements LucXorConfig class for configuration management

Provides dictionary-like interface for parameter access
• Includes
configuration validation and default value handling
• Supports dynamic
configuration updates

+113/-0 
Documentation
1 files
README.md
Complete package documentation and user guide                       

pyLucXor/README.md

• Provides comprehensive documentation for pyLucXor package
• Includes
installation instructions, usage examples, and API documentation

Documents algorithm workflow, configuration options, and output
formats
• Contains performance notes, dependencies, and contribution
guidelines

+181/-0 

Summary by CodeRabbit

  • New Features

    • Introduced pyLucXor: a CLI and Python API for PTM site localization (LuciPHOr2-inspired).
    • Supports CID/HCD, multi-threading, mzML/MGF inputs, and idXML/CSV outputs.
    • Two-stage workflow with FLR estimation/assignment; writes delta score, PEP score, and global/local FLR.
    • End-to-end pipeline from identification loading to result export.
  • Documentation

    • Added comprehensive README covering installation, prerequisites, CLI usage with examples, configuration options, output fields, performance notes, citation, and support.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 20, 2025

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@qodo-code-review
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

AttributeError Risk

Methods like normalize_spectrum, reduce_nl_peak, and get_spectrum_peaks use self.logger instead of the module-level logger, which is undefined on the instance and will raise AttributeError at runtime.

def normalize_spectrum(self) -> None:
    """Normalize spectrum intensities."""
    if self.spectrum is None:
        self.logger.warning("No spectrum data available for normalization")
        return

    # Get peaks
    mz_values, intensities = self.spectrum.get_peaks()

    if len(intensities) == 0:
        self.logger.warning("No intensity values available for normalization")
        return

    # Check data validity
    valid_mask = np.isfinite(intensities) & (intensities > 0)
    if not np.any(valid_mask):
        self.logger.warning("No valid intensity values found for normalization")
        return

    # Only use valid intensity values
    valid_intensities = intensities[valid_mask]
    valid_mz_values = mz_values[valid_mask]

    # Log intensity statistics
    self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "
                     f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")

    # Normalize to sum of 1
    total = np.sum(valid_intensities)
    if total > 0:
        normalized_intensities = valid_intensities / total

        # Update spectrum with normalized values
        self.spectrum.set_peaks(valid_mz_values, normalized_intensities)

        self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values")
    else:
        self.logger.warning("Total intensity is zero, skipping normalization")

def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None:
    """Reduce neutral loss peaks.

    Args:
        precursor_nl_mass: Precursor neutral loss mass
    """
    if self.spectrum is None:
        self.logger.warning("No spectrum data available for neutral loss reduction")
        return

    # Get peaks
    mz_values, intensities = self.spectrum.get_peaks()

    if len(intensities) == 0:
        self.logger.warning("No intensity values available for neutral loss reduction")
        return

    # Find peaks to reduce
    nl_indices = []
    for i, mz in enumerate(mz_values):
        # Check if peak is within neutral loss mass window
        if abs(mz - precursor_nl_mass) < 0.5:  # Using 0.5 Da window
            nl_indices.append(i)

    if nl_indices:
        # Reduce intensity of neutral loss peaks
        for idx in nl_indices:
            intensities[idx] *= 0.1  # Reduce to 10% of original intensity

        # Update spectrum
        self.spectrum.set_peaks(mz_values, intensities)

        self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")
    else:
        self.logger.debug("No neutral loss peaks found")
Scoring Logic Consistency

score_permutations mixes model APIs: it expects model.get_charge_model and also calls model.get_log_np_density_* directly. CIDModel and HCDModel expose those methods but CIDModel’s get_log_np_density_* compute parametric Gaussian while HCDModel uses NP; ensure the selected model matches fragmentation type and that both paths are intended. Also distance uses pos density for both b and y with same call; verify intended behavior.

def score_permutations(self, model: Optional[object]) -> None:
    """
    Score all permutations of the peptide sequence

    Args:
        model: Optional scoring model
    """
    try:
        # Get all permutations from pos_permutation_score_map and neg_permutation_score_map
        all_perms = list(self.pos_permutation_score_map.keys()) + list(self.neg_permutation_score_map.keys())
        base_tolerance = self.config.get('fragment_mass_tolerance', 0.1) # Read from config, default 0.1

        # Score each permutation
        for perm in all_perms:
            # Check if it's a decoy sequence
            is_decoy = self._is_decoy_sequence(perm)


            if is_decoy:
                tolerance = base_tolerance * 2.0  # decoyPadding = 2.0
            else:
                tolerance = base_tolerance

            # Calculate score
            final_score = 0.0
            matched_peaks = self._match_peaks(perm, tolerance)

            # Get charge model
            charge_model = model.get_charge_model(self.charge) if model else None

            if charge_model is None:
                logger.debug(f"No charge model found for charge {self.charge}")
                continue

            # Calculate score for each peak
            for peak in matched_peaks:
                # Get ion type
                ion_type = peak.get('matched_ion_str', '')

                # Calculate intensity score
                intensity_m = 0.0
                intensity_u = 0.0

                if ion_type.startswith('b'):
                    intensity_m = model.get_log_np_density_int('b', peak['norm_intensity'], self.charge)
                elif ion_type.startswith('y'):
                    intensity_m = model.get_log_np_density_int('y', peak['norm_intensity'], self.charge)

                intensity_u = model.get_log_np_density_int('n', peak['norm_intensity'], self.charge)

                # Calculate distance score
                dist_m = 0.0
                dist_u = 0.0

                if ion_type.startswith('b'):
                    dist_m = model.get_log_np_density_dist_pos(peak['mass_diff'], self.charge)
                elif ion_type.startswith('y'):
                    dist_m = model.get_log_np_density_dist_pos(peak['mass_diff'], self.charge)

                # For unmatched ions, use uniform distribution
                dist_u = 0.0

                # Calculate score
                intensity_score = intensity_m - intensity_u
                distance_score = dist_m - dist_u

                # Handle invalid values
                if np.isnan(intensity_score) or np.isinf(intensity_score):
                    intensity_score = 0.0
                if np.isnan(distance_score) or np.isinf(distance_score):
                    distance_score = 0.0

                # Calculate final score: direct addition
                peak_score = intensity_score + distance_score

                # If score is negative, set to 0
                if peak_score < 0:
                    peak_score = 0.0

                # Add to total score
                final_score += peak_score

            logger.debug(f"Scoring permutation: {perm}, Score: {final_score:.6f}, Is decoy: {is_decoy}")

            # Store scores
            if is_decoy:
                self.neg_permutation_score_map[perm] = final_score
                logger.debug(f"Stored decoy score: {perm} -> {final_score:.6f}")
            else:
                self.pos_permutation_score_map[perm] = final_score
                logger.debug(f"Stored real score: {perm} -> {final_score:.6f}")

        # Collect all scores into a list
        all_scores = []
        for score in self.pos_permutation_score_map.values():
            all_scores.append(score)

        # Only add decoy scores in non-unambiguous cases
        if not self.is_unambiguous:
            for score in self.neg_permutation_score_map.values():
                all_scores.append(score)

        # Sort (from low to high, then reverse)
        all_scores.sort()  # From low to high
        all_scores.reverse()  # From high to low

        # Get first two scores, using 6 decimal precision
        score1 = self._round_dbl(all_scores[0], 6) if all_scores else 0.0
        score2 = 0.0
        if not self.is_unambiguous:
            score2 = self._round_dbl(all_scores[1], 6) if len(all_scores) > 1 else 0.0

        # Find corresponding peptide sequence
        pep1 = ""
        pep2 = ""
        num_assigned = 0

        if not self.is_unambiguous:
            # First find highest and second highest scores in non-decoy permutations
            for p, s in self.pos_permutation_score_map.items():
                x = s
                d = self._round_dbl(x, 6)

                if (d == score1) and (pep1 == ""):
                    pep1 = p
                    num_assigned += 1
                elif (d == score2) and (pep2 == ""):
                    pep2 = p
                    num_assigned += 1

                if num_assigned == 2:
                    break

            # If not found, search in decoy permutations
            if num_assigned != 2:
                for p, s in self.neg_permutation_score_map.items():
                    x = s
                    d = self._round_dbl(x, 6)

                    if (d == score1) and (pep1 == ""):
                        pep1 = p
                        num_assigned += 1
                    elif (d == score2) and (pep2 == ""):
                        pep2 = p
                        num_assigned += 1

                    if num_assigned == 2:
                        break
        else:
            # Special handling for unambiguous cases
            for p, s in self.pos_permutation_score_map.items():
                pep1 = p
                break

        # Calculate delta score
        if self.is_unambiguous:
            # Unambiguous case: deltaScore = score1
            self.delta_score = score1
            logger.debug(f"Unambiguous case, delta score: {self.delta_score}")
        else:
            # Ambiguous case: deltaScore = score1 - score2
            self.delta_score = score1 - score2
            logger.debug(f"Non-unambiguous case, delta score: {score1} - {score2} = {self.delta_score}")

        # Set psm_score to score1
        self.psm_score = score1

        self.score1_pep = pep1
        self.score2_pep = pep2

        # Set is_decoy flag based on highest scoring peptide
        if pep1:
            self.is_decoy = self._is_decoy_sequence(pep1)

    except Exception as e:
        logger.error(f"Error scoring permutations: {str(e)}")
        self.delta_score = 0.0
        self.psm_score = 0.0
Decoy Handling

_is_decoy_sequence and generate_decoy_permutations rely on DECOY_AA_MAP and get_decoy_symbol; ensure AA mass lookup in _get_aa_mass supports decoy symbols (currently uppercasing drops decoy symbol mapping). Also tolerance is doubled for decoys; validate this does not bias delta scores unfairly.

def _is_decoy_sequence(self, sequence: str) -> bool:
    """
    Check if a sequence is a decoy sequence.

    Args:
        sequence: Peptide sequence to check

    Returns:
        bool: True if sequence is a decoy sequence
    """
    try:
        # First remove modification markers, then check for decoy amino acids
        unmod_seq = ""
        i = 0
        while i < len(sequence):
            if sequence[i:i+9] == "(Phospho)":
                i += 9
            elif sequence[i:i+11] == "(Oxidation)":
                i += 11
            else:
                unmod_seq += sequence[i]
                i += 1

        # Check if sequence contains decoy amino acid symbols after removing modifications
        for aa in unmod_seq:
            if aa in DECOY_AA_MAP:
                return True

        return False

    except Exception as e:
        logger.error(f"Error checking decoy sequence: {str(e)}")
        return False

def _get_mod_map(self, perm: str) -> Dict[int, float]:
    """
    Get peptide modification site mapping

    Args:
        perm: Peptide permutation

    Returns:
        Dict[int, float]: Position -> modification mass
    """
    mod_map = {}

    # Handle lowercase letter format modifications (new format)
    for i, aa in enumerate(perm):
        if aa.islower() and aa.upper() in ['S', 'T', 'Y']:
            # Lowercase letters indicate phosphorylation modification sites
            mod_map[i] = PHOSPHO_MOD_MASS
        elif aa.islower() and aa.upper() in ['M', 'W', 'F', 'Y']:
            # Lowercase letters indicate oxidation modification sites
            mod_map[i] = OXIDATION_MASS
        elif aa in DECOY_AA_MAP:
            # Handle decoy amino acids
            # In Java, decoy amino acids already have extra mass, no need to add modification mass
            # Because decoy amino acid mass already includes DECOY_MASS in AA_MASSES
            pass  # Don't add extra modification mass

    # Handle bracket format modifications (compatibility with old format)
    i = 0
    while i < len(perm):
        if perm[i:i+9] == "(Phospho)":
            # Get original amino acid at modification site
            orig_aa = perm[i-1].upper()  # Convert to uppercase
            if orig_aa in AA_MASSES:
                mod_map[i-1] = PHOSPHO_MOD_MASS
            i += 9
        elif perm[i:i+11] == "(Oxidation)":
            # Get original amino acid at modification site
            orig_aa = perm[i-1].upper()
            if orig_aa in AA_MASSES:
                mod_map[i-1] = OXIDATION_MASS
            i += 11
        else:
            i += 1

    return mod_map

@qodo-code-review
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Inconsistent two-stage workflow

The codebase implements two different two-stage pipelines (CoreProcessor’s
process_stage1/process_stage2 vs. PSM.process/process_round2 and CLI’s round
0/2), with mismatched round numbers and divergent FLR assignment paths. This
duplication risks inconsistent scoring/FLR results and hard-to-reproduce
behavior; converge on a single orchestrator and a single round semantic (e.g.,
RN=0/1) for permutation generation, scoring, and FLR mapping used consistently
across CLI, core, PSM, and parallel modules.

Examples:

pyLucXor/cli.py [508-555]
        # 5. Score all PSMs with trained model
        self.logger.info("Starting first round calculation (including decoy permutations)...")
        
        # Use multi-threading for PSM processing
        num_threads = get_optimal_thread_count(len(all_psms), max_threads=args.num_threads)
        self.logger.info(f"Using {num_threads} threads for PSM processing...")
        
        parallel_psm_processing(all_psms, model=self.model, round_number=0, num_threads=num_threads)

        # === Round 1: Execute FLR calculation and establish mapping relationships ===

 ... (clipped 38 lines)
pyLucXor/core.py [40-62]
    def process_all_psms(self) -> None:
        """
        Process all PSMs using a two-stage workflow:
        Stage 1 (RN=0): Calculate FLR estimates
        Stage 2 (RN=1): Assign FLR values to PSMs
        """
        logger.info(f"Starting to process all PSMs, total PSM count: {len(self.psms)}")
        
        # Create FLR calculator
        self.flr_calculator = FLRCalculator(

 ... (clipped 13 lines)

Solution Walkthrough:

Before:

# In cli.py, a manual two-round workflow is implemented
class PyLuciPHOr2:
    def run(self):
        # ...
        parallel_psm_processing(all_psms, model, round_number=0)
        flr_calc = FLRCalculator(...)
        flr_calc.calculate_flr_estimates(all_psms)
        # ...
        parallel_psm_processing(all_psms, flr_calc, round_number=2)

# In core.py, a different two-stage workflow exists but is unused
class CoreProcessor:
    def process_all_psms(self):
        self._process_psms_stage1() # Calls psm.process() with round 0
        self._process_psms_stage2() # Calls a non-existent psm.process_stage2()

After:

# In cli.py, delegate workflow to a single orchestrator
class PyLuciPHOr2:
    def run(self):
        # ...
        processor = CoreProcessor(config, all_psms)
        processor.run_workflow()
        processor.write_results(...)

# In core.py, a single, consistent workflow is defined
class CoreProcessor:
    def run_workflow(self):
        # Round 1 (with decoys)
        parallel_psm_processing(self.psms, model, round_number=1)
        self.flr_calculator.calculate_and_map_flr(self.psms)

        # Round 2 (no decoys)
        parallel_psm_processing(self.psms, self.flr_calculator, round_number=2)
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical architectural flaw where two separate and inconsistent two-stage workflows are implemented in cli.py and core.py, which impacts correctness and maintainability.

High
Possible issue
Fix residue indexing with mods

The loop indexes amino acids by string position, but perm includes bracketed
annotations and decoy symbols; indexing by string index causes mod positions and
ion lengths to be wrong. Iterate over the amino-acid sequence only (excluding
annotations) while tracking residue index, and apply mod_map using residue
indices to avoid off-by-one and mass accumulation errors.

pyLucXor/psm.py [854-914]

 def _calc_theoretical_masses(self, perm: str) -> List[float]:
-    """
-    Calculate theoretical masses
-    ...
-    # Calculate b ions
+    if not perm:
+        return []
+    mod_map = self._get_mod_map(perm)  # keys are residue indices
+    masses = []
+    # Build a clean residue list with mapping from string index to residue index
+    residues = []
+    res_idx = -1
+    i = 0
+    while i < len(perm):
+        if perm[i:i+9] == "(Phospho)":
+            i += 9
+            continue
+        if perm[i:i+11] == "(Oxidation)":
+            i += 11
+            continue
+        ch = perm[i]
+        if ch in ['[', ']']:
+            i += 1
+            continue
+        # new residue
+        res_idx += 1
+        residues.append((res_idx, ch))
+        i += 1
+    # b-ions
     current_mass = 0.0
-    for i in range(len(perm)):
-        if perm[i] not in ['(', ')']:  # Skip modification markers
-            # Handle amino acids, including lowercase letters (modification sites)
-            if perm[i].upper() in AA_MASSES:
-                current_mass += self._get_aa_mass(perm[i].upper())
-            elif perm[i] in DECOY_AA_MAP:
-                # Handle decoy amino acids - directly use decoy amino acid mass
-                # Decoy amino acid mass already includes DECOY_MASS in AA_MASSES
-                current_mass += self._get_aa_mass(perm[i])
-        if i in mod_map:
-            current_mass += mod_map[i]
-        # Only add ions longer than minimum length
-        if i >= 1:  # Minimum length is 2
-            # Only generate ions with charge less than peptide charge
-            for z in range(1, self.charge):  # Only generate charge states 1 to charge-1
-                ion_mass = (current_mass + PROTON_MASS) / z
-                if ion_mass < self.peptide.get_precursor_mass():
-                    masses.append(ion_mass)
-    ...
-    # Calculate y ions
+    for k, (ridx, ch) in enumerate(residues):
+        # amino-acid mass (handles decoys via _get_aa_mass)
+        current_mass += self._get_aa_mass(ch)
+        if ridx in mod_map:
+            current_mass += mod_map[ridx]
+        if k >= 1:
+            for z in range(1, max(1, (self.charge or 1))):
+                ion_mz = (current_mass + PROTON_MASS) / z
+                if ion_mz < self.peptide.get_precursor_mass():
+                    masses.append(ion_mz)
+    # y-ions
     current_mass = 0.0
-    for i in range(len(perm)-1, -1, -1):
-        if perm[i] not in ['(', ')']:  # Skip modification markers
-            # Handle amino acids, including lowercase letters (modification sites)
-            if perm[i].upper() in AA_MASSES:
-                current_mass += self._get_aa_mass(perm[i].upper())
-            elif perm[i] in DECOY_AA_MAP:
-                # Handle decoy amino acids - directly use decoy amino acid mass
-                # Decoy amino acid mass already includes DECOY_MASS in AA_MASSES
-                current_mass += self._get_aa_mass(perm[i])
-        if i in mod_map:
-            current_mass += mod_map[i]
-        # Only add ions longer than minimum length
-        if i <= len(perm)-2:  # Minimum length is 2
-            # Only generate ions with charge less than peptide charge
-            for z in range(1, self.charge):  # Only generate charge states 1 to charge-1
-                ion_mass = (current_mass + WATER_MASS + PROTON_MASS) / z
-                if ion_mass < self.peptide.get_precursor_mass():
-                    masses.append(ion_mass)
+    for k, (ridx, ch) in enumerate(reversed(residues)):
+        current_mass += self._get_aa_mass(ch)
+        if ridx in mod_map:
+            current_mass += mod_map[ridx]
+        if k >= 1:
+            for z in range(1, max(1, (self.charge or 1))):
+                ion_mz = (current_mass + WATER_MASS + PROTON_MASS) / z
+                if ion_mz < self.peptide.get_precursor_mass():
+                    masses.append(ion_mz)
+    return masses

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical bug where iterating over the raw permutation string, which includes multi-character modification annotations, leads to incorrect residue indexing for mass calculations and ion length checks.

High
Use proper Spectrum object

spectrum_dict is a plain dict, but later workers expect attributes like
spectrum.native_id and methods like get_peaks(). This mismatch will cause
runtime errors. Wrap the mz/intensity arrays in your Spectrum class and attach
native_id and rt attributes to match downstream access.

pyLucXor/cli.py [432-450]

+# Build Spectrum object to satisfy downstream access
+mz_arr = np.array(spectrum.get_peaks()[0])
+int_arr = np.array(spectrum.get_peaks()[1])
+spectrum_obj = Spectrum(mz_arr, int_arr)
+spectrum_obj.native_id = spectrum.getNativeID()
+spectrum_obj.rt = spectrum.getRT()
+
 peptide = Peptide(sequence, config, charge=charge)
-psm = PSM(peptide, spectrum_dict, config=config)
+psm = PSM(peptide, spectrum_obj, config=config)
 
 # Automatically determine score type and assign values
 score_type = pep_id.getScoreType() if hasattr(pep_id, 'getScoreType') else None
 score = hit.getScore()
 higher_score_better = True
 if hasattr(pep_id, 'isHigherScoreBetter'):
     higher_score_better = pep_id.isHigherScoreBetter()
 elif hasattr(pep_id, 'getHigherScoreBetter'):
     higher_score_better = pep_id.getHigherScoreBetter()
 
-# If it's a PEP score (lower is better), convert to 1-PEP
 if (score_type and 'posterior error probability' in score_type.lower()) or (higher_score_better is False):
     psm.psm_score = 1.0 - score
     peptide.score = 1.0 - score
 else:
     psm.psm_score = score
     peptide.score = score
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies that passing a plain dictionary spectrum_dict to the PSM constructor will cause runtime errors, as downstream code expects a Spectrum object with specific methods and attributes.

High
Use tolerance for unmatched filtering

Comparing floats for membership causes many matched peaks to be misclassified as
unmatched due to small m/z differences. Use a tolerance-based match (e.g.,
absolute delta within the same fragment tolerance) when filtering unmatched
peaks to avoid contaminating the noise distribution.

pyLucXor/models.py [858-872]

-# Get unmatched peaks (get all peaks from spectrum, then exclude matched peaks)
+# Get unmatched peaks with tolerance-based exclusion
 mz_values, intensities = psm.spectrum.get_peaks()
-matched_mz = {peak['mz'] for peak in matched_peaks}
+tol = psm.config.get('fragment_mass_tolerance', 0.5)
+matched_mz = np.array([pk['mz'] for pk in matched_peaks], dtype=float)
 
 for i in range(len(mz_values)):
-    if mz_values[i] not in matched_mz:
+    mz = float(mz_values[i])
+    is_matched = False
+    if matched_mz.size > 0:
+        # check within tolerance
+        if np.any(np.abs(matched_mz - mz) <= tol):
+            is_matched = True
+    if not is_matched:
         neg_peak = {
-            'mz': mz_values[i],
+            'mz': mz,
             'intensity': intensities[i],
             'norm_intensity': psm.spectrum.norm_intensity[i] if hasattr(psm.spectrum, 'norm_intensity') else intensities[i],
             'mass_diff': 0.0,
             'matched': False,
             'matched_ion_str': 'n'
         }
         neg_peaks.append(neg_peak)
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly points out that using direct float comparison for m/z values is unreliable due to precision issues, and proposes a more robust tolerance-based check, which is crucial for accurately separating matched and unmatched peaks.

Medium
Fix wrong constructor arguments

The Peptide constructor signature is (peptide, config, mod_pep=None,
charge=None), but here the config and an empty dict are swapped. This will break
initialization and downstream logic. Pass config as the second argument and
remove the stray empty dict.

pyLucXor/cli.py [295-296]

 # Create peptide object
-peptide = Peptide(sequence, {}, config, charge=charge)
+peptide = Peptide(sequence, config, charge=charge)
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that the arguments to the Peptide constructor are incorrect, which would cause a runtime error if the process_peptide function is called.

Medium
Prevent y-ion water double-counting

The addl_mass parameter is never used, and y-ions add WATER_MASS twice because
callers already include 18.01056 in ym calculations. Either remove
double-counting or use addl_mass consistently. Use addl_mass to avoid duplicated
water addition and keep callers' logic intact.

pyLucXor/peptide.py [270-291]

 def _fragment_ion_mass(self, ion_str, z, addl_mass, ion_type='b'):
     """
     Calculate fragment ion mass
     Receives complete ion string, such as "b4:FLLE" or "y2:sK"
     """
     mass = 0.0
     # Parse amino acid sequence starting after colon
     start = ion_str.find(":") + 1
     stop = len(ion_str)
     seq = ion_str[start:stop]
     
-    for idx, aa in enumerate(seq):
-        # Directly use mass mapping table, lowercase letters represent modified amino acids
+    for aa in seq:
         if aa in AA_MASSES:
             mass += AA_MASSES[aa]
     
-    # Both b/y ions add z proton masses, y ions also add 1 water molecule mass
+    # Apply any additional mass (e.g., WATER_MASS for y-ions) from caller
+    mass += addl_mass
+    # Add z protons
     mass += PROTON_MASS * z
-    if ion_type == 'y':
-        mass += WATER_MASS
     
     return mass
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that WATER_MASS is added twice for y-ions, once in the caller and once in _fragment_ion_mass, leading to incorrect mass calculations.

Medium
Fix inconsistent max index tracking

Store max_i_index as the index within the filtered arrays, not the original
input index, to keep it consistent with self.mz and self.raw_intensity. This
prevents out-of-bounds or mismatched indexing when using max_i_index later.

pyLucXor/spectrum.py [58-65]

 for i, idx in enumerate(cand_peaks):
     intensity = intensity_array[idx]
     if intensity > self.max_i:
         self.max_i = intensity
-        self.max_i_index = idx
+        self.max_i_index = i  # index within filtered arrays
     
     self.mz[i] = mz_array[idx]
     self.raw_intensity[i] = intensity
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies an inconsistency in how max_i_index is stored, which could lead to out-of-bounds errors if the attribute were used. This change improves code correctness and robustness.

Low
General
Normalize neutral loss sign convention

Use consistent sign conventions for neutral losses; representing a loss as
positive mass while another entry uses a negative for the same loss is
inconsistent and can cause double-negation bugs. Align the amino-acid-specific
entries with the 'sty' loss to negative values.

pyLucXor/constants.py [148-159]

 # Neutral losses
 NEUTRAL_LOSSES = {
-    'H3PO4': 97.976896,  # Phosphoric acid
-    'H2O': 18.010565,    # Water
-    'NH3': 17.026549,    # Ammonia
-    'CO': 27.994915,     # Carbon monoxide
-    'CO2': 43.989829,    # Carbon dioxide
-    'sty': -97.97690,    # H3PO4
-    'S': 98.00039,  # Ser phosphorylation neutral loss
-    'T': 98.00039,  # Thr phosphorylation neutral loss
-    'Y': 98.00039   # Tyr phosphorylation neutral loss
+    'H3PO4': 97.976896,   # Phosphoric acid (magnitude)
+    'H2O': 18.010565,     # Water (magnitude)
+    'NH3': 17.026549,     # Ammonia (magnitude)
+    'CO': 27.994915,      # Carbon monoxide (magnitude)
+    'CO2': 43.989829,     # Carbon dioxide (magnitude)
+    'sty': -97.97690,     # Loss of H3PO4 applied to S/T/Y
+    'S': -97.97690,       # Ser phosphorylation neutral loss (consistent with sty)
+    'T': -97.97690,       # Thr phosphorylation neutral loss
+    'Y': -97.97690        # Tyr phosphorylation neutral loss
 }
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies an inconsistency in the sign convention for neutral loss masses, which could lead to bugs. Unifying the signs improves code correctness and maintainability.

Medium
  • More

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

🧹 Nitpick comments (39)
pyLucXor/README.md (2)

170-173: Specify a language for the fenced code block (markdownlint MD040).

Add a language to the citation block for consistency with linters and better rendering.

Apply this diff:

-```
+```text
 Fermin, D., et al. (2011). LuciPHOr: algorithm for phosphorylation site localization with false localization rate estimation using target-decoy approach. 
 Molecular & Cellular Proteomics, 10(3), M110.003830.

---

`30-34`: **Replace placeholder repository URL in install instructions.**

The placeholder `<repository-url>` can cause copy-paste errors for users.

Example fix:

```diff
-git clone <repository-url>
+git clone https://github.com/bigbio/onsite.git

If this is a monorepo or subdirectory, add a note clarifying the path to the package root.

pyLucXor/constants.py (2)

59-66: Mixing modified residue masses into AA_MASSES via lowercase keys can be brittle.

Encoding PTMs by overloading residue case makes parsing fragile and opaque. Prefer explicit modification handling (e.g., a sequence + PTM map or using MOD_MASSES and dedicated logic).

Consider removing lowercase entries from AA_MASSES and centralizing PTM mass adjustments in peptide/psm logic. If you keep this, document the convention prominently and ensure all parsers/generators adhere to it.


95-100: Decoy mass augmentation mutates AA_MASSES; ensure no double-add and document.

The guard prevents double-insertion per interpreter session, but the side-effectful mutation on import should be documented since it affects any consumer importing constants.

Add a comment above the loop explaining idempotency and intent, or move to a function that returns an augmented copy when needed to avoid global mutation.

pyLucXor/config.py (2)

62-70: Align set semantics with update: warn on unknown keys.

Currently set silently introduces new keys while update warns on unknown keys. This inconsistency is surprising.

Apply this diff:

     def set(self, key: str, value: Any) -> None:
         """
         Set a configuration value.
@@
-        self.config[key] = value
+        if key in self.config:
+            self.config[key] = value
+        else:
+            logger.warning(f"Unknown configuration key: {key}")

93-101: Also guard __setitem__ against unknown keys for consistency.

Keep behavior consistent across all mutation paths.

Apply this diff:

     def __setitem__(self, key: str, value: Any) -> None:
         """
         Set a configuration value using dictionary syntax.
@@
-        self.config[key] = value
+        if key in self.config:
+            self.config[key] = value
+        else:
+            logger.warning(f"Unknown configuration key: {key}")
pyLucXor/globals.py (2)

5-11: Remove unused import Optional.

Optional is not used.

Apply this diff:

-from typing import List, Dict, Optional
+from typing import List, Dict

81-99: Use direct reverse map for decoy symbol lookup.

You re-derive the reverse mapping via iteration. Use AA_DECOY_MAP for O(1) lookup and simpler code.

Apply this diff:

-from .constants import DECOY_AA_MAP
+from .constants import AA_DECOY_MAP
@@
-    ret = ""
-    src_char = c.upper()
-    
-    for k, v in DECOY_AA_MAP.items():
-        if v.upper() == src_char:
-            ret = k
-            break
-            
-    return ret 
+    return AA_DECOY_MAP.get(c.upper(), "")
pyLucXor/peak.py (1)

119-127: Make from_dict robust to missing keys.

Use .get() with sensible defaults to ease forward/backward compatibility of serialized data.

Apply this diff:

-        peak = cls(data["mz"], data["raw_intensity"], data["norm_intensity"])
-        peak.rel_intensity = data["rel_intensity"]
-        peak.matched = data["matched"]
-        peak.dist = data["dist"]
-        peak.matched_ion_str = data["matched_ion_str"]
-        peak.matched_ion_mz = data["matched_ion_mz"]
-        peak.score = data["score"]
-        peak.intensity_score = data["intensity_score"]
-        peak.dist_score = data["dist_score"]
+        peak = cls(
+            data["mz"],
+            data["raw_intensity"],
+            data.get("norm_intensity", 0.0),
+        )
+        peak.rel_intensity = data.get("rel_intensity", 0.0)
+        peak.matched = data.get("matched", False)
+        peak.dist = data.get("dist", 0.0)
+        peak.matched_ion_str = data.get("matched_ion_str", "")
+        peak.matched_ion_mz = data.get("matched_ion_mz", 0.0)
+        peak.score = data.get("score", 0.0)
+        peak.intensity_score = data.get("intensity_score", 0.0)
+        peak.dist_score = data.get("dist_score", 0.0)
pyLucXor/spectrum.py (4)

9-10: Remove unused imports to satisfy linter and reduce import overhead

Dict, List, Optional, and pyopenms are not used in this module.

-from typing import Dict, List, Tuple, Optional
-import pyopenms
+from typing import Tuple

43-66: Vectorize peak initialization for correctness and speed; also clarify max_i_index semantics

You can replace the loop-based zero-intensity filtering and manual array population with a vectorized approach. This both simplifies the code and ensures max_i_index is the index in the filtered arrays (current code stores the index from the original arrays, which can be confusing downstream).

Is max_i_index intended to be an index into the filtered arrays (self.raw_intensity) or the original input arrays? The change below makes it index into the filtered arrays.

-        # Filter out zero intensity peaks
-        cand_peaks = []
-        for i in range(N):
-            if intensity_array[i] > 0:
-                cand_peaks.append(i)
-
-        self.N = len(cand_peaks)
-
-        # Initialize arrays
-        self.mz = np.zeros(self.N)
-        self.raw_intensity = np.zeros(self.N)
-        self.rel_intensity = np.zeros(self.N)
-        self.norm_intensity = np.zeros(self.N)
-
-        # Record peak data
-        for i, idx in enumerate(cand_peaks):
-            intensity = intensity_array[idx]
-            if intensity > self.max_i:
-                self.max_i = intensity
-                self.max_i_index = idx
-            
-            self.mz[i] = mz_array[idx]
-            self.raw_intensity[i] = intensity
+        # Filter out zero intensity peaks (vectorized)
+        mz_array = np.asarray(mz_array)
+        intensity_array = np.asarray(intensity_array)
+        mask = intensity_array[:N] > 0
+        self.N = int(np.count_nonzero(mask))
+
+        # Initialize arrays
+        self.mz = mz_array[:N][mask].astype(float, copy=False)
+        self.raw_intensity = intensity_array[:N][mask].astype(float, copy=False)
+        self.rel_intensity = np.zeros(self.N, dtype=float)
+        self.norm_intensity = np.zeros(self.N, dtype=float)
+
+        # Record maxima
+        if self.N > 0:
+            self.max_i_index = int(np.argmax(self.raw_intensity))
+            self.max_i = float(self.raw_intensity[self.max_i_index])

83-101: Use numpy median and vectorized log for normalization

Equivalent, faster, and safer. Also guards against accidental non-positive medians.

-        # Sort relative intensities
-        sorted_intensities = sorted(self.rel_intensity)
-        
-        # Calculate median
-        mid = self.N // 2
-        if self.N % 2 == 0:  # even number of peaks
-            a = sorted_intensities[mid - 1]
-            b = sorted_intensities[mid]
-            median_i = (a + b) / 2.0
-        else:  # odd number of peaks
-            median_i = sorted_intensities[mid]
-        
-        # Calculate normalized intensities
-        for i in range(self.N):
-            d = self.rel_intensity[i] / median_i
-            if d > 0:  # avoid log(0)
-                self.norm_intensity[i] = np.log(d)
-            else:
-                self.norm_intensity[i] = float('-inf')
+        # Compute median using numpy
+        median_i = float(np.median(self.rel_intensity))
+        if median_i <= 0:
+            # Defensive: if median is non-positive, set to -inf
+            self.norm_intensity.fill(float("-inf"))
+            return
+
+        # Vectorized log normalization; avoid log(0) by using where/out
+        self.norm_intensity = np.full_like(self.rel_intensity, float("-inf"))
+        np.log(self.rel_intensity / median_i,
+               out=self.norm_intensity,
+               where=self.rel_intensity > 0)

131-134: Optional: speed up m/z index lookup

For large spectra, a vectorized search avoids Python loops; if self.mz is sorted, consider binary search, otherwise vectorized absolute diff works.

-        for i in range(self.N):
-            if abs(self.mz[i] - mz) < 1e-6:  # floating point comparison
-                return i
-        return -1 
+        idxs = np.where(np.abs(self.mz - mz) < 1e-6)[0]
+        return int(idxs[0]) if idxs.size else -1
pyLucXor/flr.py (5)

9-11: Drop unused imports (typing and concurrent.futures)

These symbols aren’t used here; remove them to appease linters and speed imports.

-from typing import List, Dict, Any, Optional, Tuple
-from concurrent.futures import ThreadPoolExecutor, as_completed
+from typing import List, Tuple

394-403: Guard against tiny denominators when computing FDR ratios

Division by very small g_auc_f1/l_auc_f1 can produce inf/NaN. Clamp denominator with TINY_NUM. You already use TINY_NUM in KDE; extend the safety here.

-            ratio = (Ndecoy2/Nreal2) * (g_auc_f0 / g_auc_f1)
+            denom = max(g_auc_f1, TINY_NUM)
+            ratio = (Ndecoy2 / Nreal2) * (g_auc_f0 / denom)
             self.global_fdr[i] = ratio
@@
-            ratio = (Ndecoy2/Nreal2) * (l_auc_f0 / l_auc_f1)
+            denom = max(l_auc_f1, TINY_NUM)
+            ratio = (Ndecoy2 / Nreal2) * (l_auc_f0 / denom)
             self.local_fdr[i] = ratio

125-138: Handle small-sample bandwidth edge cases explicitly

When N <= 1 or variance is 0, bandwidth should be 0 (or a small epsilon), not NaN/inf. Make this explicit.

         if data_type == REAL:
             sigma = np.sqrt(self.delta_score_var_pos)
             N = float(self.n_real)
-            x = np.power(N, 0.2)
-            result = 1.06 * (sigma / x)
+            if N <= 1 or sigma == 0.0:
+                result = 0.0
+            else:
+                result = 1.06 * (sigma / (N ** 0.2))
             self.bw_real = result
@@
         elif data_type == DECOY:
             sigma = np.sqrt(self.delta_score_var_neg)
             N = float(self.n_decoy)
-            x = np.power(N, 0.2)
-            result = 1.06 * (sigma / x)
+            if N <= 1 or sigma == 0.0:
+                result = 0.0
+            else:
+                result = 1.06 * (sigma / (N ** 0.2))
             self.bw_decoy = result

205-211: Treat non-finite/negative bandwidth as invalid

If bw becomes NaN/inf or negative, fall back to TINY_NUM densities to avoid propagating invalid values.

-        if N == 0 or bw == 0:
+        if N == 0 or not np.isfinite(bw) or bw <= 0:
             # Avoid division by zero
             if data_type == DECOY:
                 self.f0 = np.full(self.NMARKS, TINY_NUM)
             else:
                 self.f1 = np.full(self.NMARKS, TINY_NUM)
             return

460-481: Minorization slope divisions can divide by zero

Both denominators can hit zero in degenerate cases, producing infs. Add guards.

-            slope = (0.0 - min_val) / (self.max_delta_score * 1.1 - x[min_idx])
+            denom = (self.max_delta_score * 1.1 - x[min_idx])
+            slope = (0.0 - min_val) / denom if denom != 0 else 0.0
@@
-            slope = max_val / (x[max_idx] - x[-1])
+            denom2 = (x[max_idx] - x[-1])
+            slope = max_val / denom2 if denom2 != 0 else 0.0
pyLucXor/__init__.py (1)

22-25: Avoid F401 warnings for CIDModel/HCDModel lazy exports

Returning via locals()[name] still trips some linters. Use a small mapping for clarity and to placate static analysis.

-    elif name in ["CIDModel", "HCDModel"]:
-        from .models import CIDModel, HCDModel
-        return locals()[name]
+    elif name in ("CIDModel", "HCDModel"):
+        from .models import CIDModel, HCDModel
+        return {"CIDModel": CIDModel, "HCDModel": HCDModel}[name]
pyLucXor/core.py (3)

8-10: Clean up unused imports

numpy and typing (Dict, Any, Optional) are not used in this file.

-import numpy as np
 import pyopenms
-from typing import List, Dict, Any, Optional
+from typing import List

131-135: Avoid double assignment of FLR values after process_round2

process_round2 is expected to assign FLR via find_closest_flr. Calling assign_flr_to_psms again may overwrite those values and currently assigns FLR to decoys (fixed separately). Consider removing this to keep a single source of truth.

Would you like me to adjust the stage-2 flow to rely solely on process_round2 for FLR assignment?


200-203: CSV export writes modified peptide equal to unmodified peptide

Use the modified representation if available.

-                    f.write(f"{psm.scan_num},{psm.peptide.peptide},{psm.peptide.peptide},{psm.psm_score:.4f},{psm.delta_score:.4f},{psm.global_flr:.4f},{psm.local_flr:.4f},{1 if psm.is_decoy else 0}\n")
+                    mod_pep = getattr(psm.peptide, "mod_peptide", psm.peptide.peptide)
+                    f.write(f"{psm.scan_num},{psm.peptide.peptide},{mod_pep},{psm.psm_score:.4f},{psm.delta_score:.4f},{psm.global_flr:.4f},{psm.local_flr:.4f},{1 if psm.is_decoy else 0}\n")
pyLucXor/cli.py (6)

10-11: Prune unused imports

json, NTERM_MOD, CTERM_MOD, and AA_MASSES are not used in this file.

-import json
@@
-from pyLucXor.constants import (
-    NTERM_MOD,
-    CTERM_MOD,
-    AA_MASSES,
-    DEFAULT_CONFIG
-)
+from pyLucXor.constants import DEFAULT_CONFIG

Also applies to: 29-33


528-531: Flatten nested if for readability

Minor readability improvement; no behavior change.

-            if hasattr(psm, 'delta_score') and not np.isnan(psm.delta_score):
-                if psm.delta_score > global_flr_calculator.min_delta_score:
-                    global_flr_calculator.add_psm(psm.delta_score, psm.is_decoy)
+            if hasattr(psm, 'delta_score') and not np.isnan(psm.delta_score) \
+               and psm.delta_score > global_flr_calculator.min_delta_score:
+                global_flr_calculator.add_psm(psm.delta_score, psm.is_decoy)

395-395: Rename unused loop index variable

i is unused; use _ to signal intent and silence linters.

-        for i, pep_id in enumerate(pep_ids, 1):
+        for _, pep_id in enumerate(pep_ids, 1):

666-668: Avoid bare except in MGF TITLE parsing

Catching all exceptions hides real errors; scope to expected ones.

-                        except:
+                        except (ValueError, IndexError):
                             logger.warning(f"Cannot extract scan number from title: {line}")

691-693: Avoid bare except when parsing peak lines

Be explicit about parsing failures.

-                        except:
+                        except ValueError:
                             logger.warning(f"Cannot parse peak data: {line}")

521-542: Consider skipping decoy FLR assignment in round 1 mapping

After record_flr_estimates, calling assign_flr_to_psms will assign FLR to decoys unless assign_flr_to_psms is fixed (see flr.py). Once fixed, this is okay. Otherwise, skip assigning here and only assign in round 2.

Do you want me to refactor the flow to avoid assigning FLR in round 1 entirely and rely strictly on round 2 mapping?

pyLucXor/parallel.py (3)

181-187: Thread chunking can create highly imbalanced work for small lists

Using len(psms) // num_threads can create many 1-element tasks and underutilize threads for small inputs. Consider reusing get_optimal_thread_count to bound threads to data size.

Apply this diff:

-    if num_threads is None:
-        num_threads = os.cpu_count() or DEFAULT_NUM_THREADS
+    if num_threads is None:
+        num_threads = get_optimal_thread_count(len(psms))

96-113: Remove unused locals and narrow exception in SpectrumMatchingWorker.match_spectrum_peptide

spectrum_native_id and rt are unused; the inner except Exception as e doesn’t use e.

Apply this diff:

-            # Get spectrum basic information
-            spectrum_native_id = getattr(psm.spectrum, 'native_id', None)
-            rt = getattr(psm.spectrum, 'rt', None)
+            # Spectrum basic information is attached below via getattr on psm/psm.spectrum as needed
@@
-            try:
-                matched_peaks = psm.peptide.match_peaks(psm.spectrum)
-            except Exception as e:
-                matched_peaks = []
+            try:
+                matched_peaks = psm.peptide.match_peaks(psm.spectrum)
+            except Exception:
+                matched_peaks = []

11-11: Drop unused import

threading is unused in this module.

-import threading
pyLucXor/psm.py (3)

781-786: Simplify and harden decoy detection

Use any(...) directly; it’s clearer and avoids early-return logic errors.

-            for aa in unmod_seq:
-                if aa in DECOY_AA_MAP:
-                    return True
-                    
-            return False
+            return any(aa in DECOY_AA_MAP for aa in unmod_seq)

1216-1219: Remove unused local variable

original_tolerance is assigned but never used.

-        original_tolerance = self.config.get('fragment_mass_tolerance', 0.1)
         temp_config = self.config.copy()

7-28: Prune unused imports (keeps runtime trim and static analysis clean)

A number of imports are unused in this module.

-import math
-from typing import Dict, List, Optional, Tuple, Set, Any, Union
+import math
+from typing import Dict, List, Optional, Tuple, Any, Union
@@
-import pyopenms
 from pyopenms import AASequence, Modification
@@
-    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP, AA_DECOY_MAP, NEUTRAL_LOSSES, MIN_DELTA_SCORE, PROTON_MASS, WATER_MASS,
-    DECOY_AMINO_ACIDS, PHOSPHO_MOD_MASS, OXIDATION_MASS
+    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP, PROTON_MASS, WATER_MASS,
+    PHOSPHO_MOD_MASS, OXIDATION_MASS
 )
-from .peak import Peak
 from .peptide import Peptide
 from .spectrum import Spectrum
-from .flr import FLRCalculator
 from .globals import get_decoy_symbol
pyLucXor/peptide.py (2)

687-701: Remove unused debug block in match_peaks or emit it to logs

output_data is built but never used. Either remove it or log/emit when debug is enabled.

-        if use_config.get('debug', False):
-            output_data = {
-                'matched_peaks': matched_peaks,
-                'theoretical_ions': {
-                    'y_ions': y_ions,
-                    'b_ions': b_ions
-                },
-                'peptide': self.peptide,
-                'modified_peptide': self.mod_peptide,
-                'charge': self.charge
-            }
+        if use_config.get('debug', False):
+            logger.debug(f"Matched peaks: {len(matched_peaks)} | peptide={self.peptide} z={self.charge}")

521-586: Potential index mismatch in get_permutations for ambiguous peptides

You iterate sites from self.peptide (which contains marker substrings) but then index into self.mod_peptide. These indices likely diverge once markers are present, producing incorrect lowercase positions.

Consider deriving candidate sites from a “marker-free” sequence and mapping to mod_peptide indices consistently. For example, build an index map from original peptide positions (excluding markers) to mod_peptide indices before flipping case. I can draft this refactor if this method is still in use; otherwise consider removing it to reduce confusion since PSM.generate_permutations covers the needed logic.

pyLucXor/models.py (3)

936-964: Remove unused local and simplify CID distance log-density

mu is assigned but not used; the implementation assumes mean 0. Keep it concise.

-        mu = 0.0 
-        var = charge_model.var_dist_b  
+        var = charge_model.var_dist_b
@@
-        log_prob = -0.5 * np.log(2 * np.pi * var) - 0.5 * (x ** 2) / var
+        log_prob = -0.5 * np.log(2 * np.pi * var) - 0.5 * (x ** 2) / var
         return log_prob

8-18: Trim unused imports

typing.Any and constants ALGORITHM_CID/ALGORITHM_HCD are unused. Keep imports tight.

-from typing import Dict, List, Optional, Tuple, Any
+from typing import Dict, List, Optional, Tuple
@@
-from .constants import ALGORITHM_CID, ALGORITHM_HCD, TINY_NUM
+from .constants import TINY_NUM

535-553: Style: collapse simple if/else paddings with ternaries (optional)

The padding of min_i/max_i blocks can be simplified; not functionally required but improves readability.

-        if min_i < 0:
-            min_i = min_i + (min_i * padding)
-        else:
-            min_i = min_i - (min_i * padding)
+        min_i = min_i + (min_i * padding) if min_i < 0 else min_i - (min_i * padding)
@@
-        if max_i < 0:
-            max_i = max_i - (max_i * padding)
-        else:
-            max_i = max_i + (max_i * padding)
+        max_i = max_i - (max_i * padding) if max_i < 0 else max_i + (max_i * padding)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 778df36 and c5f1ad6.

📒 Files selected for processing (14)
  • pyLucXor/README.md (1 hunks)
  • pyLucXor/__init__.py (1 hunks)
  • pyLucXor/cli.py (1 hunks)
  • pyLucXor/config.py (1 hunks)
  • pyLucXor/constants.py (1 hunks)
  • pyLucXor/core.py (1 hunks)
  • pyLucXor/flr.py (1 hunks)
  • pyLucXor/globals.py (1 hunks)
  • pyLucXor/models.py (1 hunks)
  • pyLucXor/parallel.py (1 hunks)
  • pyLucXor/peak.py (1 hunks)
  • pyLucXor/peptide.py (1 hunks)
  • pyLucXor/psm.py (1 hunks)
  • pyLucXor/spectrum.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (10)
pyLucXor/flr.py (2)
pyLucXor/core.py (1)
  • add_psm (31-38)
pyLucXor/globals.py (1)
  • record_flr_estimates (33-42)
pyLucXor/globals.py (2)
pyLucXor/flr.py (1)
  • record_flr_estimates (642-663)
pyLucXor/psm.py (1)
  • clear_scores (1015-1037)
pyLucXor/psm.py (5)
pyLucXor/peak.py (1)
  • Peak (10-129)
pyLucXor/peptide.py (4)
  • get_precursor_mass (165-196)
  • get_unmodified_sequence (883-907)
  • build_ion_ladders (207-268)
  • match_peaks (587-702)
pyLucXor/spectrum.py (5)
  • Spectrum (14-134)
  • is_empty (112-119)
  • median_normalize_spectra (75-101)
  • calc_relative_intensity (70-73)
  • get_peaks (103-110)
pyLucXor/flr.py (2)
  • add_psm (61-81)
  • find_closest_flr (736-764)
pyLucXor/globals.py (2)
  • globals (13-79)
  • get_decoy_symbol (81-99)
pyLucXor/parallel.py (3)
pyLucXor/psm.py (3)
  • score_permutations (556-733)
  • process (322-452)
  • process_round2 (454-488)
pyLucXor/cli.py (1)
  • process_psms (741-773)
pyLucXor/peptide.py (1)
  • match_peaks (587-702)
pyLucXor/__init__.py (4)
pyLucXor/cli.py (1)
  • PyLuciPHOr2 (40-773)
pyLucXor/peptide.py (1)
  • Peptide (24-907)
pyLucXor/psm.py (1)
  • PSM (31-1367)
pyLucXor/models.py (2)
  • CIDModel (787-964)
  • HCDModel (967-1122)
pyLucXor/core.py (3)
pyLucXor/psm.py (5)
  • PSM (31-1367)
  • process (322-452)
  • clear_scores (1015-1037)
  • get_results (916-926)
  • update_peptide_id (1279-1321)
pyLucXor/flr.py (5)
  • FLRCalculator (15-798)
  • add_psm (61-81)
  • calculate_flr_estimates (596-640)
  • record_flr_estimates (642-663)
  • assign_flr_to_psms (665-705)
pyLucXor/config.py (3)
  • LucXorConfig (14-113)
  • get (49-60)
  • to_dict (72-79)
pyLucXor/peptide.py (4)
pyLucXor/globals.py (2)
  • globals (13-79)
  • get_decoy_symbol (81-99)
pyLucXor/peak.py (1)
  • Peak (10-129)
pyLucXor/spectrum.py (1)
  • get_peaks (103-110)
pyLucXor/models.py (8)
  • get_charge_model (884-894)
  • get_charge_model (1065-1075)
  • get_log_np_density_int (662-723)
  • get_log_np_density_int (896-934)
  • get_log_np_density_int (1077-1099)
  • get_log_np_density_dist_pos (725-769)
  • get_log_np_density_dist_pos (936-964)
  • get_log_np_density_dist_pos (1101-1122)
pyLucXor/cli.py (6)
pyLucXor/psm.py (3)
  • PSM (31-1367)
  • charge (1244-1248)
  • process (322-452)
pyLucXor/peptide.py (2)
  • Peptide (24-907)
  • build_ion_ladders (207-268)
pyLucXor/models.py (4)
  • CIDModel (787-964)
  • HCDModel (967-1122)
  • build (802-834)
  • build (985-1013)
pyLucXor/spectrum.py (2)
  • Spectrum (14-134)
  • get_peaks (103-110)
pyLucXor/flr.py (6)
  • FLRCalculator (15-798)
  • add_psm (61-81)
  • calculate_flr_estimates (596-640)
  • record_flr_estimates (642-663)
  • assign_flr_to_psms (665-705)
  • save_delta_score_flr_mapping (711-734)
pyLucXor/parallel.py (3)
  • parallel_psm_processing (170-201)
  • get_optimal_thread_count (240-261)
  • process_psms (31-34)
pyLucXor/constants.py (1)
pyLucXor/config.py (1)
  • update (36-47)
pyLucXor/models.py (4)
pyLucXor/peak.py (1)
  • Peak (10-129)
pyLucXor/config.py (1)
  • get (49-60)
pyLucXor/psm.py (2)
  • charge (1244-1248)
  • _match_peaks (1190-1227)
pyLucXor/spectrum.py (1)
  • get_peaks (103-110)
🪛 Ruff (0.12.2)
pyLucXor/flr.py

9-9: typing.Dict imported but unused

Remove unused import

(F401)


9-9: typing.Any imported but unused

Remove unused import

(F401)


9-9: typing.Optional imported but unused

Remove unused import

(F401)


10-10: concurrent.futures.ThreadPoolExecutor imported but unused

Remove unused import

(F401)


10-10: concurrent.futures.as_completed imported but unused

Remove unused import

(F401)

pyLucXor/globals.py

6-6: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

pyLucXor/spectrum.py

9-9: typing.Dict imported but unused

Remove unused import

(F401)


9-9: typing.List imported but unused

Remove unused import

(F401)


9-9: typing.Optional imported but unused

Remove unused import

(F401)


10-10: pyopenms imported but unused

Remove unused import: pyopenms

(F401)

pyLucXor/psm.py

9-9: typing.Set imported but unused

Remove unused import: typing.Set

(F401)


16-16: pyopenms imported but unused

Remove unused import: pyopenms

(F401)


20-20: .constants.AA_DECOY_MAP imported but unused

Remove unused import

(F401)


20-20: .constants.NEUTRAL_LOSSES imported but unused

Remove unused import

(F401)


20-20: .constants.MIN_DELTA_SCORE imported but unused

Remove unused import

(F401)


21-21: .constants.DECOY_AMINO_ACIDS imported but unused

Remove unused import

(F401)


23-23: .peak.Peak imported but unused

Remove unused import: .peak.Peak

(F401)


26-26: .flr.FLRCalculator imported but unused

Remove unused import: .flr.FLRCalculator

(F401)


506-506: Loop control variable sites not used within loop body

Rename unused sites to _sites

(B007)


525-528: Combine if branches using logical or operator

Combine if branches

(SIM114)


574-577: Use ternary operator tolerance = base_tolerance * 2.0 if is_decoy else base_tolerance instead of if-else-block

(SIM108)


610-613: Combine if branches using logical or operator

Combine if branches

(SIM114)


706-706: Loop control variable s not used within loop body

(B007)


782-786: Use return any(aa in DECOY_AA_MAP for aa in unmod_seq) instead of for loop

Replace with return any(aa in DECOY_AA_MAP for aa in unmod_seq)

(SIM110)


1089-1092: Combine if branches using logical or operator

Combine if branches

(SIM114)


1180-1184: Use return any(aa in DECOY_AA_MAP for aa in sequence) instead of for loop

Replace with return any(aa in DECOY_AA_MAP for aa in sequence)

(SIM110)


1217-1217: Local variable original_tolerance is assigned to but never used

Remove assignment to unused variable original_tolerance

(F841)


1265-1265: Loop control variable mod_positions not used within loop body

Rename unused mod_positions to _mod_positions

(B007)


1272-1272: Loop control variable mod_positions not used within loop body

Rename unused mod_positions to _mod_positions

(B007)

pyLucXor/parallel.py

11-11: threading imported but unused

Remove unused import: threading

(F401)


106-106: Local variable spectrum_native_id is assigned to but never used

Remove assignment to unused variable spectrum_native_id

(F841)


107-107: Local variable rt is assigned to but never used

Remove assignment to unused variable rt

(F841)


112-112: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

pyLucXor/__init__.py

23-23: .models.CIDModel imported but unused

(F401)


23-23: .models.HCDModel imported but unused

(F401)

pyLucXor/core.py

8-8: numpy imported but unused

Remove unused import: numpy

(F401)


10-10: typing.Dict imported but unused

Remove unused import

(F401)


10-10: typing.Any imported but unused

Remove unused import

(F401)


10-10: typing.Optional imported but unused

Remove unused import

(F401)

pyLucXor/peptide.py

9-9: typing.Set imported but unused

Remove unused import

(F401)


9-9: typing.Tuple imported but unused

Remove unused import

(F401)


9-9: typing.Union imported but unused

Remove unused import

(F401)


12-12: pyopenms imported but unused

Remove unused import: pyopenms

(F401)


15-15: .constants.AA_DECOY_MAP imported but unused

Remove unused import

(F401)


16-16: .constants.ION_TYPES imported but unused

Remove unused import

(F401)


16-16: .constants.NEUTRAL_LOSSES imported but unused

Remove unused import

(F401)


281-281: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)


689-689: Local variable output_data is assigned to but never used

Remove assignment to unused variable output_data

(F841)


895-898: Combine if branches using logical or operator

Combine if branches

(SIM114)

pyLucXor/cli.py

10-10: json imported but unused

Remove unused import: json

(F401)


29-29: pyLucXor.constants.NTERM_MOD imported but unused

Remove unused import

(F401)


30-30: pyLucXor.constants.CTERM_MOD imported but unused

Remove unused import

(F401)


31-31: pyLucXor.constants.AA_MASSES imported but unused

Remove unused import

(F401)


395-395: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


528-529: Use a single if statement instead of nested if statements

(SIM102)


666-666: Do not use bare except

(E722)


691-691: Do not use bare except

(E722)

pyLucXor/models.py

11-11: typing.Any imported but unused

Remove unused import: typing.Any

(F401)


16-16: .constants.ALGORITHM_CID imported but unused

Remove unused import

(F401)


16-16: .constants.ALGORITHM_HCD imported but unused

Remove unused import

(F401)


545-548: Use ternary operator min_i = min_i + min_i * padding if min_i < 0 else min_i - min_i * padding instead of if-else-block

Replace if-else-block with min_i = min_i + min_i * padding if min_i < 0 else min_i - min_i * padding

(SIM108)


550-553: Use ternary operator max_i = max_i - max_i * padding if max_i < 0 else max_i + max_i * padding instead of if-else-block

Replace if-else-block with max_i = max_i - max_i * padding if max_i < 0 else max_i + max_i * padding

(SIM108)


957-957: Local variable mu is assigned to but never used

Remove assignment to unused variable mu

(F841)

🪛 LanguageTool
pyLucXor/README.md

[grammar] ~23-~23: There might be a mistake here.
Context: ...on

Prerequisites

  • Python 3.7+
  • PyOpenMS
  • NumPy
  • SciPy

Instal...

(QB_NEW_EN)


[grammar] ~24-~24: There might be a mistake here.
Context: ...erequisites

  • Python 3.7+
  • PyOpenMS
  • NumPy
  • SciPy

Install from sourc...

(QB_NEW_EN)


[grammar] ~25-~25: There might be a mistake here.
Context: ...es

  • Python 3.7+
  • PyOpenMS
  • NumPy
  • SciPy

Install from source


(QB_NEW_EN)

---

[grammar] ~48-~48: There might be a mistake here.
Context: ...trum`: Input spectrum file (mzML format)
- `-id, --input_id`: Input identification file (idXML forma...

(QB_NEW_EN)

---

[grammar] ~49-~49: There might be a mistake here.
Context: ...Input identification file (idXML format)
- `-out, --output`: Output file (idXML format)

#### Opt...

(QB_NEW_EN)

---

[grammar] ~54-~54: There might be a mistake here.
Context: ...tation method (CID or HCD, default: CID)
- `--fragment_mass_tolerance`: Fragment mass tolerance (default: 0.5)...

(QB_NEW_EN)

---

[grammar] ~55-~55: There might be a mistake here.
Context: ...: Fragment mass tolerance (default: 0.5)
- `--fragment_error_units`: Tolerance units (Da or ppm, default: D...

(QB_NEW_EN)

---

[grammar] ~56-~56: There might be a mistake here.
Context: ...Tolerance units (Da or ppm, default: Da)
- `--min_mz`: Minimum m/z value to consider (default...

(QB_NEW_EN)

---

[grammar] ~57-~57: There might be a mistake here.
Context: ...m m/z value to consider (default: 150.0)
- `--target_modifications`: Target modifications (default: ["Phosp...

(QB_NEW_EN)

---

[grammar] ~58-~58: There might be a mistake here.
Context: ...pho (S)", "Phospho (T)", "Phospho (Y)"])
- `--neutral_losses`: Neutral loss definitions (default: ["s...

(QB_NEW_EN)

---

[grammar] ~59-~59: There might be a mistake here.
Context: ...ions (default: ["sty -H3PO4 -97.97690"])
- `--decoy_mass`: Mass for decoy generation (default: 79...

(QB_NEW_EN)

---

[grammar] ~60-~60: There might be a mistake here.
Context: ...or decoy generation (default: 79.966331)
- `--max_charge_state`: Maximum charge state (default: 5)
- `...

(QB_NEW_EN)

---

[grammar] ~61-~61: There might be a mistake here.
Context: ...tate`: Maximum charge state (default: 5)
- `--max_peptide_length`: Maximum peptide length (default: 40)
...

(QB_NEW_EN)

---

[grammar] ~62-~62: There might be a mistake here.
Context: ...h`: Maximum peptide length (default: 40)
- `--max_num_perm`: Maximum permutations (default: 16384)
...

(QB_NEW_EN)

---

[grammar] ~63-~63: There might be a mistake here.
Context: ...`: Maximum permutations (default: 16384)
- `--modeling_score_threshold`: Minimum score for modeling (default: 0...

(QB_NEW_EN)

---

[grammar] ~64-~64: There might be a mistake here.
Context: ...nimum score for modeling (default: 0.95)
- `--min_num_psms_model`: Minimum PSMs for modeling (default: 50...

(QB_NEW_EN)

---

[grammar] ~65-~65: There might be a mistake here.
Context: ... Minimum PSMs for modeling (default: 50)
- `--num_threads`: Number of threads (default: 4)
- `--r...

(QB_NEW_EN)

---

[grammar] ~66-~66: There might be a mistake here.
Context: ...threads`: Number of threads (default: 4)
- `--rt_tolerance`: Retention time tolerance (default: 0.0...

(QB_NEW_EN)

---

[grammar] ~67-~67: There might be a mistake here.
Context: ...Retention time tolerance (default: 0.01)
- `--debug`: Enable debug mode
- `--log_file`: Cus...

(QB_NEW_EN)

---

[grammar] ~68-~68: There might be a mistake here.
Context: ...t: 0.01)
- `--debug`: Enable debug mode
- `--log_file`: Custom log file path

#### Example
...

(QB_NEW_EN)

---

[grammar] ~111-~111: There might be a mistake here.
Context: ...m:

### Stage 1: FLR Estimation (RN=0)
- Process all PSMs with both real and deco...

(QB_NEW_EN)

---

[grammar] ~116-~116: There might be a mistake here.
Context: ...ns

### Stage 2: FLR Assignment (RN=1)
- Re-process PSMs using only real permutat...

(QB_NEW_EN)

---

[grammar] ~123-~123: There might be a mistake here.
Context: ...tool generates an idXML file containing:
- **Luciphor_delta_score**: Main localizatio...

(QB_NEW_EN)

---

[grammar] ~124-~124: There might be a mistake here.
Context: ...r_delta_score**: Main localization score
- **Luciphor_pep_score**: Peptide identifica...

(QB_NEW_EN)

---

[grammar] ~125-~125: There might be a mistake here.
Context: ...ep_score**: Peptide identification score
- **Luciphor_global_flr**: Global false loca...

(QB_NEW_EN)

---

[grammar] ~126-~126: There might be a mistake here.
Context: ...al_flr**: Global false localization rate
- **Luciphor_local_flr**: Local false locali...

(QB_NEW_EN)

---

[grammar] ~133-~133: There might be a mistake here.
Context: ...method**: CID or HCD specific parameters
- **Mass tolerances**: Fragment and precurso...

(QB_NEW_EN)

---

[grammar] ~134-~134: There might be a mistake here.
Context: ...: Fragment and precursor mass tolerances
- **Modification definitions**: Target PTMs ...

(QB_NEW_EN)

---

[grammar] ~135-~135: There might be a mistake here.
Context: ...itions**: Target PTMs and neutral losses
- **Modeling parameters**: Score thresholds ...

(QB_NEW_EN)

---

[grammar] ~136-~136: There might be a mistake here.
Context: ... Score thresholds and minimum PSM counts
- **Performance settings**: Threading and me...

(QB_NEW_EN)

---

[grammar] ~141-~141: There might be a mistake here.
Context: ...ensive logging with configurable levels:
- **INFO**: General progress information
- ...

(QB_NEW_EN)

---

[grammar] ~142-~142: There might be a mistake here.
Context: ...- **INFO**: General progress information
- **DEBUG**: Detailed processing information...

(QB_NEW_EN)

---

[grammar] ~143-~143: There might be a mistake here.
Context: ...DEBUG**: Detailed processing information
- **WARNING**: Non-critical issues
- **ERRO...

(QB_NEW_EN)

---

[grammar] ~144-~144: There might be a mistake here.
Context: ...tion
- **WARNING**: Non-critical issues
- **ERROR**: Critical errors

Log files ar...

(QB_NEW_EN)

---

[grammar] ~151-~151: There might be a mistake here.
Context: ...c thread optimization based on PSM count
- **Memory efficient**: Streaming processing...

(QB_NEW_EN)

---

[grammar] ~152-~152: There might be a mistake here.
Context: ... Streaming processing for large datasets
- **Scalable**: Handles datasets with thousa...

(QB_NEW_EN)

---

[grammar] ~157-~157: There might be a mistake here.
Context: ...penMS**: Mass spectrometry data handling
- **NumPy**: Numerical computations
- **Sci...

(QB_NEW_EN)

---

[grammar] ~158-~158: There might be a mistake here.
Context: ...ing
- **NumPy**: Numerical computations
- **SciPy**: Statistical functions
- **Stan...

(QB_NEW_EN)

---

[grammar] ~159-~159: There might be a mistake here.
Context: ...ions
- **SciPy**: Statistical functions
- **Standard library**: argparse, logging, j...

(QB_NEW_EN)

---

[style] ~177-~177: Consider using a less common alternative to make your writing sound more unique and professional.
Context: ...tributing

Contributions are welcome! Please feel free to submit issues and pull requests.

## ...

(FEEL_FREE_TO_STYLE_ME)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

<details>
<summary>pyLucXor/README.md</summary>

170-170: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (5)</summary><blockquote>

<details>
<summary>pyLucXor/README.md (2)</summary>

`65-67`: **Docs default for --num_threads disagrees with code (6 vs 4).**

README lists default num_threads=4, but DEFAULT_CONFIG in pyLucXor/constants.py sets it to 6. Please align one of them to avoid user confusion.

Apply one of the following diffs:

- If 6 is intended (change README):

```diff
- - `--num_threads`: Number of threads (default: 4)
+ - `--num_threads`: Number of threads (default: 6)
  • If 4 is intended (change constants.py; see comment in constants.py too):
-    "num_threads": 6,
+    "num_threads": 4,

121-128: Clarify FLR field names to match output tag casing.

The README uses bold keys like Luciphor_delta_score, etc. Ensure the exact tag names and casing match the idXML writer implementation to prevent downstream parsing issues.

If the writer uses different keys (e.g., LuciPHOr vs Luciphor), update the README accordingly.

pyLucXor/constants.py (1)

121-136: Default num_threads=6 conflicts with README (4). Pick one.

This default should match the documented default in README and CLI help.

Apply if 4 is intended:

-    "num_threads": 6,
+    "num_threads": 4,

Or update README to 6 (see corresponding README comment).

pyLucXor/peptide.py (1)

7-20: Prune unused imports to reduce noise

[ suggest_nitpick ]

-from typing import Dict, List, Set, Tuple, Optional, Any, Union
+from typing import Dict, List, Tuple, Optional, Any, Union
@@
-import pyopenms
@@
-from .constants import (
-    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP, AA_DECOY_MAP,
-    WATER_MASS, PROTON_MASS, ION_TYPES, NEUTRAL_LOSSES
-)
+from .constants import (
+    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP,
+    WATER_MASS, PROTON_MASS
+)
pyLucXor/models.py (1)

821-859: Model training builds matched peaks by calling private PSM._match_peaks

This reaches into a private method and duplicates tolerance logic. Consider using the public Peptide.match_peaks with the PSM’s peptide and spectrum to reduce coupling.

[ suggest_optional_refactor ]
Happy to propose a small wrapper on PSM to expose a public matching utility if desired.

Comment thread pyLucXor/__init__.py
Comment on lines +10 to +13
if name == "LucXor":
from .core import LucXor
return LucXor
elif name == "PyLuciPHOr2":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

getattr exposes non-existent symbol 'LucXor'

.core defines CoreProcessor, not LucXor. Importing LucXor will fail at runtime.

-    if name == "LucXor":
-        from .core import LucXor
-        return LucXor
+    if name == "LucXor":
+        # Expose CoreProcessor under the historical/public name "LucXor"
+        from .core import CoreProcessor as LucXor
+        return LucXor

Optionally, you can also expose CoreProcessor explicitly:

+    elif name == "CoreProcessor":
+        from .core import CoreProcessor
+        return CoreProcessor
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if name == "LucXor":
from .core import LucXor
return LucXor
elif name == "PyLuciPHOr2":
if name == "LucXor":
# Expose CoreProcessor under the historical/public name "LucXor"
from .core import CoreProcessor as LucXor
return LucXor
elif name == "CoreProcessor":
from .core import CoreProcessor
return CoreProcessor
elif name == "PyLuciPHOr2":
🤖 Prompt for AI Agents
In pyLucXor/__init__.py around lines 10 to 13, __getattr__ tries to import and
return a symbol named "LucXor" but core.py defines CoreProcessor instead; change
the import to from .core import CoreProcessor and return CoreProcessor (or, if
you intended the public name to remain LucXor, import CoreProcessor and return
it under the LucXor alias), and optionally add an explicit exposure (e.g., set
LucXor = CoreProcessor) so the module only exposes existing symbols.

Comment thread pyLucXor/cli.py
Comment on lines +296 to +305
peptide = Peptide(sequence, {}, config, charge=charge)

# Build ion ladders
self.logger.info("Building ion ladders...")
peptide.build_ion_ladders()

# Create PSM object with spectrum
self.logger.info("Creating PSM object...")
psm = PSM(peptide, spectrum, config)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix Peptide and PSM constructor calls (wrong argument order/arity)

These calls currently pass config into the mod_pep/scan_num positional parameters, breaking object initialization.

-        peptide = Peptide(sequence, {}, config, charge=charge)
+        peptide = Peptide(sequence, config, charge=charge)
@@
-        psm = PSM(peptide, spectrum, config)
+        psm = PSM(peptide, spectrum, config=config)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
peptide = Peptide(sequence, {}, config, charge=charge)
# Build ion ladders
self.logger.info("Building ion ladders...")
peptide.build_ion_ladders()
# Create PSM object with spectrum
self.logger.info("Creating PSM object...")
psm = PSM(peptide, spectrum, config)
peptide = Peptide(sequence, config, charge=charge)
# Build ion ladders
self.logger.info("Building ion ladders...")
peptide.build_ion_ladders()
# Create PSM object with spectrum
self.logger.info("Creating PSM object...")
psm = PSM(peptide, spectrum, config=config)
🤖 Prompt for AI Agents
In pyLucXor/cli.py around lines 296 to 305, the Peptide and PSM constructors are
being called with config passed positionally into parameters that expect
mods/scan_num, breaking initialization; change the calls to pass the mods and
scan_num in their correct positional places (or omit them) and supply config
explicitly as a keyword argument (e.g., Peptide(sequence, mods_dict,
charge=charge, config=config) and PSM(peptide, spectrum, scan_num=...,
config=config) or equivalent according to the actual constructor signatures) so
config is not passed into the wrong positional parameters.

Comment thread pyLucXor/config.py
Comment on lines +29 to +31
# Initialize with default configuration
self.config = DEFAULT_CONFIG.copy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Shallow copy of DEFAULT_CONFIG leaks nested mutations to the global default.

Lists/dicts inside DEFAULT_CONFIG are shared with all instances. Mutating them (e.g., append to target_modifications) will corrupt the module default and other instances.

Apply this diff:

+from copy import deepcopy
@@
-        # Initialize with default configuration
-        self.config = DEFAULT_CONFIG.copy()
+        # Initialize with a deep copy of the default configuration
+        self.config = deepcopy(DEFAULT_CONFIG)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Initialize with default configuration
self.config = DEFAULT_CONFIG.copy()
from copy import deepcopy
# Initialize with a deep copy of the default configuration
self.config = deepcopy(DEFAULT_CONFIG)
🤖 Prompt for AI Agents
In pyLucXor/config.py around lines 29 to 31, the code uses DEFAULT_CONFIG.copy()
which performs a shallow copy and leaks nested mutable objects (lists/dicts)
between instances; replace the shallow copy with a deep copy (e.g., import copy
and use copy.deepcopy(DEFAULT_CONFIG)) so each Config instance gets its own
independent nested structures, and ensure the copy module is imported at the top
of the file.

Comment thread pyLucXor/constants.py
Comment on lines +197 to +199
# Physical constants
WATER = 18.01056
PROTON = 1.00728 No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Duplicate physical constants (WATER/PROTON) may diverge from precise values.

WATER_MASS/PROTON_MASS are precise, while WATER/PROTON are truncated. To avoid divergence, alias them to the precise constants.

Apply this diff:

-# Physical constants
-WATER = 18.01056
-PROTON = 1.00728
+# Physical constants (aliases for precise values)
+WATER = WATER_MASS
+PROTON = PROTON_MASS
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Physical constants
WATER = 18.01056
PROTON = 1.00728
# Physical constants (aliases for precise values)
WATER = WATER_MASS
PROTON = PROTON_MASS
🤖 Prompt for AI Agents
In pyLucXor/constants.py around lines 197 to 199, the file defines WATER and
PROTON with truncated values while there are precise constants WATER_MASS and
PROTON_MASS elsewhere; replace the hardcoded truncated values by aliasing WATER
= WATER_MASS and PROTON = PROTON_MASS so the two names refer to the single
precise values and cannot drift apart.

Comment thread pyLucXor/core.py
Comment on lines +121 to +123
# Process PSM (generate only real permutations, no decoys)
psm.process_stage2(self.config.to_dict())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Broken API call: PSM has process_round2(...), not process_stage2(...)

Per PSM docs, second-round processing should call process_round2(flr_calculator); current call will raise AttributeError and ignores FLR mapping.

-                # Process PSM (generate only real permutations, no decoys)
-                psm.process_stage2(self.config.to_dict())
+                # Process PSM (real permutations only) and assign FLR from mapping
+                psm.process_round2(self.flr_calculator)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Process PSM (generate only real permutations, no decoys)
psm.process_stage2(self.config.to_dict())
# Process PSM (real permutations only) and assign FLR from mapping
psm.process_round2(self.flr_calculator)
🤖 Prompt for AI Agents
In pyLucXor/core.py around lines 121 to 123, the code calls a non-existent
psm.process_stage2(...) which will raise an AttributeError and omits the FLR
mapping; replace this call with the correct API
psm.process_round2(flr_calculator) and pass the FLR calculator/mapping derived
from the current config (e.g., use an existing self.flr_calculator or construct
one from self.config before the call), ensuring the variable is properly
initialized and documented so process_round2 receives the expected FLR
calculator.

Comment thread pyLucXor/psm.py
Comment on lines +525 to +529
if seq[i:i+9] == "(Phospho)":
i += 9
elif seq[i:i+9] == "(Oxidation)":
i += 9
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect oxidation marker length in _validate_permutation

Same 11-char issue here.

-                    elif seq[i:i+9] == "(Oxidation)":
-                        i += 9
+                    elif seq[i:i+11] == "(Oxidation)":
+                        i += 11
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if seq[i:i+9] == "(Phospho)":
i += 9
elif seq[i:i+9] == "(Oxidation)":
i += 9
else:
if seq[i:i+9] == "(Phospho)":
i += 9
elif seq[i:i+11] == "(Oxidation)":
i += 11
else:
🧰 Tools
🪛 Ruff (0.12.2)

525-528: Combine if branches using logical or operator

Combine if branches

(SIM114)

🤖 Prompt for AI Agents
In pyLucXor/psm.py around lines 525 to 529, the branch that advances past the
"(Oxidation)" marker uses i += 9 (same incorrect fixed length as the phospho
case); update that to use the correct length for "(Oxidation)" (or, better,
compute it dynamically with len("(Oxidation)") ) so the index advances by the
full marker length instead of the current wrong value.

Comment thread pyLucXor/psm.py
Comment on lines +928 to +966
def normalize_spectrum(self) -> None:
"""Normalize spectrum intensities."""
if self.spectrum is None:
self.logger.warning("No spectrum data available for normalization")
return

# Get peaks
mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:
self.logger.warning("No intensity values available for normalization")
return

# Check data validity
valid_mask = np.isfinite(intensities) & (intensities > 0)
if not np.any(valid_mask):
self.logger.warning("No valid intensity values found for normalization")
return

# Only use valid intensity values
valid_intensities = intensities[valid_mask]
valid_mz_values = mz_values[valid_mask]

# Log intensity statistics
self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "
f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")

# Normalize to sum of 1
total = np.sum(valid_intensities)
if total > 0:
normalized_intensities = valid_intensities / total

# Update spectrum with normalized values
self.spectrum.set_peaks(valid_mz_values, normalized_intensities)

self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values")
else:
self.logger.warning("Total intensity is zero, skipping normalization")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix logger attribute access in normalize_spectrum

self.logger is not defined on PSM; use the module logger. Calls will raise AttributeError if this method is ever invoked.

Apply this diff:

-        if self.spectrum is None:
-            self.logger.warning("No spectrum data available for normalization")
+        if self.spectrum is None:
+            logger.warning("No spectrum data available for normalization")
@@
-        if len(intensities) == 0:
-            self.logger.warning("No intensity values available for normalization")
+        if len(intensities) == 0:
+            logger.warning("No intensity values available for normalization")
@@
-        if not np.any(valid_mask):
-            self.logger.warning("No valid intensity values found for normalization")
+        if not np.any(valid_mask):
+            logger.warning("No valid intensity values found for normalization")
@@
-        self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "
+        logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "
                          f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")
@@
-            self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values")
+            logger.debug(f"Normalized {len(normalized_intensities)} intensity values")
         else:
-            self.logger.warning("Total intensity is zero, skipping normalization")
+            logger.warning("Total intensity is zero, skipping normalization")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def normalize_spectrum(self) -> None:
"""Normalize spectrum intensities."""
if self.spectrum is None:
self.logger.warning("No spectrum data available for normalization")
return
# Get peaks
mz_values, intensities = self.spectrum.get_peaks()
if len(intensities) == 0:
self.logger.warning("No intensity values available for normalization")
return
# Check data validity
valid_mask = np.isfinite(intensities) & (intensities > 0)
if not np.any(valid_mask):
self.logger.warning("No valid intensity values found for normalization")
return
# Only use valid intensity values
valid_intensities = intensities[valid_mask]
valid_mz_values = mz_values[valid_mask]
# Log intensity statistics
self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "
f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")
# Normalize to sum of 1
total = np.sum(valid_intensities)
if total > 0:
normalized_intensities = valid_intensities / total
# Update spectrum with normalized values
self.spectrum.set_peaks(valid_mz_values, normalized_intensities)
self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values")
else:
self.logger.warning("Total intensity is zero, skipping normalization")
def normalize_spectrum(self) -> None:
"""Normalize spectrum intensities."""
if self.spectrum is None:
logger.warning("No spectrum data available for normalization")
return
# Get peaks
mz_values, intensities = self.spectrum.get_peaks()
if len(intensities) == 0:
logger.warning("No intensity values available for normalization")
return
# Check data validity
valid_mask = np.isfinite(intensities) & (intensities > 0)
if not np.any(valid_mask):
logger.warning("No valid intensity values found for normalization")
return
# Only use valid intensity values
valid_intensities = intensities[valid_mask]
valid_mz_values = mz_values[valid_mask]
# Log intensity statistics
logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "
f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")
# Normalize to sum of 1
total = np.sum(valid_intensities)
if total > 0:
normalized_intensities = valid_intensities / total
# Update spectrum with normalized values
self.spectrum.set_peaks(valid_mz_values, normalized_intensities)
logger.debug(f"Normalized {len(normalized_intensities)} intensity values")
else:
logger.warning("Total intensity is zero, skipping normalization")

Comment thread pyLucXor/psm.py
Comment on lines +967 to +1002
def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None:
"""Reduce neutral loss peaks.

Args:
precursor_nl_mass: Precursor neutral loss mass
"""
if self.spectrum is None:
self.logger.warning("No spectrum data available for neutral loss reduction")
return

# Get peaks
mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:
self.logger.warning("No intensity values available for neutral loss reduction")
return

# Find peaks to reduce
nl_indices = []
for i, mz in enumerate(mz_values):
# Check if peak is within neutral loss mass window
if abs(mz - precursor_nl_mass) < 0.5: # Using 0.5 Da window
nl_indices.append(i)

if nl_indices:
# Reduce intensity of neutral loss peaks
for idx in nl_indices:
intensities[idx] *= 0.1 # Reduce to 10% of original intensity

# Update spectrum
self.spectrum.set_peaks(mz_values, intensities)

self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")
else:
self.logger.debug("No neutral loss peaks found")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix logger attribute access in reduce_nl_peak

Same self.logger issue here.

-        if self.spectrum is None:
-            self.logger.warning("No spectrum data available for neutral loss reduction")
+        if self.spectrum is None:
+            logger.warning("No spectrum data available for neutral loss reduction")
@@
-        if len(intensities) == 0:
-            self.logger.warning("No intensity values available for neutral loss reduction")
+        if len(intensities) == 0:
+            logger.warning("No intensity values available for neutral loss reduction")
@@
-            self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")
+            logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")
         else:
-            self.logger.debug("No neutral loss peaks found")
+            logger.debug("No neutral loss peaks found")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None:
"""Reduce neutral loss peaks.
Args:
precursor_nl_mass: Precursor neutral loss mass
"""
if self.spectrum is None:
self.logger.warning("No spectrum data available for neutral loss reduction")
return
# Get peaks
mz_values, intensities = self.spectrum.get_peaks()
if len(intensities) == 0:
self.logger.warning("No intensity values available for neutral loss reduction")
return
# Find peaks to reduce
nl_indices = []
for i, mz in enumerate(mz_values):
# Check if peak is within neutral loss mass window
if abs(mz - precursor_nl_mass) < 0.5: # Using 0.5 Da window
nl_indices.append(i)
if nl_indices:
# Reduce intensity of neutral loss peaks
for idx in nl_indices:
intensities[idx] *= 0.1 # Reduce to 10% of original intensity
# Update spectrum
self.spectrum.set_peaks(mz_values, intensities)
self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")
else:
self.logger.debug("No neutral loss peaks found")
def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None:
"""Reduce neutral loss peaks.
Args:
precursor_nl_mass: Precursor neutral loss mass
"""
if self.spectrum is None:
logger.warning("No spectrum data available for neutral loss reduction")
return
# Get peaks
mz_values, intensities = self.spectrum.get_peaks()
if len(intensities) == 0:
logger.warning("No intensity values available for neutral loss reduction")
return
# Find peaks to reduce
nl_indices = []
for i, mz in enumerate(mz_values):
# Check if peak is within neutral loss mass window
if abs(mz - precursor_nl_mass) < 0.5: # Using 0.5 Da window
nl_indices.append(i)
if nl_indices:
# Reduce intensity of neutral loss peaks
for idx in nl_indices:
intensities[idx] *= 0.1 # Reduce to 10% of original intensity
# Update spectrum
self.spectrum.set_peaks(mz_values, intensities)
logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")
else:
logger.debug("No neutral loss peaks found")
🤖 Prompt for AI Agents
In pyLucXor/psm.py around lines 967 to 1002, the method uses self.logger which
doesn't match the class's logger attribute; replace self.logger with the actual
logger attribute used elsewhere in this class (e.g., self._logger or self.log)
and update all occurrences in this method accordingly; if the class has no
logger attribute, add one in __init__ like self._logger =
logging.getLogger(__name__) and ensure import logging is present, then use that
attribute (and adjust the two warning/debug calls in this method to use it).

Comment thread pyLucXor/psm.py
Comment on lines +1003 to +1014
def get_spectrum_peaks(self) -> Tuple[np.ndarray, np.ndarray]:
"""Get spectrum peaks.

Returns:
Tuple of (m/z array, intensity array)
"""
if self.spectrum is None:
self.logger.warning("No spectrum data available")
return np.array([]), np.array([])

return self.spectrum.get_peaks()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix logger attribute access in get_spectrum_peaks

Same self.logger issue again.

-        if self.spectrum is None:
-            self.logger.warning("No spectrum data available")
+        if self.spectrum is None:
+            logger.warning("No spectrum data available")
             return np.array([]), np.array([])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_spectrum_peaks(self) -> Tuple[np.ndarray, np.ndarray]:
"""Get spectrum peaks.
Returns:
Tuple of (m/z array, intensity array)
"""
if self.spectrum is None:
self.logger.warning("No spectrum data available")
return np.array([]), np.array([])
return self.spectrum.get_peaks()
def get_spectrum_peaks(self) -> Tuple[np.ndarray, np.ndarray]:
"""Get spectrum peaks.
Returns:
Tuple of (m/z array, intensity array)
"""
if self.spectrum is None:
logger.warning("No spectrum data available")
return np.array([]), np.array([])
return self.spectrum.get_peaks()
🤖 Prompt for AI Agents
In pyLucXor/psm.py around lines 1003 to 1014, the method uses self.logger which
is not the correct logger attribute for this class; change the access to the
logger attribute used elsewhere in the class (e.g., self._logger) so the warning
call uses the actual logger instance, leaving the warning message and return
behavior unchanged; if the class uses a different logger property name, use that
exact name instead to keep consistency.

Comment thread pyLucXor/psm.py
Comment on lines +1015 to +1030
def clear_scores(self) -> None:
"""
Clear PSM scores
"""
logger.debug(f"Clearing scores for PSM {self.scan_num}")

# Clear permutation score mappings
self.pos_permutation_score_map.clear()
self.neg_permutation_score_map.clear()

# Reset score-related attributes
self.delta_score = np.nan
self.pep_score = np.nan
self.global_flr = np.nan
self.local_flr = np.nan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Typo: clearing psm_score uses undefined pep_score

self.pep_score doesn’t exist elsewhere; this should reset self.psm_score.

-        self.delta_score = np.nan
-        self.pep_score = np.nan
-        self.global_flr = np.nan
-        self.local_flr = np.nan
+        self.delta_score = np.nan
+        self.psm_score = np.nan
+        self.global_flr = np.nan
+        self.local_flr = np.nan
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def clear_scores(self) -> None:
"""
Clear PSM scores
"""
logger.debug(f"Clearing scores for PSM {self.scan_num}")
# Clear permutation score mappings
self.pos_permutation_score_map.clear()
self.neg_permutation_score_map.clear()
# Reset score-related attributes
self.delta_score = np.nan
self.pep_score = np.nan
self.global_flr = np.nan
self.local_flr = np.nan
def clear_scores(self) -> None:
"""
Clear PSM scores
"""
logger.debug(f"Clearing scores for PSM {self.scan_num}")
# Clear permutation score mappings
self.pos_permutation_score_map.clear()
self.neg_permutation_score_map.clear()
# Reset score-related attributes
self.delta_score = np.nan
self.psm_score = np.nan
self.global_flr = np.nan
self.local_flr = np.nan
🤖 Prompt for AI Agents
In pyLucXor/psm.py around lines 1015 to 1030, the clear_scores method resets
self.pep_score, which is a typo because the class uses self.psm_score elsewhere;
change the reset to self.psm_score = np.nan (and remove or do not add pep_score)
so the correct attribute is cleared; ensure no other references to pep_score
remain and run tests to confirm attribute consistency.

@ypriverol ypriverol requested a review from Copilot August 25, 2025 18:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces pyLucXor, a complete Python package implementation for phosphosite localization analysis with False Localization Rate (FLR) calculation. The implementation provides both a command-line interface and Python API for processing peptide-spectrum matches (PSMs) to determine the most likely phosphorylation sites on peptides.

Key changes include:

  • Complete two-stage PSM processing workflow with statistical model training and FLR estimation
  • Comprehensive peptide sequence analysis with fragment ion generation for both CID and HCD fragmentation
  • Parallel processing framework supporting multi-threading for improved performance

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pyLucXor/psm.py Core PSM class with scoring logic and two-stage processing workflow
pyLucXor/models.py Statistical models for CID/HCD fragmentation with kernel density estimation
pyLucXor/peptide.py Peptide sequence analysis with ion ladder generation and peak matching
pyLucXor/flr.py FLR calculator implementing statistical modeling for false localization rates
pyLucXor/spectrum.py Mass spectrum data processing with normalization and peak handling
pyLucXor/core.py Main processing orchestrator coordinating the two-stage workflow
pyLucXor/parallel.py Parallel processing infrastructure with worker classes
pyLucXor/peak.py Peak representation for mass spectrometry data
pyLucXor/globals.py Global state management utilities
pyLucXor/constants.py Physical constants and configuration parameters
pyLucXor/config.py Configuration management with validation
Comments suppressed due to low confidence (1)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread pyLucXor/spectrum.py
"""
for i in range(self.N):
if abs(self.mz[i] - mz) < 1e-6: # floating point comparison
return i
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded tolerance value 1e-6 for m/z comparison should be configurable or defined as a named constant to improve maintainability and allow for different precision requirements.

Suggested change
return i
for i in range(self.N):
if abs(self.mz[i] - mz) < self.MZ_TOLERANCE: # floating point comparison
return i

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/psm.py


if is_decoy:
tolerance = base_tolerance * 2.0 # decoyPadding = 2.0
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded decoy padding multiplier 2.0 should be defined as a named constant to improve code maintainability and make the algorithm parameters more explicit.

Suggested change
tolerance = base_tolerance * 2.0 # decoyPadding = 2.0
tolerance = base_tolerance * DECOY_PADDING_MULTIPLIER # decoyPadding

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/psm.py
for i in range(self.spectrum.N):
if abs(self.spectrum.mz[i] - nl_mz) < 0.5: # Use 0.5 Da window
# Reduce intensity to 10% of original
self.spectrum.raw_intensity[i] *= 0.1
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded neutral loss intensity reduction factor 0.1 should be defined as a named constant or made configurable to improve maintainability.

Suggested change
self.spectrum.raw_intensity[i] *= 0.1
factor = self.config.get('neutral_loss_intensity_factor', DEFAULT_NEUTRAL_LOSS_INTENSITY_FACTOR)
self.spectrum.raw_intensity[i] *= factor

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/psm.py
# Check data validity
valid_mask = np.isfinite(intensities) & (intensities > 0)
if not np.any(valid_mask):
self.logger.warning("No valid intensity values found for normalization")
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method accesses self.logger which is not defined. It should use the module-level logger instead.

Suggested change
self.logger.warning("No valid intensity values found for normalization")
logger.warning("No spectrum data available for normalization")
return
# Get peaks
mz_values, intensities = self.spectrum.get_peaks()
if len(intensities) == 0:
logger.warning("No intensity values available for normalization")
return
# Check data validity
valid_mask = np.isfinite(intensities) & (intensities > 0)
if not np.any(valid_mask):
logger.warning("No valid intensity values found for normalization")

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/psm.py
mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:
self.logger.warning("No intensity values available for neutral loss reduction")
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method accesses self.logger which is not defined. It should use the module-level logger instead.

Suggested change
self.logger.warning("No intensity values available for neutral loss reduction")
logger.warning("No spectrum data available for neutral loss reduction")
return
# Get peaks
mz_values, intensities = self.spectrum.get_peaks()
if len(intensities) == 0:
logger.warning("No intensity values available for neutral loss reduction")

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/psm.py
Tuple of (m/z array, intensity array)
"""
if self.spectrum is None:
self.logger.warning("No spectrum data available")
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method accesses self.logger which is not defined. It should use the module-level logger instead.

Suggested change
self.logger.warning("No spectrum data available")
logger.warning("No spectrum data available")

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/peptide.py
"""
Check if sequence is a decoy sequence
"""
return any(aa.islower() for aa in seq)
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decoy sequence detection logic using lowercase letters is fragile and could lead to false positives. Consider using a more explicit decoy marking mechanism or documenting this assumption clearly.

Suggested change
return any(aa.islower() for aa in seq)
Check if sequence is a decoy sequence.
Uses explicit decoy symbol from get_decoy_symbol().
"""
decoy_symbol = get_decoy_symbol()
return decoy_symbol in seq

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/parallel.py

logger = logging.getLogger(__name__)

# Default thread count
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The default thread count should ideally be based on system capabilities rather than a hardcoded value, or at least document why 4 was chosen as the default.

Suggested change
# Default thread count
# Default thread count
# Fallback to 4 threads if os.cpu_count() is unavailable.
# 4 is chosen as a reasonable default for most systems to balance parallelism and resource usage.

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/models.py
# Limit negative distribution size
limit_n = nb + ny
if limit_n < 50000: # constants.MIN_NUM_NEG_PKS
limit_n += 50000
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded value 50000 should be imported from constants module as indicated by the comment, rather than using a magic number.

Suggested change
limit_n += 50000
if limit_n < MIN_NUM_NEG_PKS: # constants.MIN_NUM_NEG_PKS
limit_n += MIN_NUM_NEG_PKS

Copilot uses AI. Check for mistakes.
Comment thread pyLucXor/flr.py
for i in range(self.n_real):
x = self.pos[i]
if x < 0.1:
x = 0.1
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded minimum delta score threshold 0.1 should use the instance variable self.min_delta_score instead of a magic number for consistency.

Suggested change
x = 0.1
if x < self.min_delta_score:
x = self.min_delta_score

Copilot uses AI. Check for mistakes.
@ypriverol ypriverol self-requested a review August 25, 2025 18:40
@ypriverol ypriverol merged commit c2061c1 into bigbio:main Aug 25, 2025
1 check passed
weizhongchun pushed a commit to weizhongchun/onsite that referenced this pull request Oct 23, 2025
weizhongchun pushed a commit to weizhongchun/onsite that referenced this pull request Oct 23, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants