Initial commit pyLucXor by weizhongchun · Pull Request #4 · bigbio/onsite

weizhongchun · 2025-08-20T10:05:40Z

PR Type

Enhancement, Documentation

Description

• Complete implementation of pyLucXor, a Python package for peptide-spectrum match (PSM) analysis and modification site localization
• Core PSM processing engine with two-stage workflow: Stage 1 for FLR estimation with decoys, Stage 2 for final scoring
• Statistical models for CID and HCD fragmentation scoring with kernel density estimation and parallel training
• Comprehensive peptide sequence analysis with fragment ion generation (b-ions, y-ions) and neutral loss handling
• False localization rate (FLR) calculation using statistical modeling and kernel density estimation
• Parallel processing framework with configurable thread pools for improved performance
• Command-line interface with extensive configuration options and multi-threading support
• Mass spectrum data processing with peak normalization and intensity calculations
• Configuration management system with validation and default value handling
• Complete documentation including installation, usage examples, and API reference

Diagram Walkthrough

flowchart LR
  CLI["Command Line Interface"] --> Core["Core Processor"]
  Core --> PSM["PSM Analysis"]
  Core --> Models["CID/HCD Models"]
  PSM --> Peptide["Peptide Processing"]
  PSM --> Spectrum["Spectrum Matching"]
  Models --> FLR["FLR Calculator"]
  Peptide --> Parallel["Parallel Processing"]
  FLR --> Results["Output (idXML/CSV)"]

File Walkthrough

Relevant files

Enhancement

10 files

psm.py `Complete PSM class implementation with scoring and FLR calculation` pyLucXor/psm.py • Implements PSM (Peptide-Spectrum Match) class with comprehensive functionality for handling peptide-spectrum matches • Provides methods for processing PSMs, generating permutations, scoring, and calculating delta scores and FLR values • Includes support for modification handling, spectrum matching, and both CID/HCD fragmentation methods • Implements two-stage processing workflow with decoy generation for FLR estimation	+1367/-0
models.py `Statistical models for CID and HCD fragmentation scoring` pyLucXor/models.py • Implements CID and HCD statistical models (`CIDModel`, `HCDModel`) for scoring peptide-spectrum matches • Provides model data classes (`ModelData_CID`, `ModelData_HCD`) with intensity and distance parameter calculations • Includes non-parametric density estimation using kernel density estimation with NumPy vectorization • Supports parallel model training using ThreadPoolExecutor for improved performance	+1122/-0
parallel.py `Parallel processing framework for PSM analysis` pyLucXor/parallel.py • Implements parallel processing infrastructure with worker classes for different tasks • Provides `ScoringWorker`, `PSMProcessingWorker`, `SpectrumMatchingWorker` for concurrent execution • Includes utility functions for optimal thread count calculation and generic parallel processing • Supports batch processing of PSMs with configurable thread pools	+261/-0
core.py `Core processing engine for PSM workflow orchestration` pyLucXor/core.py • Implements `CoreProcessor` class as main orchestrator for PSM processing workflow • Provides two-stage processing: Stage 1 for FLR estimation with decoys, Stage 2 for final scoring • Includes methods for result writing in both idXML and CSV formats • Integrates FLR calculation and PSM score management	+207/-0
peptide.py `Core peptide sequence analysis and fragment ion generation` pyLucXor/peptide.py • Implements comprehensive Peptide class for peptide sequence representation and analysis • Provides fragment ion generation (b-ions and y-ions) with neutral loss handling • Includes permutation generation for modification site localization • Supports both target and decoy sequence processing with FLR calculation	+907/-0
flr.py `False localization rate calculation and statistical modeling` pyLucXor/flr.py • Implements FLRCalculator class for false localization rate estimation • Uses kernel density estimation for statistical modeling of delta scores • Provides both global and local FLR calculations with minorization • Supports two-round FLR calculation workflow with mapping persistence	+798/-0
cli.py `Command-line interface and main processing workflow` pyLucXor/cli.py • Implements complete command-line interface for pyLuciPHOr2 tool • Provides comprehensive argument parsing with extensive configuration options • Implements two-stage processing workflow with model training and FLR calculation • Includes multi-threading support and file I/O handling for mzML and idXML formats	+781/-0
spectrum.py `Mass spectrum data representation and processing` pyLucXor/spectrum.py • Implements Spectrum class for mass spectrometry data handling • Provides peak intensity normalization and relative intensity calculations • Includes median-based normalization for spectral data processing • Supports peak finding and indexing operations	+134/-0
peak.py `Mass spectrometry peak representation and matching` pyLucXor/peak.py • Implements Peak class representing individual mass spectrometry peaks • Provides peak matching information and scoring attributes • Includes serialization methods for data persistence • Supports peak comparison and hash operations	+129/-0
globals.py `Global state management and utility functions` pyLucXor/globals.py • Implements global variables management for PSM processing • Provides FLR estimate recording and assignment functionality • Includes decoy symbol mapping utilities • Supports PSM score clearing for multi-round processing	+99/-0

Configuration changes

3 files

constants.py `Constants and configuration definitions for mass spectrometry` pyLucXor/constants.py • Defines comprehensive constants including amino acid masses, modification masses, and decoy mappings • Provides algorithm types, physical constants, and default configuration parameters • Includes ion types, neutral losses, and score type definitions • Sets up decoy amino acid mapping with mass calculations	+199/-0
__init__.py `Package initialization with lazy loading and API exports` pyLucXor/init.py • Implements package initialization with lazy loading using `getattr` • Defines package version and exports main classes • Provides clean API access to core functionality • Sets up module-level imports for `LucXor`, `PyLuciPHOr2`, `Peptide`, `PSM`, and model classes	+27/-0
config.py `Configuration management and parameter handling` pyLucXor/config.py • Implements LucXorConfig class for configuration management • Provides dictionary-like interface for parameter access • Includes configuration validation and default value handling • Supports dynamic configuration updates	+113/-0

Documentation

1 files

README.md `Complete package documentation and user guide` pyLucXor/README.md • Provides comprehensive documentation for pyLucXor package • Includes installation instructions, usage examples, and API documentation • Documents algorithm workflow, configuration options, and output formats • Contains performance notes, dependencies, and contribution guidelines	+181/-0

Summary by CodeRabbit

New Features
- Introduced pyLucXor: a CLI and Python API for PTM site localization (LuciPHOr2-inspired).
- Supports CID/HCD, multi-threading, mzML/MGF inputs, and idXML/CSV outputs.
- Two-stage workflow with FLR estimation/assignment; writes delta score, PEP score, and global/local FLR.
- End-to-end pipeline from identification loading to result export.
Documentation
- Added comprehensive README covering installation, prerequisites, CLI usage with examples, configuration options, output fields, performance notes, citation, and support.

coderabbitai · 2025-08-20T10:05:48Z

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

qodo-code-review · 2025-08-20T10:06:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review AttributeError Risk Methods like normalize_spectrum, reduce_nl_peak, and get_spectrum_peaks use self.logger instead of the module-level logger, which is undefined on the instance and will raise AttributeError at runtime. def normalize_spectrum(self) -> None: """Normalize spectrum intensities.""" if self.spectrum is None: self.logger.warning("No spectrum data available for normalization") return # Get peaks mz_values, intensities = self.spectrum.get_peaks() if len(intensities) == 0: self.logger.warning("No intensity values available for normalization") return # Check data validity valid_mask = np.isfinite(intensities) & (intensities > 0) if not np.any(valid_mask): self.logger.warning("No valid intensity values found for normalization") return # Only use valid intensity values valid_intensities = intensities[valid_mask] valid_mz_values = mz_values[valid_mask] # Log intensity statistics self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, " f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}") # Normalize to sum of 1 total = np.sum(valid_intensities) if total > 0: normalized_intensities = valid_intensities / total # Update spectrum with normalized values self.spectrum.set_peaks(valid_mz_values, normalized_intensities) self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values") else: self.logger.warning("Total intensity is zero, skipping normalization") def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None: """Reduce neutral loss peaks. Args: precursor_nl_mass: Precursor neutral loss mass """ if self.spectrum is None: self.logger.warning("No spectrum data available for neutral loss reduction") return # Get peaks mz_values, intensities = self.spectrum.get_peaks() if len(intensities) == 0: self.logger.warning("No intensity values available for neutral loss reduction") return # Find peaks to reduce nl_indices = [] for i, mz in enumerate(mz_values): # Check if peak is within neutral loss mass window if abs(mz - precursor_nl_mass) < 0.5: # Using 0.5 Da window nl_indices.append(i) if nl_indices: # Reduce intensity of neutral loss peaks for idx in nl_indices: intensities[idx] = 0.1 # Reduce to 10% of original intensity # Update spectrum self.spectrum.set_peaks(mz_values, intensities) self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks") else: self.logger.debug("No neutral loss peaks found") Scoring Logic Consistency* score_permutations mixes model APIs: it expects model.get_charge_model and also calls model.get_log_np_density_* directly. CIDModel and HCDModel expose those methods but CIDModel’s get_log_np_density_* compute parametric Gaussian while HCDModel uses NP; ensure the selected model matches fragmentation type and that both paths are intended. Also distance uses pos density for both b and y with same call; verify intended behavior. def score_permutations(self, model: Optional[object]) -> None: """ Score all permutations of the peptide sequence Args: model: Optional scoring model """ try: # Get all permutations from pos_permutation_score_map and neg_permutation_score_map all_perms = list(self.pos_permutation_score_map.keys()) + list(self.neg_permutation_score_map.keys()) base_tolerance = self.config.get('fragment_mass_tolerance', 0.1) # Read from config, default 0.1 # Score each permutation for perm in all_perms: # Check if it's a decoy sequence is_decoy = self._is_decoy_sequence(perm) if is_decoy: tolerance = base_tolerance * 2.0 # decoyPadding = 2.0 else: tolerance = base_tolerance # Calculate score final_score = 0.0 matched_peaks = self._match_peaks(perm, tolerance) # Get charge model charge_model = model.get_charge_model(self.charge) if model else None if charge_model is None: logger.debug(f"No charge model found for charge {self.charge}") continue # Calculate score for each peak for peak in matched_peaks: # Get ion type ion_type = peak.get('matched_ion_str', '') # Calculate intensity score intensity_m = 0.0 intensity_u = 0.0 if ion_type.startswith('b'): intensity_m = model.get_log_np_density_int('b', peak['norm_intensity'], self.charge) elif ion_type.startswith('y'): intensity_m = model.get_log_np_density_int('y', peak['norm_intensity'], self.charge) intensity_u = model.get_log_np_density_int('n', peak['norm_intensity'], self.charge) # Calculate distance score dist_m = 0.0 dist_u = 0.0 if ion_type.startswith('b'): dist_m = model.get_log_np_density_dist_pos(peak['mass_diff'], self.charge) elif ion_type.startswith('y'): dist_m = model.get_log_np_density_dist_pos(peak['mass_diff'], self.charge) # For unmatched ions, use uniform distribution dist_u = 0.0 # Calculate score intensity_score = intensity_m - intensity_u distance_score = dist_m - dist_u # Handle invalid values if np.isnan(intensity_score) or np.isinf(intensity_score): intensity_score = 0.0 if np.isnan(distance_score) or np.isinf(distance_score): distance_score = 0.0 # Calculate final score: direct addition peak_score = intensity_score + distance_score # If score is negative, set to 0 if peak_score < 0: peak_score = 0.0 # Add to total score final_score += peak_score logger.debug(f"Scoring permutation: {perm}, Score: {final_score:.6f}, Is decoy: {is_decoy}") # Store scores if is_decoy: self.neg_permutation_score_map[perm] = final_score logger.debug(f"Stored decoy score: {perm} -> {final_score:.6f}") else: self.pos_permutation_score_map[perm] = final_score logger.debug(f"Stored real score: {perm} -> {final_score:.6f}") # Collect all scores into a list all_scores = [] for score in self.pos_permutation_score_map.values(): all_scores.append(score) # Only add decoy scores in non-unambiguous cases if not self.is_unambiguous: for score in self.neg_permutation_score_map.values(): all_scores.append(score) # Sort (from low to high, then reverse) all_scores.sort() # From low to high all_scores.reverse() # From high to low # Get first two scores, using 6 decimal precision score1 = self._round_dbl(all_scores[0], 6) if all_scores else 0.0 score2 = 0.0 if not self.is_unambiguous: score2 = self._round_dbl(all_scores[1], 6) if len(all_scores) > 1 else 0.0 # Find corresponding peptide sequence pep1 = "" pep2 = "" num_assigned = 0 if not self.is_unambiguous: # First find highest and second highest scores in non-decoy permutations for p, s in self.pos_permutation_score_map.items(): x = s d = self._round_dbl(x, 6) if (d == score1) and (pep1 == ""): pep1 = p num_assigned += 1 elif (d == score2) and (pep2 == ""): pep2 = p num_assigned += 1 if num_assigned == 2: break # If not found, search in decoy permutations if num_assigned != 2: for p, s in self.neg_permutation_score_map.items(): x = s d = self._round_dbl(x, 6) if (d == score1) and (pep1 == ""): pep1 = p num_assigned += 1 elif (d == score2) and (pep2 == ""): pep2 = p num_assigned += 1 if num_assigned == 2: break else: # Special handling for unambiguous cases for p, s in self.pos_permutation_score_map.items(): pep1 = p break # Calculate delta score if self.is_unambiguous: # Unambiguous case: deltaScore = score1 self.delta_score = score1 logger.debug(f"Unambiguous case, delta score: {self.delta_score}") else: # Ambiguous case: deltaScore = score1 - score2 self.delta_score = score1 - score2 logger.debug(f"Non-unambiguous case, delta score: {score1} - {score2} = {self.delta_score}") # Set psm_score to score1 self.psm_score = score1 self.score1_pep = pep1 self.score2_pep = pep2 # Set is_decoy flag based on highest scoring peptide if pep1: self.is_decoy = self._is_decoy_sequence(pep1) except Exception as e: logger.error(f"Error scoring permutations: {str(e)}") self.delta_score = 0.0 self.psm_score = 0.0 Decoy Handling _is_decoy_sequence and generate_decoy_permutations rely on DECOY_AA_MAP and get_decoy_symbol; ensure AA mass lookup in _get_aa_mass supports decoy symbols (currently uppercasing drops decoy symbol mapping). Also tolerance is doubled for decoys; validate this does not bias delta scores unfairly. def _is_decoy_sequence(self, sequence: str) -> bool: """ Check if a sequence is a decoy sequence. Args: sequence: Peptide sequence to check Returns: bool: True if sequence is a decoy sequence """ try: # First remove modification markers, then check for decoy amino acids unmod_seq = "" i = 0 while i < len(sequence): if sequence[i:i+9] == "(Phospho)": i += 9 elif sequence[i:i+11] == "(Oxidation)": i += 11 else: unmod_seq += sequence[i] i += 1 # Check if sequence contains decoy amino acid symbols after removing modifications for aa in unmod_seq: if aa in DECOY_AA_MAP: return True return False except Exception as e: logger.error(f"Error checking decoy sequence: {str(e)}") return False def _get_mod_map(self, perm: str) -> Dict[int, float]: """ Get peptide modification site mapping Args: perm: Peptide permutation Returns: Dict[int, float]: Position -> modification mass """ mod_map = {} # Handle lowercase letter format modifications (new format) for i, aa in enumerate(perm): if aa.islower() and aa.upper() in ['S', 'T', 'Y']: # Lowercase letters indicate phosphorylation modification sites mod_map[i] = PHOSPHO_MOD_MASS elif aa.islower() and aa.upper() in ['M', 'W', 'F', 'Y']: # Lowercase letters indicate oxidation modification sites mod_map[i] = OXIDATION_MASS elif aa in DECOY_AA_MAP: # Handle decoy amino acids # In Java, decoy amino acids already have extra mass, no need to add modification mass # Because decoy amino acid mass already includes DECOY_MASS in AA_MASSES pass # Don't add extra modification mass # Handle bracket format modifications (compatibility with old format) i = 0 while i < len(perm): if perm[i:i+9] == "(Phospho)": # Get original amino acid at modification site orig_aa = perm[i-1].upper() # Convert to uppercase if orig_aa in AA_MASSES: mod_map[i-1] = PHOSPHO_MOD_MASS i += 9 elif perm[i:i+11] == "(Oxidation)": # Get original amino acid at modification site orig_aa = perm[i-1].upper() if orig_aa in AA_MASSES: mod_map[i-1] = OXIDATION_MASS i += 11 else: i += 1 return mod_map

qodo-code-review · 2025-08-20T10:08:18Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Inconsistent two-stage workflow The codebase implements two different two-stage pipelines (CoreProcessor’s process_stage1/process_stage2 vs. PSM.process/process_round2 and CLI’s round 0/2), with mismatched round numbers and divergent FLR assignment paths. This duplication risks inconsistent scoring/FLR results and hard-to-reproduce behavior; converge on a single orchestrator and a single round semantic (e.g., RN=0/1) for permutation generation, scoring, and FLR mapping used consistently across CLI, core, PSM, and parallel modules. Examples: pyLucXor/cli.py [508-555] # 5. Score all PSMs with trained model self.logger.info("Starting first round calculation (including decoy permutations)...") # Use multi-threading for PSM processing num_threads = get_optimal_thread_count(len(all_psms), max_threads=args.num_threads) self.logger.info(f"Using {num_threads} threads for PSM processing...") parallel_psm_processing(all_psms, model=self.model, round_number=0, num_threads=num_threads) # === Round 1: Execute FLR calculation and establish mapping relationships === ... (clipped 38 lines) pyLucXor/core.py [40-62] def process_all_psms(self) -> None: """ Process all PSMs using a two-stage workflow: Stage 1 (RN=0): Calculate FLR estimates Stage 2 (RN=1): Assign FLR values to PSMs """ logger.info(f"Starting to process all PSMs, total PSM count: {len(self.psms)}") # Create FLR calculator self.flr_calculator = FLRCalculator( ... (clipped 13 lines) Solution Walkthrough: Before: # In cli.py, a manual two-round workflow is implemented class PyLuciPHOr2: def run(self): # ... parallel_psm_processing(all_psms, model, round_number=0) flr_calc = FLRCalculator(...) flr_calc.calculate_flr_estimates(all_psms) # ... parallel_psm_processing(all_psms, flr_calc, round_number=2) # In core.py, a different two-stage workflow exists but is unused class CoreProcessor: def process_all_psms(self): self._process_psms_stage1() # Calls psm.process() with round 0 self._process_psms_stage2() # Calls a non-existent psm.process_stage2() After: # In cli.py, delegate workflow to a single orchestrator class PyLuciPHOr2: def run(self): # ... processor = CoreProcessor(config, all_psms) processor.run_workflow() processor.write_results(...) # In core.py, a single, consistent workflow is defined class CoreProcessor: def run_workflow(self): # Round 1 (with decoys) parallel_psm_processing(self.psms, model, round_number=1) self.flr_calculator.calculate_and_map_flr(self.psms) # Round 2 (no decoys) parallel_psm_processing(self.psms, self.flr_calculator, round_number=2) Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a critical architectural flaw where two separate and inconsistent two-stage workflows are implemented in `cli.py` and `core.py`, which impacts correctness and maintainability.	High
Possible issue	Fix residue indexing with mods The loop indexes amino acids by string position, but `perm` includes bracketed annotations and decoy symbols; indexing by string index causes mod positions and ion lengths to be wrong. Iterate over the amino-acid sequence only (excluding annotations) while tracking residue index, and apply `mod_map` using residue indices to avoid off-by-one and mass accumulation errors. pyLucXor/psm.py [854-914] def _calc_theoretical_masses(self, perm: str) -> List[float]: - """ - Calculate theoretical masses - ... - # Calculate b ions + if not perm: + return [] + mod_map = self._get_mod_map(perm) # keys are residue indices + masses = [] + # Build a clean residue list with mapping from string index to residue index + residues = [] + res_idx = -1 + i = 0 + while i < len(perm): + if perm[i:i+9] == "(Phospho)": + i += 9 + continue + if perm[i:i+11] == "(Oxidation)": + i += 11 + continue + ch = perm[i] + if ch in ['[', ']']: + i += 1 + continue + # new residue + res_idx += 1 + residues.append((res_idx, ch)) + i += 1 + # b-ions current_mass = 0.0 - for i in range(len(perm)): - if perm[i] not in ['(', ')']: # Skip modification markers - # Handle amino acids, including lowercase letters (modification sites) - if perm[i].upper() in AA_MASSES: - current_mass += self._get_aa_mass(perm[i].upper()) - elif perm[i] in DECOY_AA_MAP: - # Handle decoy amino acids - directly use decoy amino acid mass - # Decoy amino acid mass already includes DECOY_MASS in AA_MASSES - current_mass += self._get_aa_mass(perm[i]) - if i in mod_map: - current_mass += mod_map[i] - # Only add ions longer than minimum length - if i >= 1: # Minimum length is 2 - # Only generate ions with charge less than peptide charge - for z in range(1, self.charge): # Only generate charge states 1 to charge-1 - ion_mass = (current_mass + PROTON_MASS) / z - if ion_mass < self.peptide.get_precursor_mass(): - masses.append(ion_mass) - ... - # Calculate y ions + for k, (ridx, ch) in enumerate(residues): + # amino-acid mass (handles decoys via _get_aa_mass) + current_mass += self._get_aa_mass(ch) + if ridx in mod_map: + current_mass += mod_map[ridx] + if k >= 1: + for z in range(1, max(1, (self.charge or 1))): + ion_mz = (current_mass + PROTON_MASS) / z + if ion_mz < self.peptide.get_precursor_mass(): + masses.append(ion_mz) + # y-ions current_mass = 0.0 - for i in range(len(perm)-1, -1, -1): - if perm[i] not in ['(', ')']: # Skip modification markers - # Handle amino acids, including lowercase letters (modification sites) - if perm[i].upper() in AA_MASSES: - current_mass += self._get_aa_mass(perm[i].upper()) - elif perm[i] in DECOY_AA_MAP: - # Handle decoy amino acids - directly use decoy amino acid mass - # Decoy amino acid mass already includes DECOY_MASS in AA_MASSES - current_mass += self._get_aa_mass(perm[i]) - if i in mod_map: - current_mass += mod_map[i] - # Only add ions longer than minimum length - if i <= len(perm)-2: # Minimum length is 2 - # Only generate ions with charge less than peptide charge - for z in range(1, self.charge): # Only generate charge states 1 to charge-1 - ion_mass = (current_mass + WATER_MASS + PROTON_MASS) / z - if ion_mass < self.peptide.get_precursor_mass(): - masses.append(ion_mass) + for k, (ridx, ch) in enumerate(reversed(residues)): + current_mass += self._get_aa_mass(ch) + if ridx in mod_map: + current_mass += mod_map[ridx] + if k >= 1: + for z in range(1, max(1, (self.charge or 1))): + ion_mz = (current_mass + WATER_MASS + PROTON_MASS) / z + if ion_mz < self.peptide.get_precursor_mass(): + masses.append(ion_mz) + return masses `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a critical bug where iterating over the raw permutation string, which includes multi-character modification annotations, leads to incorrect residue indexing for mass calculations and ion length checks.	High
	Use proper Spectrum object `spectrum_dict` is a plain dict, but later workers expect attributes like `spectrum.native_id` and methods like `get_peaks()`. This mismatch will cause runtime errors. Wrap the mz/intensity arrays in your `Spectrum` class and attach `native_id` and `rt` attributes to match downstream access. pyLucXor/cli.py [432-450] +# Build Spectrum object to satisfy downstream access +mz_arr = np.array(spectrum.get_peaks()[0]) +int_arr = np.array(spectrum.get_peaks()[1]) +spectrum_obj = Spectrum(mz_arr, int_arr) +spectrum_obj.native_id = spectrum.getNativeID() +spectrum_obj.rt = spectrum.getRT() + peptide = Peptide(sequence, config, charge=charge) -psm = PSM(peptide, spectrum_dict, config=config) +psm = PSM(peptide, spectrum_obj, config=config) # Automatically determine score type and assign values score_type = pep_id.getScoreType() if hasattr(pep_id, 'getScoreType') else None score = hit.getScore() higher_score_better = True if hasattr(pep_id, 'isHigherScoreBetter'): higher_score_better = pep_id.isHigherScoreBetter() elif hasattr(pep_id, 'getHigherScoreBetter'): higher_score_better = pep_id.getHigherScoreBetter() -# If it's a PEP score (lower is better), convert to 1-PEP if (score_type and 'posterior error probability' in score_type.lower()) or (higher_score_better is False): psm.psm_score = 1.0 - score peptide.score = 1.0 - score else: psm.psm_score = score peptide.score = score Apply / Chat Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies that passing a plain dictionary `spectrum_dict` to the `PSM` constructor will cause runtime errors, as downstream code expects a `Spectrum` object with specific methods and attributes.	High
	Use tolerance for unmatched filtering Comparing floats for membership causes many matched peaks to be misclassified as unmatched due to small m/z differences. Use a tolerance-based match (e.g., absolute delta within the same fragment tolerance) when filtering unmatched peaks to avoid contaminating the noise distribution. pyLucXor/models.py [858-872] -# Get unmatched peaks (get all peaks from spectrum, then exclude matched peaks) +# Get unmatched peaks with tolerance-based exclusion mz_values, intensities = psm.spectrum.get_peaks() -matched_mz = {peak['mz'] for peak in matched_peaks} +tol = psm.config.get('fragment_mass_tolerance', 0.5) +matched_mz = np.array([pk['mz'] for pk in matched_peaks], dtype=float) for i in range(len(mz_values)): - if mz_values[i] not in matched_mz: + mz = float(mz_values[i]) + is_matched = False + if matched_mz.size > 0: + # check within tolerance + if np.any(np.abs(matched_mz - mz) <= tol): + is_matched = True + if not is_matched: neg_peak = { - 'mz': mz_values[i], + 'mz': mz, 'intensity': intensities[i], 'norm_intensity': psm.spectrum.norm_intensity[i] if hasattr(psm.spectrum, 'norm_intensity') else intensities[i], 'mass_diff': 0.0, 'matched': False, 'matched_ion_str': 'n' } neg_peaks.append(neg_peak) Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly points out that using direct float comparison for `m/z` values is unreliable due to precision issues, and proposes a more robust tolerance-based check, which is crucial for accurately separating matched and unmatched peaks.	Medium
	Fix wrong constructor arguments The `Peptide` constructor signature is `(peptide, config, mod_pep=None,` `charge=None)`, but here the `config` and an empty dict are swapped. This will break initialization and downstream logic. Pass `config` as the second argument and remove the stray empty dict. pyLucXor/cli.py [295-296] # Create peptide object -peptide = Peptide(sequence, {}, config, charge=charge) +peptide = Peptide(sequence, config, charge=charge) Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that the arguments to the `Peptide` constructor are incorrect, which would cause a runtime error if the `process_peptide` function is called.	Medium
	Prevent y-ion water double-counting The `addl_mass` parameter is never used, and y-ions add `WATER_MASS` twice because callers already include 18.01056 in `ym` calculations. Either remove double-counting or use `addl_mass` consistently. Use `addl_mass` to avoid duplicated water addition and keep callers' logic intact. pyLucXor/peptide.py [270-291] def _fragment_ion_mass(self, ion_str, z, addl_mass, ion_type='b'): """ Calculate fragment ion mass Receives complete ion string, such as "b4:FLLE" or "y2:sK" """ mass = 0.0 # Parse amino acid sequence starting after colon start = ion_str.find(":") + 1 stop = len(ion_str) seq = ion_str[start:stop] - for idx, aa in enumerate(seq): - # Directly use mass mapping table, lowercase letters represent modified amino acids + for aa in seq: if aa in AA_MASSES: mass += AA_MASSES[aa] - # Both b/y ions add z proton masses, y ions also add 1 water molecule mass + # Apply any additional mass (e.g., WATER_MASS for y-ions) from caller + mass += addl_mass + # Add z protons mass += PROTON_MASS * z - if ion_type == 'y': - mass += WATER_MASS return mass Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that `WATER_MASS` is added twice for y-ions, once in the caller and once in `_fragment_ion_mass`, leading to incorrect mass calculations.	Medium
	Fix inconsistent max index tracking Store `max_i_index` as the index within the filtered arrays, not the original input index, to keep it consistent with `self.mz` and `self.raw_intensity`. This prevents out-of-bounds or mismatched indexing when using `max_i_index` later. pyLucXor/spectrum.py [58-65] for i, idx in enumerate(cand_peaks): intensity = intensity_array[idx] if intensity > self.max_i: self.max_i = intensity - self.max_i_index = idx + self.max_i_index = i # index within filtered arrays self.mz[i] = mz_array[idx] self.raw_intensity[i] = intensity Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies an inconsistency in how `max_i_index` is stored, which could lead to out-of-bounds errors if the attribute were used. This change improves code correctness and robustness.	Low
General	Normalize neutral loss sign convention Use consistent sign conventions for neutral losses; representing a loss as positive mass while another entry uses a negative for the same loss is inconsistent and can cause double-negation bugs. Align the amino-acid-specific entries with the 'sty' loss to negative values. pyLucXor/constants.py [148-159] # Neutral losses NEUTRAL_LOSSES = { - 'H3PO4': 97.976896, # Phosphoric acid - 'H2O': 18.010565, # Water - 'NH3': 17.026549, # Ammonia - 'CO': 27.994915, # Carbon monoxide - 'CO2': 43.989829, # Carbon dioxide - 'sty': -97.97690, # H3PO4 - 'S': 98.00039, # Ser phosphorylation neutral loss - 'T': 98.00039, # Thr phosphorylation neutral loss - 'Y': 98.00039 # Tyr phosphorylation neutral loss + 'H3PO4': 97.976896, # Phosphoric acid (magnitude) + 'H2O': 18.010565, # Water (magnitude) + 'NH3': 17.026549, # Ammonia (magnitude) + 'CO': 27.994915, # Carbon monoxide (magnitude) + 'CO2': 43.989829, # Carbon dioxide (magnitude) + 'sty': -97.97690, # Loss of H3PO4 applied to S/T/Y + 'S': -97.97690, # Ser phosphorylation neutral loss (consistent with sty) + 'T': -97.97690, # Thr phosphorylation neutral loss + 'Y': -97.97690 # Tyr phosphorylation neutral loss } Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies an inconsistency in the sign convention for neutral loss masses, which could lead to bugs. Unifying the signs improves code correctness and maintainability.	Medium
More

coderabbitai

Actionable comments posted: 21

🧹 Nitpick comments (39)

pyLucXor/README.md (2)

170-173: Specify a language for the fenced code block (markdownlint MD040).

Add a language to the citation block for consistency with linters and better rendering.

Apply this diff:
-```
+```text
 Fermin, D., et al. (2011). LuciPHOr: algorithm for phosphorylation site localization with false localization rate estimation using target-decoy approach. 
 Molecular & Cellular Proteomics, 10(3), M110.003830.
---

`30-34`: **Replace placeholder repository URL in install instructions.**

The placeholder `<repository-url>` can cause copy-paste errors for users.

Example fix:

```diff
-git clone <repository-url>
+git clone https://github.com/bigbio/onsite.git
If this is a monorepo or subdirectory, add a note clarifying the path to the package root.

pyLucXor/constants.py (2)

59-66: Mixing modified residue masses into AA_MASSES via lowercase keys can be brittle.

Encoding PTMs by overloading residue case makes parsing fragile and opaque. Prefer explicit modification handling (e.g., a sequence + PTM map or using MOD_MASSES and dedicated logic).

Consider removing lowercase entries from AA_MASSES and centralizing PTM mass adjustments in peptide/psm logic. If you keep this, document the convention prominently and ensure all parsers/generators adhere to it.

95-100: Decoy mass augmentation mutates AA_MASSES; ensure no double-add and document.

The guard prevents double-insertion per interpreter session, but the side-effectful mutation on import should be documented since it affects any consumer importing constants.

Add a comment above the loop explaining idempotency and intent, or move to a function that returns an augmented copy when needed to avoid global mutation.

pyLucXor/config.py (2)

62-70: Align set semantics with update: warn on unknown keys.

Currently set silently introduces new keys while update warns on unknown keys. This inconsistency is surprising.

Apply this diff:

     def set(self, key: str, value: Any) -> None:
         """
         Set a configuration value.
@@
-        self.config[key] = value
+        if key in self.config:
+            self.config[key] = value
+        else:
+            logger.warning(f"Unknown configuration key: {key}")

93-101: Also guard __setitem__ against unknown keys for consistency.

Keep behavior consistent across all mutation paths.

Apply this diff:

     def __setitem__(self, key: str, value: Any) -> None:
         """
         Set a configuration value using dictionary syntax.
@@
-        self.config[key] = value
+        if key in self.config:
+            self.config[key] = value
+        else:
+            logger.warning(f"Unknown configuration key: {key}")

pyLucXor/globals.py (2)

5-11: Remove unused import Optional.

Optional is not used.

Apply this diff:

-from typing import List, Dict, Optional
+from typing import List, Dict

81-99: Use direct reverse map for decoy symbol lookup.

You re-derive the reverse mapping via iteration. Use AA_DECOY_MAP for O(1) lookup and simpler code.

Apply this diff:

-from .constants import DECOY_AA_MAP
+from .constants import AA_DECOY_MAP
@@
-    ret = ""
-    src_char = c.upper()
-    
-    for k, v in DECOY_AA_MAP.items():
-        if v.upper() == src_char:
-            ret = k
-            break
-            
-    return ret 
+    return AA_DECOY_MAP.get(c.upper(), "")

pyLucXor/peak.py (1)

119-127: Make from_dict robust to missing keys.

Use .get() with sensible defaults to ease forward/backward compatibility of serialized data.

Apply this diff:

-        peak = cls(data["mz"], data["raw_intensity"], data["norm_intensity"])
-        peak.rel_intensity = data["rel_intensity"]
-        peak.matched = data["matched"]
-        peak.dist = data["dist"]
-        peak.matched_ion_str = data["matched_ion_str"]
-        peak.matched_ion_mz = data["matched_ion_mz"]
-        peak.score = data["score"]
-        peak.intensity_score = data["intensity_score"]
-        peak.dist_score = data["dist_score"]
+        peak = cls(
+            data["mz"],
+            data["raw_intensity"],
+            data.get("norm_intensity", 0.0),
+        )
+        peak.rel_intensity = data.get("rel_intensity", 0.0)
+        peak.matched = data.get("matched", False)
+        peak.dist = data.get("dist", 0.0)
+        peak.matched_ion_str = data.get("matched_ion_str", "")
+        peak.matched_ion_mz = data.get("matched_ion_mz", 0.0)
+        peak.score = data.get("score", 0.0)
+        peak.intensity_score = data.get("intensity_score", 0.0)
+        peak.dist_score = data.get("dist_score", 0.0)

pyLucXor/spectrum.py (4)

9-10: Remove unused imports to satisfy linter and reduce import overhead

Dict, List, Optional, and pyopenms are not used in this module.

-from typing import Dict, List, Tuple, Optional
-import pyopenms
+from typing import Tuple

43-66: Vectorize peak initialization for correctness and speed; also clarify max_i_index semantics

You can replace the loop-based zero-intensity filtering and manual array population with a vectorized approach. This both simplifies the code and ensures max_i_index is the index in the filtered arrays (current code stores the index from the original arrays, which can be confusing downstream).

Is max_i_index intended to be an index into the filtered arrays (self.raw_intensity) or the original input arrays? The change below makes it index into the filtered arrays.

-        # Filter out zero intensity peaks
-        cand_peaks = []
-        for i in range(N):
-            if intensity_array[i] > 0:
-                cand_peaks.append(i)
-
-        self.N = len(cand_peaks)
-
-        # Initialize arrays
-        self.mz = np.zeros(self.N)
-        self.raw_intensity = np.zeros(self.N)
-        self.rel_intensity = np.zeros(self.N)
-        self.norm_intensity = np.zeros(self.N)
-
-        # Record peak data
-        for i, idx in enumerate(cand_peaks):
-            intensity = intensity_array[idx]
-            if intensity > self.max_i:
-                self.max_i = intensity
-                self.max_i_index = idx
-            
-            self.mz[i] = mz_array[idx]
-            self.raw_intensity[i] = intensity
+        # Filter out zero intensity peaks (vectorized)
+        mz_array = np.asarray(mz_array)
+        intensity_array = np.asarray(intensity_array)
+        mask = intensity_array[:N] > 0
+        self.N = int(np.count_nonzero(mask))
+
+        # Initialize arrays
+        self.mz = mz_array[:N][mask].astype(float, copy=False)
+        self.raw_intensity = intensity_array[:N][mask].astype(float, copy=False)
+        self.rel_intensity = np.zeros(self.N, dtype=float)
+        self.norm_intensity = np.zeros(self.N, dtype=float)
+
+        # Record maxima
+        if self.N > 0:
+            self.max_i_index = int(np.argmax(self.raw_intensity))
+            self.max_i = float(self.raw_intensity[self.max_i_index])

83-101: Use numpy median and vectorized log for normalization

Equivalent, faster, and safer. Also guards against accidental non-positive medians.

-        # Sort relative intensities
-        sorted_intensities = sorted(self.rel_intensity)
-        
-        # Calculate median
-        mid = self.N // 2
-        if self.N % 2 == 0:  # even number of peaks
-            a = sorted_intensities[mid - 1]
-            b = sorted_intensities[mid]
-            median_i = (a + b) / 2.0
-        else:  # odd number of peaks
-            median_i = sorted_intensities[mid]
-        
-        # Calculate normalized intensities
-        for i in range(self.N):
-            d = self.rel_intensity[i] / median_i
-            if d > 0:  # avoid log(0)
-                self.norm_intensity[i] = np.log(d)
-            else:
-                self.norm_intensity[i] = float('-inf')
+        # Compute median using numpy
+        median_i = float(np.median(self.rel_intensity))
+        if median_i <= 0:
+            # Defensive: if median is non-positive, set to -inf
+            self.norm_intensity.fill(float("-inf"))
+            return
+
+        # Vectorized log normalization; avoid log(0) by using where/out
+        self.norm_intensity = np.full_like(self.rel_intensity, float("-inf"))
+        np.log(self.rel_intensity / median_i,
+               out=self.norm_intensity,
+               where=self.rel_intensity > 0)

131-134: Optional: speed up m/z index lookup

For large spectra, a vectorized search avoids Python loops; if self.mz is sorted, consider binary search, otherwise vectorized absolute diff works.

-        for i in range(self.N):
-            if abs(self.mz[i] - mz) < 1e-6:  # floating point comparison
-                return i
-        return -1 
+        idxs = np.where(np.abs(self.mz - mz) < 1e-6)[0]
+        return int(idxs[0]) if idxs.size else -1

pyLucXor/flr.py (5)

9-11: Drop unused imports (typing and concurrent.futures)

These symbols aren’t used here; remove them to appease linters and speed imports.

-from typing import List, Dict, Any, Optional, Tuple
-from concurrent.futures import ThreadPoolExecutor, as_completed
+from typing import List, Tuple

394-403: Guard against tiny denominators when computing FDR ratios

Division by very small g_auc_f1/l_auc_f1 can produce inf/NaN. Clamp denominator with TINY_NUM. You already use TINY_NUM in KDE; extend the safety here.

-            ratio = (Ndecoy2/Nreal2) * (g_auc_f0 / g_auc_f1)
+            denom = max(g_auc_f1, TINY_NUM)
+            ratio = (Ndecoy2 / Nreal2) * (g_auc_f0 / denom)
             self.global_fdr[i] = ratio
@@
-            ratio = (Ndecoy2/Nreal2) * (l_auc_f0 / l_auc_f1)
+            denom = max(l_auc_f1, TINY_NUM)
+            ratio = (Ndecoy2 / Nreal2) * (l_auc_f0 / denom)
             self.local_fdr[i] = ratio

125-138: Handle small-sample bandwidth edge cases explicitly

When N <= 1 or variance is 0, bandwidth should be 0 (or a small epsilon), not NaN/inf. Make this explicit.

         if data_type == REAL:
             sigma = np.sqrt(self.delta_score_var_pos)
             N = float(self.n_real)
-            x = np.power(N, 0.2)
-            result = 1.06 * (sigma / x)
+            if N <= 1 or sigma == 0.0:
+                result = 0.0
+            else:
+                result = 1.06 * (sigma / (N ** 0.2))
             self.bw_real = result
@@
         elif data_type == DECOY:
             sigma = np.sqrt(self.delta_score_var_neg)
             N = float(self.n_decoy)
-            x = np.power(N, 0.2)
-            result = 1.06 * (sigma / x)
+            if N <= 1 or sigma == 0.0:
+                result = 0.0
+            else:
+                result = 1.06 * (sigma / (N ** 0.2))
             self.bw_decoy = result

205-211: Treat non-finite/negative bandwidth as invalid

If bw becomes NaN/inf or negative, fall back to TINY_NUM densities to avoid propagating invalid values.

-        if N == 0 or bw == 0:
+        if N == 0 or not np.isfinite(bw) or bw <= 0:
             # Avoid division by zero
             if data_type == DECOY:
                 self.f0 = np.full(self.NMARKS, TINY_NUM)
             else:
                 self.f1 = np.full(self.NMARKS, TINY_NUM)
             return

460-481: Minorization slope divisions can divide by zero

Both denominators can hit zero in degenerate cases, producing infs. Add guards.

-            slope = (0.0 - min_val) / (self.max_delta_score * 1.1 - x[min_idx])
+            denom = (self.max_delta_score * 1.1 - x[min_idx])
+            slope = (0.0 - min_val) / denom if denom != 0 else 0.0
@@
-            slope = max_val / (x[max_idx] - x[-1])
+            denom2 = (x[max_idx] - x[-1])
+            slope = max_val / denom2 if denom2 != 0 else 0.0

pyLucXor/__init__.py (1)

22-25: Avoid F401 warnings for CIDModel/HCDModel lazy exports

Returning via locals()[name] still trips some linters. Use a small mapping for clarity and to placate static analysis.
-    elif name in ["CIDModel", "HCDModel"]:
-        from .models import CIDModel, HCDModel
-        return locals()[name]
+    elif name in ("CIDModel", "HCDModel"):
+        from .models import CIDModel, HCDModel
+        return {"CIDModel": CIDModel, "HCDModel": HCDModel}[name]

pyLucXor/core.py (3)

8-10: Clean up unused imports

numpy and typing (Dict, Any, Optional) are not used in this file.
-import numpy as np
 import pyopenms
-from typing import List, Dict, Any, Optional
+from typing import List
131-135: Avoid double assignment of FLR values after process_round2

process_round2 is expected to assign FLR via find_closest_flr. Calling assign_flr_to_psms again may overwrite those values and currently assigns FLR to decoys (fixed separately). Consider removing this to keep a single source of truth.

Would you like me to adjust the stage-2 flow to rely solely on process_round2 for FLR assignment?

200-203: CSV export writes modified peptide equal to unmodified peptide

Use the modified representation if available.
-                    f.write(f"{psm.scan_num},{psm.peptide.peptide},{psm.peptide.peptide},{psm.psm_score:.4f},{psm.delta_score:.4f},{psm.global_flr:.4f},{psm.local_flr:.4f},{1 if psm.is_decoy else 0}\n")
+                    mod_pep = getattr(psm.peptide, "mod_peptide", psm.peptide.peptide)
+                    f.write(f"{psm.scan_num},{psm.peptide.peptide},{mod_pep},{psm.psm_score:.4f},{psm.delta_score:.4f},{psm.global_flr:.4f},{psm.local_flr:.4f},{1 if psm.is_decoy else 0}\n")

pyLucXor/cli.py (6)

10-11: Prune unused imports

json, NTERM_MOD, CTERM_MOD, and AA_MASSES are not used in this file.
-import json
@@
-from pyLucXor.constants import (
-    NTERM_MOD,
-    CTERM_MOD,
-    AA_MASSES,
-    DEFAULT_CONFIG
-)
+from pyLucXor.constants import DEFAULT_CONFIG
Also applies to: 29-33

528-531: Flatten nested if for readability

Minor readability improvement; no behavior change.
-            if hasattr(psm, 'delta_score') and not np.isnan(psm.delta_score):
-                if psm.delta_score > global_flr_calculator.min_delta_score:
-                    global_flr_calculator.add_psm(psm.delta_score, psm.is_decoy)
+            if hasattr(psm, 'delta_score') and not np.isnan(psm.delta_score) \
+               and psm.delta_score > global_flr_calculator.min_delta_score:
+                global_flr_calculator.add_psm(psm.delta_score, psm.is_decoy)
395-395: Rename unused loop index variable

i is unused; use _ to signal intent and silence linters.
-        for i, pep_id in enumerate(pep_ids, 1):
+        for _, pep_id in enumerate(pep_ids, 1):
666-668: Avoid bare except in MGF TITLE parsing

Catching all exceptions hides real errors; scope to expected ones.
-                        except:
+                        except (ValueError, IndexError):
                             logger.warning(f"Cannot extract scan number from title: {line}")
691-693: Avoid bare except when parsing peak lines

Be explicit about parsing failures.
-                        except:
+                        except ValueError:
                             logger.warning(f"Cannot parse peak data: {line}")
521-542: Consider skipping decoy FLR assignment in round 1 mapping

After record_flr_estimates, calling assign_flr_to_psms will assign FLR to decoys unless assign_flr_to_psms is fixed (see flr.py). Once fixed, this is okay. Otherwise, skip assigning here and only assign in round 2.

Do you want me to refactor the flow to avoid assigning FLR in round 1 entirely and rely strictly on round 2 mapping?

pyLucXor/parallel.py (3)

181-187: Thread chunking can create highly imbalanced work for small lists

Using len(psms) // num_threads can create many 1-element tasks and underutilize threads for small inputs. Consider reusing get_optimal_thread_count to bound threads to data size.

Apply this diff:

-    if num_threads is None:
-        num_threads = os.cpu_count() or DEFAULT_NUM_THREADS
+    if num_threads is None:
+        num_threads = get_optimal_thread_count(len(psms))

96-113: Remove unused locals and narrow exception in SpectrumMatchingWorker.match_spectrum_peptide

spectrum_native_id and rt are unused; the inner except Exception as e doesn’t use e.

Apply this diff:

-            # Get spectrum basic information
-            spectrum_native_id = getattr(psm.spectrum, 'native_id', None)
-            rt = getattr(psm.spectrum, 'rt', None)
+            # Spectrum basic information is attached below via getattr on psm/psm.spectrum as needed
@@
-            try:
-                matched_peaks = psm.peptide.match_peaks(psm.spectrum)
-            except Exception as e:
-                matched_peaks = []
+            try:
+                matched_peaks = psm.peptide.match_peaks(psm.spectrum)
+            except Exception:
+                matched_peaks = []

11-11: Drop unused import

threading is unused in this module.

-import threading

pyLucXor/psm.py (3)

781-786: Simplify and harden decoy detection

Use any(...) directly; it’s clearer and avoids early-return logic errors.

-            for aa in unmod_seq:
-                if aa in DECOY_AA_MAP:
-                    return True
-                    
-            return False
+            return any(aa in DECOY_AA_MAP for aa in unmod_seq)

1216-1219: Remove unused local variable

original_tolerance is assigned but never used.

-        original_tolerance = self.config.get('fragment_mass_tolerance', 0.1)
         temp_config = self.config.copy()

7-28: Prune unused imports (keeps runtime trim and static analysis clean)

A number of imports are unused in this module.

-import math
-from typing import Dict, List, Optional, Tuple, Set, Any, Union
+import math
+from typing import Dict, List, Optional, Tuple, Any, Union
@@
-import pyopenms
 from pyopenms import AASequence, Modification
@@
-    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP, AA_DECOY_MAP, NEUTRAL_LOSSES, MIN_DELTA_SCORE, PROTON_MASS, WATER_MASS,
-    DECOY_AMINO_ACIDS, PHOSPHO_MOD_MASS, OXIDATION_MASS
+    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP, PROTON_MASS, WATER_MASS,
+    PHOSPHO_MOD_MASS, OXIDATION_MASS
 )
-from .peak import Peak
 from .peptide import Peptide
 from .spectrum import Spectrum
-from .flr import FLRCalculator
 from .globals import get_decoy_symbol

pyLucXor/peptide.py (2)

687-701: Remove unused debug block in match_peaks or emit it to logs

output_data is built but never used. Either remove it or log/emit when debug is enabled.
-        if use_config.get('debug', False):
-            output_data = {
-                'matched_peaks': matched_peaks,
-                'theoretical_ions': {
-                    'y_ions': y_ions,
-                    'b_ions': b_ions
-                },
-                'peptide': self.peptide,
-                'modified_peptide': self.mod_peptide,
-                'charge': self.charge
-            }
+        if use_config.get('debug', False):
+            logger.debug(f"Matched peaks: {len(matched_peaks)} | peptide={self.peptide} z={self.charge}")
521-586: Potential index mismatch in get_permutations for ambiguous peptides

You iterate sites from self.peptide (which contains marker substrings) but then index into self.mod_peptide. These indices likely diverge once markers are present, producing incorrect lowercase positions.

Consider deriving candidate sites from a “marker-free” sequence and mapping to mod_peptide indices consistently. For example, build an index map from original peptide positions (excluding markers) to mod_peptide indices before flipping case. I can draft this refactor if this method is still in use; otherwise consider removing it to reduce confusion since PSM.generate_permutations covers the needed logic.

pyLucXor/models.py (3)

936-964: Remove unused local and simplify CID distance log-density

mu is assigned but not used; the implementation assumes mean 0. Keep it concise.

-        mu = 0.0 
-        var = charge_model.var_dist_b  
+        var = charge_model.var_dist_b
@@
-        log_prob = -0.5 * np.log(2 * np.pi * var) - 0.5 * (x ** 2) / var
+        log_prob = -0.5 * np.log(2 * np.pi * var) - 0.5 * (x ** 2) / var
         return log_prob

8-18: Trim unused imports

typing.Any and constants ALGORITHM_CID/ALGORITHM_HCD are unused. Keep imports tight.

-from typing import Dict, List, Optional, Tuple, Any
+from typing import Dict, List, Optional, Tuple
@@
-from .constants import ALGORITHM_CID, ALGORITHM_HCD, TINY_NUM
+from .constants import TINY_NUM

535-553: Style: collapse simple if/else paddings with ternaries (optional)

The padding of min_i/max_i blocks can be simplified; not functionally required but improves readability.

-        if min_i < 0:
-            min_i = min_i + (min_i * padding)
-        else:
-            min_i = min_i - (min_i * padding)
+        min_i = min_i + (min_i * padding) if min_i < 0 else min_i - (min_i * padding)
@@
-        if max_i < 0:
-            max_i = max_i - (max_i * padding)
-        else:
-            max_i = max_i + (max_i * padding)
+        max_i = max_i - (max_i * padding) if max_i < 0 else max_i + (max_i * padding)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 778df36 and c5f1ad6.

📒 Files selected for processing (14)

pyLucXor/README.md (1 hunks)
pyLucXor/__init__.py (1 hunks)
pyLucXor/cli.py (1 hunks)
pyLucXor/config.py (1 hunks)
pyLucXor/constants.py (1 hunks)
pyLucXor/core.py (1 hunks)
pyLucXor/flr.py (1 hunks)
pyLucXor/globals.py (1 hunks)
pyLucXor/models.py (1 hunks)
pyLucXor/parallel.py (1 hunks)
pyLucXor/peak.py (1 hunks)
pyLucXor/peptide.py (1 hunks)
pyLucXor/psm.py (1 hunks)
pyLucXor/spectrum.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (10)

pyLucXor/flr.py (2)

pyLucXor/core.py (1)

add_psm (31-38)

pyLucXor/globals.py (1)

record_flr_estimates (33-42)

pyLucXor/globals.py (2)

pyLucXor/flr.py (1)

record_flr_estimates (642-663)

pyLucXor/psm.py (1)

clear_scores (1015-1037)

pyLucXor/psm.py (5)

pyLucXor/peak.py (1)

Peak (10-129)

pyLucXor/peptide.py (4)

get_precursor_mass (165-196)

get_unmodified_sequence (883-907)

build_ion_ladders (207-268)

match_peaks (587-702)

pyLucXor/spectrum.py (5)

Spectrum (14-134)

is_empty (112-119)

median_normalize_spectra (75-101)

calc_relative_intensity (70-73)

get_peaks (103-110)

pyLucXor/flr.py (2)

add_psm (61-81)

find_closest_flr (736-764)

pyLucXor/globals.py (2)

globals (13-79)

get_decoy_symbol (81-99)

pyLucXor/parallel.py (3)

pyLucXor/psm.py (3)

score_permutations (556-733)

process (322-452)

process_round2 (454-488)

pyLucXor/cli.py (1)

process_psms (741-773)

pyLucXor/peptide.py (1)

match_peaks (587-702)

pyLucXor/__init__.py (4)

pyLucXor/cli.py (1)

PyLuciPHOr2 (40-773)

pyLucXor/peptide.py (1)

Peptide (24-907)

pyLucXor/psm.py (1)

PSM (31-1367)

pyLucXor/models.py (2)

CIDModel (787-964)

HCDModel (967-1122)

pyLucXor/core.py (3)

pyLucXor/psm.py (5)

PSM (31-1367)

process (322-452)

clear_scores (1015-1037)

get_results (916-926)

update_peptide_id (1279-1321)

pyLucXor/flr.py (5)

FLRCalculator (15-798)

add_psm (61-81)

calculate_flr_estimates (596-640)

record_flr_estimates (642-663)

assign_flr_to_psms (665-705)

pyLucXor/config.py (3)

LucXorConfig (14-113)

get (49-60)

to_dict (72-79)

pyLucXor/peptide.py (4)

pyLucXor/globals.py (2)

globals (13-79)

get_decoy_symbol (81-99)

pyLucXor/peak.py (1)

Peak (10-129)

pyLucXor/spectrum.py (1)

get_peaks (103-110)

pyLucXor/models.py (8)

get_charge_model (884-894)

get_charge_model (1065-1075)

get_log_np_density_int (662-723)

get_log_np_density_int (896-934)

get_log_np_density_int (1077-1099)

get_log_np_density_dist_pos (725-769)

get_log_np_density_dist_pos (936-964)

get_log_np_density_dist_pos (1101-1122)

pyLucXor/cli.py (6)

pyLucXor/psm.py (3)

PSM (31-1367)

charge (1244-1248)

process (322-452)

pyLucXor/peptide.py (2)

Peptide (24-907)

build_ion_ladders (207-268)

pyLucXor/models.py (4)

CIDModel (787-964)

HCDModel (967-1122)

build (802-834)

build (985-1013)

pyLucXor/spectrum.py (2)

Spectrum (14-134)

get_peaks (103-110)

pyLucXor/flr.py (6)

FLRCalculator (15-798)

add_psm (61-81)

calculate_flr_estimates (596-640)

record_flr_estimates (642-663)

assign_flr_to_psms (665-705)

save_delta_score_flr_mapping (711-734)

pyLucXor/parallel.py (3)

parallel_psm_processing (170-201)

get_optimal_thread_count (240-261)

process_psms (31-34)

pyLucXor/constants.py (1)

pyLucXor/config.py (1)

update (36-47)

pyLucXor/models.py (4)

pyLucXor/peak.py (1)

Peak (10-129)

pyLucXor/config.py (1)

get (49-60)

pyLucXor/psm.py (2)

charge (1244-1248)

_match_peaks (1190-1227)

pyLucXor/spectrum.py (1)

get_peaks (103-110)

🪛 Ruff (0.12.2)

pyLucXor/flr.py

9-9: typing.Dict imported but unused

Remove unused import

(F401)

9-9: typing.Any imported but unused

Remove unused import

(F401)

9-9: typing.Optional imported but unused

Remove unused import

(F401)

10-10: concurrent.futures.ThreadPoolExecutor imported but unused

Remove unused import

(F401)

10-10: concurrent.futures.as_completed imported but unused

Remove unused import

(F401)

pyLucXor/globals.py

6-6: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

pyLucXor/spectrum.py

9-9: typing.Dict imported but unused

Remove unused import

(F401)

9-9: typing.List imported but unused

Remove unused import

(F401)

9-9: typing.Optional imported but unused

Remove unused import

(F401)

10-10: pyopenms imported but unused

Remove unused import: pyopenms

(F401)

pyLucXor/psm.py

9-9: typing.Set imported but unused

Remove unused import: typing.Set

(F401)

16-16: pyopenms imported but unused

Remove unused import: pyopenms

(F401)

20-20: .constants.AA_DECOY_MAP imported but unused

Remove unused import

(F401)

20-20: .constants.NEUTRAL_LOSSES imported but unused

Remove unused import

(F401)

20-20: .constants.MIN_DELTA_SCORE imported but unused

Remove unused import

(F401)

21-21: .constants.DECOY_AMINO_ACIDS imported but unused

Remove unused import

(F401)

23-23: .peak.Peak imported but unused

Remove unused import: .peak.Peak

(F401)

26-26: .flr.FLRCalculator imported but unused

Remove unused import: .flr.FLRCalculator

(F401)

506-506: Loop control variable sites not used within loop body

Rename unused sites to _sites

(B007)

525-528: Combine if branches using logical or operator

Combine if branches

(SIM114)

574-577: Use ternary operator tolerance = base_tolerance * 2.0 if is_decoy else base_tolerance instead of if-else-block

(SIM108)

610-613: Combine if branches using logical or operator

Combine if branches

(SIM114)

706-706: Loop control variable s not used within loop body

(B007)

782-786: Use return any(aa in DECOY_AA_MAP for aa in unmod_seq) instead of for loop

Replace with return any(aa in DECOY_AA_MAP for aa in unmod_seq)

(SIM110)

1089-1092: Combine if branches using logical or operator

Combine if branches

(SIM114)

1180-1184: Use return any(aa in DECOY_AA_MAP for aa in sequence) instead of for loop

Replace with return any(aa in DECOY_AA_MAP for aa in sequence)

(SIM110)

1217-1217: Local variable original_tolerance is assigned to but never used

Remove assignment to unused variable original_tolerance

(F841)

1265-1265: Loop control variable mod_positions not used within loop body

Rename unused mod_positions to _mod_positions

(B007)

1272-1272: Loop control variable mod_positions not used within loop body

Rename unused mod_positions to _mod_positions

(B007)

pyLucXor/parallel.py

11-11: threading imported but unused

Remove unused import: threading

(F401)

106-106: Local variable spectrum_native_id is assigned to but never used

Remove assignment to unused variable spectrum_native_id

(F841)

107-107: Local variable rt is assigned to but never used

Remove assignment to unused variable rt

(F841)

112-112: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

pyLucXor/__init__.py

23-23: .models.CIDModel imported but unused

(F401)

23-23: .models.HCDModel imported but unused

(F401)

pyLucXor/core.py

8-8: numpy imported but unused

Remove unused import: numpy

(F401)

10-10: typing.Dict imported but unused

Remove unused import

(F401)

10-10: typing.Any imported but unused

Remove unused import

(F401)

10-10: typing.Optional imported but unused

Remove unused import

(F401)

pyLucXor/peptide.py

9-9: typing.Set imported but unused

Remove unused import

(F401)

9-9: typing.Tuple imported but unused

Remove unused import

(F401)

9-9: typing.Union imported but unused

Remove unused import

(F401)

12-12: pyopenms imported but unused

Remove unused import: pyopenms

(F401)

15-15: .constants.AA_DECOY_MAP imported but unused

Remove unused import

(F401)

16-16: .constants.ION_TYPES imported but unused

Remove unused import

(F401)

16-16: .constants.NEUTRAL_LOSSES imported but unused

Remove unused import

(F401)

281-281: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

689-689: Local variable output_data is assigned to but never used

Remove assignment to unused variable output_data

(F841)

895-898: Combine if branches using logical or operator

Combine if branches

(SIM114)

pyLucXor/cli.py

10-10: json imported but unused

Remove unused import: json

(F401)

29-29: pyLucXor.constants.NTERM_MOD imported but unused

Remove unused import

(F401)

30-30: pyLucXor.constants.CTERM_MOD imported but unused

Remove unused import

(F401)

31-31: pyLucXor.constants.AA_MASSES imported but unused

Remove unused import

(F401)

395-395: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

528-529: Use a single if statement instead of nested if statements

(SIM102)

666-666: Do not use bare except

(E722)

691-691: Do not use bare except

(E722)

pyLucXor/models.py

11-11: typing.Any imported but unused

Remove unused import: typing.Any

(F401)

16-16: .constants.ALGORITHM_CID imported but unused

Remove unused import

(F401)

16-16: .constants.ALGORITHM_HCD imported but unused

Remove unused import

(F401)

545-548: Use ternary operator min_i = min_i + min_i * padding if min_i < 0 else min_i - min_i * padding instead of if-else-block

Replace if-else-block with min_i = min_i + min_i * padding if min_i < 0 else min_i - min_i * padding

(SIM108)

550-553: Use ternary operator max_i = max_i - max_i * padding if max_i < 0 else max_i + max_i * padding instead of if-else-block

Replace if-else-block with max_i = max_i - max_i * padding if max_i < 0 else max_i + max_i * padding

(SIM108)

957-957: Local variable mu is assigned to but never used

Remove assignment to unused variable mu

(F841)

🪛 LanguageTool

pyLucXor/README.md

[grammar] ~23-~23: There might be a mistake here.
Context: ...on

Prerequisites

Python 3.7+
PyOpenMS
NumPy
SciPy

Instal...

(QB_NEW_EN)

[grammar] ~24-~24: There might be a mistake here.
Context: ...erequisites

Python 3.7+
PyOpenMS
NumPy
SciPy

Install from sourc...

(QB_NEW_EN)

[grammar] ~25-~25: There might be a mistake here.
Context: ...es

Python 3.7+
PyOpenMS
NumPy
SciPy

Install from source


(QB_NEW_EN)

---

[grammar] ~48-~48: There might be a mistake here.
Context: ...trum`: Input spectrum file (mzML format)
- `-id, --input_id`: Input identification file (idXML forma...

(QB_NEW_EN)

---

[grammar] ~49-~49: There might be a mistake here.
Context: ...Input identification file (idXML format)
- `-out, --output`: Output file (idXML format)

#### Opt...

(QB_NEW_EN)

---

[grammar] ~54-~54: There might be a mistake here.
Context: ...tation method (CID or HCD, default: CID)
- `--fragment_mass_tolerance`: Fragment mass tolerance (default: 0.5)...

(QB_NEW_EN)

---

[grammar] ~55-~55: There might be a mistake here.
Context: ...: Fragment mass tolerance (default: 0.5)
- `--fragment_error_units`: Tolerance units (Da or ppm, default: D...

(QB_NEW_EN)

---

[grammar] ~56-~56: There might be a mistake here.
Context: ...Tolerance units (Da or ppm, default: Da)
- `--min_mz`: Minimum m/z value to consider (default...

(QB_NEW_EN)

---

[grammar] ~57-~57: There might be a mistake here.
Context: ...m m/z value to consider (default: 150.0)
- `--target_modifications`: Target modifications (default: ["Phosp...

(QB_NEW_EN)

---

[grammar] ~58-~58: There might be a mistake here.
Context: ...pho (S)", "Phospho (T)", "Phospho (Y)"])
- `--neutral_losses`: Neutral loss definitions (default: ["s...

(QB_NEW_EN)

---

[grammar] ~59-~59: There might be a mistake here.
Context: ...ions (default: ["sty -H3PO4 -97.97690"])
- `--decoy_mass`: Mass for decoy generation (default: 79...

(QB_NEW_EN)

---

[grammar] ~60-~60: There might be a mistake here.
Context: ...or decoy generation (default: 79.966331)
- `--max_charge_state`: Maximum charge state (default: 5)
- `...

(QB_NEW_EN)

---

[grammar] ~61-~61: There might be a mistake here.
Context: ...tate`: Maximum charge state (default: 5)
- `--max_peptide_length`: Maximum peptide length (default: 40)
...

(QB_NEW_EN)

---

[grammar] ~62-~62: There might be a mistake here.
Context: ...h`: Maximum peptide length (default: 40)
- `--max_num_perm`: Maximum permutations (default: 16384)
...

(QB_NEW_EN)

---

[grammar] ~63-~63: There might be a mistake here.
Context: ...`: Maximum permutations (default: 16384)
- `--modeling_score_threshold`: Minimum score for modeling (default: 0...

(QB_NEW_EN)

---

[grammar] ~64-~64: There might be a mistake here.
Context: ...nimum score for modeling (default: 0.95)
- `--min_num_psms_model`: Minimum PSMs for modeling (default: 50...

(QB_NEW_EN)

---

[grammar] ~65-~65: There might be a mistake here.
Context: ... Minimum PSMs for modeling (default: 50)
- `--num_threads`: Number of threads (default: 4)
- `--r...

(QB_NEW_EN)

---

[grammar] ~66-~66: There might be a mistake here.
Context: ...threads`: Number of threads (default: 4)
- `--rt_tolerance`: Retention time tolerance (default: 0.0...

(QB_NEW_EN)

---

[grammar] ~67-~67: There might be a mistake here.
Context: ...Retention time tolerance (default: 0.01)
- `--debug`: Enable debug mode
- `--log_file`: Cus...

(QB_NEW_EN)

---

[grammar] ~68-~68: There might be a mistake here.
Context: ...t: 0.01)
- `--debug`: Enable debug mode
- `--log_file`: Custom log file path

#### Example
...

(QB_NEW_EN)

---

[grammar] ~111-~111: There might be a mistake here.
Context: ...m:

### Stage 1: FLR Estimation (RN=0)
- Process all PSMs with both real and deco...

(QB_NEW_EN)

---

[grammar] ~116-~116: There might be a mistake here.
Context: ...ns

### Stage 2: FLR Assignment (RN=1)
- Re-process PSMs using only real permutat...

(QB_NEW_EN)

---

[grammar] ~123-~123: There might be a mistake here.
Context: ...tool generates an idXML file containing:
- **Luciphor_delta_score**: Main localizatio...

(QB_NEW_EN)

---

[grammar] ~124-~124: There might be a mistake here.
Context: ...r_delta_score**: Main localization score
- **Luciphor_pep_score**: Peptide identifica...

(QB_NEW_EN)

---

[grammar] ~125-~125: There might be a mistake here.
Context: ...ep_score**: Peptide identification score
- **Luciphor_global_flr**: Global false loca...

(QB_NEW_EN)

---

[grammar] ~126-~126: There might be a mistake here.
Context: ...al_flr**: Global false localization rate
- **Luciphor_local_flr**: Local false locali...

(QB_NEW_EN)

---

[grammar] ~133-~133: There might be a mistake here.
Context: ...method**: CID or HCD specific parameters
- **Mass tolerances**: Fragment and precurso...

(QB_NEW_EN)

---

[grammar] ~134-~134: There might be a mistake here.
Context: ...: Fragment and precursor mass tolerances
- **Modification definitions**: Target PTMs ...

(QB_NEW_EN)

---

[grammar] ~135-~135: There might be a mistake here.
Context: ...itions**: Target PTMs and neutral losses
- **Modeling parameters**: Score thresholds ...

(QB_NEW_EN)

---

[grammar] ~136-~136: There might be a mistake here.
Context: ... Score thresholds and minimum PSM counts
- **Performance settings**: Threading and me...

(QB_NEW_EN)

---

[grammar] ~141-~141: There might be a mistake here.
Context: ...ensive logging with configurable levels:
- **INFO**: General progress information
- ...

(QB_NEW_EN)

---

[grammar] ~142-~142: There might be a mistake here.
Context: ...- **INFO**: General progress information
- **DEBUG**: Detailed processing information...

(QB_NEW_EN)

---

[grammar] ~143-~143: There might be a mistake here.
Context: ...DEBUG**: Detailed processing information
- **WARNING**: Non-critical issues
- **ERRO...

(QB_NEW_EN)

---

[grammar] ~144-~144: There might be a mistake here.
Context: ...tion
- **WARNING**: Non-critical issues
- **ERROR**: Critical errors

Log files ar...

(QB_NEW_EN)

---

[grammar] ~151-~151: There might be a mistake here.
Context: ...c thread optimization based on PSM count
- **Memory efficient**: Streaming processing...

(QB_NEW_EN)

---

[grammar] ~152-~152: There might be a mistake here.
Context: ... Streaming processing for large datasets
- **Scalable**: Handles datasets with thousa...

(QB_NEW_EN)

---

[grammar] ~157-~157: There might be a mistake here.
Context: ...penMS**: Mass spectrometry data handling
- **NumPy**: Numerical computations
- **Sci...

(QB_NEW_EN)

---

[grammar] ~158-~158: There might be a mistake here.
Context: ...ing
- **NumPy**: Numerical computations
- **SciPy**: Statistical functions
- **Stan...

(QB_NEW_EN)

---

[grammar] ~159-~159: There might be a mistake here.
Context: ...ions
- **SciPy**: Statistical functions
- **Standard library**: argparse, logging, j...

(QB_NEW_EN)

---

[style] ~177-~177: Consider using a less common alternative to make your writing sound more unique and professional.
Context: ...tributing

Contributions are welcome! Please feel free to submit issues and pull requests.

## ...

(FEEL_FREE_TO_STYLE_ME)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

<details>
<summary>pyLucXor/README.md</summary>

170-170: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (5)</summary><blockquote>

<details>
<summary>pyLucXor/README.md (2)</summary>

`65-67`: **Docs default for --num_threads disagrees with code (6 vs 4).**

README lists default num_threads=4, but DEFAULT_CONFIG in pyLucXor/constants.py sets it to 6. Please align one of them to avoid user confusion.

Apply one of the following diffs:

- If 6 is intended (change README):

```diff
- - `--num_threads`: Number of threads (default: 4)
+ - `--num_threads`: Number of threads (default: 6)

If 4 is intended (change constants.py; see comment in constants.py too):

-    "num_threads": 6,
+    "num_threads": 4,

121-128: Clarify FLR field names to match output tag casing.

The README uses bold keys like Luciphor_delta_score, etc. Ensure the exact tag names and casing match the idXML writer implementation to prevent downstream parsing issues.

If the writer uses different keys (e.g., LuciPHOr vs Luciphor), update the README accordingly.

pyLucXor/constants.py (1)

121-136: Default num_threads=6 conflicts with README (4). Pick one.

This default should match the documented default in README and CLI help.

Apply if 4 is intended:

-    "num_threads": 6,
+    "num_threads": 4,

Or update README to 6 (see corresponding README comment).

pyLucXor/peptide.py (1)

7-20: Prune unused imports to reduce noise

[ suggest_nitpick ]

-from typing import Dict, List, Set, Tuple, Optional, Any, Union
+from typing import Dict, List, Tuple, Optional, Any, Union
@@
-import pyopenms
@@
-from .constants import (
-    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP, AA_DECOY_MAP,
-    WATER_MASS, PROTON_MASS, ION_TYPES, NEUTRAL_LOSSES
-)
+from .constants import (
+    NTERM_MOD, CTERM_MOD, AA_MASSES, DECOY_AA_MAP,
+    WATER_MASS, PROTON_MASS
+)

pyLucXor/models.py (1)

821-859: Model training builds matched peaks by calling private PSM._match_peaks

This reaches into a private method and duplicates tolerance logic. Consider using the public Peptide.match_peaks with the PSM’s peptide and spectrum to reduce coupling.

[ suggest_optional_refactor ]
Happy to propose a small wrapper on PSM to expose a public matching utility if desired.

coderabbitai · 2025-08-20T10:17:10Z

+    if name == "LucXor":
+        from .core import LucXor
+        return LucXor
+    elif name == "PyLuciPHOr2":


⚠️ Potential issue

getattr exposes non-existent symbol 'LucXor'

.core defines CoreProcessor, not LucXor. Importing LucXor will fail at runtime.

- if name == "LucXor": - from .core import LucXor - return LucXor + if name == "LucXor": + # Expose CoreProcessor under the historical/public name "LucXor" + from .core import CoreProcessor as LucXor + return LucXor

Optionally, you can also expose CoreProcessor explicitly:

+ elif name == "CoreProcessor": + from .core import CoreProcessor + return CoreProcessor

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if name == "LucXor":

from .core import LucXor

return LucXor

elif name == "PyLuciPHOr2":

if name == "LucXor":

# Expose CoreProcessor under the historical/public name "LucXor"

from .core import CoreProcessor as LucXor

return LucXor

elif name == "CoreProcessor":

from .core import CoreProcessor

return CoreProcessor

elif name == "PyLuciPHOr2":

🤖 Prompt for AI Agents

In pyLucXor/__init__.py around lines 10 to 13, __getattr__ tries to import and return a symbol named "LucXor" but core.py defines CoreProcessor instead; change the import to from .core import CoreProcessor and return CoreProcessor (or, if you intended the public name to remain LucXor, import CoreProcessor and return it under the LucXor alias), and optionally add an explicit exposure (e.g., set LucXor = CoreProcessor) so the module only exposes existing symbols.

coderabbitai · 2025-08-20T10:17:11Z

+        peptide = Peptide(sequence, {}, config, charge=charge)
+
+        # Build ion ladders
+        self.logger.info("Building ion ladders...")
+        peptide.build_ion_ladders()
+
+        # Create PSM object with spectrum
+        self.logger.info("Creating PSM object...")
+        psm = PSM(peptide, spectrum, config)
+


⚠️ Potential issue

Fix Peptide and PSM constructor calls (wrong argument order/arity)

These calls currently pass config into the mod_pep/scan_num positional parameters, breaking object initialization.

- peptide = Peptide(sequence, {}, config, charge=charge) + peptide = Peptide(sequence, config, charge=charge) @@ - psm = PSM(peptide, spectrum, config) + psm = PSM(peptide, spectrum, config=config)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

peptide = Peptide(sequence, {}, config, charge=charge)

# Build ion ladders

self.logger.info("Building ion ladders...")

peptide.build_ion_ladders()

# Create PSM object with spectrum

self.logger.info("Creating PSM object...")

psm = PSM(peptide, spectrum, config)

peptide = Peptide(sequence, config, charge=charge)

# Build ion ladders

self.logger.info("Building ion ladders...")

peptide.build_ion_ladders()

# Create PSM object with spectrum

self.logger.info("Creating PSM object...")

psm = PSM(peptide, spectrum, config=config)

🤖 Prompt for AI Agents

In pyLucXor/cli.py around lines 296 to 305, the Peptide and PSM constructors are being called with config passed positionally into parameters that expect mods/scan_num, breaking initialization; change the calls to pass the mods and scan_num in their correct positional places (or omit them) and supply config explicitly as a keyword argument (e.g., Peptide(sequence, mods_dict, charge=charge, config=config) and PSM(peptide, spectrum, scan_num=..., config=config) or equivalent according to the actual constructor signatures) so config is not passed into the wrong positional parameters.

coderabbitai · 2025-08-20T10:17:11Z

+        # Initialize with default configuration
+        self.config = DEFAULT_CONFIG.copy()
+


⚠️ Potential issue

Shallow copy of DEFAULT_CONFIG leaks nested mutations to the global default.

Lists/dicts inside DEFAULT_CONFIG are shared with all instances. Mutating them (e.g., append to target_modifications) will corrupt the module default and other instances.

Apply this diff:

+from copy import deepcopy @@ - # Initialize with default configuration - self.config = DEFAULT_CONFIG.copy() + # Initialize with a deep copy of the default configuration + self.config = deepcopy(DEFAULT_CONFIG)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Initialize with default configuration

self.config = DEFAULT_CONFIG.copy()

from copy import deepcopy

# Initialize with a deep copy of the default configuration

self.config = deepcopy(DEFAULT_CONFIG)

🤖 Prompt for AI Agents

In pyLucXor/config.py around lines 29 to 31, the code uses DEFAULT_CONFIG.copy() which performs a shallow copy and leaks nested mutable objects (lists/dicts) between instances; replace the shallow copy with a deep copy (e.g., import copy and use copy.deepcopy(DEFAULT_CONFIG)) so each Config instance gets its own independent nested structures, and ensure the copy module is imported at the top of the file.

coderabbitai · 2025-08-20T10:17:11Z

+# Physical constants
+WATER = 18.01056
+PROTON = 1.00728


🛠️ Refactor suggestion

Duplicate physical constants (WATER/PROTON) may diverge from precise values.

WATER_MASS/PROTON_MASS are precise, while WATER/PROTON are truncated. To avoid divergence, alias them to the precise constants.

Apply this diff:

-# Physical constants -WATER = 18.01056 -PROTON = 1.00728 +# Physical constants (aliases for precise values) +WATER = WATER_MASS +PROTON = PROTON_MASS

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Physical constants

WATER = 18.01056

PROTON = 1.00728

# Physical constants (aliases for precise values)

WATER = WATER_MASS

PROTON = PROTON_MASS

🤖 Prompt for AI Agents

In pyLucXor/constants.py around lines 197 to 199, the file defines WATER and PROTON with truncated values while there are precise constants WATER_MASS and PROTON_MASS elsewhere; replace the hardcoded truncated values by aliasing WATER = WATER_MASS and PROTON = PROTON_MASS so the two names refer to the single precise values and cannot drift apart.

coderabbitai · 2025-08-20T10:17:11Z

+                # Process PSM (generate only real permutations, no decoys)
+                psm.process_stage2(self.config.to_dict())
+


⚠️ Potential issue

Broken API call: PSM has process_round2(...), not process_stage2(...)

Per PSM docs, second-round processing should call process_round2(flr_calculator); current call will raise AttributeError and ignores FLR mapping.

- # Process PSM (generate only real permutations, no decoys) - psm.process_stage2(self.config.to_dict()) + # Process PSM (real permutations only) and assign FLR from mapping + psm.process_round2(self.flr_calculator)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Process PSM (generate only real permutations, no decoys)

psm.process_stage2(self.config.to_dict())

# Process PSM (real permutations only) and assign FLR from mapping

psm.process_round2(self.flr_calculator)

🤖 Prompt for AI Agents

In pyLucXor/core.py around lines 121 to 123, the code calls a non-existent psm.process_stage2(...) which will raise an AttributeError and omits the FLR mapping; replace this call with the correct API psm.process_round2(flr_calculator) and pass the FLR calculator/mapping derived from the current config (e.g., use an existing self.flr_calculator or construct one from self.config before the call), ensuring the variable is properly initialized and documented so process_round2 receives the expected FLR calculator.

coderabbitai · 2025-08-20T10:17:13Z

+                    if seq[i:i+9] == "(Phospho)":
+                        i += 9
+                    elif seq[i:i+9] == "(Oxidation)":
+                        i += 9
+                    else:


⚠️ Potential issue

Incorrect oxidation marker length in _validate_permutation

Same 11-char issue here.

- elif seq[i:i+9] == "(Oxidation)": - i += 9 + elif seq[i:i+11] == "(Oxidation)": + i += 11

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if seq[i:i+9] == "(Phospho)":

i += 9

elif seq[i:i+9] == "(Oxidation)":

i += 9

else:

if seq[i:i+9] == "(Phospho)":

i += 9

elif seq[i:i+11] == "(Oxidation)":

i += 11

else:

🧰 Tools

🪛 Ruff (0.12.2)

525-528: Combine if branches using logical or operator

Combine if branches

(SIM114)

🤖 Prompt for AI Agents

In pyLucXor/psm.py around lines 525 to 529, the branch that advances past the "(Oxidation)" marker uses i += 9 (same incorrect fixed length as the phospho case); update that to use the correct length for "(Oxidation)" (or, better, compute it dynamically with len("(Oxidation)") ) so the index advances by the full marker length instead of the current wrong value.

coderabbitai · 2025-08-20T10:17:13Z

+    def normalize_spectrum(self) -> None:
+        """Normalize spectrum intensities."""
+        if self.spectrum is None:
+            self.logger.warning("No spectrum data available for normalization")
+            return
+
+        # Get peaks
+        mz_values, intensities = self.spectrum.get_peaks()
+
+        if len(intensities) == 0:
+            self.logger.warning("No intensity values available for normalization")
+            return
+
+        # Check data validity
+        valid_mask = np.isfinite(intensities) & (intensities > 0)
+        if not np.any(valid_mask):
+            self.logger.warning("No valid intensity values found for normalization")
+            return
+
+        # Only use valid intensity values
+        valid_intensities = intensities[valid_mask]
+        valid_mz_values = mz_values[valid_mask]
+
+        # Log intensity statistics
+        self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "
+                         f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")
+
+        # Normalize to sum of 1
+        total = np.sum(valid_intensities)
+        if total > 0:
+            normalized_intensities = valid_intensities / total
+
+            # Update spectrum with normalized values
+            self.spectrum.set_peaks(valid_mz_values, normalized_intensities)
+
+            self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values")
+        else:
+            self.logger.warning("Total intensity is zero, skipping normalization")
+


⚠️ Potential issue

Fix logger attribute access in normalize_spectrum

self.logger is not defined on PSM; use the module logger. Calls will raise AttributeError if this method is ever invoked.

Apply this diff:

- if self.spectrum is None: - self.logger.warning("No spectrum data available for normalization") + if self.spectrum is None: + logger.warning("No spectrum data available for normalization") @@ - if len(intensities) == 0: - self.logger.warning("No intensity values available for normalization") + if len(intensities) == 0: + logger.warning("No intensity values available for normalization") @@ - if not np.any(valid_mask): - self.logger.warning("No valid intensity values found for normalization") + if not np.any(valid_mask): + logger.warning("No valid intensity values found for normalization") @@ - self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, " + logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, " f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}") @@ - self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values") + logger.debug(f"Normalized {len(normalized_intensities)} intensity values") else: - self.logger.warning("Total intensity is zero, skipping normalization") + logger.warning("Total intensity is zero, skipping normalization")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def normalize_spectrum(self) -> None:

"""Normalize spectrum intensities."""

if self.spectrum is None:

self.logger.warning("No spectrum data available for normalization")

return

# Get peaks

mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:

self.logger.warning("No intensity values available for normalization")

return

# Check data validity

valid_mask = np.isfinite(intensities) & (intensities > 0)

if not np.any(valid_mask):

self.logger.warning("No valid intensity values found for normalization")

return

# Only use valid intensity values

valid_intensities = intensities[valid_mask]

valid_mz_values = mz_values[valid_mask]

# Log intensity statistics

self.logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "

f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")

# Normalize to sum of 1

total = np.sum(valid_intensities)

if total > 0:

normalized_intensities = valid_intensities / total

# Update spectrum with normalized values

self.spectrum.set_peaks(valid_mz_values, normalized_intensities)

self.logger.debug(f"Normalized {len(normalized_intensities)} intensity values")

else:

self.logger.warning("Total intensity is zero, skipping normalization")

def normalize_spectrum(self) -> None:

"""Normalize spectrum intensities."""

if self.spectrum is None:

logger.warning("No spectrum data available for normalization")

return

# Get peaks

mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:

logger.warning("No intensity values available for normalization")

return

# Check data validity

valid_mask = np.isfinite(intensities) & (intensities > 0)

if not np.any(valid_mask):

logger.warning("No valid intensity values found for normalization")

return

# Only use valid intensity values

valid_intensities = intensities[valid_mask]

valid_mz_values = mz_values[valid_mask]

# Log intensity statistics

logger.debug(f"Intensity stats before normalization: min={np.min(valid_intensities):.2f}, "

f"max={np.max(valid_intensities):.2f}, mean={np.mean(valid_intensities):.2f}")

# Normalize to sum of 1

total = np.sum(valid_intensities)

if total > 0:

normalized_intensities = valid_intensities / total

# Update spectrum with normalized values

self.spectrum.set_peaks(valid_mz_values, normalized_intensities)

logger.debug(f"Normalized {len(normalized_intensities)} intensity values")

else:

logger.warning("Total intensity is zero, skipping normalization")

coderabbitai · 2025-08-20T10:17:13Z

+    def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None:
+        """Reduce neutral loss peaks.
+
+        Args:
+            precursor_nl_mass: Precursor neutral loss mass
+        """
+        if self.spectrum is None:
+            self.logger.warning("No spectrum data available for neutral loss reduction")
+            return
+
+        # Get peaks
+        mz_values, intensities = self.spectrum.get_peaks()
+
+        if len(intensities) == 0:
+            self.logger.warning("No intensity values available for neutral loss reduction")
+            return
+
+        # Find peaks to reduce
+        nl_indices = []
+        for i, mz in enumerate(mz_values):
+            # Check if peak is within neutral loss mass window
+            if abs(mz - precursor_nl_mass) < 0.5:  # Using 0.5 Da window
+                nl_indices.append(i)
+
+        if nl_indices:
+            # Reduce intensity of neutral loss peaks
+            for idx in nl_indices:
+                intensities[idx] *= 0.1  # Reduce to 10% of original intensity
+
+            # Update spectrum
+            self.spectrum.set_peaks(mz_values, intensities)
+
+            self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")
+        else:
+            self.logger.debug("No neutral loss peaks found")
+


⚠️ Potential issue

Fix logger attribute access in reduce_nl_peak

Same self.logger issue here.

- if self.spectrum is None: - self.logger.warning("No spectrum data available for neutral loss reduction") + if self.spectrum is None: + logger.warning("No spectrum data available for neutral loss reduction") @@ - if len(intensities) == 0: - self.logger.warning("No intensity values available for neutral loss reduction") + if len(intensities) == 0: + logger.warning("No intensity values available for neutral loss reduction") @@ - self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks") + logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks") else: - self.logger.debug("No neutral loss peaks found") + logger.debug("No neutral loss peaks found")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None:

"""Reduce neutral loss peaks.

Args:

precursor_nl_mass: Precursor neutral loss mass

"""

if self.spectrum is None:

self.logger.warning("No spectrum data available for neutral loss reduction")

return

# Get peaks

mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:

self.logger.warning("No intensity values available for neutral loss reduction")

return

# Find peaks to reduce

nl_indices = []

for i, mz in enumerate(mz_values):

# Check if peak is within neutral loss mass window

if abs(mz - precursor_nl_mass) < 0.5: # Using 0.5 Da window

nl_indices.append(i)

if nl_indices:

# Reduce intensity of neutral loss peaks

for idx in nl_indices:

intensities[idx] *= 0.1 # Reduce to 10% of original intensity

# Update spectrum

self.spectrum.set_peaks(mz_values, intensities)

self.logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")

else:

self.logger.debug("No neutral loss peaks found")

def reduce_nl_peak(self, precursor_nl_mass: float = 0.0) -> None:

"""Reduce neutral loss peaks.

Args:

precursor_nl_mass: Precursor neutral loss mass

"""

if self.spectrum is None:

logger.warning("No spectrum data available for neutral loss reduction")

return

# Get peaks

mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:

logger.warning("No intensity values available for neutral loss reduction")

return

# Find peaks to reduce

nl_indices = []

for i, mz in enumerate(mz_values):

# Check if peak is within neutral loss mass window

if abs(mz - precursor_nl_mass) < 0.5: # Using 0.5 Da window

nl_indices.append(i)

if nl_indices:

# Reduce intensity of neutral loss peaks

for idx in nl_indices:

intensities[idx] *= 0.1 # Reduce to 10% of original intensity

# Update spectrum

self.spectrum.set_peaks(mz_values, intensities)

logger.debug(f"Reduced {len(nl_indices)} neutral loss peaks")

else:

logger.debug("No neutral loss peaks found")

🤖 Prompt for AI Agents

In pyLucXor/psm.py around lines 967 to 1002, the method uses self.logger which doesn't match the class's logger attribute; replace self.logger with the actual logger attribute used elsewhere in this class (e.g., self._logger or self.log) and update all occurrences in this method accordingly; if the class has no logger attribute, add one in __init__ like self._logger = logging.getLogger(__name__) and ensure import logging is present, then use that attribute (and adjust the two warning/debug calls in this method to use it).

coderabbitai · 2025-08-20T10:17:13Z

+    def get_spectrum_peaks(self) -> Tuple[np.ndarray, np.ndarray]:
+        """Get spectrum peaks.
+
+        Returns:
+            Tuple of (m/z array, intensity array)
+        """
+        if self.spectrum is None:
+            self.logger.warning("No spectrum data available")
+            return np.array([]), np.array([])
+
+        return self.spectrum.get_peaks()
+


⚠️ Potential issue

Fix logger attribute access in get_spectrum_peaks

Same self.logger issue again.

- if self.spectrum is None: - self.logger.warning("No spectrum data available") + if self.spectrum is None: + logger.warning("No spectrum data available") return np.array([]), np.array([])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_spectrum_peaks(self) -> Tuple[np.ndarray, np.ndarray]:

"""Get spectrum peaks.

Returns:

Tuple of (m/z array, intensity array)

"""

if self.spectrum is None:

self.logger.warning("No spectrum data available")

return np.array([]), np.array([])

return self.spectrum.get_peaks()

def get_spectrum_peaks(self) -> Tuple[np.ndarray, np.ndarray]:

"""Get spectrum peaks.

Returns:

Tuple of (m/z array, intensity array)

"""

if self.spectrum is None:

logger.warning("No spectrum data available")

return np.array([]), np.array([])

return self.spectrum.get_peaks()

🤖 Prompt for AI Agents

In pyLucXor/psm.py around lines 1003 to 1014, the method uses self.logger which is not the correct logger attribute for this class; change the access to the logger attribute used elsewhere in the class (e.g., self._logger) so the warning call uses the actual logger instance, leaving the warning message and return behavior unchanged; if the class uses a different logger property name, use that exact name instead to keep consistency.

coderabbitai · 2025-08-20T10:17:13Z

+    def clear_scores(self) -> None:
+        """
+        Clear PSM scores
+        """
+        logger.debug(f"Clearing scores for PSM {self.scan_num}")
+
+        # Clear permutation score mappings
+        self.pos_permutation_score_map.clear()
+        self.neg_permutation_score_map.clear()
+
+        # Reset score-related attributes
+        self.delta_score = np.nan
+        self.pep_score = np.nan
+        self.global_flr = np.nan
+        self.local_flr = np.nan
+


⚠️ Potential issue

Typo: clearing psm_score uses undefined pep_score

self.pep_score doesn’t exist elsewhere; this should reset self.psm_score.

- self.delta_score = np.nan - self.pep_score = np.nan - self.global_flr = np.nan - self.local_flr = np.nan + self.delta_score = np.nan + self.psm_score = np.nan + self.global_flr = np.nan + self.local_flr = np.nan

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def clear_scores(self) -> None:

"""

Clear PSM scores

"""

logger.debug(f"Clearing scores for PSM {self.scan_num}")

# Clear permutation score mappings

self.pos_permutation_score_map.clear()

self.neg_permutation_score_map.clear()

# Reset score-related attributes

self.delta_score = np.nan

self.pep_score = np.nan

self.global_flr = np.nan

self.local_flr = np.nan

def clear_scores(self) -> None:

"""

Clear PSM scores

"""

logger.debug(f"Clearing scores for PSM {self.scan_num}")

# Clear permutation score mappings

self.pos_permutation_score_map.clear()

self.neg_permutation_score_map.clear()

# Reset score-related attributes

self.delta_score = np.nan

self.psm_score = np.nan

self.global_flr = np.nan

self.local_flr = np.nan

🤖 Prompt for AI Agents

In pyLucXor/psm.py around lines 1015 to 1030, the clear_scores method resets self.pep_score, which is a typo because the class uses self.psm_score elsewhere; change the reset to self.psm_score = np.nan (and remove or do not add pep_score) so the correct attribute is cleared; ensure no other references to pep_score remain and run tests to confirm attribute consistency.

Copilot

Pull Request Overview

This PR introduces pyLucXor, a complete Python package implementation for phosphosite localization analysis with False Localization Rate (FLR) calculation. The implementation provides both a command-line interface and Python API for processing peptide-spectrum matches (PSMs) to determine the most likely phosphorylation sites on peptides.

Key changes include:

Complete two-stage PSM processing workflow with statistical model training and FLR estimation
Comprehensive peptide sequence analysis with fragment ion generation for both CID and HCD fragmentation
Parallel processing framework supporting multi-threading for improved performance

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
pyLucXor/psm.py	Core PSM class with scoring logic and two-stage processing workflow
pyLucXor/models.py	Statistical models for CID/HCD fragmentation with kernel density estimation
pyLucXor/peptide.py	Peptide sequence analysis with ion ladder generation and peak matching
pyLucXor/flr.py	FLR calculator implementing statistical modeling for false localization rates
pyLucXor/spectrum.py	Mass spectrum data processing with normalization and peak handling
pyLucXor/core.py	Main processing orchestrator coordinating the two-stage workflow
pyLucXor/parallel.py	Parallel processing infrastructure with worker classes
pyLucXor/peak.py	Peak representation for mass spectrometry data
pyLucXor/globals.py	Global state management utilities
pyLucXor/constants.py	Physical constants and configuration parameters
pyLucXor/config.py	Configuration management with validation

Comments suppressed due to low confidence (1)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-25T18:37:27Z

+        """
+        for i in range(self.N):
+            if abs(self.mz[i] - mz) < 1e-6:  # floating point comparison
+                return i


The hardcoded tolerance value 1e-6 for m/z comparison should be configurable or defined as a named constant to improve maintainability and allow for different precision requirements.

Suggested change

return i

for i in range(self.N):

if abs(self.mz[i] - mz) < self.MZ_TOLERANCE: # floating point comparison

return i

Copilot · 2025-08-25T18:37:27Z

+
+
+                if is_decoy:
+                    tolerance = base_tolerance * 2.0  # decoyPadding = 2.0


The hardcoded decoy padding multiplier 2.0 should be defined as a named constant to improve code maintainability and make the algorithm parameters more explicit.

Suggested change

tolerance = base_tolerance * 2.0 # decoyPadding = 2.0

tolerance = base_tolerance * DECOY_PADDING_MULTIPLIER # decoyPadding

Copilot · 2025-08-25T18:37:28Z

+                for i in range(self.spectrum.N):
+                    if abs(self.spectrum.mz[i] - nl_mz) < 0.5:  # Use 0.5 Da window
+                        # Reduce intensity to 10% of original
+                        self.spectrum.raw_intensity[i] *= 0.1


The hardcoded neutral loss intensity reduction factor 0.1 should be defined as a named constant or made configurable to improve maintainability.

Suggested change

self.spectrum.raw_intensity[i] *= 0.1

factor = self.config.get('neutral_loss_intensity_factor', DEFAULT_NEUTRAL_LOSS_INTENSITY_FACTOR)

self.spectrum.raw_intensity[i] *= factor

Copilot · 2025-08-25T18:37:28Z

+        # Check data validity
+        valid_mask = np.isfinite(intensities) & (intensities > 0)
+        if not np.any(valid_mask):
+            self.logger.warning("No valid intensity values found for normalization")


The method accesses self.logger which is not defined. It should use the module-level logger instead.

Suggested change

self.logger.warning("No valid intensity values found for normalization")

logger.warning("No spectrum data available for normalization")

return

# Get peaks

mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:

logger.warning("No intensity values available for normalization")

return

# Check data validity

valid_mask = np.isfinite(intensities) & (intensities > 0)

if not np.any(valid_mask):

logger.warning("No valid intensity values found for normalization")

Copilot · 2025-08-25T18:37:28Z

+        mz_values, intensities = self.spectrum.get_peaks()
+
+        if len(intensities) == 0:
+            self.logger.warning("No intensity values available for neutral loss reduction")


The method accesses self.logger which is not defined. It should use the module-level logger instead.

Suggested change

self.logger.warning("No intensity values available for neutral loss reduction")

logger.warning("No spectrum data available for neutral loss reduction")

return

# Get peaks

mz_values, intensities = self.spectrum.get_peaks()

if len(intensities) == 0:

logger.warning("No intensity values available for neutral loss reduction")

Copilot · 2025-08-25T18:37:28Z

+            Tuple of (m/z array, intensity array)
+        """
+        if self.spectrum is None:
+            self.logger.warning("No spectrum data available")


The method accesses self.logger which is not defined. It should use the module-level logger instead.

Suggested change

self.logger.warning("No spectrum data available")

logger.warning("No spectrum data available")

Copilot · 2025-08-25T18:37:29Z

+        """
+        Check if sequence is a decoy sequence
+        """
+        return any(aa.islower() for aa in seq)


The decoy sequence detection logic using lowercase letters is fragile and could lead to false positives. Consider using a more explicit decoy marking mechanism or documenting this assumption clearly.

Suggested change

return any(aa.islower() for aa in seq)

Check if sequence is a decoy sequence.

Uses explicit decoy symbol from get_decoy_symbol().

"""

decoy_symbol = get_decoy_symbol()

return decoy_symbol in seq

Copilot · 2025-08-25T18:37:29Z

+
+logger = logging.getLogger(__name__)
+
+# Default thread count


[nitpick] The default thread count should ideally be based on system capabilities rather than a hardcoded value, or at least document why 4 was chosen as the default.

Suggested change

# Default thread count

# Default thread count

# Fallback to 4 threads if os.cpu_count() is unavailable.

# 4 is chosen as a reasonable default for most systems to balance parallelism and resource usage.

Copilot · 2025-08-25T18:37:29Z

+        # Limit negative distribution size
+        limit_n = nb + ny
+        if limit_n < 50000:  # constants.MIN_NUM_NEG_PKS
+            limit_n += 50000


The hardcoded value 50000 should be imported from constants module as indicated by the comment, rather than using a magic number.

Suggested change

limit_n += 50000

if limit_n < MIN_NUM_NEG_PKS: # constants.MIN_NUM_NEG_PKS

limit_n += MIN_NUM_NEG_PKS

Copilot · 2025-08-25T18:37:29Z

+        for i in range(self.n_real):
+            x = self.pos[i]
+            if x < 0.1:
+                x = 0.1


The hardcoded minimum delta score threshold 0.1 should use the instance variable self.min_delta_score instead of a magic number for consistency.

Suggested change

x = 0.1

if x < self.min_delta_score:

x = self.min_delta_score

Initial commit pyLucXor

Initial commit pyLucXor

c5f1ad6

qodo-code-review Bot added the Review effort 5/5 label Aug 20, 2025

coderabbitai Bot reviewed Aug 20, 2025

View reviewed changes

ypriverol requested a review from Copilot August 25, 2025 18:34

Copilot AI reviewed Aug 25, 2025

View reviewed changes

ypriverol self-requested a review August 25, 2025 18:40

ypriverol approved these changes Aug 25, 2025

View reviewed changes

ypriverol merged commit c2061c1 into bigbio:main Aug 25, 2025
1 check passed

This was referenced Oct 16, 2025

initial commit onsite #6

Merged

Minor refinements to code, clean like black, etc. #9

Merged

weizhongchun pushed a commit to weizhongchun/onsite that referenced this pull request Oct 23, 2025

Merge pull request bigbio#4 from weizhongchun/main

679316d

Initial commit pyLucXor

weizhongchun pushed a commit to weizhongchun/onsite that referenced this pull request Oct 23, 2025

Merge pull request bigbio#4 from weizhongchun/main

9a33351

Initial commit pyLucXor

coderabbitai Bot mentioned this pull request Oct 23, 2025

update code and add tests #11

Merged

This was referenced Jan 9, 2026

Add comprehensive performance analysis report #25

Merged

Use PyOpenMS for theoretical spectrum generation in LucXor #32

Merged

coderabbitai Bot mentioned this pull request Jan 22, 2026

Add multi-file batch processing with global FLR calculation #39

Closed

8 tasks

coderabbitai Bot mentioned this pull request Jun 1, 2026

Unified decoy-AA global FLR across AScore / PhosphoRS / LucXor + scoring fixes (#40) #41

Open

		# Initialize with default configuration
		self.config = DEFAULT_CONFIG.copy()

-        # Initialize with default configuration
-        self.config = DEFAULT_CONFIG.copy()
+from copy import deepcopy
+        # Initialize with a deep copy of the default configuration
+        self.config = deepcopy(DEFAULT_CONFIG)

		# Process PSM (generate only real permutations, no decoys)
		psm.process_stage2(self.config.to_dict())



		if is_decoy:
		tolerance = base_tolerance * 2.0 # decoyPadding = 2.0

	tolerance = base_tolerance * 2.0 # decoyPadding = 2.0
	tolerance = base_tolerance * DECOY_PADDING_MULTIPLIER # decoyPadding

	self.spectrum.raw_intensity[i] *= 0.1
	factor = self.config.get('neutral_loss_intensity_factor', DEFAULT_NEUTRAL_LOSS_INTENSITY_FACTOR)
	self.spectrum.raw_intensity[i] *= factor

	self.logger.warning("No spectrum data available")
	logger.warning("No spectrum data available")

-        return any(aa.islower() for aa in seq)
+        Check if sequence is a decoy sequence.
+        Uses explicit decoy symbol from get_decoy_symbol().
+        """
+        decoy_symbol = get_decoy_symbol()
+        return decoy_symbol in seq

	limit_n += 50000
	if limit_n < MIN_NUM_NEG_PKS: # constants.MIN_NUM_NEG_PKS
	limit_n += MIN_NUM_NEG_PKS

Conversation

weizhongchun commented Aug 20, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

qodo-code-review Bot commented Aug 20, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-code-review Bot commented Aug 20, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Prerequisites

Instal...

Install from sourc...

Install from source

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 25, 2025

weizhongchun commented Aug 20, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Aug 20, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)