Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughAdds anomaly-detection support across the DIANN → MSstats conversion pipeline: new anomaly-related parameters and docs, NA-aware stateful anomaly score computations, propagation of anomaly metrics through preprocessing/balanced design, and minor build/docs ignore updates. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DIANNConverter as DIANNtoMSstatsFormat
participant Cleaner as MSstatsClean
participant Preprocessor as MSstatsPreprocess
participant BalancedDesign as MSstatsBalancedDesign
participant AnomalyDetector as MSstatsAnomalyScores
participant Output
User->>DIANNConverter: input + anomaly params
activate DIANNConverter
DIANNConverter->>Cleaner: call (calculateAnomalyScores, anomalyModelFeatures)
activate Cleaner
Cleaner->>Cleaner: filter or zero Intensity per thresholds
Cleaner->>DIANNConverter: cleaned data
deactivate Cleaner
DIANNConverter->>Preprocessor: call (anomaly_metrics)
activate Preprocessor
Preprocessor->>DIANNConverter: preprocessed data
deactivate Preprocessor
DIANNConverter->>BalancedDesign: call (anomaly_metrics)
activate BalancedDesign
BalancedDesign->>DIANNConverter: balanced data
deactivate BalancedDesign
alt calculateAnomalyScores == TRUE
DIANNConverter->>AnomalyDetector: request scores (features, temporal, params)
activate AnomalyDetector
AnomalyDetector->>AnomalyDetector: compute stateful metrics (mean_increase, etc.)
AnomalyDetector->>AnomalyDetector: run isolation forest per feature-group
AnomalyDetector->>DIANNConverter: merged AnomalyScores
deactivate AnomalyDetector
end
DIANNConverter->>User: final output
deactivate DIANNConverter
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
man/dot-cleanRawDIANN.Rd (1)
10-22:⚠️ Potential issue | 🟡 MinorAdd missing
@paramdocumentation for new parameters in the source file.The function signature in
R/clean_DIANN.RincludescalculateAnomalyScoresandanomalyModelFeatures, but the roxygen2@paramtags for these parameters are missing. Add documentation like:#' `@param` calculateAnomalyScores Default is FALSE. If TRUE, will run anomaly detection model and calculate anomaly scores for each feature. #' `@param` anomalyModelFeatures character vector of quality metric column names to be used as features in the anomaly detection model.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@man/dot-cleanRawDIANN.Rd` around lines 10 - 22, Add roxygen `@param` entries for the two missing parameters in the function documented by man/dot-cleanRawDIANN.Rd (source in R/clean_DIANN.R): add a line documenting calculateAnomalyScores (e.g., Default FALSE; when TRUE runs the anomaly detection model and computes anomaly scores per feature) and a line documenting anomalyModelFeatures (e.g., character vector of quality metric column names used as features for the anomaly detection model), placing them with the other `@param` tags so they are emitted into the Rd file for the function (cleanRawDIANN / the function in R/clean_DIANN.R).src/isolation_forest.cpp (1)
204-506:⚠️ Potential issue | 🔴 CriticalMerge improvements from commented code or remove the entire commented block—it contains critical bug fixes to path length traversal.
The ~300-line commented block (lines 204-505) is not just alternative code; it fixes correctness issues in the active implementation:
Path length traversal bug: The active
path_length()function (lines 151-165) fails to respect theis_missing_splitflag, instead unconditionally checkingit->second < node->value. The commented version explicitly marks this as "IMPORTANT FIX" and properly handles missing splits separately from numeric splits, preventing incorrect anomaly score calculations.Suboptimal missing split probability: The active code uses a hardcoded
0.5probability (with a TODO acknowledging this). The commented version implements calibrated probability based on missing data frequency (p_miss = min(0.5, missing_fraction)), which aligns with the algorithm's design principle.Maintenance burden: The commented code creates ambiguity about which implementation is correct, especially given that the commented version explicitly labels a fix for the active code.
Action: Either merge the improvements into the active code (recommended) or delete the commented block entirely. Do not leave both versions in the codebase.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/isolation_forest.cpp` around lines 204 - 506, The large commented block contains critical fixes that must be applied or removed: merge the corrected traversal logic from the commented path_length into the active path_length so it respects IsolationTreeNode::is_missing_split (handle missing split vs numeric split separately and use node->size when hitting a leaf), adopt the calibrated missing-split probability p_miss = std::min(0.5, missing_fraction) and missing-handling logic from the commented isolation_tree (use is_missing_split, split_value generation, and left/right assignment), then delete the redundant commented 300-line block to avoid ambiguity; update references to isolation_tree, isolation_forest, and path_length accordingly.R/MSstatsConvert_core_functions.R (1)
550-556:⚠️ Potential issue | 🟡 MinorImprove clarity of temporal feature condition check.
The comparison
temporal_direction[i] != FALSEis semantically unclear for a character vector. While it happens to work (character values are not equal to logical FALSE), the intent is better expressed as checking whether the value is a recognized temporal direction or explicitly checking for non-empty/non-NA values.Consider using:
if (!is.na(temporal_direction[i]) && nzchar(temporal_direction[i]))or more explicitly:
if (temporal_direction[i] %in% c("mean_decrease", "mean_increase", "dispersion_increase"))Note: The mutation of
quality_metricsduring the loop is safe becauseseq_along()creates a fixed sequence at loop entry.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/MSstatsConvert_core_functions.R` around lines 550 - 556, Replace the unclear comparison `temporal_direction[i] != FALSE` inside the loop that iterates with `seq_along(quality_metrics)` by an explicit check for a non-missing, non-empty temporal direction (e.g., use `!is.na(temporal_direction[i]) && nzchar(temporal_direction[i])`) so the intent is clear when appending to `quality_metrics`; keep the existing concatenation logic that uses `paste0(quality_metrics[i], ".", temporal_direction[i])`.R/utils_anomaly_score.R (1)
208-251:⚠️ Potential issue | 🟠 MajorClose the parallel cluster on all exit paths.
Line 208 creates a cluster with
parallel::makeCluster(cores), but there is no correspondingstopClusterbefore the return at line 251 or on error, which leaks worker processes on repeated calls.Proposed fix
.runAnomalyModel = function(input_data, n_trees, max_depth, cores, split_column, quality_metrics){ function_environment = environment() cl = parallel::makeCluster(cores) + on.exit(parallel::stopCluster(cl), add = TRUE)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/utils_anomaly_score.R` around lines 208 - 251, The cluster created with parallel::makeCluster(cores) (variable cl) is never stopped; wrap cluster lifecycle so stopCluster(cl) always runs (including on errors) by calling on.exit(parallel::stopCluster(cl)) immediately after cl is created (or use tryCatch/finally around parallel::parLapply) so the cluster is stopped on normal return and on error; ensure this is added near the makeCluster call and before any early returns, and keep parLapply, clusterExport, and subsequent merge logic unchanged.
🧹 Nitpick comments (3)
R/clean_DIANN.R (1)
95-102: Consider validating thatanomalyModelFeaturescolumns exist in the input.When
calculateAnomalyScores = TRUE, the code assumes all columns inanomalyModelFeaturesexist indn_input. If any are missing, line 102 will throw an error that may not clearly indicate which column is missing.💡 Optional: Add validation with informative error message
qual_cols <- if (calculateAnomalyScores) { + missing_cols <- setdiff(anomalyModelFeatures, names(dn_input)) + if (length(missing_cols) > 0) { + stop("Anomaly model features not found in input: ", + paste(missing_cols, collapse = ", ")) + } anomalyModelFeatures } else { c() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/clean_DIANN.R` around lines 95 - 102, When calculateAnomalyScores is TRUE, validate that every name in anomalyModelFeatures exists in dn_input before building req_cols: compute missing <- setdiff(anomalyModelFeatures, names(dn_input)) and if length(missing)>0 call stop() (or use a clear error via the project's logging) with a message that lists the missing column names and context (e.g., "Missing anomalyModelFeatures in dn_input: ..."); only proceed to set qual_cols and return dn_input[, req_cols, with = FALSE] if no missing columns are found.R/converters_DIANNtoMSstatsFormat.R (1)
89-89: Remove leftover debug comments (# browser()).These lines are debug artifacts and add noise in production code.
Also applies to: 108-108, 141-141, 159-159, 166-166, 173-173
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/converters_DIANNtoMSstatsFormat.R` at line 89, Remove all leftover debug statements ("# browser()") from the R source to avoid noisy debug artifacts in production; search the file converters_DIANNtoMSstatsFormat.R for any occurrences (including the instances around the noted positions) and delete those comment lines so functions like those in this file no longer contain inline debug breakpoints.R/utils_anomaly_score.R (1)
225-250: Avoid positional score assignment; map scores with an explicit key.Line 247 uses
unlist(model_results)and writes scores by position. This is brittle because it depends on implicit row ordering ofmodel_input. Prefer returning a key (e.g., row id) from each worker and joining by that key.Safer pattern
- model_input = unique(input_data[, c(split_column, quality_metrics), with = FALSE]) + model_input = unique(input_data[, c(split_column, quality_metrics), with = FALSE]) + model_input[, row_id := .I] @@ - single_psm = model_input[model_input[[split_column]] == psm_list[[i]], - ..quality_metrics] + single_psm = model_input[model_input[[split_column]] == psm_list[[i]], + c("row_id", quality_metrics), with = FALSE] @@ - forest$anomaly_score + data.table::data.table( + row_id = single_psm$row_id, + AnomalyScores = forest$anomaly_score + ) } ) - model_input$AnomalyScores = unlist(model_results) + score_dt = data.table::rbindlist(model_results) + model_input = merge(model_input, score_dt, by = "row_id", all.x = TRUE) - input_data = merge(input_data, model_input, by = c(split_column, quality_metrics), all.x = TRUE) + input_data = merge( + input_data, + model_input[, c(split_column, quality_metrics, "AnomalyScores"), with = FALSE], + by = c(split_column, quality_metrics), + all.x = TRUE + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/utils_anomaly_score.R` around lines 225 - 250, The current parallel::parLapply worker returns scores only and assigns them to model_input by position (unlist(model_results)), which is brittle; update the worker (the anonymous function passed to parLapply that references psm_list, split_column, and calculate_anomaly_score) to return a named result containing the split key (e.g., the psm identifier from psm_list[[i]] or a unique row id from model_input) and the computed anomaly_score, then after parLapply build a result table/data.frame mapping that key to anomaly_score and merge it back into model_input/input_data using an explicit join by the split key (instead of relying on positional unlist and model_input$AnomalyScores), ensuring you update the merge call that currently uses merge(input_data, model_input, ...) to use the new keyed result for a safe join.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Line 31: Update the documentation for the runOrder parameter in
converters_DIANNtoMSstatsFormat.R to reference DIA-NN instead of Spectronaut:
keep the description that runOrder is a two-column data.table with columns `Run`
and `Order` (where `Run` matches the run name output by DIA-NN and `Order` is an
integer) and that it is used to engineer temporal features in
`anomalyModelFeatureTemporal`.
In `@R/utils_anomaly_score.R`:
- Around line 82-92: The function .add_mean_increase (and the other helper
functions that write to element [1]) must early-return for empty inputs to avoid
creating length-1 outputs from length-0 inputs: at the top of .add_mean_increase
check if n <- length(quality_vector) is 0 and if so return numeric(0) (or
double(0)) immediately; apply the same guard to the other functions that assign
to [1] so an empty quality_vector yields an empty result rather than a length-1
vector.
---
Outside diff comments:
In `@man/dot-cleanRawDIANN.Rd`:
- Around line 10-22: Add roxygen `@param` entries for the two missing parameters
in the function documented by man/dot-cleanRawDIANN.Rd (source in
R/clean_DIANN.R): add a line documenting calculateAnomalyScores (e.g., Default
FALSE; when TRUE runs the anomaly detection model and computes anomaly scores
per feature) and a line documenting anomalyModelFeatures (e.g., character vector
of quality metric column names used as features for the anomaly detection
model), placing them with the other `@param` tags so they are emitted into the Rd
file for the function (cleanRawDIANN / the function in R/clean_DIANN.R).
In `@R/MSstatsConvert_core_functions.R`:
- Around line 550-556: Replace the unclear comparison `temporal_direction[i] !=
FALSE` inside the loop that iterates with `seq_along(quality_metrics)` by an
explicit check for a non-missing, non-empty temporal direction (e.g., use
`!is.na(temporal_direction[i]) && nzchar(temporal_direction[i])`) so the intent
is clear when appending to `quality_metrics`; keep the existing concatenation
logic that uses `paste0(quality_metrics[i], ".", temporal_direction[i])`.
In `@R/utils_anomaly_score.R`:
- Around line 208-251: The cluster created with parallel::makeCluster(cores)
(variable cl) is never stopped; wrap cluster lifecycle so stopCluster(cl) always
runs (including on errors) by calling on.exit(parallel::stopCluster(cl))
immediately after cl is created (or use tryCatch/finally around
parallel::parLapply) so the cluster is stopped on normal return and on error;
ensure this is added near the makeCluster call and before any early returns, and
keep parLapply, clusterExport, and subsequent merge logic unchanged.
In `@src/isolation_forest.cpp`:
- Around line 204-506: The large commented block contains critical fixes that
must be applied or removed: merge the corrected traversal logic from the
commented path_length into the active path_length so it respects
IsolationTreeNode::is_missing_split (handle missing split vs numeric split
separately and use node->size when hitting a leaf), adopt the calibrated
missing-split probability p_miss = std::min(0.5, missing_fraction) and
missing-handling logic from the commented isolation_tree (use is_missing_split,
split_value generation, and left/right assignment), then delete the redundant
commented 300-line block to avoid ambiguity; update references to
isolation_tree, isolation_forest, and path_length accordingly.
---
Nitpick comments:
In `@R/clean_DIANN.R`:
- Around line 95-102: When calculateAnomalyScores is TRUE, validate that every
name in anomalyModelFeatures exists in dn_input before building req_cols:
compute missing <- setdiff(anomalyModelFeatures, names(dn_input)) and if
length(missing)>0 call stop() (or use a clear error via the project's logging)
with a message that lists the missing column names and context (e.g., "Missing
anomalyModelFeatures in dn_input: ..."); only proceed to set qual_cols and
return dn_input[, req_cols, with = FALSE] if no missing columns are found.
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Line 89: Remove all leftover debug statements ("# browser()") from the R
source to avoid noisy debug artifacts in production; search the file
converters_DIANNtoMSstatsFormat.R for any occurrences (including the instances
around the noted positions) and delete those comment lines so functions like
those in this file no longer contain inline debug breakpoints.
In `@R/utils_anomaly_score.R`:
- Around line 225-250: The current parallel::parLapply worker returns scores
only and assigns them to model_input by position (unlist(model_results)), which
is brittle; update the worker (the anonymous function passed to parLapply that
references psm_list, split_column, and calculate_anomaly_score) to return a
named result containing the split key (e.g., the psm identifier from
psm_list[[i]] or a unique row id from model_input) and the computed
anomaly_score, then after parLapply build a result table/data.frame mapping that
key to anomaly_score and merge it back into model_input/input_data using an
explicit join by the split key (instead of relying on positional unlist and
model_input$AnomalyScores), ensuring you update the merge call that currently
uses merge(input_data, model_input, ...) to use the new keyed result for a safe
join.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.gitignoreDESCRIPTIONR/MSstatsConvert_core_functions.RR/clean_DIANN.RR/converters_DIANNtoMSstatsFormat.RR/utils_anomaly_score.Rman/DIANNtoMSstatsFormat.Rdman/MSstatsAnomalyScores.Rdman/MSstatsClean.Rdman/dot-cleanRawDIANN.Rdsrc/isolation_forest.cpp
| .add_mean_increase = function(quality_vector) { | ||
|
|
||
| n = length(quality_vector) | ||
| mean_increase = rep(NA_real_, n) | ||
|
|
||
| d = 0.5 | ||
|
|
||
| for(k in 2:length(quality_vector)) { | ||
| # 5 is reference (3 sigma) | ||
| if (mean_increase[k] > 5){ | ||
| mean_increase[k] = max(0,(quality_vector[k] - d), na.rm = TRUE) | ||
| } else { | ||
| mean_increase[k] = max(0, | ||
| (quality_vector[k] - d + mean_increase[k-1]), | ||
| na.rm = TRUE) # positive CuSum | ||
| h = 5 | ||
|
|
||
| state = 0 | ||
| mean_increase[1] = if (is.na(quality_vector[1])) NA_real_ else 0 | ||
|
|
There was a problem hiding this comment.
Guard empty input vectors before writing index 1.
At Line 91, Line 125, and Line 163, these functions write to [1] even when length(quality_vector) == 0, which can produce a length-1 output from an empty input.
Proposed fix
.add_mean_increase = function(quality_vector) {
n = length(quality_vector)
+ if (n == 0) return(numeric(0))
mean_increase = rep(NA_real_, n)
@@
.add_mean_decrease = function(quality_vector) {
n = length(quality_vector)
+ if (n == 0) return(numeric(0))
mean_decrease = rep(NA_real_, n)
@@
.add_dispersion_increase = function(quality_vector) {
n = length(quality_vector)
+ if (n == 0) return(numeric(0))
dispersion_increase = rep(NA_real_, n)Also applies to: 116-126, 151-169
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@R/utils_anomaly_score.R` around lines 82 - 92, The function
.add_mean_increase (and the other helper functions that write to element [1])
must early-return for empty inputs to avoid creating length-1 outputs from
length-0 inputs: at the top of .add_mean_increase check if n <-
length(quality_vector) is 0 and if so return numeric(0) (or double(0))
immediately; apply the same guard to the other functions that assign to [1] so
an empty quality_vector yields an empty result rather than a length-1 vector.
There was a problem hiding this comment.
Pull request overview
Updates the DIA-NN import/conversion workflow to better support MSstats+ by carrying quality-metric columns through the pipeline and optionally generating anomaly scores, alongside several anomaly-model behavior adjustments and documentation updates.
Changes:
- Extend
DIANNtoMSstatsFormat()/ DIANN cleaning to accept anomaly-scoring parameters and propagate quality-metric columns into preprocessing/balanced design and (optionally)MSstatsAnomalyScores(). - Adjust anomaly feature engineering (CUSUM-like temporal features) and change anomaly scoring to operate on “unique rows” and merge scores back (removing prior clipping).
- Update generated Rd docs and repo metadata/ignores (roxygen note bump,
.lintr/.vscodeignored).
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
src/isolation_forest.cpp |
Adds a large commented-out alternate implementation (including a “missing split” traversal fix reference). |
R/utils_anomaly_score.R |
Updates temporal feature functions and changes anomaly score calculation to de-duplicate rows and merge results back. |
R/converters_DIANNtoMSstatsFormat.R |
Adds anomaly-scoring parameters to the DIA-NN converter and wires them into preprocessing/balanced design and optional scoring. |
R/clean_DIANN.R |
Extends DIANN cleaning to optionally select additional quality-metric columns needed for anomaly scoring. |
R/MSstatsConvert_core_functions.R |
Expands roxygen description for MSstatsAnomalyScores(). |
man/DIANNtoMSstatsFormat.Rd |
Documents new DIA-NN converter parameters related to anomaly scoring. |
man/MSstatsClean.Rd |
Updates DIANN MSstatsClean() method usage to include anomaly-related arguments. |
man/dot-cleanRawDIANN.Rd |
Updates .cleanRawDIANN() usage to include anomaly-related arguments. |
man/MSstatsAnomalyScores.Rd |
Updates MSstatsAnomalyScores() description text. |
DESCRIPTION |
Bumps RoxygenNote. |
.gitignore |
Ignores .lintr and .vscode. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // #include <vector> | ||
| // #include <string> | ||
| // #include <random> | ||
| // #include <memory> | ||
| // #include <unordered_map> | ||
| // #include <cmath> | ||
| // #include <Rcpp.h> | ||
|
|
||
| // using namespace Rcpp; | ||
|
|
||
| // const double EM_CONSTANT = 0.5772156649; // Euler-Mascheroni constant | ||
| // std::mt19937 gen(42); // Fixed random seed for reproducibility | ||
|
|
||
| // struct IsolationTreeNode { | ||
| // bool is_leaf; | ||
| // int size; | ||
| // std::string feature; | ||
| // double value; | ||
| // bool is_missing_split; | ||
| // std::unique_ptr<IsolationTreeNode> left; | ||
| // std::unique_ptr<IsolationTreeNode> right; | ||
|
|
||
| // IsolationTreeNode(int s) | ||
| // : is_leaf(true), | ||
| // size(s), | ||
| // feature(""), | ||
| // value(0.0), | ||
| // is_missing_split(false), |
There was a problem hiding this comment.
The added large commented-out duplicate implementation (including an alternate IsolationTreeNode / path_length) significantly increases noise and maintenance burden. If this code is meant to be kept, it should be moved behind a compile-time flag or into a separate reference document; otherwise please remove it from the source file.
| // #include <vector> | |
| // #include <string> | |
| // #include <random> | |
| // #include <memory> | |
| // #include <unordered_map> | |
| // #include <cmath> | |
| // #include <Rcpp.h> | |
| // using namespace Rcpp; | |
| // const double EM_CONSTANT = 0.5772156649; // Euler-Mascheroni constant | |
| // std::mt19937 gen(42); // Fixed random seed for reproducibility | |
| // struct IsolationTreeNode { | |
| // bool is_leaf; | |
| // int size; | |
| // std::string feature; | |
| // double value; | |
| // bool is_missing_split; | |
| // std::unique_ptr<IsolationTreeNode> left; | |
| // std::unique_ptr<IsolationTreeNode> right; | |
| // IsolationTreeNode(int s) | |
| // : is_leaf(true), | |
| // size(s), | |
| // feature(""), | |
| // value(0.0), | |
| // is_missing_split(false), |
| // if (node->is_leaf) { | ||
| // return depth + node->size; | ||
| // } | ||
|
|
||
| // auto it = obs.find(node->feature); | ||
| // if (it == obs.end()) { | ||
| // // Feature absent from obs map. Fall back: stop at current depth. | ||
| // return depth; | ||
| // } | ||
|
|
||
| // double x = it->second; | ||
|
|
||
| // if (node->is_missing_split) { | ||
| // if (std::isnan(x)) { | ||
| // return path_length(node->left.get(), obs, depth + 1); | ||
| // } | ||
| // return path_length(node->right.get(), obs, depth + 1); | ||
| // } | ||
|
|
||
| // if (std::isnan(x)) { | ||
| // // Numeric split but x missing: choose a side deterministically. | ||
| // // Here: treat missing as going left. | ||
| // return path_length(node->left.get(), obs, depth + 1); | ||
| // } | ||
|
|
||
| // if (x < node->value) { | ||
| // return path_length(node->left.get(), obs, depth + 1); | ||
| // } | ||
|
|
||
| // return path_length(node->right.get(), obs, depth + 1); | ||
| // } | ||
|
|
There was a problem hiding this comment.
The active path_length() implementation ignores node->is_missing_split and doesn’t handle NaN traversal. Since isolation_tree() can create missing-value splits, this will route NaN observations incorrectly and produce wrong anomaly scores. Please implement missing-split-aware traversal in the live code (not commented out), or remove missing-split creation entirely.
| // if (node->is_leaf) { | |
| // return depth + node->size; | |
| // } | |
| // auto it = obs.find(node->feature); | |
| // if (it == obs.end()) { | |
| // // Feature absent from obs map. Fall back: stop at current depth. | |
| // return depth; | |
| // } | |
| // double x = it->second; | |
| // if (node->is_missing_split) { | |
| // if (std::isnan(x)) { | |
| // return path_length(node->left.get(), obs, depth + 1); | |
| // } | |
| // return path_length(node->right.get(), obs, depth + 1); | |
| // } | |
| // if (std::isnan(x)) { | |
| // // Numeric split but x missing: choose a side deterministically. | |
| // // Here: treat missing as going left. | |
| // return path_length(node->left.get(), obs, depth + 1); | |
| // } | |
| // if (x < node->value) { | |
| // return path_length(node->left.get(), obs, depth + 1); | |
| // } | |
| // return path_length(node->right.get(), obs, depth + 1); | |
| // } | |
| double path_length(const IsolationTreeNode* node, | |
| const std::unordered_map<std::string, double>& obs, | |
| int depth) { | |
| if (node == nullptr) { | |
| return depth; | |
| } | |
| if (node->is_leaf) { | |
| return depth + node->size; | |
| } | |
| auto it = obs.find(node->feature); | |
| if (it == obs.end()) { | |
| // Feature absent from obs map. Fall back: stop at current depth. | |
| return depth; | |
| } | |
| double x = it->second; | |
| if (node->is_missing_split) { | |
| if (std::isnan(x)) { | |
| return path_length(node->left.get(), obs, depth + 1); | |
| } | |
| return path_length(node->right.get(), obs, depth + 1); | |
| } | |
| if (std::isnan(x)) { | |
| // Numeric split but x missing: choose a side deterministically. | |
| // Here: treat missing as going left. | |
| return path_length(node->left.get(), obs, depth + 1); | |
| } | |
| if (x < node->value) { | |
| return path_length(node->left.get(), obs, depth + 1); | |
| } | |
| return path_length(node->right.get(), obs, depth + 1); | |
| } |
R/converters_DIANNtoMSstatsFormat.R
Outdated
| #' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL). | ||
| #' @param removeMissingFeatures Remove features with missing values in more than this fraction of runs. Default is 0.5. Only used if calculateAnomalyScores=TRUE. | ||
| #' @param anomalyModelFeatureCount Feature selection for anomaly model. Anomaly detection works on the precursor-level and can be much slower if all features used. We will by default filter to the top-100 highest intensity features. This can be adjusted as necessary. To turn feature-selection off, set this value to a high number (e.g. 10000). Only used if calculateAnomalyScores=TRUE. | ||
| #' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by Spectronaut and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal. |
There was a problem hiding this comment.
The documentation for runOrder mentions that Run matches the run name output by Spectronaut, but this is the DIA-NN converter. Please update the wording to refer to DIA-NN run names to avoid confusing users.
| #' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by Spectronaut and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal. | |
| #' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by DIA-NN (i.e. the value used in the Run column of the DIA-NN output) and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal. |
| #' Clean raw Diann files | ||
| #' @param msstats_object an object of class `MSstatsDIANNFiles`. | ||
| #' @param MBR True if analysis was done with match between runs | ||
| #' @param quantificationColumn Use 'FragmentQuantCorrected'(default) column for quantified intensities for DIANN 1.8.x. | ||
| #' Use 'FragmentQuantRaw' for quantified intensities for DIANN 1.9.x. | ||
| #' Use 'auto' for quantified intensities for DIANN 2.x where each fragment intensity is a separate column, e.g. Fr0Quantity. | ||
| #' @return data.table | ||
| #' @importFrom stats na.omit | ||
| #' @keywords internal | ||
| .cleanRawDIANN <- function(msstats_object, MBR = TRUE, | ||
| quantificationColumn = "FragmentQuantCorrected") { | ||
| quantificationColumn = "FragmentQuantCorrected", | ||
| calculateAnomalyScores = FALSE, | ||
| anomalyModelFeatures = c()) { |
There was a problem hiding this comment.
Roxygen docs for .cleanRawDIANN were not updated for the new arguments (calculateAnomalyScores, anomalyModelFeatures), which leaves them undocumented in the generated Rd. Please add @param entries describing these new parameters so the man page stays in sync with the function signature.
| #' Run Anomaly Model | ||
| #' | ||
| #' Detects anomalous measurements in mass spectrometry data using an isolation forest algorithm. | ||
| #' This function identifies unusual precursor measurements based on quality metrics and their | ||
| #' temporal patterns. For features with insufficient quality metric data, it assigns anomaly | ||
| #' scores based on the median score of similar features (same peptide and charge combination). | ||
| #' The model supports parallel processing for improved performance on large datasets. |
There was a problem hiding this comment.
The new MSstatsAnomalyScores roxygen description claims median-based fallback for insufficient quality metrics and parallel-processing support, but neither behavior is visible in the current implementation of MSstatsAnomalyScores / .prepareSpectronautAnomalyInput / .runAnomalyModel. Please either implement these behaviors or adjust the documentation to match what the function actually does.
| @@ -192,9 +244,9 @@ | |||
| } | |||
| ) | |||
|
|
|||
| model_results = unlist(model_results) | |||
| # Clip anomaly scores to stop them from exploding | |||
| input_data$AnomalyScores = pmax(model_results, .001) | |||
|
|
|||
| model_input$AnomalyScores = unlist(model_results) | |||
|
|
|||
| input_data = merge(input_data, model_input, by = c(split_column, quality_metrics), all.x = TRUE) | |||
There was a problem hiding this comment.
model_input$AnomalyScores = unlist(model_results) assumes model_input rows are grouped/ordered identically to psm_list and the per-PSM subsets, which is not guaranteed (unique() can interleave PSMs). This can misassign scores to the wrong rows. Consider returning a data.table keyed by split_column (and an explicit row id within model_input) from each worker and rbind/join, or ordering model_input deterministically and assigning by index.
R/converters_DIANNtoMSstatsFormat.R
Outdated
| # browser() | ||
| input = MSstatsConvert::MSstatsClean( | ||
| input, MBR = MBR, quantificationColumn = quantificationColumn, | ||
| calculateAnomalyScores = calculateAnomalyScores, | ||
| anomalyModelFeatures = anomalyModelFeatures) |
There was a problem hiding this comment.
There are multiple commented-out debugging statements (e.g., "# browser()") left in the converter. These should be removed before merging to avoid accidental debugging traps and reduce noise.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| DIANNtoMSstatsFormat = function( | ||
| input, annotation = NULL, | ||
| global_qvalue_cutoff = 0.01, | ||
| qvalue_cutoff = 0.01, | ||
| pg_qvalue_cutoff = 0.01, | ||
| useUniquePeptide = TRUE, | ||
| removeFewMeasurements = TRUE, | ||
| removeOxidationMpeptides = TRUE, | ||
| removeProtein_with1Feature = TRUE, | ||
| MBR = TRUE, | ||
| quantificationColumn = "FragmentQuantCorrected", | ||
| calculateAnomalyScores=FALSE, anomalyModelFeatures=c(), | ||
| anomalyModelFeatureTemporal=c(), removeMissingFeatures=.5, | ||
| anomalyModelFeatureCount=100, | ||
| runOrder=NULL, n_trees=100, max_depth="auto", numberOfCores=1, | ||
| use_log_file = TRUE, append = FALSE, | ||
| verbose = TRUE, log_file_path = NULL, | ||
| ...) { |
There was a problem hiding this comment.
This converter now exposes anomaly-model parameters (calculateAnomalyScores, anomalyModelFeatures, temporal settings, etc.) but doesn’t validate them (unlike SpectronauttoMSstatsFormat, which calls .validateMSstatsConverterParameters). Please add the same validation here to catch cases like calculateAnomalyScores=TRUE with empty/length-mismatched feature vectors early with a clear error.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| #' Use 'auto' for quantified intensities for DIANN 2.x where each fragment intensity is a separate column, e.g. Fr0Quantity. | ||
| #' @param calculateAnomalyScores Default is FALSE. If TRUE, will run anomaly detection model and calculate anomaly scores for each feature. Used downstream to weigh measurements in differential analysis. | ||
| #' @param anomalyModelFeatures character vector of quality metric column names to be used as features in the anomaly detection model. List must not be empty if calculateAnomalyScores=TRUE. | ||
| #' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL). |
There was a problem hiding this comment.
The anomalyModelFeatureTemporal docs say values can be NULL, but NULL cannot appear as an element in a character vector (it gets dropped). Consider documenting a character sentinel (e.g., "none") or NA_character_ instead, and validate accordingly.
| #' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL). | |
| #' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or a sentinel indicating no temporal feature engineering (e.g. `none` or `NA_character_`). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all indicate no temporal feature engineering). |
R/converters_DIANNtoMSstatsFormat.R
Outdated
| @@ -105,8 +125,8 @@ DIANNtoMSstatsFormat = function(input, annotation = NULL, | |||
| msg = '** MBR was not used to analyze the data. Now setting names and filtering' | |||
| msg_1 = paste0('-- Filtering on GlobalPGQValue < ', pg_qvalue_cutoff) | |||
| msg_2 = paste0('-- Filtering on GlobalQValue < ', qvalue_cutoff) | |||
| input = input[GlobalPGQValue < pg_qvalue_cutoff, ] | |||
| input = input[GlobalQValue < qvalue_cutoff, ] | |||
| input = input[GlobalPGQValue >= pg_qvalue_cutoff, Intensity := 0] | |||
| input = input[GlobalQValue >= qvalue_cutoff, Intensity := 0] | |||
| getOption("MSstatsLog")("INFO", msg) | |||
| getOption("MSstatsMsg")("INFO", msg) | |||
| getOption("MSstatsLog")("INFO", msg_1) | |||
| @@ -118,6 +138,7 @@ DIANNtoMSstatsFormat = function(input, annotation = NULL, | |||
|
|
|||
| feature_columns = c("PeptideSequence", "PrecursorCharge", | |||
| "FragmentIon", "ProductCharge") | |||
| # browser() | |||
| input = MSstatsConvert::MSstatsPreprocess( | |||
| input, | |||
| annotation, | |||
| @@ -132,15 +153,24 @@ DIANNtoMSstatsFormat = function(input, annotation = NULL, | |||
| remove_features_with_few_measurements = removeFewMeasurements, | |||
| summarize_multiple_psms = max), | |||
| columns_to_fill = list(Fraction = 1, | |||
| IsotopeLabelType = "Light")) | |||
| IsotopeLabelType = "Light"), | |||
| anomaly_metrics = anomalyModelFeatures) | |||
| input[, Intensity := ifelse(Intensity == 0, NA, Intensity)] | |||
|
|
|||
| # browser() | |||
| input = MSstatsConvert::MSstatsBalancedDesign(input, feature_columns, | |||
| fill_incomplete = FALSE, | |||
| fill_incomplete = TRUE, | |||
| handle_fractions = FALSE, | |||
| remove_few = removeFewMeasurements | |||
| remove_few = removeFewMeasurements, | |||
| anomaly_metrics = anomalyModelFeatures | |||
| ) | |||
|
|
|||
| # browser() | |||
| if (calculateAnomalyScores){ | |||
| input = MSstatsConvert::MSstatsAnomalyScores( | |||
| input, anomalyModelFeatures, anomalyModelFeatureTemporal, | |||
| removeMissingFeatures, anomalyModelFeatureCount, runOrder, n_trees, | |||
| max_depth, numberOfCores) | |||
| } | |||
There was a problem hiding this comment.
DIANNtoMSstatsFormat test expectations (e.g., fixed nrow/ncol) will likely change with the updated filtering behavior and fill_incomplete=TRUE balanced design. Please update/add tinytest coverage for the new behavior, including a test case that exercises calculateAnomalyScores=TRUE and verifies the AnomalyScores column is added and aligned row-wise.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
@devonjkohler I've opened a new pull request, #120, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@devonjkohler I've opened a new pull request, #121, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
R/converters_DIANNtoMSstatsFormat.R (1)
89-89: Remove committed debug breakpoints.Lines 89, 108, 141, 159, 166, and 173 still include
# browser()artifacts. Please drop them before merge.Also applies to: 108-108, 141-141, 159-159, 166-166, 173-173
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/converters_DIANNtoMSstatsFormat.R` at line 89, Remove all committed debugging breakpoints by deleting the commented `# browser()` lines left in converters_DIANNtoMSstatsFormat.R (the occurrences around the blocks that include the functions/conversion logic in this file). Search for and remove every `# browser()` instance (including those at the previously noted spots) so no debug breakpoints remain; ensure the functions that call internal conversion routines no longer contain any `# browser()` artifacts before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Around line 75-78: When calculateAnomalyScores is TRUE, add upfront validation
that anomalyModelFeatures is non-empty and that length(anomalyModelFeatures)
matches length(anomalyModelFeatureTemporal) (or that anomalyModelFeatureTemporal
is length 1 or same length) before any downstream processing; in practice,
inside the top-level function that accepts args (check the parameter block
containing calculateAnomalyScores, anomalyModelFeatures,
anomalyModelFeatureTemporal, e.g., the converter function in
R/converters_DIANNtoMSstatsFormat.R) add checks that throw informative errors
(stop(...)) if anomalyModelFeatures is NULL/length 0 or if the two feature
vectors are misaligned, and run these validations immediately after parsing
arguments (before any use around lines ~167-171 and ~168) so failures are early
and actionable.
- Line 31: The roxygen param tag for runOrder is written with backticks
(`@param`) so it isn't recognized; update the roxygen block to use a proper
`@param` tag (i.e., replace `` `@param` runOrder ...`` with `@param runOrder ...`)
for the runOrder parameter description (which references the runOrder argument
and its role in engineering temporal features via anomalyModelFeatureTemporal).
---
Nitpick comments:
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Line 89: Remove all committed debugging breakpoints by deleting the commented
`# browser()` lines left in converters_DIANNtoMSstatsFormat.R (the occurrences
around the blocks that include the functions/conversion logic in this file).
Search for and remove every `# browser()` instance (including those at the
previously noted spots) so no debug breakpoints remain; ensure the functions
that call internal conversion routines no longer contain any `# browser()`
artifacts before merging.
| calculateAnomalyScores=FALSE, anomalyModelFeatures=c(), | ||
| anomalyModelFeatureTemporal=c(), removeMissingFeatures=.5, | ||
| anomalyModelFeatureCount=100, | ||
| runOrder=NULL, n_trees=100, max_depth="auto", numberOfCores=1, |
There was a problem hiding this comment.
Add upfront validation for anomaly-model arguments.
When calculateAnomalyScores=TRUE, there is no early check that anomalyModelFeatures is non-empty and aligned with anomalyModelFeatureTemporal. This can fail late in Line 168+ with less actionable errors.
Proposed fix
anomalyModelFeatures = .standardizeColnames(anomalyModelFeatures)
+ if (calculateAnomalyScores) {
+ if (length(anomalyModelFeatures) == 0) {
+ stop("anomalyModelFeatures must be non-empty when calculateAnomalyScores=TRUE.")
+ }
+
+ allowed_temporal = c("mean_decrease", "mean_increase", "dispersion_increase", NA_character_)
+ if (length(anomalyModelFeatureTemporal) > 0) {
+ if (length(anomalyModelFeatureTemporal) != length(anomalyModelFeatures)) {
+ stop("anomalyModelFeatureTemporal must have the same length as anomalyModelFeatures.")
+ }
+ if (!all(anomalyModelFeatureTemporal %in% allowed_temporal)) {
+ stop("anomalyModelFeatureTemporal values must be one of mean_decrease, mean_increase, dispersion_increase, or NA/NULL.")
+ }
+ }
+ }Also applies to: 85-86, 167-171
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@R/converters_DIANNtoMSstatsFormat.R` around lines 75 - 78, When
calculateAnomalyScores is TRUE, add upfront validation that anomalyModelFeatures
is non-empty and that length(anomalyModelFeatures) matches
length(anomalyModelFeatureTemporal) (or that anomalyModelFeatureTemporal is
length 1 or same length) before any downstream processing; in practice, inside
the top-level function that accepts args (check the parameter block containing
calculateAnomalyScores, anomalyModelFeatures, anomalyModelFeatureTemporal, e.g.,
the converter function in R/converters_DIANNtoMSstatsFormat.R) add checks that
throw informative errors (stop(...)) if anomalyModelFeatures is NULL/length 0 or
if the two feature vectors are misaligned, and run these validations immediately
after parsing arguments (before any use around lines ~167-171 and ~168) so
failures are early and actionable.
Motivation and Context
Updated DIA-NN converter to work with MSstats+ and some minor bug fixes to anomaly detection model.
Changes
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
Motivation and Context
This PR updates the DIA-NN -> MSstats conversion pipeline to integrate anomaly-detection support for MSstats+ and fixes bugs in the anomaly scoring pipeline. Goals: add quality-metric columns to the DIANN converter, ensure anomaly scores are computed on unique (aggregated) rows, and remove clipping of anomaly scores so MSstats' dataProcess can reweight scores.
Short solution summary
Detailed changes
R/converters_DIANNtoMSstatsFormat.R
R/clean_DIANN.R
R/utils_anomaly_score.R
R/MSstatsConvert_core_functions.R
man/*.Rd updates
inst/tinytest/test_utils_anomaly_score.R
src/isolation_forest.cpp
Repository metadata
Unit tests added or modified
Coding guidelines / issues identified