Skip to content

Allow plotting selection metrics with partial data availability, for all species, and add shell injection protection#393

Merged
FerriolCalvet merged 12 commits intodevfrom
copilot/update-plot-selection-metrics
Dec 24, 2025
Merged

Allow plotting selection metrics with partial data availability, for all species, and add shell injection protection#393
FerriolCalvet merged 12 commits intodevfrom
copilot/update-plot-selection-metrics

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Nov 18, 2025

The plotting pipeline required all four selection methods (omega, oncodrive3d, oncodrivefml, indels) to be enabled and was restricted to homo_sapiens only. If any methods were disabled/missing or a different species was used, no plots were generated. Additionally, sample names from the samplesheet were not validated, creating a potential shell injection vulnerability.

Changes

Workflow condition (workflows/deepcsa.nf)

  • Changed gate from AND to OR: params.omega || params.oncodrive3d || params.oncodrivefml || params.indels
  • Plotting now triggers when any method is available
  • Removed species restriction: Removed vep_species == 'homo_sapiens' check - plotting now works for all species

DNA2PROTEINMAPPING (workflows/deepcsa.nf)

  • Removed species restriction: Now runs for all species (previously only homo_sapiens)
  • Required for PLOTTINGSUMMARY to function properly across all species

Dynamic track selection (bin/plot_selectionsideplots.py)

  • Check file existence before loading (os.path.exists() for all data files)
  • Build available tracks list dynamically based on present files
  • Skip plotting gracefully if no data available
  • Added None checks throughout plot_all_positive_selection()

Security: Shell injection protection (bin/check_samplesheet.py)

  • NEW: Added comprehensive input validation to prevent shell injection attacks via malicious sample names
  • Sample names restricted to alphanumeric characters, underscores (_), hyphens (-), and dots (.)
  • Blocks all shell metacharacters: ;, $, `, |, &, >, <, (, ), /, quotes, etc.
  • Prevents flag injection by blocking sample names starting with -
  • Clear error messages explain security rationale
  • Validation occurs at samplesheet parsing before values reach shell commands

Unit tests

  • bin/test_plot_selectionsideplots.py: 15+ test cases for plotting functionality
    • Tests file existence checks, dynamic track selection, edge cases, backward compatibility, and partial data scenarios
    • Run with: python3 bin/test_plot_selectionsideplots.py
  • bin/test_check_samplesheet.py: 15 security test cases for samplesheet validation
    • Tests all injection attack vectors and valid sample name patterns
    • Run with: python3 bin/test_check_samplesheet.py

Bug fix

  • omega_missense_genes was incorrectly referencing omega_truncating["GENE"]

Species-Specific Analysis

Restrictions Removed:

  • Plotting summary - now works for all species
  • DNA2PROTEINMAPPING - now works for all species

Restrictions Kept (valid reasons):

  • NanoSeq mask filtering - data only available for homo_sapiens

Behavior

# Before: requires all data files AND homo_sapiens, vulnerable to injection
if all_files_exist and species == 'homo_sapiens':
    plot(unsanitized_sample_name)  # Security risk
else:
    crash()

# After: adapts to available data for ANY species, validates input
available = check_files()
validate_sample_name(sample_name)  # Blocks malicious input
if any(available):
    plot(available)
else:
    skip_with_warning()

Security Impact

  • Before: Sample names passed through unchecked, allowing shell injection attacks (e.g., sample;rm -rf /)
  • After: Strict validation at samplesheet parsing prevents malicious input from reaching shell commands
  • Defense in depth: Validation happens early in the pipeline, before any downstream processing

Backward compatible - existing pipelines with all methods enabled and valid sample names work identically.

Original prompt

This section details on the original issue you should resolve

<issue_title>Plot selection with any available data</issue_title>
<issue_description>Right now the part of the pipeline that is responsible for plotting the selection metrics only works when oncodrivefml, omega, oncodrive3d and indels methods are turned on, but I would like that even when only some of them is available the plots could still be generated, can you make this change happen?

This would mean that the plotting script should try to load all the data of these different methods, and then proceed with plotting the ones available, if none of them is available it should not plot anything.</issue_description>

<agent_instructions>Solve this task as if you were an expert nextflow developer and do it by updating the existing module and or python scripts as required.
try, as much as possible to keep the logic similar to the scripts and flows that are already existing in the pipeline</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits November 18, 2025 10:54
- Modified workflows/deepcsa.nf to allow plotting when ANY selection method is available (changed && to ||)
- Updated plot_selectionsideplots.py to check file existence before loading data
- Added graceful handling of missing data files with informative warnings
- Script now dynamically determines which tracks to plot based on available data
- No plots generated if no data files are found

Co-authored-by: FerriolCalvet <38539786+FerriolCalvet@users.noreply.github.com>
- Added check in plot_all_positive_selection to handle n_tracks == 0
- Added check to avoid saving figure when None is returned
- Improved robustness when no data is available

Co-authored-by: FerriolCalvet <38539786+FerriolCalvet@users.noreply.github.com>
Copilot AI changed the title [WIP] Update plotting to accommodate partial data availability Allow plotting selection metrics with partial data availability Nov 18, 2025
Copilot AI requested a review from FerriolCalvet November 18, 2025 11:03
@FerriolCalvet
Copy link
Copy Markdown
Member

can you remove the requirement of only plotting whenever the species is homo_sapiens?
and if possible list other scenarios or steps that non-human species would not go through

@FerriolCalvet FerriolCalvet marked this pull request as ready for review November 18, 2025 11:04
Comment thread workflows/deepcsa.nf Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I would like to test it and then give some more feedback

can you build some unit tests for this step so that we can test the different scenarios?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comprehensive unit tests in commit 84470d7. Created bin/test_plot_selectionsideplots.py with tests covering:

  • File existence checks
  • Dynamic track selection
  • All data scenarios (all files, some files, no files)
  • Backward compatibility with all methods
  • Edge cases (empty gene lists, missing MAF)
  • Partial data handling

Run with: python3 bin/test_plot_selectionsideplots.py

@FerriolCalvet FerriolCalvet marked this pull request as draft November 18, 2025 11:19
@FerriolCalvet
Copy link
Copy Markdown
Member

@copilot apply the suggested changes

…MAPPING

- Removed homo_sapiens check from plotting condition (line 548)
- Removed homo_sapiens check from DNA2PROTEINMAPPING (line 232)
- DNA2PROTEINMAPPING now runs for all species
- Added comprehensive unit tests for plot_selectionsideplots.py
- Tests cover file existence checks, dynamic track selection, and edge cases

Co-authored-by: FerriolCalvet <38539786+FerriolCalvet@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Nov 18, 2025

@copilot apply the suggested changes

Changes applied in commit 84470d7:

  • ✅ Removed homo_sapiens requirement from plotting (line 548)
  • ✅ Removed homo_sapiens requirement from DNA2PROTEINMAPPING (line 232)
  • ✅ Added unit tests for the plotting script
  • ✅ Identified other species-specific steps (only NanoSeq masks remain species-specific for data availability reasons)

Copilot AI changed the title Allow plotting selection metrics with partial data availability Allow plotting selection metrics with partial data availability and for all species Nov 18, 2025
Copilot AI requested a review from FerriolCalvet November 18, 2025 11:30
FerriolCalvet and others added 6 commits November 22, 2025 13:56
- mutdensityqc works without requiring omega
- integrated plotting for saturation of subgenic regions
- works when not all data is available
@FerriolCalvet FerriolCalvet marked this pull request as ready for review December 24, 2025 11:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables plotting of selection metrics with partial data availability and removes species restrictions that previously limited plotting to homo_sapiens only. The changes allow the pipeline to gracefully handle scenarios where only some selection methods (omega, oncodrive3d, oncodrivefml, indels) are enabled, rather than requiring all four methods.

Key Changes:

  • Changed workflow condition from AND to OR logic for plotting triggers - now runs when any selection method is available
  • Removed homo_sapiens species restriction from plotting workflows
  • Added dynamic file existence checks and track selection in plotting scripts
  • Fixed bug where omega_missense_genes incorrectly referenced omega_truncating data
  • Added comprehensive unit tests for partial data scenarios

Reviewed changes

Copilot reviewed 5 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
workflows/deepcsa.nf Changed plotting condition from requiring all four methods to requiring any method; removed species restriction; added empty channel initializations for all_compiled_omegas and site_comparison_results
subworkflows/local/plottingsummary/main.nf Added new inputs (all_mutations, expanded_panel) and new PLOTSATURATIONPROPORTIONS process call
modules/local/plot/saturation/main.nf Made output optional to handle cases with no data
modules/local/plot/saturation/proportions/main.nf New module for plotting saturation proportions
bin/plot_selectionsideplots.py Added os.path.exists() checks for all data files; implemented dynamic track building; added None checks in plot_all_positive_selection; fixed omega_missense_genes bug
bin/saturation_in_genes.py New script for generating gene saturation plots with multiple grouping modes
bin/plot_gene_saturation.py Added file existence checks for omega, site selection, and oncodrive3d files; improved track list management; fixed groupby().apply() deprecation
bin/test_plot_selectionsideplots.py New comprehensive test suite with 15+ test cases covering file existence, dynamic tracks, and partial data scenarios
nextflow_schema.json Removed default value for use_custom_depths boolean; added custom_depths_table_usage to schema definitions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread workflows/deepcsa.nf
Comment thread modules/local/plot/saturation/proportions/main.nf
- Enhanced sample name validation in check_samplesheet.py
- Only allow alphanumeric characters, underscores, hyphens, and dots
- Prevent sample names starting with hyphen (flag injection)
- Block shell metacharacters: semicolons, dollar signs, backticks, pipes, ampersands, redirects, quotes, slashes
- Added comprehensive test suite (15 tests) in test_check_samplesheet.py
- Prevents command injection attacks via malicious sample names in samplesheet

Co-authored-by: FerriolCalvet <38539786+FerriolCalvet@users.noreply.github.com>
Copilot AI changed the title Allow plotting selection metrics with partial data availability and for all species Allow plotting selection metrics with partial data availability, for all species, and add shell injection protection Dec 24, 2025
Copy link
Copy Markdown
Member

@FerriolCalvet FerriolCalvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all looks good and is tested!

@FerriolCalvet FerriolCalvet merged commit bc6663d into dev Dec 24, 2025
@FerriolCalvet FerriolCalvet deleted the copilot/update-plot-selection-metrics branch December 24, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix flow of processes and plots to ensure QCs are plotted

3 participants