Skip to content

Improve raw data file extension handling in input channel creation#69

Merged
ypriverol merged 1 commit into
devfrom
claude/fix-raw-file-types-oopd6
Apr 27, 2026
Merged

Improve raw data file extension handling in input channel creation#69
ypriverol merged 1 commit into
devfrom
claude/fix-raw-file-types-oopd6

Conversation

@ypriverol
Copy link
Copy Markdown
Member

Description

This PR improves the handling of raw data file extensions when creating input channels, particularly for compound extensions like .d.zip, .d.tar.gz, and .mzML.gz.

Changes

  • Added a list of known raw-data file extensions (knownRawExts) ordered from longest to shortest to ensure compound suffixes are stripped correctly
  • Refactored the file extension replacement logic to:
    • First attempt to match and strip the longest known raw-data extension
    • Fall back to the previous behavior (stripping at the last dot) if no known extension matches
    • Then append the target extension specified by params.local_input_type

Motivation

The previous implementation used lastIndexOf('.') which would incorrectly handle files with compound extensions. For example, sample.d.zip would become sample.d instead of sample when converting to a different format. This fix ensures that known compound extensions are properly recognized and stripped as complete units.

Testing

The change maintains backward compatibility with single-extension files while correctly handling compound extensions. Existing tests should continue to pass.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the bigbio/quantmsdiann branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

https://claude.ai/code/session_01CtVkdv9sgCm6WUnfZxYrPr

…tution

The previous substitution stripped only the last '.ext', so SDRFs that
contain compound extensions (e.g. '.d.zip' as in PXD065380 / test_dia_dotd,
or '.d.tar.gz') produced wrong paths when combined with --root_folder.

Examples of the bug:
  /root/sample.d.zip    + local_input_type=d.zip -> /root/sample.d.d.zip
  /root/sample.d.tar.gz + local_input_type=d     -> /root/sample.d.tar.d
  /root/sample.d.zip    + local_input_type=raw   -> /root/sample.d.raw

Strip the longest matching known raw-data extension first so the resulting
path targets the real file on disk.

https://claude.ai/code/session_01CtVkdv9sgCm6WUnfZxYrPr
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22a71287-a5d7-49cc-bc39-5cacecc82157

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-raw-file-types-oopd6

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit b32e74d

+| ✅ 106 tests passed       |+
#| ❔  19 tests were ignored |#
#| ❔   1 tests had warnings |#
!| ❗   4 tests had warnings |!
Details

❗ Test warnings:

  • files_exist - File not found: conf/igenomes.config
  • files_exist - File not found: conf/igenomes_ignored.config
  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml

❔ Tests ignored:

❔ Tests fixed:

✅ Tests passed:

Run details

  • nf-core/tools version 3.5.2
  • Run at 2026-04-22 11:49:51

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the local input-channel creation logic to correctly strip compound raw-data file extensions (e.g., .d.zip, .d.tar.gz, .mzML.gz) before appending the desired --local_input_type, preventing incorrect stems.

Changes:

  • Introduces an ordered knownRawExts list to match and strip the longest known raw-data suffix first.
  • Refactors filename rewriting to prefer known compound-suffix stripping, with a fallback to the previous “strip after last dot” behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +23
// Known raw-data extensions (order matters: strip longest/compound ones first
// so 'sample.d.zip' -> 'sample', not 'sample.d').
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment opened on the previous line has an unmatched parenthesis: "Known raw-data extensions (order matters: ...". Please close the parenthesis or rephrase to avoid the dangling '(' in the comment.

Suggested change
// Known raw-data extensions (order matters: strip longest/compound ones first
// so 'sample.d.zip' -> 'sample', not 'sample.d').
// Known raw-data extensions; order matters, so strip longest/compound ones first
// so 'sample.d.zip' -> 'sample', not 'sample.d'.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +56
def stem = filestr
def matched = knownRawExts.find { stem.endsWith(it) }
if (matched) {
stem = stem.substring(0, stem.length() - matched.length())
} else if (stem.lastIndexOf('.') > 0) {
stem = stem.take(stem.lastIndexOf('.'))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suffix matching is currently case-sensitive (stem.endsWith(it)). If the SDRF lists files like sample.mzml.gz / sample.raw.gz (lowercase), no known extension will match and the fallback will only strip the final .gz, producing an incorrect name like sample.mzml.mzML after appending params.local_input_type. Consider doing case-insensitive matching (e.g., compare on toLowerCase()), while still stripping by the matched suffix length.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +59
if (params.local_input_type) {
// Strip the longest matching known raw-data extension (covers
// compound suffixes like .d.zip / .d.tar.gz from the SDRF),
// then append the target extension.
def stem = filestr
def matched = knownRawExts.find { stem.endsWith(it) }
if (matched) {
stem = stem.substring(0, stem.length() - matched.length())
} else if (stem.lastIndexOf('.') > 0) {
stem = stem.take(stem.lastIndexOf('.'))
}
filestr = stem + '.' + params.local_input_type
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a real filename-mapping bug for compound extensions, but there doesn’t appear to be a test exercising the --root_folder + --local_input_type path rewriting (especially for .d.zip / .d.tar.gz / .mzML.gz). Adding an nf-test case (or extending the existing snapshot test inputs) to cover at least one compound extension would help prevent regressions.

Copilot uses AI. Check for mistakes.
@ypriverol ypriverol changed the base branch from copilot/fix-local-raw-file-type to dev April 24, 2026 13:28
@ypriverol ypriverol merged commit b32e74d into dev Apr 27, 2026
24 checks passed
@ypriverol ypriverol deleted the claude/fix-raw-file-types-oopd6 branch May 5, 2026 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants