Improve raw data file extension handling in input channel creation#69
Conversation
…tution The previous substitution stripped only the last '.ext', so SDRFs that contain compound extensions (e.g. '.d.zip' as in PXD065380 / test_dia_dotd, or '.d.tar.gz') produced wrong paths when combined with --root_folder. Examples of the bug: /root/sample.d.zip + local_input_type=d.zip -> /root/sample.d.d.zip /root/sample.d.tar.gz + local_input_type=d -> /root/sample.d.tar.d /root/sample.d.zip + local_input_type=raw -> /root/sample.d.raw Strip the longest matching known raw-data extension first so the resulting path targets the real file on disk. https://claude.ai/code/session_01CtVkdv9sgCm6WUnfZxYrPr
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Pull request overview
This PR updates the local input-channel creation logic to correctly strip compound raw-data file extensions (e.g., .d.zip, .d.tar.gz, .mzML.gz) before appending the desired --local_input_type, preventing incorrect stems.
Changes:
- Introduces an ordered
knownRawExtslist to match and strip the longest known raw-data suffix first. - Refactors filename rewriting to prefer known compound-suffix stripping, with a fallback to the previous “strip after last dot” behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Known raw-data extensions (order matters: strip longest/compound ones first | ||
| // so 'sample.d.zip' -> 'sample', not 'sample.d'). |
There was a problem hiding this comment.
The comment opened on the previous line has an unmatched parenthesis: "Known raw-data extensions (order matters: ...". Please close the parenthesis or rephrase to avoid the dangling '(' in the comment.
| // Known raw-data extensions (order matters: strip longest/compound ones first | |
| // so 'sample.d.zip' -> 'sample', not 'sample.d'). | |
| // Known raw-data extensions; order matters, so strip longest/compound ones first | |
| // so 'sample.d.zip' -> 'sample', not 'sample.d'. |
| def stem = filestr | ||
| def matched = knownRawExts.find { stem.endsWith(it) } | ||
| if (matched) { | ||
| stem = stem.substring(0, stem.length() - matched.length()) | ||
| } else if (stem.lastIndexOf('.') > 0) { | ||
| stem = stem.take(stem.lastIndexOf('.')) |
There was a problem hiding this comment.
Suffix matching is currently case-sensitive (stem.endsWith(it)). If the SDRF lists files like sample.mzml.gz / sample.raw.gz (lowercase), no known extension will match and the fallback will only strip the final .gz, producing an incorrect name like sample.mzml.mzML after appending params.local_input_type. Consider doing case-insensitive matching (e.g., compare on toLowerCase()), while still stripping by the matched suffix length.
| if (params.local_input_type) { | ||
| // Strip the longest matching known raw-data extension (covers | ||
| // compound suffixes like .d.zip / .d.tar.gz from the SDRF), | ||
| // then append the target extension. | ||
| def stem = filestr | ||
| def matched = knownRawExts.find { stem.endsWith(it) } | ||
| if (matched) { | ||
| stem = stem.substring(0, stem.length() - matched.length()) | ||
| } else if (stem.lastIndexOf('.') > 0) { | ||
| stem = stem.take(stem.lastIndexOf('.')) | ||
| } | ||
| filestr = stem + '.' + params.local_input_type | ||
| } |
There was a problem hiding this comment.
This change fixes a real filename-mapping bug for compound extensions, but there doesn’t appear to be a test exercising the --root_folder + --local_input_type path rewriting (especially for .d.zip / .d.tar.gz / .mzML.gz). Adding an nf-test case (or extending the existing snapshot test inputs) to cover at least one compound extension would help prevent regressions.
Description
This PR improves the handling of raw data file extensions when creating input channels, particularly for compound extensions like
.d.zip,.d.tar.gz, and.mzML.gz.Changes
knownRawExts) ordered from longest to shortest to ensure compound suffixes are stripped correctlyparams.local_input_typeMotivation
The previous implementation used
lastIndexOf('.')which would incorrectly handle files with compound extensions. For example,sample.d.zipwould becomesample.dinstead ofsamplewhen converting to a different format. This fix ensures that known compound extensions are properly recognized and stripped as complete units.Testing
The change maintains backward compatibility with single-extension files while correctly handling compound extensions. Existing tests should continue to pass.
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).https://claude.ai/code/session_01CtVkdv9sgCm6WUnfZxYrPr