Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastp: correctly parse sample name from --in1/--in2 command. Fallback to file name #2139

Merged
merged 7 commits into from Oct 30, 2023

Conversation

vladsavelyev
Copy link
Member

@vladsavelyev vladsavelyev commented Oct 19, 2023

Fixes #2138

The sample name parsing logic:

  • Parse the command field found in JSON. E.g. "fastp -c -g -y --in1 Sample 1 1.fastq.gz --in2 Sample 1_2.fastq.gz --out1 ..." will parse everything between --in1 and the following -, plus do normal file name trimming, ending up with Sample 1 1.
  • If the step above failed, fall back to the file name (e.g. fastp.json) to extract the sample name (e.g. fastp).
  • Add config.fastp.s_name_filenames option to force using the file names for sample names.

Additionally, fix the self.ignore_samples() logic for the module (it previously affected only the main self.fastp_data, but not the derived dictionaries).

@vladsavelyev vladsavelyev added the bug: module Bug in a MultiQC module label Oct 19, 2023
@vladsavelyev vladsavelyev changed the title Fastp: correctly parse sample name from --in1/--in2 command. Fallback to file name Fastp: correctly parse sample name from --in1/--in2 command. Prefer file name if not fastp.json; fallback to file name when error Oct 19, 2023
@vladsavelyev
Copy link
Member Author

@multiqc-bot changelog

@vladsavelyev vladsavelyev added this to the MultiQC v1.18 milestone Oct 26, 2023
@ewels
Copy link
Member

ewels commented Oct 26, 2023

Let's switch the priority, as discussed here: #2138 (comment)

@vladsavelyev vladsavelyev changed the title Fastp: correctly parse sample name from --in1/--in2 command. Prefer file name if not fastp.json; fallback to file name when error Fastp: correctly parse sample name from --in1/--in2 command. Fallback to file name Oct 27, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 27, 2023

🚀 Deployed on https://mqc-pr-2139--multiqc.netlify.app

@vladsavelyev vladsavelyev merged commit 2d69c45 into master Oct 30, 2023
7 checks passed
@ewels ewels deleted the fastp-fallback-fname branch October 31, 2023 11:10
@tamuanand
Copy link

tamuanand commented Nov 1, 2023

Hi @ewels and @vladsavelyev

with your proposed switch, how will this below be handled

fastp --in1 Sample_A_1.fastq.gz --in2 Sample_A_2.fastq.gz --out Sample_A.fastp.json

Will my general stats table show this as Sample_A_1?

@vladsavelyev
Copy link
Member Author

Hi Anand - yes, exactly. Unless you specify extra_fn_clean_trim: "_1" in the config.

@tamuanand
Copy link

tamuanand commented Nov 4, 2023

Hi @vladsavelyev and @ewels - I do not agree with this - #2138 (comment) - please correct me if I am wrong

My preference is to use the FastQ in the command line as the top priority - this is the same behaviour as we have in most other modules, such as Trimmomatic, Cutadapt, FastQC etc
  • with FastQC (for PE data) - both "Sample_A_1" and "Sample_A_2" make it to the general stats table with the Read count data, %GC data etc.

Will fastp too have "Sample_A_1" and "Sample_A_2" etc reported?

  • if yes, then I agree with @ewels comment above
  • if no, and if only "Sample_A_1" data gets shown in the general stats table where in fact it is referring to consolidated stats from both _1 and _2 then it is going to convey something incorrect and unwanted.

That's why I asked here - #2138 (comment)

  • why fix/change something that has been working for years now

I request you to revisit this.

@vladsavelyev
Copy link
Member Author

@tamuanand -

We want to generalise the MultiQC behaviour as much as possible, and for that reason we want fastp to work similarly to other modules that take pairs of FASTQ files as input: cutadapt, trimmomatic. For that reason, the best bet would be to take the first FASTQ file name as a sample name by default, but allow users to override that behaviour with either module-specific file name trimming, or by forcing MultiQC to use the log file names as sample names:

fastp:
  s_name_filenames: true

That option can be conveniently specified in the command line as well:

multiqc -m fastp test_data/data/modules/fastp -f -v --cl-config "fastp: {s_name_filenames: true}"

Hope that helps.

@ewels
Copy link
Member

ewels commented Nov 7, 2023

why fix/change something that has been working for years now

It's been working for you, but equally for other users it's been broken for years now. It depends on your analysis setup. That's what we try to avoid - the previous behaviour required you to organise your data in a particular way. The newer behaviour (consistent with the rest of MultiQC) does not.

There is a standard way to achieve the behaviour that you want via a config file, mentioned by Vlad. I think this is fine.

if only "Sample_A_1" data gets shown in the general stats table where in fact it is referring to consolidated stats from both _1 and _2 then it is going to convey something incorrect and unwanted.

I agree that this isn't ideal. Again it's not specific to this module, and MultiQC is riddled with imperfections such as this. The data that we have to obtain sample names is by nature imperfect.

I've created a new issue to discuss a potential new general functionality to better clean sample names when we have a pair of filenames available: #2162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: module Bug in a MultiQC module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiQC 1.17 fails to recognize fastp files
4 participants