Group samples: merge summary stats #542

ewels · 2017-08-08T08:16:14Z

To avoid having half-empty rows from broken up samples, have the option to merge the statistics given in General Statistics. This shouldn't affect plots or any data in sections, only General Stats.

For example, have the following config (modelled on sample name cleaning):

general_stats_merge:
    - '_R1'
    - '_R2'
    - type: regex_keep
      pattern: '[A-Z]{3}[1-9]{2}'

Then group any samples matching these and merge the stats in a module-specific and statistic-specific manner. This would also allow situations such as multiplexed lanes etc.

Benefits:

Sections in report maintain full original data
Saved data in multiqc_data maintains full original data
Summary table more concise and easy to skim, lines up with other modules
Doesn't affect front-end or any major existing code infrastructure

Downsides:

Requires specific code to be written for each module that will support this.

The text was updated successfully, but these errors were encountered:

ewels · 2017-08-08T08:21:28Z

To do this, have a new base module function. Pass a list of sample names to it, will return a dict of lists (same sample names, grouped into bundles that should be merged with the key as the sample basename). This result can then be iterated over for the merging process.

eg:

g_stats = {
    'sample_1_R1': { #...
    'sample_1_R2': { #...
    'sample_2_R1': { #...
}
new_g_stats = {}
grouped_s_names = self.general_stats_merge( g_stats.keys() )
for basename, s_names in grouped_s_names.items():
    new_g_stats[basename] = {}
    # loop through s_names collecting stats for each key and merging them into new_g_stats

ewels · 2017-08-17T18:02:53Z

Discussed with @tbooth on gitter. Would be good to generalise the above more so that it's not tied to the General Stats table. Instead just give the ability to collapse a list of sample names to a list of sets with a shared basename. Modules can then use this to collapse rows in the General Stats if they wish, as well as splitting plots into multiple datasets.

The config for the different patterns should be split into different keys to allow this functionality to be used for different reasons. For example, it may be nice to group samples based on read pairs (likely the most common reason) and also pre- and post- trimming status.

merge_group_patterns:
    - read_pairs:
        - type: regex
          pattern: '_R[12]$'
    - pre_post_trimming:
        - '_trimmed'

This function could then use the already-standard clean_s_name base function with these different config sets for consistent functionality. The same data structure described above could be returned and modules could do whatever they like with it. For example:

g_stats = {
    'sample_1_R1': { #...
    'sample_1_R2': { #...
    'sample_2_R1': { #...
}
self.merge_sample_groups( 'read_pairs', g_stats.keys() )
# {
#     'sample_1': ['sample_1_R1, 'sample_1_R2'],
#     'sample_2': ['sample_2_R1, 'sample_2_R2'],
# }

g_stats_2 = {
    'sample_1': { #...
    'sample_1_trimmed': { #...
    'sample_2': { #...
}
self.merge_sample_groups( 'pre_post_trimming', g_stats_2.keys() )
# {
#     'sample_1': ['sample_1, 'sample_1_trimmed'],
#     'sample_2': ['sample_2, 'sample_2_trimmed'],
# }

NB: This issue is getting towards a similar aim of issue #479 (running modules multiple times on subsets of samples).

tbooth · 2017-08-28T11:28:42Z

Hi,

I now have an implementation of this, with an updated FastQC module to make use of the new functionality. It's not exactly as described above but I think it achieves the same thing, plus it avoids adding redundant code while also maintaining backwards-compatibility.

For details and discussion please see the PR #566

ewels · 2017-08-28T15:26:06Z

See also the group samples branch...

KasperThystrup · 2019-08-02T08:07:00Z

Hi thank you so much for this wonderful tool!

A possible (properbly not optimal) approach to this issue, would be to restructure the General Statistics to a wider table format in respects to FastQC results. As such that e.g. "% Dups" becomes "% Dups Mate 1" | "% Dups Mate 2" | "% Dups average" -> So the user can select which way they want it to be represented.

Hope it is useful..

ewels · 2019-08-03T19:53:41Z

Thanks @Dungdae - this is probably the approach that we'll take, yes. Though sadly this is the easy bit! The tricky bit is getting the data and knowing which pairs of input files belong together in one table row. I have an implementation that mostly works which I wrote ages ago (see above), one day I will get to testing and merging it in.

lczech · 2021-01-21T03:58:56Z

Hi @ewels,

thanks for already considering this feature! I'd be very much interested in this. With the tools I have currently included in MultiQC, there are some that yield individual R1/R2 rows in the "General Statistics" table, and others that just don't seem to clean up the files names properly (e.g., AdapterRemoval, leading to those .settings filenames), so that my table looks like this:

I guess your feature would be able to solve both issues? Thumbs up if that feature became available :-)

Cheers and thanks again
Lucas

ewels · 2021-01-21T08:02:04Z

Part 1 - yes this issue is to resolve this. Part 2 - see the docs: https://multiqc.info/docs/#sample-name-cleaning

ewels · 2021-01-21T08:03:10Z

NB. If those sample name extensions are standard and come from the tool itself (so would be the same for all users), please make a new issue / PR to add those cleaning suffixes to the MultiQC defaults.

lczech · 2021-01-22T00:49:03Z

Okay, thanks for the quick reply!

Yes, I think those are names that come from the tools (I've seen such issues with AdapterRemoval and with trimmomatic). If I find the time, I'll work on fixes :-)

fgvieira · 2021-03-12T09:48:37Z

Has this been implemented in version 1.10?

ewels · 2021-03-12T15:16:37Z

No it has not.

serge2016 · 2023-01-25T14:02:28Z

Any progress here? We hope to see it very much!

ewels · 2024-01-09T11:03:15Z

Update - there has been some progress on this issue, things are creeping forwards. As one might expect for such a long-lived issue, there has been some scope creep and change in specs. Specifically:

Additional functionality:
- We want to show underlying data in tables as well as merged stats, with a nested click-to-expand view
Splitting out functionality:
- Stats from read pairs should be merged into a single summary value. Stats from pre- and post-trimming should not. These are two separate features. I have moved the trimming use case into a separate issue: Group samples: split summary stats #2260
Blocker / dependency:
- The latest feature requests start to impinge on the front end and plotting. As such, this feature will go back on hold until the plotting library has been switched to Plotly. See Plotly backend for graphs. Adds plotly template #2079

See comment #576 (comment) for a more in-depth technical status.

Redmar-van-den-Berg · 2024-04-03T07:53:36Z

I'm interested in this functionality, specifically for FastQ quality metrics which tend to pollute the general statistics table with sample_R1 and sample_R2 lines. What is the current status of this feature, is there anything I can do to help it along?

vladsavelyev · 2024-04-17T16:33:27Z

That's high on our list, just not high enough to instantly jump, so we keep putting it off. Now are focused on an unrelated project, so will probably only get back to this in June.

fevac · 2024-05-14T12:32:15Z

We are also very interested on this. Particularly for fastqc and fastp modules to aggregate information from multiple lanes for the same sample

vladsavelyev · 2024-09-13T16:45:50Z

Mostly implemented in #2794, some todo leftovers:

Add sample grouping functionality for all tables, not just General Statistics
Implement for demuxing modules: bcl2fastq/bclconvert/etc
Fix edge case: hide R1 and the columns don't align any more
Test (None) - FastQC all samples a bit broken

ewels added the core: back end label Aug 8, 2017

ewels changed the title ~~Merge R1 & R2 stats in General Statistics (FastQC, Cutadapt)~~ New base function to group samples into sets Aug 17, 2017

tbooth mentioned this issue Aug 28, 2017

Attempt to merge my development branch back into mainline #566

Closed

This was referenced Sep 5, 2017

new group_samples() function #576

Closed

paired-end fastqc #575

Closed

ewels mentioned this issue Nov 14, 2017

Trim Galore Read Names Contain _val_1 and _val_2 in Report #626

Closed

mvdbeek mentioned this issue Jan 5, 2018

MultiQC for list of pairs (of FastQC output) galaxyproject/tools-iuc#1658

Closed

ewels mentioned this issue Apr 26, 2018

Move FastQC into trimGalore nf-core/exoseq#15

Closed

This was referenced Aug 23, 2023

MultiQC 2024 tracker #1784

Open

ngsderive: new 'endedness' submodule #1992

Open

ewels added the priority: high label Oct 3, 2023

vladsavelyev added this to the MultiQC v1.17 milestone Oct 3, 2023

vladsavelyev self-assigned this Oct 3, 2023

ewels mentioned this issue Oct 3, 2023

Per-table sample name cleaning #2097

Open

vladsavelyev modified the milestones: MultiQC v1.17, MultiQC v2.0 Oct 15, 2023

vladsavelyev added this to the MultiQC v1.18 milestone Oct 15, 2023

vladsavelyev modified the milestones: MultiQC v1.18, MultiQC v1.19 Nov 14, 2023

ewels modified the milestones: MultiQC v1.19, MultiQC v1.20 Dec 11, 2023

vladsavelyev linked a pull request Jan 2, 2024 that will close this issue

new group_samples() function #576

Closed

ewels changed the title ~~New base function to group samples into sets~~ Group samples: merge summary stats Jan 9, 2024

ewels mentioned this issue Jan 9, 2024

Group samples: split summary stats #2260

Open

ewels added the core: front end label Jan 13, 2024

vladsavelyev modified the milestones: MultiQC v1.20, MultiQC v1.21 Jan 24, 2024

vladsavelyev modified the milestones: MultiQC v1.21: Versions API, MultiQC v1.22 Feb 19, 2024

Redmar-van-den-Berg mentioned this issue Apr 3, 2024

MultiQC integration improvement suggestion rhpvorderman/sequali#121

Closed

ewels modified the milestones: MultiQC v1.22: Pydantic, MultiQC v1.23 May 3, 2024

ewels modified the milestones: MultiQC v1.23 - unit tests, MultiQC v1.24 Jul 4, 2024

vladsavelyev modified the milestones: v1.24, v1.25 Aug 15, 2024

vladsavelyev linked a pull request Sep 5, 2024 that will close this issue

Group read pairs in general stats #2794

Merged

10 tasks

vladsavelyev modified the milestones: v1.25, v1.26 Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group samples: merge summary stats #542

Group samples: merge summary stats #542

ewels commented Aug 8, 2017

ewels commented Aug 8, 2017

ewels commented Aug 17, 2017

tbooth commented Aug 28, 2017

ewels commented Aug 28, 2017

KasperThystrup commented Aug 2, 2019 •

edited

Loading

ewels commented Aug 3, 2019

lczech commented Jan 21, 2021

ewels commented Jan 21, 2021

ewels commented Jan 21, 2021

lczech commented Jan 22, 2021

fgvieira commented Mar 12, 2021

ewels commented Mar 12, 2021

serge2016 commented Jan 25, 2023

ewels commented Jan 9, 2024 •

edited

Loading

Redmar-van-den-Berg commented Apr 3, 2024

vladsavelyev commented Apr 17, 2024

fevac commented May 14, 2024

vladsavelyev commented Sep 13, 2024 •

edited

Loading

Group samples: merge summary stats #542

Group samples: merge summary stats #542

Comments

ewels commented Aug 8, 2017

ewels commented Aug 8, 2017

ewels commented Aug 17, 2017

tbooth commented Aug 28, 2017

ewels commented Aug 28, 2017

KasperThystrup commented Aug 2, 2019 • edited Loading

ewels commented Aug 3, 2019

lczech commented Jan 21, 2021

ewels commented Jan 21, 2021

ewels commented Jan 21, 2021

lczech commented Jan 22, 2021

fgvieira commented Mar 12, 2021

ewels commented Mar 12, 2021

serge2016 commented Jan 25, 2023

ewels commented Jan 9, 2024 • edited Loading

Redmar-van-den-Berg commented Apr 3, 2024

vladsavelyev commented Apr 17, 2024

fevac commented May 14, 2024

vladsavelyev commented Sep 13, 2024 • edited Loading

KasperThystrup commented Aug 2, 2019 •

edited

Loading

ewels commented Jan 9, 2024 •

edited

Loading

vladsavelyev commented Sep 13, 2024 •

edited

Loading