Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group samples: merge summary stats #542

Open
ewels opened this issue Aug 8, 2017 · 18 comments · Fixed by #2794
Open

Group samples: merge summary stats #542

ewels opened this issue Aug 8, 2017 · 18 comments · Fixed by #2794

Comments

@ewels
Copy link
Member

ewels commented Aug 8, 2017

To avoid having half-empty rows from broken up samples, have the option to merge the statistics given in General Statistics. This shouldn't affect plots or any data in sections, only General Stats.

For example, have the following config (modelled on sample name cleaning):

general_stats_merge:
    - '_R1'
    - '_R2'
    - type: regex_keep
      pattern: '[A-Z]{3}[1-9]{2}'

Then group any samples matching these and merge the stats in a module-specific and statistic-specific manner. This would also allow situations such as multiplexed lanes etc.

Benefits:

  • Sections in report maintain full original data
  • Saved data in multiqc_data maintains full original data
  • Summary table more concise and easy to skim, lines up with other modules
  • Doesn't affect front-end or any major existing code infrastructure

Downsides:

  • Requires specific code to be written for each module that will support this.
@ewels
Copy link
Member Author

ewels commented Aug 8, 2017

To do this, have a new base module function. Pass a list of sample names to it, will return a dict of lists (same sample names, grouped into bundles that should be merged with the key as the sample basename). This result can then be iterated over for the merging process.

eg:

g_stats = {
    'sample_1_R1': { #...
    'sample_1_R2': { #...
    'sample_2_R1': { #...
}
new_g_stats = {}
grouped_s_names = self.general_stats_merge( g_stats.keys() )
for basename, s_names in grouped_s_names.items():
    new_g_stats[basename] = {}
    # loop through s_names collecting stats for each key and merging them into new_g_stats

@ewels ewels changed the title Merge R1 & R2 stats in General Statistics (FastQC, Cutadapt) New base function to group samples into sets Aug 17, 2017
@ewels
Copy link
Member Author

ewels commented Aug 17, 2017

Discussed with @tbooth on gitter. Would be good to generalise the above more so that it's not tied to the General Stats table. Instead just give the ability to collapse a list of sample names to a list of sets with a shared basename. Modules can then use this to collapse rows in the General Stats if they wish, as well as splitting plots into multiple datasets.

The config for the different patterns should be split into different keys to allow this functionality to be used for different reasons. For example, it may be nice to group samples based on read pairs (likely the most common reason) and also pre- and post- trimming status.

merge_group_patterns:
    - read_pairs:
        - type: regex
          pattern: '_R[12]$'
    - pre_post_trimming:
        - '_trimmed'

This function could then use the already-standard clean_s_name base function with these different config sets for consistent functionality. The same data structure described above could be returned and modules could do whatever they like with it. For example:

g_stats = {
    'sample_1_R1': { #...
    'sample_1_R2': { #...
    'sample_2_R1': { #...
}
self.merge_sample_groups( 'read_pairs', g_stats.keys() )
# {
#     'sample_1': ['sample_1_R1, 'sample_1_R2'],
#     'sample_2': ['sample_2_R1, 'sample_2_R2'],
# }

g_stats_2 = {
    'sample_1': { #...
    'sample_1_trimmed': { #...
    'sample_2': { #...
}
self.merge_sample_groups( 'pre_post_trimming', g_stats_2.keys() )
# {
#     'sample_1': ['sample_1, 'sample_1_trimmed'],
#     'sample_2': ['sample_2, 'sample_2_trimmed'],
# }

NB: This issue is getting towards a similar aim of issue #479 (running modules multiple times on subsets of samples).

@tbooth
Copy link
Contributor

tbooth commented Aug 28, 2017

Hi,

I now have an implementation of this, with an updated FastQC module to make use of the new functionality. It's not exactly as described above but I think it achieves the same thing, plus it avoids adding redundant code while also maintaining backwards-compatibility.

For details and discussion please see the PR #566

@ewels
Copy link
Member Author

ewels commented Aug 28, 2017

See also the group samples branch...

@KasperThystrup
Copy link

KasperThystrup commented Aug 2, 2019

Hi thank you so much for this wonderful tool!

A possible (properbly not optimal) approach to this issue, would be to restructure the General Statistics to a wider table format in respects to FastQC results. As such that e.g. "% Dups" becomes "% Dups Mate 1" | "% Dups Mate 2" | "% Dups average" -> So the user can select which way they want it to be represented.

Hope it is useful..

@ewels
Copy link
Member Author

ewels commented Aug 3, 2019

Thanks @Dungdae - this is probably the approach that we'll take, yes. Though sadly this is the easy bit! The tricky bit is getting the data and knowing which pairs of input files belong together in one table row. I have an implementation that mostly works which I wrote ages ago (see above), one day I will get to testing and merging it in.

@lczech
Copy link

lczech commented Jan 21, 2021

Hi @ewels,

thanks for already considering this feature! I'd be very much interested in this. With the tools I have currently included in MultiQC, there are some that yield individual R1/R2 rows in the "General Statistics" table, and others that just don't seem to clean up the files names properly (e.g., AdapterRemoval, leading to those .settings filenames), so that my table looks like this:

image

I guess your feature would be able to solve both issues? Thumbs up if that feature became available :-)

Cheers and thanks again
Lucas

@ewels
Copy link
Member Author

ewels commented Jan 21, 2021

Part 1 - yes this issue is to resolve this. Part 2 - see the docs: https://multiqc.info/docs/#sample-name-cleaning

@ewels
Copy link
Member Author

ewels commented Jan 21, 2021

NB. If those sample name extensions are standard and come from the tool itself (so would be the same for all users), please make a new issue / PR to add those cleaning suffixes to the MultiQC defaults.

@lczech
Copy link

lczech commented Jan 22, 2021

Okay, thanks for the quick reply!

Yes, I think those are names that come from the tools (I've seen such issues with AdapterRemoval and with trimmomatic). If I find the time, I'll work on fixes :-)

@fgvieira
Copy link
Contributor

Has this been implemented in version 1.10?

@ewels
Copy link
Member Author

ewels commented Mar 12, 2021

No it has not.

@serge2016
Copy link

Any progress here? We hope to see it very much!

@vladsavelyev vladsavelyev added this to the MultiQC v1.18 milestone Oct 15, 2023
@ewels ewels modified the milestones: MultiQC v1.19, MultiQC v1.20 Dec 11, 2023
@vladsavelyev vladsavelyev linked a pull request Jan 2, 2024 that will close this issue
@ewels
Copy link
Member Author

ewels commented Jan 9, 2024

Update - there has been some progress on this issue, things are creeping forwards. As one might expect for such a long-lived issue, there has been some scope creep and change in specs. Specifically:

  • Additional functionality:
    • We want to show underlying data in tables as well as merged stats, with a nested click-to-expand view
  • Splitting out functionality:
    • Stats from read pairs should be merged into a single summary value. Stats from pre- and post-trimming should not. These are two separate features. I have moved the trimming use case into a separate issue: Group samples: split summary stats #2260
  • Blocker / dependency:

See comment #576 (comment) for a more in-depth technical status.

@ewels ewels changed the title New base function to group samples into sets Group samples: merge summary stats Jan 9, 2024
@Redmar-van-den-Berg
Copy link
Contributor

I'm interested in this functionality, specifically for FastQ quality metrics which tend to pollute the general statistics table with sample_R1 and sample_R2 lines. What is the current status of this feature, is there anything I can do to help it along?

@vladsavelyev
Copy link
Member

That's high on our list, just not high enough to instantly jump, so we keep putting it off. Now are focused on an unrelated project, so will probably only get back to this in June.

@fevac
Copy link

fevac commented May 14, 2024

We are also very interested on this. Particularly for fastqc and fastp modules to aggregate information from multiple lanes for the same sample

@vladsavelyev vladsavelyev modified the milestones: v1.24, v1.25 Aug 15, 2024
@vladsavelyev vladsavelyev linked a pull request Sep 5, 2024 that will close this issue
10 tasks
@vladsavelyev
Copy link
Member

vladsavelyev commented Sep 13, 2024

Mostly implemented in #2794, some todo leftovers:

  • Add sample grouping functionality for all tables, not just General Statistics
  • Implement for demuxing modules: bcl2fastq/bclconvert/etc
  • Fix edge case: hide R1 and the columns don't align any more
  • Test (None) - FastQC all samples a bit broken

@vladsavelyev vladsavelyev modified the milestones: v1.25, v1.26 Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment