-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group samples: merge summary stats #542
Comments
To do this, have a new base module function. Pass a list of sample names to it, will return a dict of lists (same sample names, grouped into bundles that should be merged with the key as the sample basename). This result can then be iterated over for the merging process. eg: g_stats = {
'sample_1_R1': { #...
'sample_1_R2': { #...
'sample_2_R1': { #...
}
new_g_stats = {}
grouped_s_names = self.general_stats_merge( g_stats.keys() )
for basename, s_names in grouped_s_names.items():
new_g_stats[basename] = {}
# loop through s_names collecting stats for each key and merging them into new_g_stats |
Discussed with @tbooth on gitter. Would be good to generalise the above more so that it's not tied to the General Stats table. Instead just give the ability to collapse a list of sample names to a list of sets with a shared basename. Modules can then use this to collapse rows in the General Stats if they wish, as well as splitting plots into multiple datasets. The config for the different patterns should be split into different keys to allow this functionality to be used for different reasons. For example, it may be nice to group samples based on read pairs (likely the most common reason) and also pre- and post- trimming status. merge_group_patterns:
- read_pairs:
- type: regex
pattern: '_R[12]$'
- pre_post_trimming:
- '_trimmed' This function could then use the already-standard g_stats = {
'sample_1_R1': { #...
'sample_1_R2': { #...
'sample_2_R1': { #...
}
self.merge_sample_groups( 'read_pairs', g_stats.keys() )
# {
# 'sample_1': ['sample_1_R1, 'sample_1_R2'],
# 'sample_2': ['sample_2_R1, 'sample_2_R2'],
# }
g_stats_2 = {
'sample_1': { #...
'sample_1_trimmed': { #...
'sample_2': { #...
}
self.merge_sample_groups( 'pre_post_trimming', g_stats_2.keys() )
# {
# 'sample_1': ['sample_1, 'sample_1_trimmed'],
# 'sample_2': ['sample_2, 'sample_2_trimmed'],
# } NB: This issue is getting towards a similar aim of issue #479 (running modules multiple times on subsets of samples). |
Hi, I now have an implementation of this, with an updated FastQC module to make use of the new functionality. It's not exactly as described above but I think it achieves the same thing, plus it avoids adding redundant code while also maintaining backwards-compatibility. For details and discussion please see the PR #566 |
See also the group samples branch... |
Hi thank you so much for this wonderful tool! A possible (properbly not optimal) approach to this issue, would be to restructure the General Statistics to a wider table format in respects to FastQC results. As such that e.g. "% Dups" becomes "% Dups Mate 1" | "% Dups Mate 2" | "% Dups average" -> So the user can select which way they want it to be represented. Hope it is useful.. |
Thanks @Dungdae - this is probably the approach that we'll take, yes. Though sadly this is the easy bit! The tricky bit is getting the data and knowing which pairs of input files belong together in one table row. I have an implementation that mostly works which I wrote ages ago (see above), one day I will get to testing and merging it in. |
Hi @ewels, thanks for already considering this feature! I'd be very much interested in this. With the tools I have currently included in MultiQC, there are some that yield individual R1/R2 rows in the "General Statistics" table, and others that just don't seem to clean up the files names properly (e.g., AdapterRemoval, leading to those I guess your feature would be able to solve both issues? Thumbs up if that feature became available :-) Cheers and thanks again |
Part 1 - yes this issue is to resolve this. Part 2 - see the docs: https://multiqc.info/docs/#sample-name-cleaning |
NB. If those sample name extensions are standard and come from the tool itself (so would be the same for all users), please make a new issue / PR to add those cleaning suffixes to the MultiQC defaults. |
Okay, thanks for the quick reply! Yes, I think those are names that come from the tools (I've seen such issues with AdapterRemoval and with trimmomatic). If I find the time, I'll work on fixes :-) |
Has this been implemented in version |
No it has not. |
Any progress here? We hope to see it very much! |
Update - there has been some progress on this issue, things are creeping forwards. As one might expect for such a long-lived issue, there has been some scope creep and change in specs. Specifically:
See comment #576 (comment) for a more in-depth technical status. |
I'm interested in this functionality, specifically for FastQ quality metrics which tend to pollute the general statistics table with |
That's high on our list, just not high enough to instantly jump, so we keep putting it off. Now are focused on an unrelated project, so will probably only get back to this in June. |
We are also very interested on this. Particularly for fastqc and fastp modules to aggregate information from multiple lanes for the same sample |
Mostly implemented in #2794, some todo leftovers:
|
To avoid having half-empty rows from broken up samples, have the option to merge the statistics given in General Statistics. This shouldn't affect plots or any data in sections, only General Stats.
For example, have the following config (modelled on sample name cleaning):
Then group any samples matching these and merge the stats in a module-specific and statistic-specific manner. This would also allow situations such as multiplexed lanes etc.
Benefits:
multiqc_data
maintains full original dataDownsides:
The text was updated successfully, but these errors were encountered: