Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new group_samples() function #576

Closed
wants to merge 31 commits into from
Closed

new group_samples() function #576

wants to merge 31 commits into from

Conversation

ewels
Copy link
Member

@ewels ewels commented Sep 5, 2017

New base module function group_samples() for bundling samples into groups, such as read pairs. A default config added and FastQC module updated to make use of the read pair grouping.

See #542 for discussion.

@ewels
Copy link
Member Author

ewels commented May 20, 2018

Notes to self: This got really complicated and ended up being quite bespoke. Need to think about how to improve it.

I think group_samples needs more config so that it can annotate the groups it finds better.

Perhaps:

  1. Break sample_merge_groups down into subgroups with names
  2. Take a dict to be passed instead of just cleaning patterns
    • Allows configuration - eg. Take unchanged sample names their own group
sample_merge_groups:
    read_pairs:
        Read 1:
            clean_s_name:
                - type: regex
                  pattern: '_1$'
                - type: regex
                  pattern: '_R1[\._\- \Z]'
        Read 2:
            clean_s_name:
                - type: regex
                  pattern: '_2$'
                - type: regex
                  pattern: '_R2[\._\- \Z]'
    pre_post_trimming:
        Raw:
            group_unchanged: True
        Trimmed:
            clean_s_name:
                - '_trimmed'
                - '_val'
                - '_val_1'
                - '_val_2'

ewels added a commit to MultiQC/test-data that referenced this pull request May 20, 2018
@schorlton
Copy link
Contributor

@ewels, this is a super awesome feature that would love to see make it into the main codebase. What is left/how can we help you get this merged?

@ewels
Copy link
Member Author

ewels commented Apr 1, 2022

I agree, I'd love to get this feature in too. It has extremely far reaching implications though - it needs implementing really, not just merging 😅 I think the PR only scratches the surface currently.

I will get to it one day! I just need to find the time to get past the firefighting of dealing with bugs and PRs to find time to do development work on the core code. I'm starting a new job soon so hopefully this will be more feasible before long...

@schorlton
Copy link
Contributor

I agree, I'd love to get this feature in too. It has extremely far reaching implications though - it needs implementing really, not just merging sweat_smile I think the PR only scratches the surface currently.

I will get to it one day! I just need to find the time to get past the firefighting of dealing with bugs and PRs to find time to do development work on the core code. I'm starting a new job soon so hopefully this will be more feasible before long...

Congrats on the new position!! 🎉
Thanks for the consideration and keep me posted if we can help in any way.

@serge2016
Copy link

Dear colleagues, we hope to see this feature very much!

@ewels ewels added this to the MultiQC v2.0 milestone Aug 28, 2023
@ewels ewels mentioned this pull request Aug 28, 2023
11 tasks
@vladsavelyev vladsavelyev self-requested a review October 7, 2023 22:20
@vladsavelyev vladsavelyev removed this from the MultiQC v2.0 milestone Dec 28, 2023
@ewels
Copy link
Member Author

ewels commented Jan 8, 2024

Ok, started looking at the code and the output report and thinking a bit more about the subtleties here. A few thoughts:

Different use cases

We have different uses cases with our example of read groups / trimming here. It makes sense to merge stats from read 1 + 2 and show the average, however it does not make sense to do this for trimmed and untrimmed data. For read groups we want a merge, but for trimmed we want a "split group", for want of a better term.

I think that the simplest solution here is to just omit the trimming config and leave this functionality as purely for merging sample groups. People can continue to do the current solution for trimmed and untrimmed data - that is, running the module multiple times. Maybe we can implement better functionality for this in a future PR? Or happy to hear others' thoughts on this.

Group grouping

I think that we should keep the groups separate - so instead of having Raw Untrimmed, we should have two groups of buttons.
eg. instead of this:

We should have this:

I think that this is clearer to the end user about what's happening, which samples are showing. It'll also scale better when people are running with larger sets of groups.

Toggle visibility instead of switch

One of the core concepts of MultiQC is to be able to get a quick overview of all samples at a glance. So I'd prefer plot buttons to toggle sample visibility, rather than acting as a switch. With the current "Switch" setup, most samples are hidden until we have user interaction.

eg. instead of this:

We could have this (where clicking shows / hides the samples in that group):

Alternatively, keeping it as a switch but having the default as showing all samples would be fine:

But I think that this is less pretty and maybe more unclear?

Showing ungrouped data in tables

To elaborate a little on what we discussed in the meeting, this is a very quick and dirty in-browser-hacking representation of the kind of thing that I had in mind:

CleanShot 2024-01-08 at 12 49 01@2x

Code from the browser (not to be taken seriously, but so that you can copy and paste into the inspector to play with yourself):

HTML code
<table id="general_stats_table" class="table table-condensed mqc_table tablesorter tablesorter-default tablesorterd8f640d3cc683" data-title="General Statistics" data-sortlist="" role="grid">
        <thead><tr role="row" class="tablesorter-headerRow"><th style="width:2px;"></th><th class="rowheader tablesorter-header tablesorter-headerUnSorted" data-column="0" tabindex="0" scope="col" role="columnheader" aria-disabled="false" aria-controls="general_stats_table" unselectable="on" aria-sort="none" aria-label="Sample Name: No sort applied, activate to apply a descending sort" style="user-select: none;"><div class="tablesorter-header-inner">Sample Name</div></th><th id="header_mqc-generalstats-fastqc-percent_duplicates" class="mqc-generalstats-fastqc-percent_duplicates tablesorter-header tablesorter-headerUnSorted" data-dmax="100.0" data-dmin="0.0" data-namespace="FastQC" data-column="1" tabindex="0" scope="col" role="columnheader" aria-disabled="false" aria-controls="general_stats_table" unselectable="on" aria-sort="none" aria-label="% Dups: No sort applied, activate to apply a descending sort" style="user-select: none;"><div class="tablesorter-header-inner"><span class="mqc_table_tooltip" title="" data-original-title="FastQC: % Duplicate Reads, averaged across read pairs. Values from trimmed data shown.">% Dups</span></div></th><th id="header_mqc-generalstats-fastqc-percent_gc" class="mqc-generalstats-fastqc-percent_gc tablesorter-header tablesorter-headerUnSorted" data-dmax="100.0" data-dmin="0.0" data-namespace="FastQC" data-column="2" tabindex="0" scope="col" role="columnheader" aria-disabled="false" aria-controls="general_stats_table" unselectable="on" aria-sort="none" aria-label="% GC: No sort applied, activate to apply a descending sort" style="user-select: none;"><div class="tablesorter-header-inner"><span class="mqc_table_tooltip" title="" data-original-title="FastQC: Average % GC Content, averaged across read pairs. Values from trimmed data shown.">% GC</span></div></th><th id="header_mqc-generalstats-fastqc-avg_sequence_length" class="mqc-generalstats-fastqc-avg_sequence_length hidden tablesorter-header tablesorter-headerUnSorted" data-dmax="100.05205146661868" data-dmin="0.0" data-namespace="FastQC" data-column="3" tabindex="0" scope="col" role="columnheader" aria-disabled="false" aria-controls="general_stats_table" unselectable="on" aria-sort="none" aria-label="Average Read Length: No sort applied, activate to apply a descending sort" style="user-select: none;"><div class="tablesorter-header-inner"><span class="mqc_table_tooltip" title="" data-original-title="FastQC: Average Read Length (bp), averaged across read pairs. Values from trimmed data shown.">Average Read Length</span></div></th><th id="header_mqc-generalstats-fastqc-median_sequence_length" class="mqc-generalstats-fastqc-median_sequence_length hidden tablesorter-header tablesorter-headerUnSorted" data-dmax="100.0" data-dmin="0.0" data-namespace="FastQC" data-column="4" tabindex="0" scope="col" role="columnheader" aria-disabled="false" aria-controls="general_stats_table" unselectable="on" aria-sort="none" aria-label="Median Read Length: No sort applied, activate to apply a descending sort" style="user-select: none;"><div class="tablesorter-header-inner"><span class="mqc_table_tooltip" title="" data-original-title="FastQC: Median Read Length (bp), averaged across read pairs. Values from trimmed data shown.">Median Read Length</span></div></th><th id="header_mqc-generalstats-fastqc-percent_fails" class="mqc-generalstats-fastqc-percent_fails hidden tablesorter-header tablesorter-headerUnSorted" data-dmax="100.0" data-dmin="0.0" data-namespace="FastQC" data-column="5" tabindex="0" scope="col" role="columnheader" aria-disabled="false" aria-controls="general_stats_table" unselectable="on" aria-sort="none" aria-label="% Failed: No sort applied, activate to apply a descending sort" style="user-select: none;"><div class="tablesorter-header-inner"><span class="mqc_table_tooltip" title="" data-original-title="FastQC: Percentage of modules failed in FastQC report (includes those not plotted here), averaged across read pairs. Values from trimmed data shown.">% Failed</span></div></th><th id="header_mqc-generalstats-fastqc-total_sequences" class="mqc-generalstats-fastqc-total_sequences tablesorter-header tablesorter-headerUnSorted" data-dmax="0.266736" data-dmin="0.0" data-namespace="FastQC" data-shared-key="read_count" data-column="6" tabindex="0" scope="col" role="columnheader" aria-disabled="false" aria-controls="general_stats_table" unselectable="on" aria-sort="none" aria-label="M Seqs: No sort applied, activate to apply a descending sort" style="user-select: none;"><div class="tablesorter-header-inner"><span class="mqc_table_tooltip" title="" data-original-title="FastQC: Total Sequences (millions), averaged across read pairs. Values from trimmed data shown.">M Seqs</span></div></th></tr></thead><tbody aria-live="polite" aria-relevant="all"><tr role="row"><th style="width:2px;" width="2"></th><th class="rowheader" data-original-sn="single">single</th><td val="1.7720375527608212" class="data-coloured mqc-generalstats-fastqc-percent_duplicates "><div class="wrapper"><span class="bar" style="width:1.772037552760821%; background-color:#b4d4c4 !important;"></span><span class="val">1.8%</span></div></td><td val="50.0" class="data-coloured mqc-generalstats-fastqc-percent_gc "><div class="wrapper"><span class="bar" style="width:50.0%; background-color:#f5d1e7 !important;"></span><span class="val">50%</span></div></td><td val="100.05205146661868" class="data-coloured mqc-generalstats-fastqc-avg_sequence_length hidden"><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b3d2c3 !important;"></span><span class="val">100 bp</span></div></td><td val="100" class="data-coloured mqc-generalstats-fastqc-median_sequence_length hidden"><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b3d2c3 !important;"></span><span class="val">100 bp</span></div></td><td val="40.0" class="data-coloured mqc-generalstats-fastqc-percent_fails hidden"><div class="wrapper"><span class="bar" style="width:40.0%; background-color:#fedcd2 !important;"></span><span class="val">40%</span></div></td><td val="0.266736" class="data-coloured mqc-generalstats-fastqc-total_sequences "><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b5c1d3 !important;"></span><span class="val">0.3</span></div></td></tr>
<tr role="row">
  <th style="width:2px;"></th>
  <th class="rowheader" data-original-sn="test">test</th>
  <td val="1.3827871827776548" class="data-coloured mqc-generalstats-fastqc-percent_duplicates">
    <div class="wrapper">
      <span class="bar" style="width: 1.3827871827776548%; background-color: #b4d4c4 !important"></span><span class="val">1.4%</span>
    </div>
  </td>
  <td val="49.0" class="data-coloured mqc-generalstats-fastqc-percent_gc">
    <div class="wrapper">
      <span class="bar" style="width: 49%; background-color: #f5d2e8 !important"></span><span class="val">49%</span>
    </div>
  </td>
  <td val="100.05148348929278" class="data-coloured mqc-generalstats-fastqc-avg_sequence_length hidden">
    <div class="wrapper">
      <span class="bar" style="width: 99.99943231816081%; background-color: #b3d2c3 !important"></span><span class="val">100 bp</span>
    </div>
  </td>
  <td class="data-coloured mqc-generalstats-fastqc-median_sequence_length hidden"></td>
  <td val="30.0" class="data-coloured mqc-generalstats-fastqc-percent_fails hidden">
    <div class="wrapper">
      <span class="bar" style="width: 30%; background-color: #fee6dd !important"></span><span class="val">30%</span>
    </div>
  </td>
  <td val="0.266736" class="data-coloured mqc-generalstats-fastqc-total_sequences">
    <div class="wrapper">
      <span class="bar" style="width: 100%; background-color: #b5c1d3 !important"></span><span class="val">0.3</span>
    </div>
  </td>
</tr><tr role="row" class="subgroup"><th style="
    background-color: #efefef;
"></th><th class="rowheader" data-original-sn="test_1" style="
    background-color: #efefef;
">↳ test_1</th><td val="1.7720375527608212" class="data-coloured mqc-generalstats-fastqc-percent_duplicates " style="
    /* background-color: #efefef; */
"><div class="wrapper" style="background-color: #efefef;"><span class="bar" style="width:1.772037552760821%; background-color:#b4d4c4 !important;"></span><span class="val">1.8%</span></div></td><td val="50.0" class="data-coloured mqc-generalstats-fastqc-percent_gc "><div class="wrapper"><span class="bar" style="width:50.0%; background-color:#f5d1e7 !important;"></span><span class="val">50%</span></div></td><td val="100.05205146661868" class="data-coloured mqc-generalstats-fastqc-avg_sequence_length hidden"><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b3d2c3 !important;"></span><span class="val">100 bp</span></div></td><td val="100" class="data-coloured mqc-generalstats-fastqc-median_sequence_length hidden"><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b3d2c3 !important;"></span><span class="val">100 bp</span></div></td><td val="40.0" class="data-coloured mqc-generalstats-fastqc-percent_fails hidden"><div class="wrapper"><span class="bar" style="width:40.0%; background-color:#fedcd2 !important;"></span><span class="val">40%</span></div></td><td val="0.266736" class="data-coloured mqc-generalstats-fastqc-total_sequences "><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b5c1d3 !important;"></span><span class="val">0.3</span></div></td></tr><tr role="row" class="subgroup"><th style="
    background-color: #efefef;
"></th><th class="rowheader" data-original-sn="test_2" style="
    background-color: #efefef;
">↳ test_2</th><td val="0.9935368127944884" class="data-coloured mqc-generalstats-fastqc-percent_duplicates "><div class="wrapper"><span class="bar" style="width:0.9935368127944884%; background-color:#b3d3c4 !important;"></span><span class="val">1.0%</span></div></td><td val="48.0" class="data-coloured mqc-generalstats-fastqc-percent_gc "><div class="wrapper"><span class="bar" style="width:48.0%; background-color:#f4d3e8 !important;"></span><span class="val">48%</span></div></td><td val="100.05091551196689" class="data-coloured mqc-generalstats-fastqc-avg_sequence_length hidden"><div class="wrapper"><span class="bar" style="width:99.99886463632166%; background-color:#b3d2c3 !important;"></span><span class="val">100 bp</span></div></td><td val="100" class="data-coloured mqc-generalstats-fastqc-median_sequence_length hidden"><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b3d2c3 !important;"></span><span class="val">100 bp</span></div></td><td val="20.0" class="data-coloured mqc-generalstats-fastqc-percent_fails hidden"><div class="wrapper"><span class="bar" style="width:20.0%; background-color:#feefe9 !important;"></span><span class="val">20%</span></div></td><td val="0.266736" class="data-coloured mqc-generalstats-fastqc-total_sequences "><div class="wrapper"><span class="bar" style="width:100.0%; background-color:#b5c1d3 !important;"></span><span class="val">0.3</span></div></td></tr><tr role="row">
  <th style="width:2px;"></th>
  <th class="rowheader" data-original-sn="test">test2</th>
  <td val="1.3827871827776548" class="data-coloured mqc-generalstats-fastqc-percent_duplicates">
    <div class="wrapper">
      <span class="bar" style="width: 1.3827871827776548%; background-color: #b4d4c4 !important"></span><span class="val">1.4%</span>
    </div>
  </td>
  <td val="49.0" class="data-coloured mqc-generalstats-fastqc-percent_gc">
    <div class="wrapper">
      <span class="bar" style="width: 49%; background-color: #f5d2e8 !important"></span><span class="val">49%</span>
    </div>
  </td>
  <td val="100.05148348929278" class="data-coloured mqc-generalstats-fastqc-avg_sequence_length hidden">
    <div class="wrapper">
      <span class="bar" style="width: 99.99943231816081%; background-color: #b3d2c3 !important"></span><span class="val">100 bp</span>
    </div>
  </td>
  <td class="data-coloured mqc-generalstats-fastqc-median_sequence_length hidden"></td>
  <td val="30.0" class="data-coloured mqc-generalstats-fastqc-percent_fails hidden">
    <div class="wrapper">
      <span class="bar" style="width: 30%; background-color: #fee6dd !important"></span><span class="val">30%</span>
    </div>
  </td>
  <td val="0.266736" class="data-coloured mqc-generalstats-fastqc-total_sequences">
    <div class="wrapper">
      <span class="bar" style="width: 100%; background-color: #b5c1d3 !important"></span><span class="val">0.3</span>
    </div>
  </td>
</tr>


</tbody></table>

Feedback and thoughts on any / all of the above welcome!

@lskatz
Copy link

lskatz commented Apr 20, 2024

Feedback and thoughts on any / all of the above welcome!

I love the way it looks with collapsible groups!

Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 79.74684% with 32 lines in your changes missing coverage. Please review.

Project coverage is 88.08%. Comparing base (2041c83) to head (e3fb6ee).
Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
multiqc/modules/fastqc/fastqc.py 75.00% 23 Missing ⚠️
multiqc/base_module.py 86.15% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #576      +/-   ##
==========================================
- Coverage   88.15%   88.08%   -0.08%     
==========================================
  Files         473      473              
  Lines       29085    29193     +108     
==========================================
+ Hits        25641    25715      +74     
- Misses       3444     3478      +34     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vladsavelyev
Copy link
Member

Closing in favour of #2794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaits-review Awaiting final review and merge. core: back end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Group samples: merge summary stats
6 participants