Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkqc module #1552

Merged
merged 20 commits into from
Jan 25, 2022
Merged

Checkqc module #1552

merged 20 commits into from
Jan 25, 2022

Conversation

maleasy
Copy link

@maleasy maleasy commented Sep 17, 2021

Initial version of a module for CheckQC (see #1551 (comment))

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md has been updated
  • There is example tool output for tools in the https://github.com/ewels/MultiQC_TestData repository or attached to this PR
  • Code is tested and works locally (including with --lint flag)
  • docs/README.md is updated with link to below
  • docs/modulename.md is created
  • Everything that can be represented with a plot instead of a table is a plot
  • Report sections have a description and help text (with self.add_section)
  • There aren't any huge tables with > 6 columns (explain reasoning if so)
  • Each table column has a different colour scale to its neighbour, which relates to the data (eg. if high numbers are bad, they're red)
  • Module does not do any significant computational work

@maleasy maleasy mentioned this pull request Sep 17, 2021
@maleasy
Copy link
Author

maleasy commented Sep 17, 2021

The MultiQC - Linux / Linux -Python 3.x checks fail in subtest Test for missing CSPs with this output result:

Run python test/print_missing_csp.py --report full_report.html --whitelist CSP.txt
The following scripts are missing from CSP.txt
  'sha256-TUYHIjQsABv4n1G4GdIAz0ZljvuXPEvQ9T20Sok7TvE=' # ////////////////////////////////////////////////// Base JS for MultiQC Reports//
  'sha256-teMLGfDW72TRam2f0Fnj53uchjf3tQR9N0xo+hR2PXU=' # ////////////////////////////////////////////////// MultiQC Table code///////////
  'sha256-AEcY37NW7iIjmlzDOPZvuAorKHxXZxsECPnuSP52fB8=' # ////////////////////////////////////////////////// HighCharts Plotting Code/////
  'sha256-iPIReyQyAoheerXqjtulr8WN7lPMWacMVU4awy5CTJg=' # ////////////////////////////////////////////////// Static MatPlotLib Plots Javas
  'sha256-KGgqQTL/PWbF29mNXwFnepbDYBEnTKdMtejlNhEiXfs=' # ////////////////////////////////////////////////// MultiQC Report Toolbox Code//
  'sha256-RO0nmf6TEJtrp+R4JF8N4UrF+A2Hfkb8BKxyS7nM5go=' # // Return JS code required for plotting a single sample// RSeQC plot. Attempt to
  'sha256-kH3IklvmfvBiZprdCgEZqGuZjhiGEF5V3tHncoRciQI=' # // Javascript for the FastQC MultiQC Mod///////////////// Per Base Sequence Cont

I am not sure where the problem is. Can you please advise?

Thanks!

@gartician gartician mentioned this pull request Oct 11, 2021
11 tasks
@ewels
Copy link
Member

ewels commented Nov 15, 2021

I am not sure where the problem is. Can you please advise?

In short, it wasn't your fault - was failing for all MultiQC tests. It's since been fixed, I just pulled in updates from master and hopefully tests will pass now..

@ewels
Copy link
Member

ewels commented Nov 15, 2021

x-ref test data PR: MultiQC/test-data#218

@ewels
Copy link
Member

ewels commented Nov 15, 2021

Thanks for this @maleasy - will be great to get a CheckQC module in! The tool was written by colleagues of mine so it's been on my radar to add for ages.. Maybe @matrulda you could cast your eyes over this PR if you get a few minutes?

I've just had a quick stab at getting the CI tests to pass and have moved them on a little. But looks like it's still catching some issues, such as the module not returning a UserWarning when no files are found (or possibly not filtering all samples out with --ignore-samples). Now that the CSP tests are fixed hopefully you can push this PR on now to get the little green tick @maleasy

Phil

@matrulda
Copy link
Contributor

Wow! Cool, I will have a look and try it out 😄 👍

@maleasy
Copy link
Author

maleasy commented Nov 16, 2021

@ewels @matrulda I refactored the code primarily to make sure that all samples are filtered out on --ignore-samples before adding content. Passes all tests now.

@matrulda
Copy link
Contributor

@maleasy Just wanted to give you a life sign. This week has been packed with meetings, I've started trying the module out and will hopefully have time to finish the review at the start of next week.

Copy link
Contributor

@matrulda matrulda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work and thanks for doing this! 👏 Left some comments that you can have a look at.

multiqc/modules/checkqc/checkqc.py Outdated Show resolved Hide resolved
multiqc/modules/checkqc/checkqc.py Outdated Show resolved Hide resolved
multiqc/modules/checkqc/checkqc.py Outdated Show resolved Hide resolved
multiqc/modules/checkqc/checkqc.py Show resolved Hide resolved
multiqc/modules/checkqc/checkqc.py Outdated Show resolved Hide resolved
multiqc/modules/checkqc/checkqc.py Outdated Show resolved Hide resolved
continue
self.add_data_source(f, sample)
p_undetermined = issue["data"]["percentage_undetermined"]
threshold = issue["data"]["threshold"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckQC reports that percentage undetermined couldn't be computed if yield == 0, in that case data only contains lane and percentage_underermined (which is N/A). It would be nice if that case was handled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, fixed in ead255e, this case does not break parsing the checkqc JSON now. I added it to the plot as an empty entry so the user can at least see that something is up with the respective sample. Not sure how to transport more info, e.g. that yield is zero, because no legend is shown for a bar of zero length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkqc_module_no_yield
I tried it out and it looks like this. I think this is fine, but it would nice if the message you wrote ("Yield is 0, no undetermined percentage computation possible") was communicated. I'm not super familiar with how MultiQC modules work, so can't really suggest how to best implement it. Would it be helpful if I send you the json file I used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your test, was the yield 0 in all lanes? I had not thought of this possibility. The corresponding JSON file would be helpful, thanks! Below is a plot where only one lane has yield 0 ( (lane 8 on run hiseq2500_rapidhighoutput_v4_1), I was hoping the legend would include the "Yield is 0" message, but it does not for a value of 0.
I have to see how to do this best, I was hoping to somehow include it in the barplot, but probably a separate plot or section is needed.

checkqc_undetermined-percentage-plot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, It's basically a failed MiSeq run, so only one lane. It's quite clear from the other handlers that something is wrong, so I don't see it as super necessary to fix, just a nice to have. I can't upload the file here, so send me an e-mail (you can find my address on my profile) if you want me to send the file. But like I said, I'm also fine with approving this as is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a dummy JSON with an entry like this for testing, I think that should cover it (correct me if I'm wrong):

    "UndeterminedPercentageHandler": [
        {
            "type": "error",
            "message": "The percentage of undetermined indexes was to high on lane 8, it was: N/A",
            "data": {
              "lane": 1,
              "percentage_undetermined": "N/A"
            }
        }
    ],

In 1d28323 I added an additional section for cases like this, it contains a table listing the lanes with yield zero.

Copy link
Contributor

@matrulda matrulda Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly, only the message is a bit different, but it isn't used in the code as far as I can see.

    "UndeterminedPercentageHandler": [
        {
            "type": "error",
            "message": "Yield for lane: {} was 0. No undetermined percentage could be computed.",
            "data": {
                "lane": 1,
                "percentage_undetermined": "N/A"
            }
        }
    ],

(Oops, we have a 🐛 {})

Christoph Malisi added 3 commits November 23, 2021 08:38
Remove parsing of unused variables
Fixed unique run names generation
Read numbers now always in millions
Handle yield=0 case in UndeterminedPercentageHandler section
multiqc/modules/checkqc/checkqc.py Show resolved Hide resolved
multiqc/modules/checkqc/checkqc.py Outdated Show resolved Hide resolved
multiqc/modules/checkqc/checkqc.py Outdated Show resolved Hide resolved
"""
data = self.checkqc_data["UndeterminedPercentageHandler_ZeroYield"]

pconfig = {"id": "checkqc_zero-yield-table", "table_title": "CheckQC: Lanes with Yield 0", "scale": "Reds"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less confusing to set col1_header to something else than the default "Sample Name". Maybe Lane Number? It's a bit overkill since the other column also is specifying lane. I guess only a list is needed really. Is it possible to make a table with only the first column? I've only made lists in MultiQC reports by adding a HTML list using the custom content module before.

Copy link
Author

@maleasy maleasy Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the table columns are somewhat redundant. I don't think it is possible to only have the first column, because this column is created automatically by MultiQC using the sample names, and the column title is also assigned by MultiQC automatically. I used the lane/run names for the other columns because I don't have any other useful info to put in the cell.
I don't find documentation about HTML content in a module, I'm not sure if that is easily possible as with custom content. It would be possible to simply add the list of failed lanes to the section text separated by commas, and have no plot/table at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use "run" as "sample" (line 149) instead of "lane", and col1_header="Run". That also removes the need for the special case of one run. I tried it out:
zero_yield_table
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes I tested: matrulda@9051fa6

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good solution! I pulled your commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks 😃

Copy link
Contributor

@matrulda matrulda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, approved! ✔️ Again, great work @maleasy !

@ewels I leave the rest to you now. :)

@apeltzer
Copy link
Contributor

Thank you all for making this possible 💯 🥳

@maleasy
Copy link
Author

maleasy commented Nov 25, 2021

Great, thank you all!
@matrulda: Thanks for the helpful comments and code bits

@matrulda
Copy link
Contributor

No problemat all! I've shown this module to my colleagues and we look forward to start using it at our facility 🥳

@ewels
Copy link
Member

ewels commented Nov 26, 2021

Brilliant stuff - thank you for the extensive review @matrulda and thanks for being responsive and implementing all the changes @maleasy!

I'll try to do one final check myself soon and merge when I get a moment 👍🏻

@ewels
Copy link
Member

ewels commented Jan 15, 2022

Thanks for this @maleasy! Just skim read the code and I can't see any issues on that side 👍🏻 (just a minor thing - missing a log output when something is found, which I've just added and pushed to the PR).

Looking at the report (based only on the file I have in the test data repo), I have a few suggestions for changes:

  • General stats table
    • I think that we can remove these columns.. They seem to exactly mimic the bar plot underneath so I'm not sure that they really add anything.
  • I would have to have two heatmaps showing all samples / lanes with pass / warn / fail for each test type. I'm not sure if this is actually possible (samples with no warnings are not reported, right @matrulda?) and I know some are read-specific. But it would be nice to flatten where applicable and have a summary table showing which samples fail across the board.. (as we do for FastQC)
  • Too few reads per sample
    • Can we alphabetically sort the sample names please.. (or maybe lanes first?)
  • Cluster PF too low
    • Can we avoid the PF acronym wherever we're not short on space? eg. Title, first category chart legend
  • Graph sample names
    • I wonder if we drop the axis title and use a more verbose Lane 3 instead of just 3? Lane - Read could definitely do with this. eg. Lane 3 - R2 instead of 3 -2. Prefixes and numbers should always be short so I don't think that space is an issue.
  • Descriptions / help text could be improved
    • Error rate too high followed by Some lanes have too high error rate. is not super helpful 😉 What error rate is this? How is it calculated? What does it mean? Same for most sections.
    • I think also that the descriptions should be rephrased to make it clear that we are only showing the errors / warnings. For example, instead of Some samples have too few reads we could have The following samples reported an error because they did not meet their minimum read threshold. etc...

All fairly minor stuff to just polish the output and make it easier for newcomers. When reviewing I generally try to imagine that I'm a lab scientist who has been emailed a MultiQC report like this and I'm trying to figure it out from scratch.

I hope this is ok! Let me know if you have any thoughts on any of the above 👍🏻

Phil

@maleasy
Copy link
Author

maleasy commented Jan 17, 2022

Hi Phil,
I won't be able to work on this in the coming weeks. I'll check out your suggestions when I find the time.

@ewels
Copy link
Member

ewels commented Jan 18, 2022

Ok, thanks for letting me know 👍🏻 Maybe I can wrap it up with help from @matrulda before then, depending on how I go with the other PRs that need handling.

@ewels
Copy link
Member

ewels commented Jan 25, 2022

Ok, I've just pushed some changes to try to tweak the report output a little. Hopefully you guys agree that these are improvements, please let me know if you disagree with anything.

Thanks for this!

@ewels ewels enabled auto-merge January 25, 2022 14:02
@ewels ewels merged commit 350d501 into MultiQC:master Jan 25, 2022
@apeltzer
Copy link
Contributor

Looks all good to me 👍🏻

@matrulda
Copy link
Contributor

Hi! (over a year later 😓 ). Lost track of this, but super happy that this got merged. Thank you very much for all your work @maleasy @ewels @apeltzer 🎉

@ewels
Copy link
Member

ewels commented Oct 22, 2023

Wow, some serious inbox diving to find this @matrulda! 😂 I hope it's useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants