Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Picard: directly support Sentieon #2110

Merged
merged 39 commits into from Nov 13, 2023
Merged

Picard: directly support Sentieon #2110

merged 39 commits into from Nov 13, 2023

Conversation

vladsavelyev
Copy link
Member

@vladsavelyev vladsavelyev commented Oct 12, 2023

Sentieon internally uses Picard tools to generate metrics. However, it adds a slightly different header into the outputs. Compare Picard's CollectGcBiasMetrics output:

## htsjdk.samtools.metrics.StringHeader
# picard.analysis.CollectGcBiasMetrics INPUT=SRR4238351_subsamp.sorted.bam <...>
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Apr 05 11:14:59 UTC 2017

## METRICS CLASS        picard.analysis.GcBiasDetailMetrics
ACCUMULATION_LEVEL      GC      WINDOWS READ_STARTS     MEAN_BASE_QUALITY       NORMALIZED_COVERAGE     ERROR_BAR_WIDTH SAMPLE  LIBRARY READ_GROUP
...

And Sentieon's --algo GCBias output:

#SentieonCommandLine: /usr/local/sentieon-genomics-201911/libexec/driver -t 36 -i sorted_0.bam -i sorted_1.bam -r genome/hs37d5.fa --algo GCBias --summary gc_summary.txt gc_metric.txt
ACCUMULATION_LEVEL     GC      WINDOWS READ_STARTS     MEAN_BASE_QUALITY       NORMALIZED_COVERAGE     ERROR_BAR_WIDTH SAMPLE  LIBRARY READ_GROUP
...

The module for Sentieon was implemented completely independent from Picard, however, the actual module code was copied from Picard almost identically (apart from treating the headers and extracting the sample name). It led to some overtime discrepancy, when Picard was updated, whereas Sentieon wasn't (e.g.).

Decoupling module dependencies is a great idea in general, but in my opinion that makes sense when modules exist for different tools, even if very similar. But because Sentieon literally uses the very same Picard tools, I think the code should always stay in sync, so I think it makes sense for the Sentieon module to just import the Picard module code. We already do a similar thing for Bracken that just imports Kraken.

Adds support for all Picard-compatible Sentieon QC outputs, fixes:

UPD:

It makes sense to remove the Sentieon MultiQC module completely, and adjust Picard to support both header formats. This way, MultiQC won't break up the general stats table into two columns, and all the plots sections as well:

Screenshot 2023-10-19 at 18 26 46 Screenshot 2023-10-19 at 18 26 54

TODO:

  • Check linting. We probably shouldn't demand to add doi and version in every submodule, but rather only in the main picard.py.

@vladsavelyev vladsavelyev changed the title Reuse picard AlignmentSummary and GcBias funcs for Sentieon Reuse Picard functions for Sentieon Oct 12, 2023
@vladsavelyev vladsavelyev changed the title Reuse Picard functions for Sentieon Sentieon: reuse Picard functions Oct 12, 2023
@vladsavelyev vladsavelyev marked this pull request as ready for review October 19, 2023 19:28
@vladsavelyev vladsavelyev changed the title Sentieon: reuse Picard functions Picard: directly support Sentieon Oct 19, 2023
@vladsavelyev vladsavelyev added this to the MultiQC v1.18 milestone Oct 26, 2023
@ewels
Copy link
Member

ewels commented Oct 26, 2023

x-ref after discussion: #2152

@ewels
Copy link
Member

ewels commented Oct 26, 2023

Needs a special changelog entry warning Senteion users about changes:

  • -m sentieon will no longer work
  • Exported file names will have picard instead of senteion
  • Report naming will change

@vladsavelyev
Copy link
Member Author

@ewels would be great if you can take another look. I'd love to merge it before completing the ruff linting refactoring!

@ewels
Copy link
Member

ewels commented Nov 3, 2023

Feedback on the Slack thread in nf-core has been positive, so I think we're ok on that front 👍🏻

I'll have a quick look / play now.

@ewels
Copy link
Member

ewels commented Nov 3, 2023

In testing, I was looking for the Senteion sample NA12878 in the report generated from this PR. I didn't find it, but more worryingly I got a JS error (probably from another change?)

Uncaught TypeError: s.replace is not a function
    at Object.format (multiqc_report.html:1562:8475)
    at buildCache (multiqc_report.html:1561:2774)
    at HTMLTableElement.<anonymous> (multiqc_report.html:1562:5250)
    at Function.each (multiqc_report.html:1075:2815)
    at r.fn.init.each (multiqc_report.html:1075:1003)
    at r.fn.init.construct [as tablesorter] (multiqc_report.html:1562:4923)
    at HTMLDocument.<anonymous> (multiqc_report.html:1758:21)
    at j (multiqc_report.html:1075:29948)
    at k (multiqc_report.html:1075:30262)

Need to stop now, but are you able to replicate @vladsavelyev ? This is in a report generated from the picard directory in test data, with this PR checked out and installed. Using the highlighting toolbox functionality for NA12878.

@vladsavelyev
Copy link
Member Author

NA12878 should be reported as sorted_0 - per the command line in the file header, e.g.:

#SentieonCommandLine: /usr/local/sentieon-genomics-201911/libexec/driver -t 36 -i sorted_0.bam -i sorted_1.bam -r genome/hs37d5.fa --algo InsertSizeMetricAlgo is_metric.txt

(in ./InsertSizeMetrics/sentieon_pr_1239/NA12878.InsertSize_metrics.txt)

@vladsavelyev
Copy link
Member Author

Regarding the java script error - that's coming from a change in the previous PR.

Fixed it now: b88eb1b

@ewels
Copy link
Member

ewels commented Nov 8, 2023

So flicking between old / new reports before / after this PR, there is some discrepancies:

  • ✅ 63 General stats rows goes to 57
    • Looking at a diff between the sample names column, I guess that this is all due to us now taking the input filename, so several Senteion samples are probably now overwriting one another as sorted_0 for example. Probably fine.
  • ✅ 32 General stats columns to 29
    • Explained by the loss of the Senteion specific columns
  • 🐛 In "Base Distribution Metrics" there are samples called None_R1 and None_R2 - sounds like we need a fallback there for not finding sample name from inputs?
  • 🐛 Some unexpected fields in multiqc_data seem to be going missing
    • multiqc_picard_IlluminaLaneMetrics.yaml - loses fields PHASING_APPLIED, PREPHASING_APPLIED, TYPE_NAME?
    • multiqc_picard_ExtractIlluminaBarcodes.yaml - loses LANE

@vladsavelyev
Copy link
Member Author

🐛 In "Base Distribution Metrics" there are samples called None_R1 and None_R2 - sounds like we need a fallback there for not finding sample name from inputs?

Fixed that, thanks!

🐛 Some unexpected fields in multiqc_data seem to be going missing
multiqc_picard_IlluminaLaneMetrics.yaml - loses fields PHASING_APPLIED, PREPHASING_APPLIED, TYPE_NAME?

There are two types of IlluminaLaneMetrics outputs, reporting different metrics:

## METRICS CLASS	picard.illumina.IlluminaPhasingMetrics
LANE	TYPE_NAME	PHASING_APPLIED	PREPHASING_APPLIED
...

and

## METRICS CLASS	picard.illumina.IlluminaLaneMetrics
CLUSTER_DENSITY	LANE
...

After refactoring, one started to override another one (like we normally do for duplicated samples). I modified back to updating dicts instead of overwriting for this sort of reports.

multiqc_picard_ExtractIlluminaBarcodes.yaml - loses LANE

LANE was added there to act as a suffix for the sample name, so I assumed it redundant. Should have kept it for compatibility, and it might be useful for the users downstream. Fixed.

Copy link
Member

@ewels ewels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, happy with the two minor bug fixes 👍🏻

Let's merge, then see if anyone complains 😬

Thanks for the major refactoring work here!

@vladsavelyev vladsavelyev merged commit 56ea024 into master Nov 13, 2023
4 of 7 checks passed
@vladsavelyev vladsavelyev deleted the sentieon-picard branch November 13, 2023 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaits-review Awaiting final review and merge. module: enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants