Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CollectQualityYieldMetricsFlowSpace tool #1932

Conversation

dror27
Copy link
Contributor

@dror27 dror27 commented Dec 25, 2023

CollectQualityYieldMetrics for Ultima flow based read files.

@dror27 dror27 force-pushed the CollectQualityYieldMetricsFlowSpace-dev branch from 4cc0e00 to e17dcc8 Compare January 28, 2024 11:44
@dror27 dror27 marked this pull request as draft February 12, 2024 10:22
@dror27 dror27 changed the title CollectQualityYieldMetricsFlowSpace tool (please hold on review) CollectQualityYieldMetricsFlowSpace tool Feb 12, 2024
@dror27 dror27 marked this pull request as ready for review February 19, 2024 13:50
@dror27 dror27 requested a review from ilyasoifer March 11, 2024 11:20
@cmnbroad
Copy link
Contributor

From picard triage: It looks like some of the FlowBasedRead code is duplicated with GATK, and is utility code. Is there any chance that the code that is common with GATK can be factored out so there is less duplication ? Let us know if this turns out to be burdensome. Otherwise, is this ready to merge ? @meganshand Can you do a quick review on this ?

@ilyasoifer
Copy link
Contributor

From picard triage: It looks like some of the FlowBasedRead code is duplicated with GATK, and is utility code. Is there any chance that the code that is common with GATK can be factored out so there is less duplication ? Let us know if this turns out to be burdensome. Otherwise, is this ready to merge ? @meganshand Can you do a quick review on this ?

@cmnbroad - thanks! You are right about the code duplication. I was actually wondering what is the established flow to prevent duplications between the picard and the GATK. Could you suggest or point me to the example? Regarding the PR - could we wait with the review? There is a related tool in the GATK that I would like to have merged there first.

Thanks!

@cmnbroad
Copy link
Contributor

cmnbroad commented Mar 12, 2024

@ilyasoifer The picard jar is included in the GATK build, so anything in picard that is public can be accessed and used in GATK (just search GATK for import picard to see some examples).

So then it becomes a question of what is convenient. Importing and using a class of static utilities into GATK from picard is pretty easy, but sharing something like the FlowBasedRead class is a little harder because in GATK it extends SAMRecordToGATKReadAdapter, which is GATK only.

We just wanted to raise the issue and see if there is any refactoring that can be done to minimize duplication (such as factoring out some common code into static utilities that can live in picard, which could then be accessed by both picard and GATK). Obviously keeping duplicating code means updates have to be made in both projects. On the other hand, if you want to make a change in GATK that requires an update to common code in picard, there is some latency incurred to get those changes into GATK, since we generally only update to a new picard jar in GATK when there is a new picard release. So we'll leave it up to you to strike the right balance. Hope that helps.

@cmnbroad
Copy link
Contributor

cmnbroad commented Apr 9, 2024

@meganshand Can you review this now ? It looks like the GATK PR has been merged ?

@meganshand meganshand requested review from meganshand and removed request for ilyasoifer April 9, 2024 17:52
@meganshand meganshand self-assigned this Apr 9, 2024
Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @cmnbroad above that copying code from GATK should be avoided if possible. Did you look into that and determine that leaving it copied here is the best solution (for FlowBasedRead for example)?

Other than that, a few small comments to update it to match the changes made in the GATK PR. Thanks!

@@ -92,9 +92,6 @@ public class CollectQualityYieldMetrics extends SinglePassSamProgram {
"of bases if there are supplemental alignments in the input file.")
public boolean INCLUDE_SUPPLEMENTAL_ALIGNMENTS = false;

@Argument(doc = "If true, calculates flow-specific READ_LENGTH_AVG_Q metrics.")
public boolean FLOW_MODE = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change won't be backwards compatible with the previous version of Picard. That's ok with me since you're probably aware of most Ultima users using this tool, but wanted to note it in case you'd rather throw an error message pointing them to CollectQualityYieldMetricsFlow if they provide FLOW_MODE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice. Restored FLOW_MODE on CollectQualityYieldMetrics. When set, now throws an exception and points to CollectQualityYieldMetricsFlow

/*
* The MIT License
*
* Copyright (c) 2009 The Broad Institute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2009 -> 2024

// add up quals, and quals >= 20
int flow = 0;
for (final int qual : quals) {
metrics.Q20_EQUIVALENT_YIELD += qual;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to be above line 161 in order to include non-PF reads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q20_EQUIVALENT_YIELD seems to be superfluous. The new tool discards all non PF reads and contains only PF counters (except for TOTAL_READS). Q20_EQUIVALENT_YIELD removed. Your comment is correct, but now that there is no such counters it is resolved.

public long PF_READS = 0;

/**
* The average read length of all the reads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The average read length of all the reads
* The average read length of the PF reads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually is this read length? Or mean number of flows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name changed to MEAN_PF_READ_NUMBER_OF_FLOWS

/*
* The MIT License
*
* Copyright (c) 2009 The Broad Institute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2009 The Broad Institute
* Copyright (c) 2024 The Broad Institute

this.metrics = new QualityYieldMetrics(useBQForBaseQualities);
}

public void acceptRecord(final SAMRecord rec, final ReferenceSequence ref) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is all so similar, would there be any way to fold this into the CollectQualityYieldMetricsFlow tool instead? Or is it truly beneficial to have two tools? You're already making a new Metrics class so you can add more metrics to it and in the case that the CRAM hasn't had the BQ tag added (or whatever tag the user specifies for the qualities) you'd just get these metrics applied over the QUALs. It's fine if you want to push back on this suggestion, it just might make this more streamlined (but that's not worth it if the cost is a worse user experience).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Still, from a user experience perspective, the SNVQ is a rather separate and unique feature in its own right, hence the original design of a separate tool.

} else {
// zero base bq values
for ( int i = 0 ; i < tmp.length ; i++ ) {
tmp[i] -= 33;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as in GATK to use htsjdk rather than handling the byte conversion here yourself.

@@ -46,4 +46,5 @@ public class StandardOptionDefinitions {
public static final String MINIMUM_LOD_SHORT_NAME = "LOD";
public static final String SORT_ORDER_SHORT_NAME = "SO";
public static final String USE_ORIGINAL_QUALITIES_SHORT_NAME = "OQ";
public static final String USE_BQ_FOR_BASE_QUALITIES_SHORT_NAME = "UBQ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully you won't need this after updating the tool to take the tag from the user rather than using "BQ"

/*
* The MIT License
*
* Copyright (c) 2015 The Broad Institute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2015 -> 2024


final CollectQualityYieldMetricsSNVQ.QualityYieldMetrics metrics = output.getMetrics().get(0);
Assert.assertEquals(metrics.TOTAL_READS, 26577);
Assert.assertTrue(metrics.PF_READS > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you only want to check that these values are greater than 0? Or do you know what these values should be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Actual numbers added.

@ilyasoifer
Copy link
Contributor

@cmnbroad and @meganshand - we agree with your suggestion to remove as much as possible code duplication.
So the plan is:

  1. Move as much as possible code from GATK to Picard.
  2. Wait until release of the updated picard version (update of the dependency in the GATK)
  3. Make a PR into the GATK that removes the duplicated code in the GATK and adds dependency on picard.
    Does this sound like a reasonable plan? Or is there a faster way of doing this?

@cmnbroad
Copy link
Contributor

Sounds right to me.

@dror27
Copy link
Contributor Author

dror27 commented Apr 25, 2024

Sounds right to me.

comments by @meganshand addressed - thank you. Next I'd like to verify that what flow code we have in this pull request is enough to be able to pull it and other dup come from GATK. This is verification step - to save us from having to do yet another picard PR to enable GATK flow code cleanup.

@meganshand
Copy link
Contributor

The changes you made to address my comments look good 👍

Am I right in assuming that you'll be adding more to this PR as you're moving code from GATK to Picard? Are you waiting on me for anything at this point?

@dror27
Copy link
Contributor Author

dror27 commented Apr 30, 2024

The changes you made to address my comments look good 👍

Am I right in assuming that you'll be adding more to this PR as you're moving code from GATK to Picard? Are you waiting on me for anything at this point?

Your assumption is correct. I'm in the process of evaluating additional common flow code to be moved from GATK to Picard.

@dror27
Copy link
Contributor Author

dror27 commented Apr 30, 2024

@meganshand : I have now committed changes to picard which include most of the core flow code (from GATK). I have also arrived at a local build which builds a modified version of GATK against this branch (so we know it is usable in this sense).

I will be happy accept your comments on these changes.

@ilyasoifer : FYI

@dror27 dror27 requested a review from meganshand April 30, 2024 11:57
@meganshand
Copy link
Contributor

@dror27 This looks good to me. Did you also run GATK tests when you built the modified version? If so then that would give me much more confidence.

@cmnbroad Did you also want to take a quick look at this?

@dror27
Copy link
Contributor Author

dror27 commented May 2, 2024

@dror27 This looks good to me. Did you also run GATK tests when you built the modified version? If so then that would give me much more confidence.

@cmnbroad Did you also want to take a quick look at this?

@meganshand Yes. I did run all the flow tests on a branch adapted to this picard using something like:

$ ./gradlew test --tests "Flow"

@meganshand
Copy link
Contributor

@cmnbroad I'll wait to merge until you've had a chance to voice anything on the latest changes.

@cmnbroad
Copy link
Contributor

cmnbroad commented May 2, 2024

I won't get to it anytime soon, so I'd say if you're good with it @meganshand, go for it. Thanks.

@meganshand meganshand merged commit cd216a7 into broadinstitute:master May 2, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants