-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CollectQualityYieldMetricsFlowSpace tool #1932
CollectQualityYieldMetricsFlowSpace tool #1932
Conversation
4cc0e00
to
e17dcc8
Compare
From picard triage: It looks like some of the FlowBasedRead code is duplicated with GATK, and is utility code. Is there any chance that the code that is common with GATK can be factored out so there is less duplication ? Let us know if this turns out to be burdensome. Otherwise, is this ready to merge ? @meganshand Can you do a quick review on this ? |
@cmnbroad - thanks! You are right about the code duplication. I was actually wondering what is the established flow to prevent duplications between the picard and the GATK. Could you suggest or point me to the example? Regarding the PR - could we wait with the review? There is a related tool in the GATK that I would like to have merged there first. Thanks! |
@ilyasoifer The picard jar is included in the GATK build, so anything in picard that is public can be accessed and used in GATK (just search GATK for So then it becomes a question of what is convenient. Importing and using a class of static utilities into GATK from picard is pretty easy, but sharing something like the We just wanted to raise the issue and see if there is any refactoring that can be done to minimize duplication (such as factoring out some common code into static utilities that can live in picard, which could then be accessed by both picard and GATK). Obviously keeping duplicating code means updates have to be made in both projects. On the other hand, if you want to make a change in GATK that requires an update to common code in picard, there is some latency incurred to get those changes into GATK, since we generally only update to a new picard jar in GATK when there is a new picard release. So we'll leave it up to you to strike the right balance. Hope that helps. |
@meganshand Can you review this now ? It looks like the GATK PR has been merged ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @cmnbroad above that copying code from GATK should be avoided if possible. Did you look into that and determine that leaving it copied here is the best solution (for FlowBasedRead for example)?
Other than that, a few small comments to update it to match the changes made in the GATK PR. Thanks!
@@ -92,9 +92,6 @@ public class CollectQualityYieldMetrics extends SinglePassSamProgram { | |||
"of bases if there are supplemental alignments in the input file.") | |||
public boolean INCLUDE_SUPPLEMENTAL_ALIGNMENTS = false; | |||
|
|||
@Argument(doc = "If true, calculates flow-specific READ_LENGTH_AVG_Q metrics.") | |||
public boolean FLOW_MODE = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change won't be backwards compatible with the previous version of Picard. That's ok with me since you're probably aware of most Ultima users using this tool, but wanted to note it in case you'd rather throw an error message pointing them to CollectQualityYieldMetricsFlow
if they provide FLOW_MODE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good advice. Restored FLOW_MODE on CollectQualityYieldMetrics. When set, now throws an exception and points to CollectQualityYieldMetricsFlow
/* | ||
* The MIT License | ||
* | ||
* Copyright (c) 2009 The Broad Institute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2009 -> 2024
// add up quals, and quals >= 20 | ||
int flow = 0; | ||
for (final int qual : quals) { | ||
metrics.Q20_EQUIVALENT_YIELD += qual; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this need to be above line 161 in order to include non-PF reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q20_EQUIVALENT_YIELD seems to be superfluous. The new tool discards all non PF reads and contains only PF counters (except for TOTAL_READS). Q20_EQUIVALENT_YIELD removed. Your comment is correct, but now that there is no such counters it is resolved.
public long PF_READS = 0; | ||
|
||
/** | ||
* The average read length of all the reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The average read length of all the reads | |
* The average read length of the PF reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually is this read length? Or mean number of flows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name changed to MEAN_PF_READ_NUMBER_OF_FLOWS
/* | ||
* The MIT License | ||
* | ||
* Copyright (c) 2009 The Broad Institute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright (c) 2009 The Broad Institute | |
* Copyright (c) 2024 The Broad Institute |
this.metrics = new QualityYieldMetrics(useBQForBaseQualities); | ||
} | ||
|
||
public void acceptRecord(final SAMRecord rec, final ReferenceSequence ref) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is all so similar, would there be any way to fold this into the CollectQualityYieldMetricsFlow tool instead? Or is it truly beneficial to have two tools? You're already making a new Metrics class so you can add more metrics to it and in the case that the CRAM hasn't had the BQ tag added (or whatever tag the user specifies for the qualities) you'd just get these metrics applied over the QUALs. It's fine if you want to push back on this suggestion, it just might make this more streamlined (but that's not worth it if the cost is a worse user experience).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Still, from a user experience perspective, the SNVQ is a rather separate and unique feature in its own right, hence the original design of a separate tool.
} else { | ||
// zero base bq values | ||
for ( int i = 0 ; i < tmp.length ; i++ ) { | ||
tmp[i] -= 33; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as in GATK to use htsjdk rather than handling the byte conversion here yourself.
@@ -46,4 +46,5 @@ public class StandardOptionDefinitions { | |||
public static final String MINIMUM_LOD_SHORT_NAME = "LOD"; | |||
public static final String SORT_ORDER_SHORT_NAME = "SO"; | |||
public static final String USE_ORIGINAL_QUALITIES_SHORT_NAME = "OQ"; | |||
public static final String USE_BQ_FOR_BASE_QUALITIES_SHORT_NAME = "UBQ"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully you won't need this after updating the tool to take the tag from the user rather than using "BQ"
/* | ||
* The MIT License | ||
* | ||
* Copyright (c) 2015 The Broad Institute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2015 -> 2024
|
||
final CollectQualityYieldMetricsSNVQ.QualityYieldMetrics metrics = output.getMetrics().get(0); | ||
Assert.assertEquals(metrics.TOTAL_READS, 26577); | ||
Assert.assertTrue(metrics.PF_READS > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you only want to check that these values are greater than 0? Or do you know what these values should be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Actual numbers added.
@cmnbroad and @meganshand - we agree with your suggestion to remove as much as possible code duplication.
|
Sounds right to me. |
comments by @meganshand addressed - thank you. Next I'd like to verify that what flow code we have in this pull request is enough to be able to pull it and other dup come from GATK. This is verification step - to save us from having to do yet another picard PR to enable GATK flow code cleanup. |
The changes you made to address my comments look good 👍 Am I right in assuming that you'll be adding more to this PR as you're moving code from GATK to Picard? Are you waiting on me for anything at this point? |
Your assumption is correct. I'm in the process of evaluating additional common flow code to be moved from GATK to Picard. |
@meganshand : I have now committed changes to picard which include most of the core flow code (from GATK). I have also arrived at a local build which builds a modified version of GATK against this branch (so we know it is usable in this sense). I will be happy accept your comments on these changes. @ilyasoifer : FYI |
…est framework" This reverts commit fd8b16d.
@meganshand Yes. I did run all the flow tests on a branch adapted to this picard using something like: $ ./gradlew test --tests "Flow" |
@cmnbroad I'll wait to merge until you've had a chance to voice anything on the latest changes. |
I won't get to it anytime soon, so I'd say if you're good with it @meganshand, go for it. Thanks. |
CollectQualityYieldMetrics for Ultima flow based read files.