New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Duplicate Set Index for queryname sorted input to MarkDuplicates #1843
Conversation
428254f
to
2d310e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good and I don't have very much to say except to comment that the logic for this code is very confusing and needs to be better documented/cleaned up to be at all comprehensible especially since there were two very obvious mistakes in plane site that you cleaned up over a 20 line span...
@@ -400,23 +401,6 @@ protected int doWork() { | |||
UmiUtil.setMolecularIdentifier(rec, "", MOLECULAR_IDENTIFIER_TAG, DUPLEX_UMI); | |||
} | |||
|
|||
// Tag any read pair that was in a duplicate set with the duplicate set size and a representative read name | |||
if (TAG_DUPLICATE_SET_MEMBERS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this code really copy-pasted like this all along and nobody noticed? 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... that does appear to be the case
tester.expectedRepresentativeIndexMap.put(8, representativeReadIndexInFileForward2); | ||
tester.expectedRepresentativeIndexMap.put(11, representativeReadIndexInFileForward2); | ||
tester.expectedSetSizeMap.put("RUNID:1:1:15993:13370",3); | ||
final String duplicateSet2DuplicateReadName1 = "RUNID:1:1:15993:13365"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bleh is this the best of how our testing for this feature worked before? this seems very brittle... not really your problem to fix here though...
* Should be used to update any expectations to be used in implementation | ||
* specific tests. | ||
**/ | ||
void updateExpectationsHook() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solid refactoring of the testing infrastructure.
@@ -384,13 +386,12 @@ protected int doWork() { | |||
} | |||
final boolean isInDuplicateSet = recordInFileIndex == nextRepresentativeIndex || | |||
(sortOrder == SAMFileHeader.SortOrder.queryname && | |||
recordInFileIndex > nextDuplicateIndex); | |||
recordInFileIndex > nextRepresentativeIndex && rec.getReadName().equals(representativeQueryName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename this/comment the logic for "isInDuplicateSet" here to clairify this? This took a while to pick through what all of the steps are doing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "Is RepresentativeReadForADuplicateSet"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this essentially says that if we are in queryname sorted mode (It should be noted this won't work if we are in query grouped mode even though the logic should be the same, maybe that check should be updated), we mark both the first read and its mates with the duplicate sets but if we are coordinate sorted the mates don't get the flag? Is that asymmetrical behavior desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is super confusing, I actually couldn't figure out how this even works just now until I realized the variables are horribly named.
I believe that mates should also be marked with duplicate sets when the input is coordinate sorted as well (which is confirmed by our tests). The reason is that in addRepresentativeReadIndex
we add an entry for both reads of the fragment if end.read1IndexInFile != end.read2IndexInFile
. And when the ReadEnds object gets built we keep those entries the same for queryname sorted, but different for coordinate sorted (see L548 and thereabouts). I think this really goes back to your overall point that this code has become mangled to the point obfuscation. Which I completely agree with, but don't know that anyone has the capacity to fix, unfortunately.
Regardless, I will try to at least make this part a little clearer
if (isInDuplicateSet) { | ||
if (!rec.isSecondaryOrSupplementary() && !rec.getReadUnmappedFlag()) { | ||
if (TAG_DUPLICATE_SET_MEMBERS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh boy this code was not in a good state before you got to it...
Thanks for the review @jamesemery. I agree with your comment that the code is very confusing at this point, and could greatly benefit from significant cleanup. I don't know where we find the capacity for that currently though. I did try to at least make a little clearer that part this PR affects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarifying comments
Duplicate Set index is broken for queryname sorted input. This fixes, adds tests of this case, and also refactors some mark duplicates tests.
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests