Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Duplicate Set Index for queryname sorted input to MarkDuplicates #1843

Merged
merged 4 commits into from Feb 13, 2023

Conversation

kachulis
Copy link
Contributor

Duplicate Set index is broken for queryname sorted input. This fixes, adds tests of this case, and also refactors some mark duplicates tests.

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

Copy link
Contributor

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good and I don't have very much to say except to comment that the logic for this code is very confusing and needs to be better documented/cleaned up to be at all comprehensible especially since there were two very obvious mistakes in plane site that you cleaned up over a 20 line span...

@@ -400,23 +401,6 @@ protected int doWork() {
UmiUtil.setMolecularIdentifier(rec, "", MOLECULAR_IDENTIFIER_TAG, DUPLEX_UMI);
}

// Tag any read pair that was in a duplicate set with the duplicate set size and a representative read name
if (TAG_DUPLICATE_SET_MEMBERS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this code really copy-pasted like this all along and nobody noticed? 😞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... that does appear to be the case

tester.expectedRepresentativeIndexMap.put(8, representativeReadIndexInFileForward2);
tester.expectedRepresentativeIndexMap.put(11, representativeReadIndexInFileForward2);
tester.expectedSetSizeMap.put("RUNID:1:1:15993:13370",3);
final String duplicateSet2DuplicateReadName1 = "RUNID:1:1:15993:13365";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bleh is this the best of how our testing for this feature worked before? this seems very brittle... not really your problem to fix here though...

* Should be used to update any expectations to be used in implementation
* specific tests.
**/
void updateExpectationsHook() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid refactoring of the testing infrastructure.

@@ -384,13 +386,12 @@ protected int doWork() {
}
final boolean isInDuplicateSet = recordInFileIndex == nextRepresentativeIndex ||
(sortOrder == SAMFileHeader.SortOrder.queryname &&
recordInFileIndex > nextDuplicateIndex);
recordInFileIndex > nextRepresentativeIndex && rec.getReadName().equals(representativeQueryName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename this/comment the logic for "isInDuplicateSet" here to clairify this? This took a while to pick through what all of the steps are doing...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "Is RepresentativeReadForADuplicateSet"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this essentially says that if we are in queryname sorted mode (It should be noted this won't work if we are in query grouped mode even though the logic should be the same, maybe that check should be updated), we mark both the first read and its mates with the duplicate sets but if we are coordinate sorted the mates don't get the flag? Is that asymmetrical behavior desirable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is super confusing, I actually couldn't figure out how this even works just now until I realized the variables are horribly named.

I believe that mates should also be marked with duplicate sets when the input is coordinate sorted as well (which is confirmed by our tests). The reason is that in addRepresentativeReadIndex we add an entry for both reads of the fragment if end.read1IndexInFile != end.read2IndexInFile. And when the ReadEnds object gets built we keep those entries the same for queryname sorted, but different for coordinate sorted (see L548 and thereabouts). I think this really goes back to your overall point that this code has become mangled to the point obfuscation. Which I completely agree with, but don't know that anyone has the capacity to fix, unfortunately.

Regardless, I will try to at least make this part a little clearer

if (isInDuplicateSet) {
if (!rec.isSecondaryOrSupplementary() && !rec.getReadUnmappedFlag()) {
if (TAG_DUPLICATE_SET_MEMBERS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh boy this code was not in a good state before you got to it...

@kachulis
Copy link
Contributor Author

Thanks for the review @jamesemery. I agree with your comment that the code is very confusing at this point, and could greatly benefit from significant cleanup. I don't know where we find the capacity for that currently though.

I did try to at least make a little clearer that part this PR affects.

Copy link
Contributor

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarifying comments

@kachulis kachulis merged commit 21ccdb5 into master Feb 13, 2023
@kachulis kachulis deleted the ck_MD_duplicate_set_index_tag branch February 13, 2023 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants