Fix Duplicate Set Index for queryname sorted input to MarkDuplicates #1843

kachulis · 2022-11-16T17:32:46Z

Duplicate Set index is broken for queryname sorted input. This fixes, adds tests of this case, and also refactors some mark duplicates tests.

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

Added or modified tests to cover changes and any new functionality
Edited the README / documentation (if applicable)
All tests passing on Travis

Review

Final thumbs-up from reviewer
Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

jamesemery

Mostly looks good and I don't have very much to say except to comment that the logic for this code is very confusing and needs to be better documented/cleaned up to be at all comprehensible especially since there were two very obvious mistakes in plane site that you cleaned up over a 20 line span...

jamesemery · 2023-01-10T19:25:04Z

src/main/java/picard/sam/markduplicates/MarkDuplicates.java

@@ -400,23 +401,6 @@ protected int doWork() {
                    UmiUtil.setMolecularIdentifier(rec, "", MOLECULAR_IDENTIFIER_TAG, DUPLEX_UMI);
                }

-                // Tag any read pair that was in a duplicate set with the duplicate set size and a representative read name
-                if (TAG_DUPLICATE_SET_MEMBERS) {


was this code really copy-pasted like this all along and nobody noticed? 😞

yeah... that does appear to be the case

jamesemery · 2023-01-10T19:27:33Z

src/test/java/picard/sam/markduplicates/MarkDuplicatesTagRepresentativeReadIndexTest.java

-        tester.expectedRepresentativeIndexMap.put(8, representativeReadIndexInFileForward2);
-        tester.expectedRepresentativeIndexMap.put(11, representativeReadIndexInFileForward2);
-        tester.expectedSetSizeMap.put("RUNID:1:1:15993:13370",3);
+        final String duplicateSet2DuplicateReadName1 = "RUNID:1:1:15993:13365";


bleh is this the best of how our testing for this feature worked before? this seems very brittle... not really your problem to fix here though...

jamesemery · 2023-01-10T19:28:23Z

src/test/java/picard/sam/markduplicates/AbstractMarkDuplicatesCommandLineProgramTester.java

+    * Should be used to update any expectations to be used in implementation
+    * specific tests.
+     **/
+    void updateExpectationsHook() {


solid refactoring of the testing infrastructure.

jamesemery · 2023-01-10T19:30:29Z

src/main/java/picard/sam/markduplicates/MarkDuplicates.java

@@ -384,13 +386,12 @@ protected int doWork() {
                    }
                    final boolean isInDuplicateSet = recordInFileIndex == nextRepresentativeIndex ||
                            (sortOrder == SAMFileHeader.SortOrder.queryname &&
-                                    recordInFileIndex > nextDuplicateIndex);
+                                    recordInFileIndex > nextRepresentativeIndex && rec.getReadName().equals(representativeQueryName));


can you rename this/comment the logic for "isInDuplicateSet" here to clairify this? This took a while to pick through what all of the steps are doing...

maybe "Is RepresentativeReadForADuplicateSet"

So this essentially says that if we are in queryname sorted mode (It should be noted this won't work if we are in query grouped mode even though the logic should be the same, maybe that check should be updated), we mark both the first read and its mates with the duplicate sets but if we are coordinate sorted the mates don't get the flag? Is that asymmetrical behavior desirable?

Yeah this is super confusing, I actually couldn't figure out how this even works just now until I realized the variables are horribly named.

I believe that mates should also be marked with duplicate sets when the input is coordinate sorted as well (which is confirmed by our tests). The reason is that in addRepresentativeReadIndex we add an entry for both reads of the fragment if end.read1IndexInFile != end.read2IndexInFile. And when the ReadEnds object gets built we keep those entries the same for queryname sorted, but different for coordinate sorted (see L548 and thereabouts). I think this really goes back to your overall point that this code has become mangled to the point obfuscation. Which I completely agree with, but don't know that anyone has the capacity to fix, unfortunately.

Regardless, I will try to at least make this part a little clearer

jamesemery · 2023-01-10T19:30:51Z

src/main/java/picard/sam/markduplicates/MarkDuplicates.java

                    if (isInDuplicateSet) {
                        if (!rec.isSecondaryOrSupplementary() && !rec.getReadUnmappedFlag()) {
-                            if (TAG_DUPLICATE_SET_MEMBERS) {


oh boy this code was not in a good state before you got to it...

kachulis · 2023-01-11T21:36:25Z

Thanks for the review @jamesemery. I agree with your comment that the code is very confusing at this point, and could greatly benefit from significant cleanup. I don't know where we find the capacity for that currently though.

I did try to at least make a little clearer that part this PR affects.

jamesemery

Thank you for the clarifying comments

kachulis added 3 commits November 16, 2022 12:34

fix duplicate set tags

306c9b5

fix DI for queryname sorted, tests, refactor tests

9ba9104

remove debugging line

2d310e2

kachulis force-pushed the ck_MD_duplicate_set_index_tag branch from 428254f to 2d310e2 Compare November 16, 2022 17:34

kachulis requested a review from jamesemery January 10, 2023 18:28

droazen assigned jamesemery Jan 10, 2023

jamesemery reviewed Jan 10, 2023

View reviewed changes

make code clearer

1e699ca

jamesemery approved these changes Feb 13, 2023

View reviewed changes

kachulis merged commit 21ccdb5 into master Feb 13, 2023

kachulis deleted the ck_MD_duplicate_set_index_tag branch February 13, 2023 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Duplicate Set Index for queryname sorted input to MarkDuplicates #1843

Fix Duplicate Set Index for queryname sorted input to MarkDuplicates #1843

kachulis commented Nov 16, 2022

jamesemery left a comment

jamesemery Jan 10, 2023

kachulis Jan 11, 2023

jamesemery Jan 10, 2023

jamesemery Jan 10, 2023

jamesemery Jan 10, 2023

jamesemery Jan 10, 2023

jamesemery Jan 10, 2023

kachulis Jan 11, 2023

jamesemery Jan 10, 2023

kachulis commented Jan 11, 2023

jamesemery left a comment

Fix Duplicate Set Index for queryname sorted input to MarkDuplicates #1843

Fix Duplicate Set Index for queryname sorted input to MarkDuplicates #1843

Conversation

kachulis commented Nov 16, 2022

Checklist (never delete this)

Content

Review

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kachulis commented Jan 11, 2023

jamesemery left a comment

Choose a reason for hiding this comment