MarkDuplicatesSpark improvements checkpoint #4656

jamesemery · 2018-04-12T16:32:08Z

This PR is the culmination of work from myself and @lbergelson to improve the runtime for MarkDuplicatesSpark on a single machine. This involved a rewrite of the tool as well as a number of improvements which should bring it into closer agreement with MarkDuplicates from picard.

Note: this is merely a checkpoint and there is still work that must be done to bring the work into agreement with recent MarkDuplicates development in picard.

Resolves #3706

codecov-io · 2018-04-12T18:55:38Z

Codecov Report

Merging #4656 into master will increase coverage by 0.305%.
The diff coverage is 92.758%.

@@              Coverage Diff               @@
##             master     #4656       +/-   ##
==============================================
+ Coverage     79.84%   80.146%   +0.305%     
- Complexity    17330     17620      +290     
==============================================
  Files          1074      1080        +6     
  Lines         62907     64036     +1129     
  Branches      10181     10471      +290     
==============================================
+ Hits          50225     51322     +1097     
- Misses         8701      8704        +3     
- Partials       3981      4010       +29

Impacted Files	Coverage Δ	Complexity Δ
...ava/org/broadinstitute/hellbender/utils/Utils.java	`80.241% <0%> (-0.194%)`	`142 <2> (+2)`
...ections/MarkDuplicatesSparkArgumentCollection.java	`100% <100%> (ø)`	`1 <1> (?)`
...nder/tools/spark/pipelines/ReadsPipelineSpark.java	`89.13% <100%> (-0.231%)`	`12 <0> (ø)`
...k/pipelines/BwaAndMarkDuplicatesPipelineSpark.java	`77.778% <100%> (-1.17%)`	`4 <0> (ø)`
...s/read/markduplicates/sparkrecords/PairedEnds.java	`100% <100%> (ø)`	`1 <1> (?)`
...der/engine/spark/datasources/ReadsSparkSource.java	`82.051% <100%> (ø)`	`44 <5> (ø)`	⬇️
...ils/read/markduplicates/sparkrecords/Fragment.java	`100% <100%> (ø)`	`9 <9> (?)`
...transforms/markduplicates/MarkDuplicatesSpark.java	`95.122% <100%> (+4.213%)`	`15 <11> (+6)`	⬆️
...itute/hellbender/engine/spark/GATKRegistrator.java	`100% <100%> (ø)`	`3 <0> (ø)`	⬇️
...icates/sparkrecords/MarkDuplicatesSparkRecord.java	`100% <100%> (ø)`	`7 <7> (?)`
... and 37 more

lbergelson · 2018-04-17T19:44:12Z

src/main/java/org/broadinstitute/hellbender/engine/spark/GATKRegistrator.java

-
-        //register to avoid writing the full name of this class over and over
-        kryo.register(PairedEnds.class, new FieldSerializer<>(kryo, PairedEnds.class));
+        kryo.register(PairedEnds.class, new PairedEnds.PairedEndsEmptyFragmentSerializer());


This seems wrong, it should be registering for each of the subclasses, not the abstract class

…he tool with smaller serializing

… what is going on

…y and suppelementary reads

lbergelson

@jamesemery some comments, not done, but if you're going to start making changes i'lll get these in now

lbergelson · 2018-04-17T19:45:02Z

...a/org/broadinstitute/hellbender/tools/spark/pipelines/BwaAndMarkDuplicatesPipelineSpark.java

@@ -54,13 +54,16 @@
            fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME)
    protected String output;

+    @Argument(fullName = "do_not_mark_unmapped_mates", doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")


this-isn't-kabob-case

also, it should be a constant since it's duplicated a bunch of times.

lbergelson · 2018-04-17T19:46:11Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/ReadsPipelineSpark.java

@@ -112,6 +112,9 @@
    @Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true)
    private JoinStrategy joinStrategy = JoinStrategy.BROADCAST;

+    @Argument(fullName = "do_not_mark_unmapped_mates", doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")


we might want to consider a DuplicateMarkingArgumentCollection so we don't have to duplicate this all over the place

lbergelson · 2018-04-17T20:15:37Z

...adinstitute/hellbender/tools/spark/bwa/BwaAndMarkDuplicatesPipelineSparkIntegrationTest.java

@@ -34,6 +34,7 @@ public void test() throws Exception {
        args.addInput(input);
        args.addOutput(output);
        args.addBooleanArgument(StandardArgumentDefinitions.DISABLE_SEQUENCE_DICT_VALIDATION_NAME, true);
+        args.add("--do_not_mark_unmapped_mates");


use the constant

lbergelson · 2018-04-17T21:07:21Z

...in/java/org/broadinstitute/hellbender/utils/read/markduplicates/MarkDuplicatesSparkData.java

+import org.broadinstitute.hellbender.utils.read.GATKRead;
+import org.broadinstitute.hellbender.utils.read.ReadUtils;
+
+public abstract class MarkDuplicatesSparkData {


I still want a better name for this. It's super vague and confusing. I don't know what to call it though. Also, needs some javadoc here.

lbergelson · 2018-04-17T21:07:34Z

...in/java/org/broadinstitute/hellbender/utils/read/markduplicates/MarkDuplicatesSparkData.java

+    }
+
+    public abstract Type getType();
+    public abstract int getScore();


this can be pushed down to PairedEnds

lbergelson · 2018-04-18T20:49:58Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+
+            final List<IndexPair<GATKRead>> primaryReads = Utils.stream(keyedRead._2())
+                    ////// Making The Fragments //////
+                    // Make a PairedEnd object with no second read for each fragment (and an empty one for each paired read)


We should make our language consistent.
"Make a Fragment for each read which has no mapped mate, and a placeholder for each that does."

I still think we should update this comment

lbergelson · 2018-04-18T22:21:25Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+                .partitionBy(new KnownIndexPartitioner(reads.getNumPartitions()))
+                .values();
+
+        return reads.zipPartitions(repartitionedReadNames, (readsIter, readNamesIter)  -> {


I think we have a major bug here:

I think if we have a non-queryname / query-group sorted bam, the zip partitions will fail completely if there are multiple shards because we don't re-sort into matching partitions the second time around. We can fix that by pulling pulling the step that prepares the reads rdd out of the transformToDuplicateNames and then reusing the intermediate sorted RDD on the writeout step (probably want to cache it as well)

Right, as we have discussed, the problem here is that it fails to route the duplicate-marking data to the correct partition for all reads in a group if they are not in the same place to start out. So you are libel to mark as duplicates reads that should not be duplicates in a group incorrectly some of the time. I'm going to elect to fix this by just keeping a list of indexes of interest in the MarkDuplciatesSparkData object and just making sure the data gets duplicated properly during the zip, as opposed to sorting the bam before the mark phase as that is likely to have significant performance impact

I opened #4701, for this issue to expedite this branch.

lbergelson · 2018-04-18T22:23:58Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-            JavaPairRDD<String, GATKRead> keyReadPairs = reads.mapToPair(read -> new Tuple2<>(ReadsKey.keyForRead(header, read), read));
-            keyedReads = keyReadPairs.groupByKey(numReducers);
-        }
+    static JavaPairRDD<IndexPair<String>, Integer> transformToDuplicateNames(final SAMFileHeader header, final MarkDuplicatesScoringStrategy scoringStrategy, final OpticalDuplicateFinder finder, final JavaRDD<GATKRead>  reads, final int numReducers) {


We have a bug that will break our assumptions here I think. If we have bam that is query grouped, but not query sorted, which is the usual case, we won't do the adjustment for pairs at the edge of shards. We can fix that with a simple change to the check in ReadsSparkSource.putPairsInSamePartition

lbergelson · 2018-04-19T14:21:44Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+
+            // Mark duplicates cant properly handle templates with more than two reads in a pair
+            if (primaryReads.size()>2) {
+                throw new GATKException(String.format("Readgroup containing read %s has more than two primary reads, this is not valid", primaryReads.get(0).getValue()));


This should be a UserException.UnsupportedFeature probably instead of a GATKException. It's possible in a valid bam, it's just not something we allow. The message is a bit confusing. Read group is an overloaded term. Maybe "MarkDuplicatesSpark only supports singleton fragments and pairs. We found a group with >2 primary reads. ( %d number of reads). \list all reads one line at a time. "

lbergelson · 2018-04-19T14:22:13Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-        }).groupByKey(numReducers);
+        });
+
+        final JavaPairRDD<Integer, Iterable<MarkDuplicatesSparkData>> keyedPairs = pairedEnds.groupByKey(); //TODO make this a proper aggregate by key


lets get rid of this comment

…o first see if it works

lbergelson · 2018-04-25T18:24:41Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/ReadsPipelineSpark.java

@@ -112,6 +112,9 @@
    @Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true)
    private JoinStrategy joinStrategy = JoinStrategy.BROADCAST;

+    @Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")


this should probably be an advanced argument or a not recommend one or something

I mean, how "advanced" is matching gatk3 vs not matching gatk?

very advanced. We need a "not recommended" argument option

lbergelson · 2018-04-25T18:25:21Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+            final Map<String,Integer> namesOfNonDuplicateReadsAndOpticalCounts = Utils.stream(readNamesIter).collect(Collectors.toMap(Tuple2::_1,Tuple2::_2));
+            return Utils.stream(readsIter).peek(read -> {
+                // Handle reads that have been marked as non-duplicates (which also get tagged with optical duplicate summary statistics)
+                if( namesOfNonDuplicateReadsAndOpticalCounts.containsKey(read.getName())) { //todo figure out if we should be marking the unmapped mates of duplicate reads as duplicates


extraneous comment here

lbergelson · 2018-04-25T18:31:16Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java


+import javax.validation.constraints.Max;


this import seems spurious

lbergelson · 2018-04-25T18:35:45Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+        // Place all the reads into a single RDD of MarkDuplicatesSparkRecord objects
+        final JavaPairRDD<Integer, MarkDuplicatesSparkRecord> pairedEnds = keyedReads.flatMapToPair(keyedRead -> {
+            final List<Tuple2<Integer, MarkDuplicatesSparkRecord>> out = Lists.newArrayList();
+            AtomicReference<IndexPair<GATKRead>> hadNonPrimaryRead = new AtomicReference<>();


it would be good to not use an atomic reference here, they have a lot of overhead

lbergelson · 2018-04-25T18:37:26Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

    }

-    static JavaPairRDD<String, Iterable<GATKRead>> spanReadsByKey(final SAMFileHeader header, final JavaRDD<GATKRead> reads) {
-        JavaPairRDD<String, GATKRead> nameReadPairs = reads.mapToPair(read -> new Tuple2<>(read.getName(), read));
+    //todo use this instead of keeping all unmapped reads as non-duplicate


can't we remove this?

lbergelson · 2018-04-25T18:40:13Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-            }
+    // Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data
+    private static long getGroupKey(MarkDuplicatesSparkRecord record) {
+        return record.getClass()==Passthrough.class?-1:


did you open an issue to resolve the readgroup issue?

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+    // Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data
+    private static long getGroupKey(MarkDuplicatesSparkRecord record) {
+        return record.getClass()==Passthrough.class?-1:
+                (((long)((PairedEnds)record).getUnclippedStartPosition()) << 32 |


lbergelson · 2018-04-25T18:40:50Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+        return record.getClass()==Passthrough.class?-1:
+                (((long)((PairedEnds)record).getUnclippedStartPosition()) << 32 |
+                        ((PairedEnds)record).getFirstRefIndex() << 16 );
+        //| ((PairedEnds)pe).getLibraryIndex())).values();


lbergelson · 2018-04-25T18:53:20Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-     *       with the same name. TODO: explain why there might be more.
+     * (3) keyMarkDuplicatesSparkRecords with alignment info:
+     *   (a) Generate a fragment or emptyFragment from each read if its unpaired.
+     *   (b) Pair grouped reads reads into MarkDuplicatesSparkRecord. In most cases there will only be two reads


duplicate reads

lbergelson · 2018-04-25T18:55:58Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-                return handleFragments(pairedEnds, scoringStrategy, header).iterator();
-            }
+    /**
+     * Primary landing point for MarkDulicateSparkRecords:


typo MarkDulicate

droazen

Minor comments for you @jamesemery

droazen · 2018-04-25T18:27:06Z

src/main/java/org/broadinstitute/hellbender/utils/read/SAMRecordToGATKReadAdapter.java

+        return samRecord.getTransientAttribute(key);
+    }
+
+    public void setTransientAttribute(Object key, Object value) {


Add javadoc for these two new methods, with appropriate disclaimers about use of these methods, and an explanation of what the transient attributes are.

droazen · 2018-04-25T18:29:20Z

src/main/java/org/broadinstitute/hellbender/utils/Utils.java

+     * @param collection any Collection
+     * @return true if the collection exists and has elements
+     */
+    public static boolean hasElements(Collection<?> collection){


isNonEmpty() or isNonEmptyCollection() might be a better names.

droazen · 2018-04-25T18:31:29Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSource.java

@@ -214,7 +214,7 @@ public boolean accept(Path path) {
     * so they are processed together. No shuffle is needed.
     */
    JavaRDD<GATKRead> putPairsInSamePartition(final SAMFileHeader header, final JavaRDD<GATKRead> reads) {
-        if (!header.getSortOrder().equals(SAMFileHeader.SortOrder.queryname)) {
+        if (!header.getSortOrder().equals(SAMFileHeader.SortOrder.queryname) && !SAMFileHeader.GroupOrder.query.equals(header.getGroupOrder())) {


Can you add a comment explaining the GroupOrder vs. SortOrder check here?

droazen · 2018-04-25T18:32:30Z

...a/org/broadinstitute/hellbender/tools/spark/pipelines/BwaAndMarkDuplicatesPipelineSpark.java

@@ -54,13 +54,16 @@
            fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME)
    protected String output;

+    @Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")
+    public boolean dontMarkUnmappedMates = false;


Move this argument and duplicates_scoring_strategy into a MarkDuplicatesSparkArgumentCollection

droazen · 2018-04-25T18:34:14Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/ReadsPipelineSpark.java

@@ -112,6 +112,9 @@
    @Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true)
    private JoinStrategy joinStrategy = JoinStrategy.BROADCAST;

+    @Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")
+    public boolean dontMarkUnmappedMates = false;


Move to argument collection, as noted above.

droazen · 2018-04-25T18:47:23Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-            }
-        }
-        return reads;
+    private static Tuple2<IndexPair<String>, Integer> handleFragments(List<MarkDuplicatesSparkRecord> duplicateFragmentGroup) {


All methods in this class should have basic javadoc with at least an explanation of the method's purpose.

droazen · 2018-04-25T18:50:09Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-        private static final long serialVersionUID = 1l;
-        private final SAMFileHeader header;
-        // TODO: Unify with other comparators in the codebase
+    public static final class PairedEndsCoordinateComparator implements Comparator<PairedEnds>, Serializable {


It would probably be better to have this comparator live in a standalone class in the same package instead of embedded in this utils class like this.

droazen · 2018-04-25T18:51:31Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+     * the same location.
+     *
+     * Ordering is almost identical to the {@link htsjdk.samtools.SAMRecordCoordinateComparator},
+     * modulo a few subtle differences in tie-breaking rules for reads that share the same


You say that the ordering is almost identical to the SAMRecordCoordinateComparator, but it looks like you've stripped out most of the tie-breaking.

droazen · 2018-04-25T18:59:25Z

src/main/java/org/broadinstitute/hellbender/tools/spark/validation/CompareDuplicatesSpark.java

+        JavaPairRDD<Integer, GATKRead> firstKeyed = firstReads.mapToPair(read -> new Tuple2<>(ReadsKey.hashKeyForFragment(
+                ReadUtils.getStrandedUnclippedStart(
+                                                                                                                                  read),
+                read.isReverseStrand(),


Strange formatting here...

droazen · 2018-04-25T19:00:34Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKey.java

+        int key = library != null ? library.hashCode() : 1;
+        key = key * 31 + referenceIndex;
+        key = key * 31 + strandedUnclippedStart;
+        return key * 31 + (reverseStrand ? 0 : 1);


Add comment explaining why you took this strategy for the key generation

lbergelson · 2018-04-25T19:06:23Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

@@ -89,4 +98,51 @@ private String getReadGroupId(final SAMFileHeader header, final int index) {
        return new Tuple2<>(key, ImmutableList.copyOf(reads));
    }

+    @Test (enabled = false)
+    public void testSortOrderParitioningCorrectness() throws IOException {


typo paritioning

lbergelson · 2018-04-25T19:06:30Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

@@ -89,4 +98,51 @@ private String getReadGroupId(final SAMFileHeader header, final int index) {
        return new Tuple2<>(key, ImmutableList.copyOf(reads));
    }

+    @Test (enabled = false)


lets move this into the other branch

its in the other branch now

lbergelson · 2018-04-25T19:07:29Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

@@ -38,7 +43,7 @@ public void testSpanningIterator() {
                ImmutableList.of(pairIterable(1, "a"), pairIterable(2, "b"), pairIterable(1, "c")));
    }

-    @Test(groups = "spark")
+    @Test(groups = "spark",enabled = false) //TODO discuss with reviewer what to do about this test. perhaps the readgroups should still be used in the name?
    public void testSpanReadsByKeyWithAlternatingGroups() {
        SAMFileHeader header = ArtificialReadUtils.createArtificialSamHeaderWithGroups(1, 1, 1000, 2);


can we change this to be names instead of read groups in the test?

droazen · 2018-04-25T19:50:39Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKey.java

-                second.isReverseStrand() ? "r" : "f");
+        key = 31 * key + ReadUtils.getReferenceIndex(second, header);
+        key = 31 * key + ReadUtils.getStrandedUnclippedStart(second);
+        return 31 * key + (second.isReverseStrand() ? 0 : 1);
    }

    /**
     * Makes a unique key for the read.


Document the reason why the read group isn't needed here.

droazen · 2018-04-25T19:55:25Z

...n/java/org/broadinstitute/hellbender/utils/read/markduplicates/sparkrecords/Passthrough.java

+    private final transient GATKRead read;
+
+    Passthrough(GATKRead read, int partitionIndex) {
+        super(partitionIndex, read.getName());


Should make the key at construction time instead of holding a transient reference to the read (can do in a separate PR, however).

droazen · 2018-04-25T19:56:08Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/sparkrecords/Pair.java

+@DefaultSerializer(Pair.Serializer.class)
+public final class Pair extends PairedEnds implements OpticalDuplicateFinder.PhysicalLocation {
+    protected transient GATKRead first;
+    protected transient GATKRead second;


Can you document why these can be transient (as it's not obvious that this is safe).

droazen · 2018-04-25T19:56:22Z

...n/java/org/broadinstitute/hellbender/utils/read/markduplicates/sparkrecords/Passthrough.java

+ * processing on the reads. (eg. unmapped reads we don't want to process but must be non-duplicate marked)
+ */
+public final class Passthrough extends MarkDuplicatesSparkRecord {
+    private final transient GATKRead read;


Can you document why these can be transient (as it's not obvious that this is safe).

droazen · 2018-04-25T20:00:30Z

...a/org/broadinstitute/hellbender/tools/spark/pipelines/ReadsPipelineSparkIntegrationTest.java

@@ -121,6 +122,7 @@ public void testReadsPipelineSpark(PipelineTest params) throws IOException {
            args.add("-R");
            args.add(referenceFile.getAbsolutePath());
        }
+        args.add("--"+ MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES);


Do you have tests that cover the case where this argument is not specified?

yes, the corresponding tests (with the same names and arguments in AbstractMarkDuplicatesTester)

droazen · 2018-04-25T20:01:36Z

.../org/broadinstitute/hellbender/tools/spark/pipelines/MarkDuplicatesSparkIntegrationTest.java

@@ -158,4 +158,24 @@ public void testMarkDuplicatesSparkIntegrationTestLocal(
            }
        }
    }
+


Do you have tests that take name-sorted input? Can you create a ticket to add more tests that use name-sorted input?

in another PR

lbergelson · 2018-04-25T20:02:37Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+                    .filter(indexPair -> !(indexPair.getValue().isSecondaryAlignment()||indexPair.getValue().isSupplementaryAlignment()))
+                    .collect(Collectors.toList());
+
+            // Mark duplicates cant properly handle templates with more than two reads in a pair


missing '

droazen · 2018-04-25T20:04:18Z

src/test/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKeyUnitTest.java

+
+import java.util.Random;
+
+public class ReadsKeyUnitTest {


All test classes should extend GATKBaseTest

Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> First part of a major rewrite of MarkDuplicatesSpark to improve performance. Tool still has a number of known issues, but is much faster than the previous version.

jamesemery assigned lbergelson and jamesemery Apr 12, 2018

jamesemery requested a review from lbergelson April 12, 2018 16:32

droazen self-requested a review April 13, 2018 15:06

droazen self-assigned this Apr 13, 2018

lbergelson reviewed Apr 17, 2018

View reviewed changes

jamesemery and others added 21 commits April 18, 2018 16:05

Changed the key generation in MarkDuplicates to use a StringBuilder

d006989

Further Optimizations

297e1ad

Do check on CompareDuplicatesSpark because its probably broken now

0da1fff

adding in the registrator changes

b24b3a4

removing a sort and maybe making duplicating slightly more efficient

2f9f0ff

A prototype that should demonstrate the theoretical cost of running t…

d7865a5

…he tool with smaller serializing

making changes to mark based on zip partitions

a507ff6

refactoring paired ends

7ad71c1

fixing wierd multimap issue

4b7fe14

making changes to fix a few things

a44c81b

disabling non-query sorted input and encapsulating fields in PairedEnds

368d77d

working on custom serializers

7d03569

more work in this rat race that is optical duplicate marking

ed18037

finishing up custom partitioner round 1

d7e6e2d

all mark duplicates test passing

4b440eb

actually set the sort order...

2847111

stopgap to fix temporary insanity

828f487

adding some instrumentation to hopefully catch the bug and understand…

c15bc6f

… what is going on

Refactored the pairedends code and added propper support for secondar…

88261f8

…y and suppelementary reads

Added an argument to disable duplciate marking of unmapped mates

f1240ee

Fixed the broken kryo registration

8ad0671

jamesemery force-pushed the lb_modify_to_zip_partitions branch from 606c6d5 to 8ad0671 Compare April 18, 2018 20:05

lbergelson reviewed Apr 19, 2018

View reviewed changes

someday bugs will be eradicated automatically by executing the code t…

67f2934

…o first see if it works

lbergelson reviewed Apr 25, 2018

View reviewed changes

droazen requested changes Apr 25, 2018

View reviewed changes

lbergelson reviewed Apr 25, 2018

View reviewed changes

droazen reviewed Apr 25, 2018

View reviewed changes

lbergelson reviewed Apr 25, 2018

View reviewed changes

droazen reviewed Apr 25, 2018

View reviewed changes

responded to yet more comments

61b284f

droazen approved these changes Apr 26, 2018

View reviewed changes

droazen merged commit 7641f53 into master Apr 26, 2018

droazen deleted the lb_modify_to_zip_partitions branch April 26, 2018 13:46

lbergelson mentioned this pull request Apr 27, 2018

Changed the key generation in MarkDuplicates to use a StringBuilder #4386

Closed

@@ @@ -158,4 +158,24 @@ public void testMarkDuplicatesSparkIntegrationTestLocal( @@
                           }
                       }
                   }

MarkDuplicatesSpark improvements checkpoint #4656

MarkDuplicatesSpark improvements checkpoint #4656

Conversation

jamesemery commented Apr 12, 2018

codecov-io commented Apr 12, 2018 • edited

Codecov Report

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Apr 12, 2018 •

edited