Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDuplicatesSpark improvements checkpoint #4656

Merged
merged 27 commits into from
Apr 26, 2018
Merged

Conversation

jamesemery
Copy link
Collaborator

This PR is the culmination of work from myself and @lbergelson to improve the runtime for MarkDuplicatesSpark on a single machine. This involved a rewrite of the tool as well as a number of improvements which should bring it into closer agreement with MarkDuplicates from picard.

Note: this is merely a checkpoint and there is still work that must be done to bring the work into agreement with recent MarkDuplicates development in picard.

Resolves #3706

@codecov-io
Copy link

codecov-io commented Apr 12, 2018

Codecov Report

Merging #4656 into master will increase coverage by 0.305%.
The diff coverage is 92.758%.

@@              Coverage Diff               @@
##             master     #4656       +/-   ##
==============================================
+ Coverage     79.84%   80.146%   +0.305%     
- Complexity    17330     17620      +290     
==============================================
  Files          1074      1080        +6     
  Lines         62907     64036     +1129     
  Branches      10181     10471      +290     
==============================================
+ Hits          50225     51322     +1097     
- Misses         8701      8704        +3     
- Partials       3981      4010       +29
Impacted Files Coverage Δ Complexity Δ
...ava/org/broadinstitute/hellbender/utils/Utils.java 80.241% <0%> (-0.194%) 142 <2> (+2)
...ections/MarkDuplicatesSparkArgumentCollection.java 100% <100%> (ø) 1 <1> (?)
...nder/tools/spark/pipelines/ReadsPipelineSpark.java 89.13% <100%> (-0.231%) 12 <0> (ø)
...k/pipelines/BwaAndMarkDuplicatesPipelineSpark.java 77.778% <100%> (-1.17%) 4 <0> (ø)
...s/read/markduplicates/sparkrecords/PairedEnds.java 100% <100%> (ø) 1 <1> (?)
...der/engine/spark/datasources/ReadsSparkSource.java 82.051% <100%> (ø) 44 <5> (ø) ⬇️
...ils/read/markduplicates/sparkrecords/Fragment.java 100% <100%> (ø) 9 <9> (?)
...transforms/markduplicates/MarkDuplicatesSpark.java 95.122% <100%> (+4.213%) 15 <11> (+6) ⬆️
...itute/hellbender/engine/spark/GATKRegistrator.java 100% <100%> (ø) 3 <0> (ø) ⬇️
...icates/sparkrecords/MarkDuplicatesSparkRecord.java 100% <100%> (ø) 7 <7> (?)
... and 37 more

@droazen droazen self-requested a review April 13, 2018 15:06
@droazen droazen self-assigned this Apr 13, 2018

//register to avoid writing the full name of this class over and over
kryo.register(PairedEnds.class, new FieldSerializer<>(kryo, PairedEnds.class));
kryo.register(PairedEnds.class, new PairedEnds.PairedEndsEmptyFragmentSerializer());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong, it should be registering for each of the subclasses, not the abstract class

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery some comments, not done, but if you're going to start making changes i'lll get these in now

@@ -54,13 +54,16 @@
fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME)
protected String output;

@Argument(fullName = "do_not_mark_unmapped_mates", doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this-isn't-kabob-case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it should be a constant since it's duplicated a bunch of times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -112,6 +112,9 @@
@Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true)
private JoinStrategy joinStrategy = JoinStrategy.BROADCAST;

@Argument(fullName = "do_not_mark_unmapped_mates", doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to consider a DuplicateMarkingArgumentCollection so we don't have to duplicate this all over the place

@@ -34,6 +34,7 @@ public void test() throws Exception {
args.addInput(input);
args.addOutput(output);
args.addBooleanArgument(StandardArgumentDefinitions.DISABLE_SEQUENCE_DICT_VALIDATION_NAME, true);
args.add("--do_not_mark_unmapped_mates");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the constant

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import org.broadinstitute.hellbender.utils.read.GATKRead;
import org.broadinstitute.hellbender.utils.read.ReadUtils;

public abstract class MarkDuplicatesSparkData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want a better name for this. It's super vague and confusing. I don't know what to call it though. Also, needs some javadoc here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

public abstract Type getType();
public abstract int getScore();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be pushed down to PairedEnds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


final List<IndexPair<GATKRead>> primaryReads = Utils.stream(keyedRead._2())
////// Making The Fragments //////
// Make a PairedEnd object with no second read for each fragment (and an empty one for each paired read)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make our language consistent.
"Make a Fragment for each read which has no mapped mate, and a placeholder for each that does."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should update this comment

.partitionBy(new KnownIndexPartitioner(reads.getNumPartitions()))
.values();

return reads.zipPartitions(repartitionedReadNames, (readsIter, readNamesIter) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a major bug here:

I think if we have a non-queryname / query-group sorted bam, the zip partitions will fail completely if there are multiple shards because we don't re-sort into matching partitions the second time around. We can fix that by pulling pulling the step that prepares the reads rdd out of the transformToDuplicateNames and then reusing the intermediate sorted RDD on the writeout step (probably want to cache it as well)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as we have discussed, the problem here is that it fails to route the duplicate-marking data to the correct partition for all reads in a group if they are not in the same place to start out. So you are libel to mark as duplicates reads that should not be duplicates in a group incorrectly some of the time. I'm going to elect to fix this by just keeping a list of indexes of interest in the MarkDuplciatesSparkData object and just making sure the data gets duplicated properly during the zip, as opposed to sorting the bam before the mark phase as that is likely to have significant performance impact

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #4701, for this issue to expedite this branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

JavaPairRDD<String, GATKRead> keyReadPairs = reads.mapToPair(read -> new Tuple2<>(ReadsKey.keyForRead(header, read), read));
keyedReads = keyReadPairs.groupByKey(numReducers);
}
static JavaPairRDD<IndexPair<String>, Integer> transformToDuplicateNames(final SAMFileHeader header, final MarkDuplicatesScoringStrategy scoringStrategy, final OpticalDuplicateFinder finder, final JavaRDD<GATKRead> reads, final int numReducers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a bug that will break our assumptions here I think. If we have bam that is query grouped, but not query sorted, which is the usual case, we won't do the adjustment for pairs at the edge of shards. We can fix that with a simple change to the check in ReadsSparkSource.putPairsInSamePartition

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// Mark duplicates cant properly handle templates with more than two reads in a pair
if (primaryReads.size()>2) {
throw new GATKException(String.format("Readgroup containing read %s has more than two primary reads, this is not valid", primaryReads.get(0).getValue()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a UserException.UnsupportedFeature probably instead of a GATKException. It's possible in a valid bam, it's just not something we allow. The message is a bit confusing. Read group is an overloaded term. Maybe "MarkDuplicatesSpark only supports singleton fragments and pairs. We found a group with >2 primary reads. ( %d number of reads). \list all reads one line at a time. "

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}).groupByKey(numReducers);
});

final JavaPairRDD<Integer, Iterable<MarkDuplicatesSparkData>> keyedPairs = pairedEnds.groupByKey(); //TODO make this a proper aggregate by key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets get rid of this comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -112,6 +112,9 @@
@Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true)
private JoinStrategy joinStrategy = JoinStrategy.BROADCAST;

@Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be an advanced argument or a not recommend one or something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, how "advanced" is matching gatk3 vs not matching gatk?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very advanced. We need a "not recommended" argument option

final Map<String,Integer> namesOfNonDuplicateReadsAndOpticalCounts = Utils.stream(readNamesIter).collect(Collectors.toMap(Tuple2::_1,Tuple2::_2));
return Utils.stream(readsIter).peek(read -> {
// Handle reads that have been marked as non-duplicates (which also get tagged with optical duplicate summary statistics)
if( namesOfNonDuplicateReadsAndOpticalCounts.containsKey(read.getName())) { //todo figure out if we should be marking the unmapped mates of duplicate reads as duplicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous comment here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


import javax.validation.constraints.Max;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this import seems spurious

// Place all the reads into a single RDD of MarkDuplicatesSparkRecord objects
final JavaPairRDD<Integer, MarkDuplicatesSparkRecord> pairedEnds = keyedReads.flatMapToPair(keyedRead -> {
final List<Tuple2<Integer, MarkDuplicatesSparkRecord>> out = Lists.newArrayList();
AtomicReference<IndexPair<GATKRead>> hadNonPrimaryRead = new AtomicReference<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to not use an atomic reference here, they have a lot of overhead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

static JavaPairRDD<String, Iterable<GATKRead>> spanReadsByKey(final SAMFileHeader header, final JavaRDD<GATKRead> reads) {
JavaPairRDD<String, GATKRead> nameReadPairs = reads.mapToPair(read -> new Tuple2<>(read.getName(), read));
//todo use this instead of keeping all unmapped reads as non-duplicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we remove this?

}
// Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data
private static long getGroupKey(MarkDuplicatesSparkRecord record) {
return record.getClass()==Passthrough.class?-1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you open an issue to resolve the readgroup issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes #4700

// Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data
private static long getGroupKey(MarkDuplicatesSparkRecord record) {
return record.getClass()==Passthrough.class?-1:
(((long)((PairedEnds)record).getUnclippedStartPosition()) << 32 |

This comment was marked as resolved.

return record.getClass()==Passthrough.class?-1:
(((long)((PairedEnds)record).getUnclippedStartPosition()) << 32 |
((PairedEnds)record).getFirstRefIndex() << 16 );
//| ((PairedEnds)pe).getLibraryIndex())).values();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code

* with the same name. TODO: explain why there might be more.
* (3) keyMarkDuplicatesSparkRecords with alignment info:
* (a) Generate a fragment or emptyFragment from each read if its unpaired.
* (b) Pair grouped reads reads into MarkDuplicatesSparkRecord. In most cases there will only be two reads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate reads

return handleFragments(pairedEnds, scoringStrategy, header).iterator();
}
/**
* Primary landing point for MarkDulicateSparkRecords:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo MarkDulicate

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments for you @jamesemery

return samRecord.getTransientAttribute(key);
}

public void setTransientAttribute(Object key, Object value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for these two new methods, with appropriate disclaimers about use of these methods, and an explanation of what the transient attributes are.

* @param collection any Collection
* @return true if the collection exists and has elements
*/
public static boolean hasElements(Collection<?> collection){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNonEmpty() or isNonEmptyCollection() might be a better names.

@@ -214,7 +214,7 @@ public boolean accept(Path path) {
* so they are processed together. No shuffle is needed.
*/
JavaRDD<GATKRead> putPairsInSamePartition(final SAMFileHeader header, final JavaRDD<GATKRead> reads) {
if (!header.getSortOrder().equals(SAMFileHeader.SortOrder.queryname)) {
if (!header.getSortOrder().equals(SAMFileHeader.SortOrder.queryname) && !SAMFileHeader.GroupOrder.query.equals(header.getGroupOrder())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining the GroupOrder vs. SortOrder check here?

@@ -54,13 +54,16 @@
fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME)
protected String output;

@Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")
public boolean dontMarkUnmappedMates = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this argument and duplicates_scoring_strategy into a MarkDuplicatesSparkArgumentCollection

@@ -112,6 +112,9 @@
@Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true)
private JoinStrategy joinStrategy = JoinStrategy.BROADCAST;

@Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.")
public boolean dontMarkUnmappedMates = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to argument collection, as noted above.

}
}
return reads;
private static Tuple2<IndexPair<String>, Integer> handleFragments(List<MarkDuplicatesSparkRecord> duplicateFragmentGroup) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All methods in this class should have basic javadoc with at least an explanation of the method's purpose.

private static final long serialVersionUID = 1l;
private final SAMFileHeader header;
// TODO: Unify with other comparators in the codebase
public static final class PairedEndsCoordinateComparator implements Comparator<PairedEnds>, Serializable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to have this comparator live in a standalone class in the same package instead of embedded in this utils class like this.

* the same location.
*
* Ordering is almost identical to the {@link htsjdk.samtools.SAMRecordCoordinateComparator},
* modulo a few subtle differences in tie-breaking rules for reads that share the same
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say that the ordering is almost identical to the SAMRecordCoordinateComparator, but it looks like you've stripped out most of the tie-breaking.

JavaPairRDD<Integer, GATKRead> firstKeyed = firstReads.mapToPair(read -> new Tuple2<>(ReadsKey.hashKeyForFragment(
ReadUtils.getStrandedUnclippedStart(
read),
read.isReverseStrand(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange formatting here...

int key = library != null ? library.hashCode() : 1;
key = key * 31 + referenceIndex;
key = key * 31 + strandedUnclippedStart;
return key * 31 + (reverseStrand ? 0 : 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment explaining why you took this strategy for the key generation

@@ -89,4 +98,51 @@ private String getReadGroupId(final SAMFileHeader header, final int index) {
return new Tuple2<>(key, ImmutableList.copyOf(reads));
}

@Test (enabled = false)
public void testSortOrderParitioningCorrectness() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo paritioning

@@ -89,4 +98,51 @@ private String getReadGroupId(final SAMFileHeader header, final int index) {
return new Tuple2<>(key, ImmutableList.copyOf(reads));
}

@Test (enabled = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this into the other branch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its in the other branch now

@@ -38,7 +43,7 @@ public void testSpanningIterator() {
ImmutableList.of(pairIterable(1, "a"), pairIterable(2, "b"), pairIterable(1, "c")));
}

@Test(groups = "spark")
@Test(groups = "spark",enabled = false) //TODO discuss with reviewer what to do about this test. perhaps the readgroups should still be used in the name?
public void testSpanReadsByKeyWithAlternatingGroups() {
SAMFileHeader header = ArtificialReadUtils.createArtificialSamHeaderWithGroups(1, 1, 1000, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to be names instead of read groups in the test?

second.isReverseStrand() ? "r" : "f");
key = 31 * key + ReadUtils.getReferenceIndex(second, header);
key = 31 * key + ReadUtils.getStrandedUnclippedStart(second);
return 31 * key + (second.isReverseStrand() ? 0 : 1);
}

/**
* Makes a unique key for the read.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the reason why the read group isn't needed here.

private final transient GATKRead read;

Passthrough(GATKRead read, int partitionIndex) {
super(partitionIndex, read.getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make the key at construction time instead of holding a transient reference to the read (can do in a separate PR, however).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@DefaultSerializer(Pair.Serializer.class)
public final class Pair extends PairedEnds implements OpticalDuplicateFinder.PhysicalLocation {
protected transient GATKRead first;
protected transient GATKRead second;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document why these can be transient (as it's not obvious that this is safe).

* processing on the reads. (eg. unmapped reads we don't want to process but must be non-duplicate marked)
*/
public final class Passthrough extends MarkDuplicatesSparkRecord {
private final transient GATKRead read;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document why these can be transient (as it's not obvious that this is safe).

@@ -121,6 +122,7 @@ public void testReadsPipelineSpark(PipelineTest params) throws IOException {
args.add("-R");
args.add(referenceFile.getAbsolutePath());
}
args.add("--"+ MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have tests that cover the case where this argument is not specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the corresponding tests (with the same names and arguments in AbstractMarkDuplicatesTester)

@@ -158,4 +158,24 @@ public void testMarkDuplicatesSparkIntegrationTestLocal(
}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have tests that take name-sorted input? Can you create a ticket to add more tests that use name-sorted input?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in another PR

.filter(indexPair -> !(indexPair.getValue().isSecondaryAlignment()||indexPair.getValue().isSupplementaryAlignment()))
.collect(Collectors.toList());

// Mark duplicates cant properly handle templates with more than two reads in a pair
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing '


import java.util.Random;

public class ReadsKeyUnitTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test classes should extend GATKBaseTest

@droazen droazen merged commit 7641f53 into master Apr 26, 2018
@droazen droazen deleted the lb_modify_to_zip_partitions branch April 26, 2018 13:46
lbergelson pushed a commit that referenced this pull request Apr 26, 2018
Co-authored-by: Louis Bergelson <louisb@broadinstitute.org>

First part of a major rewrite of MarkDuplicatesSpark to improve performance. Tool still has a number of known issues, but is much faster than the previous version.
cwhelan pushed a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018
Co-authored-by: Louis Bergelson <louisb@broadinstitute.org>

First part of a major rewrite of MarkDuplicatesSpark to improve performance. Tool still has a number of known issues, but is much faster than the previous version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Spark to speed up the MarkDuplicates and SortSam steps in the single-sample pipeline
4 participants