New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of MarkDuplicates which seems to improve performance #380

Merged
merged 4 commits into from Sep 18, 2014

Conversation

Projects
None yet
3 participants
@nfergu
Contributor

nfergu commented Sep 14, 2014

This is a rewrite of MarkDuplicates, which gives around a 3x performance increase over the original in my tests. It reduces the total time spent marking duplicates in a job from an average of 860 seconds to an average of 259 seconds when running against NA12878_chr20 on an Amazon r3.xlarge machine (t=23.9, df=8, p<0.0005).

Note that the above refers the time spent actually executing the duplicate marking code. There will be a much less significant impact on the entire job. For example, running a job to mark duplicates in NA12878_chr20 (already sorted, in ADAM format) took an average of 46 minutes 36 seconds with the old code and 43 minutes 57 seconds with the new code (t=6.6, df=8, p<0.0005). This is an improvement of around 5%.

The new implementation takes a similar approach to the old one -- however it is a re-write, so many variables have been renamed, functions moved around etc. This makes the diff a bit difficult to read unfortunately -- apologies for that.

I've checked the new implementation carefully to ensure that it's equivalent to the old one, but it's possible I've missed something -- let me know if I have. I've also tried hard to ensure that it's at least as readable as the old version, but let me know if anything is unclear.

val groupIsFragments = rightPos.isEmpty
// We have no pairs (only fragments) if the current group is a group of fragments
// and there is only one group in total

This comment has been minimized.

@fnothaft

fnothaft Sep 14, 2014

Member

Nifty! Nice approach.

@fnothaft

fnothaft Sep 14, 2014

Member

Nifty! Nice approach.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 14, 2014

Member

+1, this is pretty cool @nfergu! I believe @massie has some scripts for verifying the correctness of markdups, let me see if we can run those over here.

Member

fnothaft commented Sep 14, 2014

+1, this is pretty cool @nfergu! I believe @massie has some scripts for verifying the correctness of markdups, let me see if we can run those over here.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Sep 15, 2014

Member

Thanks, @nfergu! I really appreciate this contribution. The code changes look good to me. Ill dig up my test scripts and take look deeper soon.

Member

massie commented Sep 15, 2014

Thanks, @nfergu! I really appreciate this contribution. The code changes look good to me. Ill dig up my test scripts and take look deeper soon.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Sep 18, 2014

Member

I'm ready to merge this. Before I do, can you add the following unit test to MarkDuplicatesSuite?

sparkTest("read pairs that cross chromosomes") {
  val poorPairs = for (
    i <- 0 until 10;
    read <- createPair("ref0", 10, "ref1", 210, avgPhredScore = 20, readName = "poor%d".format(i))
  ) yield read
  val bestPair = createPair("ref0", 10, "ref1", 210, avgPhredScore = 30, readName = "best")
  val marked = markDuplicates(bestPair ++ poorPairs: _*)
  val (dups, nonDups) = marked.partition(_.getDuplicateRead)
  assert(nonDups.size == 2 && nonDups.forall(p => p.getReadName.toString == "best"))
  assert(dups.forall(p => p.getReadName.startsWith("poor")))
}

Btw, this test passes with your changes.

I really like how you've simplified (and improved) the code. It's easier to reason about and not surprisingly performs better. Thanks!

Member

massie commented Sep 18, 2014

I'm ready to merge this. Before I do, can you add the following unit test to MarkDuplicatesSuite?

sparkTest("read pairs that cross chromosomes") {
  val poorPairs = for (
    i <- 0 until 10;
    read <- createPair("ref0", 10, "ref1", 210, avgPhredScore = 20, readName = "poor%d".format(i))
  ) yield read
  val bestPair = createPair("ref0", 10, "ref1", 210, avgPhredScore = 30, readName = "best")
  val marked = markDuplicates(bestPair ++ poorPairs: _*)
  val (dups, nonDups) = marked.partition(_.getDuplicateRead)
  assert(nonDups.size == 2 && nonDups.forall(p => p.getReadName.toString == "best"))
  assert(dups.forall(p => p.getReadName.startsWith("poor")))
}

Btw, this test passes with your changes.

I really like how you've simplified (and improved) the code. It's easier to reason about and not surprisingly performs better. Thanks!

@nfergu

This comment has been minimized.

Show comment
Hide comment
@nfergu

nfergu Sep 18, 2014

Contributor

@massie -- test added to MarkDuplicatesSuite.

Glad you like it!

Contributor

nfergu commented Sep 18, 2014

@massie -- test added to MarkDuplicatesSuite.

Glad you like it!

massie added a commit that referenced this pull request Sep 18, 2014

Merge pull request #380 from nfergu/markduplicateschanges
Rewrite of MarkDuplicates which seems to improve performance

@massie massie merged commit da9aa88 into bigdatagenomics:master Sep 18, 2014

1 check passed

default Merged build finished.
Details
@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Sep 18, 2014

Member

Thanks, Neil!

Member

massie commented Sep 18, 2014

Thanks, Neil!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment