Bake off different region join implementations #395

Closed
fnothaft opened this Issue Sep 24, 2014 · 12 comments

Comments

Projects
None yet
5 participants
@fnothaft
Member

fnothaft commented Sep 24, 2014

@kozanitis and I were discussing this. Now that we've got 3 proposed region join strategies, we should go ahead and run a few head-to-head tests to evaluate performance. I think the correct approach would be to define a few reasonable test cases, and to then do a one-by-one bakeoff.

Here are a few test cases that I thought of:

  • Whole genome BAM vs. windows (2-10kbp windows over the whole genome)
  • Whole exome BAM vs. exome panel BED
  • WES/WGS BAM vs. targeted sequencing panel (pick "important cancer genes", or something of that sort)
  • RNA-seq BAM vs. exons?

I'm sure there are more cases out there; @tdanford and @carlyeks, what cases did you originally have in mind when you implemented region joins?

cc'ing @massie @laserson, would be glad to hear thoughts from any other interested parties. @kozanitis and I will start working to put together the bake-off tests on Thursday.

@fnothaft fnothaft added the discussion label Sep 24, 2014

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Oct 7, 2014

Contributor

@fnothaft the main use case Carl and I had in mind when we wrote this was "join gene annotations with variants."

Contributor

tdanford commented Oct 7, 2014

@fnothaft the main use case Carl and I had in mind when we wrote this was "join gene annotations with variants."

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Oct 7, 2014

Contributor

@fnothaft Intersecting ENCODE tracks. Inner joins, semijoins, etc. Another difficult test case would be intersecting a set of ChIP-seq peaks with fixedWiggle with step=1. Essentially, there is a value at every base position in the genome.

Contributor

laserson commented Oct 7, 2014

@fnothaft Intersecting ENCODE tracks. Inner joins, semijoins, etc. Another difficult test case would be intersecting a set of ChIP-seq peaks with fixedWiggle with step=1. Essentially, there is a value at every base position in the genome.

@arq5x

This comment has been minimized.

Show comment
Hide comment
@arq5x

arq5x Jan 4, 2015

@laserson has shared an preliminary attempt at this with me. I am really interested in exploring ways to take advantage of ADAM/Spark to scale up the functionality we have already implemented in bedtools. Can you point me to use cases and/or example commands for converting BED and/or BigWig files to ADAM?

arq5x commented Jan 4, 2015

@laserson has shared an preliminary attempt at this with me. I am really interested in exploring ways to take advantage of ADAM/Spark to scale up the functionality we have already implemented in bedtools. Can you point me to use cases and/or example commands for converting BED and/or BigWig files to ADAM?

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Jan 4, 2015

Contributor

@arq5x Check out README here for the data and a few manipulations:
https://github.com/sryza/aas/tree/master/ch10-genomics

Then the scala code for loading and joining is here:
https://github.com/sryza/aas/blob/master/ch10-genomics/src/main/scala/com/cloudera/datascience/genomics/RunTFPrediction.scala

This is from our upcoming book Advanced Analytics with Spark.

Contributor

laserson commented Jan 4, 2015

@arq5x Check out README here for the data and a few manipulations:
https://github.com/sryza/aas/tree/master/ch10-genomics

Then the scala code for loading and joining is here:
https://github.com/sryza/aas/blob/master/ch10-genomics/src/main/scala/com/cloudera/datascience/genomics/RunTFPrediction.scala

This is from our upcoming book Advanced Analytics with Spark.

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Jan 4, 2015

Contributor

@arq5x also, the book's code was written before the shuffle join impl I shared with you, so it uses the broadcast join.

Contributor

laserson commented Jan 4, 2015

@arq5x also, the book's code was written before the shuffle join impl I shared with you, so it uses the broadcast join.

@arq5x

This comment has been minimized.

Show comment
Hide comment
@arq5x

arq5x Jan 4, 2015

Thanks for this. I am still a bit unclear about the implementation of broadcast and join. Do you have any pictorial and/or verbal descriptions of how they each work?

arq5x commented Jan 4, 2015

Thanks for this. I am still a bit unclear about the implementation of broadcast and join. Do you have any pictorial and/or verbal descriptions of how they each work?

@laserson

This comment has been minimized.

Show comment
Hide comment
@arq5x

This comment has been minimized.

Show comment
Hide comment
@arq5x

arq5x Jan 5, 2015

Thanks - I will try to make my way through the code. Part of my confusion is in knowing which functions are distributed, which are serial, etc...

arq5x commented Jan 5, 2015

Thanks - I will try to make my way through the code. Part of my confusion is in knowing which functions are distributed, which are serial, etc...

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Jan 5, 2015

Contributor

@arq5x Yeah, I'd suggest looking at the code from a Scala IDE (like IntelliJ), which should make it easier to tell what type everything has. Anything that's an RDD or a Broadcast is distributed. Everything else is local.

Contributor

laserson commented Jan 5, 2015

@arq5x Yeah, I'd suggest looking at the code from a Scala IDE (like IntelliJ), which should make it easier to tell what type everything has. Anything that's an RDD or a Broadcast is distributed. Everything else is local.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams May 29, 2015

Member

OOC, what are the "3 proposed region join strategies", @fnothaft? BroadcastRegionJoin, ShuffleRegionJoin, and ____?

Member

ryan-williams commented May 29, 2015

OOC, what are the "3 proposed region join strategies", @fnothaft? BroadcastRegionJoin, ShuffleRegionJoin, and ____?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 29, 2015

Member

@kozanitis had proposed an interval tree based join in #390, but this didn't wind up merging in.

Member

fnothaft commented May 29, 2015

@kozanitis had proposed an interval tree based join in #390, but this didn't wind up merging in.

@fnothaft fnothaft added the wontfix label Jul 20, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 20, 2016

Member

Closing as won't fix, as @kozanitis had been most interested in this and has left.

Member

fnothaft commented Jul 20, 2016

Closing as won't fix, as @kozanitis had been most interested in this and has left.

@fnothaft fnothaft closed this Jul 20, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment