Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled thresholding for joins and standardized regionFn #1741

Merged

Conversation

@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Sep 21, 2017

Resolves #1739, #1740.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Sep 21, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2390/
Test PASSed.

@@ -757,18 +757,21 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] extends Logging {
* @see broadcastRegionJoinAgainst
*/
def broadcastRegionJoin[X, Y <: GenomicRDD[X, Y], Z <: GenomicRDD[(T, X), Z]](
genomicRdd: GenomicRDD[X, Y])(
genomicRdd: GenomicRDD[X, Y],
threshold: Long = 0L)(

This comment has been minimized.

@heuermh

heuermh Sep 21, 2017
Member

threshold param needs scaladoc here and elsewhere.

flankSize, padding, and threshold all have similar meanings, is there any reason to choose one over the other?

Finally, adding a parameter with a Scala default value may make life more difficult for the Java API and downstream in Python and R.

This comment has been minimized.

@devin-petersohn

devin-petersohn Sep 21, 2017
Author Member

Thanks for the scaladoc reminder. I need to push the tests too.

I'm not tied to threshold, but for a join, padding and flankSize don't seem to capture the meaning of the input: to find nearby regions within some threshold.

I can rewrite it to separate the optional parameter into two different methods if that's better.

This comment has been minimized.

@fnothaft

fnothaft Sep 22, 2017
Member

+1 @heuermh RE consistency. My preference is flankSize, as that is the amount of flanking sequence to consider when running an operation. Not a huge deal one way or the other, as long as we're consistent.

Finally, adding a parameter with a Scala default value may make life more difficult for the Java API and downstream in Python and R.

I can rewrite it to separate the optional parameter into two different methods if that's better.

+1 @heuermh and +1 at separating into two methods

This comment has been minimized.

@heuermh

heuermh Sep 26, 2017
Member

Now sometimes we have four

def join(GenomicRDD)
def join(GenomicRDD, Option[Int])
def join(GenomicRDD, Long)
def join(GenomicRDD, Option[Int], Long)

I think it is ok to have only the simplest and longest forms, especially if one is an option

def join(GenomicRDD)
def join(GenomicRDD, Option[Int], Long)

This comment has been minimized.

@devin-petersohn

devin-petersohn Sep 26, 2017
Author Member

I have mixed feelings on that. I don't think 4 is too many, but I do see the issue. I don't know if we should restrict users in that way.

Related: How do you pass an Option from R/Python?

We may just keep the method signatures with the optPartitions parameter private[adam] (because hopefully in the future we can deterministically set the partition number) and go with genomicRDD: GenomicRDD, flankSize: Long for the public API. Do users need the power to set the partition count?

This comment has been minimized.

@heuermh

heuermh Sep 26, 2017
Member

How do you pass an Option from R/Python?

From Java I import scala.Option/scala.Some directly.

I found a library that did Java Optional → Scala Option conversions but didn't feel it was worth adding another dependency.

This comment has been minimized.

@devin-petersohn

devin-petersohn Sep 26, 2017
Author Member

Here is what I am proposing:

public def join(GenomicRDD)
public def join(GenomicRDD, Long)
private[rdd] def join(GenomicRDD, Option[Int], Long)

I just don't see users caring about the number of partitions for their join, and we can probably do better deterministically setting it.

This comment has been minimized.

@devin-petersohn

devin-petersohn Sep 26, 2017
Author Member

public specified for clarity.

This comment has been minimized.

@fnothaft

fnothaft Sep 26, 2017
Member

That SGTM.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Sep 21, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2391/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Few small nits. Thanks @devin-petersohn!

@@ -757,18 +757,21 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] extends Logging {
* @see broadcastRegionJoinAgainst
*/
def broadcastRegionJoin[X, Y <: GenomicRDD[X, Y], Z <: GenomicRDD[(T, X), Z]](
genomicRdd: GenomicRDD[X, Y])(
genomicRdd: GenomicRDD[X, Y],
threshold: Long = 0L)(

This comment has been minimized.

@fnothaft

fnothaft Sep 22, 2017
Member

+1 @heuermh RE consistency. My preference is flankSize, as that is the amount of flanking sequence to consider when running an operation. Not a huge deal one way or the other, as long as we're consistent.

Finally, adding a parameter with a Scala default value may make life more difficult for the Java API and downstream in Python and R.

I can rewrite it to separate the optional parameter into two different methods if that's better.

+1 @heuermh and +1 at separating into two methods

genomicRdd.flattenRddByRegions()),
sequences ++ genomicRdd.sequences,
kv => {
Seq(kv._1.map(v => getReferenceRegions(v))).flatten.flatten ++
Seq(kv._1.map(v => getReferenceRegions(v)
.map(_.pad(-1 * threshold)))).flatten.flatten ++

This comment has been minimized.

@fnothaft

fnothaft Sep 22, 2017
Member

I did a double-take on the lines that pad by a negative threshold. Can you add documentation to explain that we are removing the padding that we added earlier?

This comment has been minimized.

@devin-petersohn

devin-petersohn Sep 25, 2017
Author Member

Alternatively we could add an unpad and put the documentation there. It doesn't have to be public, but it would make this cleaner.

This comment has been minimized.

@heuermh

heuermh Sep 26, 2017
Member

I think a // style single line comment would be sufficient

@devin-petersohn devin-petersohn force-pushed the devin-petersohn:issue#1739thresholdedJoins branch from 1372977 to 9cf4b26 Sep 25, 2017
@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented Sep 25, 2017

Rebased on current master and overloaded methods with optionals. I haven't changed the threshold name yet. I want to make sure it passes with @AmplabJenkins first.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Sep 25, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2394/
Test PASSed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Sep 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2395/
Test PASSed.

@devin-petersohn devin-petersohn force-pushed the devin-petersohn:issue#1739thresholdedJoins branch from b32654d to 57929e5 Sep 26, 2017
@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented Sep 26, 2017

Rebased and made changes to architecture reflecting this comment.

devin-petersohn added a commit to devin-petersohn/adam that referenced this pull request Sep 26, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Sep 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2397/
Test PASSed.

@heuermh
heuermh approved these changes Oct 4, 2017
@heuermh heuermh merged commit 77a16f2 into bigdatagenomics:master Oct 4, 2017
1 of 2 checks passed
1 of 2 checks passed
codacy/pr Not so good... This pull request quality could be better.
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented Oct 4, 2017

Squashed and merged as commit 77a16f2. Thank you, @devin-petersohn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.