Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1499] Enable reuse of broadcasted objects in region join. #1524

Merged
merged 3 commits into from May 18, 2017

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented May 12, 2017

Resolves #1499.

@fnothaft fnothaft added this to the 0.23.0 milestone May 12, 2017
@coveralls
Copy link

@coveralls coveralls commented May 12, 2017

Coverage Status

Coverage decreased (-0.2%) to 81.789% when pulling cedb700 on fnothaft:issues/1499-broadcast-reuse into 18191f9 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1997/
Test PASSed.

Copy link
Member

@heuermh heuermh left a comment

Few doc improvements requested

pom.xml Outdated
@@ -496,7 +496,7 @@
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<!-- This library has no Scala 2.11 version, but using the 2 dot 10 version seems to work. -->
<!-- This library has no Scala 2.11 version, but using the 2.10 version seems to work. -->

This comment has been minimized.

@heuermh

heuermh May 12, 2017
Member

the dot was intentional, so that it doesn't get replaced by the move_to_scala scripts

This comment has been minimized.

@fnothaft

fnothaft May 13, 2017
Author Member

Oops sorry. I had done something silly and messed this up in a rebase.

val readsByFeature = panel.broadcastRegionJoin(reads)
```

We can get a handle to the broadcast panel by rewriting the code as:

This comment has been minimized.

@heuermh

heuermh May 12, 2017
Member

use of the word panel here was confusing, could it be replaced by features?

* in the product of the join. As compared to broadcastRegionJoin, this
* function allows the broadcast object to be reused across multiple joins.
*
* @note This function differs from other region joins as it treats the calling RDD

This comment has been minimized.

@heuermh

heuermh May 12, 2017
Member

could you double check that left and right and this RDD and another RDD and calling RDD are consistent here? I'm having a hard time following

This comment has been minimized.

@fnothaft

fnothaft May 13, 2017
Author Member

This is correct as written.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented May 13, 2017

@heuermh pushed 4dc0168 to address review comments.

@coveralls
Copy link

@coveralls coveralls commented May 13, 2017

Coverage Status

Coverage decreased (-0.2%) to 81.789% when pulling 4dc0168 on fnothaft:issues/1499-broadcast-reuse into 18191f9 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2004/
Test PASSed.

@@ -412,6 +420,38 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] {
}

/**
* Performs a broadcast inner join between this RDD and data that has been broadcast.
*
* In a broadcast join, the left RDD is collected to the driver,

This comment has been minimized.

@heuermh

heuermh May 16, 2017
Member

This RDD is the right RDD, correct?
There is no left RDD, rather broadcastTree which is an IntervalArray[ReferenceRegion, X].

* function used for this join is the reference region overlap function. Since this
* is an inner join, all values who do not overlap a value from the other
* RDD are dropped. As compared to broadcastRegionJoin, this function allows the
* broadcast object to be reused across multiple joins.

This comment has been minimized.

@heuermh

heuermh May 16, 2017
Member

other RDD? Should this be
...all values in this RDD who do not overlap a value from the broadcast data are dropped.

* is a right outer join, all values in the left RDD that do not overlap a
* value from the right RDD are dropped. If a value from the right RDD does
* not overlap any values in the left RDD, it will be paired with a `None`
* in the product of the join. As compared to broadcastRegionJoin, this function allows the

This comment has been minimized.

@heuermh

heuermh May 16, 2017
Member

...still getting lost by left and right here. One of the two mentions of RDD should be replaced with broadcast data.

* Performs a broadcast right outer join between this RDD and another RDD.
*
* In a broadcast join, the left RDD (this RDD) is collected to the driver,
* and broadcast to all the nodes in the cluster. The key equality function

This comment has been minimized.

@heuermh

heuermh May 16, 2017
Member

...similar

@fnothaft fnothaft force-pushed the fnothaft:issues/1499-broadcast-reuse branch from 4dc0168 to f9c35e7 May 18, 2017
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented May 18, 2017

@heuermh pushed some docs clarification.

@coveralls
Copy link

@coveralls coveralls commented May 18, 2017

Coverage Status

Coverage decreased (-0.2%) to 82.003% when pulling f9c35e7 on fnothaft:issues/1499-broadcast-reuse into 37b971a on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2024/
Test PASSed.

@heuermh heuermh merged commit f2d9869 into bigdatagenomics:master May 18, 2017
1 of 3 checks passed
1 of 3 checks passed
codacy/pr Not so good... This pull request quality could be better.
Details
coverage/coveralls Coverage decreased (-0.2%) to 82.003%
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented May 18, 2017

Thank you, @fnothaft!

@heuermh heuermh added this to Completed in Release 0.23.0 May 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.