Add outer joins #1109

Merged
merged 3 commits into from Oct 3, 2016

Conversation

Projects
None yet
5 participants
@fnothaft
Member

fnothaft commented Aug 10, 2016

Resolves #1098. Still a WIP; needs tests, as well as more documentation.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 10, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1377/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1377/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 25, 2016

Member

Ping for review.

Member

fnothaft commented Aug 25, 2016

Ping for review.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 31, 2016

Member

@akmorrow13 I made the region joins public again (resolves #1143) in f8019d6. Can you review?

Member

fnothaft commented Aug 31, 2016

@akmorrow13 I made the region joins public again (resolves #1143) in f8019d6. Can you review?

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 31, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1443/

Build result: FAILURE

GitHub pull request #1109 of commit f8019d6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1109/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0c0c983 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1109/merge^{commit} # timeout=10Checking out Revision 0c0c983 (origin/pr/1109/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0c0c983ed5f7a7ee7704cd4a00bf473dad398c3cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1443/

Build result: FAILURE

GitHub pull request #1109 of commit f8019d6 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1109/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0c0c983 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1109/merge^{commit} # timeout=10Checking out Revision 0c0c983 (origin/pr/1109/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0c0c983ed5f7a7ee7704cd4a00bf473dad398c3cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 1, 2016

Member

Jenkins, retest this please.

Member

fnothaft commented Sep 1, 2016

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 1, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1450/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1450/
Test PASSed.

+ }
+}
+
+private trait VictimlessSortedIntervalPartitionJoin[T, U, RU] extends SortedIntervalPartitionJoin[T, U, T, RU] with Serializable {

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

Does this mean all the other joins take victims?

@heuermh

heuermh Sep 6, 2016

Member

Does this mean all the other joins take victims?

This comment has been minimized.

@fnothaft

fnothaft Sep 6, 2016

Member

Some of the outer joins make use of a "victim cache" to store elements from one of the two iterators that did not match to an element in the other iterator. The phrase "victim cache" comes from a type of cache that is occasionally used in computer architecture to "save" cache lines that have been evicted. I'll make a pass and add more docs.

@fnothaft

fnothaft Sep 6, 2016

Member

Some of the outer joins make use of a "victim cache" to store elements from one of the two iterators that did not match to an element in the other iterator. The phrase "victim cache" comes from a type of cache that is occasionally used in computer architecture to "save" cache lines that have been evicted. I'll make a pass and add more docs.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 6, 2016

Member

I'm not technically proficient enough to review the implementations of these. The code style looks fine.

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

Member

heuermh commented Sep 6, 2016

I'm not technically proficient enough to review the implementations of these. The code style looks fine.

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 6, 2016

Member

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

I'll make a pass and write these up. Do you actually have a presentation you need these for tomorrow? If so, give me a ping so I can figure out the best way to get it to you.

Member

fnothaft commented Sep 6, 2016

It would be nice to have a table describing the different performance characteristics of these and when each would be most useful. In particular, one I could use in a presentation tomorrow night. :)

I'll make a pass and write these up. Do you actually have a presentation you need these for tomorrow? If so, give me a ping so I can figure out the best way to get it to you.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 6, 2016

Member

Do you actually have a presentation you need these for tomorrow?

Yeah it will be something short for a local audience, who won't necessarily care about the biology but may be interested in how we extend Spark.

Member

heuermh commented Sep 6, 2016

Do you actually have a presentation you need these for tomorrow?

Yeah it will be something short for a local audience, who won't necessarily care about the biology but may be interested in how we extend Spark.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 6, 2016

Member

OK, cool. How about I send you a slide for that? Would you prefer Keynote, Powerpoint, Google Drive, etc...?

Member

fnothaft commented Sep 6, 2016

OK, cool. How about I send you a slide for that? Would you prefer Keynote, Powerpoint, Google Drive, etc...?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 6, 2016

Member

Any format would be fine, thanks! For my own good, I want to go through these and sketch out whiteboard diagrams of what is happening, similar to those I saw in some or another Spark book.

Member

heuermh commented Sep 6, 2016

Any format would be fine, thanks! For my own good, I want to go through these and sketch out whiteboard diagrams of what is happening, similar to those I saw in some or another Spark book.

@heuermh heuermh referenced this pull request Sep 7, 2016

Closed

Release ADAM version 0.20.0 #1048

47 of 61 tasks complete

@jpdna jpdna referenced this pull request Sep 14, 2016

Closed

Interval tree join in ADAM #1171

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Sep 14, 2016

Member

some aside comments on #1171

Member

jpdna commented Sep 14, 2016

some aside comments on #1171

+ @transient val sc: SparkContext
+
+ // Create the set of bins across the genome for parallel processing
+ protected val seqLengths = Map(sd.records.toSeq.map(rec => (rec.name, rec.length)): _*)

This comment has been minimized.

@akmorrow13

akmorrow13 Sep 14, 2016

Contributor

This throws an error downstream in GenomeBins because it tries to set seqLengths from sd when sd is not yet set (is null). What would be the cleanest way to ensure this doesn't happen?

@akmorrow13

akmorrow13 Sep 14, 2016

Contributor

This throws an error downstream in GenomeBins because it tries to set seqLengths from sd when sd is not yet set (is null). What would be the cleanest way to ensure this doesn't happen?

This comment has been minimized.

@fnothaft

fnothaft Sep 27, 2016

Member

Just pushed rebased commits with this fixed. Thanks for catching @akmorrow13.

@fnothaft

fnothaft Sep 27, 2016

Member

Just pushed rebased commits with this fixed. Thanks for catching @akmorrow13.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 27, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1503/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1503/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 29, 2016

Member

Ping for review/merge.

Member

fnothaft commented Sep 29, 2016

Ping for review/merge.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Sep 30, 2016

Member

I'm going to merge later today unless anyone asks for more time.

Member

jpdna commented Sep 30, 2016

I'm going to merge later today unless anyone asks for more time.

genomicRdd.flattenRddByRegions()),
sequences ++ genomicRdd.sequences,
kv => { getReferenceRegions(kv._1) ++ genomicRdd.getReferenceRegions(kv._2) })
.asInstanceOf[GenomicRDD[(T, X), Z]]
}
+ def rightOuterBroadcastRegionJoin[X, Y <: GenomicRDD[X, Y], Z <: GenomicRDD[(Option[T], X), Z]](genomicRdd: GenomicRDD[X, Y])(

This comment has been minimized.

@heuermh

heuermh Sep 30, 2016

Member

All these public join methods on GenomicRDD need code level doc.

@heuermh

heuermh Sep 30, 2016

Member

All these public join methods on GenomicRDD need code level doc.

@akmorrow13

This comment has been minimized.

Show comment
Hide comment
@akmorrow13

akmorrow13 Sep 30, 2016

Contributor

Apparently github ate my earlier comment.. My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice. In my case, this would not work because both RDD's were too large to collect and broadcast. Is there any way around this?

Contributor

akmorrow13 commented Sep 30, 2016

Apparently github ate my earlier comment.. My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice. In my case, this would not work because both RDD's were too large to collect and broadcast. Is there any way around this?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 30, 2016

Member

My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice.

What code uses the BroadcastRegionJoin? This PR largely extends the shuffle region join code (provides 5 new shuffle joins), but does extend the BroadcastRegionJoin in two places.

Member

fnothaft commented Sep 30, 2016

My problem is that these use the BroadcastRegionJoin, collecting one of the RDDs with no notice.

What code uses the BroadcastRegionJoin? This PR largely extends the shuffle region join code (provides 5 new shuffle joins), but does extend the BroadcastRegionJoin in two places.

@akmorrow13

This comment has been minimized.

Show comment
Hide comment
@akmorrow13

akmorrow13 Sep 30, 2016

Contributor

@fnothaft maybe it was a temporary moment of insanity but it looks like I was wrong. I believe InnerShuffleRegionJoinAndGroupByLeft was previously calling a collect somewhere but this must have been fixed. I cannot seem to find it.

Contributor

akmorrow13 commented Sep 30, 2016

@fnothaft maybe it was a temporary moment of insanity but it looks like I was wrong. I believe InnerShuffleRegionJoinAndGroupByLeft was previously calling a collect somewhere but this must have been fixed. I cannot seem to find it.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 30, 2016

Member

@akmorrow13 no sweat!

Member

fnothaft commented Sep 30, 2016

@akmorrow13 no sweat!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 3, 2016

Member

@heuermh added docs. Can you make another review pass?

Member

fnothaft commented Oct 3, 2016

@heuermh added docs. Can you make another review pass?

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 3, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1513/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1513/
Test PASSed.

@heuermh

Docs read great, thanks! Found a couple minor typos.

+ * @param genomicRdd The right RDD in the join.
+ * @return Returns a new genomic RDD containing all pairs of keys that
+ * overlapped in the genomic coordinate space, grouped together by
+ * the value they overlapped in the left RDD., and all values from the

This comment has been minimized.

@heuermh

heuermh Oct 3, 2016

Member

minor typo, ,.

@heuermh

heuermh Oct 3, 2016

Member

minor typo, ,.

+
+/**
+ * Extends the ShuffleRegionJoin trait to implement an inner join followed by
+ * grouping by the left value..

This comment has been minimized.

@heuermh

heuermh Oct 3, 2016

Member

minor typo, ..

@heuermh

heuermh Oct 3, 2016

Member

minor typo, ..

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 3, 2016

Member

Thanks for catching @heuermh! I've fixed the typos, squashed down the documentation commit, and rebased.

Member

fnothaft commented Oct 3, 2016

Thanks for catching @heuermh! I've fixed the typos, squashed down the documentation commit, and rebased.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 3, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1515/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1515/
Test PASSed.

@heuermh

heuermh approved these changes Oct 3, 2016

@heuermh heuermh merged commit bd3c62a into bigdatagenomics:master Oct 3, 2016

1 check passed

default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 3, 2016

Member

Thank you, @fnothaft!

Member

heuermh commented Oct 3, 2016

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment