Adding examples of how to use joins in the real world #1605

devin-petersohn · 2017-07-13T05:44:56Z

Resolves #890.

coveralls · 2017-07-13T05:57:33Z

Coverage remained the same at 84.157% when pulling 77ad223 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

fnothaft · 2017-07-13T05:58:20Z

Somehow, I am skeptical that this PR decreased coverage by 0.5%.

AmplabJenkins · 2017-07-13T06:05:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2216/
Test PASSed.

fnothaft

Thanks @devin-petersohn! I've dropped a few specific notes inline. As a generalization, code isn't great standalone documentation. For each query, I'd like to see:

Brief synopsis of the query (why would I run this?)
The code for the query
A brief discussion about the query (why it was written in a specific way, any non-obvious nits, performance implications, etc.)

E.g., for use case 2:

This query joins an RDD of Variants against an RDD of Features, and immediately performs a group-by on the Feature. This produces an RDD whose elements are a tuple containing a Feature, and all of the Variants overlapping the Feature. This query is useful for trying to identify annotated variants that may interact (identifying frameshift mutations within a transcript that may act as a pair to shift and then restore the reading frame) or as the start of a query that computes variant density over a set of genomic features.

... code ...

One important implication with this query is that the broadcast region join strategy only supports a group-by on the right side of the tuple. This means that we need to structure our query so that the variant table is the broadcast table. Since we typically expect our variant dataset to be much larger than our feature dataset, this may perform substantially worse than the shuffle join.

fnothaft · 2017-07-13T06:02:13Z

docs/source/55_api.md

+val features = sc.loadFeatures(“my/features.adam”)
+
+// We can use ShuffleRegionJoin…
+val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features)


Can you s/_shuffle/Shuffle/g throughout this PR? Ditto with s/_bcast/Bcast/g.

fnothaft · 2017-07-13T06:04:07Z

docs/source/55_api.md

@@ -339,6 +339,52 @@ val bcastFeatures = sc.loadFeatures("my/features.adam").broadcast()
 val readsByFeature = reads.broadcastRegionJoinAgainst(bcastFeatures)
 ```

+#### Examples of real-world analyses possible with the RegionJoin API


I would drop the "Examples of..." header and segue in with a paragraph before saying "To demonstrate how these APIs can be used, we'll walk through three common queries that can be written using the region join. They are X, Y, and Z." The subheaders should stay.

fnothaft · 2017-07-13T06:06:05Z

docs/source/55_api.md

+val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features)
+
+// …or BroadcastRegionJoin
+val filteredGenotypes_bcast = genotypes.broadcastRegionJoin(features)


You'd want the features to be on the right side of the join here; the features are expected to be smaller than the variant calls.

According to our method level documentation, features would be the right side of the join: link. Please let me know which is correct.

Maybe you meant to say features should be on the left side?

Sorry for the confusion; I'd meant left.

fnothaft · 2017-07-13T06:10:09Z

docs/source/55_api.md

+###### Separate reads into overlapping and non-overlapping features
+```scala
+// An outer join provides us with both overlapping and non-overlapping data
+val reads = sc.loadAlightments(“my/reads.adam”)


Alightments -> Alignments

fnothaft · 2017-07-13T06:11:15Z

docs/source/55_api.md

+val variantsByFeature_shuffle = features.shuffleRegionJoinAndGroupByLeft(variants)
+
+// As a BroadcastRegionJoin, it can be implemented as follows:
+val variantsByFeature_bcast = variants.broadcastRegionJoinAndGroupByRight(features)


Please highlight that the sides of the join have swapped between the two queries. This didn't stick out the first time I read through.

Please add text discussing the performance implications of rewriting this as a broadcast join.

coveralls · 2017-07-13T17:22:42Z

Coverage remained the same at 84.157% when pulling f1183b3 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-13T17:30:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2218/
Test PASSed.

devin-petersohn · 2017-07-13T17:34:17Z

I pushed this a little early, I'll push another update shortly.

heuermh · 2017-07-13T17:28:12Z

docs/source/55_api.md

+
+```scala
+// Inner join will filter out genotypic data not represented in the feature dataset
+val genotypes = sc.loadGenotypes(“my/genotypes.adam”)


I don't think the my/ here is helpful

I was keeping consistent with previous documentation. If you want me to change it here, I'll go ahead and change it throughout.

heuermh · 2017-07-13T17:28:41Z

docs/source/55_api.md

+val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)
+
+// …or BroadcastRegionJoin
+val joinedGenotypesBcast = genotypes.broadcastRegionJoin(features)


Bcast → Broadcast

heuermh · 2017-07-13T17:30:19Z

docs/source/55_api.md

+// We can use ShuffleRegionJoin…
+val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)
+
+// …or BroadcastRegionJoin


A separate code block for the broadcast region join would be helpful. I know folks like to copy and paste from docs. ;)

I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.

heuermh · 2017-07-13T17:31:05Z

docs/source/55_api.md

+
+After the join, we can perform a predicate function on the resulting RDD to
+manipulate it into providing the answer to our question. Because we were
+interested in only getting the Genotypes that overlap the features, we used a


interested in only getting the Genotypes → interested in the Genotypes

We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.

heuermh · 2017-07-13T17:51:04Z

docs/source/55_api.md

+
+```scala
+// Inner join with a group by on the features
+val features = sc.loadFeatures(“my/features.adam”)


similar here

heuermh · 2017-07-13T17:51:20Z

docs/source/55_api.md

+// As a ShuffleRegionJoin, it can be implemented as follows:
+val variantsByFeatureShuffle = features.shuffleRegionJoinAndGroupByLeft(variants)
+
+// As a BroadcastRegionJoin, it can be implemented as follows:


...and separate broadcast code block

-1 on separating into two blocks, please keep as is

heuermh · 2017-07-13T17:52:20Z

docs/source/55_api.md

+
+```scala
+// An outer join provides us with both overlapping and non-overlapping data
+val reads = sc.loadAlignments(“my/reads.adam”)


...and here

heuermh · 2017-07-13T17:54:19Z

docs/source/55_api.md

+val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
+```
+
+Previously, we illustrated that join calls can be different between


These summary paragraphs seem not so useful.

The following two or all of them?

The Previously, and We also previously paragraphs. I don't feel strongly about it.

I agree WRT removing We also....

I don't think we should remove Previously,, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.

heuermh · 2017-07-13T17:55:24Z

Sorry about the outdated comments, was reviewing inbetween yer pushes :)

coveralls · 2017-07-13T17:57:38Z

Coverage remained the same at 84.157% when pulling ff0b84e on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-13T18:06:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2219/
Test PASSed.

fnothaft · 2017-07-13T19:43:22Z

docs/source/55_api.md

+To ensure that the data is appropriately colocated, we perform a copartition
+on the right dataset before the each node conducts the join locally.
+ShuffleRegionJoin should be used if the right dataset is too large to send to
+all nodes and both datasets have low


should be high cardinality, no?

Also, add that there are certain operations (e.g., full outer join) that can only be performed as a shuffle join.

fnothaft · 2017-07-13T19:44:01Z

docs/source/55_api.md

+The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
+entire left dataset to each node. The BroadcastRegionJoin should be used when
+you are joining a smaller dataset to a larger one and/or the datasets in the
+join have high cardinality.


should be low cardinality?

fnothaft · 2017-07-13T19:50:42Z

docs/source/55_api.md

+// We can use ShuffleRegionJoin…
+val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)
+
+// …or BroadcastRegionJoin


I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.

fnothaft · 2017-07-13T19:51:39Z

docs/source/55_api.md

+
+After the join, we can perform a predicate function on the resulting RDD to
+manipulate it into providing the answer to our question. Because we were
+interested in only getting the Genotypes that overlap the features, we used a


We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.

fnothaft · 2017-07-13T19:53:31Z

docs/source/55_api.md

+interested in only getting the Genotypes that overlap the features, we used a
+predicate.
+
+Notice that at the end of the join, we can access the RDD resulting from the


I don't think this paragraph adds much beyond what is already said in the last sentence of the prior paragraph; let's remove it.

fnothaft · 2017-07-13T20:03:24Z

docs/source/55_api.md

+
+// After we have our join, we need to separate the RDD
+// If we used the ShuffleRegionJoin, we filter by None in the values
+val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None)


_._2.isDefined

fnothaft · 2017-07-13T20:03:32Z

docs/source/55_api.md

+// After we have our join, we need to separate the RDD
+// If we used the ShuffleRegionJoin, we filter by None in the values
+val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None)
+val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None)


_._2.isEmpty

fnothaft · 2017-07-13T20:04:14Z

docs/source/55_api.md

+val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None)
+
+// If we used BroadcastRegionJoin, we filter by None in the keys
+val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None)


_._1.isDefined

fnothaft · 2017-07-13T20:04:22Z

docs/source/55_api.md

+
+// If we used BroadcastRegionJoin, we filter by None in the keys
+val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
+val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)


_._1.isEmpty

fnothaft · 2017-07-13T20:10:15Z

docs/source/55_api.md

+val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
+```
+
+Previously, we illustrated that join calls can be different between


I agree WRT removing We also....

I don't think we should remove Previously,, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.

coveralls · 2017-07-13T22:02:50Z

Coverage remained the same at 84.157% when pulling 0704601 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-13T22:10:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2220/
Test PASSed.

coveralls · 2017-07-14T16:42:37Z

Coverage remained the same at 84.157% when pulling 4fbe0a6 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-14T16:51:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2221/
Test PASSed.

fnothaft · 2017-07-14T14:36:48Z

docs/source/55_api.md

-cardinality.
-
-Another important distinction between ShuffleRegionJoin and 
+The ShuffleRegionJoin is at its core a distributed sort-merge overlap join.


is at its core -> is

fnothaft · 2017-07-14T14:39:25Z

docs/source/55_api.md

+To ensure that the data is appropriately colocated, we perform a copartition
+on the right dataset before the each node conducts the join locally.
+ShuffleRegionJoin should be used if the right dataset is too large to send to
+all nodes and both datasets have high cardinality. Because of the flexibility


I mean, this is true, but it'd be more accurate to write the converse -> "Since the broadcast join doesn't co-partition the datasets and instead sends the full right table to all nodes, some joins (e.g., left/full outer joins) cannot be written as broadcast joins."

fnothaft · 2017-07-14T23:01:24Z

docs/source/55_api.md

+
+The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
+entire left dataset to each node. The BroadcastRegionJoin should be used when
+you are joining a smaller dataset to a larger one and/or the datasets in the


More so, I'd say the broadcast region join should be used when you have a dataset that is small enough to be collected and broadcast out, the larger side of the join is unsorted and either the data is so skewed that it is hard to load balance, the data is too large to be worth shuffling, or you don't want sorted output.

fnothaft · 2017-07-14T23:02:30Z

docs/source/55_api.md

 read a genomic dataset into memory, this condition is met.

-ADAM has a variety of ShuffleRegionJoin types that you can perform on your 
+ADAM has a variety of ShuffleRegionJoin types that you can perform on your


ShuffleRegionJoin types -> region joins

fnothaft · 2017-07-14T23:03:07Z

docs/source/55_api.md

+
+Each of these demonstrations illustrates the difference between calling the
+ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can
+be expanded from. For a detailed difference on the optimal performance of


There's not really a detailed discussion of the performance characteristics above, please strike.

fnothaft · 2017-07-14T23:16:36Z

docs/source/55_api.md

+```
+
+When we switch join strategies, we need to change the dataset that is on the
+left side of the join. This distinction is very important to understanding the


It's really the opposite way around. You need to understand the architectural difference between the broadcast and shuffle strategies to understand why we change which dataset is on which side of the join.

fnothaft · 2017-07-14T23:18:56Z

docs/source/55_api.md

+dataset may change between BroadcastRegionJoin and ShuffleRegionJoin.
+
+The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
+implementation is due to the fact that the left dataset is broadcasted to all


The point about locality is correct, but comment about the broadcast is somewhat misleading. If the right dataset was sorted and we did a broadcast join, we could do a shuffle free join. But, we don't guarantee any sort invariants when running the broadcast join, so we can't provide a broadcastJoinAndGroupByLeft that has predictable performance.

Please see rephrased sentence below.

Seems reasonable. Might tighten up to:

Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.

fnothaft · 2017-07-14T23:19:16Z

docs/source/55_api.md

+implementation is due to the fact that the left dataset is broadcasted to all
+nodes. It would be impossible to perform a group by function on the resulting
+join without a shuffle phase because joined tuples could be on any partition.
+ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the


Rejion -> Region

fnothaft · 2017-07-14T23:20:05Z

docs/source/55_api.md

+nodes. It would be impossible to perform a group by function on the resulting
+join without a shuffle phase because joined tuples could be on any partition.
+ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the
+left data does not require a shuffle. This is primarily due to the invariant


Last sentence is redundant; please remove.

fnothaft · 2017-07-14T23:21:32Z

docs/source/55_api.md

+feature. If a given read does not overlap with any features provided, it is
+paired with a `None`. After we perform the join, we use a predicate to separate
+the reads into two RDDs. This query is useful for filtering out reads based on
+feature data. For example, identifying reads that overlap with ChIPSeq data to


ATAC-seq? ChIP-seq identifies protein binding locations.

coveralls · 2017-07-17T16:12:20Z

Coverage increased (+0.03%) to 84.191% when pulling d2e4371 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-17T16:20:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2228/
Test PASSed.

fnothaft · 2017-07-17T17:26:17Z

docs/source/55_api.md

+interested in the Genotypes that overlap the features, we used a map function
+to extract them.
+
+Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin


Another important distinction -> The difference
between ShuffleRegionJoin -> between the ShuffleRegionJoin
after BroadcastRegionJoin -> BroadcastRegionJoin strategies

fnothaft · 2017-07-17T17:30:56Z

docs/source/55_api.md

+to extract them.
+
+Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin
+is that in a BroadcastRegionJoin, the left dataset is sent to all executors. In


is that in a BroadcastRegionJoin, the left dataset is sent to all executors -> is that a broadcast join sends the left dataset to all executors.

fnothaft · 2017-07-17T17:49:31Z

docs/source/55_api.md

+
+```scala
+// Inner join will filter out genotypic data not represented in the feature dataset
+val genotypes = sc.loadGenotypes(“my/genotypes.adam”)


Small nit: you've got a smattering of angled quotes (“) in here. Can you make these non-angled? They render incorrectly in the PDF version if they are angled.

There are a smattering of these throughout the new docs, not just here.

fnothaft · 2017-07-17T17:50:36Z

docs/source/55_api.md

+// …or BroadcastRegionJoin
+val joinedGenotypesBcast = features.broadcastRegionJoin(genotypes)
+
+// In the case that we only want Genotypes, we can use a simple predicate


predicate -> projection

fnothaft · 2017-07-17T17:51:15Z

docs/source/55_api.md

+val filteredGenotypesBcast = joinedGenotypesBcast.rdd.map(_._2)
+```
+
+After the join, we can perform a predicate function on the resulting RDD to


This is unnecessarily verbose. Suggest:

Since we are interested in the Genotypes that overlap a feature, we map over the tuples and select just the Genotype.

fnothaft · 2017-07-17T17:52:27Z

docs/source/55_api.md

+When we switch join strategies, we need to change the dataset that is on the
+left side of the join.
+
+To perform a `groupBy` after the join, BroadcastRegionJoin only supports


Unnecessarily verbose. Suggest removing To perform a groupByafter the join,

fnothaft · 2017-07-17T17:53:22Z

docs/source/55_api.md

+
+To perform a `groupBy` after the join, BroadcastRegionJoin only supports
+grouping by the right dataset, and ShuffleRegionJoin supports only grouping by
+the left dataset. Thus, depending on the type of join, the left and right


Can probably remove sentence starting with Thus

fnothaft · 2017-07-17T17:56:47Z

docs/source/55_api.md

+dataset may change between BroadcastRegionJoin and ShuffleRegionJoin.
+
+The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
+implementation is due to the fact that the left dataset is broadcasted to all


Seems reasonable. Might tighten up to:

Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.

fnothaft · 2017-07-17T17:58:38Z

docs/source/55_api.md

+val featuresToReads = features.rightOuterShuffleRegionJoin(reads)
+
+// After we have our join, we need to separate the RDD
+// If we used the ShuffleRegionJoin, we filter by None in the values


Key/value is unintiutive to me here. Key/value implies that the key is derived from the value.

fnothaft · 2017-07-17T17:59:15Z

docs/source/55_api.md

+BroadcastRegionJoin broadcasts the left dataset, so a left outer join would
+require an additional shuffle phase. For an outer join, using a
+ShuffleRegionJoin will be cheaper if your reads are already sorted, however if
+the feature dataset is small, the BroadcastRegionJoin call would likely be more


the feature dataset is small -> the feature dataset is small and the reads are not sorted

Fixed typos.

coveralls · 2017-07-18T05:30:33Z

Coverage increased (+0.4%) to 84.58% when pulling 0a51d85 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-18T05:41:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2238/
Test PASSed.

AmplabJenkins · 2017-07-18T05:41:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2239/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48faFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2017-07-18T05:42:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2240/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48fa > /home/jenkins/git2/bin/git rev-list d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

devin-petersohn · 2017-07-18T06:00:51Z

docs/source/55_api.md

+| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin |
+| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin |
+| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin |
+| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)```  ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin |


Thoughts on a separate table for all the broadcastRegionJoinAgainst variants?

Instead of using a table, I'd break these out into a nested bulleted list:

Joins implemented across both shuffle and broadcast

Inner

...

Shuffle-only joins

FullOuter

...

Broadcast-only joins

RightAndGroupByRight

...

fnothaft

A couple more nits; this is pretty close! Thanks @devin-petersohn.

fnothaft · 2017-07-18T07:39:26Z

docs/source/55_api.md

+The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
+entire left dataset to each node. The BroadcastRegionJoin should be used when
+you have a dataset that is small enough to be collected and broadcast out, the
+larger side of the join is unsorted and either the data is so skewed that it is


so is colloquial, prefer sufficiently

fnothaft · 2017-07-18T07:41:40Z

docs/source/55_api.md

+
+The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
+entire left dataset to each node. The BroadcastRegionJoin should be used when
+you have a dataset that is small enough to be collected and broadcast out, the


when you have a dataset that -> when the right side of your join

also, there should be an and after and broadcast out,. The condition is "dataset is small" and ("large side unsorted" or "data is skewed").

fnothaft · 2017-07-18T07:42:17Z

docs/source/55_api.md

+entire left dataset to each node. The BroadcastRegionJoin should be used when
+you have a dataset that is small enough to be collected and broadcast out, the
+larger side of the join is unsorted and either the data is so skewed that it is
+hard to load balance, the data is too large to be worth shuffling, or you don't


"larger side of the join is unsorted" and "data is too large to be worth shuffling" should be grouped together logically.

I would change you don't want sorted output to you can tolerate unsorted output.

fnothaft · 2017-07-18T07:46:52Z

docs/source/55_api.md

+| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin |
+| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin |
+| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin |
+| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)```  ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin |


Instead of using a table, I'd break these out into a nested bulleted list:

Joins implemented across both shuffle and broadcast

Inner

...

Shuffle-only joins

FullOuter

...

Broadcast-only joins

RightAndGroupByRight

...

fnothaft · 2017-07-18T07:48:38Z

docs/source/55_api.md

+###### Filter Genotypes by Features
+
+This query joins an RDD of Genotypes against an RDD of Features using an inner
+join. The inner join will result in an RDD of key-value pairs, where the key is


Still don't like key-value pairs here, because they're not key/value pairs.

fnothaft · 2017-07-18T07:53:47Z

docs/source/55_api.md

+this query would extract all genotypes that fall in exonic regions.
+
+```scala
+// Inner join will filter out genotypic data not represented in the feature dataset


genotypic data -> genotypes
not represented in the feature dataset -> not covered by an feature

fnothaft · 2017-07-18T07:55:25Z

docs/source/55_api.md

+and select just the `Genotype`.
+
+The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies
+is that a broadcast join sends the left dataset to all executors. In this case,


For conciseness, trim:

The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies is that a broadcast join sends the left dataset to all executors. In this case,

to

Since a broadcast join sends the left dataset to all executors,

fnothaft · 2017-07-18T07:57:23Z

docs/source/55_api.md

+val variantsByFeatureBcast = variants.broadcastRegionJoinAndGroupByRight(features)
+```
+
+When we switch join strategies, we need to change the dataset that is on the


Nit: I think this text would read a bit better if "to change the dataset that is" was replaced with "to swap which dataset is"

coveralls · 2017-07-18T22:09:35Z

Coverage increased (+0.4%) to 84.58% when pulling cd98e93 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-18T22:18:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2242/
Test PASSed.

fnothaft

1 really small typo, otherwise LGTM. Thanks @devin-petersohn!

fnothaft · 2017-07-19T06:10:16Z

docs/source/55_api.md

-When we switch join strategies, we need to change the dataset that is on the
-left side of the join. BroadcastRegionJoin only supports grouping by the right
-dataset, and ShuffleRegionJoin supports only grouping by the left dataset.
+When we switch join strategies, to swap which dataset is on the left side of


to swap which -> we swap which

fnothaft

LGTM. Thanks @devin-petersohn!

coveralls · 2017-07-19T16:59:00Z

Coverage increased (+0.4%) to 84.58% when pulling a49cd04 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-19T17:08:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2246/
Test PASSed.

fnothaft · 2017-07-19T22:05:29Z

@heuermh just going to ping you for a review. As an FYI, @devin-petersohn is going to clean up the history on the commit, so don't merge until you get an OK from him.

fnothaft · 2017-07-21T08:23:52Z

Ping @heuermh

heuermh

Few minor suggestions, otherwise looks good

heuermh · 2017-07-21T09:04:03Z

docs/source/55_api.md

+and select just the `Genotype`.
+
+Since a broadcast join sends the left dataset to all executors, we chose to
+send the `features` dataset because feature data is usually smaller in size


data is → data are

heuermh · 2017-07-21T09:04:51Z

docs/source/55_api.md

+Each of these demonstrations illustrates the difference between calling the
+ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can
+be expanded from.
+


How about:

These demonstrations illustrate the difference between calling
ShuffleRegionJoin and BroadcastRegionJoin and provide example code
to expand from.

heuermh · 2017-07-21T09:07:06Z

docs/source/55_api.md

+This query joins an RDD of Variants against an RDD of Features, and immediately
+performs a group-by on the Feature. This produces an RDD whose elements are a
+tuple containing a Feature, and all of the Variants overlapping the Feature.
+This query is useful for trying to identify annotated variants that may


This produces an RDD whose elements are tuples containing a Feature and all of the Variants overlapping the Feature.

heuermh · 2017-07-21T09:09:36Z

docs/source/55_api.md

+ShuffleRegionJoin supports only grouping by the left dataset.
+
+The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
+implementation is due to the fact that the left dataset is broadcasted to all


is broadcasted → is broadcast

devin-petersohn · 2017-07-21T17:58:47Z

Ok to squash. I talked to @gunjanbaid and she ok'ed squash.

fnothaft · 2017-07-21T17:59:44Z

Thanks @devin-petersohn and @gunjanbaid! @heuermh if the latest changes look good to you, please squash-and-merge.

coveralls · 2017-07-21T18:31:45Z

Coverage decreased (-0.1%) to 84.016% when pulling 6f3fe74 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-21T18:39:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2265/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 # timeout=10Checking out Revision 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 > /home/jenkins/git2/bin/git rev-list 8704718 # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

heuermh · 2017-07-21T23:12:31Z

Jenkins, retest this please

coveralls · 2017-07-21T23:24:22Z

Coverage increased (+0.6%) to 84.743% when pulling 6f3fe74 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

AmplabJenkins · 2017-07-21T23:33:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2269/
Test PASSed.

fnothaft · 2017-07-21T23:34:43Z

Merged! Thanks @devin-petersohn!

Adding examples of how to use joins in the real world

77ad223

fnothaft requested changes Jul 13, 2017

View reviewed changes

fnothaft added this to the 0.23.0 milestone Jul 13, 2017

Addressing reviewer comments

f1183b3

Addressing reviewer comments

ff0b84e

heuermh requested changes Jul 13, 2017

View reviewed changes

fnothaft requested changes Jul 13, 2017

View reviewed changes

Addressing reviewer comments

0704601

Fixing small formatting errors

4fbe0a6

fnothaft requested changes Jul 14, 2017

View reviewed changes

Addressing reviewer comments

d2e4371

fnothaft requested changes Jul 17, 2017

View reviewed changes

gunjanbaid and others added 2 commits July 17, 2017 14:46

Fixed typos.

1e9cfe2

Merge pull request #3 from gunjanbaid/doc-typos

db3504d

Fixed typos.

devin-petersohn added 2 commits July 17, 2017 22:20

Fixing spacing issue

f1cb036

Fixing minor table formatting issue

0a51d85

devin-petersohn commented Jul 18, 2017

View reviewed changes

fnothaft requested changes Jul 18, 2017

View reviewed changes

Addressing reviewer comments

cd98e93

fnothaft requested changes Jul 19, 2017

View reviewed changes

Addressing reviewer comments

a49cd04

fnothaft approved these changes Jul 19, 2017

View reviewed changes

heuermh reviewed Jul 21, 2017

View reviewed changes

Addressing reviewer comments

6f3fe74

heuermh approved these changes Jul 21, 2017

View reviewed changes

fnothaft merged commit 6abe7a6 into bigdatagenomics:master Jul 21, 2017

Adding examples of how to use joins in the real world #1605

Adding examples of how to use joins in the real world #1605

Conversation

devin-petersohn commented Jul 13, 2017

coveralls commented Jul 13, 2017 • edited

fnothaft commented Jul 13, 2017

AmplabJenkins commented Jul 13, 2017

fnothaft left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devin-petersohn Jul 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jul 13, 2017 • edited

AmplabJenkins commented Jul 13, 2017

devin-petersohn commented Jul 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Jul 13, 2017

coveralls commented Jul 13, 2017 • edited

AmplabJenkins commented Jul 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jul 13, 2017 • edited

AmplabJenkins commented Jul 13, 2017

coveralls commented Jul 14, 2017 • edited

AmplabJenkins commented Jul 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jul 17, 2017 • edited

AmplabJenkins commented Jul 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jul 18, 2017 • edited

coveralls commented Jul 13, 2017 •

edited

fnothaft left a comment •

edited

devin-petersohn Jul 13, 2017 •

edited

coveralls commented Jul 13, 2017 •

edited

coveralls commented Jul 13, 2017 •

edited

coveralls commented Jul 13, 2017 •

edited

coveralls commented Jul 14, 2017 •

edited

coveralls commented Jul 17, 2017 •

edited

coveralls commented Jul 18, 2017 •

edited