Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding examples of how to use joins in the real world #1605

Merged

Conversation

@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Jul 13, 2017

Resolves #890.

@coveralls
Copy link

@coveralls coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling 77ad223 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 13, 2017

Somehow, I am skeptical that this PR decreased coverage by 0.5%.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2216/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Thanks @devin-petersohn! I've dropped a few specific notes inline. As a generalization, code isn't great standalone documentation. For each query, I'd like to see:

  1. Brief synopsis of the query (why would I run this?)
  2. The code for the query
  3. A brief discussion about the query (why it was written in a specific way, any non-obvious nits, performance implications, etc.)

E.g., for use case 2:

This query joins an RDD of Variants against an RDD of Features, and immediately performs a group-by on the Feature. This produces an RDD whose elements are a tuple containing a Feature, and all of the Variants overlapping the Feature. This query is useful for trying to identify annotated variants that may interact (identifying frameshift mutations within a transcript that may act as a pair to shift and then restore the reading frame) or as the start of a query that computes variant density over a set of genomic features.

... code ...

One important implication with this query is that the broadcast region join strategy only supports a group-by on the right side of the tuple. This means that we need to structure our query so that the variant table is the broadcast table. Since we typically expect our variant dataset to be much larger than our feature dataset, this may perform substantially worse than the shuffle join.

val features = sc.loadFeatures(“my/features.adam”)
// We can use ShuffleRegionJoin…
val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

Can you s/_shuffle/Shuffle/g throughout this PR? Ditto with s/_bcast/Bcast/g.

@@ -339,6 +339,52 @@ val bcastFeatures = sc.loadFeatures("my/features.adam").broadcast()
val readsByFeature = reads.broadcastRegionJoinAgainst(bcastFeatures)
```

#### Examples of real-world analyses possible with the RegionJoin API

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

I would drop the "Examples of..." header and segue in with a paragraph before saying "To demonstrate how these APIs can be used, we'll walk through three common queries that can be written using the region join. They are X, Y, and Z." The subheaders should stay.

val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features)
// …or BroadcastRegionJoin
val filteredGenotypes_bcast = genotypes.broadcastRegionJoin(features)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

You'd want the features to be on the right side of the join here; the features are expected to be smaller than the variant calls.

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 13, 2017
Author Member

According to our method level documentation, features would be the right side of the join: link. Please let me know which is correct.

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 13, 2017
Author Member

Maybe you meant to say features should be on the left side?

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

Sorry for the confusion; I'd meant left.

###### Separate reads into overlapping and non-overlapping features
```scala
// An outer join provides us with both overlapping and non-overlapping data
val reads = sc.loadAlightments(“my/reads.adam”)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

Alightments -> Alignments

val variantsByFeature_shuffle = features.shuffleRegionJoinAndGroupByLeft(variants)
// As a BroadcastRegionJoin, it can be implemented as follows:
val variantsByFeature_bcast = variants.broadcastRegionJoinAndGroupByRight(features)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

  1. Please highlight that the sides of the join have swapped between the two queries. This didn't stick out the first time I read through.
  2. Please add text discussing the performance implications of rewriting this as a broadcast join.
@fnothaft fnothaft added this to the 0.23.0 milestone Jul 13, 2017
@coveralls
Copy link

@coveralls coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling f1183b3 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2218/
Test PASSed.

@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented Jul 13, 2017

I pushed this a little early, I'll push another update shortly.


```scala
// Inner join will filter out genotypic data not represented in the feature dataset
val genotypes = sc.loadGenotypes(“my/genotypes.adam”)

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

I don't think the my/ here is helpful

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 13, 2017
Author Member

I was keeping consistent with previous documentation. If you want me to change it here, I'll go ahead and change it throughout.

val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)
// …or BroadcastRegionJoin
val joinedGenotypesBcast = genotypes.broadcastRegionJoin(features)

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

Bcast → Broadcast

// We can use ShuffleRegionJoin…
val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)
// …or BroadcastRegionJoin

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

A separate code block for the broadcast region join would be helpful. I know folks like to copy and paste from docs. ;)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.


After the join, we can perform a predicate function on the resulting RDD to
manipulate it into providing the answer to our question. Because we were
interested in only getting the Genotypes that overlap the features, we used a

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

interested in only getting the Genotypes → interested in the Genotypes

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.


```scala
// Inner join with a group by on the features
val features = sc.loadFeatures(“my/features.adam”)

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

similar here

// As a ShuffleRegionJoin, it can be implemented as follows:
val variantsByFeatureShuffle = features.shuffleRegionJoinAndGroupByLeft(variants)
// As a BroadcastRegionJoin, it can be implemented as follows:

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

...and separate broadcast code block

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

-1 on separating into two blocks, please keep as is

```scala
// An outer join provides us with both overlapping and non-overlapping data
val reads = sc.loadAlignments(“my/reads.adam”)

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

...and here

val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
```
Previously, we illustrated that join calls can be different between

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

These summary paragraphs seem not so useful.

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 13, 2017
Author Member

The following two or all of them?

This comment has been minimized.

@heuermh

heuermh Jul 13, 2017
Member

The Previously, and We also previously paragraphs. I don't feel strongly about it.

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

I agree WRT removing We also....

I don't think we should remove Previously,, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.

@heuermh
Copy link
Member

@heuermh heuermh commented Jul 13, 2017

Sorry about the outdated comments, was reviewing inbetween yer pushes :)

@coveralls
Copy link

@coveralls coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling ff0b84e on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2219/
Test PASSed.

To ensure that the data is appropriately colocated, we perform a copartition
on the right dataset before the each node conducts the join locally.
ShuffleRegionJoin should be used if the right dataset is too large to send to
all nodes and both datasets have low

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

should be high cardinality, no?

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

Also, add that there are certain operations (e.g., full outer join) that can only be performed as a shuffle join.

The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you are joining a smaller dataset to a larger one and/or the datasets in the
join have high cardinality.

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

should be low cardinality?

// We can use ShuffleRegionJoin…
val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)
// …or BroadcastRegionJoin

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.


After the join, we can perform a predicate function on the resulting RDD to
manipulate it into providing the answer to our question. Because we were
interested in only getting the Genotypes that overlap the features, we used a

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.

interested in only getting the Genotypes that overlap the features, we used a
predicate.

Notice that at the end of the join, we can access the RDD resulting from the

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

I don't think this paragraph adds much beyond what is already said in the last sentence of the prior paragraph; let's remove it.

// After we have our join, we need to separate the RDD
// If we used the ShuffleRegionJoin, we filter by None in the values
val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

_._2.isDefined

// After we have our join, we need to separate the RDD
// If we used the ShuffleRegionJoin, we filter by None in the values
val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None)
val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

_._2.isEmpty

val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None)
// If we used BroadcastRegionJoin, we filter by None in the keys
val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

_._1.isDefined

// If we used BroadcastRegionJoin, we filter by None in the keys
val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

_._1.isEmpty

val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
```
Previously, we illustrated that join calls can be different between

This comment has been minimized.

@fnothaft

fnothaft Jul 13, 2017
Member

I agree WRT removing We also....

I don't think we should remove Previously,, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.

@coveralls
Copy link

@coveralls coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling 0704601 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2220/
Test PASSed.

@coveralls
Copy link

@coveralls coveralls commented Jul 14, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling 4fbe0a6 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2221/
Test PASSed.

cardinality.

Another important distinction between ShuffleRegionJoin and
The ShuffleRegionJoin is at its core a distributed sort-merge overlap join.

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

is at its core -> is

To ensure that the data is appropriately colocated, we perform a copartition
on the right dataset before the each node conducts the join locally.
ShuffleRegionJoin should be used if the right dataset is too large to send to
all nodes and both datasets have high cardinality. Because of the flexibility

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

I mean, this is true, but it'd be more accurate to write the converse -> "Since the broadcast join doesn't co-partition the datasets and instead sends the full right table to all nodes, some joins (e.g., left/full outer joins) cannot be written as broadcast joins."


The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you are joining a smaller dataset to a larger one and/or the datasets in the

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

More so, I'd say the broadcast region join should be used when you have a dataset that is small enough to be collected and broadcast out, the larger side of the join is unsorted and either the data is so skewed that it is hard to load balance, the data is too large to be worth shuffling, or you don't want sorted output.

read a genomic dataset into memory, this condition is met.

ADAM has a variety of ShuffleRegionJoin types that you can perform on your
ADAM has a variety of ShuffleRegionJoin types that you can perform on your

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

ShuffleRegionJoin types -> region joins


Each of these demonstrations illustrates the difference between calling the
ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can
be expanded from. For a detailed difference on the optimal performance of

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

There's not really a detailed discussion of the performance characteristics above, please strike.

```
When we switch join strategies, we need to change the dataset that is on the
left side of the join. This distinction is very important to understanding the

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

It's really the opposite way around. You need to understand the architectural difference between the broadcast and shuffle strategies to understand why we change which dataset is on which side of the join.

dataset may change between BroadcastRegionJoin and ShuffleRegionJoin.
The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
implementation is due to the fact that the left dataset is broadcasted to all

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

The point about locality is correct, but comment about the broadcast is somewhat misleading. If the right dataset was sorted and we did a broadcast join, we could do a shuffle free join. But, we don't guarantee any sort invariants when running the broadcast join, so we can't provide a broadcastJoinAndGroupByLeft that has predictable performance.

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 17, 2017
Author Member

Please see rephrased sentence below.

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

Seems reasonable. Might tighten up to:

Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.

implementation is due to the fact that the left dataset is broadcasted to all
nodes. It would be impossible to perform a group by function on the resulting
join without a shuffle phase because joined tuples could be on any partition.
ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

Rejion -> Region

nodes. It would be impossible to perform a group by function on the resulting
join without a shuffle phase because joined tuples could be on any partition.
ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the
left data does not require a shuffle. This is primarily due to the invariant

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

Last sentence is redundant; please remove.

feature. If a given read does not overlap with any features provided, it is
paired with a `None`. After we perform the join, we use a predicate to separate
the reads into two RDDs. This query is useful for filtering out reads based on
feature data. For example, identifying reads that overlap with ChIPSeq data to

This comment has been minimized.

@fnothaft

fnothaft Jul 14, 2017
Member

ATAC-seq? ChIP-seq identifies protein binding locations.

@coveralls
Copy link

@coveralls coveralls commented Jul 17, 2017

Coverage Status

Coverage increased (+0.03%) to 84.191% when pulling d2e4371 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 17, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2228/
Test PASSed.

interested in the Genotypes that overlap the features, we used a map function
to extract them.

Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

Another important distinction -> The difference
between ShuffleRegionJoin -> between the ShuffleRegionJoin
after BroadcastRegionJoin -> BroadcastRegionJoin strategies

to extract them.

Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin
is that in a BroadcastRegionJoin, the left dataset is sent to all executors. In

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

is that in a BroadcastRegionJoin, the left dataset is sent to all executors -> is that a broadcast join sends the left dataset to all executors.


```scala
// Inner join will filter out genotypic data not represented in the feature dataset
val genotypes = sc.loadGenotypes(“my/genotypes.adam”)

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

Small nit: you've got a smattering of angled quotes () in here. Can you make these non-angled? They render incorrectly in the PDF version if they are angled.

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

There are a smattering of these throughout the new docs, not just here.

// …or BroadcastRegionJoin
val joinedGenotypesBcast = features.broadcastRegionJoin(genotypes)
// In the case that we only want Genotypes, we can use a simple predicate

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

predicate -> projection

val filteredGenotypesBcast = joinedGenotypesBcast.rdd.map(_._2)
```

After the join, we can perform a predicate function on the resulting RDD to

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

This is unnecessarily verbose. Suggest:

Since we are interested in the Genotypes that overlap a feature, we map over the tuples and select just the Genotype.

When we switch join strategies, we need to change the dataset that is on the
left side of the join.
To perform a `groupBy` after the join, BroadcastRegionJoin only supports

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

Unnecessarily verbose. Suggest removing To perform a groupByafter the join,

To perform a `groupBy` after the join, BroadcastRegionJoin only supports
grouping by the right dataset, and ShuffleRegionJoin supports only grouping by
the left dataset. Thus, depending on the type of join, the left and right

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

Can probably remove sentence starting with Thus

dataset may change between BroadcastRegionJoin and ShuffleRegionJoin.
The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
implementation is due to the fact that the left dataset is broadcasted to all

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

Seems reasonable. Might tighten up to:

Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.

val featuresToReads = features.rightOuterShuffleRegionJoin(reads)
// After we have our join, we need to separate the RDD
// If we used the ShuffleRegionJoin, we filter by None in the values

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

Key/value is unintiutive to me here. Key/value implies that the key is derived from the value.

BroadcastRegionJoin broadcasts the left dataset, so a left outer join would
require an additional shuffle phase. For an outer join, using a
ShuffleRegionJoin will be cheaper if your reads are already sorted, however if
the feature dataset is small, the BroadcastRegionJoin call would likely be more

This comment has been minimized.

@fnothaft

fnothaft Jul 17, 2017
Member

the feature dataset is small -> the feature dataset is small and the reads are not sorted

gunjanbaid and others added 2 commits Jul 17, 2017
@coveralls
Copy link

@coveralls coveralls commented Jul 18, 2017

Coverage Status

Coverage increased (+0.4%) to 84.58% when pulling 0a51d85 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2238/
Test PASSed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 18, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2239/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48faFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 18, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2240/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48fa > /home/jenkins/git2/bin/git rev-list d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin |
| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)``` ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin |

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 18, 2017
Author Member

Thoughts on a separate table for all the broadcastRegionJoinAgainst variants?

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

Instead of using a table, I'd break these out into a nested bulleted list:

  • Joins implemented across both shuffle and broadcast
    • Inner
    • ...
  • Shuffle-only joins
    • FullOuter
    • ...
  • Broadcast-only joins
    • RightAndGroupByRight
    • ...
Copy link
Member

@fnothaft fnothaft left a comment

A couple more nits; this is pretty close! Thanks @devin-petersohn.

The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you have a dataset that is small enough to be collected and broadcast out, the
larger side of the join is unsorted and either the data is so skewed that it is

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

so is colloquial, prefer sufficiently


The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you have a dataset that is small enough to be collected and broadcast out, the

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

when you have a dataset that -> when the right side of your join

also, there should be an and after and broadcast out,. The condition is "dataset is small" and ("large side unsorted" or "data is skewed").

entire left dataset to each node. The BroadcastRegionJoin should be used when
you have a dataset that is small enough to be collected and broadcast out, the
larger side of the join is unsorted and either the data is so skewed that it is
hard to load balance, the data is too large to be worth shuffling, or you don't

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

"larger side of the join is unsorted" and "data is too large to be worth shuffling" should be grouped together logically.

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

I would change you don't want sorted output to you can tolerate unsorted output.

| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin |
| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)``` ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin |

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

Instead of using a table, I'd break these out into a nested bulleted list:

  • Joins implemented across both shuffle and broadcast
    • Inner
    • ...
  • Shuffle-only joins
    • FullOuter
    • ...
  • Broadcast-only joins
    • RightAndGroupByRight
    • ...
###### Filter Genotypes by Features

This query joins an RDD of Genotypes against an RDD of Features using an inner
join. The inner join will result in an RDD of key-value pairs, where the key is

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

Still don't like key-value pairs here, because they're not key/value pairs.

this query would extract all genotypes that fall in exonic regions.

```scala
// Inner join will filter out genotypic data not represented in the feature dataset

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

genotypic data -> genotypes
not represented in the feature dataset -> not covered by an feature

and select just the `Genotype`.

The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies
is that a broadcast join sends the left dataset to all executors. In this case,

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

For conciseness, trim:

The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies is that a broadcast join sends the left dataset to all executors. In this case,

to

Since a broadcast join sends the left dataset to all executors,

val variantsByFeatureBcast = variants.broadcastRegionJoinAndGroupByRight(features)
```

When we switch join strategies, we need to change the dataset that is on the

This comment has been minimized.

@fnothaft

fnothaft Jul 18, 2017
Member

Nit: I think this text would read a bit better if "to change the dataset that is" was replaced with "to swap which dataset is"

@coveralls
Copy link

@coveralls coveralls commented Jul 18, 2017

Coverage Status

Coverage increased (+0.4%) to 84.58% when pulling cd98e93 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2242/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

1 really small typo, otherwise LGTM. Thanks @devin-petersohn!

When we switch join strategies, we need to change the dataset that is on the
left side of the join. BroadcastRegionJoin only supports grouping by the right
dataset, and ShuffleRegionJoin supports only grouping by the left dataset.
When we switch join strategies, to swap which dataset is on the left side of

This comment has been minimized.

@fnothaft

fnothaft Jul 19, 2017
Member

to swap which -> we swap which

Copy link
Member

@fnothaft fnothaft left a comment

LGTM. Thanks @devin-petersohn!

@coveralls
Copy link

@coveralls coveralls commented Jul 19, 2017

Coverage Status

Coverage increased (+0.4%) to 84.58% when pulling a49cd04 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 19, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2246/
Test PASSed.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 19, 2017

@heuermh just going to ping you for a review. As an FYI, @devin-petersohn is going to clean up the history on the commit, so don't merge until you get an OK from him.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 21, 2017

Ping @heuermh

Copy link
Member

@heuermh heuermh left a comment

Few minor suggestions, otherwise looks good

and select just the `Genotype`.

Since a broadcast join sends the left dataset to all executors, we chose to
send the `features` dataset because feature data is usually smaller in size

This comment has been minimized.

@heuermh

heuermh Jul 21, 2017
Member

data is → data are

Each of these demonstrations illustrates the difference between calling the
ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can
be expanded from.

This comment has been minimized.

@heuermh

heuermh Jul 21, 2017
Member

How about:

These demonstrations illustrate the difference between calling
ShuffleRegionJoin and BroadcastRegionJoin and provide example code
to expand from.

This query joins an RDD of Variants against an RDD of Features, and immediately
performs a group-by on the Feature. This produces an RDD whose elements are a
tuple containing a Feature, and all of the Variants overlapping the Feature.
This query is useful for trying to identify annotated variants that may

This comment has been minimized.

@heuermh

heuermh Jul 21, 2017
Member

This produces an RDD whose elements are tuples containing a Feature and all of the Variants overlapping the Feature.

ShuffleRegionJoin supports only grouping by the left dataset.

The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
implementation is due to the fact that the left dataset is broadcasted to all

This comment has been minimized.

@heuermh

heuermh Jul 21, 2017
Member

is broadcasted → is broadcast

@devin-petersohn
Copy link
Member Author

@devin-petersohn devin-petersohn commented Jul 21, 2017

Ok to squash. I talked to @gunjanbaid and she ok'ed squash.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 21, 2017

Thanks @devin-petersohn and @gunjanbaid! @heuermh if the latest changes look good to you, please squash-and-merge.

@coveralls
Copy link

@coveralls coveralls commented Jul 21, 2017

Coverage Status

Coverage decreased (-0.1%) to 84.016% when pulling 6f3fe74 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 21, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2265/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 # timeout=10Checking out Revision 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 > /home/jenkins/git2/bin/git rev-list 8704718 # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

@heuermh heuermh commented Jul 21, 2017

Jenkins, retest this please

@coveralls
Copy link

@coveralls coveralls commented Jul 21, 2017

Coverage Status

Coverage increased (+0.6%) to 84.743% when pulling 6f3fe74 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 21, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2269/
Test PASSed.

@fnothaft fnothaft merged commit 6abe7a6 into bigdatagenomics:master Jul 21, 2017
3 checks passed
3 checks passed
codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage increased (+0.6%) to 84.743%
Details
default Merged build finished.
Details
@fnothaft