Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding examples of how to use joins in the real world #1605

Merged

Conversation

devin-petersohn
Copy link
Member

Resolves #890.

@coveralls
Copy link

coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling 77ad223 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@fnothaft
Copy link
Member

Somehow, I am skeptical that this PR decreased coverage by 0.5%.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2216/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devin-petersohn! I've dropped a few specific notes inline. As a generalization, code isn't great standalone documentation. For each query, I'd like to see:

  1. Brief synopsis of the query (why would I run this?)
  2. The code for the query
  3. A brief discussion about the query (why it was written in a specific way, any non-obvious nits, performance implications, etc.)

E.g., for use case 2:

This query joins an RDD of Variants against an RDD of Features, and immediately performs a group-by on the Feature. This produces an RDD whose elements are a tuple containing a Feature, and all of the Variants overlapping the Feature. This query is useful for trying to identify annotated variants that may interact (identifying frameshift mutations within a transcript that may act as a pair to shift and then restore the reading frame) or as the start of a query that computes variant density over a set of genomic features.

... code ...

One important implication with this query is that the broadcast region join strategy only supports a group-by on the right side of the tuple. This means that we need to structure our query so that the variant table is the broadcast table. Since we typically expect our variant dataset to be much larger than our feature dataset, this may perform substantially worse than the shuffle join.

val features = sc.loadFeatures(“my/features.adam”)

// We can use ShuffleRegionJoin…
val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you s/_shuffle/Shuffle/g throughout this PR? Ditto with s/_bcast/Bcast/g.

@@ -339,6 +339,52 @@ val bcastFeatures = sc.loadFeatures("my/features.adam").broadcast()
val readsByFeature = reads.broadcastRegionJoinAgainst(bcastFeatures)
```

#### Examples of real-world analyses possible with the RegionJoin API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop the "Examples of..." header and segue in with a paragraph before saying "To demonstrate how these APIs can be used, we'll walk through three common queries that can be written using the region join. They are X, Y, and Z." The subheaders should stay.

val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features)

// …or BroadcastRegionJoin
val filteredGenotypes_bcast = genotypes.broadcastRegionJoin(features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd want the features to be on the right side of the join here; the features are expected to be smaller than the variant calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our method level documentation, features would be the right side of the join: link. Please let me know which is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you meant to say features should be on the left side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion; I'd meant left.

###### Separate reads into overlapping and non-overlapping features
```scala
// An outer join provides us with both overlapping and non-overlapping data
val reads = sc.loadAlightments(“my/reads.adam”)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alightments -> Alignments

val variantsByFeature_shuffle = features.shuffleRegionJoinAndGroupByLeft(variants)

// As a BroadcastRegionJoin, it can be implemented as follows:
val variantsByFeature_bcast = variants.broadcastRegionJoinAndGroupByRight(features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please highlight that the sides of the join have swapped between the two queries. This didn't stick out the first time I read through.
  2. Please add text discussing the performance implications of rewriting this as a broadcast join.

@fnothaft fnothaft added this to the 0.23.0 milestone Jul 13, 2017
@coveralls
Copy link

coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling f1183b3 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2218/
Test PASSed.

@devin-petersohn
Copy link
Member Author

I pushed this a little early, I'll push another update shortly.


```scala
// Inner join will filter out genotypic data not represented in the feature dataset
val genotypes = sc.loadGenotypes(“my/genotypes.adam”)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the my/ here is helpful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was keeping consistent with previous documentation. If you want me to change it here, I'll go ahead and change it throughout.

val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)

// …or BroadcastRegionJoin
val joinedGenotypesBcast = genotypes.broadcastRegionJoin(features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bcast → Broadcast

// We can use ShuffleRegionJoin…
val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)

// …or BroadcastRegionJoin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate code block for the broadcast region join would be helpful. I know folks like to copy and paste from docs. ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.


After the join, we can perform a predicate function on the resulting RDD to
manipulate it into providing the answer to our question. Because we were
interested in only getting the Genotypes that overlap the features, we used a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interested in only getting the Genotypes → interested in the Genotypes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.


```scala
// Inner join with a group by on the features
val features = sc.loadFeatures(“my/features.adam”)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here

// As a ShuffleRegionJoin, it can be implemented as follows:
val variantsByFeatureShuffle = features.shuffleRegionJoinAndGroupByLeft(variants)

// As a BroadcastRegionJoin, it can be implemented as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and separate broadcast code block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on separating into two blocks, please keep as is


```scala
// An outer join provides us with both overlapping and non-overlapping data
val reads = sc.loadAlignments(“my/reads.adam”)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and here

val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
```

Previously, we illustrated that join calls can be different between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These summary paragraphs seem not so useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following two or all of them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Previously, and We also previously paragraphs. I don't feel strongly about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree WRT removing We also....

I don't think we should remove Previously,, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.

@heuermh
Copy link
Member

heuermh commented Jul 13, 2017

Sorry about the outdated comments, was reviewing inbetween yer pushes :)

@coveralls
Copy link

coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling ff0b84e on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2219/
Test PASSed.

To ensure that the data is appropriately colocated, we perform a copartition
on the right dataset before the each node conducts the join locally.
ShuffleRegionJoin should be used if the right dataset is too large to send to
all nodes and both datasets have low
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be high cardinality, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add that there are certain operations (e.g., full outer join) that can only be performed as a shuffle join.

The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you are joining a smaller dataset to a larger one and/or the datasets in the
join have high cardinality.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be low cardinality?

// We can use ShuffleRegionJoin…
val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features)

// …or BroadcastRegionJoin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.


After the join, we can perform a predicate function on the resulting RDD to
manipulate it into providing the answer to our question. Because we were
interested in only getting the Genotypes that overlap the features, we used a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.

interested in only getting the Genotypes that overlap the features, we used a
predicate.

Notice that at the end of the join, we can access the RDD resulting from the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this paragraph adds much beyond what is already said in the last sentence of the prior paragraph; let's remove it.


// After we have our join, we need to separate the RDD
// If we used the ShuffleRegionJoin, we filter by None in the values
val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_._2.isDefined

// After we have our join, we need to separate the RDD
// If we used the ShuffleRegionJoin, we filter by None in the values
val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None)
val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_._2.isEmpty

val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None)

// If we used BroadcastRegionJoin, we filter by None in the keys
val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_._1.isDefined


// If we used BroadcastRegionJoin, we filter by None in the keys
val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_._1.isEmpty

val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None)
```

Previously, we illustrated that join calls can be different between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree WRT removing We also....

I don't think we should remove Previously,, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.

@coveralls
Copy link

coveralls commented Jul 13, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling 0704601 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2220/
Test PASSed.

@coveralls
Copy link

coveralls commented Jul 14, 2017

Coverage Status

Coverage remained the same at 84.157% when pulling 4fbe0a6 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2221/
Test PASSed.

cardinality.

Another important distinction between ShuffleRegionJoin and
The ShuffleRegionJoin is at its core a distributed sort-merge overlap join.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is at its core -> is

To ensure that the data is appropriately colocated, we perform a copartition
on the right dataset before the each node conducts the join locally.
ShuffleRegionJoin should be used if the right dataset is too large to send to
all nodes and both datasets have high cardinality. Because of the flexibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this is true, but it'd be more accurate to write the converse -> "Since the broadcast join doesn't co-partition the datasets and instead sends the full right table to all nodes, some joins (e.g., left/full outer joins) cannot be written as broadcast joins."


The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you are joining a smaller dataset to a larger one and/or the datasets in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More so, I'd say the broadcast region join should be used when you have a dataset that is small enough to be collected and broadcast out, the larger side of the join is unsorted and either the data is so skewed that it is hard to load balance, the data is too large to be worth shuffling, or you don't want sorted output.

read a genomic dataset into memory, this condition is met.

ADAM has a variety of ShuffleRegionJoin types that you can perform on your
ADAM has a variety of ShuffleRegionJoin types that you can perform on your
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShuffleRegionJoin types -> region joins


Each of these demonstrations illustrates the difference between calling the
ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can
be expanded from. For a detailed difference on the optimal performance of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not really a detailed discussion of the performance characteristics above, please strike.

```

When we switch join strategies, we need to change the dataset that is on the
left side of the join. This distinction is very important to understanding the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really the opposite way around. You need to understand the architectural difference between the broadcast and shuffle strategies to understand why we change which dataset is on which side of the join.

dataset may change between BroadcastRegionJoin and ShuffleRegionJoin.

The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
implementation is due to the fact that the left dataset is broadcasted to all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point about locality is correct, but comment about the broadcast is somewhat misleading. If the right dataset was sorted and we did a broadcast join, we could do a shuffle free join. But, we don't guarantee any sort invariants when running the broadcast join, so we can't provide a broadcastJoinAndGroupByLeft that has predictable performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see rephrased sentence below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. Might tighten up to:

Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.

implementation is due to the fact that the left dataset is broadcasted to all
nodes. It would be impossible to perform a group by function on the resulting
join without a shuffle phase because joined tuples could be on any partition.
ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rejion -> Region

nodes. It would be impossible to perform a group by function on the resulting
join without a shuffle phase because joined tuples could be on any partition.
ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the
left data does not require a shuffle. This is primarily due to the invariant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last sentence is redundant; please remove.

feature. If a given read does not overlap with any features provided, it is
paired with a `None`. After we perform the join, we use a predicate to separate
the reads into two RDDs. This query is useful for filtering out reads based on
feature data. For example, identifying reads that overlap with ChIPSeq data to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATAC-seq? ChIP-seq identifies protein binding locations.

@coveralls
Copy link

coveralls commented Jul 17, 2017

Coverage Status

Coverage increased (+0.03%) to 84.191% when pulling d2e4371 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2228/
Test PASSed.

interested in the Genotypes that overlap the features, we used a map function
to extract them.

Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another important distinction -> The difference
between ShuffleRegionJoin -> between the ShuffleRegionJoin
after BroadcastRegionJoin -> BroadcastRegionJoin strategies

to extract them.

Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin
is that in a BroadcastRegionJoin, the left dataset is sent to all executors. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that in a BroadcastRegionJoin, the left dataset is sent to all executors -> is that a broadcast join sends the left dataset to all executors.


```scala
// Inner join will filter out genotypic data not represented in the feature dataset
val genotypes = sc.loadGenotypes(“my/genotypes.adam”)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: you've got a smattering of angled quotes () in here. Can you make these non-angled? They render incorrectly in the PDF version if they are angled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a smattering of these throughout the new docs, not just here.

// …or BroadcastRegionJoin
val joinedGenotypesBcast = features.broadcastRegionJoin(genotypes)

// In the case that we only want Genotypes, we can use a simple predicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predicate -> projection

val filteredGenotypesBcast = joinedGenotypesBcast.rdd.map(_._2)
```

After the join, we can perform a predicate function on the resulting RDD to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessarily verbose. Suggest:

Since we are interested in the Genotypes that overlap a feature, we map over the tuples and select just the Genotype.

When we switch join strategies, we need to change the dataset that is on the
left side of the join.

To perform a `groupBy` after the join, BroadcastRegionJoin only supports
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessarily verbose. Suggest removing To perform a groupByafter the join,


To perform a `groupBy` after the join, BroadcastRegionJoin only supports
grouping by the right dataset, and ShuffleRegionJoin supports only grouping by
the left dataset. Thus, depending on the type of join, the left and right
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably remove sentence starting with Thus

dataset may change between BroadcastRegionJoin and ShuffleRegionJoin.

The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
implementation is due to the fact that the left dataset is broadcasted to all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. Might tighten up to:

Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.

val featuresToReads = features.rightOuterShuffleRegionJoin(reads)

// After we have our join, we need to separate the RDD
// If we used the ShuffleRegionJoin, we filter by None in the values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key/value is unintiutive to me here. Key/value implies that the key is derived from the value.

BroadcastRegionJoin broadcasts the left dataset, so a left outer join would
require an additional shuffle phase. For an outer join, using a
ShuffleRegionJoin will be cheaper if your reads are already sorted, however if
the feature dataset is small, the BroadcastRegionJoin call would likely be more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the feature dataset is small -> the feature dataset is small and the reads are not sorted

@coveralls
Copy link

coveralls commented Jul 18, 2017

Coverage Status

Coverage increased (+0.4%) to 84.58% when pulling 0a51d85 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2238/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2239/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48faFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2240/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48fa > /home/jenkins/git2/bin/git rev-list d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin |
| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)``` ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin |
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on a separate table for all the broadcastRegionJoinAgainst variants?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a table, I'd break these out into a nested bulleted list:

  • Joins implemented across both shuffle and broadcast
    • Inner
    • ...
  • Shuffle-only joins
    • FullOuter
    • ...
  • Broadcast-only joins
    • RightAndGroupByRight
    • ...

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more nits; this is pretty close! Thanks @devin-petersohn.

The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you have a dataset that is small enough to be collected and broadcast out, the
larger side of the join is unsorted and either the data is so skewed that it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is colloquial, prefer sufficiently


The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the
entire left dataset to each node. The BroadcastRegionJoin should be used when
you have a dataset that is small enough to be collected and broadcast out, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you have a dataset that -> when the right side of your join

also, there should be an and after and broadcast out,. The condition is "dataset is small" and ("large side unsorted" or "data is skewed").

entire left dataset to each node. The BroadcastRegionJoin should be used when
you have a dataset that is small enough to be collected and broadcast out, the
larger side of the join is unsorted and either the data is so skewed that it is
hard to load balance, the data is too large to be worth shuffling, or you don't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"larger side of the join is unsorted" and "data is too large to be worth shuffling" should be grouped together logically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change you don't want sorted output to you can tolerate unsorted output.

| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin |
| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin |
| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)``` ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a table, I'd break these out into a nested bulleted list:

  • Joins implemented across both shuffle and broadcast
    • Inner
    • ...
  • Shuffle-only joins
    • FullOuter
    • ...
  • Broadcast-only joins
    • RightAndGroupByRight
    • ...

###### Filter Genotypes by Features

This query joins an RDD of Genotypes against an RDD of Features using an inner
join. The inner join will result in an RDD of key-value pairs, where the key is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't like key-value pairs here, because they're not key/value pairs.

this query would extract all genotypes that fall in exonic regions.

```scala
// Inner join will filter out genotypic data not represented in the feature dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genotypic data -> genotypes
not represented in the feature dataset -> not covered by an feature

and select just the `Genotype`.

The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies
is that a broadcast join sends the left dataset to all executors. In this case,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For conciseness, trim:

The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies is that a broadcast join sends the left dataset to all executors. In this case,

to

Since a broadcast join sends the left dataset to all executors,

val variantsByFeatureBcast = variants.broadcastRegionJoinAndGroupByRight(features)
```

When we switch join strategies, we need to change the dataset that is on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this text would read a bit better if "to change the dataset that is" was replaced with "to swap which dataset is"

@coveralls
Copy link

coveralls commented Jul 18, 2017

Coverage Status

Coverage increased (+0.4%) to 84.58% when pulling cd98e93 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2242/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 really small typo, otherwise LGTM. Thanks @devin-petersohn!

When we switch join strategies, we need to change the dataset that is on the
left side of the join. BroadcastRegionJoin only supports grouping by the right
dataset, and ShuffleRegionJoin supports only grouping by the left dataset.
When we switch join strategies, to swap which dataset is on the left side of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to swap which -> we swap which

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @devin-petersohn!

@coveralls
Copy link

coveralls commented Jul 19, 2017

Coverage Status

Coverage increased (+0.4%) to 84.58% when pulling a49cd04 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2246/
Test PASSed.

@fnothaft
Copy link
Member

@heuermh just going to ping you for a review. As an FYI, @devin-petersohn is going to clean up the history on the commit, so don't merge until you get an OK from him.

@fnothaft
Copy link
Member

Ping @heuermh

Copy link
Member

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor suggestions, otherwise looks good

and select just the `Genotype`.

Since a broadcast join sends the left dataset to all executors, we chose to
send the `features` dataset because feature data is usually smaller in size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data is → data are

Each of these demonstrations illustrates the difference between calling the
ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can
be expanded from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

These demonstrations illustrate the difference between calling
ShuffleRegionJoin and BroadcastRegionJoin and provide example code
to expand from.

This query joins an RDD of Variants against an RDD of Features, and immediately
performs a group-by on the Feature. This produces an RDD whose elements are a
tuple containing a Feature, and all of the Variants overlapping the Feature.
This query is useful for trying to identify annotated variants that may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces an RDD whose elements are tuples containing a Feature and all of the Variants overlapping the Feature.

ShuffleRegionJoin supports only grouping by the left dataset.

The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft`
implementation is due to the fact that the left dataset is broadcasted to all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is broadcasted → is broadcast

@devin-petersohn
Copy link
Member Author

Ok to squash. I talked to @gunjanbaid and she ok'ed squash.

@fnothaft
Copy link
Member

Thanks @devin-petersohn and @gunjanbaid! @heuermh if the latest changes look good to you, please squash-and-merge.

@coveralls
Copy link

coveralls commented Jul 21, 2017

Coverage Status

Coverage decreased (-0.1%) to 84.016% when pulling 6f3fe74 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2265/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 # timeout=10Checking out Revision 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 > /home/jenkins/git2/bin/git rev-list 8704718 # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

heuermh commented Jul 21, 2017

Jenkins, retest this please

@coveralls
Copy link

coveralls commented Jul 21, 2017

Coverage Status

Coverage increased (+0.6%) to 84.743% when pulling 6f3fe74 on devin-petersohn:issue#890joinDocExamples into 607cd50 on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2269/
Test PASSed.

@fnothaft fnothaft merged commit 6abe7a6 into bigdatagenomics:master Jul 21, 2017
@fnothaft
Copy link
Member

Merged! Thanks @devin-petersohn!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants