New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding examples of how to use joins in the real world #1605
Adding examples of how to use joins in the real world #1605
Conversation
Somehow, I am skeptical that this PR decreased coverage by 0.5%. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devin-petersohn! I've dropped a few specific notes inline. As a generalization, code isn't great standalone documentation. For each query, I'd like to see:
- Brief synopsis of the query (why would I run this?)
- The code for the query
- A brief discussion about the query (why it was written in a specific way, any non-obvious nits, performance implications, etc.)
E.g., for use case 2:
This query joins an RDD of Variants against an RDD of Features, and immediately performs a group-by on the Feature. This produces an RDD whose elements are a tuple containing a Feature, and all of the Variants overlapping the Feature. This query is useful for trying to identify annotated variants that may interact (identifying frameshift mutations within a transcript that may act as a pair to shift and then restore the reading frame) or as the start of a query that computes variant density over a set of genomic features.
... code ...
One important implication with this query is that the broadcast region join strategy only supports a group-by on the right side of the tuple. This means that we need to structure our query so that the variant table is the broadcast table. Since we typically expect our variant dataset to be much larger than our feature dataset, this may perform substantially worse than the shuffle join.
docs/source/55_api.md
Outdated
val features = sc.loadFeatures(“my/features.adam”) | ||
|
||
// We can use ShuffleRegionJoin… | ||
val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you s/_shuffle/Shuffle/g
throughout this PR? Ditto with s/_bcast/Bcast/g
.
docs/source/55_api.md
Outdated
@@ -339,6 +339,52 @@ val bcastFeatures = sc.loadFeatures("my/features.adam").broadcast() | |||
val readsByFeature = reads.broadcastRegionJoinAgainst(bcastFeatures) | |||
``` | |||
|
|||
#### Examples of real-world analyses possible with the RegionJoin API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would drop the "Examples of..." header and segue in with a paragraph before saying "To demonstrate how these APIs can be used, we'll walk through three common queries that can be written using the region join. They are X, Y, and Z." The subheaders should stay.
docs/source/55_api.md
Outdated
val filteredGenotypes_shuffle = genotypes.shuffleRegionJoin(features) | ||
|
||
// …or BroadcastRegionJoin | ||
val filteredGenotypes_bcast = genotypes.broadcastRegionJoin(features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd want the features
to be on the right side of the join here; the features are expected to be smaller than the variant calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to our method level documentation, features
would be the right side of the join: link. Please let me know which is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you meant to say features
should be on the left side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion; I'd meant left.
docs/source/55_api.md
Outdated
###### Separate reads into overlapping and non-overlapping features | ||
```scala | ||
// An outer join provides us with both overlapping and non-overlapping data | ||
val reads = sc.loadAlightments(“my/reads.adam”) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alightments
-> Alignments
docs/source/55_api.md
Outdated
val variantsByFeature_shuffle = features.shuffleRegionJoinAndGroupByLeft(variants) | ||
|
||
// As a BroadcastRegionJoin, it can be implemented as follows: | ||
val variantsByFeature_bcast = variants.broadcastRegionJoinAndGroupByRight(features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please highlight that the sides of the join have swapped between the two queries. This didn't stick out the first time I read through.
- Please add text discussing the performance implications of rewriting this as a broadcast join.
Test PASSed. |
I pushed this a little early, I'll push another update shortly. |
docs/source/55_api.md
Outdated
|
||
```scala | ||
// Inner join will filter out genotypic data not represented in the feature dataset | ||
val genotypes = sc.loadGenotypes(“my/genotypes.adam”) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the my/
here is helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was keeping consistent with previous documentation. If you want me to change it here, I'll go ahead and change it throughout.
docs/source/55_api.md
Outdated
val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features) | ||
|
||
// …or BroadcastRegionJoin | ||
val joinedGenotypesBcast = genotypes.broadcastRegionJoin(features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bcast → Broadcast
// We can use ShuffleRegionJoin… | ||
val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features) | ||
|
||
// …or BroadcastRegionJoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate code block for the broadcast region join would be helpful. I know folks like to copy and paste from docs. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.
docs/source/55_api.md
Outdated
|
||
After the join, we can perform a predicate function on the resulting RDD to | ||
manipulate it into providing the answer to our question. Because we were | ||
interested in only getting the Genotypes that overlap the features, we used a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interested in only getting the Genotypes → interested in the Genotypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.
docs/source/55_api.md
Outdated
|
||
```scala | ||
// Inner join with a group by on the features | ||
val features = sc.loadFeatures(“my/features.adam”) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here
// As a ShuffleRegionJoin, it can be implemented as follows: | ||
val variantsByFeatureShuffle = features.shuffleRegionJoinAndGroupByLeft(variants) | ||
|
||
// As a BroadcastRegionJoin, it can be implemented as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and separate broadcast code block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on separating into two blocks, please keep as is
docs/source/55_api.md
Outdated
|
||
```scala | ||
// An outer join provides us with both overlapping and non-overlapping data | ||
val reads = sc.loadAlignments(“my/reads.adam”) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and here
docs/source/55_api.md
Outdated
val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None) | ||
``` | ||
|
||
Previously, we illustrated that join calls can be different between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These summary paragraphs seem not so useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following two or all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Previously, and We also previously paragraphs. I don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree WRT removing We also...
.
I don't think we should remove Previously,
, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.
Sorry about the outdated comments, was reviewing inbetween yer pushes :) |
Test PASSed. |
docs/source/55_api.md
Outdated
To ensure that the data is appropriately colocated, we perform a copartition | ||
on the right dataset before the each node conducts the join locally. | ||
ShuffleRegionJoin should be used if the right dataset is too large to send to | ||
all nodes and both datasets have low |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be high cardinality, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, add that there are certain operations (e.g., full outer join) that can only be performed as a shuffle join.
docs/source/55_api.md
Outdated
The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the | ||
entire left dataset to each node. The BroadcastRegionJoin should be used when | ||
you are joining a smaller dataset to a larger one and/or the datasets in the | ||
join have high cardinality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be low cardinality?
// We can use ShuffleRegionJoin… | ||
val joinedGenotypesShuffle = genotypes.shuffleRegionJoin(features) | ||
|
||
// …or BroadcastRegionJoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather leave it as is. It makes it easier to visually inspect the differences between the broadcast and shuffle versions of the joins.
docs/source/55_api.md
Outdated
|
||
After the join, we can perform a predicate function on the resulting RDD to | ||
manipulate it into providing the answer to our question. Because we were | ||
interested in only getting the Genotypes that overlap the features, we used a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using a predicate here per se (nothing's getting filtered), we're just selecting the genotype.
docs/source/55_api.md
Outdated
interested in only getting the Genotypes that overlap the features, we used a | ||
predicate. | ||
|
||
Notice that at the end of the join, we can access the RDD resulting from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this paragraph adds much beyond what is already said in the last sentence of the prior paragraph; let's remove it.
docs/source/55_api.md
Outdated
|
||
// After we have our join, we need to separate the RDD | ||
// If we used the ShuffleRegionJoin, we filter by None in the values | ||
val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_._2.isDefined
docs/source/55_api.md
Outdated
// After we have our join, we need to separate the RDD | ||
// If we used the ShuffleRegionJoin, we filter by None in the values | ||
val overlapsFeatures = readsToFeatures.rdd.filter(_._2 != None) | ||
val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_._2.isEmpty
docs/source/55_api.md
Outdated
val notOverlapsFeatures = readsToFeatures.rdd.filter(_._2 == None) | ||
|
||
// If we used BroadcastRegionJoin, we filter by None in the keys | ||
val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_._1.isDefined
docs/source/55_api.md
Outdated
|
||
// If we used BroadcastRegionJoin, we filter by None in the keys | ||
val overlapsFeatures = featuresToReads.rdd.filter(_._1 != None) | ||
val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_._1.isEmpty
docs/source/55_api.md
Outdated
val notOverlapsFeatures = featuresToReads.rdd.filter(_._1 != None) | ||
``` | ||
|
||
Previously, we illustrated that join calls can be different between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree WRT removing We also...
.
I don't think we should remove Previously,
, but I do agree that it isn't really useful as is written. Specifically, if I've gotten to this point in the documentation, I'm probably asking myself right now "Why are the two queries written differently, and when should I choose a shuffle join instead of a broadcast join?" or some variant thereupon. So, this paragraph should explain why you can't run a left outer join using the broadcast strategy, and that a shuffle join is probably cheaper if your reads are already sorted, but that we expect features to be pretty small, so a broadcast join will likely be pretty performant.
Test PASSed. |
Test PASSed. |
docs/source/55_api.md
Outdated
cardinality. | ||
|
||
Another important distinction between ShuffleRegionJoin and | ||
The ShuffleRegionJoin is at its core a distributed sort-merge overlap join. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is at its core
-> is
docs/source/55_api.md
Outdated
To ensure that the data is appropriately colocated, we perform a copartition | ||
on the right dataset before the each node conducts the join locally. | ||
ShuffleRegionJoin should be used if the right dataset is too large to send to | ||
all nodes and both datasets have high cardinality. Because of the flexibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this is true, but it'd be more accurate to write the converse -> "Since the broadcast join doesn't co-partition the datasets and instead sends the full right table to all nodes, some joins (e.g., left/full outer joins) cannot be written as broadcast joins."
docs/source/55_api.md
Outdated
|
||
The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the | ||
entire left dataset to each node. The BroadcastRegionJoin should be used when | ||
you are joining a smaller dataset to a larger one and/or the datasets in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More so, I'd say the broadcast region join should be used when you have a dataset that is small enough to be collected and broadcast out, the larger side of the join is unsorted and either the data is so skewed that it is hard to load balance, the data is too large to be worth shuffling, or you don't want sorted output.
docs/source/55_api.md
Outdated
read a genomic dataset into memory, this condition is met. | ||
|
||
ADAM has a variety of ShuffleRegionJoin types that you can perform on your | ||
ADAM has a variety of ShuffleRegionJoin types that you can perform on your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShuffleRegionJoin types
-> region joins
docs/source/55_api.md
Outdated
|
||
Each of these demonstrations illustrates the difference between calling the | ||
ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can | ||
be expanded from. For a detailed difference on the optimal performance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not really a detailed discussion of the performance characteristics above, please strike.
docs/source/55_api.md
Outdated
``` | ||
|
||
When we switch join strategies, we need to change the dataset that is on the | ||
left side of the join. This distinction is very important to understanding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really the opposite way around. You need to understand the architectural difference between the broadcast and shuffle strategies to understand why we change which dataset is on which side of the join.
docs/source/55_api.md
Outdated
dataset may change between BroadcastRegionJoin and ShuffleRegionJoin. | ||
|
||
The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft` | ||
implementation is due to the fact that the left dataset is broadcasted to all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point about locality is correct, but comment about the broadcast is somewhat misleading. If the right dataset was sorted and we did a broadcast join, we could do a shuffle free join. But, we don't guarantee any sort invariants when running the broadcast join, so we can't provide a broadcastJoinAndGroupByLeft
that has predictable performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see rephrased sentence below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. Might tighten up to:
Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.
docs/source/55_api.md
Outdated
implementation is due to the fact that the left dataset is broadcasted to all | ||
nodes. It would be impossible to perform a group by function on the resulting | ||
join without a shuffle phase because joined tuples could be on any partition. | ||
ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rejion
-> Region
docs/source/55_api.md
Outdated
nodes. It would be impossible to perform a group by function on the resulting | ||
join without a shuffle phase because joined tuples could be on any partition. | ||
ShuffleRejionJoin, however, performs a sort-merge join, and grouping by the | ||
left data does not require a shuffle. This is primarily due to the invariant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last sentence is redundant; please remove.
docs/source/55_api.md
Outdated
feature. If a given read does not overlap with any features provided, it is | ||
paired with a `None`. After we perform the join, we use a predicate to separate | ||
the reads into two RDDs. This query is useful for filtering out reads based on | ||
feature data. For example, identifying reads that overlap with ChIPSeq data to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATAC-seq? ChIP-seq identifies protein binding locations.
Test PASSed. |
docs/source/55_api.md
Outdated
interested in the Genotypes that overlap the features, we used a map function | ||
to extract them. | ||
|
||
Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another important distinction
-> The difference
between ShuffleRegionJoin
-> between the ShuffleRegionJoin
after BroadcastRegionJoin
-> BroadcastRegionJoin strategies
docs/source/55_api.md
Outdated
to extract them. | ||
|
||
Another important distinction between ShuffleRegionJoin and BroadcastRegionJoin | ||
is that in a BroadcastRegionJoin, the left dataset is sent to all executors. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that in a BroadcastRegionJoin, the left dataset is sent to all executors
-> is that a broadcast join sends the left dataset to all executors
.
docs/source/55_api.md
Outdated
|
||
```scala | ||
// Inner join will filter out genotypic data not represented in the feature dataset | ||
val genotypes = sc.loadGenotypes(“my/genotypes.adam”) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: you've got a smattering of angled quotes (“
) in here. Can you make these non-angled? They render incorrectly in the PDF version if they are angled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a smattering of these throughout the new docs, not just here.
docs/source/55_api.md
Outdated
// …or BroadcastRegionJoin | ||
val joinedGenotypesBcast = features.broadcastRegionJoin(genotypes) | ||
|
||
// In the case that we only want Genotypes, we can use a simple predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predicate
-> projection
docs/source/55_api.md
Outdated
val filteredGenotypesBcast = joinedGenotypesBcast.rdd.map(_._2) | ||
``` | ||
|
||
After the join, we can perform a predicate function on the resulting RDD to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessarily verbose. Suggest:
Since we are interested in the
Genotype
s that overlap a feature, we map over the tuples and select just theGenotype
.
docs/source/55_api.md
Outdated
When we switch join strategies, we need to change the dataset that is on the | ||
left side of the join. | ||
|
||
To perform a `groupBy` after the join, BroadcastRegionJoin only supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessarily verbose. Suggest removing To perform a
groupByafter the join,
docs/source/55_api.md
Outdated
|
||
To perform a `groupBy` after the join, BroadcastRegionJoin only supports | ||
grouping by the right dataset, and ShuffleRegionJoin supports only grouping by | ||
the left dataset. Thus, depending on the type of join, the left and right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably remove sentence starting with Thus
docs/source/55_api.md
Outdated
dataset may change between BroadcastRegionJoin and ShuffleRegionJoin. | ||
|
||
The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft` | ||
implementation is due to the fact that the left dataset is broadcasted to all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. Might tighten up to:
Unlike shuffle joins, broadcast joins don't maintain a sort order invariant. Because of this, we would need to shuffle all data to a group-by on the left side of the dataset, and there is no opportunity to optimize by combining the join and group-by.
docs/source/55_api.md
Outdated
val featuresToReads = features.rightOuterShuffleRegionJoin(reads) | ||
|
||
// After we have our join, we need to separate the RDD | ||
// If we used the ShuffleRegionJoin, we filter by None in the values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key/value is unintiutive to me here. Key/value implies that the key is derived from the value.
docs/source/55_api.md
Outdated
BroadcastRegionJoin broadcasts the left dataset, so a left outer join would | ||
require an additional shuffle phase. For an outer join, using a | ||
ShuffleRegionJoin will be cheaper if your reads are already sorted, however if | ||
the feature dataset is small, the BroadcastRegionJoin call would likely be more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the feature dataset is small
-> the feature dataset is small and the reads are not sorted
Fixed typos.
Test PASSed. |
Test FAILed. Build result: FAILURE[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48faFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Checking out Revision d2e7c9740e82adc9aca2caf53026a71b347c48fa (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d2e7c9740e82adc9aca2caf53026a71b347c48fa > /home/jenkins/git2/bin/git rev-list d2e7c9740e82adc9aca2caf53026a71b347c48fa # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
docs/source/55_api.md
Outdated
| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin | | ||
| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin | | ||
| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin | | ||
| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)``` ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on a separate table for all the broadcastRegionJoinAgainst
variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a table, I'd break these out into a nested bulleted list:
- Joins implemented across both shuffle and broadcast
- Inner
- ...
- Shuffle-only joins
- FullOuter
- ...
- Broadcast-only joins
- RightAndGroupByRight
- ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more nits; this is pretty close! Thanks @devin-petersohn.
docs/source/55_api.md
Outdated
The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the | ||
entire left dataset to each node. The BroadcastRegionJoin should be used when | ||
you have a dataset that is small enough to be collected and broadcast out, the | ||
larger side of the join is unsorted and either the data is so skewed that it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so
is colloquial, prefer sufficiently
docs/source/55_api.md
Outdated
|
||
The BroadcastRegionJoin performs an overlap join by broadcasting a copy of the | ||
entire left dataset to each node. The BroadcastRegionJoin should be used when | ||
you have a dataset that is small enough to be collected and broadcast out, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you have a dataset that
-> when the right side of your join
also, there should be an and
after and broadcast out,
. The condition is "dataset is small" and ("large side unsorted" or "data is skewed").
docs/source/55_api.md
Outdated
entire left dataset to each node. The BroadcastRegionJoin should be used when | ||
you have a dataset that is small enough to be collected and broadcast out, the | ||
larger side of the join is unsorted and either the data is so skewed that it is | ||
hard to load balance, the data is too large to be worth shuffling, or you don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"larger side of the join is unsorted" and "data is too large to be worth shuffling" should be grouped together logically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change you don't want sorted output
to you can tolerate unsorted output
.
docs/source/55_api.md
Outdated
| ```left.shuffleRegionJoinAndGroupByLeft(right)``` | perform an inner join and group joined values by the records on the left | ShuffleRegionJoin | | ||
| ```left.broadcastRegionJoinAndGroupByRight(right)``` ```right.broadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform an inner join and group joined values by the records on the right | ShuffleRegionJoin | | ||
| ```left.rightOuterShuffleRegionJoinAndGroupByLeft(right)``` | perform a right outer join and group joined values by the records on the left | ShuffleRegionJoin | | ||
| ```left.rightOuterBroadcastRegionJoinAndGroupByRight(right)``` ```right.rightOuterBroadcastRegionJoinAgainstAndGroupByRight(broadcastedLeft)``` | perform a right outer join and group joined values by the records on the right | BroadcastRegionJoin | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a table, I'd break these out into a nested bulleted list:
- Joins implemented across both shuffle and broadcast
- Inner
- ...
- Shuffle-only joins
- FullOuter
- ...
- Broadcast-only joins
- RightAndGroupByRight
- ...
docs/source/55_api.md
Outdated
###### Filter Genotypes by Features | ||
|
||
This query joins an RDD of Genotypes against an RDD of Features using an inner | ||
join. The inner join will result in an RDD of key-value pairs, where the key is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still don't like key-value
pairs here, because they're not key/value pairs.
docs/source/55_api.md
Outdated
this query would extract all genotypes that fall in exonic regions. | ||
|
||
```scala | ||
// Inner join will filter out genotypic data not represented in the feature dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
genotypic data
-> genotypes
not represented in the feature dataset
-> not covered by an feature
docs/source/55_api.md
Outdated
and select just the `Genotype`. | ||
|
||
The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies | ||
is that a broadcast join sends the left dataset to all executors. In this case, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For conciseness, trim:
The difference between the ShuffleRegionJoin and BroadcastRegionJoin strategies is that a broadcast join sends the left dataset to all executors. In this case,
to
Since a broadcast join sends the left dataset to all executors,
docs/source/55_api.md
Outdated
val variantsByFeatureBcast = variants.broadcastRegionJoinAndGroupByRight(features) | ||
``` | ||
|
||
When we switch join strategies, we need to change the dataset that is on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this text would read a bit better if "to change the dataset that is" was replaced with "to swap which dataset is"
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 really small typo, otherwise LGTM. Thanks @devin-petersohn!
docs/source/55_api.md
Outdated
When we switch join strategies, we need to change the dataset that is on the | ||
left side of the join. BroadcastRegionJoin only supports grouping by the right | ||
dataset, and ShuffleRegionJoin supports only grouping by the left dataset. | ||
When we switch join strategies, to swap which dataset is on the left side of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to swap which
-> we swap which
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @devin-petersohn!
Test PASSed. |
@heuermh just going to ping you for a review. As an FYI, @devin-petersohn is going to clean up the history on the commit, so don't merge until you get an OK from him. |
Ping @heuermh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor suggestions, otherwise looks good
docs/source/55_api.md
Outdated
and select just the `Genotype`. | ||
|
||
Since a broadcast join sends the left dataset to all executors, we chose to | ||
send the `features` dataset because feature data is usually smaller in size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data is → data are
docs/source/55_api.md
Outdated
Each of these demonstrations illustrates the difference between calling the | ||
ShuffleRegionJoin and BroadcastRegionJoin and provides example code that can | ||
be expanded from. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
These demonstrations illustrate the difference between calling
ShuffleRegionJoin and BroadcastRegionJoin and provide example code
to expand from.
docs/source/55_api.md
Outdated
This query joins an RDD of Variants against an RDD of Features, and immediately | ||
performs a group-by on the Feature. This produces an RDD whose elements are a | ||
tuple containing a Feature, and all of the Variants overlapping the Feature. | ||
This query is useful for trying to identify annotated variants that may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This produces an RDD whose elements are tuples containing a Feature and all of the Variants overlapping the Feature.
docs/source/55_api.md
Outdated
ShuffleRegionJoin supports only grouping by the left dataset. | ||
|
||
The reason BroadcastRegionJoin does not have a `joinAndGroupByLeft` | ||
implementation is due to the fact that the left dataset is broadcasted to all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is broadcasted → is broadcast
Ok to squash. I talked to @gunjanbaid and she ok'ed squash. |
Thanks @devin-petersohn and @gunjanbaid! @heuermh if the latest changes look good to you, please squash-and-merge. |
Test FAILed. Build result: FAILURE[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1605/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 # timeout=10Checking out Revision 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 (origin/pr/1605/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 90b0b6cd6ecfdcbe2b521b4ffc71a5f92c27b5a1 > /home/jenkins/git2/bin/git rev-list 8704718 # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.1.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please |
Test PASSed. |
Merged! Thanks @devin-petersohn! |
Resolves #890.