Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ShuffleRegionJoin usage docs #1384

Closed

Conversation

@devin-petersohn
Copy link
Member

devin-petersohn commented Feb 6, 2017

No description provided.

Copy link
Member

fnothaft left a comment

Good first cut! A few small comments inside.

@@ -256,6 +256,39 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s ShuffleRegionJoin API
Another useful API implemented in ADAM is the ShuffleRegionJoin API, which joins two genomic datasets that contain overlapping regions. This primitive is useful for a number of applications including Variant calls and Feature mapping.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 6, 2017

Member
  1. It'd be good to give a more detailed example. E.g., "The region join primitive is useful for a number of applications including variant calling (identifying all of the reads that overlap a candidate variant) or Feature mapping (what is feature mapping?)."
  2. Break at 80 chars. ;)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 6, 2017

Member
  1. This section should talk about the distinction between a shuffle and broadcast region join, and why one is preferable to the other.
@@ -256,6 +256,39 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s ShuffleRegionJoin API

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 6, 2017

Member

s/Shuffle//g

This comment has been minimized.

Copy link
@devin-petersohn

devin-petersohn Feb 8, 2017

Author Member

I'm sorry I don't understand this comment. I looked up Markdown documentation, but this doesn't look like any Markdown syntax I could find.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member

Ah sorry, sed ;). I was saying to delete "shuffle"; this section should be docs for all region join implementations.

ADAM has a variety of ShuffleRegionJoin types that you can perform on your data, and all are called in a similar way:
``` scala

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 6, 2017

Member

I have a figure that we can use here (see below). I might suggest that instead of providing code snippets, we do the figure and a table, as a figure is worth a thousand words. ;) The table would show which joins are supported as broadcast joins and which are supported as shuffle joins. IIRC, everything is supported as a shuffle join.

joins

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 6, 2017

Member

I can extend this picture to demonstrate all the join types; I did a subset of joins since this was originally used in a presentation.

@AmplabJenkins
Copy link

AmplabJenkins commented Feb 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1776/
Test PASSed.

@fnothaft fnothaft added this to the 0.21.1 milestone Feb 6, 2017
@AmplabJenkins
Copy link

AmplabJenkins commented Feb 9, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1780/
Test PASSed.

@AmplabJenkins
Copy link

AmplabJenkins commented Feb 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1781/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@@ -256,6 +256,31 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s RegionJoin API
Another useful API implemented in ADAM is the RegionJoin API, which joins two genomic datasets that contain overlapping regions. This primitive is useful for a number of applications including variant calls (identifying all of the reads that overlap a candidate variant), coverage analysis (determining the coverage depth for each region in a reference), and indel realignment (identify indels aligned against a reference).

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member

Nit: please break lines at 80 chars throughout this PR.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member

Nits:

  • capitalize indels -> INDELs
  • "applications including variant calls" -> "applications including variant calling"
### Using ADAM’s RegionJoin API
Another useful API implemented in ADAM is the RegionJoin API, which joins two genomic datasets that contain overlapping regions. This primitive is useful for a number of applications including variant calls (identifying all of the reads that overlap a candidate variant), coverage analysis (determining the coverage depth for each region in a reference), and indel realignment (identify indels aligned against a reference).
There are two implementations of joins available in ADAM: BroadcastRegionJoin and ShuffleRegionJoin. The result of a ShuffleRegionJoin is identical to the BroadcastRegionJoin, however they should each be used in specific circumstances. The ShuffleRegionJoin performs a copartition on the right dataset before the join on the entire dataset, but the BroadcastRegionJoin sends a copy of the entire right dataset to each node. ShuffleRegionJoin should be used in the case that the right dataset is too large to send to all nodes or if there is high coverage skew in either dataset. The BroadcastRegionJoin should be used when you are joining a smaller dataset to a larger one.

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member
  • "implementations of joins" -> "join implementations"
  • "ShuffleRegionJoin should be used in the case" -> "ShuffleRegionJoin should be used if"
  • RE: "if there is high coverage skew in either dataset," this is not necessarily true. Broadcast joins work really well if your left side has hot keys and a randomized partitioning.
ADAM has a variety of ShuffleRegionJoin types that you can perform on your data, and all are called in a similar way:
![Joins Available](https://cloud.githubusercontent.com/assets/3752466/22659423/700c8aae-ec52-11e6-8534-0e364f91700a.png)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member

Don't use this link. You should check the file into the ADAM repo as part of this commit.

This comment has been minimized.

Copy link
@devin-petersohn

devin-petersohn Feb 10, 2017

Author Member

Sounds great! Should I create a new image directory, or does it belong in docs?

There are two implementations of joins available in ADAM: BroadcastRegionJoin and ShuffleRegionJoin. The result of a ShuffleRegionJoin is identical to the BroadcastRegionJoin, however they should each be used in specific circumstances. The ShuffleRegionJoin performs a copartition on the right dataset before the join on the entire dataset, but the BroadcastRegionJoin sends a copy of the entire right dataset to each node. ShuffleRegionJoin should be used in the case that the right dataset is too large to send to all nodes or if there is high coverage skew in either dataset. The BroadcastRegionJoin should be used when you are joining a smaller dataset to a larger one.
To perform a ShuffleRegionJoin, add the following to your ADAM script:

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member

Add a broadcast join example too.

Join call | action |
----------|--------|
```dataset1.shuffleRegionJoin(dataset2) ```| perform an inner join

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member

Should we assume that our audience will know what a left/right/full outer join is? CC @heuermh @akmorrow13 @laserson

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 9, 2017

Member

Also:

  • FullOuter is missing
  • Table should show which joins are implemented for broadcast and which are implemented for shuffle
@@ -256,6 +256,31 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s RegionJoin API

This comment has been minimized.

Copy link
@heuermh

heuermh Feb 10, 2017

Member

I think this section is in the wrong doc, should it not be in 70_algorithms.md?

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 10, 2017

Member

My general thinking was that 70_algorithms.md is for the detailed implementation notes (e.g., "to optimize join performance, we use an XYZ tree with an ABC load balancer.") while this is more intended as documentation for someone building an app using the ADAM APIs. I do agree though; I think this "chapter" reads right now as more of a "pull in these dependencies", less "here is how to use the APIs". Perhaps we should split the chapter?

This comment has been minimized.

Copy link
@heuermh

heuermh Feb 10, 2017

Member

Sure, maybe a new features or API highlights doc, the chapter before this one?

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 10, 2017

Member

+1 for "API highlights"

@laserson
Copy link
Contributor

laserson commented Feb 10, 2017

I'm a bit out of context, but I just used the lastest ADAM recently to work on the 2nd edition of the spark book. The new APIs since then have made things so much easier! But one thing is that AFAICT there is no such thing as BroadcastRegionJoin....there is now TreeRegionJoin?

@coveralls
Copy link

coveralls commented Mar 1, 2017

Coverage Status

Changes Unknown when pulling 3ed5983 on devin-petersohn:joinDocs into ** on bigdatagenomics:master**.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 1, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1822/
Test PASSed.

Copy link
Member

fnothaft left a comment

A few small nits.

data, and all are called in a similar way:
![Joins Available]
(https://github.com/bigdatagenomics/adam/tree/master/docs/source/img/join_examples.png)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Mar 3, 2017

Member

Nit: this link should be internal to the repo, not to the website.

There are two overlap join implementations available in ADAM:
BroadcastRegionJoin and ShuffleRegionJoin. The result of a ShuffleRegionJoin
is identical to the BroadcastRegionJoin, however they serve different

This comment has been minimized.

Copy link
@fnothaft

fnothaft Mar 3, 2017

Member

I would add that the shuffle region join supports a few more operations (e.g., full outer join).

This comment has been minimized.

Copy link
@devin-petersohn

devin-petersohn Mar 6, 2017

Author Member

Added a line in a new paragraph linking to the table below.

(https://github.com/bigdatagenomics/adam/tree/master/docs/source/img/join_examples.png)
Join call | action |

This comment has been minimized.

Copy link
@fnothaft

fnothaft Mar 3, 2017

Member

Table looks good! Can we add a column that lists whether the action is supported in shuffle, broadcast, or both?

@fnothaft
Copy link
Member

fnothaft commented Mar 6, 2017

@devin-petersohn just pinging for an update to this. Let me know when you've finished your pass. From a PR sequencing perspective, I'd like to pull this in to #1422.

@coveralls
Copy link

coveralls commented Mar 6, 2017

Coverage Status

Changes Unknown when pulling 26ce925 on devin-petersohn:joinDocs into ** on bigdatagenomics:master**.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1839/
Test PASSed.

@fnothaft
Copy link
Member

fnothaft commented Mar 10, 2017

Closing as this has merged into #1422 now. Thanks @devin-petersohn!

@fnothaft fnothaft closed this Mar 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.