New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ShuffleRegionJoin usage docs #1384

Closed
wants to merge 6 commits into
from

Conversation

Projects
None yet
6 participants
@devin-petersohn
Member

devin-petersohn commented Feb 6, 2017

No description provided.

@fnothaft

Good first cut! A few small comments inside.

Show outdated Hide outdated docs/source/60_building_apps.md
@@ -256,6 +256,39 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s ShuffleRegionJoin API
Another useful API implemented in ADAM is the ShuffleRegionJoin API, which joins two genomic datasets that contain overlapping regions. This primitive is useful for a number of applications including Variant calls and Feature mapping.

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2017

Member
  1. It'd be good to give a more detailed example. E.g., "The region join primitive is useful for a number of applications including variant calling (identifying all of the reads that overlap a candidate variant) or Feature mapping (what is feature mapping?)."
  2. Break at 80 chars. ;)
@fnothaft

fnothaft Feb 6, 2017

Member
  1. It'd be good to give a more detailed example. E.g., "The region join primitive is useful for a number of applications including variant calling (identifying all of the reads that overlap a candidate variant) or Feature mapping (what is feature mapping?)."
  2. Break at 80 chars. ;)

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2017

Member
  1. This section should talk about the distinction between a shuffle and broadcast region join, and why one is preferable to the other.
@fnothaft

fnothaft Feb 6, 2017

Member
  1. This section should talk about the distinction between a shuffle and broadcast region join, and why one is preferable to the other.
Show outdated Hide outdated docs/source/60_building_apps.md
@@ -256,6 +256,39 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s ShuffleRegionJoin API

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2017

Member

s/Shuffle//g

@fnothaft

fnothaft Feb 6, 2017

Member

s/Shuffle//g

This comment has been minimized.

@devin-petersohn

devin-petersohn Feb 8, 2017

Member

I'm sorry I don't understand this comment. I looked up Markdown documentation, but this doesn't look like any Markdown syntax I could find.

@devin-petersohn

devin-petersohn Feb 8, 2017

Member

I'm sorry I don't understand this comment. I looked up Markdown documentation, but this doesn't look like any Markdown syntax I could find.

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member

Ah sorry, sed ;). I was saying to delete "shuffle"; this section should be docs for all region join implementations.

@fnothaft

fnothaft Feb 9, 2017

Member

Ah sorry, sed ;). I was saying to delete "shuffle"; this section should be docs for all region join implementations.

Show outdated Hide outdated docs/source/60_building_apps.md
ADAM has a variety of ShuffleRegionJoin types that you can perform on your data, and all are called in a similar way:
``` scala

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2017

Member

I have a figure that we can use here (see below). I might suggest that instead of providing code snippets, we do the figure and a table, as a figure is worth a thousand words. ;) The table would show which joins are supported as broadcast joins and which are supported as shuffle joins. IIRC, everything is supported as a shuffle join.

joins

@fnothaft

fnothaft Feb 6, 2017

Member

I have a figure that we can use here (see below). I might suggest that instead of providing code snippets, we do the figure and a table, as a figure is worth a thousand words. ;) The table would show which joins are supported as broadcast joins and which are supported as shuffle joins. IIRC, everything is supported as a shuffle join.

joins

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2017

Member

I can extend this picture to demonstrate all the join types; I did a subset of joins since this was originally used in a presentation.

@fnothaft

fnothaft Feb 6, 2017

Member

I can extend this picture to demonstrate all the join types; I did a subset of joins since this was originally used in a presentation.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Feb 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1776/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1776/
Test PASSed.

@fnothaft fnothaft added this to the 0.21.1 milestone Feb 6, 2017

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Feb 9, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1780/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1780/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Feb 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1781/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1781/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Show outdated Hide outdated docs/source/60_building_apps.md
@@ -256,6 +256,31 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s RegionJoin API
Another useful API implemented in ADAM is the RegionJoin API, which joins two genomic datasets that contain overlapping regions. This primitive is useful for a number of applications including variant calls (identifying all of the reads that overlap a candidate variant), coverage analysis (determining the coverage depth for each region in a reference), and indel realignment (identify indels aligned against a reference).

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member

Nit: please break lines at 80 chars throughout this PR.

@fnothaft

fnothaft Feb 9, 2017

Member

Nit: please break lines at 80 chars throughout this PR.

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member

Nits:

  • capitalize indels -> INDELs
  • "applications including variant calls" -> "applications including variant calling"
@fnothaft

fnothaft Feb 9, 2017

Member

Nits:

  • capitalize indels -> INDELs
  • "applications including variant calls" -> "applications including variant calling"
Show outdated Hide outdated docs/source/60_building_apps.md
### Using ADAM’s RegionJoin API
Another useful API implemented in ADAM is the RegionJoin API, which joins two genomic datasets that contain overlapping regions. This primitive is useful for a number of applications including variant calls (identifying all of the reads that overlap a candidate variant), coverage analysis (determining the coverage depth for each region in a reference), and indel realignment (identify indels aligned against a reference).
There are two implementations of joins available in ADAM: BroadcastRegionJoin and ShuffleRegionJoin. The result of a ShuffleRegionJoin is identical to the BroadcastRegionJoin, however they should each be used in specific circumstances. The ShuffleRegionJoin performs a copartition on the right dataset before the join on the entire dataset, but the BroadcastRegionJoin sends a copy of the entire right dataset to each node. ShuffleRegionJoin should be used in the case that the right dataset is too large to send to all nodes or if there is high coverage skew in either dataset. The BroadcastRegionJoin should be used when you are joining a smaller dataset to a larger one.

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member
  • "implementations of joins" -> "join implementations"
  • "ShuffleRegionJoin should be used in the case" -> "ShuffleRegionJoin should be used if"
  • RE: "if there is high coverage skew in either dataset," this is not necessarily true. Broadcast joins work really well if your left side has hot keys and a randomized partitioning.
@fnothaft

fnothaft Feb 9, 2017

Member
  • "implementations of joins" -> "join implementations"
  • "ShuffleRegionJoin should be used in the case" -> "ShuffleRegionJoin should be used if"
  • RE: "if there is high coverage skew in either dataset," this is not necessarily true. Broadcast joins work really well if your left side has hot keys and a randomized partitioning.
Show outdated Hide outdated docs/source/60_building_apps.md
ADAM has a variety of ShuffleRegionJoin types that you can perform on your data, and all are called in a similar way:
![Joins Available](https://cloud.githubusercontent.com/assets/3752466/22659423/700c8aae-ec52-11e6-8534-0e364f91700a.png)

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member

Don't use this link. You should check the file into the ADAM repo as part of this commit.

@fnothaft

fnothaft Feb 9, 2017

Member

Don't use this link. You should check the file into the ADAM repo as part of this commit.

This comment has been minimized.

@devin-petersohn

devin-petersohn Feb 10, 2017

Member

Sounds great! Should I create a new image directory, or does it belong in docs?

@devin-petersohn

devin-petersohn Feb 10, 2017

Member

Sounds great! Should I create a new image directory, or does it belong in docs?

Show outdated Hide outdated docs/source/60_building_apps.md
There are two implementations of joins available in ADAM: BroadcastRegionJoin and ShuffleRegionJoin. The result of a ShuffleRegionJoin is identical to the BroadcastRegionJoin, however they should each be used in specific circumstances. The ShuffleRegionJoin performs a copartition on the right dataset before the join on the entire dataset, but the BroadcastRegionJoin sends a copy of the entire right dataset to each node. ShuffleRegionJoin should be used in the case that the right dataset is too large to send to all nodes or if there is high coverage skew in either dataset. The BroadcastRegionJoin should be used when you are joining a smaller dataset to a larger one.
To perform a ShuffleRegionJoin, add the following to your ADAM script:

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member

Add a broadcast join example too.

@fnothaft

fnothaft Feb 9, 2017

Member

Add a broadcast join example too.

Show outdated Hide outdated docs/source/60_building_apps.md
Join call | action |
----------|--------|
```dataset1.shuffleRegionJoin(dataset2) ```| perform an inner join

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member

Should we assume that our audience will know what a left/right/full outer join is? CC @heuermh @akmorrow13 @laserson

@fnothaft

fnothaft Feb 9, 2017

Member

Should we assume that our audience will know what a left/right/full outer join is? CC @heuermh @akmorrow13 @laserson

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2017

Member

Also:

  • FullOuter is missing
  • Table should show which joins are implemented for broadcast and which are implemented for shuffle
@fnothaft

fnothaft Feb 9, 2017

Member

Also:

  • FullOuter is missing
  • Table should show which joins are implemented for broadcast and which are implemented for shuffle
@@ -256,6 +256,31 @@ $ spark-submit \
A complete example of this pattern can be found in the
[heuermh/adam-examples](https://github.com/heuermh/adam-examples) repository.
### Using ADAM’s RegionJoin API

This comment has been minimized.

@heuermh

heuermh Feb 10, 2017

Member

I think this section is in the wrong doc, should it not be in 70_algorithms.md?

@heuermh

heuermh Feb 10, 2017

Member

I think this section is in the wrong doc, should it not be in 70_algorithms.md?

This comment has been minimized.

@fnothaft

fnothaft Feb 10, 2017

Member

My general thinking was that 70_algorithms.md is for the detailed implementation notes (e.g., "to optimize join performance, we use an XYZ tree with an ABC load balancer.") while this is more intended as documentation for someone building an app using the ADAM APIs. I do agree though; I think this "chapter" reads right now as more of a "pull in these dependencies", less "here is how to use the APIs". Perhaps we should split the chapter?

@fnothaft

fnothaft Feb 10, 2017

Member

My general thinking was that 70_algorithms.md is for the detailed implementation notes (e.g., "to optimize join performance, we use an XYZ tree with an ABC load balancer.") while this is more intended as documentation for someone building an app using the ADAM APIs. I do agree though; I think this "chapter" reads right now as more of a "pull in these dependencies", less "here is how to use the APIs". Perhaps we should split the chapter?

This comment has been minimized.

@heuermh

heuermh Feb 10, 2017

Member

Sure, maybe a new features or API highlights doc, the chapter before this one?

@heuermh

heuermh Feb 10, 2017

Member

Sure, maybe a new features or API highlights doc, the chapter before this one?

This comment has been minimized.

@fnothaft

fnothaft Feb 10, 2017

Member

+1 for "API highlights"

@fnothaft

fnothaft Feb 10, 2017

Member

+1 for "API highlights"

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Feb 10, 2017

Contributor

I'm a bit out of context, but I just used the lastest ADAM recently to work on the 2nd edition of the spark book. The new APIs since then have made things so much easier! But one thing is that AFAICT there is no such thing as BroadcastRegionJoin....there is now TreeRegionJoin?

Contributor

laserson commented Feb 10, 2017

I'm a bit out of context, but I just used the lastest ADAM recently to work on the 2nd edition of the spark book. The new APIs since then have made things so much easier! But one thing is that AFAICT there is no such thing as BroadcastRegionJoin....there is now TreeRegionJoin?

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 1, 2017

Coverage Status

Changes Unknown when pulling 3ed5983 on devin-petersohn:joinDocs into ** on bigdatagenomics:master**.

coveralls commented Mar 1, 2017

Coverage Status

Changes Unknown when pulling 3ed5983 on devin-petersohn:joinDocs into ** on bigdatagenomics:master**.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 1, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1822/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1822/
Test PASSed.

@fnothaft

A few small nits.

Show outdated Hide outdated docs/source/60_building_apps.md
data, and all are called in a similar way:
![Joins Available]
(https://github.com/bigdatagenomics/adam/tree/master/docs/source/img/join_examples.png)

This comment has been minimized.

@fnothaft

fnothaft Mar 3, 2017

Member

Nit: this link should be internal to the repo, not to the website.

@fnothaft

fnothaft Mar 3, 2017

Member

Nit: this link should be internal to the repo, not to the website.

There are two overlap join implementations available in ADAM:
BroadcastRegionJoin and ShuffleRegionJoin. The result of a ShuffleRegionJoin
is identical to the BroadcastRegionJoin, however they serve different

This comment has been minimized.

@fnothaft

fnothaft Mar 3, 2017

Member

I would add that the shuffle region join supports a few more operations (e.g., full outer join).

@fnothaft

fnothaft Mar 3, 2017

Member

I would add that the shuffle region join supports a few more operations (e.g., full outer join).

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 6, 2017

Member

Added a line in a new paragraph linking to the table below.

@devin-petersohn

devin-petersohn Mar 6, 2017

Member

Added a line in a new paragraph linking to the table below.

Show outdated Hide outdated docs/source/60_building_apps.md
(https://github.com/bigdatagenomics/adam/tree/master/docs/source/img/join_examples.png)
Join call | action |

This comment has been minimized.

@fnothaft

fnothaft Mar 3, 2017

Member

Table looks good! Can we add a column that lists whether the action is supported in shuffle, broadcast, or both?

@fnothaft

fnothaft Mar 3, 2017

Member

Table looks good! Can we add a column that lists whether the action is supported in shuffle, broadcast, or both?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 6, 2017

Member

@devin-petersohn just pinging for an update to this. Let me know when you've finished your pass. From a PR sequencing perspective, I'd like to pull this in to #1422.

Member

fnothaft commented Mar 6, 2017

@devin-petersohn just pinging for an update to this. Let me know when you've finished your pass. From a PR sequencing perspective, I'd like to pull this in to #1422.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 6, 2017

Coverage Status

Changes Unknown when pulling 26ce925 on devin-petersohn:joinDocs into ** on bigdatagenomics:master**.

coveralls commented Mar 6, 2017

Coverage Status

Changes Unknown when pulling 26ce925 on devin-petersohn:joinDocs into ** on bigdatagenomics:master**.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1839/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1839/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 10, 2017

Member

Closing as this has merged into #1422 now. Thanks @devin-petersohn!

Member

fnothaft commented Mar 10, 2017

Closing as this has merged into #1422 now. Thanks @devin-petersohn!

@fnothaft fnothaft closed this Mar 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment