New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1533] Set Theory #1561

Closed
wants to merge 13 commits into
base: master
from

Conversation

Projects
5 participants
@devin-petersohn
Member

devin-petersohn commented Jun 9, 2017

WIP. Looking for quick feedback on architecture changes. The idea is to move the prepare code into the individual set theory primitive classes, which will unbloat GenomicRDD a bit.

Most of the primitives can be reduced to post-processing on the ShuffleRegionJoin implementations now that I have generalized joins to allow distances also.

TODO:

  • Implement one-to-self set theory primitives
  • Create Test cases
  • Better/More complete Docs
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 9, 2017

Coverage Status

Coverage decreased (-0.3%) to 82.842% when pulling 966be93 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.3%) to 82.842% when pulling 966be93 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2090/

Build result: ABORTED

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 1daa69788cb362d5664b22a495ca180ca1b99871 # timeout=10Checking out Revision 1daa69788cb362d5664b22a495ca180ca1b99871 (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1daa69788cb362d5664b22a495ca180ca1b99871First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jun 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2090/

Build result: ABORTED

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 1daa69788cb362d5664b22a495ca180ca1b99871 # timeout=10Checking out Revision 1daa69788cb362d5664b22a495ca180ca1b99871 (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1daa69788cb362d5664b22a495ca180ca1b99871First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2091/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c9ef99a590e6f63b0763b6a25b9df9f8dbf90d0e # timeout=10Checking out Revision c9ef99a590e6f63b0763b6a25b9df9f8dbf90d0e (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c9ef99a590e6f63b0763b6a25b9df9f8dbf90d0eFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jun 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2091/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c9ef99a590e6f63b0763b6a25b9df9f8dbf90d0e # timeout=10Checking out Revision c9ef99a590e6f63b0763b6a25b9df9f8dbf90d0e (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c9ef99a590e6f63b0763b6a25b9df9f8dbf90d0eFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@devin-petersohn devin-petersohn changed the title from Issue#1533set theory to [ADAM-1533] Set Theory Jun 9, 2017

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 9, 2017

Coverage Status

Coverage increased (+0.2%) to 83.336% when pulling 348d150 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 9, 2017

Coverage Status

Coverage increased (+0.2%) to 83.336% when pulling 348d150 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2094/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 866341412bcb1847e609c4971aeb0eb21ad60026 # timeout=10Checking out Revision 866341412bcb1847e609c4971aeb0eb21ad60026 (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 866341412bcb1847e609c4971aeb0eb21ad60026First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jun 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2094/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 866341412bcb1847e609c4971aeb0eb21ad60026 # timeout=10Checking out Revision 866341412bcb1847e609c4971aeb0eb21ad60026 (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 866341412bcb1847e609c4971aeb0eb21ad60026First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

I've left a few detailed comments within. I'm hesitant to move forward with this. My main architectural objection is that I think we're inverting an abstraction. Is there any reason we can't build the set theory primitives on top of the join code, instead of building the set theory operators using the guts of the join code? See my comment on closest for a concrete example.

Additionally, I'd like to avoid spreading the partitionMap data structure outside of the GenomicRDD hierarchy. We have to do a full scan of the data to compute the partitionMap, and I believe that we can implement a lighter weight alternative that is cheaper to compute and that is compatible with legacy formats.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/models/ReferenceRegion.scala Outdated
@@ -66,7 +67,6 @@ private[rdd] object GenomicRDD {
* Replaces file references in a command.
*
* @see pipe
*

This comment has been minimized.

@fnothaft

fnothaft Jun 9, 2017

Member

Please revert comment spacing changes throughout this file.

@fnothaft

fnothaft Jun 9, 2017

Member

Please revert comment spacing changes throughout this file.

This comment has been minimized.

@devin-petersohn

devin-petersohn Jun 12, 2017

Member

I will revert these on a future push. I want to avoid reverting and re-reverting repeatedly.

@devin-petersohn

devin-petersohn Jun 12, 2017

Member

I will revert these on a future push. I want to avoid reverting and re-reverting repeatedly.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/settheory/Closest.scala Outdated
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/settheory/SetTheory.scala Outdated
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/settheory/SetTheory.scala Outdated
Show outdated Hide outdated ...src/main/scala/org/bdgenomics/adam/rdd/settheory/ShuffleRegionJoin.scala Outdated
Show outdated Hide outdated ...src/main/scala/org/bdgenomics/adam/rdd/settheory/ShuffleRegionJoin.scala Outdated
Show outdated Hide outdated ...src/main/scala/org/bdgenomics/adam/rdd/settheory/ShuffleRegionJoin.scala Outdated
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/settheory/Closest.scala Outdated
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala Outdated
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 12, 2017

Coverage Status

Coverage increased (+0.05%) to 83.17% when pulling a4d196e on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 12, 2017

Coverage Status

Coverage increased (+0.05%) to 83.17% when pulling a4d196e on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2095/
Test PASSed.

AmplabJenkins commented Jun 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2095/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2096/
Test PASSed.

AmplabJenkins commented Jun 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2096/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 13, 2017

Coverage Status

Coverage increased (+0.2%) to 83.333% when pulling 564bc7d on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 13, 2017

Coverage Status

Coverage increased (+0.2%) to 83.333% when pulling 564bc7d on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 13, 2017

Coverage Status

Coverage decreased (-0.3%) to 82.843% when pulling 564bc7d on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 13, 2017

Coverage Status

Coverage decreased (-0.3%) to 82.843% when pulling 564bc7d on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 13, 2017

Coverage Status

Coverage increased (+0.03%) to 83.155% when pulling a271b45 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 13, 2017

Coverage Status

Coverage increased (+0.03%) to 83.155% when pulling a271b45 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2097/
Test PASSed.

AmplabJenkins commented Jun 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2097/
Test PASSed.

@fnothaft

Hi @devin-petersohn,

This is related to the comments I was making in our meeting on Monday. I don't like this architectural shift, because all of the set theoretic operations we are implementing should be able to be implemented on top of the region join primitive. I believe that all the primitives should map into a join [ -> aggregate ] -> predicate flow. There are several advantages to this:

  1. This will allow us to support both join strategies (shuffle and broadcast) in most cases.
  2. This approach requires substantially less code.
  3. This approach should further isolate bugs.
  4. This approach would allow us to minimize the openness of the interfaces we build the join primitives from.

Additionally, I'm not a big fan of making the partition map visible outside of GenomicRDD. Again, there are several reasons for this:

  1. I think that if the entrypoint to ShuffleRegionJoin clearly assumes the contract that the two RDDs are copartitioned, then we can require that all callers enforce that contract, whether they are starting with a GenomicRDD or a plain ReferenceRegion-keyed RDD. If we have an entrypoint that does the prepwork and sets up the contract, then the contract is a bit less clear if you are calling with a ReferenceRegion-keyed RDD. I'm word vomiting a bit here, so let me know if this makes sense or not.
  2. Additionally, this change means that we need to open up various protections on the partition map.
  3. Part of which, I would like to avoid, because I think that we can refactor the partition map in a later PR to simplify the data structure and make it easier to compute.

Let me know your thoughts.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Jun 15, 2017

Member

all of the set theoretic operations we are implementing should be able to be implemented on top of the region join primitive

Except for unbounded closest, this is true. Unbounded closest is the one of the reasons the SetOperation class is abstracted this way; it uses much of the same architecture as joins, there are only differences in how to compute the closest and how to copartition the data. There are also SetOperation abstractions we need when we are dealing with a single collection, i.e. Merge. Merge can be implemented with a series of joins, but it would be much more expensive due to the increased cost/number of shuffle phases. I still have the SetOperationWithSingleCollection (and all the one-to-self primitives) to implement.

I believe that all the primitives should map into a join [ -> aggregate ] -> predicate flow

I agree 100% that the majority of the operations should be performed with a join as the first phase, independent of the type of join. However, this will not work, or be efficient, for UnboundedClosest and many one-to-self set theory operations, which are not yet in this PR. I can start laying the groundwork for allowing this, but fully implementing it will probably require a minor refactor of the broadcastRegionJoin code, and perhaps belongs in a separate PR.

I think that if the entrypoint to ShuffleRegionJoin clearly assumes the contract that the two RDDs are copartitioned

The way that it is currently architected, it does not assume this. The goal for pulling the prepareForShuffleRegionJoin code out was to reduce the bloat in GenomicRDD. It seems reasonable to me for SetOperations to prepare the data themselves, particularly if they have a different partitioning requirement than shuffleRegionJoin. This also gives us strong guarantees around not keeping RDDs that have duplicated records from copartitioning. If we do want to allow both SetOperation(GenomicRDD) and SetOperation(RDD[(ReferenceRegion, T)]), it seems that having the prepare code in the SetOperation class would be easiest. In the case that users want to call ShuffleRegionJoin(...).compute() themselves, they can do so without having to worry about how their data is partitioned. I agree that ideally, users would just call GenomicRDD.shuffleRegionJoin(), but there are cases where they cannot. I personally believe there is a stronger case for moving it into the SetOperations and letting each class define the optimal partition scheme for its respective operation, but I am happy to hear the opposing arguments.

I'm a strong -1 on opening up protections on the partition map.

I am not sure what you mean by opening up protections on the partition map since I have moved from private[rdd] -> protected. I do understand what you are saying about setting the optPartitionMap access to private[GenomicRDD], but I disagree with making it that strict. Part of the reason I switched to passing in GenomicRDDs for the SetOperations was because we have the prepare() code in the SetOperations, so they also need to know how the data is partitioned. One-to-self operations need access to the PartitionMap structure to avoid extra shuffle phases and reduce skew.

As you may have guessed, I am also a strong -1 on making a PartitionMap class, esp. if it is not private to GenomicRDD.

The PartitionMap class is a part of a (hopefully) better way of handling sorted data. I think we would rather know that the data is sorted, independent of whether or not the data has a partition map (which would begin to solve our issues of sorted legacy file formats). I plan to have an accompanying object that builds the PartitionMap when given a GenomicRDD or RDD[(ReferenceRegion, T)]. Making the partitionMap a lazy val will also reduce the amount of code and avoid computing it when it isn't needed. GenomicRDDs will take in a Boolean value for sorted rather than an optional partitionMap. There are also a lot of common operations performed on the PartitionMap (toIntervalArray). The PartitionMap class would live in its own file and be private[rdd]. Having a separate PartitionMap class would also clean out a lot of code in GenomicRDD related to computing the optPartitionMap.

Sorry for the wall of text. Feel free to address anything I've said.

Member

devin-petersohn commented Jun 15, 2017

all of the set theoretic operations we are implementing should be able to be implemented on top of the region join primitive

Except for unbounded closest, this is true. Unbounded closest is the one of the reasons the SetOperation class is abstracted this way; it uses much of the same architecture as joins, there are only differences in how to compute the closest and how to copartition the data. There are also SetOperation abstractions we need when we are dealing with a single collection, i.e. Merge. Merge can be implemented with a series of joins, but it would be much more expensive due to the increased cost/number of shuffle phases. I still have the SetOperationWithSingleCollection (and all the one-to-self primitives) to implement.

I believe that all the primitives should map into a join [ -> aggregate ] -> predicate flow

I agree 100% that the majority of the operations should be performed with a join as the first phase, independent of the type of join. However, this will not work, or be efficient, for UnboundedClosest and many one-to-self set theory operations, which are not yet in this PR. I can start laying the groundwork for allowing this, but fully implementing it will probably require a minor refactor of the broadcastRegionJoin code, and perhaps belongs in a separate PR.

I think that if the entrypoint to ShuffleRegionJoin clearly assumes the contract that the two RDDs are copartitioned

The way that it is currently architected, it does not assume this. The goal for pulling the prepareForShuffleRegionJoin code out was to reduce the bloat in GenomicRDD. It seems reasonable to me for SetOperations to prepare the data themselves, particularly if they have a different partitioning requirement than shuffleRegionJoin. This also gives us strong guarantees around not keeping RDDs that have duplicated records from copartitioning. If we do want to allow both SetOperation(GenomicRDD) and SetOperation(RDD[(ReferenceRegion, T)]), it seems that having the prepare code in the SetOperation class would be easiest. In the case that users want to call ShuffleRegionJoin(...).compute() themselves, they can do so without having to worry about how their data is partitioned. I agree that ideally, users would just call GenomicRDD.shuffleRegionJoin(), but there are cases where they cannot. I personally believe there is a stronger case for moving it into the SetOperations and letting each class define the optimal partition scheme for its respective operation, but I am happy to hear the opposing arguments.

I'm a strong -1 on opening up protections on the partition map.

I am not sure what you mean by opening up protections on the partition map since I have moved from private[rdd] -> protected. I do understand what you are saying about setting the optPartitionMap access to private[GenomicRDD], but I disagree with making it that strict. Part of the reason I switched to passing in GenomicRDDs for the SetOperations was because we have the prepare() code in the SetOperations, so they also need to know how the data is partitioned. One-to-self operations need access to the PartitionMap structure to avoid extra shuffle phases and reduce skew.

As you may have guessed, I am also a strong -1 on making a PartitionMap class, esp. if it is not private to GenomicRDD.

The PartitionMap class is a part of a (hopefully) better way of handling sorted data. I think we would rather know that the data is sorted, independent of whether or not the data has a partition map (which would begin to solve our issues of sorted legacy file formats). I plan to have an accompanying object that builds the PartitionMap when given a GenomicRDD or RDD[(ReferenceRegion, T)]. Making the partitionMap a lazy val will also reduce the amount of code and avoid computing it when it isn't needed. GenomicRDDs will take in a Boolean value for sorted rather than an optional partitionMap. There are also a lot of common operations performed on the PartitionMap (toIntervalArray). The PartitionMap class would live in its own file and be private[rdd]. Having a separate PartitionMap class would also clean out a lot of code in GenomicRDD related to computing the optPartitionMap.

Sorry for the wall of text. Feel free to address anything I've said.

* Perform an Inner ShuffleRegionJoin. This is publicly accessible to be
* compatible with legacy code.
*/
object InnerShuffleRegionJoin {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jun 21, 2017

Member

Pinging @heuermh and @fnothaft to get your thoughts on this architecture. This was what I came up with to guarantee consistency with our previous implementation. From the user point of view, it looks exactly the same as it did before, despite looking different under the hood.

@devin-petersohn

devin-petersohn Jun 21, 2017

Member

Pinging @heuermh and @fnothaft to get your thoughts on this architecture. This was what I came up with to guarantee consistency with our previous implementation. From the user point of view, it looks exactly the same as it did before, despite looking different under the hood.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 21, 2017

Coverage Status

Coverage increased (+0.4%) to 83.527% when pulling 1f73378 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

coveralls commented Jun 21, 2017

Coverage Status

Coverage increased (+0.4%) to 83.527% when pulling 1f73378 on devin-petersohn:issue#1533setTheory into ad5ae6d on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 21, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2111/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains f08e10f43371f4280767a3e0c8b22fc4bc6de9f8 # timeout=10Checking out Revision f08e10f43371f4280767a3e0c8b22fc4bc6de9f8 (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f08e10f43371f4280767a3e0c8b22fc4bc6de9f8First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jun 21, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2111/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1561/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains f08e10f43371f4280767a3e0c8b22fc4bc6de9f8 # timeout=10Checking out Revision f08e10f43371f4280767a3e0c8b22fc4bc6de9f8 (origin/pr/1561/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f08e10f43371f4280767a3e0c8b22fc4bc6de9f8First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 18, 2017

Member

Closing as won't merge. This work belongs in a downstream app.

Member

devin-petersohn commented Sep 18, 2017

Closing as won't merge. This work belongs in a downstream app.

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment