New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1018] Add support for Spark SQL Datasets. #1391

Merged
merged 6 commits into from Jul 11, 2017

Conversation

Projects
10 participants
@fnothaft
Member

fnothaft commented Feb 15, 2017

Resolves #1018. Adds the adam-codegen module, which generates classes that:

  1. Implement the Scala Product interface and thus can be read into a Spark SQL Dataset.
  2. Have a complete constructor that is compatible with the constructor that Spark SQL expects to see when exporting a Dataset back to Scala.
  3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the org.bdgenomics.adam.sql package, and use them for export from the Avro based GenomicRDDs. With a Dataset, we can then export to a DataFrame, which enables us to expose data through Python via RDD->Dataset->DataFrame. This is important since the Avro classes generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to Python RDD crossing with them.

This is on top of #1387.

To do:

  • Add Python/SQL docs
  • Add pyadam-shell script (thoughts on name?)
  • Finish Dataset bind-on-load work, pending review comments
  • PyPI
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Feb 16, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1788/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f6661e40fb6240282f6b7360de9797983ea202cc # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision f6661e40fb6240282f6b7360de9797983ea202cc (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f6661e40fb6240282f6b7360de9797983ea202ccFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Feb 16, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1788/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f6661e40fb6240282f6b7360de9797983ea202cc # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision f6661e40fb6240282f6b7360de9797983ea202cc (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f6661e40fb6240282f6b7360de9797983ea202ccFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Feb 16, 2017

Member

This is awesome, nice work! A few general questions before I dig in to review:

Are the new java-friendly methods on *RDD (from the first commit) necessary or simply more convenient?

Does the dataset stuff compile against Spark 1.6.x or is there some conditional build logic I'm not seeing?

Is name.capitalize in DumpSchemasToProduct going to work consistently everywhere (in other words, do we have any oddly camel-cased field names in bdg-formats)?

Member

heuermh commented Feb 16, 2017

This is awesome, nice work! A few general questions before I dig in to review:

Are the new java-friendly methods on *RDD (from the first commit) necessary or simply more convenient?

Does the dataset stuff compile against Spark 1.6.x or is there some conditional build logic I'm not seeing?

Is name.capitalize in DumpSchemasToProduct going to work consistently everywhere (in other words, do we have any oddly camel-cased field names in bdg-formats)?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 16, 2017

Member

Thanks @heuermh!

The Java friendly methods are for the Python support in #1387. The TL;DR is that Py4j needs "proper" Java methods (no default parameters, value types are java.lang..., etc)

Datasets came in Spark 1.6.0. We'll need to drop Spark 1.5 and earlier, but I think that makes sense generally now that the blessed release of Spark is 2.1.0.

Yup! The Capitalize method just capitalizes the first letter of a string, which is the same behavior Avro uses for making getters/setters.

Member

fnothaft commented Feb 16, 2017

Thanks @heuermh!

The Java friendly methods are for the Python support in #1387. The TL;DR is that Py4j needs "proper" Java methods (no default parameters, value types are java.lang..., etc)

Datasets came in Spark 1.6.0. We'll need to drop Spark 1.5 and earlier, but I think that makes sense generally now that the blessed release of Spark is 2.1.0.

Yup! The Capitalize method just capitalizes the first letter of a string, which is the same behavior Avro uses for making getters/setters.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Feb 16, 2017

Member

+1 to dropping support for Spark 1.5, which is really only a todo for docs & Jenkins

Member

heuermh commented Feb 16, 2017

+1 to dropping support for Spark 1.5, which is really only a todo for docs & Jenkins

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 17, 2017

Member

Rebased on latest and greatest changes over at #1387, and trimmed Jenkins down to Spark 1.6.2 and 2.0.0.

Member

fnothaft commented Feb 17, 2017

Rebased on latest and greatest changes over at #1387, and trimmed Jenkins down to Spark 1.6.2 and 2.0.0.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Feb 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1804/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 4556d45 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 4556d45 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 4556d458547fb0947ceb9894d6912216d912c208First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Feb 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1804/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 4556d45 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 4556d45 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 4556d458547fb0947ceb9894d6912216d912c208First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Feb 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1806/
Test PASSed.

AmplabJenkins commented Feb 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1806/
Test PASSed.

@fnothaft fnothaft referenced this pull request Feb 21, 2017

Merged

[ADAM-882] R API #1397

3 of 3 tasks complete
@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Mar 2, 2017

Member

Are there currently any tests that roundtrip an RDD (or its ADAM wrapper like VariantRDD) to a dataset and back?

Member

jpdna commented Mar 2, 2017

Are there currently any tests that roundtrip an RDD (or its ADAM wrapper like VariantRDD) to a dataset and back?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 2, 2017

Member

RE: tests, these are a bit thin, but https://github.com/bigdatagenomics/adam/pull/1391/files#diff-306c6dde689c4cbcb6d3e982dc401221R81 forces a round trip RDD->Dataset->RDD.

Member

fnothaft commented Mar 2, 2017

RE: tests, these are a bit thin, but https://github.com/bigdatagenomics/adam/pull/1391/files#diff-306c6dde689c4cbcb6d3e982dc401221R81 forces a round trip RDD->Dataset->RDD.

@@ -54,7 +54,7 @@ object ConsensusGenerator {
* present in a single aligned read back into the reference sequence where
* they are aligned.
*/
def fromReads: ConsensusGenerator = {
def fromReads(): ConsensusGenerator = {

This comment has been minimized.

@ryan-williams

ryan-williams Mar 7, 2017

Member

what's your rule for when to add these vs. not?

@ryan-williams

ryan-williams Mar 7, 2017

Member

what's your rule for when to add these vs. not?

This comment has been minimized.

@fnothaft

fnothaft Mar 7, 2017

Member

If we want it to be (easily) callable from Java (and thus Python), it seems that we need to have the parens.

@fnothaft

fnothaft Mar 7, 2017

Member

If we want it to be (easily) callable from Java (and thus Python), it seems that we need to have the parens.

@antonkulaga

This comment has been minimized.

Show comment
Hide comment
@antonkulaga

antonkulaga Mar 12, 2017

Contributor

What about moving from ugly-Java classes with getters and setters to case classes (like they have in https://github.com/hammerlab/spark-genomics )?

Contributor

antonkulaga commented Mar 12, 2017

What about moving from ugly-Java classes with getters and setters to case classes (like they have in https://github.com/hammerlab/spark-genomics )?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 12, 2017

Member

@antonkulaga I'd love to do that, but that would require an almost complete rewrite of the core read transformations and format conversion classes, which is pretty hard to justify from the perspective of engineering effort. It's something like a 10k+ LOC rewrite. Realistically, I think we'll move away from the Java Avro models slowly as we rewrite blocks on top of the Dataset APIs over the next year.

Member

fnothaft commented Mar 12, 2017

@antonkulaga I'd love to do that, but that would require an almost complete rewrite of the core read transformations and format conversion classes, which is pretty hard to justify from the perspective of engineering effort. It's something like a 10k+ LOC rewrite. Realistically, I think we'll move away from the Java Avro models slowly as we rewrite blocks on top of the Dataset APIs over the next year.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1937/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 74aafd990026fc13a97d878b93e1f945d27c2d49 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 74aafd990026fc13a97d878b93e1f945d27c2d49 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 74aafd990026fc13a97d878b93e1f945d27c2d49First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Apr 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1937/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 74aafd990026fc13a97d878b93e1f945d27c2d49 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 74aafd990026fc13a97d878b93e1f945d27c2d49 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 74aafd990026fc13a97d878b93e1f945d27c2d49First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1983/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 9662833ef2e2d65041d6624c9f662d14572b530c # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 9662833ef2e2d65041d6624c9f662d14572b530c (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 9662833ef2e2d65041d6624c9f662d14572b530cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented May 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1983/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 9662833ef2e2d65041d6624c9f662d14572b530c # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 9662833ef2e2d65041d6624c9f662d14572b530c (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 9662833ef2e2d65041d6624c9f662d14572b530cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls May 11, 2017

Coverage Status

Coverage decreased (-8.6%) to 73.27% when pulling 133fcca on fnothaft:issues/1018-dataset-api into ea9ce6c on bigdatagenomics:master.

coveralls commented May 11, 2017

Coverage Status

Coverage decreased (-8.6%) to 73.27% when pulling 133fcca on fnothaft:issues/1018-dataset-api into ea9ce6c on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1984/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 312c057 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 312c057 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 312c057581461d455ccca66b4a1f00c7dfb01746First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented May 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1984/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 312c057 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 312c057 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 312c057581461d455ccca66b4a1f00c7dfb01746First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls May 11, 2017

Coverage Status

Coverage decreased (-6.7%) to 75.198% when pulling 92db52a on fnothaft:issues/1018-dataset-api into 0dd09b2 on bigdatagenomics:master.

coveralls commented May 11, 2017

Coverage Status

Coverage decreased (-6.7%) to 75.198% when pulling 92db52a on fnothaft:issues/1018-dataset-api into 0dd09b2 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1987/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f02af2af08fa4462f13edd7adf415ae11ed751fd # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision f02af2af08fa4462f13edd7adf415ae11ed751fd (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f02af2af08fa4462f13edd7adf415ae11ed751fdFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented May 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1987/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f02af2af08fa4462f13edd7adf415ae11ed751fd # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision f02af2af08fa4462f13edd7adf415ae11ed751fd (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f02af2af08fa4462f13edd7adf415ae11ed751fdFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Show outdated Hide outdated pom.xml Outdated
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 12, 2017

Member

@heuermh et al

I've started a more ambitious push in 92db52a to dynamically choose whether to load data using ParquetAvro or Spark SQL by an elaborate set of tweaks to the GenomicRDD class hierarchy. For now, I've only built this out on AlignmentRecordRDD. Anyone have comments on this? I think I'm going to build it out further tonight/tomorrow across the other types, but would love early feedback.

Member

fnothaft commented May 12, 2017

@heuermh et al

I've started a more ambitious push in 92db52a to dynamically choose whether to load data using ParquetAvro or Spark SQL by an elaborate set of tweaks to the GenomicRDD class hierarchy. For now, I've only built this out on AlignmentRecordRDD. Anyone have comments on this? I think I'm going to build it out further tonight/tomorrow across the other types, but would love early feedback.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 3, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2173/
Test PASSed.

AmplabJenkins commented Jul 3, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2173/
Test PASSed.

</dependencies>
</dependencyManagement>
<profiles>
<profile>
<id>python</id>

This comment has been minimized.

@heuermh

heuermh Jul 7, 2017

Member
[INFO] --- maven-jar-plugin:3.0.1:jar (default-jar) @ adam-python-spark2_2.11 ---
[WARNING] JAR will be empty - no content was marked for inclusion!
[INFO] Building jar: adam/adam-python/target/adam-python-spark2_2.11-0.23.0-SNAPSHOT.jar

reminds me that the Sonatype maven release process with fail to validate a release if -sources.jar or -javadocs.jar are missing for any jar artifact.

How are we releasing the python API? I assume our release process will not be using this profile, right?

@heuermh

heuermh Jul 7, 2017

Member
[INFO] --- maven-jar-plugin:3.0.1:jar (default-jar) @ adam-python-spark2_2.11 ---
[WARNING] JAR will be empty - no content was marked for inclusion!
[INFO] Building jar: adam/adam-python/target/adam-python-spark2_2.11-0.23.0-SNAPSHOT.jar

reminds me that the Sonatype maven release process with fail to validate a release if -sources.jar or -javadocs.jar are missing for any jar artifact.

How are we releasing the python API? I assume our release process will not be using this profile, right?

This comment has been minimized.

@fnothaft

fnothaft Jul 7, 2017

Member

Python is through PyPi and the python profile will not be active during a release (release scripts are unchanged).

@fnothaft

fnothaft Jul 7, 2017

Member

Python is through PyPi and the python profile will not be active during a release (release scripts are unchanged).

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 7, 2017

Member

Woot! The python stuff works for me. Thanks for the improved docs.

Member

heuermh commented Jul 7, 2017

Woot! The python stuff works for me. Thanks for the improved docs.

@heuermh

heuermh approved these changes Jul 7, 2017

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 7, 2017

Member

Squashed and rebased.

Member

fnothaft commented Jul 7, 2017

Squashed and rebased.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 7, 2017

Coverage Status

Coverage decreased (-0.4%) to 82.839% when pulling d9b22cb on fnothaft:issues/1018-dataset-api into 8572fb7 on bigdatagenomics:master.

coveralls commented Jul 7, 2017

Coverage Status

Coverage decreased (-0.4%) to 82.839% when pulling d9b22cb on fnothaft:issues/1018-dataset-api into 8572fb7 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2177/
Test PASSed.

AmplabJenkins commented Jul 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2177/
Test PASSed.

@devin-petersohn

I only have minor formatting points here and there, and some minor correctness points. Otherwise, looks great!

@@ -69,6 +69,7 @@
<configuration>
<sources>
<source>src/main/scala</source>
<source>target/generated-sources/src/main/scala</source>

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Nevermind this, I see it right above here.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Nevermind this, I see it right above here.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala Outdated
Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala Outdated
@@ -156,11 +168,12 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] extends Logging {
// second.
protected val optPartitionMap: Option[Array[Option[(ReferenceRegion, ReferenceRegion)]]]
assert(optPartitionMap.isEmpty ||
assert(optPartitionMap == null ||

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Can the optPartitionMap ever be null? Should we allow that?

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Can the optPartitionMap ever be null? Should we allow that?

This comment has been minimized.

@fnothaft

fnothaft Jul 7, 2017

Member

This happens due to some weird serialization behavior (IIRC related to the closure cleaner) that comes up in pipe since we use the newRdd.getReferenceRegion function inside of a mapPartitionsWithIndex call. I don't really have any great insight as to what we can do to fix it, but IIRC all the classes that extend GenomicRDD have constructors that are private[rdd], so we should be able to just apply sanity in our code.

@fnothaft

fnothaft Jul 7, 2017

Member

This happens due to some weird serialization behavior (IIRC related to the closure cleaner) that comes up in pipe since we use the newRdd.getReferenceRegion function inside of a mapPartitionsWithIndex call. I don't really have any great insight as to what we can do to fix it, but IIRC all the classes that extend GenomicRDD have constructors that are private[rdd], so we should be able to just apply sanity in our code.

@@ -973,12 +986,40 @@ trait MultisampleGenomicRDD[T, U <: MultisampleGenomicRDD[T, U]] extends Genomic
}
/**
* A trait describing a GenomicRDD that also supports the Spark SQL APIs.
*/
trait GenomicDataset[T, U <: Product, V <: GenomicDataset[T, U, V]] extends GenomicRDD[T, V] {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Ping @fnothaft for clarification. I was thinking that if we want to restrict access to GenomicRDD, we can, though I doubt that we would want to.

Also, can we set val rdd: RDD[T] = dataset.rdd.map(_.toAvro)? We do it downstream, but I think it makes more sense here.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Ping @fnothaft for clarification. I was thinking that if we want to restrict access to GenomicRDD, we can, though I doubt that we would want to.

Also, can we set val rdd: RDD[T] = dataset.rdd.map(_.toAvro)? We do it downstream, but I think it makes more sense here.

headerLines: Seq[VCFHeaderLine]): VariantRDD = {
VariantRDD(rdd, sequences, headerLines, None)
new DatasetBoundVariantRDD(ds, sequences, headerLines)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Nit: Remove new for consistency.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

Nit: Remove new for consistency.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/variant/VariantRDD.scala Outdated
Show outdated Hide outdated ...-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala Outdated
Show outdated Hide outdated ...-core/src/test/scala/org/bdgenomics/adam/rdd/SortedGenomicRDDSuite.scala Outdated
@@ -58,14 +58,45 @@ class CoverageRDDSuite extends ADAMFunSuite {
val coverage = sc.loadCoverage(outputFile)
assert(coverage.rdd.count == 3)
assert(coverage.dataset.count == 3)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

We dropped assert(coverage.dataset.rdd.count == 3) here and below.

@devin-petersohn

devin-petersohn Jul 7, 2017

Member

We dropped assert(coverage.dataset.rdd.count == 3) here and below.

This comment has been minimized.

@fnothaft

fnothaft Jul 10, 2017

Member

They're not really meaningful in addition to the coverage.rdd.count.

@fnothaft

fnothaft Jul 10, 2017

Member

They're not really meaningful in addition to the coverage.rdd.count.

fnothaft added some commits Feb 9, 2017

[ADAM-538] Add support for an adam-python API.
Resolves #538. Adds support for Python APIs that use the ADAM Java API to make
the ADAMContext and RDD functions accessible natively through python.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 10, 2017

Member

Rebased and addressed review comments.

Member

fnothaft commented Jul 10, 2017

Rebased and addressed review comments.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 10, 2017

Coverage Status

Coverage decreased (-0.9%) to 83.232% when pulling 95efe05 on fnothaft:issues/1018-dataset-api into acb0c07 on bigdatagenomics:master.

coveralls commented Jul 10, 2017

Coverage Status

Coverage decreased (-0.9%) to 83.232% when pulling 95efe05 on fnothaft:issues/1018-dataset-api into acb0c07 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 10, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2187/
Test PASSed.

AmplabJenkins commented Jul 10, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2187/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2188/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains fffb194e87a4c5b2b788cc0fbc6911434ace31bf # timeout=10Checking out Revision fffb194e87a4c5b2b788cc0fbc6911434ace31bf (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f fffb194e87a4c5b2b788cc0fbc6911434ace31bf > /home/jenkins/git2/bin/git rev-list 3b62fe1f15d69e832817f4315cb7e9b540db40fb # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jul 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2188/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains fffb194e87a4c5b2b788cc0fbc6911434ace31bf # timeout=10Checking out Revision fffb194e87a4c5b2b788cc0fbc6911434ace31bf (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f fffb194e87a4c5b2b788cc0fbc6911434ace31bf > /home/jenkins/git2/bin/git rev-list 3b62fe1f15d69e832817f4315cb7e9b540db40fb # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves #1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2189/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b286cf604daccc72b7a80738f39fa860941ecbc1 # timeout=10Checking out Revision b286cf604daccc72b7a80738f39fa860941ecbc1 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b286cf604daccc72b7a80738f39fa860941ecbc1 > /home/jenkins/git2/bin/git rev-list fffb194e87a4c5b2b788cc0fbc6911434ace31bf # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Jul 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2189/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b286cf604daccc72b7a80738f39fa860941ecbc1 # timeout=10Checking out Revision b286cf604daccc72b7a80738f39fa860941ecbc1 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b286cf604daccc72b7a80738f39fa860941ecbc1 > /home/jenkins/git2/bin/git rev-list fffb194e87a4c5b2b788cc0fbc6911434ace31bf # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jul 10, 2017

Coverage Status

Coverage decreased (-0.8%) to 83.361% when pulling 4851a77 on fnothaft:issues/1018-dataset-api into acb0c07 on bigdatagenomics:master.

coveralls commented Jul 10, 2017

Coverage Status

Coverage decreased (-0.8%) to 83.361% when pulling 4851a77 on fnothaft:issues/1018-dataset-api into acb0c07 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jul 10, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2190/
Test PASSed.

AmplabJenkins commented Jul 10, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2190/
Test PASSed.

@devin-petersohn

Looks great to me!

@heuermh heuermh merged commit 07a15b5 into bigdatagenomics:master Jul 11, 2017

2 checks passed

coverage/coveralls Coverage decreased (-0.8%) to 83.361%
Details
default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 11, 2017

Member

Thank you, @fnothaft! Also thanks to @jpdna @devin-petersohn for review!

Member

heuermh commented Jul 11, 2017

Thank you, @fnothaft! Also thanks to @jpdna @devin-petersohn for review!

@rxin

This comment has been minimized.

Show comment
Hide comment
@rxin

rxin Jul 11, 2017

@fnothaft any idea what the performance implication is with this change?

rxin commented Jul 11, 2017

@fnothaft any idea what the performance implication is with this change?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 11, 2017

Member

Depends on the kernel. We've got a smattering of various kernels which are fully/partially ported over to Spark SQL. Here's a (very unscientific) few off of the top of my head:

  • #1528, which ports the read coverage counter over to Spark SQL leads to a 2x performance improvement on our standard test dataset.
  • Upstream, bigdatagenomics/avocado#239 moved the genotyper over to Spark SQL which was also about a 2x performance improvement, off of the top of my head. That's a somewhat impure benchmark though, because I just ported the aggregation step that scores the variants over to SQL. The front-end of the genotyper is still RDD based, and the post-aggregation filtration is also still RDD based (it was pending this PR).
  • I need to look up the benchmarks that I ran for the k-mer counter that I ported over in this PR.
  • Ditto for various simple queries (e.g., variant count a la #879, and flagstat) that are ripe to move over to Spark SQL.
  • We have a old branch sitting around that rewrote the duplicate marker on top of Spark SQL; this led to a 25% speedup on Spark 1.5.2. I'm going to clean that up on top of this work, and we'll see what perf it gives us now.
Member

fnothaft commented Jul 11, 2017

Depends on the kernel. We've got a smattering of various kernels which are fully/partially ported over to Spark SQL. Here's a (very unscientific) few off of the top of my head:

  • #1528, which ports the read coverage counter over to Spark SQL leads to a 2x performance improvement on our standard test dataset.
  • Upstream, bigdatagenomics/avocado#239 moved the genotyper over to Spark SQL which was also about a 2x performance improvement, off of the top of my head. That's a somewhat impure benchmark though, because I just ported the aggregation step that scores the variants over to SQL. The front-end of the genotyper is still RDD based, and the post-aggregation filtration is also still RDD based (it was pending this PR).
  • I need to look up the benchmarks that I ran for the k-mer counter that I ported over in this PR.
  • Ditto for various simple queries (e.g., variant count a la #879, and flagstat) that are ripe to move over to Spark SQL.
  • We have a old branch sitting around that rewrote the duplicate marker on top of Spark SQL; this led to a 25% speedup on Spark 1.5.2. I'm going to clean that up on top of this work, and we'll see what perf it gives us now.
@rxin

This comment has been minimized.

Show comment
Hide comment
@rxin

rxin Jul 11, 2017

Nice! Great to know that the work we did in the last two years are paying off and helping genomics :)

rxin commented Jul 11, 2017

Nice! Great to know that the work we did in the last two years are paying off and helping genomics :)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 11, 2017

Member

@rxin Oh yeah! I'm quite excited about the future possibilities we've got on top of Spark SQL. I see this PR as simply the first step towards moving more code over to the SQL APIs, and moving further away from RDDs.

Member

fnothaft commented Jul 11, 2017

@rxin Oh yeah! I'm quite excited about the future possibilities we've got on top of Spark SQL. I see this PR as simply the first step towards moving more code over to the SQL APIs, and moving further away from RDDs.

@heuermh heuermh moved this from Triage to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment