Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1018] Add support for Spark SQL Datasets. #1391

Merged
merged 6 commits into from Jul 11, 2017

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Feb 15, 2017

Resolves #1018. Adds the adam-codegen module, which generates classes that:

  1. Implement the Scala Product interface and thus can be read into a Spark SQL Dataset.
  2. Have a complete constructor that is compatible with the constructor that Spark SQL expects to see when exporting a Dataset back to Scala.
  3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the org.bdgenomics.adam.sql package, and use them for export from the Avro based GenomicRDDs. With a Dataset, we can then export to a DataFrame, which enables us to expose data through Python via RDD->Dataset->DataFrame. This is important since the Avro classes generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to Python RDD crossing with them.

This is on top of #1387.

To do:

  • Add Python/SQL docs
  • Add pyadam-shell script (thoughts on name?)
  • Finish Dataset bind-on-load work, pending review comments
  • PyPI
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Feb 16, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1788/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f6661e40fb6240282f6b7360de9797983ea202cc # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision f6661e40fb6240282f6b7360de9797983ea202cc (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f6661e40fb6240282f6b7360de9797983ea202ccFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

@heuermh heuermh commented Feb 16, 2017

This is awesome, nice work! A few general questions before I dig in to review:

Are the new java-friendly methods on *RDD (from the first commit) necessary or simply more convenient?

Does the dataset stuff compile against Spark 1.6.x or is there some conditional build logic I'm not seeing?

Is name.capitalize in DumpSchemasToProduct going to work consistently everywhere (in other words, do we have any oddly camel-cased field names in bdg-formats)?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Feb 16, 2017

Thanks @heuermh!

The Java friendly methods are for the Python support in #1387. The TL;DR is that Py4j needs "proper" Java methods (no default parameters, value types are java.lang..., etc)

Datasets came in Spark 1.6.0. We'll need to drop Spark 1.5 and earlier, but I think that makes sense generally now that the blessed release of Spark is 2.1.0.

Yup! The Capitalize method just capitalizes the first letter of a string, which is the same behavior Avro uses for making getters/setters.

@heuermh
Copy link
Member

@heuermh heuermh commented Feb 16, 2017

+1 to dropping support for Spark 1.5, which is really only a todo for docs & Jenkins

@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from 814e324 to 0c46879 Feb 17, 2017
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Feb 17, 2017

Rebased on latest and greatest changes over at #1387, and trimmed Jenkins down to Spark 1.6.2 and 2.0.0.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Feb 17, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1804/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 4556d45 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 4556d45 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 4556d458547fb0947ceb9894d6912216d912c208First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from 0c46879 to 05cbccb Feb 18, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Feb 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1806/
Test PASSed.

@fnothaft fnothaft mentioned this pull request Feb 21, 2017
3 of 3 tasks complete
@@ -246,6 +248,28 @@ case class FeatureRDD(rdd: RDD[Feature],
}

/**
* @return Creates a SQL Dataset of genotypes.

This comment has been minimized.

@jpdna

jpdna Mar 2, 2017
Member

Is this here a SQL Dataset of "genotypes" or of "Features"?

This comment has been minimized.

@fnothaft

fnothaft Mar 2, 2017
Author Member

Ah, good catch! Will fix.

@jpdna
Copy link
Member

@jpdna jpdna commented Mar 2, 2017

Are there currently any tests that roundtrip an RDD (or its ADAM wrapper like VariantRDD) to a dataset and back?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Mar 2, 2017

RE: tests, these are a bit thin, but https://github.com/bigdatagenomics/adam/pull/1391/files#diff-306c6dde689c4cbcb6d3e982dc401221R81 forces a round trip RDD->Dataset->RDD.

@@ -54,7 +54,7 @@ object ConsensusGenerator {
* present in a single aligned read back into the reference sequence where
* they are aligned.
*/
def fromReads: ConsensusGenerator = {
def fromReads(): ConsensusGenerator = {

This comment has been minimized.

@ryan-williams

ryan-williams Mar 7, 2017
Member

what's your rule for when to add these vs. not?

This comment has been minimized.

@fnothaft

fnothaft Mar 7, 2017
Author Member

If we want it to be (easily) callable from Java (and thus Python), it seems that we need to have the parens.

@antonkulaga
Copy link
Contributor

@antonkulaga antonkulaga commented Mar 12, 2017

What about moving from ugly-Java classes with getters and setters to case classes (like they have in https://github.com/hammerlab/spark-genomics )?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Mar 12, 2017

@antonkulaga I'd love to do that, but that would require an almost complete rewrite of the core read transformations and format conversion classes, which is pretty hard to justify from the perspective of engineering effort. It's something like a 10k+ LOC rewrite. Realistically, I think we'll move away from the Java Avro models slowly as we rewrite blocks on top of the Dataset APIs over the next year.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1937/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 74aafd990026fc13a97d878b93e1f945d27c2d49 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 74aafd990026fc13a97d878b93e1f945d27c2d49 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 74aafd990026fc13a97d878b93e1f945d27c2d49First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from 56f5f64 to 306fe62 May 11, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1983/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 9662833ef2e2d65041d6624c9f662d14572b530c # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1391/merge^{commit} # timeout=10Checking out Revision 9662833ef2e2d65041d6624c9f662d14572b530c (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 9662833ef2e2d65041d6624c9f662d14572b530cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from 306fe62 to 133fcca May 11, 2017
@coveralls
Copy link

@coveralls coveralls commented May 11, 2017

Coverage Status

Coverage decreased (-8.6%) to 73.27% when pulling 133fcca on fnothaft:issues/1018-dataset-api into ea9ce6c on bigdatagenomics:master.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 7, 2017

Squashed and rebased.

@coveralls
Copy link

@coveralls coveralls commented Jul 7, 2017

Coverage Status

Coverage decreased (-0.4%) to 82.839% when pulling d9b22cb on fnothaft:issues/1018-dataset-api into 8572fb7 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2177/
Test PASSed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

I only have minor formatting points here and there, and some minor correctness points. Otherwise, looks great!

@@ -69,6 +69,7 @@
<configuration>
<sources>
<source>src/main/scala</source>
<source>target/generated-sources/src/main/scala</source>

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

Nevermind this, I see it right above here.

@@ -1460,7 +1513,17 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
val sd = loadAvroSequenceDictionary(pathName)
val rdd = loadParquet[Feature](pathName, optPredicate, optProjection)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

I think this line should be deleted.

This comment has been minimized.

@fnothaft

fnothaft Jul 7, 2017
Author Member

+1

@@ -1482,7 +1545,18 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
val sd = loadAvroSequenceDictionary(pathName)
val rdd = loadParquet[NucleotideContigFragment](pathName, optPredicate, optProjection)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

Se previous comment about deleting this line.

This comment has been minimized.

@fnothaft

fnothaft Jul 7, 2017
Author Member

+1

@@ -156,11 +168,12 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] extends Logging {
// second.
protected val optPartitionMap: Option[Array[Option[(ReferenceRegion, ReferenceRegion)]]]

assert(optPartitionMap.isEmpty ||
assert(optPartitionMap == null ||

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

Can the optPartitionMap ever be null? Should we allow that?

This comment has been minimized.

@fnothaft

fnothaft Jul 7, 2017
Author Member

This happens due to some weird serialization behavior (IIRC related to the closure cleaner) that comes up in pipe since we use the newRdd.getReferenceRegion function inside of a mapPartitionsWithIndex call. I don't really have any great insight as to what we can do to fix it, but IIRC all the classes that extend GenomicRDD have constructors that are private[rdd], so we should be able to just apply sanity in our code.

@@ -973,12 +986,40 @@ trait MultisampleGenomicRDD[T, U <: MultisampleGenomicRDD[T, U]] extends Genomic
}

/**
* A trait describing a GenomicRDD that also supports the Spark SQL APIs.
*/
trait GenomicDataset[T, U <: Product, V <: GenomicDataset[T, U, V]] extends GenomicRDD[T, V] {

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

Ping @fnothaft for clarification. I was thinking that if we want to restrict access to GenomicRDD, we can, though I doubt that we would want to.

Also, can we set val rdd: RDD[T] = dataset.rdd.map(_.toAvro)? We do it downstream, but I think it makes more sense here.

headerLines: Seq[VCFHeaderLine]): VariantRDD = {
VariantRDD(rdd, sequences, headerLines, None)
new DatasetBoundVariantRDD(ds, sequences, headerLines)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

Nit: Remove new for consistency.

sequences,
Seq.empty[Sample],
headerLines,
None)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

This needs to send the optPartitionMap also.

@@ -71,11 +72,11 @@ class SortedGenomicRDDSuite extends SparkFunSuite {

// sort and make into 16 partitions
val y = x.sortLexicographically(storePartitionMap = true, partitions = 16)
assert(isSorted(y.optPartitionMap.get))
//assert(isSorted(y.optPartitionMap.get))

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

Since I wasn't able to create access to the optPartitionMap outside of GenomicRDD, we can just delete these. They don't check anything serious, anyway.

@@ -313,6 +314,7 @@ class SortedGenomicRDDSuite extends SparkFunSuite {
assert(h.count == i.count)
}

/*

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

This is the one that I care about. We need to verify that we read things in with the same order and partition schema as it was written.

This comment has been minimized.

@fnothaft

fnothaft Jul 10, 2017
Author Member

Suggestions for how to implement said test?

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 10, 2017
Member

We can just test that each partition contains the same data on z and t. If you'd prefer, I can create a pull request against this branch.

This comment has been minimized.

@fnothaft

fnothaft Jul 10, 2017
Author Member

Yeah, if you wouldn't mind, that would minimize back-and-forth.

@@ -58,14 +58,45 @@ class CoverageRDDSuite extends ADAMFunSuite {

val coverage = sc.loadCoverage(outputFile)
assert(coverage.rdd.count == 3)
assert(coverage.dataset.count == 3)

This comment has been minimized.

@devin-petersohn

devin-petersohn Jul 7, 2017
Member

We dropped assert(coverage.dataset.rdd.count == 3) here and below.

This comment has been minimized.

@fnothaft

fnothaft Jul 10, 2017
Author Member

They're not really meaningful in addition to the coverage.rdd.count.

fnothaft added 3 commits Jun 26, 2017
Resolves #538. Adds support for Python APIs that use the ADAM Java API to make
the ADAMContext and RDD functions accessible natively through python.
@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from d9b22cb to 95efe05 Jul 10, 2017
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 10, 2017

Rebased and addressed review comments.

@coveralls
Copy link

@coveralls coveralls commented Jul 10, 2017

Coverage Status

Coverage decreased (-0.9%) to 83.232% when pulling 95efe05 on fnothaft:issues/1018-dataset-api into acb0c07 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 10, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2187/
Test PASSed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from 95efe05 to 25bff3f Jul 10, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2188/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains fffb194e87a4c5b2b788cc0fbc6911434ace31bf # timeout=10Checking out Revision fffb194e87a4c5b2b788cc0fbc6911434ace31bf (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f fffb194e87a4c5b2b788cc0fbc6911434ace31bf > /home/jenkins/git2/bin/git rev-list 3b62fe1f15d69e832817f4315cb7e9b540db40fb # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Resolves #1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.
@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from 25bff3f to cbe0141 Jul 10, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 10, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2189/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1391/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b286cf604daccc72b7a80738f39fa860941ecbc1 # timeout=10Checking out Revision b286cf604daccc72b7a80738f39fa860941ecbc1 (origin/pr/1391/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b286cf604daccc72b7a80738f39fa860941ecbc1 > /home/jenkins/git2/bin/git rev-list fffb194e87a4c5b2b788cc0fbc6911434ace31bf # timeout=10Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1018-dataset-api branch from cbe0141 to 4851a77 Jul 10, 2017
@coveralls
Copy link

@coveralls coveralls commented Jul 10, 2017

Coverage Status

Coverage decreased (-0.8%) to 83.361% when pulling 4851a77 on fnothaft:issues/1018-dataset-api into acb0c07 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 10, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2190/
Test PASSed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Looks great to me!

@heuermh heuermh merged commit 07a15b5 into bigdatagenomics:master Jul 11, 2017
2 checks passed
2 checks passed
coverage/coveralls Coverage decreased (-0.8%) to 83.361%
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented Jul 11, 2017

Thank you, @fnothaft! Also thanks to @jpdna @devin-petersohn for review!

@rxin
Copy link

@rxin rxin commented Jul 11, 2017

@fnothaft any idea what the performance implication is with this change?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 11, 2017

Depends on the kernel. We've got a smattering of various kernels which are fully/partially ported over to Spark SQL. Here's a (very unscientific) few off of the top of my head:

  • #1528, which ports the read coverage counter over to Spark SQL leads to a 2x performance improvement on our standard test dataset.
  • Upstream, bigdatagenomics/avocado#239 moved the genotyper over to Spark SQL which was also about a 2x performance improvement, off of the top of my head. That's a somewhat impure benchmark though, because I just ported the aggregation step that scores the variants over to SQL. The front-end of the genotyper is still RDD based, and the post-aggregation filtration is also still RDD based (it was pending this PR).
  • I need to look up the benchmarks that I ran for the k-mer counter that I ported over in this PR.
  • Ditto for various simple queries (e.g., variant count a la #879, and flagstat) that are ripe to move over to Spark SQL.
  • We have a old branch sitting around that rewrote the duplicate marker on top of Spark SQL; this led to a 25% speedup on Spark 1.5.2. I'm going to clean that up on top of this work, and we'll see what perf it gives us now.
@rxin
Copy link

@rxin rxin commented Jul 11, 2017

Nice! Great to know that the work we did in the last two years are paying off and helping genomics :)

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 11, 2017

@rxin Oh yeah! I'm quite excited about the future possibilities we've got on top of Spark SQL. I see this PR as simply the first step towards moving more code over to the SQL APIs, and moving further away from RDDs.

@heuermh heuermh moved this from Triage to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

10 participants
You can’t perform that action at this time.