Make Spark SQL APIs supported across all types #1921
Conversation
1 similar comment
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse b75295e^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b75295e # timeout=10Checking out Revision b75295e (origin/pr/1921/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b75295e6d1fb816321a5f4bf5206b9541ca7cb8bFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
+1 |
a few random review comments/questions, carry on! |
/** | ||
* If true and saving as FASTQ, we will sort by read name. | ||
*/ | ||
var sortFastqOutput: Boolean |
heuermh
Feb 22, 2018
Member
is now a good time to rid the world of this?
is now a good time to rid the world of this?
fnothaft
Mar 7, 2018
Author
Member
Don't think so.
Don't think so.
heuermh
Mar 8, 2018
Member
ok
ok
* | ||
* @param pathName The path name to load genotypes from. | ||
* Globs/directories are supported. | ||
* @return Returns a GenotypeRDD. |
heuermh
Feb 22, 2018
Member
doc doesn't match
doc doesn't match
import sqlContext.implicits._ | ||
val ds = sqlContext.read.parquet(pathName).as[VariantContextProduct] | ||
|
||
new DatasetBoundVariantContextRDD(ds, sd, samples, headers) |
heuermh
Feb 22, 2018
Member
we're intentionally bound to dataset here, is the idea to move other loadParquetXxx methods in the same direction?
we're intentionally bound to dataset here, is the idea to move other loadParquetXxx methods in the same direction?
fnothaft
Mar 7, 2018
Author
Member
Nah, we don't have an Avro impl of VariantContext, so we can't use the normal loadParquet method.
Nah, we don't have an Avro impl of VariantContext, so we can't use the normal loadParquet method.
heuermh
Mar 8, 2018
Member
ok
ok
val uTag: TypeTag[U] | ||
|
||
/** | ||
* This data as a Spark SQL Dataset. |
heuermh
Feb 22, 2018
Member
These data?
These data?
* | ||
* @param tFn A function that transforms the underlying RDD as a DataFrame. | ||
* @return A new RDD where the RDD of genomic data has been replaced, but the | ||
* metadata (sequence dictionary, and etc) is copied without modification. |
heuermh
Feb 22, 2018
Member
metadata ... are?
metadata ... are?
genomicRdd: GenomicRDD[X, Y], | ||
flankSize: Long)( | ||
def rightOuterBroadcastRegionJoin[X, Y <: Product, Z <: GenomicDataset[X, Y, Z]]( | ||
genomicRdd: GenomicDataset[X, Y, Z])( |
heuermh
Feb 22, 2018
Member
genomicRdd → genomicDataset
genomicRdd → genomicDataset
fnothaft
Mar 7, 2018
Author
Member
How much do you want me to go through and patch these? For now, I've left them as is (genomicRdd). I'd rather make a cleanup pass through later.
How much do you want me to go through and patch these? For now, I've left them as is (genomicRdd). I'd rather make a cleanup pass through later.
heuermh
Mar 8, 2018
Member
ok, please create an issue to track consistency in names and doc for RDD vs Dataset
ok, please create an issue to track consistency in names and doc for RDD vs Dataset
} | ||
|
||
/** | ||
* Performs a broadcast inner join between this RDD and another RDD. |
heuermh
Feb 22, 2018
Member
RDD → Dataset
RDD → Dataset
regionFn) | ||
@transient val uTag: TypeTag[U] = typeTag[U] | ||
|
||
def saveAsParquet(filePath: String, |
heuermh
Feb 22, 2018
Member
does this one need an override?
does this one need an override?
pageSize: Int = 1 * 1024 * 1024, | ||
compressCodec: CompressionCodecName = CompressionCodecName.GZIP, | ||
disableDictionaryEncoding: Boolean = false, | ||
optSchema: Option[Schema] = None): Unit = SaveAsADAM.time { |
heuermh
Feb 22, 2018
Member
would be nice to have separate SaveRddAsParquet and SaveDatasetAsParquet timers
would be nice to have separate SaveRddAsParquet and SaveDatasetAsParquet timers
fnothaft
Mar 7, 2018
Author
Member
We're not really going to get meaningful timing info out of the SaveDatasetAsParquet
timer.
We're not really going to get meaningful timing info out of the SaveDatasetAsParquet
timer.
heuermh
Mar 8, 2018
Member
ok
ok
@@ -210,7 +210,10 @@ case class RDDBoundGenotypeRDD private[rdd] ( | |||
} | |||
} | |||
|
|||
sealed abstract class GenotypeRDD extends MultisampleAvroGenomicRDD[Genotype, GenotypeProduct, GenotypeRDD] { | |||
sealed abstract class GenotypeRDD extends MultisampleAvroGenomicDataset[Genotype, GenotypeProduct, GenotypeRDD] { |
heuermh
Feb 22, 2018
Member
will you be proposing to move this to GenotypeDataset
?
will you be proposing to move this to GenotypeDataset
?
b75295e
to
6d363e7
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1921/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c8dd89d # timeout=10Checking out Revision c8dd89d (origin/pr/1921/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c8dd89d25204e4dd7fa729ef22d01b3a1aa9d280First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please. |
Test PASSed. |
@heuermh I think everything in Scala is going to be stable from here on out; can you make a review pass? I'm going to finish up the R bit in the morning. |
Test PASSed. |
Test PASSed. |
6184711
to
a26047a
Test PASSed. |
Test PASSed. |
tManifest: ClassTag[T], | ||
xManifest: ClassTag[X]): Y = { | ||
def pipe[X, Y <: Product, Z <: GenomicDataset[X, Y, Z], W <: InFormatter[T, U, V, W]](cmd: Seq[String], | ||
files: Seq[String] = Seq.empty, |
heuermh
Mar 8, 2018
Member
nit: reformat as above for line length. Same with other pipe methods
nit: reformat as above for line length. Same with other pipe methods
implicit val txTag = ClassTag.AnyRef.asInstanceOf[ClassTag[(Option[T], X)]] | ||
implicit val u1Tag: TypeTag[U] = uTag | ||
implicit val u2Tag: TypeTag[Y] = genomicRdd.uTag | ||
implicit val uyTag = typeTag[(Option[U], Y)] |
heuermh
Mar 8, 2018
Member
hope I never need to touch this method signature ;)
hope I never need to touch this method signature ;)
genomicRdd: GenomicRDD[X, Y], | ||
flankSize: Long)( | ||
def rightOuterBroadcastRegionJoin[X, Y <: Product, Z <: GenomicDataset[X, Y, Z]]( | ||
genomicRdd: GenomicDataset[X, Y, Z])( |
heuermh
Mar 8, 2018
Member
ok, please create an issue to track consistency in names and doc for RDD vs Dataset
ok, please create an issue to track consistency in names and doc for RDD vs Dataset
mvn -U \ | ||
-P python,r \ | ||
test \ | ||
-Dsuites=select.no.suites\* \ |
heuermh
Mar 8, 2018
Member
is this to ignore the scala tests while still running the Python and R ones?
is this to ignore the scala tests while still running the Python and R ones?
Do not know, for my ADAM-based code/libs renaming is not a problem.. What I noticed with the previous version was that in many cases using genomic RDDs was faster than trying to use spark sql. |
Resolves #1580. Does not compile due to abstract classes not implemented.
a26047a
to
40043e5
Test PASSed. |
Thank you, @fnothaft |
Resolves #1580, #1867, WIP towards #1728. Merges the GenomicRDD and GenomicDataset traits together, and makes the VariantContextRDD and GenericGenomicRDD implementations support GenomicDataset. Not done yet, but I wanted to get this out for review, as it's a big one. Remaining TODO:
One general stylistic question: this PR goes halfway to renaming everything "GenomicDataset" instead of "GenomicRDD". Is this a good change, or should we hold on to the GenomicRDD name for the sake of consistency?