New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hive style partitioning for contigName #1620

Closed
wants to merge 10 commits into
base: master
from

Conversation

6 participants
@jpdna
Member

jpdna commented Jul 21, 2017

The relevant changes to review are in file:
AlignmentRecordRDD.scala the other changes are caused by our scripts to change to Spark2/scala 2.11

This PR implements Hive-style partitioning by contigName for AlignmentRecords when written to parquet through the Dataset/SparkSQL write path. As described at here

The output directory when the dataset is saved to Parquet looks like the following:

_SUCCESS
_common_metadata
_metadata
_rgdict.avro
_seqdict.avro
contigName=1
contigName=2
...

where the Parquet files for each contigName are written inside the contigName=N directories.

Note: in the future we will add another layer of hierarchy with 10 megabase bins within each chromosome as further subdirs

As per discussion in #651
such binning should allow a more efficient predicate pushdown of range queries than we may currently get from Parquet.

Below is described how to write and read back this data in the spark-shell,
and an error which occurs currently when reading back directly as an RDD is discussed.

Load data from SAM file, convert to Dataset, and save as Hive-partitioned parquet

import org.bdgenomics.adam.sql.{ AlignmentRecord => AlignmentRecordProduct }
import org.bdgenomics.adam.rdd.ADAMContext._

val rdd = sc.loadAlignments("../adam/adam-core/src/test/resources/multi_chr.sam")
val x = rdd.transformDataset(ds => {  
  import ds.sqlContext.implicits._
      val df = ds.toDF()
       df.as[AlignmentRecordProduct]
      })

// save as Hive-style paritioned data by contigName
x.saveAsParquet("test_hivePartitions_1")

Read back in from disk as a Dataset

val read = sc.loadAlignments("test_hivePartitions_1")
read1: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD =
ParquetUnboundAlignmentRecordRDD(org.apache.spark.SparkContext@2db4a84a,test_hivePartitions_1,SequenceDictionary{

// view fields from first record as a dataset
scala> read.dataset.collect()(0)
res38: org.bdgenomics.adam.sql.AlignmentRecord = org.bdgenomics.adam.sql.AlignmentRecord@64dba8f1

scala> read.dataset.collect()(0).contigName.get
res40: String = 1

scala> read.dataset.collect()(0).start.get
res41: Long = 26472783

Read back in directly as an RDD (this results incorrect contigName=null)

//examine the type
scala> read.rdd
res45: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord] = MapPartitionsRDD[51] at map at ADAMContext.scala:126

// print the first record when read as an rdd
read.rdd.collect()(0)
res44: org.bdgenomics.formats.avro.AlignmentRecord = {"readInFragment": 0, "contigName": null, "start": 26472783, "oldPosition": null, "end": 26472858, "mapq": 60, "readName": "simread:1:26472783:false", "sequence": "GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA", "qual": null, "cigar": "75M", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, "properPair": false, "readMapped": true, "mateMapped": false, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": null, "origQual": null, "attributes": "XS:i:0\tAS:i:75\tNM:i:0", "recordGroupName": null…

PROBLEM: Note that above when read as an RDD the contigName field is NULL. This is incorrect, the contigName=1 here.

We can get the correct RDD[org.bdgenomics.formats.avro.AlignmentRecord] though another way, by reading as a dataset and then converting to an RDD:

// read as dataset, convert to rdd of sql type then map to Avro bdg AlignmentRecord
val rdd_try2 = read.dataset.rdd.map(_.toAvro)
rdd_try2: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord] = MapPartitionsRDD[58] at map at <console>:30

// checking the first record now, we find that contigName is correct
rdd_try2.collect()(0)
res46: org.bdgenomics.formats.avro.AlignmentRecord = {"readInFragment": 0, "contigName": "1", "start": 26472783, "oldPosition": null, "end": 26472858, "mapq": 60, "readName": "simread:1:26472783:false", "sequence": "GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA", "qual": null, "cigar": "75M", "oldCigar": null, "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, "properPair": false, "readMapped": true, "mateMapped": false, "failedVendorQualityChecks": false, "duplicateRead": false, "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": true, "secondaryAlignment": false, "supplementaryAlignment": false, "mismatchingPositions": null, "origQual": null, "attributes": "XS:i:0\tAS:i:75\tNM:i:0", "recordGroupName": null,..

Conclusion:

When using Hive style partitioning as implemented in this PR, the direct RDD read path
sc.loadAlignments("myParitionedSparkSQLParquetDir").rdd
will result in contigName erroneously being null because the direct RDD reading code is not handling the Hive-style partitioning correctly. My guess is that the RDD read-path just isn't fully partition-aware and/or when partitioned parquet files are written by Spark-SQL they may null out the info that is redundant in the partitioning.

The cost of having to read Hive-partitioned parquet data in as a Dataset but then convert to an RDD seems like it may be a reasonable price to pay if indeed we find enough other benefits to the Hive-style partitioning, but we need to then disallow the direct rdd read-path so that users don't get this error of contigName=null

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jul 21, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2262/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1620/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 95dbf5a # timeout=10Checking out Revision 95dbf5a (origin/pr/1620/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 95dbf5aef4bb06aa247d1dcb00b0fc9b47fd61e5First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.1.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jul 21, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2267/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1620/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 71e9796 # timeout=10Checking out Revision 71e9796 (origin/pr/1620/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 71e97965e48af15edb6a231d108d04d9bbd208dcFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.3.0,2.11,2.1.0,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.1.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.1.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna

This comment has been minimized.

Member

jpdna commented Jul 21, 2017

I've loaded a chr20 BAM file for a sample from 1000 genomes. Hive partition dataset output to parquet with contig/1MB bins writing and reading seems to be working fine.

Each 1 megabase bin (1..60) under contig 20 has one parquet file in it.

Example queries which are working below:

val read = sc.loadAlignments("hg00096_chr20_1")
val df = read.dataset
df.filter("contigName = '20'").filter("posBin = '39'").count
df.createOrReplaceTempView("genomedata")
val sqlResult = spark.sql("SELECT * FROM genomedata where contigName='20' and posBin>39 and posBin<50")
sqlResult.count

Next, I will do some performance testing to see if retrieval of various range sizes is faster or slower through the SQL path and the contigName/posBin columns compared to:
a) non hive partitioned dataset written to parquet
b) non-hive partitioned parquet written from RDD

@fnothaft - note. the problem with posBin column that I thought I'd encountered earlier appears to not be a problem after all, I think previously I was not actually using the dataset I had added that posBin column to - happily its seeming to just work now.

@jpdna

This comment has been minimized.

Member

jpdna commented Jul 23, 2017

Some preliminary results:
On the bdg cluster, using 20 executors, reading a 14GB low cov whole genome BAM file:
Filtering this whole genome dataset for a 1mb region (chr4:31000000-32000000) and then counting or re-writing to parquet takes 25 seconds using the existing non-partitioned parquet ADAM files, but when reading a Hive-style partitioned parquet files, takes just 1 second.

Parsing the original BAM file and writing to paritioned parquet takes 2.5 minuets, while the existing non-paritioned parquet write takes 3.5 minutes, so surprising to me a bit of gain on write too.

The improvement in region filtering time for Hive-style partitioned data I think can be attributed to eliminating the scan time to touch and reject partitions based on min max in the original parquet method. Increased number of executors relative to the number of partitions in the of input dataset should reduce difference between methods in the limit of where this scan becomes fully parallel. Single node or small clusters should see the most gains from the Hive partition strategy, though there seems to be little downside even on a larger cluster.

I'm writing up in detail the test's so far, and adding more.
In addition to counting and re-writing to disk tasks, I'll add k-mer counting.

@akmorrow13 - I look forward to seeing if this will be useful for retrieving region slices with lower latency for visualization in Mango.

@jpdna

This comment has been minimized.

Member

jpdna commented Jul 27, 2017

can someone point out to me in the jenkins output why my tests are failing? - and suggest fixes. This is successfully building with tests when I run mvn clean package on my local machine.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Jul 31, 2017

@jpdna as an aside, I was thinking about the null issue when loading as Parquet into RDD-land and did a small amount of benchmarking. If we bind to a dataset and then do .rdd to convert back into an RDD, we're only about 20% slower (for a really trivial query). I think this overhead is pretty much in the noise, so we could bind Hive-partitioned files to Dataset on load and that'd resolve the null issue.

@jpdna

This comment has been minimized.

Member

jpdna commented Aug 2, 2017

bind Hive-partitioned files to Dataset on load

In that case, do we need to have a flag or otherwise explicitly detect when loading from Hive-Partitioned files? Right now when reading, the partitioning is just auto-detected I believe.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Aug 2, 2017

bind Hive-partitioned files to Dataset on load

In that case, do we need to have a flag or otherwise explicitly detect when loading from Hive-Partitioned files? Right now when reading, the partitioning is just auto-detected I believe.

Yeah, my thought is that we'd either touch some file (e.g., _hivePartitioned) on write and then look for that file on load or we'd try to identify the partitioning scheme, whichever is the right combo of easy/robust.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Aug 3, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2311/
Test FAILed.

@jpdna

This comment has been minimized.

Member

jpdna commented Aug 3, 2017

I reverted pom back to scala 2_10 Spark 1.x as @fnothaft suggested, but still all fails in Jenkins. Suggestions?

@fnothaft

This comment has been minimized.

Member

fnothaft commented Aug 3, 2017

Jenkins, retest this please.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Aug 3, 2017

@jpdna it was a jenkins glitch, I've just kicked off another test.

@coveralls

This comment has been minimized.

coveralls commented Aug 3, 2017

Coverage Status

Coverage decreased (-1.1%) to 83.145% when pulling bbdd372 on jpdna:hive_partitions into 640c44b on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Aug 3, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2313/
Test PASSed.

@jpdna

This comment has been minimized.

Member

jpdna commented Aug 5, 2017

look for that file on load or we'd try to identify the partitioning scheme

I've implemented this _hivePartitioned marker file so that now sc.loadAlignments("mypath") when that file is present, returns a DatasetBoundAlignmentRecordRDD such that we prevent the possibility of the null cotnigName error discussed above when attempting to read as an rdd directly

@coveralls

This comment has been minimized.

coveralls commented Aug 5, 2017

Coverage Status

Coverage decreased (-0.8%) to 83.484% when pulling fbfd059 on jpdna:hive_partitions into 640c44b on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Aug 5, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2314/
Test PASSed.

@coveralls

This comment has been minimized.

coveralls commented Aug 7, 2017

Coverage Status

Coverage decreased (-0.3%) to 83.124% when pulling 4e1f978 on jpdna:hive_partitions into 96fc37f on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Aug 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2315/
Test PASSed.

@jpdna

This comment has been minimized.

Member

jpdna commented Aug 7, 2017

Ready for initial review by others.

I added a parameter to saveAsParquet to enable/disable hive-style partitioning on write, and to control the genomic region bin size of subdirs that parquet files are written into.

rdd.saveAsParquet(outputPath, enableHivePartitioning = true, hivePartitioningGenomicBinSize = 1000000)

This is nice because in absence of explicitly enabling hivePartitioning here, behavior should not be altered.

As discussed above, previous commit includes a _hivePartiioned marker file written with parquet files and when detected on read automatically binds to a Dataset, ameliorating earlier issue when trying to read as rdd.

todo:

  • I am suspicious as to why Jenkins says this passes Spark 1.6 when I believe I found that partitioning only worked with Spark 2.1.x, so need to make sure the tests are properly testing all the hive partitioning features.
  • add tests to cover all the types fragment, etc.
  • formally report the performance metrics in reading and writing hive-partitioned vs. non-partitioned dataset and rdd
@fnothaft

Thanks @jpdna! This looks like a great start! I dropped a few nits inline; the File -> Hadoop Path/FileSystem change will need to be propagated across many files.

One thing I was noticing is that I'm not sure that the correct behavior happens if you save an RDDBound...RDD. I think the way to handle this properly is to factor the saveAsParquet code out so that there's a single "save as binned" method and a "save unbinned" method. If this suggestion is unclear, let me know and I can put together a PR showing what I'm thinking.

@@ -1777,8 +1777,15 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
(optPredicate, optProjection) match {
case (None, None) => {
ParquetUnboundAlignmentRecordRDD(sc, pathName, sd, rgd, pgs)
val hiveFlagFile = new File(pathName, "_hivePartitioned")

This comment has been minimized.

@fnothaft

fnothaft Aug 8, 2017

Member

This will only work for a file that is stored on a "local" (read: mounted POSIX compatible) file system. I'd suggest factoring this out into a function:

private def checkHiveFlag(fileName: String): Boolean = {
  val hivePath = new Path(pathName, "_hivePartitioned")
  val fs = hivePath.getFileSystem(sc.hadoopConfiguration)
  fs.exists(hivePath)
}
disableDictionaryEncoding: Boolean = false) {
disableDictionaryEncoding: Boolean = false,
enableHivePartitioning: Boolean = false,
hivePartitioningGenomicBinSize: Int = 1000000) {

This comment has been minimized.

@fnothaft

fnothaft Aug 8, 2017

Member

I would roll enableHivePartitioning and hivePartitioningGenomicBinSize into one parameter optHivePartitioningBinSize

.option("spark.sql.parquet.compression.codec", compressCodec.toString.toLowerCase())
.save(filePath)
val hiveFlagFile = new File(filePath, "_hivePartitioned")

This comment has been minimized.

@fnothaft

fnothaft Aug 8, 2017

Member

These will also need to be rewritten to use either the Hadoop FileSystem API or the java.nio libraries.

@jpdna

This comment has been minimized.

Member

jpdna commented Aug 8, 2017

is to factor the saveAsParquet code out so that there's a single "save as binned" method and a "save unbinned"

@fnothaft can you elaborate a bit here, are you suggesting that instead of calling saveAsParquet we'd call saveAsParquetUnbinned or saveAsParquetBinned ?

@heuermh

This comment has been minimized.

Member

heuermh commented Aug 8, 2017

As a first high level review, I think all references to "Hive" should be dropped.

Then I'm in agreement with @fnothaft in that boolean flags make babies cry. I would prefer methods saveAsParquet and saveAsBinnedParquet or saveAsPartitionedParquet.

Should the marker file be called _binnedByStartPosition or _partitionedByStartPosition? Are there any other binned partitionBy fields we would want to support?

I assume any new command line options will conflict with -coalesce and -repartition?

Finally, is it possible to implement this as an external Maven module? Feels invasive to have to modify/extend all of the *RDD classes.

@jpdna

This comment has been minimized.

Member

jpdna commented Aug 8, 2017

Finally, is it possible to implement this as an external Maven module? Feels invasive to have to modify/extend all of the *RDD classes.

As best as I can tell - because I cut and pasted it anyhow, the code in *RDD classes implementing the over-ridden function saveAsParquet is the same in every *RDD class of ours, and I am wondering if the implementation can be factored out into the base class or elsewhere to be more DRY. If so maybe it won't be so invasive.

I am not sure how to go about making this functionality a separate module - and if it works well I'd like to think partitioning parquet like this may become the standard mode when saving from a Dataset, as so far I don't see much of a performance penalty - I'll proceed with the changes suggested above still within the current classes unless others suggest another way forward.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Aug 8, 2017

Finally, is it possible to implement this as an external Maven module? Feels invasive to have to modify/extend all of the *RDD classes.

-1. This is very reasonable functionality to have in the GenomicRDDs. Just to clarify in case there's a misunderstanding, this functionality isn't only usable with Hive, the "Hive" nomenclature comes from the fact that Hive was the tool that introduced this type of partitioning scheme.

and if it works well I'd like to think partitioning parquet like this may become the standard mode when saving from a Dataset

+1!

@heuermh

This comment has been minimized.

Member

heuermh commented Aug 9, 2017

Just to clarify in case there's a misunderstanding, this functionality isn't only usable with Hive, the "Hive" nomenclature comes from the fact that Hive was the tool that introduced this type of partitioning scheme.

Agreed, in the Spark docs it is referred to as, "[an] approach used in systems like Hive." Thus I don't see a need to mention Hive anywhere in field or method or file names.

...may become the standard mode when saving from a Dataset

Feels like there may be pathological cases where this falls down, say lots of unaligned reads, but perhaps it will be no worse than the default.

@jpdna

This comment has been minimized.

Member

jpdna commented Aug 14, 2017

update wrt last push:

  • fixed flag file writing to work with HDFS
  • Now there are specific saveAsPartitionedParquet and loadPartitionedParquet functions. The load functions replace what was previously an attempt to keep a single load function and vary behavior based on flag file. The explicit approach is cleaner as a) the optional Parquet predicates and Filters parmeters will never be meaningful for the Datasets loaded from partitioned parquet because those filters/predicates will be applied via SQL/dataset syntax in subsequent steps after loadParitionedParquet returns its Dataset handle.
  • All reference to Hive have been removed
  • There is only a single implementation of saveAsPartitionedParquet appearing in the base class GenomicRDD. Need to evaluate if this creates any problems for any subclases for which this function cannot work (if any)

todo:

  • updates tests
  • further collection of performance metrics
  • consider some wrapper functions or other utilities/demos of integration of the partitioned/dataset read path with algorithms and ADAM command line tools
@coveralls

This comment has been minimized.

coveralls commented Aug 14, 2017

Coverage Status

Coverage decreased (-0.5%) to 82.984% when pulling 26a7bcf on jpdna:hive_partitions into 96fc37f on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Aug 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2318/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Oct 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2418/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1620/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 98154a7 # timeout=10Checking out Revision 98154a7 (origin/pr/1620/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 98154a7 > /home/jenkins/git2/bin/git rev-list fcecb6b803c52a88747c55f934d79553e7e8bc1e # timeout=10Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Oct 9, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2419/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1620/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 86e20a6 # timeout=10Checking out Revision 86e20a6 (origin/pr/1620/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 86e20a6 > /home/jenkins/git2/bin/git rev-list 98154a7 # timeout=10Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna

This comment has been minimized.

Member

jpdna commented Oct 9, 2017

If you know a better way to setup a dataset api query using dataset columns than building up this string like here https://github.com/jpdna/adam/blob/217183e397aa97a67043cec8fe155dd3df2e19bf/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L1808
let me know.

@@ -2105,6 +2139,21 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
}
}
def loadPartitionedParquetGenotypes(pathName: String): GenotypeRDD = {

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 11, 2017

Contributor

Is there a reason there is no regions parameter here?

@@ -2138,6 +2187,35 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
}
}
def loadPartitionedParquetVariants(pathName: String, regions: Option[Iterable[ReferenceRegion]] = None, partitionSize: Int = 1000000): VariantRDD = {

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 11, 2017

Contributor

Is there any interest in combining this functionality into the existing load functions? It may be nice so the user does not have to know whether the files are partitioned when loading.

This comment has been minimized.

@jpdna

jpdna Oct 16, 2017

Member

I looked at that earlier, using a flag file to detect if the input was partitioned or not, but the trouble is that the existing load functions have a predicate and projection parameter that we don't right now have a way to directly translate into the something spark-sql/dataset can use.

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

I think its fine if the predicate/projection parameters can't be translated over. In the main loadParquet* functions, what I would do is:

  • Check for the partitionedParquetFlag
  • If the flag is set, check if predicate/projection are provided. If projection is provided, log a warning. If predicate is provided, throw an exception.
  • If we didn't throw an exception, call loadPartitionedParquet*
@jpdna

This comment has been minimized.

Member

jpdna commented Oct 16, 2017

I've done some single node benchmarking using a version of Mango that relies on this PR and uses a partitioned alignment dataset.
Random access of a 1000 base pair region:
unpartitioned (current) - adam files 8-9 seconds
partitioned ADAM file (this PR) 1.5 - 2.5 seconds

https://docs.google.com/presentation/d/1p6nA_vhydW2J2O7iFTJbsohbC_FAFEhpqfmzNtb8jRU/edit?usp=sharing

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Oct 16, 2017

Awesome @jpdna! Can you make a PR in Mango so we can track these numbers there?

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Oct 21, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2442/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1620/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 88d64fd # timeout=10Checking out Revision 88d64fd (origin/pr/1620/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 88d64fd > /home/jenkins/git2/bin/git rev-list 86e20a6 # timeout=10Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

A couple of misc comments. Looks good in general; I'm really excited about where this is going. Can you add Scaladoc to all new methods?

@@ -18,15 +18,10 @@
package org.bdgenomics.adam.rdd
import java.io.{ File, FileNotFoundException, InputStream }

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

Can you remove this whitespace?

VCFHeaderLine,
VCFInfoHeaderLine
}
import htsjdk.variant.vcf.{ VCFCompoundHeaderLine, VCFFormatHeaderLine, VCFHeader, VCFHeaderLine, VCFInfoHeaderLine }

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

Can you undo this? We break at 80 chars or 4 items.

RDDBoundNucleotideContigFragmentRDD
}
import org.bdgenomics.adam.projections.{ FeatureField, Projection }
import org.bdgenomics.adam.rdd.contig.{ DatasetBoundNucleotideContigFragmentRDD, NucleotideContigFragmentRDD, ParquetUnboundNucleotideContigFragmentRDD, RDDBoundNucleotideContigFragmentRDD }

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

Can you break this line? We indent at 80 chars/4 imports.

RDDBoundAlignmentRecordRDD
}
import org.bdgenomics.adam.rdd.fragment.{ DatasetBoundFragmentRDD, FragmentRDD, ParquetUnboundFragmentRDD, RDDBoundFragmentRDD }
import org.bdgenomics.adam.rdd.read.{ AlignmentRecordRDD, DatasetBoundAlignmentRecordRDD, ParquetUnboundAlignmentRecordRDD, RDDBoundAlignmentRecordRDD, RepairPartitions }

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

Can you break this line? We indent at 80 chars/4 imports.

NucleotideContigFragment => NucleotideContigFragmentProduct,
Variant => VariantProduct
}
import org.bdgenomics.adam.sql.{ AlignmentRecord => AlignmentRecordProduct, Feature => FeatureProduct, Fragment => FragmentProduct, Genotype => GenotypeProduct, NucleotideContigFragment => NucleotideContigFragmentProduct, Variant => VariantProduct }

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

Can you break this line? We indent at 80 chars/4 imports.

val genotypes = ParquetUnboundGenotypeRDD(sc, pathName, sd, samples, headers)
val datasetBoundGenotypeRDD: GenotypeRDD = regions match {

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

Instead of having regions: Option[Iterable[ReferenceRegion]] and running:

regions match {
  case Some(x) => DatasetBoundGenotypeRDD(genotypes.dataset.filter(referenceRegionsToDatasetQueryString(x))
  DatasetBoundGenotypeRDD(genotypes.dataset)
}

Change the type of regions to Iterable[ReferenceRegion] and then apply the region filters by running:

regions.foldLeft(genotypes)(p => p._1.transformDataset(_.filter(referenceRegionsToDatasetQueryString(p._2))))

This comment has been minimized.

@jpdna

jpdna Nov 10, 2017

Member

I think I understand you want to use the more elegant fold left over a potentially empty regions list, with default being the unfiltered dataset if regions is empty.

However, I'm not sure that the code you suggest implements the OR logic that is intended for this region filter, as won't the consecutive applications of filter as the list is folded result in AND logic?

I'm incline to leave the clunky regions match code for now if we think it is correct, as it seems to work.

@@ -2904,4 +2939,29 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
loadParquetFragments(pathName, optPredicate = optPredicate, optProjection = optProjection)
}
}
def writePartitionedParquetFlag(filePath: String): Boolean = {

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

This should be private (and go in GenomicRDD.scala, methinks?)

fs.createNewFile(path)
}
def checkPartitionedParquetFlag(filePath: String): Boolean = {

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

This should be private.

This comment has been minimized.

@jpdna

jpdna Nov 14, 2017

Member

I actually find a need to sue this from Mango library in order to determine if dataset is partitioned to see if it should use the partitioned read functions, since we needed to keep them separate. Perhaps a better option would be some wrapper function like maybeReadPartitioned but for now I prefer to keep it public and give client code chance to explicitly check.

fs.exists(path)
}
def referenceRegionsToDatasetQueryString(x: Iterable[ReferenceRegion], partitionSize: Int = 1000000): String = {

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

This should be private.

}
def referenceRegionsToDatasetQueryString(x: Iterable[ReferenceRegion], partitionSize: Int = 1000000): String = {
var regionQueryString = "(contigName=" + "\'" + x.head.referenceName.replaceAll("chr", "") + "\' and posBin >= \'" +

This comment has been minimized.

@fnothaft

fnothaft Nov 9, 2017

Member

Why are we replacing "chr"s?

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Nov 14, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2484/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1620/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 8fca515 # timeout=10Checking out Revision 8fca515 (origin/pr/1620/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8fca515 > /home/jenkins/git2/bin/git rev-list 88d64fd # timeout=10Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Nov 15, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2485/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1620/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains df2c07b # timeout=10Checking out Revision df2c07b (origin/pr/1620/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f df2c07b > /home/jenkins/git2/bin/git rev-list 8fca515 # timeout=10Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Paschall Paschall
@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Nov 27, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2489/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse 41a271c^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 41a271c # timeout=10Checking out Revision 41a271c (origin/pr/1620/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 41a271c9f805874032acde57fb2666482fca5ff0First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Nov 28, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2490/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse 8f6c0be^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 8f6c0be # timeout=10Checking out Revision 8f6c0be (origin/pr/1620/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8f6c0be > /home/jenkins/git2/bin/git rev-list 41a271c # timeout=10Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh

This comment has been minimized.

This is not correct, even for human (e.g., chrMT). It is also probably misplaced.

@akmorrow13 akmorrow13 added this to the 0.24.0 milestone Nov 29, 2017

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Dec 11, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2507/

Build result: FAILURE

[...truncated 15 lines...] > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse bea3dfb^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains bea3dfb # timeout=10Checking out Revision bea3dfb (origin/pr/1620/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f bea3dfb > /home/jenkins/git2/bin/git rev-list 8f6c0be # timeout=10Triggering ADAM-prb ? 2.6.2,2.11,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,1.6.3,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.0,centosTriggering ADAM-prb ? 2.6.2,2.10,1.6.3,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.0,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.0,centosADAM-prb ? 2.6.2,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.0,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,1.6.3,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.0,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@@ -2476,4 +2478,29 @@ abstract class AvroGenomicRDD[T <% IndexedRecord: Manifest, U <: Product, V <: A
def saveAsParquet(filePath: java.lang.String) {
saveAsParquet(new JavaSaveArgs(filePath))
}
def writePartitionedParquetFlag(filePath: String): Boolean = {
val path = new Path(filePath, "_isPartitionedByStartPos")

This comment has been minimized.

@akmorrow13

akmorrow13 Dec 12, 2017

Contributor

would there be a way to write a _partitioned file with the fields it is partitioned by, instead of hard coding to start position?

@heuermh heuermh added this to Triage in Release 0.24.0 Jan 4, 2018

@jpdna

This comment has been minimized.

Member

jpdna commented Jan 6, 2018

Closing this PR in favor of:
#1864

@jpdna jpdna closed this Jan 6, 2018

@heuermh heuermh moved this from Triage to Completed in Release 0.24.0 Jan 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment