Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1339] Use glob-safe method to load VCF header metadata for Parquet #1340

Merged

Conversation

@fnothaft
Copy link
Member

fnothaft commented Jan 3, 2017

Resolves #1339.

@AmplabJenkins
Copy link

AmplabJenkins commented Jan 3, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1717/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Jan 4, 2017

I'm seeing the same on HEAD

$ ./bin/adam-submit vcf2adam "adam-core/src/test/resources/sorted**vcf" sorted.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
...
Jan 3, 2017 5:55:19 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jan 3, 2017 5:55:19 PM WARNING: org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary file for sorted.adam
org.apache.parquet.io.ParquetEncodingException: /Users/heuermh/working/adam/sorted.adam/part-r-00000.gz.parquet invalid: all the files must be contained in the root sorted.adam
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:444)
	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1145)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:477)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply$mcV$sp(ADAMRDDFunctions.scala:159)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at scala.Option.fold(Option.scala:157)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions.saveRddAsParquet(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:908)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:883)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:74)
	at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:53)
	at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:128)
	at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:68)
	at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

and issues/1339-parquet-vcf-metadata branch

adam issues/1339-parquet-vcf-metadata
$ mvn clean install
$ ./bin/adam-submit vcf2adam "adam-core/src/test/resources/sorted**vcf" sorted.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
...
Jan 3, 2017 6:17:14 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 47B for [phaseQuality] INT32: 18 values, 6B raw, 26B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, RLE]
Jan 3, 2017 6:17:14 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jan 3, 2017 6:17:14 PM WARNING: org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary file for sorted.adam
org.apache.parquet.io.ParquetEncodingException: /Users/heuermh/working/adam/sorted.adam/part-r-00000.gz.parquet invalid: all the files must be contained in the root sorted.adam
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:444)
	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1145)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:477)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply$mcV$sp(ADAMRDDFunctions.scala:159)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at scala.Option.fold(Option.scala:157)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions.saveRddAsParquet(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:908)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:883)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:74)
	at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:53)
	at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:128)
	at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:68)
	at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Or perhaps am I doing something wrong?

@heuermh
Copy link
Member

heuermh commented Jan 4, 2017

I assume you would like this in 0.21.0? Feel free to set the milestone.

@fnothaft fnothaft added this to the 0.21.0 milestone Jan 4, 2017
@fnothaft
Copy link
Member Author

fnothaft commented Jan 4, 2017

Just set to 0.21.0. We need this for Variant DB.

@fnothaft
Copy link
Member Author

fnothaft commented Jan 5, 2017

Or perhaps am I doing something wrong?

That's an unrelated warning. I haven't figured out what exactly that warning relates to, but whenever I've gotten that warning, the file has still saved OK.

This PR is solely related to loading Parquet files and doesn't change anything on the save path.

@heuermh
Copy link
Member

heuermh commented Jan 5, 2017

solely related to loading Parquet files

Ah, I see. Is this on HEAD demonstrating the issue or that I don't know how to use globbing properly?

scala> val variants = sc.loadVariants("*.adam")
java.io.FileNotFoundException: Couldn't find any files matching *.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/*.adam")
java.io.FileNotFoundException: Couldn't find any files matching /Users/heuermh/working/adam/*.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/**.adam")
java.io.FileNotFoundException: Couldn't find any files matching /Users/heuermh/working/adam/**.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/*.adam/*")
java.io.FileNotFoundException: File /Users/heuermh/working/adam/*.adam/*/_header does not exist

The last one I got from this line in the Variant DB challenge submission. Why would the trailing /* be necessary?

val otherGts = sc.loadGenotypes("%s/NA1289*.gt.adam/*".format(dataDir)).transform(_.cache())
@fnothaft
Copy link
Member Author

fnothaft commented Jan 5, 2017

This is on head, and it is the last exception you got ("/path/*/_header does not exist").

Aside: part of your confusion is caused by globs getting a bit funny in Hadoop due to sharding. I.e., since we save as directories of sharded Parquet, your glob needs to select all directories you want ("*.adam") and all of the shards within that directory (which is where the glob at the end comes from). We could probably roll in some more intelligence using listStatus from the Hadoop FileSystem API if there's interest. Done correctly, that would eliminate the need for the glob at the end.

@heuermh
Copy link
Member

heuermh commented Jan 5, 2017

Thanks for the clarification, works for me on the branch. We might want to doc the glob stuff somewhere. Or maybe I should go look and see if we did already first. :)

@heuermh
heuermh approved these changes Jan 5, 2017
@heuermh heuermh merged commit 573996b into bigdatagenomics:master Jan 5, 2017
1 check passed
1 check passed
default Merged build finished.
Details
@heuermh
Copy link
Member

heuermh commented Jan 5, 2017

Thank you, @fnothaft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.