New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1339] Use glob-safe method to load VCF header metadata for Parquet #1340

Merged
merged 1 commit into from Jan 5, 2017

Conversation

Projects
3 participants
@fnothaft
Member

fnothaft commented Jan 3, 2017

Resolves #1339.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jan 3, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1717/
Test PASSed.

AmplabJenkins commented Jan 3, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1717/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jan 4, 2017

Member

I'm seeing the same on HEAD

$ ./bin/adam-submit vcf2adam "adam-core/src/test/resources/sorted**vcf" sorted.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
...
Jan 3, 2017 5:55:19 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jan 3, 2017 5:55:19 PM WARNING: org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary file for sorted.adam
org.apache.parquet.io.ParquetEncodingException: /Users/heuermh/working/adam/sorted.adam/part-r-00000.gz.parquet invalid: all the files must be contained in the root sorted.adam
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:444)
	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1145)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:477)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply$mcV$sp(ADAMRDDFunctions.scala:159)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at scala.Option.fold(Option.scala:157)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions.saveRddAsParquet(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:908)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:883)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:74)
	at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:53)
	at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:128)
	at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:68)
	at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

and issues/1339-parquet-vcf-metadata branch

adam issues/1339-parquet-vcf-metadata
$ mvn clean install
$ ./bin/adam-submit vcf2adam "adam-core/src/test/resources/sorted**vcf" sorted.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
...
Jan 3, 2017 6:17:14 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 47B for [phaseQuality] INT32: 18 values, 6B raw, 26B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, RLE]
Jan 3, 2017 6:17:14 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jan 3, 2017 6:17:14 PM WARNING: org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary file for sorted.adam
org.apache.parquet.io.ParquetEncodingException: /Users/heuermh/working/adam/sorted.adam/part-r-00000.gz.parquet invalid: all the files must be contained in the root sorted.adam
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:444)
	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1145)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:477)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply$mcV$sp(ADAMRDDFunctions.scala:159)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at scala.Option.fold(Option.scala:157)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions.saveRddAsParquet(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:908)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:883)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:74)
	at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:53)
	at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:128)
	at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:68)
	at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Or perhaps am I doing something wrong?

Member

heuermh commented Jan 4, 2017

I'm seeing the same on HEAD

$ ./bin/adam-submit vcf2adam "adam-core/src/test/resources/sorted**vcf" sorted.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
...
Jan 3, 2017 5:55:19 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jan 3, 2017 5:55:19 PM WARNING: org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary file for sorted.adam
org.apache.parquet.io.ParquetEncodingException: /Users/heuermh/working/adam/sorted.adam/part-r-00000.gz.parquet invalid: all the files must be contained in the root sorted.adam
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:444)
	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1145)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:477)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply$mcV$sp(ADAMRDDFunctions.scala:159)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at scala.Option.fold(Option.scala:157)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions.saveRddAsParquet(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:908)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:883)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:74)
	at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:53)
	at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:128)
	at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:68)
	at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

and issues/1339-parquet-vcf-metadata branch

adam issues/1339-parquet-vcf-metadata
$ mvn clean install
$ ./bin/adam-submit vcf2adam "adam-core/src/test/resources/sorted**vcf" sorted.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
...
Jan 3, 2017 6:17:14 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 47B for [phaseQuality] INT32: 18 values, 6B raw, 26B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, RLE]
Jan 3, 2017 6:17:14 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jan 3, 2017 6:17:14 PM WARNING: org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary file for sorted.adam
org.apache.parquet.io.ParquetEncodingException: /Users/heuermh/working/adam/sorted.adam/part-r-00000.gz.parquet invalid: all the files must be contained in the root sorted.adam
	at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:444)
	at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1145)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
	at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:477)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply$mcV$sp(ADAMRDDFunctions.scala:159)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:143)
	at scala.Option.fold(Option.scala:157)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMRDDFunctions.saveRddAsParquet(ADAMRDDFunctions.scala:143)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:908)
	at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:883)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:74)
	at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
	at org.bdgenomics.adam.cli.Vcf2ADAM.run(Vcf2ADAM.scala:53)
	at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:128)
	at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:68)
	at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Or perhaps am I doing something wrong?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jan 4, 2017

Member

I assume you would like this in 0.21.0? Feel free to set the milestone.

Member

heuermh commented Jan 4, 2017

I assume you would like this in 0.21.0? Feel free to set the milestone.

@fnothaft fnothaft added this to the 0.21.0 milestone Jan 4, 2017

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jan 4, 2017

Member

Just set to 0.21.0. We need this for Variant DB.

Member

fnothaft commented Jan 4, 2017

Just set to 0.21.0. We need this for Variant DB.

@fnothaft fnothaft referenced this pull request Jan 5, 2017

Open

WIP ADAM queries. #6

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jan 5, 2017

Member

Or perhaps am I doing something wrong?

That's an unrelated warning. I haven't figured out what exactly that warning relates to, but whenever I've gotten that warning, the file has still saved OK.

This PR is solely related to loading Parquet files and doesn't change anything on the save path.

Member

fnothaft commented Jan 5, 2017

Or perhaps am I doing something wrong?

That's an unrelated warning. I haven't figured out what exactly that warning relates to, but whenever I've gotten that warning, the file has still saved OK.

This PR is solely related to loading Parquet files and doesn't change anything on the save path.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jan 5, 2017

Member

solely related to loading Parquet files

Ah, I see. Is this on HEAD demonstrating the issue or that I don't know how to use globbing properly?

scala> val variants = sc.loadVariants("*.adam")
java.io.FileNotFoundException: Couldn't find any files matching *.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/*.adam")
java.io.FileNotFoundException: Couldn't find any files matching /Users/heuermh/working/adam/*.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/**.adam")
java.io.FileNotFoundException: Couldn't find any files matching /Users/heuermh/working/adam/**.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/*.adam/*")
java.io.FileNotFoundException: File /Users/heuermh/working/adam/*.adam/*/_header does not exist

The last one I got from this line in the Variant DB challenge submission. Why would the trailing /* be necessary?

val otherGts = sc.loadGenotypes("%s/NA1289*.gt.adam/*".format(dataDir)).transform(_.cache())
Member

heuermh commented Jan 5, 2017

solely related to loading Parquet files

Ah, I see. Is this on HEAD demonstrating the issue or that I don't know how to use globbing properly?

scala> val variants = sc.loadVariants("*.adam")
java.io.FileNotFoundException: Couldn't find any files matching *.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/*.adam")
java.io.FileNotFoundException: Couldn't find any files matching /Users/heuermh/working/adam/*.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/**.adam")
java.io.FileNotFoundException: Couldn't find any files matching /Users/heuermh/working/adam/**.adam

scala> val variants = sc.loadVariants("/Users/heuermh/working/adam/*.adam/*")
java.io.FileNotFoundException: File /Users/heuermh/working/adam/*.adam/*/_header does not exist

The last one I got from this line in the Variant DB challenge submission. Why would the trailing /* be necessary?

val otherGts = sc.loadGenotypes("%s/NA1289*.gt.adam/*".format(dataDir)).transform(_.cache())
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jan 5, 2017

Member

This is on head, and it is the last exception you got ("/path/*/_header does not exist").

Aside: part of your confusion is caused by globs getting a bit funny in Hadoop due to sharding. I.e., since we save as directories of sharded Parquet, your glob needs to select all directories you want ("*.adam") and all of the shards within that directory (which is where the glob at the end comes from). We could probably roll in some more intelligence using listStatus from the Hadoop FileSystem API if there's interest. Done correctly, that would eliminate the need for the glob at the end.

Member

fnothaft commented Jan 5, 2017

This is on head, and it is the last exception you got ("/path/*/_header does not exist").

Aside: part of your confusion is caused by globs getting a bit funny in Hadoop due to sharding. I.e., since we save as directories of sharded Parquet, your glob needs to select all directories you want ("*.adam") and all of the shards within that directory (which is where the glob at the end comes from). We could probably roll in some more intelligence using listStatus from the Hadoop FileSystem API if there's interest. Done correctly, that would eliminate the need for the glob at the end.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jan 5, 2017

Member

Thanks for the clarification, works for me on the branch. We might want to doc the glob stuff somewhere. Or maybe I should go look and see if we did already first. :)

Member

heuermh commented Jan 5, 2017

Thanks for the clarification, works for me on the branch. We might want to doc the glob stuff somewhere. Or maybe I should go look and see if we did already first. :)

@heuermh

heuermh approved these changes Jan 5, 2017

@heuermh heuermh merged commit 573996b into bigdatagenomics:master Jan 5, 2017

1 check passed

default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jan 5, 2017

Member

Thank you, @fnothaft

Member

heuermh commented Jan 5, 2017

Thank you, @fnothaft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment