spark-submit throw exception in spark-standalone using .adam which transformed from .vcf #1121

Closed
car2008 opened this Issue Aug 23, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@car2008

car2008 commented Aug 23, 2016

first,i use "adam-submit -- vcf2adam /mnt/hgfs/shiyan/ALL.vcf /home/file/new/ALL.adam" ,then i got the ALL.adam
second , i ran a process in spark:
spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar /home/file/new/ALL.adam /home/file/integrated_call_samples_v3.20130502.ALL.panel
throw exception : WARN TaskSetManager:70 - Lost task 1.0 in stage 0.0 (TID 1, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00082.gz.parquet does not exist.
a lot of exceptions like this.
2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 1.0 in stage 0.0 (TID 1, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00082.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 0.0 in stage 0.0 (TID 0, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00311.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 3.0 in stage 0.0 (TID 3, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00062.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 4.0 in stage 0.0 (TID 4, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00091.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 6.0 in stage 0.0 (TID 6, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00232.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 5.0 in stage 0.0 (TID 5, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00069.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 7.0 in stage 0.0 (TID 7, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00227.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 8.0 in stage 0.0 (TID 8, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00027.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2016-08-22 20:00:41 ERROR TaskSetManager:74 - Task 7 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 16, spark04): java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00227.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at com.neilferguson.PopStrat$.main(PopStrat.scala:79)
at com.neilferguson.PopStrat.main(PopStrat.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: File file:/home/file/new/ALL.adam/part-r-00227.gz.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Aug 22, 2016 7:59:36 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 335
2016-08-22 20:00:41 WARN TaskSetManager:70 - Lost task 6.3 in stage 0.0 (TID 17, spark04): org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:204)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Aug 22, 2016 8:22:01 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 335
but i saw(part-r-00###.gz.parquet) the file was in ALL.adam ,why?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 23, 2016

Member

Hi @car2008! It looks like you are running ADAM locally, but the second command on a Spark cluster (I saw --master spark://192.168.2.83:7077 in your spark-submit command). Two thoughts:

  1. You can run ADAM on your Spark cluster (and not in local mode) by running ./bin/adam-submit --master spark://192.168.2.83:7077 -- vcf2adam <input> <output>.
  2. If you're running on a Spark cluster, you'll need the files you're running on to be stored in a file system that is mounted on all of your nodes. Typically, you'd use HDFS, but you can also run this off of an NFS mount, etc. That's why you're getting the file not found error. The path exists on the node you're running the first ADAM job on locally, but when you run command on your cluster, Spark tries to read that file on another node in the cluster, where that path doesn't exist.
Member

fnothaft commented Aug 23, 2016

Hi @car2008! It looks like you are running ADAM locally, but the second command on a Spark cluster (I saw --master spark://192.168.2.83:7077 in your spark-submit command). Two thoughts:

  1. You can run ADAM on your Spark cluster (and not in local mode) by running ./bin/adam-submit --master spark://192.168.2.83:7077 -- vcf2adam <input> <output>.
  2. If you're running on a Spark cluster, you'll need the files you're running on to be stored in a file system that is mounted on all of your nodes. Typically, you'd use HDFS, but you can also run this off of an NFS mount, etc. That's why you're getting the file not found error. The path exists on the node you're running the first ADAM job on locally, but when you run command on your cluster, Spark tries to read that file on another node in the cluster, where that path doesn't exist.
@car2008

This comment has been minimized.

Show comment
Hide comment
@car2008

car2008 Aug 23, 2016

Thank you so much firstly @fnothaft ,i have took your advice,and i tried to put my files (ALL.adam,integrated_call_samples_v3.20130502.ALL.panel,uber-neilferguson-0.0.1-SNAPSHOT.jar)in the same path on all of my nodes ,then i ran the code:/home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar /home/file/ALL.adam /home/file/integrated_call_samples_v3.20130502.ALL.panel,it didnot throw the exception above,but there was the new error:
/home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar /home/file/ALL.adam /home/file/integrated_call_samples_v3.20130502.ALL.panel
2016-08-22 22:28:01 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 3.0 in stage 0.0 (TID 3, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00062.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 0.0 in stage 0.0 (TID 0, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00311.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 2.0 in stage 0.0 (TID 2, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00254.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 7.0 in stage 0.0 (TID 7, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00227.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 6.0 in stage 0.0 (TID 6, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00232.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 1.0 in stage 0.0 (TID 1, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00082.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 5.0 in stage 0.0 (TID 5, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00069.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 4.0 in stage 0.0 (TID 4, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00091.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 8.0 in stage 0.0 (TID 8, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00027.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 9.0 in stage 0.0 (TID 9, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00122.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:17 ERROR TaskSetManager:74 - Task 8 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 0.0 failed 4 times, most recent failure: Lost task 8.3 in stage 0.0 (TID 16, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00027.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at com.neilferguson.PopStrat$.main(PopStrat.scala:79)
at com.neilferguson.PopStrat.main(PopStrat.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00027.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more
Aug 22, 2016 10:28:04 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 335
2016-08-22 22:29:17 WARN TaskSetManager:70 - Lost task 9.3 in stage 0.0 (TID 18, spark04): TaskKilled (killed intentionally)
then i saw #899,my ALL.vcf is like:
fileformat=VCFv4.1
FILTER=<ID=PASS,Description="All filters passed">
fileDate=20150218
reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
source=1000GenomesPhase3Pipeline
.......
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103.....
i transformed .vcf to .adam by using, /adam-submit -- anno2adam /mnt/hgfs/shiyan/ALL.vcf /home/file/2.adam ,**then i ran the same code in spark-cluster,it didnot work yet and throw exception like ** ,
Found 0 samples Exception in thread "main" java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1344) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.first(RDD.scala:1341) at com.neilferguson.PopStrat$.main(PopStrat.scala:110) at com.neilferguson.PopStrat.main(PopStrat.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Aug 22, 2016 11:23:14 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 335
so , are there some problems with my adam?adam version 0.19.0 spark version 1.6.1 scala 2.10.4 as default; the file .vcf i used is 11G,the .adam is 266M by using --vcf2adam, and is 1.7M by using -- anno2adam.is it right?i really donot know how can i reslove it.

car2008 commented Aug 23, 2016

Thank you so much firstly @fnothaft ,i have took your advice,and i tried to put my files (ALL.adam,integrated_call_samples_v3.20130502.ALL.panel,uber-neilferguson-0.0.1-SNAPSHOT.jar)in the same path on all of my nodes ,then i ran the code:/home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar /home/file/ALL.adam /home/file/integrated_call_samples_v3.20130502.ALL.panel,it didnot throw the exception above,but there was the new error:
/home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar /home/file/ALL.adam /home/file/integrated_call_samples_v3.20130502.ALL.panel
2016-08-22 22:28:01 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 3.0 in stage 0.0 (TID 3, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00062.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 0.0 in stage 0.0 (TID 0, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00311.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 2.0 in stage 0.0 (TID 2, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00254.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 7.0 in stage 0.0 (TID 7, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00227.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 6.0 in stage 0.0 (TID 6, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00232.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 1.0 in stage 0.0 (TID 1, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00082.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 5.0 in stage 0.0 (TID 5, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00069.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 4.0 in stage 0.0 (TID 4, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00091.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 8.0 in stage 0.0 (TID 8, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00027.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:15 WARN TaskSetManager:70 - Lost task 9.0 in stage 0.0 (TID 9, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00122.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

2016-08-22 22:29:17 ERROR TaskSetManager:74 - Task 8 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 0.0 failed 4 times, most recent failure: Lost task 8.3 in stage 0.0 (TID 16, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00027.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at com.neilferguson.PopStrat$.main(PopStrat.scala:79)
at com.neilferguson.PopStrat.main(PopStrat.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file file:/home/file/ALL.adam/part-r-00027.gz.parquet
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 16 more
Aug 22, 2016 10:28:04 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 335
2016-08-22 22:29:17 WARN TaskSetManager:70 - Lost task 9.3 in stage 0.0 (TID 18, spark04): TaskKilled (killed intentionally)
then i saw #899,my ALL.vcf is like:
fileformat=VCFv4.1
FILTER=<ID=PASS,Description="All filters passed">
fileDate=20150218
reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
source=1000GenomesPhase3Pipeline
.......
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103.....
i transformed .vcf to .adam by using, /adam-submit -- anno2adam /mnt/hgfs/shiyan/ALL.vcf /home/file/2.adam ,**then i ran the same code in spark-cluster,it didnot work yet and throw exception like ** ,
Found 0 samples Exception in thread "main" java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1344) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.first(RDD.scala:1341) at com.neilferguson.PopStrat$.main(PopStrat.scala:110) at com.neilferguson.PopStrat.main(PopStrat.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Aug 22, 2016 11:23:14 PM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 335
so , are there some problems with my adam?adam version 0.19.0 spark version 1.6.1 scala 2.10.4 as default; the file .vcf i used is 11G,the .adam is 266M by using --vcf2adam, and is 1.7M by using -- anno2adam.is it right?i really donot know how can i reslove it.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 23, 2016

Member

The VCF looks fine. I would strongly suggest against trying to put the files at the same path on each node, and would strongly suggest using a file system that mounts across your cluster (e.g., HDFS). Is there a reason you're not using HDFS or an NFS or a blob store like S3?

Member

fnothaft commented Aug 23, 2016

The VCF looks fine. I would strongly suggest against trying to put the files at the same path on each node, and would strongly suggest using a file system that mounts across your cluster (e.g., HDFS). Is there a reason you're not using HDFS or an NFS or a blob store like S3?

@car2008

This comment has been minimized.

Show comment
Hide comment
@car2008

car2008 Aug 24, 2016

thank you @fnothaft ,today i use HDFS,and i uploaded the .vcf file and the .adam file to the HDFS.

  • i ran the code /home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar hdfs://192.168.2.85:9000/test/ALL.vcf /home/file/integrated_call_samples_v3.20130502.ALL.panel , it worked well
  • then i ran the code /home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar hdfs://192.168.2.85:9000/user/ALL.adam /home/file/integrated_call_samples_v3.20130502.ALL.panel ,but it didn't work and throw exceptions:
    2016-08-24 01:58:50 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2016-08-24 02:00:38 WARN TaskSetManager:70 - Lost task 1.0 in stage 0.0 (TID 1, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file hdfs://192.168.2.85:9000/user/ALL.adam/part-r-00001.gz.parquet
    at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
    at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
    at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
    at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
    at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
    at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
    at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
    at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
    at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
    ... 16 more
  • so , why the spark can not read value at 0 in block 0 in file hdfs://192.168.2.85:9000/user/ALL.adam/part-r-00001.gz.parquet ? but it didn't happen when using .vcf file . are there some problems with my adam?jdk1.8.0_101, adam version 0.19.0, spark version 1.6.1 ,scala 2.10.4 as default; the file .vcf i used is 11G,the .adam is 266M by using --vcf2adam, and is 1.7M by using -- anno2adam.is it right?

car2008 commented Aug 24, 2016

thank you @fnothaft ,today i use HDFS,and i uploaded the .vcf file and the .adam file to the HDFS.

  • i ran the code /home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar hdfs://192.168.2.85:9000/test/ALL.vcf /home/file/integrated_call_samples_v3.20130502.ALL.panel , it worked well
  • then i ran the code /home/soft/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class "com.neilferguson.PopStrat" --master spark://192.168.2.83:7077 /home/file/uber-neilferguson-0.0.1-SNAPSHOT.jar hdfs://192.168.2.85:9000/user/ALL.adam /home/file/integrated_call_samples_v3.20130502.ALL.panel ,but it didn't work and throw exceptions:
    2016-08-24 01:58:50 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2016-08-24 02:00:38 WARN TaskSetManager:70 - Lost task 1.0 in stage 0.0 (TID 1, spark04): parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file hdfs://192.168.2.85:9000/user/ALL.adam/part-r-00001.gz.parquet
    at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
    at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:168)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant
    at org.bdgenomics.formats.avro.Genotype.put(Genotype.java:148)
    at parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:143)
    at parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:39)
    at parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:78)
    at parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:163)
    at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
    at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
    ... 16 more
  • so , why the spark can not read value at 0 in block 0 in file hdfs://192.168.2.85:9000/user/ALL.adam/part-r-00001.gz.parquet ? but it didn't happen when using .vcf file . are there some problems with my adam?jdk1.8.0_101, adam version 0.19.0, spark version 1.6.1 ,scala 2.10.4 as default; the file .vcf i used is 11G,the .adam is 266M by using --vcf2adam, and is 1.7M by using -- anno2adam.is it right?
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 24, 2016

Member

What version/commit of ADAM are you running? This message:

Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant

is typically a pretty good sign that the two tools you are using disagree on the schema being used. Or, in other words, the version of ADAM you are running is probably newer than the version of ADAM that com.neilferguson.PopStrat depends on. You could try to upgrade com.neilferguson.PopStrat to the newer version of ADAM (depending on the schema changes, this could be as simple as just recompling) or you could just drop back to using an older version of ADAM so that vcf2adam writes the Parquet file with the schema that com.neilferguson.PopStrat is expecting to see.

Member

fnothaft commented Aug 24, 2016

What version/commit of ADAM are you running? This message:

Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.bdgenomics.formats.avro.Variant

is typically a pretty good sign that the two tools you are using disagree on the schema being used. Or, in other words, the version of ADAM you are running is probably newer than the version of ADAM that com.neilferguson.PopStrat depends on. You could try to upgrade com.neilferguson.PopStrat to the newer version of ADAM (depending on the schema changes, this could be as simple as just recompling) or you could just drop back to using an older version of ADAM so that vcf2adam writes the Parquet file with the schema that com.neilferguson.PopStrat is expecting to see.

@car2008

This comment has been minimized.

Show comment
Hide comment
@car2008

car2008 Aug 25, 2016

hi @fnothaft, i am using version 0.19.1-SNAPSHOT transforming .vcf to .adam,and i took your advice _

try to upgrade com.neilferguson.PopStrat to the newer version of ADAM

_ . after i changed 0.16.0(default) to 0.19.1-SNAPSHOT, the code val genotypeFile = args(0) val allGenotypes: RDD[Genotype] = sparkContextToADAMContext(sc).loadGenotypes(genotypeFile) throws exceptions and exception
Caused by: java.lang.ClassCastException:java.lang.string cannot be cast to org.bdgenomics.....,
,and i find #1118,Clean up ADAMContext, has it changed from the version 0.19.0? i want to use version 0.19.1-SNAPSHOT ,what can i do to resolve it? and the ability of my programming is not very well. sorry to give you to so much trouble.

car2008 commented Aug 25, 2016

hi @fnothaft, i am using version 0.19.1-SNAPSHOT transforming .vcf to .adam,and i took your advice _

try to upgrade com.neilferguson.PopStrat to the newer version of ADAM

_ . after i changed 0.16.0(default) to 0.19.1-SNAPSHOT, the code val genotypeFile = args(0) val allGenotypes: RDD[Genotype] = sparkContextToADAMContext(sc).loadGenotypes(genotypeFile) throws exceptions and exception
Caused by: java.lang.ClassCastException:java.lang.string cannot be cast to org.bdgenomics.....,
,and i find #1118,Clean up ADAMContext, has it changed from the version 0.19.0? i want to use version 0.19.1-SNAPSHOT ,what can i do to resolve it? and the ability of my programming is not very well. sorry to give you to so much trouble.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 25, 2016

Member

Hi @car2008 !

Can you post the full exception?

What I might suggest is using the ADAM 0.19.0 release, instead of the 0.19.1-SNAPSHOT artifacts. Specifically, we've done a lot of refactoring since we released 0.19.0, and it should be much easier to upgrade 0.16.0 to 0.19.0 than all the way to the latest code. Then, you'd want to use ADAM 0.19.0 for adam2vcf. You can download precompiled binaries for 0.19.0 here.

Member

fnothaft commented Aug 25, 2016

Hi @car2008 !

Can you post the full exception?

What I might suggest is using the ADAM 0.19.0 release, instead of the 0.19.1-SNAPSHOT artifacts. Specifically, we've done a lot of refactoring since we released 0.19.0, and it should be much easier to upgrade 0.16.0 to 0.19.0 than all the way to the latest code. Then, you'd want to use ADAM 0.19.0 for adam2vcf. You can download precompiled binaries for 0.19.0 here.

@car2008

This comment has been minimized.

Show comment
Hide comment
@car2008

car2008 Aug 25, 2016

ok @fnothaft ,thank you,i will try the ADAM 0.19.0 release and tell you how the result is,the code's error while compiling is

  • "not found : value sparkContextToADAMContext"
  • in "import org.bdgenomics.adam.rdd.ADAMContext",the ADAMContext is gray,cann't be clicked with the ctrl key
  • the exception "Caused by: java.lang.ClassCastException:java.lang.string cannot be cast to org.bdgenomics...." is that i used the ADAM 0.19.1-SNAPSHOT transforming .vcf to .adam ,but i upgraded 0.16.0 to 0.19.0 in the com.neilferguson.PopStrat,so it was wrong at the beginning and we can ignore it.

car2008 commented Aug 25, 2016

ok @fnothaft ,thank you,i will try the ADAM 0.19.0 release and tell you how the result is,the code's error while compiling is

  • "not found : value sparkContextToADAMContext"
  • in "import org.bdgenomics.adam.rdd.ADAMContext",the ADAMContext is gray,cann't be clicked with the ctrl key
  • the exception "Caused by: java.lang.ClassCastException:java.lang.string cannot be cast to org.bdgenomics...." is that i used the ADAM 0.19.1-SNAPSHOT transforming .vcf to .adam ,but i upgraded 0.16.0 to 0.19.0 in the com.neilferguson.PopStrat,so it was wrong at the beginning and we can ignore it.
@car2008

This comment has been minimized.

Show comment
Hide comment
@car2008

car2008 Aug 26, 2016

Thank you very much @fnothaft ,i have got the right result using the ADAM 0.19.0 release.
Now i have some advice for Genomic Analysis Using ADAM, Spark and Deep Learning to the people who want to reproduce the test:
first, in the .pom file :

  • Spark version 1.6.1 replacing 1.2.0
  • ADAM version 0.19.0 replacing 0.16.0
  • Sparkling Water version 1.6.5 replacing 1.2.5
  • H2O version 3.8.2.6 replacing 3.0.0.8(we can only modify the version and don't install it after we have installed Sparkling Water)
<dependency>
        <groupId>org.bdgenomics.adam</groupId>
        <artifactId>adam-core</artifactId>
        <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis</artifactId>
         <version>${adam.version}</version>
</dependency>

is modified to

<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-core_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>

then ,in the codes :

val header = StructType(Array(StructField("Region", StringType)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {StructField(variant.variantId.toString, IntegerType)}))

is modified to

val header = DataTypes.createStructType(Array(DataTypes.createStructField("Region", DataTypes.StringType,false)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {DataTypes.createStructField(variant.variantId.toString,DataTypes.IntegerType,false)}))
// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._
    val dataFrame = h2oContext.toDataFrame(schemaRDD)

is modified to

// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    //val dataFrame=sqlContext.createDataFrame(rowRDD, header)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._ 
    val dataFrame1 =h2oContext.asH2OFrame(schemaRDD)
    val dataFrame=H2OFrameSupport.allStringVecToCategorical(dataFrame1)
// Split the dataframe into 50% training, 30% test, and 20% validation data
    val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make), null)

is modified to

// Split the dataframe into 50% training, 30% test, and 20% validation data
   val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make[Frame](_)), null)
// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training
    deepLearningParameters._valid = validation

is modified to

// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training._key
    deepLearningParameters._valid = validation._key
// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)('predict)

is modified to

// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)
    Add
import org.apache.spark.sql.types.DataTypes
import hex._
import water.fvec._
import water.support._
import _root_.hex.Distribution.Family
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.tree.gbm.GBMModel
import _root_.hex.{Model, ModelMetricsBinomial}

ok ,that's all, it will be better if you have other advice . Thank you again !

car2008 commented Aug 26, 2016

Thank you very much @fnothaft ,i have got the right result using the ADAM 0.19.0 release.
Now i have some advice for Genomic Analysis Using ADAM, Spark and Deep Learning to the people who want to reproduce the test:
first, in the .pom file :

  • Spark version 1.6.1 replacing 1.2.0
  • ADAM version 0.19.0 replacing 0.16.0
  • Sparkling Water version 1.6.5 replacing 1.2.5
  • H2O version 3.8.2.6 replacing 3.0.0.8(we can only modify the version and don't install it after we have installed Sparkling Water)
<dependency>
        <groupId>org.bdgenomics.adam</groupId>
        <artifactId>adam-core</artifactId>
        <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis</artifactId>
         <version>${adam.version}</version>
</dependency>

is modified to

<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-core_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>

then ,in the codes :

val header = StructType(Array(StructField("Region", StringType)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {StructField(variant.variantId.toString, IntegerType)}))

is modified to

val header = DataTypes.createStructType(Array(DataTypes.createStructField("Region", DataTypes.StringType,false)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {DataTypes.createStructField(variant.variantId.toString,DataTypes.IntegerType,false)}))
// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._
    val dataFrame = h2oContext.toDataFrame(schemaRDD)

is modified to

// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    //val dataFrame=sqlContext.createDataFrame(rowRDD, header)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._ 
    val dataFrame1 =h2oContext.asH2OFrame(schemaRDD)
    val dataFrame=H2OFrameSupport.allStringVecToCategorical(dataFrame1)
// Split the dataframe into 50% training, 30% test, and 20% validation data
    val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make), null)

is modified to

// Split the dataframe into 50% training, 30% test, and 20% validation data
   val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make[Frame](_)), null)
// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training
    deepLearningParameters._valid = validation

is modified to

// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training._key
    deepLearningParameters._valid = validation._key
// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)('predict)

is modified to

// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)
    Add
import org.apache.spark.sql.types.DataTypes
import hex._
import water.fvec._
import water.support._
import _root_.hex.Distribution.Family
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.tree.gbm.GBMModel
import _root_.hex.{Model, ModelMetricsBinomial}

ok ,that's all, it will be better if you have other advice . Thank you again !

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 26, 2016

Member

@car2008 that's great to hear! Thanks for posting back with the update. If you're able to, you might want to open a pull request with the fixes you used against https://github.com/nfergu/popstrat. I can't speak for him, obviously, but I imagine @nfergu would be interested in the updates!

Member

fnothaft commented Aug 26, 2016

@car2008 that's great to hear! Thanks for posting back with the update. If you're able to, you might want to open a pull request with the fixes you used against https://github.com/nfergu/popstrat. I can't speak for him, obviously, but I imagine @nfergu would be interested in the updates!

@car2008

This comment has been minimized.

Show comment
Hide comment
@car2008

car2008 Aug 26, 2016

Ok @fnothaft ,i have opened a pull request against https://github.com/nfergu/popstrat.

car2008 commented Aug 26, 2016

Ok @fnothaft ,i have opened a pull request against https://github.com/nfergu/popstrat.

@nfergu

This comment has been minimized.

Show comment
Hide comment
@nfergu

nfergu Aug 26, 2016

Contributor

Great, thanks! I'll try and take a look at the PR later today.

Contributor

nfergu commented Aug 26, 2016

Great, thanks! I'll try and take a look at the PR later today.

@fnothaft fnothaft closed this Nov 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment