New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to Write RDD into HDFS with Parquet Format #344

Closed
PengWeiPRC opened this Issue Aug 3, 2014 · 4 comments

Comments

Projects
None yet
2 participants
@PengWeiPRC

PengWeiPRC commented Aug 3, 2014

Hi there,

I was wondering if anybody could help me fix this issue:

I tried to write a function to store a RDD into HDFS with Parquet format. The way worked for previous spark version was to mimic the "adamSave" function in this file adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala.

However, it does not work when I upgrade spark from 0.9 to 1.0.1. The error message is "could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class"

Then, I checked the newer version of ADAM and found that you modified the function and deleted the setting step of SupportWriter.

Nevertheless, I tried to modify my code to make it consistent with your newer version, it still does not work. The error is:

java.lang.NullPointerException
parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:305)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:713)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

It may be not that appropriate to ask this question at this place, but ADAM is the only example that I can find to use avro-parquet and run in spark 1.0.1. Could you help me fix it? I will really appreciate it.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 3, 2014

Member

Hi Pengwei!

Parquet updated their docs to indicate that the ParquetAvroOutputFormat should be used instead of the ParquetOutputFormat + AvroWriteSupport. If you've just removed the AvroWriteSupport, you'll also need to change the ParquetOutputFormat over to the ParquetAvroOutputFormat.

PengWeiPRC notifications@github.com wrote:

Hi there,

I was wondering if anybody could help me fix this issue:

I tried to write a function to store a RDD into HDFS with Parquet format. The way worked for previous spark version was to mimic the "adamSave" function in this file adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala.

However, it does not work when I upgrade spark from 0.9 to 1.0.1. The error message is "could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class"

Then, I checked the newer version of ADAM and found that you modified the function and deleted the setting step of SupportWriter.

Nevertheless, I tried to modify my code to make it consistent with your newer version, it still does not work. The error is:

java.lang.NullPointerException
parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:305)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:713)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

It may be not that appropriate to ask this question at this place, but ADAM is the only example that I can find to use avro-parquet and run in spark 1.0.1. Could you help me fix it? I will really appreciate it.


Reply to this email directly or view it on GitHub:
#344

Hi there,

I was wondering if anybody could help me fix this issue:

I tried to write a function to store a RDD into HDFS with Parquet format. The way worked for previous spark version was to mimic the "adamSave" function in this file adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala.

However, it does not work when I upgrade spark from 0.9 to 1.0.1. The error message is "could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class"

Then, I checked the newer version of ADAM and found that you modified the function and deleted the setting step of SupportWriter.

Nevertheless, I tried to modify my code to make it consistent with your newer version, it still does not work. The error is:

java.lang.NullPointerException
parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:305)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:713)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

It may be not that appropriate to ask this question at this place, but ADAM is the only example that I can find to use avro-parquet and run in spark 1.0.1. Could you help me fix it? I will really appreciate it.


Reply to this email directly or view it on GitHub.

Member

fnothaft commented Aug 3, 2014

Hi Pengwei!

Parquet updated their docs to indicate that the ParquetAvroOutputFormat should be used instead of the ParquetOutputFormat + AvroWriteSupport. If you've just removed the AvroWriteSupport, you'll also need to change the ParquetOutputFormat over to the ParquetAvroOutputFormat.

PengWeiPRC notifications@github.com wrote:

Hi there,

I was wondering if anybody could help me fix this issue:

I tried to write a function to store a RDD into HDFS with Parquet format. The way worked for previous spark version was to mimic the "adamSave" function in this file adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala.

However, it does not work when I upgrade spark from 0.9 to 1.0.1. The error message is "could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class"

Then, I checked the newer version of ADAM and found that you modified the function and deleted the setting step of SupportWriter.

Nevertheless, I tried to modify my code to make it consistent with your newer version, it still does not work. The error is:

java.lang.NullPointerException
parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:305)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:713)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

It may be not that appropriate to ask this question at this place, but ADAM is the only example that I can find to use avro-parquet and run in spark 1.0.1. Could you help me fix it? I will really appreciate it.


Reply to this email directly or view it on GitHub:
#344

Hi there,

I was wondering if anybody could help me fix this issue:

I tried to write a function to store a RDD into HDFS with Parquet format. The way worked for previous spark version was to mimic the "adamSave" function in this file adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMRDDFunctions.scala.

However, it does not work when I upgrade spark from 0.9 to 1.0.1. The error message is "could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class"

Then, I checked the newer version of ADAM and found that you modified the function and deleted the setting step of SupportWriter.

Nevertheless, I tried to modify my code to make it consistent with your newer version, it still does not work. The error is:

java.lang.NullPointerException
parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:305)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:713)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:731)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)

It may be not that appropriate to ask this question at this place, but ADAM is the only example that I can find to use avro-parquet and run in spark 1.0.1. Could you help me fix it? I will really appreciate it.


Reply to this email directly or view it on GitHub.

@PengWeiPRC

This comment has been minimized.

Show comment
Hide comment
@PengWeiPRC

PengWeiPRC Aug 3, 2014

Hi Frank,

Thanks for your reply. Did you mean AvroParquetOutputFormat? I modified my code but the error is ClassNotFound.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception failure in TID 6 on host Gene1.CS.UCLA.EDU: java.lang.ClassNotFoundException: parquet.avro.AvroParquetOutputFormat

I am pretty sure that I imported "parquet.avro.AvroParquetOutputFormat", and in my pom.xml file there is the dependencies related to parquet-avro and I also find this class in the jar. Any suggestions? Thanks very much.

PengWeiPRC commented Aug 3, 2014

Hi Frank,

Thanks for your reply. Did you mean AvroParquetOutputFormat? I modified my code but the error is ClassNotFound.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception failure in TID 6 on host Gene1.CS.UCLA.EDU: java.lang.ClassNotFoundException: parquet.avro.AvroParquetOutputFormat

I am pretty sure that I imported "parquet.avro.AvroParquetOutputFormat", and in my pom.xml file there is the dependencies related to parquet-avro and I also find this class in the jar. Any suggestions? Thanks very much.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 20, 2014

Member

@PengWeiPRC are you still seeing this problem?

Member

fnothaft commented Sep 20, 2014

@PengWeiPRC are you still seeing this problem?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 6, 2016

Member

Closing due to inactivity.

Member

fnothaft commented Jul 6, 2016

Closing due to inactivity.

@fnothaft fnothaft closed this Jul 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment