New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADAM works on Cloudera but does NOT work on MAPR #1475

Closed
mhaghdad opened this Issue Apr 5, 2017 · 6 comments

Comments

Projects
3 participants
@mhaghdad

mhaghdad commented Apr 5, 2017

Hi
I have been struggling with this problem, need your help and I sincerely appreciate any feedback. I usually test new things on Cloudera QuickStart before putting them on the Dev server which has an HDFS with a number of nodes. On Cloudera everything works successfully like a charm but on MAPR I am getting a cast error every time the code uses sc.loadAlignments.

The Cloudera QuickStart is the latest version, latest spark version, latest Scala version, I am using the latest ADAM, it has only one node, it uses the JDK 1.8. These are the steps that I have taken on the QuickStart:

1- Downloaded and built Adam- successful
2- Loaded resources like small.sam to HDFS and then used adam-submit transform and that created the small.adam files – successful
3- Used adam-submit flagstat or other commands on the adam files –successful

The MAPR is a much bigger system with many nodes, many spark projects has been developed and running successfully on this system. It has the Spark version 1.6.1, Scala version 2.10.5, I have used older versions of ADAM and new versions of ADAM, on all I am getting the same cast error as soon as the code uses sc.loadAlignments. These are the steps that I have taken on the MAPR:

1- Downloaded and built Adam- successful
2- Loaded resources like small.sam to HDFS and then used adam-submit transform and that created the small.adam files – successful
3- Used adam-submit flagstat or other commands on the adam files –NOT successful

Here is the error:
mhaghdad@dbslp0567:/home/mhaghdad/adam-adam-parent_2.10-0.20.0
$ adam-submit flagstat maprfs:/datalake/optum/optuminsight/udw/prd/pae/developer/mhaghdad/tmp/small.adam Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/opt/mapr/spark/spark-1.6.1/bin/spark-submit
17/04/05 12:37:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/05 12:37:52 WARN HiveConf: HiveConf of name hive.exec.warehousedir does not exist
17/04/05 12:37:52 WARN HiveConf: HiveConf of name hive.server2.authentication.pam.profiles does not exist
Command body threw exception:
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.apache.avro.specific.SpecificRecordBase
Exception in thread "main" java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to org.apache.avro.specific.SpecificRecordBase

at org.bdgenomics.adam.rdd.ADAMContext.org$bdgenomics$adam$rdd$ADAMContext$$loadAvro(ADAMContext.scala:603)
at org.bdgenomics.adam.rdd.ADAMContext.org$bdgenomics$adam$rdd$ADAMContext$$loadAvroSequencesFile(ADAMContext.scala:220)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAvroSequences$1.apply(ADAMContext.scala:208)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAvroSequences$1.apply(ADAMContext.scala:208)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequences(ADAMContext.scala:208)
at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:640)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAlignments$1.apply(ADAMContext.scala:1447)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAlignments$1.apply(ADAMContext.scala:1425)
at scala.Option.fold(Option.scala:157)
at org.apache.spark.rdd.Timer.time(Timer.scala:48)
at org.bdgenomics.adam.rdd.ADAMContext.loadAlignments(ADAMContext.scala:1423)
at org.bdgenomics.adam.cli.FlagStat.run(FlagStat.scala:72)
at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
at org.bdgenomics.adam.cli.FlagStat.run(FlagStat.scala:48)
at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:132)
at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:72)
at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:752)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
mhaghdad@dbslp0567:/home/mhaghdad/adam-adam-parent_2.10-0.20.0
$

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 5, 2017

Member

Hi @mhaghdad! Thanks for pinging us with the issue. Typically, this class cast error means that the file you are trying to load was saved with an old version of the ADAM schemas. Do you know when you converted the old file?

Member

fnothaft commented Apr 5, 2017

Hi @mhaghdad! Thanks for pinging us with the issue. Typically, this class cast error means that the file you are trying to load was saved with an old version of the ADAM schemas. Do you know when you converted the old file?

@mhaghdad

This comment has been minimized.

Show comment
Hide comment
@mhaghdad

mhaghdad Apr 5, 2017

Thank you fnothaft for such a quick response :) I am using the small.sam which comes with the code. Then I use the transform and it creates the small.adam and immediately I run the adam-submit flagstat on the newly created small.adam file. I am not sure if I understand the question properly but I am using everything within the same package, taking the same steps on both systems and I am not using any other older adam schema. Understanding of whether there is a correlation between adam schema and MAPR is not that trivial to me. I am taking the same steps on Cloudera and MAPR, one works the other one doesn’t. Would you please elaborate? Thanks

mhaghdad commented Apr 5, 2017

Thank you fnothaft for such a quick response :) I am using the small.sam which comes with the code. Then I use the transform and it creates the small.adam and immediately I run the adam-submit flagstat on the newly created small.adam file. I am not sure if I understand the question properly but I am using everything within the same package, taking the same steps on both systems and I am not using any other older adam schema. Understanding of whether there is a correlation between adam schema and MAPR is not that trivial to me. I am taking the same steps on Cloudera and MAPR, one works the other one doesn’t. Would you please elaborate? Thanks

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 6, 2017

Member

Another cause we've seen for this same issue is a conflict in avro-ipc transitive dependency versions at runtime. Does MAPR add anything to the Spark classpath at runtime?

Member

heuermh commented Apr 6, 2017

Another cause we've seen for this same issue is a conflict in avro-ipc transitive dependency versions at runtime. Does MAPR add anything to the Spark classpath at runtime?

@mhaghdad

This comment has been minimized.

Show comment
Hide comment
@mhaghdad

mhaghdad Apr 7, 2017

Thank you heuermh, I have checked the spark-default.conf and also other places and do not see any Avro related jar file to be added to the classpath at runtime, is there any other place specifically I need to check?

I also noticed something else that might be causing the issue, on Cloudera Qucikstart the Apache Avro library is installed and integrated with spark, but on MAPR it is not. The question is do I need to add the Avro Jar file to the spark-defaults.conf ? for example adding this to spark-defaults.conf?

spark.driver.extraClassPath /opt/mapr/spark/spark-1.6.1/jars/avro-1.7.7.jar
spark.executor.extraClassPath /opt/mapr/spark/spark-1.6.1/jars/avro-1.7.7.jar

Do I need to add any other libraries, jars or anything to spark 1.6.1 on MAPR so that it works? Thanks again

mhaghdad commented Apr 7, 2017

Thank you heuermh, I have checked the spark-default.conf and also other places and do not see any Avro related jar file to be added to the classpath at runtime, is there any other place specifically I need to check?

I also noticed something else that might be causing the issue, on Cloudera Qucikstart the Apache Avro library is installed and integrated with spark, but on MAPR it is not. The question is do I need to add the Avro Jar file to the spark-defaults.conf ? for example adding this to spark-defaults.conf?

spark.driver.extraClassPath /opt/mapr/spark/spark-1.6.1/jars/avro-1.7.7.jar
spark.executor.extraClassPath /opt/mapr/spark/spark-1.6.1/jars/avro-1.7.7.jar

Do I need to add any other libraries, jars or anything to spark 1.6.1 on MAPR so that it works? Thanks again

@mhaghdad

This comment has been minimized.

Show comment
Hide comment
@mhaghdad

mhaghdad Apr 10, 2017

I have resolved the issue and I am going write the solution in detail because no one should be ever subjected to that kind of pain again. Before doing that I wanted to thank fnothaft, heuermh and everybody for trying to help.

If you look at the pom.xml in ADAM code and search for avro, depending on the ADAM version you are using, you will find something like this:

<avro.version>1.8.0</avro.version>

That means that in my version of ADAM avro 1.8.0 is used, that means that you have to tell spark to use this version. It turned out that as of Spark 1.5.2 (the Spark version on my MAPR is 1.6.1) you have to follow the following steps in order to integrate the Spark-SQL with Avro. You have to download the right version of the avro jar (in my case avro-1.8.0.jar) and assign it to both spark.executor.extraClassPath and spark.driver.extraClassPath.

To do this in MAPR you have to go to /opt/mapr/spark/spark-version/conf and open the
/opt/mapr/spark/spark-version/conf/spark-defaults.conf

Inside that you have to add the following lines:
spark.executor.extraClassPath /your path to the jar file/avro-1.8.0.jar:<rest_of_path>
spark.driver.extraClassPath /your path to the jar file/avro-1.8.0.jar:<rest_of_path>

If you don’t have admin privileges on your MAPR and cannot modify the /opt/mapr/spark/spark-/conf/spark-defaults.conf, it is not a big deal. Copy the /opt/mapr/spark/spark-version/conf/ directory to /your desirable path/conf, modify the spark-defaults.conf and then on MAPR do an export:
export SPARK_CONF_DIR=/your desirable path/conf
Now it is going to work and you are no longer going to get the annoying cast error on MAPR ever again. I hope this help :)

Mehdi

mhaghdad commented Apr 10, 2017

I have resolved the issue and I am going write the solution in detail because no one should be ever subjected to that kind of pain again. Before doing that I wanted to thank fnothaft, heuermh and everybody for trying to help.

If you look at the pom.xml in ADAM code and search for avro, depending on the ADAM version you are using, you will find something like this:

<avro.version>1.8.0</avro.version>

That means that in my version of ADAM avro 1.8.0 is used, that means that you have to tell spark to use this version. It turned out that as of Spark 1.5.2 (the Spark version on my MAPR is 1.6.1) you have to follow the following steps in order to integrate the Spark-SQL with Avro. You have to download the right version of the avro jar (in my case avro-1.8.0.jar) and assign it to both spark.executor.extraClassPath and spark.driver.extraClassPath.

To do this in MAPR you have to go to /opt/mapr/spark/spark-version/conf and open the
/opt/mapr/spark/spark-version/conf/spark-defaults.conf

Inside that you have to add the following lines:
spark.executor.extraClassPath /your path to the jar file/avro-1.8.0.jar:<rest_of_path>
spark.driver.extraClassPath /your path to the jar file/avro-1.8.0.jar:<rest_of_path>

If you don’t have admin privileges on your MAPR and cannot modify the /opt/mapr/spark/spark-/conf/spark-defaults.conf, it is not a big deal. Copy the /opt/mapr/spark/spark-version/conf/ directory to /your desirable path/conf, modify the spark-defaults.conf and then on MAPR do an export:
export SPARK_CONF_DIR=/your desirable path/conf
Now it is going to work and you are no longer going to get the annoying cast error on MAPR ever again. I hope this help :)

Mehdi

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 11, 2017

Member

Great to hear!

For your information, things might get better/worse/more complicated in the near future. The dependency version for avro in ADAM was bumped to 1.8.1 in commit 9505d47, and an attempt to move to Parquet version 1.8.2 causes some kind of runtime conflict in Apache Spark related to avro (see e.g. https://issues.apache.org/jira/browse/SPARK-19697).

Meanwhile, is it ok to close this issue?

Member

heuermh commented Apr 11, 2017

Great to hear!

For your information, things might get better/worse/more complicated in the near future. The dependency version for avro in ADAM was bumped to 1.8.1 in commit 9505d47, and an attempt to move to Parquet version 1.8.2 causes some kind of runtime conflict in Apache Spark related to avro (see e.g. https://issues.apache.org/jira/browse/SPARK-19697).

Meanwhile, is it ok to close this issue?

@fnothaft fnothaft closed this Jun 22, 2017

@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment