Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-2062] Update Spark version to 2.3.2 #2055

Merged
merged 1 commit into from Nov 4, 2018

Conversation

@heuermh
Copy link
Member

@heuermh heuermh commented Sep 21, 2018

Fixes #2062.

@coveralls
Copy link

@coveralls coveralls commented Sep 21, 2018

Coverage Status

Coverage remained the same at 79.25% when pulling e5b9b5c on heuermh:spark-2.3.2 into 43d0595 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Sep 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2818/
Test PASSed.

@heuermh
Copy link
Member Author

@heuermh heuermh commented Sep 24, 2018

Running on Cloudera Spark 2.2.0 fails with missing/conflicting dependencies

$ ./bin/adam-shell --master yarn
Using SPARK_SHELL=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2//bin/spark-shell
Spark context available as 'sc' (master = yarn, app id = application_1537216002452_0014).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignments = sc.loadParquetAlignments("foo.alignments.adam")
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = ParquetUnboundAlignmentRecordRDD with 25 reference sequences, 1 read groups, and 1 processing steps

scala> alignments.dataset.count()
res0: Long = 199543

scala> alignments.saveAsParquet("foo.alignments.copy.adam")
error: missing or invalid dependency detected while loading class file 'AvroGenomicDataset.class'.
Could not access type CompressionCodecName in package org.apache.parquet.hadoop.metadata,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'AvroGenomicDataset.class' was compiled against an incompatible version of org.apache.parquet.hadoop.metadata.
error: missing or invalid dependency detected while loading class file 'GenomicDataset.class'.
Could not access type CompressionCodecName in package org.apache.parquet.hadoop.metadata,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'GenomicDataset.class' was compiled against an incompatible version of org.apache.parquet.hadoop.metadata.
@heuermh
Copy link
Member Author

@heuermh heuermh commented Sep 24, 2018

Works ok with --packages org.apache.parquet:parquet-avro:1.8.3

$ ./bin/adam-shell --master yarn --packages org.apache.parquet:parquet-avro:1.8.3
Using SPARK_SHELL=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2//bin/spark-shell
...
org.apache.parquet#parquet-avro added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found org.apache.parquet#parquet-avro;1.8.3 in central
	found org.apache.parquet#parquet-column;1.8.3 in local-m2-cache
	found org.apache.parquet#parquet-common;1.8.3 in local-m2-cache
	found org.slf4j#slf4j-api;1.7.5 in central
	found org.apache.parquet#parquet-encoding;1.8.3 in local-m2-cache
	found commons-codec#commons-codec;1.5 in local-m2-cache
	found org.apache.parquet#parquet-hadoop;1.8.3 in local-m2-cache
	found org.apache.parquet#parquet-format;2.3.1 in local-m2-cache
	found org.apache.parquet#parquet-jackson;1.8.3 in local-m2-cache
	found org.codehaus.jackson#jackson-mapper-asl;1.9.11 in local-m2-cache
	found org.codehaus.jackson#jackson-core-asl;1.9.11 in local-m2-cache
	found org.xerial.snappy#snappy-java;1.1.1.6 in local-m2-cache
	found org.apache.avro#avro;1.8.0 in local-m2-cache
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in local-m2-cache
	found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in local-m2-cache
	found com.thoughtworks.paranamer#paranamer;2.7 in local-m2-cache
	found org.apache.commons#commons-compress;1.8.1 in local-m2-cache
	found org.tukaani#xz;1.5 in local-m2-cache
	found org.slf4j#slf4j-api;1.7.7 in central
	found it.unimi.dsi#fastutil;6.5.7 in central
downloading https://repo1.maven.org/maven2/org/apache/parquet/parquet-avro/1.8.3/parquet-avro-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-avro;1.8.3!parquet-avro.jar (27ms)
downloading .m2/repository/org/apache/parquet/parquet-column/1.8.3/parquet-column-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-column;1.8.3!parquet-column.jar (2ms)
downloading .m2/repository/org/apache/parquet/parquet-hadoop/1.8.3/parquet-hadoop-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-hadoop;1.8.3!parquet-hadoop.jar (2ms)
downloading .m2/repository/org/apache/parquet/parquet-common/1.8.3/parquet-common-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-common;1.8.3!parquet-common.jar (1ms)
downloading .m2/repository/org/apache/parquet/parquet-encoding/1.8.3/parquet-encoding-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-encoding;1.8.3!parquet-encoding.jar (1ms)
downloading .m2/repository/org/apache/parquet/parquet-jackson/1.8.3/parquet-jackson-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-jackson;1.8.3!parquet-jackson.jar (2ms)
:: resolution report :: resolve 1905ms :: artifacts dl 49ms
	:: modules in use:
	com.thoughtworks.paranamer#paranamer;2.7 from local-m2-cache in [default]
	commons-codec#commons-codec;1.5 from local-m2-cache in [default]
	it.unimi.dsi#fastutil;6.5.7 from central in [default]
	org.apache.avro#avro;1.8.0 from local-m2-cache in [default]
	org.apache.commons#commons-compress;1.8.1 from local-m2-cache in [default]
	org.apache.parquet#parquet-avro;1.8.3 from central in [default]
	org.apache.parquet#parquet-column;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-common;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-encoding;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-format;2.3.1 from local-m2-cache in [default]
	org.apache.parquet#parquet-hadoop;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-jackson;1.8.3 from local-m2-cache in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from local-m2-cache in [default]
	org.codehaus.jackson#jackson-mapper-asl;1.9.13 from local-m2-cache in [default]
	org.slf4j#slf4j-api;1.7.7 from central in [default]
	org.tukaani#xz;1.5 from local-m2-cache in [default]
	org.xerial.snappy#snappy-java;1.1.1.6 from local-m2-cache in [default]
	:: evicted modules:
	org.slf4j#slf4j-api;1.7.5 by [org.slf4j#slf4j-api;1.7.7] in [default]
	org.codehaus.jackson#jackson-mapper-asl;1.9.11 by [org.codehaus.jackson#jackson-mapper-asl;1.9.13] in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.11 by [org.codehaus.jackson#jackson-core-asl;1.9.13] in [default]
	org.xerial.snappy#snappy-java;1.1.1.3 by [org.xerial.snappy#snappy-java;1.1.1.6] in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   21  |   6   |   6   |   4   ||   17  |   6   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	6 artifacts copied, 11 already retrieved (2652kB/19ms)

Spark context available as 'sc' (master = yarn, app id = application_1537216002452_0015).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignments = sc.loadParquetAlignments("foo.alignments.adam")
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = ParquetUnboundAlignmentRecordRDD with 25 reference sequences, 1 read groups, and 1 processing steps

scala> alignments.dataset.count()
res0: Long = 199543

scala> alignments.saveAsParquet("foo.alignments.copy.adam")

scala> sc.loadAlignments("foo.alignments.copy.adam").rdd.count()
res2: Long = 199543
@akmorrow13
Copy link
Contributor

@akmorrow13 akmorrow13 commented Sep 24, 2018

This seems related to #1742. Is the current solution to still include --packages org.apache.parquet:parquet-avro:1.8.3?

@heuermh heuermh changed the title Update Spark version to 2.3.2-rc6. Update Spark version to 2.3.2 Sep 27, 2018
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Sep 27, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2821/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse origin/pr/2055/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains d631c93 # timeout=10Checking out Revision d631c93 (origin/pr/2055/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f d631c931e713b23313ee72e25c5db483732e7873First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,2.2.2,ubuntuTriggering ADAM-prb ? 2.7.3,2.11,2.2.2,ubuntuADAM-prb ? 2.6.2,2.11,2.2.2,ubuntu completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.2,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Oct 8, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2828/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse origin/pr/2055/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains e50c232 # timeout=10Checking out Revision e50c232 (origin/pr/2055/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f e50c2327d61d0d6fdb18f590944077ebc1105a97First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,2.2.2,ubuntuTriggering ADAM-prb ? 2.7.3,2.11,2.2.2,ubuntuADAM-prb ? 2.6.2,2.11,2.2.2,ubuntu completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.2,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh heuermh changed the title Update Spark version to 2.3.2 [ADAM-2062] Update Spark version to 2.3.2 Oct 17, 2018
@heuermh heuermh force-pushed the heuermh:spark-2.3.2 branch from 2551654 to e5b9b5c Oct 26, 2018
@heuermh heuermh added this to the 0.24.1 milestone Oct 26, 2018
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Oct 26, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2838/
Test PASSed.

@heuermh heuermh requested review from akmorrow13 and fnothaft Oct 26, 2018
<spark.version>2.3.0</spark.version>
<parquet.version>1.8.2</parquet.version>
<spark.version>2.3.2</spark.version>
<parquet.version>1.8.3</parquet.version>

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 26, 2018
Contributor

Can we still not remove the pom exclusions for parquet in this update?

This comment has been minimized.

@heuermh

heuermh Oct 26, 2018
Author Member

That is correct. Without the pom exclusions we run into the same org.apache.avro.SchemaParseException: Can't redefine: list in unit tests

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 26, 2018
Contributor

That's too bad. For some reason, the --packages does not work for parquet-hadoop in python.

This comment has been minimized.

@heuermh

heuermh Oct 27, 2018
Author Member

Unfortunately, I don't understand why it is necessary for running on CDH Spark and not standalone or EMR, or why it works.

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 29, 2018
Contributor

I have gone through all possible combinations of potential answers to the python problem, and the only thing that seems to work is setting:
--conf spark.driver.extraClassPath=/<path_to_jar>/org.apache.parquet_parquet-hadoop-1.8.3.jar

for pyspark. This means the user has to download the jar first somewhere. I cannot find a cleaner solution.

@akmorrow13 akmorrow13 merged commit ee9d73f into bigdatagenomics:master Nov 4, 2018
1 check passed
1 check passed
@AmplabJenkins
default Merged build finished.
Details
@akmorrow13
Copy link
Contributor

@akmorrow13 akmorrow13 commented Nov 4, 2018

Thanks @heuermh !

@heuermh heuermh deleted the heuermh:spark-2.3.2 branch Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants