New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-2062] Update Spark version to 2.3.2 #2055

Merged
merged 1 commit into from Nov 4, 2018

Conversation

Projects
None yet
4 participants
@heuermh
Copy link
Member

heuermh commented Sep 21, 2018

Fixes #2062.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Sep 21, 2018

Coverage Status

Coverage remained the same at 79.25% when pulling e5b9b5c on heuermh:spark-2.3.2 into 43d0595 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Sep 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2818/
Test PASSed.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Sep 24, 2018

Running on Cloudera Spark 2.2.0 fails with missing/conflicting dependencies

$ ./bin/adam-shell --master yarn
Using SPARK_SHELL=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2//bin/spark-shell
Spark context available as 'sc' (master = yarn, app id = application_1537216002452_0014).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignments = sc.loadParquetAlignments("foo.alignments.adam")
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = ParquetUnboundAlignmentRecordRDD with 25 reference sequences, 1 read groups, and 1 processing steps

scala> alignments.dataset.count()
res0: Long = 199543

scala> alignments.saveAsParquet("foo.alignments.copy.adam")
error: missing or invalid dependency detected while loading class file 'AvroGenomicDataset.class'.
Could not access type CompressionCodecName in package org.apache.parquet.hadoop.metadata,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'AvroGenomicDataset.class' was compiled against an incompatible version of org.apache.parquet.hadoop.metadata.
error: missing or invalid dependency detected while loading class file 'GenomicDataset.class'.
Could not access type CompressionCodecName in package org.apache.parquet.hadoop.metadata,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'GenomicDataset.class' was compiled against an incompatible version of org.apache.parquet.hadoop.metadata.
@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Sep 24, 2018

Works ok with --packages org.apache.parquet:parquet-avro:1.8.3

$ ./bin/adam-shell --master yarn --packages org.apache.parquet:parquet-avro:1.8.3
Using SPARK_SHELL=/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2//bin/spark-shell
...
org.apache.parquet#parquet-avro added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found org.apache.parquet#parquet-avro;1.8.3 in central
	found org.apache.parquet#parquet-column;1.8.3 in local-m2-cache
	found org.apache.parquet#parquet-common;1.8.3 in local-m2-cache
	found org.slf4j#slf4j-api;1.7.5 in central
	found org.apache.parquet#parquet-encoding;1.8.3 in local-m2-cache
	found commons-codec#commons-codec;1.5 in local-m2-cache
	found org.apache.parquet#parquet-hadoop;1.8.3 in local-m2-cache
	found org.apache.parquet#parquet-format;2.3.1 in local-m2-cache
	found org.apache.parquet#parquet-jackson;1.8.3 in local-m2-cache
	found org.codehaus.jackson#jackson-mapper-asl;1.9.11 in local-m2-cache
	found org.codehaus.jackson#jackson-core-asl;1.9.11 in local-m2-cache
	found org.xerial.snappy#snappy-java;1.1.1.6 in local-m2-cache
	found org.apache.avro#avro;1.8.0 in local-m2-cache
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in local-m2-cache
	found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in local-m2-cache
	found com.thoughtworks.paranamer#paranamer;2.7 in local-m2-cache
	found org.apache.commons#commons-compress;1.8.1 in local-m2-cache
	found org.tukaani#xz;1.5 in local-m2-cache
	found org.slf4j#slf4j-api;1.7.7 in central
	found it.unimi.dsi#fastutil;6.5.7 in central
downloading https://repo1.maven.org/maven2/org/apache/parquet/parquet-avro/1.8.3/parquet-avro-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-avro;1.8.3!parquet-avro.jar (27ms)
downloading .m2/repository/org/apache/parquet/parquet-column/1.8.3/parquet-column-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-column;1.8.3!parquet-column.jar (2ms)
downloading .m2/repository/org/apache/parquet/parquet-hadoop/1.8.3/parquet-hadoop-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-hadoop;1.8.3!parquet-hadoop.jar (2ms)
downloading .m2/repository/org/apache/parquet/parquet-common/1.8.3/parquet-common-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-common;1.8.3!parquet-common.jar (1ms)
downloading .m2/repository/org/apache/parquet/parquet-encoding/1.8.3/parquet-encoding-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-encoding;1.8.3!parquet-encoding.jar (1ms)
downloading .m2/repository/org/apache/parquet/parquet-jackson/1.8.3/parquet-jackson-1.8.3.jar ...
	[SUCCESSFUL ] org.apache.parquet#parquet-jackson;1.8.3!parquet-jackson.jar (2ms)
:: resolution report :: resolve 1905ms :: artifacts dl 49ms
	:: modules in use:
	com.thoughtworks.paranamer#paranamer;2.7 from local-m2-cache in [default]
	commons-codec#commons-codec;1.5 from local-m2-cache in [default]
	it.unimi.dsi#fastutil;6.5.7 from central in [default]
	org.apache.avro#avro;1.8.0 from local-m2-cache in [default]
	org.apache.commons#commons-compress;1.8.1 from local-m2-cache in [default]
	org.apache.parquet#parquet-avro;1.8.3 from central in [default]
	org.apache.parquet#parquet-column;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-common;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-encoding;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-format;2.3.1 from local-m2-cache in [default]
	org.apache.parquet#parquet-hadoop;1.8.3 from local-m2-cache in [default]
	org.apache.parquet#parquet-jackson;1.8.3 from local-m2-cache in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from local-m2-cache in [default]
	org.codehaus.jackson#jackson-mapper-asl;1.9.13 from local-m2-cache in [default]
	org.slf4j#slf4j-api;1.7.7 from central in [default]
	org.tukaani#xz;1.5 from local-m2-cache in [default]
	org.xerial.snappy#snappy-java;1.1.1.6 from local-m2-cache in [default]
	:: evicted modules:
	org.slf4j#slf4j-api;1.7.5 by [org.slf4j#slf4j-api;1.7.7] in [default]
	org.codehaus.jackson#jackson-mapper-asl;1.9.11 by [org.codehaus.jackson#jackson-mapper-asl;1.9.13] in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.11 by [org.codehaus.jackson#jackson-core-asl;1.9.13] in [default]
	org.xerial.snappy#snappy-java;1.1.1.3 by [org.xerial.snappy#snappy-java;1.1.1.6] in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   21  |   6   |   6   |   4   ||   17  |   6   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	6 artifacts copied, 11 already retrieved (2652kB/19ms)

Spark context available as 'sc' (master = yarn, app id = application_1537216002452_0015).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignments = sc.loadParquetAlignments("foo.alignments.adam")
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = ParquetUnboundAlignmentRecordRDD with 25 reference sequences, 1 read groups, and 1 processing steps

scala> alignments.dataset.count()
res0: Long = 199543

scala> alignments.saveAsParquet("foo.alignments.copy.adam")

scala> sc.loadAlignments("foo.alignments.copy.adam").rdd.count()
res2: Long = 199543
@akmorrow13

This comment has been minimized.

Copy link
Contributor

akmorrow13 commented Sep 24, 2018

This seems related to #1742. Is the current solution to still include --packages org.apache.parquet:parquet-avro:1.8.3?

@heuermh heuermh changed the title Update Spark version to 2.3.2-rc6. Update Spark version to 2.3.2 Sep 27, 2018

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Sep 27, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2821/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse origin/pr/2055/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains d631c93 # timeout=10Checking out Revision d631c93 (origin/pr/2055/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f d631c931e713b23313ee72e25c5db483732e7873First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,2.2.2,ubuntuTriggering ADAM-prb ? 2.7.3,2.11,2.2.2,ubuntuADAM-prb ? 2.6.2,2.11,2.2.2,ubuntu completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.2,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Oct 8, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2828/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse origin/pr/2055/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains e50c232 # timeout=10Checking out Revision e50c232 (origin/pr/2055/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f e50c2327d61d0d6fdb18f590944077ebc1105a97First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.11,2.2.2,ubuntuTriggering ADAM-prb ? 2.7.3,2.11,2.2.2,ubuntuADAM-prb ? 2.6.2,2.11,2.2.2,ubuntu completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.2,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh heuermh changed the title Update Spark version to 2.3.2 [ADAM-2062] Update Spark version to 2.3.2 Oct 17, 2018

@heuermh heuermh force-pushed the heuermh:spark-2.3.2 branch from 2551654 to e5b9b5c Oct 26, 2018

@heuermh heuermh added this to the 0.24.1 milestone Oct 26, 2018

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Oct 26, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2838/
Test PASSed.

@heuermh heuermh requested review from akmorrow13 and fnothaft Oct 26, 2018

<spark.version>2.3.0</spark.version>
<parquet.version>1.8.2</parquet.version>
<spark.version>2.3.2</spark.version>
<parquet.version>1.8.3</parquet.version>

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 26, 2018

Contributor

Can we still not remove the pom exclusions for parquet in this update?

This comment has been minimized.

@heuermh

heuermh Oct 26, 2018

Member

That is correct. Without the pom exclusions we run into the same org.apache.avro.SchemaParseException: Can't redefine: list in unit tests

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 26, 2018

Contributor

That's too bad. For some reason, the --packages does not work for parquet-hadoop in python.

This comment has been minimized.

@heuermh

heuermh Oct 27, 2018

Member

Unfortunately, I don't understand why it is necessary for running on CDH Spark and not standalone or EMR, or why it works.

This comment has been minimized.

@akmorrow13

akmorrow13 Oct 29, 2018

Contributor

I have gone through all possible combinations of potential answers to the python problem, and the only thing that seems to work is setting:
--conf spark.driver.extraClassPath=/<path_to_jar>/org.apache.parquet_parquet-hadoop-1.8.3.jar

for pyspark. This means the user has to download the jar first somewhere. I cannot find a cleaner solution.

@akmorrow13 akmorrow13 merged commit ee9d73f into bigdatagenomics:master Nov 4, 2018

1 check passed

default Merged build finished.
Details
@akmorrow13

This comment has been minimized.

Copy link
Contributor

akmorrow13 commented Nov 4, 2018

Thanks @heuermh !

@heuermh heuermh deleted the heuermh:spark-2.3.2 branch Nov 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment