[SPARK-37102][BUILD] Removed redundant exclusions in `hadoop-cloud` module by vmalakhin · Pull Request #34383 · apache/spark

vmalakhin · 2021-10-25T18:14:23Z

What changes were proposed in this pull request?

Redundant exclusions were removed for hadoop-cloud module so the build output contains required dependency for hadoop-azure artifact (ackson-mapper-asl).

Why are the changes needed?

Currently Hadoop ABFS connector (for Azure Data Lake Storage Gen2) is broken due to missing dependency. So required dependencies for hadoop-azure artifact should be included into distribution output if hadoop-cloud module enabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unfortunately Microsoft does not provide support for Data Lake Storage Gen2 within azurite emulator - so the change was tested manually and the diff was checked to see if anything else was picked up for build outputs (before and after the change). So the only change is inclusion of jackson-mapper-asl-1.9.13.jar.

dongjoon-hyun · 2021-10-25T18:54:46Z

Thank you for making a PR, @vmalakhin . Could you enable GitHub Action on your repository? Apache Spark is utilizing your free open source GitHub Action runner quota.

https://github.com/apache/spark/pull/34383/checks?check_run_id=4000736256

dongjoon-hyun · 2021-10-25T19:16:55Z

Could you review this, @sunchao ?

vmalakhin · 2021-10-25T19:42:06Z

Thank you for making a PR, @vmalakhin . Could you enable GitHub Action on your repository? Apache Spark is utilizing your free open source GitHub Action runner quota.
* https://github.com/apache/spark/pull/34383/checks?check_run_id=4000736256

sure!

sunchao · 2021-10-25T20:19:45Z

@vmalakhin can you put more details in the PR description?

Redundant exclusions were removed for hadoop-cloud module

This doesn't fit the description "What changes were proposed in this pull request"

Currently Hadoop ABFS connector (for Azure Data Lake Storage Gen2) is broken due to missing dependency.

Hm can you share more details? what missing dependency and how is that related to Spark?

So the only change is inclusion of jackson-mapper-asl-1.9.13.jar.

the PR restores transitive dependency for jackson-mapper-asl, jackson-core-asl, and jackson-core. Do we need the other 2?

also cc @steveloughran

vmalakhin · 2021-10-25T20:33:26Z

@vmalakhin can you put more details in the PR description?

Redundant exclusions were removed for hadoop-cloud module

This doesn't fit the description "What changes were proposed in this pull request"

Currently Hadoop ABFS connector (for Azure Data Lake Storage Gen2) is broken due to missing dependency.

Hm can you share more details? what missing dependency and how is that related to Spark?

So the only change is inclusion of jackson-mapper-asl-1.9.13.jar.

the PR restores transitive dependency for jackson-mapper-asl, jackson-core-asl, and jackson-core. Do we need the other 2?

also cc @steveloughran

OK - there are some details posted under SPARK-37102, but if I try to access ADLS Gen2 then following exception happens:

>>> df=sqlContext.read.parquet("new_test")                            
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/pyspark/sql/readwriter.py", line 361, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/pyspark/sql/utils.py", line 178, in deco
    return f(*a, **kw)
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.parquet.
: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper
        at org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.parseListFilesResponse(AbfsHttpOperation.java:508)
        at org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:374)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:274)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:205)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:181)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:454)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:179)
        at org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:301)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:957)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:927)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:909)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:406)
        at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
        at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
        at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
        at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:567)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:227)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:209)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:209)
        at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:553)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        ... 46 more

So org.codehaus.jackson.map.ObjectMapper related jar is not presented on class path (ie under jars dir).
I've compared jars outputs for ./dev/make-distribution.sh --name custom-spark-default --tgz --pip -Pkubernetes -Phadoop-cloud build configuration and the only different is just jackson-mapper-asl-1.9.13.jar. So I can limit the change only to this one.

sunchao

Thanks, could you add more info to the PR description (esp. why the changes are required), so it can go into the git commit eventually.

sunchao · 2021-10-25T23:49:45Z

hadoop-cloud/pom.xml

why include Guava?

Because I don't see it in the dependency:tree output:

[INFO] +- org.apache.hadoop:hadoop-azure:jar:3.3.1:compile [INFO] | +- com.microsoft.azure:azure-storage:jar:7.0.1:compile [INFO] | | \- com.microsoft.azure:azure-keyvault-core:jar:1.0.0:compile [INFO] | +- org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:compile [INFO] | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile [INFO] | \- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile [INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.1:compile [INFO] | +- org.apache.hadoop:hadoop-aliyun:jar:3.3.1:compile [INFO] | | \- com.aliyun.oss:aliyun-sdk-oss:jar:3.4.1:compile [INFO] | | +- org.jdom:jdom:jar:1.1:compile [INFO] | | +- org.codehaus.jettison:jettison:jar:1.1:compile [INFO] | | | \- stax:stax-api:jar:1.0.1:compile [INFO] | | +- com.aliyun:aliyun-java-sdk-core:jar:3.4.0:compile [INFO] | | +- com.aliyun:aliyun-java-sdk-ram:jar:3.0.0:compile [INFO] | | +- com.aliyun:aliyun-java-sdk-sts:jar:3.0.0:compile [INFO] | | \- com.aliyun:aliyun-java-sdk-ecs:jar:4.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.1:compile [INFO] | | \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile [INFO] | \- org.apache.hadoop:hadoop-cos:jar:3.3.1:compile [INFO] | \- com.qcloud:cos_api-bundle:jar:5.6.19:compile

But happy to get it added back. Shall we do it?

Hadoop is using hadoop-shaded-guava since 3.3.1 so I think the com.google.guava inclusion is unnecessary.

Do you mean exclusion?

I mean the change on the following:

<exclusion> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> </exclusion>

needs to be reverted from the PR.

OK, will do. Thanks.

That's done. @sunchao - do you mind to take a look please?

sunchao · 2021-10-25T23:50:00Z

hadoop-cloud/pom.xml

nit: unnecessary change.

ok, reverted.

srowen · 2021-10-26T13:53:07Z

Hm, I'm worried that the exclusions are there to ensure that a newer version of Jackson 'wins' in the build. This may change that. What does the result of dev/test-dependencies.sh --replace-manifest show?

vmalakhin · 2021-10-26T14:08:45Z

Hm, I'm worried that the exclusions are there to ensure that a newer version of Jackson 'wins' in the build. This may change that. What does the result of dev/test-dependencies.sh --replace-manifest show?

But that's the problem - exclusions dropped required jar completely. Another option could be to add explicit dependency on asl artifact with required version... Then versions can be managed.
Re test-dependencies.sh - will run and post here.

srowen · 2021-10-26T15:28:47Z

codehaus jackson is 1.x, so is more "OK" to add back. That said it's probably excluded because otherwise unused, and triggers security warnings on static analysis, so probably why it wasn't excluded.

fasterxml jackson is probably specifically excluded because it is included at a newer version in the Spark build. That isn't related to the error you show. Neither is Guava, which is in a similar situation. Those shouldn't be changed.

Can you just add the dependencies that ABFS requires to your app? I don't think this profile is meant to support third party libraries, though ABFS connector could be arguably a special case.

hadoop-cloud isn't published as part of the binary release so is more "OK" to change in this way though. All in all I could see adding back codehaus jackson

vmalakhin · 2021-10-26T16:23:12Z

codehaus jackson is 1.x, so is more "OK" to add back. That said it's probably excluded because otherwise unused, and triggers security warnings on static analysis, so probably why it wasn't excluded.

fasterxml jackson is probably specifically excluded because it is included at a newer version in the Spark build. That isn't related to the error you show. Neither is Guava, which is in a similar situation. Those shouldn't be changed.

Can you just add the dependencies that ABFS requires to your app? I don't think this profile is meant to support third party libraries, though ABFS connector could be arguably a special case.

hadoop-cloud isn't published as part of the binary release so is more "OK" to change in this way though. All in all I could see adding back codehaus jackson

Yep, the only difference is just asl jar in the output. If it's OK to go ahead can we merge this in? Or please let me know next steps. Thanks!

srowen · 2021-10-26T16:30:25Z

No, the fasterxml changes need to be reverted

vmalakhin · 2021-10-26T16:37:54Z

No, the fasterxml changes need to be reverted

Sure, will do.

vmalakhin · 2021-10-27T21:00:09Z

No, the fasterxml changes need to be reverted

That's done. @srowen - do you mind to take another look?

srowen · 2021-10-27T21:05:37Z

Unless @steveloughran has thoughts on this one, I think it's OK

srowen · 2021-10-30T13:16:16Z

Jenkins test this please

SparkQA · 2021-10-30T14:02:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49252/

SparkQA · 2021-10-30T14:45:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49252/

SparkQA · 2021-10-30T15:50:14Z

Test build #144783 has finished for PR 34383 at commit f2209ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-10-30T16:24:24Z

Merged to master

dongjoon-hyun · 2021-11-01T15:58:24Z

Thank you, @vmalakhin , @srowen , @sunchao .

github-actions bot added the BUILD label Oct 25, 2021

vmalakhin force-pushed the SPARK-37102 branch from 318d7ea to 96177b3 Compare October 25, 2021 18:16

dongjoon-hyun changed the title ~~SPARK-37102: Hadoop Cloud: removed redundant exclusions~~ [SPARK-37102] Hadoop Cloud: removed redundant exclusions Oct 25, 2021

dongjoon-hyun changed the title ~~[SPARK-37102] Hadoop Cloud: removed redundant exclusions~~ [SPARK-37102][BUILD] Removed redundant exclusions in hadoop-cloud module Oct 25, 2021

sunchao reviewed Oct 25, 2021

View reviewed changes

vmalakhin added 3 commits October 26, 2021 09:33

SPARK-37102: Hadoop Cloud: removed redundant exclusions

bb0123f

SPARK-37102: Hadoop Cloud: removed redundant exclusions

59750c3

SPARK-37102: Hadoop Cloud: removed redundant exclusions

e08ca45

vmalakhin force-pushed the SPARK-37102 branch from 71d6984 to e08ca45 Compare October 26, 2021 08:33

SPARK-37102: Hadoop Cloud: removed redundant exclusions

a1f8e92

vmalakhin added 3 commits October 27, 2021 21:47

SPARK-37102: Hadoop Cloud: removed redundant exclusions

ec470e6

SPARK-37102: Hadoop Cloud: removed redundant exclusions

8e0c008

SPARK-37102: Hadoop Cloud: removed redundant exclusions

f2209ff

srowen closed this in 6906328 Oct 30, 2021

Conversation

vmalakhin commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Oct 25, 2021

Uh oh!

dongjoon-hyun commented Oct 25, 2021

Uh oh!

vmalakhin commented Oct 25, 2021

Uh oh!

sunchao commented Oct 25, 2021

Uh oh!

vmalakhin commented Oct 25, 2021

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 26, 2021

Uh oh!

vmalakhin commented Oct 26, 2021

Uh oh!

srowen commented Oct 26, 2021

Uh oh!

vmalakhin commented Oct 26, 2021

Uh oh!

srowen commented Oct 26, 2021

Uh oh!

vmalakhin commented Oct 26, 2021

Uh oh!

vmalakhin commented Oct 27, 2021

Uh oh!

srowen commented Oct 27, 2021

Uh oh!

srowen commented Oct 30, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

SparkQA commented Oct 30, 2021

Uh oh!

srowen commented Oct 30, 2021

Uh oh!

dongjoon-hyun commented Nov 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

vmalakhin commented Oct 25, 2021 •

edited

Loading