Skip to content

[SPARK-37102][BUILD] Removed redundant exclusions in hadoop-cloud module#34383

Closed
vmalakhin wants to merge 7 commits intoapache:masterfrom
vmalakhin:SPARK-37102
Closed

[SPARK-37102][BUILD] Removed redundant exclusions in hadoop-cloud module#34383
vmalakhin wants to merge 7 commits intoapache:masterfrom
vmalakhin:SPARK-37102

Conversation

@vmalakhin
Copy link
Contributor

@vmalakhin vmalakhin commented Oct 25, 2021

What changes were proposed in this pull request?

Redundant exclusions were removed for hadoop-cloud module so the build output contains required dependency for hadoop-azure artifact (ackson-mapper-asl).

Why are the changes needed?

Currently Hadoop ABFS connector (for Azure Data Lake Storage Gen2) is broken due to missing dependency. So required dependencies for hadoop-azure artifact should be included into distribution output if hadoop-cloud module enabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unfortunately Microsoft does not provide support for Data Lake Storage Gen2 within azurite emulator - so the change was tested manually and the diff was checked to see if anything else was picked up for build outputs (before and after the change). So the only change is inclusion of jackson-mapper-asl-1.9.13.jar.

@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @vmalakhin . Could you enable GitHub Action on your repository? Apache Spark is utilizing your free open source GitHub Action runner quota.

@dongjoon-hyun
Copy link
Member

Could you review this, @sunchao ?

@vmalakhin
Copy link
Contributor Author

Thank you for making a PR, @vmalakhin . Could you enable GitHub Action on your repository? Apache Spark is utilizing your free open source GitHub Action runner quota.

* https://github.com/apache/spark/pull/34383/checks?check_run_id=4000736256

sure!

@sunchao
Copy link
Member

sunchao commented Oct 25, 2021

@vmalakhin can you put more details in the PR description?

Redundant exclusions were removed for hadoop-cloud module

This doesn't fit the description "What changes were proposed in this pull request"

Currently Hadoop ABFS connector (for Azure Data Lake Storage Gen2) is broken due to missing dependency.

Hm can you share more details? what missing dependency and how is that related to Spark?

So the only change is inclusion of jackson-mapper-asl-1.9.13.jar.

the PR restores transitive dependency for jackson-mapper-asl, jackson-core-asl, and jackson-core. Do we need the other 2?

also cc @steveloughran

@vmalakhin
Copy link
Contributor Author

@vmalakhin can you put more details in the PR description?

Redundant exclusions were removed for hadoop-cloud module

This doesn't fit the description "What changes were proposed in this pull request"

Currently Hadoop ABFS connector (for Azure Data Lake Storage Gen2) is broken due to missing dependency.

Hm can you share more details? what missing dependency and how is that related to Spark?

So the only change is inclusion of jackson-mapper-asl-1.9.13.jar.

the PR restores transitive dependency for jackson-mapper-asl, jackson-core-asl, and jackson-core. Do we need the other 2?

also cc @steveloughran

OK - there are some details posted under SPARK-37102, but if I try to access ADLS Gen2 then following exception happens:

>>> df=sqlContext.read.parquet("new_test")                            
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/pyspark/sql/readwriter.py", line 361, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/pyspark/sql/utils.py", line 178, in deco
    return f(*a, **kw)
  File "spark/spark-3.3.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.parquet.
: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper
        at org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.parseListFilesResponse(AbfsHttpOperation.java:508)
        at org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:374)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:274)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:205)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:181)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:454)
        at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:179)
        at org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:301)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:957)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:927)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:909)
        at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:406)
        at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
        at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
        at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
        at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
        at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:567)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:227)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:209)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:209)
        at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:553)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        ... 46 more

So org.codehaus.jackson.map.ObjectMapper related jar is not presented on class path (ie under jars dir).
I've compared jars outputs for ./dev/make-distribution.sh --name custom-spark-default --tgz --pip -Pkubernetes -Phadoop-cloud build configuration and the only different is just jackson-mapper-asl-1.9.13.jar. So I can limit the change only to this one.

@dongjoon-hyun dongjoon-hyun changed the title SPARK-37102: Hadoop Cloud: removed redundant exclusions [SPARK-37102] Hadoop Cloud: removed redundant exclusions Oct 25, 2021
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-37102] Hadoop Cloud: removed redundant exclusions [SPARK-37102][BUILD] Removed redundant exclusions in hadoop-cloud module Oct 25, 2021
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, could you add more info to the PR description (esp. why the changes are required), so it can go into the git commit eventually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why include Guava?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't see it in the dependency:tree output:

[INFO] +- org.apache.hadoop:hadoop-azure:jar:3.3.1:compile
[INFO] |  +- com.microsoft.azure:azure-storage:jar:7.0.1:compile
[INFO] |  |  \- com.microsoft.azure:azure-keyvault-core:jar:1.0.0:compile
[INFO] |  +- org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:compile
[INFO] |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO] |  \- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.3.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-aliyun:jar:3.3.1:compile
[INFO] |  |  \- com.aliyun.oss:aliyun-sdk-oss:jar:3.4.1:compile
[INFO] |  |     +- org.jdom:jdom:jar:1.1:compile
[INFO] |  |     +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] |  |     |  \- stax:stax-api:jar:1.0.1:compile
[INFO] |  |     +- com.aliyun:aliyun-java-sdk-core:jar:3.4.0:compile
[INFO] |  |     +- com.aliyun:aliyun-java-sdk-ram:jar:3.0.0:compile
[INFO] |  |     +- com.aliyun:aliyun-java-sdk-sts:jar:3.0.0:compile
[INFO] |  |     \- com.aliyun:aliyun-java-sdk-ecs:jar:4.2.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-azure-datalake:jar:3.3.1:compile
[INFO] |  |  \- com.microsoft.azure:azure-data-lake-store-sdk:jar:2.3.9:compile
[INFO] |  \- org.apache.hadoop:hadoop-cos:jar:3.3.1:compile
[INFO] |     \- com.qcloud:cos_api-bundle:jar:5.6.19:compile

But happy to get it added back. Shall we do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadoop is using hadoop-shaded-guava since 3.3.1 so I think the com.google.guava inclusion is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean exclusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the change on the following:

  <exclusion>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
  </exclusion>

needs to be reverted from the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's done. @sunchao - do you mind to take a look please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, reverted.

@srowen
Copy link
Member

srowen commented Oct 26, 2021

Hm, I'm worried that the exclusions are there to ensure that a newer version of Jackson 'wins' in the build. This may change that. What does the result of dev/test-dependencies.sh --replace-manifest show?

@vmalakhin
Copy link
Contributor Author

Hm, I'm worried that the exclusions are there to ensure that a newer version of Jackson 'wins' in the build. This may change that. What does the result of dev/test-dependencies.sh --replace-manifest show?

But that's the problem - exclusions dropped required jar completely. Another option could be to add explicit dependency on asl artifact with required version... Then versions can be managed.
Re test-dependencies.sh - will run and post here.

@srowen
Copy link
Member

srowen commented Oct 26, 2021

codehaus jackson is 1.x, so is more "OK" to add back. That said it's probably excluded because otherwise unused, and triggers security warnings on static analysis, so probably why it wasn't excluded.

fasterxml jackson is probably specifically excluded because it is included at a newer version in the Spark build. That isn't related to the error you show. Neither is Guava, which is in a similar situation. Those shouldn't be changed.

Can you just add the dependencies that ABFS requires to your app? I don't think this profile is meant to support third party libraries, though ABFS connector could be arguably a special case.

hadoop-cloud isn't published as part of the binary release so is more "OK" to change in this way though. All in all I could see adding back codehaus jackson

@vmalakhin
Copy link
Contributor Author

codehaus jackson is 1.x, so is more "OK" to add back. That said it's probably excluded because otherwise unused, and triggers security warnings on static analysis, so probably why it wasn't excluded.

fasterxml jackson is probably specifically excluded because it is included at a newer version in the Spark build. That isn't related to the error you show. Neither is Guava, which is in a similar situation. Those shouldn't be changed.

Can you just add the dependencies that ABFS requires to your app? I don't think this profile is meant to support third party libraries, though ABFS connector could be arguably a special case.

hadoop-cloud isn't published as part of the binary release so is more "OK" to change in this way though. All in all I could see adding back codehaus jackson

Yep, the only difference is just asl jar in the output. If it's OK to go ahead can we merge this in? Or please let me know next steps. Thanks!

@srowen
Copy link
Member

srowen commented Oct 26, 2021

No, the fasterxml changes need to be reverted

@vmalakhin
Copy link
Contributor Author

No, the fasterxml changes need to be reverted

Sure, will do.

@vmalakhin
Copy link
Contributor Author

No, the fasterxml changes need to be reverted

That's done. @srowen - do you mind to take another look?

@srowen
Copy link
Member

srowen commented Oct 27, 2021

Unless @steveloughran has thoughts on this one, I think it's OK

@srowen
Copy link
Member

srowen commented Oct 30, 2021

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49252/

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49252/

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Test build #144783 has finished for PR 34383 at commit f2209ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in 6906328 Oct 30, 2021
@srowen
Copy link
Member

srowen commented Oct 30, 2021

Merged to master

@dongjoon-hyun
Copy link
Member

Thank you, @vmalakhin , @srowen , @sunchao .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments