[SPARK-37600][BUILD] Upgrade to Hadoop 3.3.2 #34855

sunchao · 2021-12-09T19:35:39Z

What changes were proposed in this pull request?

This PR aims to upgrade to Hadoop 3.3.2. In addition, it also removes the LZ4 wrapper classes added in SPARK-36669, therefore fixing SPARK-36679.

Why are the changes needed?

Hadoop 3.3.2 has many bug fixes and we also can remove our internal hacked Hadoop codecs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

SparkQA · 2021-12-09T20:02:09Z

Test build #146048 has finished for PR 34855 at commit 997590e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-09T20:47:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50523/

SparkQA · 2021-12-09T21:03:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50525/

SparkQA · 2021-12-09T21:31:32Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50523/

SparkQA · 2021-12-09T21:44:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50525/

SparkQA · 2021-12-09T23:08:03Z

Test build #146050 has finished for PR 34855 at commit be530f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-12-10T17:18:41Z

pom.xml

@@ -309,6 +309,17 @@
    </extraJavaTestArgs>
  </properties>
  <repositories>
+    <repository>


We'd remove this before merging? after it's released

Yes, will remove this section once the official 3.3.2 release is out.

dev/deps/spark-deps-hadoop-3-hive-2.3

HyukjinKwon · 2021-12-11T12:22:52Z

pom.xml

@@ -120,7 +120,7 @@
    <sbt.project.name>spark</sbt.project.name>
    <slf4j.version>1.7.30</slf4j.version>
    <log4j.version>1.2.17</log4j.version>
-    <hadoop.version>3.3.1</hadoop.version>
+    <hadoop.version>3.3.2</hadoop.version>


should we update #34830 (comment) together?

Good point. Will do.

should we update #34830 (comment) together?

+1 on. this

sunchao · 2022-03-04T01:37:51Z

hmm somehow YarnClustereSuite started failing after 3.3.2. I'll need to check what caused the issue.

dongjoon-hyun · 2022-03-07T00:20:04Z

Is there any update, @sunchao ?

dongjoon-hyun

It seems to pass locally. Could you re-trigger the test simply, @sunchao ?

[info] YarnClusterSuite:
[info] - run Spark in yarn-client mode (10 seconds, 131 milliseconds)
[info] - run Spark in yarn-cluster mode (9 seconds, 90 milliseconds)
[info] - run Spark in yarn-client mode with unmanaged am (8 seconds, 78 milliseconds)
[info] - run Spark in yarn-client mode with different configurations, ensuring redaction (10 seconds, 102 milliseconds)
[info] - run Spark in yarn-cluster mode with different configurations, ensuring redaction (10 seconds, 96 milliseconds)
[info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) (9 seconds, 116 milliseconds)
[info] - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' (10 seconds, 111 milliseconds)
[info] - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' (9 seconds, 96 milliseconds)
[info] - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' and gateway-replacement path (8 seconds, 79 milliseconds)
[info] - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path (9 seconds, 90 milliseconds)
[info] - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path containing an environment variable (9 seconds, 100 milliseconds)
...

LuciferYang · 2022-03-07T18:01:53Z

dev/deps/spark-deps-hadoop-3-hive-2.3

 orc-core/1.7.3//orc-core-1.7.3.jar
 orc-mapreduce/1.7.3//orc-mapreduce-1.7.3.jar
 orc-shims/1.7.3//orc-shims-1.7.3.jar
+org.jacoco.agent/0.8.5/runtime/org.jacoco.agent-0.8.5-runtime.jar


jacoco is Java code coverage library, I was surprised that it would become a dependency

oh, I missed this. Do you want to exclude this as an workaround, @sunchao ?

Hmm let me check.

This is a test-only dependency brought in by aliyun-java-sdk-core in hadoop-cloud-storage.

Are you sure it's test only? Didn't think those appeared in these drops file

I tracked to this commit which introduced it: aliyun/aliyun-openapi-java-sdk@e0d21a3, which looks only used in test?

LuciferYang · 2022-03-07T18:57:56Z

It seems to pass locally. Could you re-trigger the test simply, @sunchao ?

[info] YarnClusterSuite:
[info] - run Spark in yarn-client mode (10 seconds, 131 milliseconds)
[info] - run Spark in yarn-cluster mode (9 seconds, 90 milliseconds)
[info] - run Spark in yarn-client mode with unmanaged am (8 seconds, 78 milliseconds)
[info] - run Spark in yarn-client mode with different configurations, ensuring redaction (10 seconds, 102 milliseconds)
[info] - run Spark in yarn-cluster mode with different configurations, ensuring redaction (10 seconds, 96 milliseconds)
[info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) (9 seconds, 116 milliseconds)
[info] - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' (10 seconds, 111 milliseconds)
[info] - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' (9 seconds, 96 milliseconds)
[info] - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' and gateway-replacement path (8 seconds, 79 milliseconds)
[info] - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path (9 seconds, 90 milliseconds)
[info] - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path containing an environment variable (9 seconds, 100 milliseconds)
...

I manually test with mvn locally, and there will be UT failed:

YarnClusterSuite:
- run Spark in yarn-client mode
- run Spark in yarn-cluster mode
- run Spark in yarn-client mode with unmanaged am
- run Spark in yarn-client mode with different configurations, ensuring redaction
- run Spark in yarn-cluster mode with different configurations, ensuring redaction
- yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)
- SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local'
- SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local'
- SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' and gateway-replacement path
- SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path
- SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path containing an environment variable
- SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'file'
- SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'file'
- run Spark in yarn-cluster mode unsuccessfully
- run Spark in yarn-cluster mode failure after sc initialized
- run Python application in yarn-client mode *** FAILED ***
  LOST did not equal FINISHED SLF4J: Class path contains multiple SLF4J bindings.
  SLF4J: Found binding in [jar:file:/Users/xxx/spark-source/assembly/target/scala-2.12/jars/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  SLF4J: Found binding in [jar:file:/Users/xxx/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.17.1/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] (BaseYarnClusterSuite.scala:233)

dongjoon-hyun · 2022-03-07T19:06:36Z

Oh...

LuciferYang · 2022-03-07T19:06:56Z

#34855 (comment)

Sorry, this may be my bad. I re-run it twice and succeeded

dongjoon-hyun · 2022-03-07T19:12:17Z

Ya, your failed test case is already passed in the original GitHub Action run. Maybe you might hit some flaky test case failure which is still in this module.

sunchao · 2022-03-07T19:14:21Z

Thanks for helping to verify this @LuciferYang @dongjoon-hyun ! yea it seems a bit flaky. I tried to look into the YARN logs locally but couldn't find anything interesting. Let me try to re-trigger the GitHub workflow.

dongjoon-hyun

@sunchao . All tests except pyspark-pandas-slow seems to passed. It would be irrelevant.
Could you remove [WIP] and re-trigger once more, please?

sunchao · 2022-03-07T23:46:43Z

Sure @dongjoon-hyun . Just re-triggered the jobs.

dongjoon-hyun

+1, LGTM (Pending CIs). Thank you, @sunchao .

dongjoon-hyun · 2022-03-07T23:48:41Z

I believe this is almost one. Could you review this once more, @viirya , @srowen , @HyukjinKwon , @AngersZhuuuu , @LuciferYang ?

dongjoon-hyun · 2022-03-07T23:49:39Z

To @sunchao . It seems that it's not re-triggered yet. You may want to add an empty commit.

sunchao · 2022-03-07T23:56:21Z

Re-triggered via empty commit. I did it manually by clicking the "Re-run all jobs" button which wasn't reflected here somehow.

dev/deps/spark-deps-hadoop-3-hive-2.3

hadoop-cloud/pom.xml

LuciferYang

LGTM +1

srowen

Licenses look OK

dongjoon-hyun · 2022-03-09T02:45:49Z

@srowen , do you have any other concerns? Or, the last issue (LICENSE) is resolved and we are good to go?

AngersZhuuuu · 2022-03-09T02:49:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala

@@ -69,7 +69,7 @@ private[hive] object IsolatedClientLoader extends Logging {
            // If the error message contains hadoop, it is probably because the hadoop
            // version cannot be resolved.
            val fallbackVersion = if (VersionUtils.isHadoop3) {
-              "3.3.1"
+              "3.3.2"
            } else {
              "2.7.4"


By the way, can we read the hadoop version of the project configuration here?

That sounds like independent improvement idea. Could you file a JIRA for that?

That sounds like independent improvement idea. Could you file a JIRA for that?

Yea, will try to do this.

I'm not sure this is easy since in this case the Hadoop version specified via hadoop.version in pom.xml is customized and is not 3.3.2, which is why it can't be fetched from Maven.

dongjoon-hyun · 2022-03-09T03:56:37Z

Thank you, @sunchao , @viirya , @srowen , @HyukjinKwon , @LuciferYang , @AngersZhuuuu .
Merged to master for Apache Spark 3.3.

Also, cc @MaxGekk since he is the release manager for Apache Spark 3.3.

github-actions bot added the BUILD label Dec 9, 2021

sunchao changed the title ~~[SPARK-37600][BUILD] Upgrade to Hadoop 3.3.2~~ [WIP][SPARK-37600][BUILD] Upgrade to Hadoop 3.3.2 Dec 9, 2021

github-actions bot added the SQL label Dec 9, 2021

srowen reviewed Dec 10, 2021

View reviewed changes

HyukjinKwon reviewed Dec 11, 2021

View reviewed changes

sunchao force-pushed the SPARK-37600 branch from be530f0 to 59b70c5 Compare January 6, 2022 19:44

sunchao force-pushed the SPARK-37600 branch from 59b70c5 to 430156e Compare January 18, 2022 21:44

sunchao force-pushed the SPARK-37600 branch from aa97932 to 0afbceb Compare January 26, 2022 01:10

github-actions bot added the CORE label Jan 26, 2022

sunchao force-pushed the SPARK-37600 branch 2 times, most recently from c1a6cb7 to 29da6a8 Compare February 11, 2022 18:35

dongjoon-hyun reviewed Mar 7, 2022

View reviewed changes

LuciferYang reviewed Mar 7, 2022

View reviewed changes

sunchao added 4 commits March 7, 2022 11:15

initial commit

b2971f9

Add staging repository to more places

83617f7

more update

00df5d3

retrigger build

1f6cfff

sunchao added 3 commits March 7, 2022 11:15

Revert SPARK-36669

ffdd991

mima

df76174

remove staging configs

92f91fa

sunchao force-pushed the SPARK-37600 branch from 2f398c7 to 92f91fa Compare March 7, 2022 19:15

dongjoon-hyun reviewed Mar 7, 2022

View reviewed changes

sunchao changed the title ~~[WIP][SPARK-37600][BUILD] Upgrade to Hadoop 3.3.2~~ [SPARK-37600][BUILD] Upgrade to Hadoop 3.3.2 Mar 7, 2022

sunchao marked this pull request as ready for review March 7, 2022 23:46

dongjoon-hyun approved these changes Mar 7, 2022

View reviewed changes

retrigger build

1825ccb

remove jacoco

654f6a9

srowen requested changes Mar 8, 2022

View reviewed changes

dev/deps/spark-deps-hadoop-3-hive-2.3 Show resolved Hide resolved

dev/deps/spark-deps-hadoop-3-hive-2.3 Show resolved Hide resolved

hadoop-cloud/pom.xml Show resolved Hide resolved

update licenses

db6f2d1

LuciferYang approved these changes Mar 8, 2022

View reviewed changes

srowen reviewed Mar 8, 2022

View reviewed changes

srowen approved these changes Mar 9, 2022

View reviewed changes

AngersZhuuuu reviewed Mar 9, 2022

View reviewed changes

dongjoon-hyun closed this in 4da04fc Mar 9, 2022

[SPARK-37600][BUILD] Upgrade to Hadoop 3.3.2 #34855

[SPARK-37600][BUILD] Upgrade to Hadoop 3.3.2 #34855

Conversation

sunchao commented Dec 9, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 9, 2021

SparkQA commented Dec 9, 2021

SparkQA commented Dec 9, 2021

SparkQA commented Dec 9, 2021

SparkQA commented Dec 9, 2021

SparkQA commented Dec 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Mar 4, 2022

dongjoon-hyun commented Mar 7, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang commented Mar 7, 2022

dongjoon-hyun commented Mar 7, 2022

LuciferYang commented Mar 7, 2022

dongjoon-hyun commented Mar 7, 2022

sunchao commented Mar 7, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

sunchao commented Mar 7, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 7, 2022

dongjoon-hyun commented Mar 7, 2022

sunchao commented Mar 7, 2022

LuciferYang left a comment

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 9, 2022 • edited

sunchao commented Dec 9, 2021 •

edited

dongjoon-hyun commented Mar 9, 2022 •

edited