[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. #33064

steveloughran · 2021-06-24T15:54:26Z

What changes were proposed in this pull request?

This patches the hadoop configuration so that fs.s3a.endpoint is set to
s3.amazonaws.com if neither it nor fs.s3a.endpoint.region is set.

This stops S3A Filesystem creation failing with the error
"Unable to find a region via the region provider chain."
in some non-EC2 deployments.

See: HADOOP-17771.

when spark options are propagated to the hadoop configuration
in SparkHadoopUtils. the fs.s3a.endpoint value is set to
"s3.amazonaws.com" if unset and no explicit region
is set in fs.s3a.endpoint.region.

Why are the changes needed?

A regression in Hadoop 3.3.1 has surfaced which causes S3A filesystem
instantiation to fail outside EC2 deployments if the host lacks
a CLI configuration in ~/.aws/config declaring the region, or
the AWS_REGION environment variable

HADOOP-17771 fixes this in Hadoop-3.3.2+, but
this spark patch will correct the behavior when running
Spark with the 3.3.1 artifacts.

It is harmless for older versions and compatible
with hadoop releases containing the HADOOP-17771
fix.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests to verify propagation logic from spark conf to hadoop conf.

…is null. This patches the hadoop configuration so that fs.s3a.endpoint is set to s3.amazonaws.com if neither it nor fs.s3a.endpoint.region is set. This stops S3A Filesystem creation failing with the error "Unable to find a region via the region provider chain." in some non-EC2 deployments. See: HADOOP-17771. Contributed by Steve Loughran. Change-Id: Ib8f5dd4a5b8d5ddf4c643747d05e7a4e42083422

SparkQA · 2021-06-24T16:38:29Z

Test build #140271 has finished for PR 33064 at commit bc4a585.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2021-06-24T17:01:36Z

This patch fails Scala style tests.

thought I'd done this locally. will review.

steveloughran · 2021-06-24T17:06:22Z

Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala:35:54: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala:60:62: No space after token ,
[error] Total time: 48 s, completed Jun 24, 2021 9:38:28 AM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala -Phadoop-3.2 -Phive-2.3 -Pmesos -Phadoop-cloud -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pspark-ganglia-lgpl -Pyarn -Pdocker-integration-tests -Phive ; received return code 1

Change-Id: I80e3160ca85c987862ec5a734774361982876173

SparkQA · 2021-06-24T17:25:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44803/

SparkQA · 2021-06-24T17:34:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44803/

SparkQA · 2021-06-24T18:30:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44804/

SparkQA · 2021-06-24T18:39:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44804/

SparkQA · 2021-06-24T20:00:25Z

Test build #140273 has finished for PR 33064 at commit 4116751.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-06-25T00:56:59Z

The test failures in GA look unrelated FWIW

dongjoon-hyun · 2021-06-25T03:01:43Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

+    // This change is harmless on older versions and compatible with
+    // later Hadoop releases
+    if (hadoopConf.get("fs.s3a.endpoint", "").isEmpty &&
+      hadoopConf.get("fs.s3a.endpoint.region") == null) {


Two more spaces.

dongjoon-hyun · 2021-06-25T12:20:09Z

core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala

+    sc.set("spark.hadoop.orc.filterPushdown", "true")
+    new SparkHadoopUtil().appendSparkHadoopConfigs(sc, hadoopConf)
+    assertConfigValue(hadoopConf, "orc.filterPushdown", "true" )
+    assertConfigValue(hadoopConf, "fs.s3a.downgrade.syncable.exceptions", "true")


Thank you for adding this.

dongjoon-hyun · 2021-06-25T12:21:59Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -487,6 +490,18 @@ private[spark] object SparkHadoopUtil extends Logging {
    if (conf.getOption("spark.hadoop.fs.s3a.downgrade.syncable.exceptions").isEmpty) {
      hadoopConf.set("fs.s3a.downgrade.syncable.exceptions", "true")
    }
+    // In Hadoop 3.3.1, AWS region handling with the default "" endpoint only works
+    // in EC2 deployments or when the AWS CLI is installed.
+    // The workaround is to set the name of the S3 endpoint explicitly,


We can use AWS_REGION as a workaround, too.

that's automatically picked up in the default chain, but since you have to set it everywhere it's not something you can rely on.

dongjoon-hyun

+1, LGTM. Thank you, @steveloughran .
Merged to master.

steveloughran · 2021-06-25T13:42:39Z

Thank you. And sorry for creating the problem. We always rely on a broad set of test configs to explore the many-dimensional configuration space which represents a cloud setup/config, but it turns out that all our tests were in (the all too small) regions of that space where everything worked: in-EC2, systems with aws client set up, or test setups where fs.s3a.endpoint was set to the explicit region where the buckets were

github-actions bot added the CORE label Jun 24, 2021

steveloughran changed the title ~~[CORE][SPARK-35878] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null.~~ [SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. Jun 24, 2021

SPARK-35878. endpoint fixup: scalastyle in test suite

4116751

Change-Id: I80e3160ca85c987862ec5a734774361982876173

dongjoon-hyun reviewed Jun 25, 2021

View reviewed changes

dongjoon-hyun approved these changes Jun 25, 2021

View reviewed changes

dongjoon-hyun closed this in 36aaaa1 Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. #33064

[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. #33064

steveloughran commented Jun 24, 2021 •

edited by dongjoon-hyun

SparkQA commented Jun 24, 2021

steveloughran commented Jun 24, 2021

steveloughran commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

HyukjinKwon commented Jun 25, 2021

dongjoon-hyun Jun 25, 2021

dongjoon-hyun Jun 25, 2021

dongjoon-hyun Jun 25, 2021

steveloughran Jun 25, 2021

dongjoon-hyun left a comment

steveloughran commented Jun 25, 2021

[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. #33064

[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. #33064

Conversation

steveloughran commented Jun 24, 2021 • edited by dongjoon-hyun

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jun 24, 2021

steveloughran commented Jun 24, 2021

steveloughran commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

SparkQA commented Jun 24, 2021

HyukjinKwon commented Jun 25, 2021

dongjoon-hyun Jun 25, 2021

Choose a reason for hiding this comment

dongjoon-hyun Jun 25, 2021

Choose a reason for hiding this comment

dongjoon-hyun Jun 25, 2021

Choose a reason for hiding this comment

steveloughran Jun 25, 2021

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

steveloughran commented Jun 25, 2021

steveloughran commented Jun 24, 2021 •

edited by dongjoon-hyun