Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. #33064

Closed

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Jun 24, 2021

What changes were proposed in this pull request?

This patches the hadoop configuration so that fs.s3a.endpoint is set to
s3.amazonaws.com if neither it nor fs.s3a.endpoint.region is set.

This stops S3A Filesystem creation failing with the error
"Unable to find a region via the region provider chain."
in some non-EC2 deployments.

See: HADOOP-17771.

when spark options are propagated to the hadoop configuration
in SparkHadoopUtils. the fs.s3a.endpoint value is set to
"s3.amazonaws.com" if unset and no explicit region
is set in fs.s3a.endpoint.region.

Why are the changes needed?

A regression in Hadoop 3.3.1 has surfaced which causes S3A filesystem
instantiation to fail outside EC2 deployments if the host lacks
a CLI configuration in ~/.aws/config declaring the region, or
the AWS_REGION environment variable

HADOOP-17771 fixes this in Hadoop-3.3.2+, but
this spark patch will correct the behavior when running
Spark with the 3.3.1 artifacts.

It is harmless for older versions and compatible
with hadoop releases containing the HADOOP-17771
fix.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests to verify propagation logic from spark conf to hadoop conf.

…is null.

This patches the hadoop configuration so that fs.s3a.endpoint is set to
s3.amazonaws.com if neither it nor fs.s3a.endpoint.region is set.

This stops S3A Filesystem creation failing with the error
"Unable to find a region via the region provider chain."
in some non-EC2 deployments.

See: HADOOP-17771.

Contributed by Steve Loughran.

Change-Id: Ib8f5dd4a5b8d5ddf4c643747d05e7a4e42083422
@github-actions github-actions bot added the CORE label Jun 24, 2021
@steveloughran steveloughran changed the title [CORE][SPARK-35878] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. [SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. Jun 24, 2021
@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Test build #140271 has finished for PR 33064 at commit bc4a585.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

This patch fails Scala style tests.

thought I'd done this locally. will review.

@steveloughran
Copy link
Contributor Author

Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala:35:54: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala:60:62: No space after token ,
[error] Total time: 48 s, completed Jun 24, 2021 9:38:28 AM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala -Phadoop-3.2 -Phive-2.3 -Pmesos -Phadoop-cloud -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pspark-ganglia-lgpl -Pyarn -Pdocker-integration-tests -Phive ; received return code 1

Change-Id: I80e3160ca85c987862ec5a734774361982876173
@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44803/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44803/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44804/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44804/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Test build #140273 has finished for PR 33064 at commit 4116751.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

The test failures in GA look unrelated FWIW

// This change is harmless on older versions and compatible with
// later Hadoop releases
if (hadoopConf.get("fs.s3a.endpoint", "").isEmpty &&
hadoopConf.get("fs.s3a.endpoint.region") == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more spaces.

sc.set("spark.hadoop.orc.filterPushdown", "true")
new SparkHadoopUtil().appendSparkHadoopConfigs(sc, hadoopConf)
assertConfigValue(hadoopConf, "orc.filterPushdown", "true" )
assertConfigValue(hadoopConf, "fs.s3a.downgrade.syncable.exceptions", "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

@@ -487,6 +490,18 @@ private[spark] object SparkHadoopUtil extends Logging {
if (conf.getOption("spark.hadoop.fs.s3a.downgrade.syncable.exceptions").isEmpty) {
hadoopConf.set("fs.s3a.downgrade.syncable.exceptions", "true")
}
// In Hadoop 3.3.1, AWS region handling with the default "" endpoint only works
// in EC2 deployments or when the AWS CLI is installed.
// The workaround is to set the name of the S3 endpoint explicitly,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use AWS_REGION as a workaround, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's automatically picked up in the default chain, but since you have to set it everywhere it's not something you can rely on.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @steveloughran .
Merged to master.

@steveloughran
Copy link
Contributor Author

Thank you. And sorry for creating the problem. We always rely on a broad set of test configs to explore the many-dimensional configuration space which represents a cloud setup/config, but it turns out that all our tests were in (the all too small) regions of that space where everything worked: in-EC2, systems with aws client set up, or test setups where fs.s3a.endpoint was set to the explicit region where the buckets were

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants