New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35878][CORE] Add fs.s3a.endpoint if unset and fs.s3a.endpoint.region is null. #33064
Conversation
…is null. This patches the hadoop configuration so that fs.s3a.endpoint is set to s3.amazonaws.com if neither it nor fs.s3a.endpoint.region is set. This stops S3A Filesystem creation failing with the error "Unable to find a region via the region provider chain." in some non-EC2 deployments. See: HADOOP-17771. Contributed by Steve Loughran. Change-Id: Ib8f5dd4a5b8d5ddf4c643747d05e7a4e42083422
Test build #140271 has finished for PR 33064 at commit
|
thought I'd done this locally. will review. |
|
Change-Id: I80e3160ca85c987862ec5a734774361982876173
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #140273 has finished for PR 33064 at commit
|
The test failures in GA look unrelated FWIW |
// This change is harmless on older versions and compatible with | ||
// later Hadoop releases | ||
if (hadoopConf.get("fs.s3a.endpoint", "").isEmpty && | ||
hadoopConf.get("fs.s3a.endpoint.region") == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more spaces.
sc.set("spark.hadoop.orc.filterPushdown", "true") | ||
new SparkHadoopUtil().appendSparkHadoopConfigs(sc, hadoopConf) | ||
assertConfigValue(hadoopConf, "orc.filterPushdown", "true" ) | ||
assertConfigValue(hadoopConf, "fs.s3a.downgrade.syncable.exceptions", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this.
@@ -487,6 +490,18 @@ private[spark] object SparkHadoopUtil extends Logging { | |||
if (conf.getOption("spark.hadoop.fs.s3a.downgrade.syncable.exceptions").isEmpty) { | |||
hadoopConf.set("fs.s3a.downgrade.syncable.exceptions", "true") | |||
} | |||
// In Hadoop 3.3.1, AWS region handling with the default "" endpoint only works | |||
// in EC2 deployments or when the AWS CLI is installed. | |||
// The workaround is to set the name of the S3 endpoint explicitly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use AWS_REGION
as a workaround, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's automatically picked up in the default chain, but since you have to set it everywhere it's not something you can rely on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @steveloughran .
Merged to master.
Thank you. And sorry for creating the problem. We always rely on a broad set of test configs to explore the many-dimensional configuration space which represents a cloud setup/config, but it turns out that all our tests were in (the all too small) regions of that space where everything worked: in-EC2, systems with aws client set up, or test setups where fs.s3a.endpoint was set to the explicit region where the buckets were |
What changes were proposed in this pull request?
This patches the hadoop configuration so that fs.s3a.endpoint is set to
s3.amazonaws.com if neither it nor fs.s3a.endpoint.region is set.
This stops S3A Filesystem creation failing with the error
"Unable to find a region via the region provider chain."
in some non-EC2 deployments.
See: HADOOP-17771.
when spark options are propagated to the hadoop configuration
in SparkHadoopUtils. the fs.s3a.endpoint value is set to
"s3.amazonaws.com" if unset and no explicit region
is set in fs.s3a.endpoint.region.
Why are the changes needed?
A regression in Hadoop 3.3.1 has surfaced which causes S3A filesystem
instantiation to fail outside EC2 deployments if the host lacks
a CLI configuration in ~/.aws/config declaring the region, or
the
AWS_REGION
environment variableHADOOP-17771 fixes this in Hadoop-3.3.2+, but
this spark patch will correct the behavior when running
Spark with the 3.3.1 artifacts.
It is harmless for older versions and compatible
with hadoop releases containing the HADOOP-17771
fix.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New tests to verify propagation logic from spark conf to hadoop conf.