Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35868][CORE] Add fs.s3a.downgrade.syncable.exceptions if not set #33044

Closed
wants to merge 1 commit into from
Closed

[SPARK-35868][CORE] Add fs.s3a.downgrade.syncable.exceptions if not set #33044

wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 24, 2021

What changes were proposed in this pull request?

This PR aims to add fs.s3a.downgrade.syncable.exceptions=true if it's not provided by the users.

Why are the changes needed?

Currently, event log feature is broken with Hadoop 3.2 profile due to UnsupportedOperationException because HADOOP-17597 changes the default behavior to throw exceptions by default since Apache Hadoop 3.3.1. We know that it's because EventLogFileWriters is using hadoopDataStream.foreach(_.hflush()), but this PR aims to provide the same UX across Spark distributions with Hadoop2/Hadoop 3 at Apache Spark 3.2.0.

$ bin/spark-shell -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/
...
21/06/23 17:34:35 ERROR SparkContext: Error initializing SparkContext.
java.lang.UnsupportedOperationException: S3A streams are not Syncable. See HADOOP-17597.

Does this PR introduce any user-facing change?

Yes, this will recover the existing behavior.

How was this patch tested?

Manual.

$ build/sbt package -Phadoop-3.2 -Phadoop-cloud
$ bin/spark-shell -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/
...(working)...

If the users provide the configuration explicitly, it will return to the original behavior throwing exceptions.

$ bin/spark-shell -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ -c spark.hadoop.fs.s3a.downgrade.syncable.exceptions=false
...
21/06/23 17:44:41 ERROR Main: Failed to initialize Spark session.
java.lang.UnsupportedOperationException: S3A streams are not Syncable. See HADOOP-17597.

@github-actions github-actions bot added the CORE label Jun 24, 2021
@dongjoon-hyun
Copy link
Member Author

cc @sunchao , @steveloughran

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 24, 2021

cc @gengliangwang for Apache Spark 3.2.0.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding) thanks @dongjoon-hyun !

@dongjoon-hyun
Copy link
Member Author

Thank you, @sunchao !

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44756/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44756/

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Test build #140229 has finished for PR 33044 at commit 4cbf380.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @gengliangwang ! The python UT failures are irrelevant.
Merged to master for Apache Spark 3.2.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-35868 branch June 24, 2021 05:47
@HyukjinKwon
Copy link
Member

lgtm2

@steveloughran
Copy link
Contributor

thx. FWIW, given its causing trouble, do you want this to be the default in hadoop default-xml?

its there to stop people attempting to use s3 as a WAL for HBase or similar, but if applications have been treating it as a low-cost operation in general file IO, then we can just downgrade it broadly and rely on the hope that people don't do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants