-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default #29895
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129193 has finished for PR 29895 at commit
|
GitHub Action passed. The Jenkins failure is irrelevant. BTW, cc @waleedfateem , @srowen , @HyukjinKwon , @wangyum for #29541 |
also cc @steveloughran |
Looks fine to me but how do you think @steveloughran? Looks like your call is important here. |
I labeled SPARK-33019 as a correctness issue because a working Apache Spark 2.4 PySpark program can generate a wrong result with Apache Spark 3.0 with Hadoop 3.2 distribution or Apache Spark 3.1 default distribution. This is a release blocker for Apache Spark 3.1. Note that this is no-op when the user provides the conf. |
Version 2 may have better performance, but version 1 may handle failures better in certain situations, | ||
as per <a href="https://issues.apache.org/jira/browse/MAPREDUCE-4815">MAPREDUCE-4815</a>. | ||
The default value depends on the Hadoop version used in an environment: | ||
1 for Hadoop versions lower than 3.0 | ||
2 for Hadoop versions 3.0 and higher | ||
It's important to note that this can change back to 1 again in the future once <a href="https://issues.apache.org/jira/browse/MAPREDUCE-7282">MAPREDUCE-7282</a> | ||
is fixed and merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why this is deleted? It is a very comprehensive comments about the hadoop version background. @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR aims to provide a consistent view for Apache Spark users. For example, The default value depends on the Hadoop version used in an environment
is not valid any more. After this PR, Apache Spark users will use v1
consistently by default.
FWIW I'm going to change the default to be v1, and log @ WARN in job set up when you use v2 (unless you turn that specific log off). V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. Note that if your job doesn't generate unique files with each task attempt, even without atomic task commit the output is correct. The danger is when when you get one or more of
If your attempts are 100% deterministic, you are going to be safe. |
https://issues.apache.org/jira/browse/MAPREDUCE-7282 is not yet resolved so I think we should wait for resolution there. I don't remember the details off the top of my head so need to go look again. |
Hi, @steveloughran and @tgravescs . No matter what happens in the future, they cannot change the history (Apache Hadoop 3.2.0 and all exiting Hadoop 3.x versions). And, for now, Apache Spark 3.1 will be stuck in Apache Hadoop 3.2.0 due to the Guava issue. That's the reason why we need to do this right now from Spark side. For the following, @steveloughran , as I wrote in the PR description, this PR doesn't override the explicit user-given config. This is only setting
Eventually, I believe we can use |
Patch LGTM: you are changing the default algorithm to v1 if the user doesn't say otherwise. I'm sorry about "the guava problem".. something to discuss there. It's just there were some security fixes we needed to get in and we couldn't stay on older versions. FWIW we are removing the Preconditions checks out of hadoop-common entirely and moving to our own, just to avoid grief there -but it other bits (executors, cache, ...) still be used. What a pain.Are |
I'm fine with changing the default. I was trying to figure out cases when a user would really see this. The MapReduce paradigm and Spark rely on the output of tasks being deterministic. If they are not they have other issues with retries and the output has no guarantees. I thought Spark had deterministic output path naming but I was just starting to make sure I was remembering properly. If those are true. I think that just leaves the _SUCCESS file thing. Which I can see if people don't check would be a problem. Are there cases I'm missing here? Are there cases cloud providers or other tools are changing the output paths or something? @steveloughran did you see this in a particular situation? |
Thank you, @steveloughran and @tgravescs . |
@tgravescs . Apache Spark's official cloud integration document is here. We are already recommending
|
I'll merge this PR. Thanks! |
…gorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cc06266) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Merged to master/3.0. |
…gorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes apache#29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cc06266) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Apache Spark 3.1's default Hadoop profile is
hadoop-3.2
. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which isv1
. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration,spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
, that will be used still.Why are the changes needed?
Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2.
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm fromv1
tov2
and now there exists a discussion to removev2
. We had better provide a consistent default behavior ofv1
across various Spark distributions.Does this PR introduce any user-facing change?
Yes. This changes the default behavior. Users can override this conf.
How was this patch tested?
Manual.
BEFORE (spark-3.0.1-bin-hadoop3.2)
AFTER