Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9899] [SQL] Disables customized output committer when speculation is on #8317

Conversation

liancheng
Copy link
Contributor

Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss.

Please see this PR comment for more details.

@@ -58,6 +58,9 @@ private[sql] abstract class BaseWriterContainer(
// This is only used on driver side.
@transient private val jobContext: JobContext = job

private val speculationEnabled: Boolean =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother having a private variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we need this since the relation and SQLContext are transient and so not available on the executors.

@SparkQA
Copy link

SparkQA commented Aug 19, 2015

Test build #41270 has finished for PR 8317 at commit 7564867.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Thanks, merging to master and 1.5.

@asfgit asfgit closed this in f3ff4c4 Aug 19, 2015
asfgit pushed a commit that referenced this pull request Aug 19, 2015
…ion is on

Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss.

Please see this [PR comment] [1] for more details.

[1]: #8191 (comment)

Author: Cheng Lian <lian@databricks.com>

Closes #8317 from liancheng/spark-9899/speculation-hates-direct-output-committer.

(cherry picked from commit f3ff4c4)
Signed-off-by: Michael Armbrust <michael@databricks.com>

Conflicts:
	sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
@liancheng liancheng deleted the spark-9899/speculation-hates-direct-output-committer branch August 19, 2015 23:55
asfgit pushed a commit that referenced this pull request Sep 14, 2015
…lation enabled

This is a follow-up of #8317.

When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.

However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](#8191 (comment)) for more details.

Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8687 from cloud-fan/direct-committer.
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
…lation enabled

This is a follow-up of apache/spark#8317.

When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.

However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](apache/spark#8191 (comment)) for more details.

Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8687 from cloud-fan/direct-committer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants