Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition's API contract #10932

Closed
wants to merge 2 commits into from

Conversation

JoshRosen
Copy link
Contributor

Spark's Partition and RDD.partitions APIs have a contract which requires custom implementations of RDD.partitions to ensure that for all x, rdd.partitions(x).index == x; in other words, the index reported by a repartition needs to match its position in the partitions array.

If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html

In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose partitions violate the API contract.

…s API contract

Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array.

If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html

In order to guard against this infinite loop behavior, I think that Spark should fail-fast and refuse to compute RDDs' whose `partitions` violate the API contract.
@RussellSpitzer
Copy link
Member

I'm +1 on this in 2.0 :)

@JoshRosen
Copy link
Contributor Author

An open question is whether we want to put this in 1.6.1; this risks breaking user code which happened to work accidentally but also helps to guard against infinite loop behavior.

@rxin
Copy link
Contributor

rxin commented Jan 26, 2016

I don't think we should put it in 1.6.x.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50137 has finished for PR 10932 at commit 10efe2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50157 has finished for PR 10932 at commit a0dd1e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BadRDD[T: ClassTag](prev: RDD[T]) extends RDD[T](prev)

@yhuai
Copy link
Contributor

yhuai commented Jan 27, 2016

LGTM

@yhuai
Copy link
Contributor

yhuai commented Jan 27, 2016

Merging to master.

@asfgit asfgit closed this in 32f7411 Jan 27, 2016
@JoshRosen JoshRosen deleted the SPARK-13021 branch January 27, 2016 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants