[SPARK-29042][Core] Sampling-based RDD with unordered input should be INDETERMINATE #25751

viirya · 2019-09-11T01:21:00Z

What changes were proposed in this pull request?

We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE.

Why are the changes needed?

A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun.

It is a problem in ML applications.

In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.

Each sample is random output, but once you sampled, the output should be determinate.

Does this PR introduce any user-facing change?

Previously, a sampling-based RDD can possibly come with different output after a rerun.
After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage.

How was this patch tested?

Added test.

viirya · 2019-09-11T01:21:34Z

cc @felixcheung @cloud-fan

HyukjinKwon · 2019-09-11T01:30:10Z

@viirya, looks like technically it introduces a behaviour change ("Does this PR introduce any user-facing change?") assuming it affects determinism after a rerun given the description.

core/src/main/scala/org/apache/spark/rdd/RDD.scala

viirya · 2019-09-11T03:00:52Z

@HyukjinKwon Thanks! I updated the PR description.

SparkQA · 2019-09-11T05:22:09Z

Test build #110454 has finished for PR 25751 at commit fb94fea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-09-11T15:16:50Z

retest this please

SparkQA · 2019-09-11T18:05:06Z

Test build #110479 has finished for PR 25751 at commit ad06a8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM. what's our policy now on correctness change?

cloud-fan · 2019-09-12T07:30:52Z

Do we have any queries return wrong result because of it?

for round-robin partitioner, it has an expectation that it should return the same output when rerun, otherwise we need to rerun the entire stage. This is for the correctness of repartition.

However, I don't think sample has the same problem. End-users would expect sample to return random output, so it doesn't matter if Spark returns different output when rerun tasks of sample.

viirya · 2019-09-12T15:29:55Z

It is a problem in ML applications.

In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.

Each sample is random output, but once you sampled, the output should be determinate.

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

cloud-fan · 2019-09-12T15:36:23Z

make sense, LGTM

SparkQA · 2019-09-12T19:48:49Z

Test build #110528 has finished for PR 25751 at commit 71634f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-09-12T19:57:58Z

retest this please

SparkQA · 2019-09-12T21:46:54Z

Test build #110539 has finished for PR 25751 at commit 71634f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2019-09-12T21:54:35Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+   * sensitive, it may return totally different result when the input order
+   * is changed. Mostly stateful functions are order-sensitive.
+   */
+  private[spark] def mapPartitionsWithIndex[U: ClassTag](


shall we expose this to users?

I tend to not now. @cloud-fan WDYT?

viirya · 2019-09-12T22:00:33Z

retest this please

SparkQA · 2019-09-13T00:21:37Z

Test build #110545 has finished for PR 25751 at commit 71634f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-09-13T17:30:31Z

I will go to merge this later if no other comments. We can decide to expose mapPartitionsWithIndex later if we want.

viirya · 2019-09-13T21:08:06Z

Merged to master. Thanks!

… INDETERMINATE ### What changes were proposed in this pull request? We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE. ### Why are the changes needed? A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun. It is a problem in ML applications. In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy. Each sample is random output, but once you sampled, the output should be determinate. ### Does this PR introduce any user-facing change? Previously, a sampling-based RDD can possibly come with different output after a rerun. After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage. ### How was this patch tested? Added test. Closes apache#25751 from viirya/sample-order-sensitive. Authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>

gatorsmile · 2019-09-18T02:17:14Z

@viirya Could you backport this to 2.4?

viirya · 2019-09-18T02:21:00Z

@gatorsmile Ok. Will do backport.

HyukjinKwon reviewed Sep 11, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/rdd/RDD.scala Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

viirya force-pushed the sample-order-sensitive branch from e5c90c0 to fb94fea Compare September 11, 2019 03:00

Sampling-based RDD with unordered input should be INDETERMINATE.

ad06a8f

viirya force-pushed the sample-order-sensitive branch from fb94fea to ad06a8f Compare September 11, 2019 06:03

This comment has been minimized.

Sign in to view

dongjoon-hyun added the SPARK CORE label Sep 11, 2019

felixcheung approved these changes Sep 12, 2019

View reviewed changes

cloud-fan reviewed Sep 12, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala Show resolved Hide resolved

Simplify test case.

71634f2

jiangxb1987 reviewed Sep 12, 2019

View reviewed changes

viirya closed this in c610de6 Sep 13, 2019

viirya deleted the sample-order-sensitive branch December 27, 2023 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29042][Core] Sampling-based RDD with unordered input should be INDETERMINATE #25751

[SPARK-29042][Core] Sampling-based RDD with unordered input should be INDETERMINATE #25751

viirya commented Sep 11, 2019 •

edited

Loading

viirya commented Sep 11, 2019

HyukjinKwon commented Sep 11, 2019

This comment has been minimized.

viirya commented Sep 11, 2019

SparkQA commented Sep 11, 2019

This comment has been minimized.

viirya commented Sep 11, 2019

SparkQA commented Sep 11, 2019

felixcheung left a comment

cloud-fan commented Sep 12, 2019

viirya commented Sep 12, 2019

cloud-fan commented Sep 12, 2019

SparkQA commented Sep 12, 2019

viirya commented Sep 12, 2019

SparkQA commented Sep 12, 2019

jiangxb1987 Sep 12, 2019

viirya Sep 12, 2019

viirya commented Sep 12, 2019

SparkQA commented Sep 13, 2019

viirya commented Sep 13, 2019

viirya commented Sep 13, 2019

gatorsmile commented Sep 18, 2019

viirya commented Sep 18, 2019

[SPARK-29042][Core] Sampling-based RDD with unordered input should be INDETERMINATE #25751

[SPARK-29042][Core] Sampling-based RDD with unordered input should be INDETERMINATE #25751

Conversation

viirya commented Sep 11, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

viirya commented Sep 11, 2019

HyukjinKwon commented Sep 11, 2019

This comment has been minimized.

viirya commented Sep 11, 2019

SparkQA commented Sep 11, 2019

This comment has been minimized.

viirya commented Sep 11, 2019

SparkQA commented Sep 11, 2019

felixcheung left a comment

Choose a reason for hiding this comment

cloud-fan commented Sep 12, 2019

viirya commented Sep 12, 2019

cloud-fan commented Sep 12, 2019

SparkQA commented Sep 12, 2019

viirya commented Sep 12, 2019

SparkQA commented Sep 12, 2019

jiangxb1987 Sep 12, 2019

Choose a reason for hiding this comment

viirya Sep 12, 2019

Choose a reason for hiding this comment

viirya commented Sep 12, 2019

SparkQA commented Sep 13, 2019

viirya commented Sep 13, 2019

viirya commented Sep 13, 2019

gatorsmile commented Sep 18, 2019

viirya commented Sep 18, 2019

viirya commented Sep 11, 2019 •

edited

Loading