Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29042][Core] Sampling-based RDD with unordered input should be INDETERMINATE #25751

Closed
wants to merge 2 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 11, 2019

What changes were proposed in this pull request?

We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE.

Why are the changes needed?

A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun.

It is a problem in ML applications.

In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.

Each sample is random output, but once you sampled, the output should be determinate.

Does this PR introduce any user-facing change?

Previously, a sampling-based RDD can possibly come with different output after a rerun.
After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage.

How was this patch tested?

Added test.

@viirya
Copy link
Member Author

viirya commented Sep 11, 2019

cc @felixcheung @cloud-fan

@HyukjinKwon
Copy link
Member

@viirya, looks like technically it introduces a behaviour change ("Does this PR introduce any user-facing change?") assuming it affects determinism after a rerun given the description.

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Sep 11, 2019

@HyukjinKwon Thanks! I updated the PR description.

@SparkQA
Copy link

SparkQA commented Sep 11, 2019

Test build #110454 has finished for PR 25751 at commit fb94fea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member Author

viirya commented Sep 11, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 11, 2019

Test build #110479 has finished for PR 25751 at commit ad06a8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. what's our policy now on correctness change?

@cloud-fan
Copy link
Contributor

Do we have any queries return wrong result because of it?

for round-robin partitioner, it has an expectation that it should return the same output when rerun, otherwise we need to rerun the entire stage. This is for the correctness of repartition.

However, I don't think sample has the same problem. End-users would expect sample to return random output, so it doesn't matter if Spark returns different output when rerun tasks of sample.

@viirya
Copy link
Member Author

viirya commented Sep 12, 2019

It is a problem in ML applications.

In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.

Each sample is random output, but once you sampled, the output should be determinate.

@cloud-fan
Copy link
Contributor

make sense, LGTM

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110528 has finished for PR 25751 at commit 71634f2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 12, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110539 has finished for PR 25751 at commit 71634f2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* sensitive, it may return totally different result when the input order
* is changed. Mostly stateful functions are order-sensitive.
*/
private[spark] def mapPartitionsWithIndex[U: ClassTag](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we expose this to users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to not now. @cloud-fan WDYT?

@viirya
Copy link
Member Author

viirya commented Sep 12, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 13, 2019

Test build #110545 has finished for PR 25751 at commit 71634f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 13, 2019

I will go to merge this later if no other comments. We can decide to expose mapPartitionsWithIndex later if we want.

@viirya viirya closed this in c610de6 Sep 13, 2019
@viirya
Copy link
Member Author

viirya commented Sep 13, 2019

Merged to master. Thanks!

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
… INDETERMINATE

### What changes were proposed in this pull request?

We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE.

### Why are the changes needed?

A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun.

It is a problem in ML applications.

In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.

Each sample is random output, but once you sampled, the output should be determinate.

### Does this PR introduce any user-facing change?

Previously, a sampling-based RDD can possibly come with different output after a rerun.
After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage.

### How was this patch tested?

Added test.

Closes apache#25751 from viirya/sample-order-sensitive.

Authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
@gatorsmile
Copy link
Member

@viirya Could you backport this to 2.4?

@viirya
Copy link
Member Author

viirya commented Sep 18, 2019

@gatorsmile Ok. Will do backport.

@viirya viirya deleted the sample-order-sensitive branch December 27, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants