-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29042][Core] Sampling-based RDD with unordered input should be INDETERMINATE #25751
Conversation
@viirya, looks like technically it introduces a behaviour change ("Does this PR introduce any user-facing change?") assuming it affects determinism after a rerun given the description. |
This comment has been minimized.
This comment has been minimized.
e5c90c0
to
fb94fea
Compare
@HyukjinKwon Thanks! I updated the PR description. |
Test build #110454 has finished for PR 25751 at commit
|
fb94fea
to
ad06a8f
Compare
This comment has been minimized.
This comment has been minimized.
retest this please |
Test build #110479 has finished for PR 25751 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. what's our policy now on correctness change?
Do we have any queries return wrong result because of it? for round-robin partitioner, it has an expectation that it should return the same output when rerun, otherwise we need to rerun the entire stage. This is for the correctness of However, I don't think sample has the same problem. End-users would expect sample to return random output, so it doesn't matter if Spark returns different output when rerun tasks of sample. |
It is a problem in ML applications. In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy. Each sample is random output, but once you sampled, the output should be determinate. |
make sense, LGTM |
Test build #110528 has finished for PR 25751 at commit
|
retest this please |
Test build #110539 has finished for PR 25751 at commit
|
* sensitive, it may return totally different result when the input order | ||
* is changed. Mostly stateful functions are order-sensitive. | ||
*/ | ||
private[spark] def mapPartitionsWithIndex[U: ClassTag]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we expose this to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to not now. @cloud-fan WDYT?
retest this please |
Test build #110545 has finished for PR 25751 at commit
|
I will go to merge this later if no other comments. We can decide to expose mapPartitionsWithIndex later if we want. |
Merged to master. Thanks! |
… INDETERMINATE ### What changes were proposed in this pull request? We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE. ### Why are the changes needed? A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun. It is a problem in ML applications. In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy. Each sample is random output, but once you sampled, the output should be determinate. ### Does this PR introduce any user-facing change? Previously, a sampling-based RDD can possibly come with different output after a rerun. After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage. ### How was this patch tested? Added test. Closes apache#25751 from viirya/sample-order-sensitive. Authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
@viirya Could you backport this to 2.4? |
@gatorsmile Ok. Will do backport. |
What changes were proposed in this pull request?
We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE.
Why are the changes needed?
A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun.
It is a problem in ML applications.
In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.
Each sample is random output, but once you sampled, the output should be determinate.
Does this PR introduce any user-facing change?
Previously, a sampling-based RDD can possibly come with different output after a rerun.
After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage.
How was this patch tested?
Added test.