-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48184][PYTHON][CONNECT] Always set the seed of Dataframe.sample
in Client side
#46456
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the contact of sample
has deterministicity or not, @zhengruifeng .
which cause inconsistent results in multiple executions
Anyway, if this behavior change is considered bug in order to match with the non-connect PySpark, do we need to backport?
@dongjoon-hyun |
Ah, I got your point. It's a very interesting
I was thinking the following. My bad.
|
def test_sample_with_random_seed(self): | ||
df = self.spark.range(10000).sample(0.1) | ||
cnts = [df.count() for i in range(10)] | ||
self.assertEqual(1, len(set(cnts))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it seems that we have a unit test. Could you update it?
FAIL [0.001s]: test_sample (pyspark.sql.tests.connect.test_connect_plan.SparkConnectPlanTests.test_sample)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_plan.py", line 446, in test_sample
self.assertEqual(plan.root.sample.HasField("seed"), False)
AssertionError: True != False
d3082fa
to
3a9f971
Compare
…le` in Client side ### What changes were proposed in this pull request? Always set the seed of `Dataframe.sample` in Client side ### Why are the changes needed? Bug fix If the seed is not set in Client, it will be set in server side with a random int https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386 which cause inconsistent results in multiple executions In Spark Classic: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006] ``` In Spark Connect: before: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979] ``` after: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032] ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46456 from zhengruifeng/py_connect_sample_seed. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 47afe77) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Merged to master/3.5. |
Thank you, @zhengruifeng and @HyukjinKwon . |
thanks @dongjoon-hyun and @HyukjinKwon for reviews |
### What changes were proposed in this pull request? Document the requirement of seed in protos ### Why are the changes needed? the seed should be set at client side document it to avoid cases like #46456 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46518 from zhengruifeng/doc_random. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…le` in Client side ### What changes were proposed in this pull request? Always set the seed of `Dataframe.sample` in Client side ### Why are the changes needed? Bug fix If the seed is not set in Client, it will be set in server side with a random int https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386 which cause inconsistent results in multiple executions In Spark Classic: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006] ``` In Spark Connect: before: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979] ``` after: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032] ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46456 from zhengruifeng/py_connect_sample_seed. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? Document the requirement of seed in protos ### Why are the changes needed? the seed should be set at client side document it to avoid cases like apache#46456 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46518 from zhengruifeng/doc_random. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Always set the seed of
Dataframe.sample
in Client sideWhy are the changes needed?
Bug fix
If the seed is not set in Client, it will be set in server side with a random int
spark/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Line 386 in c4df12c
which cause inconsistent results in multiple executions
In Spark Classic:
In Spark Connect:
before:
after:
Does this PR introduce any user-facing change?
yes, bug fix
How was this patch tested?
ci
Was this patch authored or co-authored using generative AI tooling?
no