[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side #46456

zhengruifeng · 2024-05-08T03:24:24Z

What changes were proposed in this pull request?

Always set the seed of Dataframe.sample in Client side

Why are the changes needed?

Bug fix

If the seed is not set in Client, it will be set in server side with a random int

spark/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

Line 386 in c4df12c

if (rel.hasSeed) rel.getSeed else Utils.random.nextLong,

which cause inconsistent results in multiple executions

In Spark Classic:

In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]

In Spark Connect:

before:

In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]

after:

In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032]

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

dongjoon-hyun

I'm not sure if the contact of sample has deterministicity or not, @zhengruifeng .

which cause inconsistent results in multiple executions

Anyway, if this behavior change is considered bug in order to match with the non-connect PySpark, do we need to backport?

zhengruifeng · 2024-05-08T03:35:53Z

@dongjoon-hyun df2 = df1.sample(), the dataframe df2 should be immutable.
I think it is a bug, we should backport it, have update the affected versions in jira.

dongjoon-hyun · 2024-05-08T03:41:12Z

Ah, I got your point. It's a very interesting connector bug.

df2 should be immutable.

I was thinking the following. My bad.

scala> spark.range(10000).sample(0.1).count()
res0: Long = 1080

scala> spark.range(10000).sample(0.1).count()
res1: Long = 998

dongjoon-hyun · 2024-05-08T04:39:18Z

python/pyspark/sql/tests/test_dataframe.py

+    def test_sample_with_random_seed(self):
+        df = self.spark.range(10000).sample(0.1)
+        cnts = [df.count() for i in range(10)]
+        self.assertEqual(1, len(set(cnts)))


Oh, it seems that we have a unit test. Could you update it?

FAIL [0.001s]: test_sample (pyspark.sql.tests.connect.test_connect_plan.SparkConnectPlanTests.test_sample) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_plan.py", line 446, in test_sample self.assertEqual(plan.root.sample.HasField("seed"), False) AssertionError: True != False

init

…le` in Client side ### What changes were proposed in this pull request? Always set the seed of `Dataframe.sample` in Client side ### Why are the changes needed? Bug fix If the seed is not set in Client, it will be set in server side with a random int https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386 which cause inconsistent results in multiple executions In Spark Classic: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006] ``` In Spark Connect: before: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979] ``` after: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032] ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46456 from zhengruifeng/py_connect_sample_seed. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 47afe77) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2024-05-08T14:44:56Z

Merged to master/3.5.

dongjoon-hyun · 2024-05-08T14:45:36Z

Thank you, @zhengruifeng and @HyukjinKwon .

zhengruifeng · 2024-05-08T23:44:02Z

thanks @dongjoon-hyun and @HyukjinKwon for reviews

### What changes were proposed in this pull request? Document the requirement of seed in protos ### Why are the changes needed? the seed should be set at client side document it to avoid cases like #46456 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46518 from zhengruifeng/doc_random. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…le` in Client side ### What changes were proposed in this pull request? Always set the seed of `Dataframe.sample` in Client side ### Why are the changes needed? Bug fix If the seed is not set in Client, it will be set in server side with a random int https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386 which cause inconsistent results in multiple executions In Spark Classic: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006] ``` In Spark Connect: before: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979] ``` after: ``` In [1]: df = spark.range(10000).sample(0.1) In [2]: [df.count() for i in range(10)] Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032] ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46456 from zhengruifeng/py_connect_sample_seed. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Document the requirement of seed in protos ### Why are the changes needed? the seed should be set at client side document it to avoid cases like apache#46456 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46518 from zhengruifeng/doc_random. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added SQL PYTHON CONNECT labels May 8, 2024

dongjoon-hyun reviewed May 8, 2024

View reviewed changes

zhengruifeng requested a review from HyukjinKwon May 8, 2024 03:36

dongjoon-hyun approved these changes May 8, 2024

View reviewed changes

dongjoon-hyun reviewed May 8, 2024

View reviewed changes

HyukjinKwon approved these changes May 8, 2024

View reviewed changes

zhengruifeng added 2 commits May 8, 2024 15:11

init

655cc69

init

fix plan test

3a9f971

zhengruifeng force-pushed the py_connect_sample_seed branch from d3082fa to 3a9f971 Compare May 8, 2024 07:13

dongjoon-hyun closed this in 47afe77 May 8, 2024

zhengruifeng deleted the py_connect_sample_seed branch May 8, 2024 23:43

zhengruifeng mentioned this pull request May 9, 2024

[SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos #46518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side #46456

[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side #46456

zhengruifeng commented May 8, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

zhengruifeng commented May 8, 2024

dongjoon-hyun commented May 8, 2024

dongjoon-hyun May 8, 2024 •

edited

Loading

dongjoon-hyun commented May 8, 2024

dongjoon-hyun commented May 8, 2024

zhengruifeng commented May 8, 2024

[SPARK-48184][PYTHON][CONNECT] Always set the seed of Dataframe.sample in Client side #46456

[SPARK-48184][PYTHON][CONNECT] Always set the seed of Dataframe.sample in Client side #46456

Conversation

zhengruifeng commented May 8, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

zhengruifeng commented May 8, 2024

dongjoon-hyun commented May 8, 2024

dongjoon-hyun May 8, 2024 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented May 8, 2024

dongjoon-hyun commented May 8, 2024

zhengruifeng commented May 8, 2024

[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side #46456

[SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side #46456

zhengruifeng commented May 8, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun May 8, 2024 •

edited

Loading