Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48184][PYTHON][CONNECT] Always set the seed of Dataframe.sample in Client side #46456

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented May 8, 2024

What changes were proposed in this pull request?

Always set the seed of Dataframe.sample in Client side

Why are the changes needed?

Bug fix

If the seed is not set in Client, it will be set in server side with a random int

which cause inconsistent results in multiple executions

In Spark Classic:

In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]

In Spark Connect:

before:

In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]

after:

In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032]

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the contact of sample has deterministicity or not, @zhengruifeng .

which cause inconsistent results in multiple executions

Anyway, if this behavior change is considered bug in order to match with the non-connect PySpark, do we need to backport?

@zhengruifeng
Copy link
Contributor Author

@dongjoon-hyun df2 = df1.sample(), the dataframe df2 should be immutable.
I think it is a bug, we should backport it, have update the affected versions in jira.

@dongjoon-hyun
Copy link
Member

Ah, I got your point. It's a very interesting connector bug.

df2 should be immutable.

I was thinking the following. My bad.

scala> spark.range(10000).sample(0.1).count()
res0: Long = 1080

scala> spark.range(10000).sample(0.1).count()
res1: Long = 998

def test_sample_with_random_seed(self):
df = self.spark.range(10000).sample(0.1)
cnts = [df.count() for i in range(10)]
self.assertEqual(1, len(set(cnts)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it seems that we have a unit test. Could you update it?

FAIL [0.001s]: test_sample (pyspark.sql.tests.connect.test_connect_plan.SparkConnectPlanTests.test_sample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_plan.py", line 446, in test_sample
    self.assertEqual(plan.root.sample.HasField("seed"), False)
AssertionError: True != False

dongjoon-hyun pushed a commit that referenced this pull request May 8, 2024
…le` in Client side

### What changes were proposed in this pull request?
Always set the seed of `Dataframe.sample` in Client side

### Why are the changes needed?
Bug fix

If the seed is not set in Client, it will be set in server side with a random int

https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386

which cause inconsistent results in multiple executions

In Spark Classic:
```
In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]
```

In Spark Connect:

before:
```
In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]
```

after:
```
In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032]
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46456 from zhengruifeng/py_connect_sample_seed.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 47afe77)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Merged to master/3.5.

@dongjoon-hyun
Copy link
Member

Thank you, @zhengruifeng and @HyukjinKwon .

@zhengruifeng zhengruifeng deleted the py_connect_sample_seed branch May 8, 2024 23:43
@zhengruifeng
Copy link
Contributor Author

thanks @dongjoon-hyun and @HyukjinKwon for reviews

dongjoon-hyun pushed a commit that referenced this pull request May 9, 2024
### What changes were proposed in this pull request?
Document the requirement of seed in protos

### Why are the changes needed?
the seed should be set at client side

document it to avoid cases like #46456

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46518 from zhengruifeng/doc_random.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…le` in Client side

### What changes were proposed in this pull request?
Always set the seed of `Dataframe.sample` in Client side

### Why are the changes needed?
Bug fix

If the seed is not set in Client, it will be set in server side with a random int

https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386

which cause inconsistent results in multiple executions

In Spark Classic:
```
In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]
```

In Spark Connect:

before:
```
In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [969, 1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]
```

after:
```
In [1]: df = spark.range(10000).sample(0.1)

In [2]: [df.count() for i in range(10)]
Out[2]: [1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032]
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#46456 from zhengruifeng/py_connect_sample_seed.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?
Document the requirement of seed in protos

### Why are the changes needed?
the seed should be set at client side

document it to avoid cases like apache#46456

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#46518 from zhengruifeng/doc_random.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants