[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

zhengruifeng · 2022-10-19T08:02:43Z

What changes were proposed in this pull request?

Implement DataFrame.sample in Connect

Why are the changes needed?

for DataFrame API coverage

Does this PR introduce any user-facing change?

Yes, new API

    def sample(
        self,
        fraction: float,
        *,
        withReplacement: bool = False,
        seed: Optional[int] = None,
    ) -> "DataFrame":

How was this patch tested?

added UT

zhengruifeng · 2022-10-19T08:04:30Z

python/pyspark/sql/dataframe.py


+    @staticmethod
+    def _prepare_augments_for_sample(


the pre-processing of sample augments is pretty complex, so make it a static method and reuse it in connect

If we do need to share code between pyspark and spark connect python client, we should probably add a new module like pyspark-common

zhengruifeng · 2022-10-19T11:12:17Z

cc @HyukjinKwon @amaliujia @cloud-fan @grundprinzip

cloud-fan · 2022-10-19T14:27:54Z

python/pyspark/sql/connect/dataframe.py

+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,
+    ) -> "DataFrame":
+        from pyspark.sql import DataFrame as PySparkDataFrame


oh, does spark connect python client depends on pyspark? Then it's not a thin client any more...

yes this is now depending on the pyspark. In fact it depends on pyspark since the first PR. For the short term it is ok cc @HyukjinKwon

I guess we will need to make a final decision how whether it should depend or not before making the python packaging and release.

cloud-fan · 2022-10-19T14:29:47Z

python/pyspark/sql/connect/dataframe.py

+        if withReplacement is None:
+            withReplacement = False
+        if seed is None:
+            # TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM


@amaliujia We should really consider this. The principle is to move code implementation to the server side as much as possible. We just moved the identifier parsing logic to server side, and we should probably do the same for parameter default values.

This makes sense.

@zhengruifeng I am thinking you can wrap this seed into a proto message and in that case the server side can know if this is set or not? In that case, the server side can does the random generation rather than using the value from proto.

This is an example: #38275

yeah, let me make this change

amaliujia · 2022-10-19T20:20:32Z

python/pyspark/sql/connect/dataframe.py

+            withReplacement, fraction, seed
+        )
+        if withReplacement is None:
+            withReplacement = False


The default bool value for proto is False so this is probably not needed.

oh The Plan definition is not Optional for withReplacement. In this case probably set it as False makes sense.

class Sample(LogicalPlan): def __init__( self, child: Optional["LogicalPlan"], lower_bound: float, upper_bound: float, with_replacement: bool, seed: int, ) -> None:

amaliujia · 2022-10-19T20:25:11Z

python/pyspark/sql/connect/dataframe.py

@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
        """Sort by a specific column"""
        return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)

+    def sample(


The pyspark dataframe API has

@overload def sample(self, fraction: float, seed: Optional[int] = ...) -> "DataFrame": ... @overload def sample( self, withReplacement: Optional[bool], fraction: float, seed: Optional[int] = ..., ) -> "DataFrame": ...

Can we match (as easy as copy the API into connect dataframe.py)?

I guess we can discard those ones ? @HyukjinKwon

Maybe my real question was, will we have an issue to be compatible with existing pyspark dataframe code (needs different imports, of course) if we discard such API? I see many other similar API existing for pyspark dataframe.

users may have to change their codes for this emigration, but I think this is also a chance to make some changes.

Sure. We also can go to that direction.

python/pyspark/sql/connect/plan.py

HyukjinKwon · 2022-10-20T01:17:51Z

python/pyspark/sql/connect/dataframe.py

+        self,
+        withReplacement: Optional[Union[float, bool]] = None,
+        fraction: Optional[Union[int, float]] = None,
+        seed: Optional[int] = None,


Maybe we should just leverage keyword-only argument which will make the logic much simpler. Actually we wanted to do it in PySpark API layer in the past. Since this is a new API layer, I think it;s a good chance to replace them. cc @ueshin

yes, that's a bit confusing at first glance.

Yes, if we can break the signature, it would be:

def sample( self, fraction: float, *, withReplacement: Optional[bool] = None, seed: Optional[int] = None, ) -> "DataFrame": ...

withReplacement can be : bool = False if the default is False.

I like this idea

zhengruifeng · 2022-10-20T05:16:08Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

+  Seed seed = 5;
+}
+
+message Seed {


I need to define Seed out of Sample, otherwise there is no HasSeed method in the generated files

This is not true. The has* messages are generated for non simple types.

you are right, maybe the jars were out of sync in that time, let me move Seed in Sample

Yeah I always does a clean then build.

amaliujia · 2022-10-20T18:35:38Z

LGTM

nit fix fix fix lint mark as todo mark as todo make seed a msg mv seed outside of sample mv seed outside of sample nit nit mv Seed into Sample fix scala lint change signature

HyukjinKwon · 2022-10-21T08:18:25Z

Merged to master.

zhengruifeng · 2022-10-21T11:04:22Z

thank you guys

### What changes were proposed in this pull request? Implement `DataFrame.sample` in Connect ### Why are the changes needed? for DataFrame API coverage ### Does this PR introduce _any_ user-facing change? Yes, new API ``` def sample( self, fraction: float, *, withReplacement: bool = False, seed: Optional[int] = None, ) -> "DataFrame": ``` ### How was this patch tested? added UT Closes apache#38310 from zhengruifeng/connect_df_sample. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

zhengruifeng commented Oct 19, 2022

View reviewed changes

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 19, 2022

zhengruifeng marked this pull request as ready for review October 19, 2022 10:00

zhengruifeng changed the title ~~[SPARK-40839][CONNECT][PYTHON][WIP] Implement DataFrame.sample~~ [SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample Oct 19, 2022

zhengruifeng force-pushed the connect_df_sample branch 2 times, most recently from 8953e7f to c114ba4 Compare October 19, 2022 10:13

cloud-fan reviewed Oct 19, 2022

View reviewed changes

amaliujia reviewed Oct 19, 2022

View reviewed changes

python/pyspark/sql/connect/plan.py Show resolved Hide resolved

HyukjinKwon reviewed Oct 20, 2022

View reviewed changes

zhengruifeng force-pushed the connect_df_sample branch from c114ba4 to 374bffc Compare October 20, 2022 04:37

zhengruifeng commented Oct 20, 2022

View reviewed changes

zhengruifeng force-pushed the connect_df_sample branch from 906f792 to 7506c56 Compare October 20, 2022 09:44

zhengruifeng force-pushed the connect_df_sample branch 2 times, most recently from 4c137ac to 1bd8f65 Compare October 21, 2022 02:07

zhengruifeng added 2 commits October 21, 2022 13:40

init

dc3e713

nit fix fix fix lint mark as todo mark as todo make seed a msg mv seed outside of sample mv seed outside of sample nit nit mv Seed into Sample fix scala lint change signature

resolve conflicts

25f4f75

zhengruifeng force-pushed the connect_df_sample branch from 1bd8f65 to 25f4f75 Compare October 21, 2022 05:41

HyukjinKwon approved these changes Oct 21, 2022

View reviewed changes

HyukjinKwon closed this in 7934f00 Oct 21, 2022

zhengruifeng deleted the connect_df_sample branch October 21, 2022 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

zhengruifeng commented Oct 19, 2022 •

edited

zhengruifeng Oct 19, 2022

cloud-fan Oct 19, 2022 •

edited

zhengruifeng commented Oct 19, 2022

cloud-fan Oct 19, 2022

amaliujia Oct 19, 2022 •

edited

cloud-fan Oct 19, 2022

amaliujia Oct 19, 2022 •

edited

amaliujia Oct 19, 2022

zhengruifeng Oct 20, 2022

amaliujia Oct 19, 2022

amaliujia Oct 19, 2022

amaliujia Oct 19, 2022 •

edited

zhengruifeng Oct 20, 2022

amaliujia Oct 20, 2022

zhengruifeng Oct 20, 2022

amaliujia Oct 20, 2022

HyukjinKwon Oct 20, 2022

zhengruifeng Oct 20, 2022

ueshin Oct 21, 2022

ueshin Oct 21, 2022

zhengruifeng Oct 21, 2022

zhengruifeng Oct 20, 2022

grundprinzip Oct 20, 2022

zhengruifeng Oct 20, 2022

amaliujia Oct 20, 2022

amaliujia commented Oct 20, 2022

HyukjinKwon commented Oct 21, 2022

zhengruifeng commented Oct 21, 2022

[SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample #38310

[SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample #38310

Conversation

zhengruifeng commented Oct 19, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

cloud-fan Oct 19, 2022 • edited

Choose a reason for hiding this comment

zhengruifeng commented Oct 19, 2022

Choose a reason for hiding this comment

amaliujia Oct 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia Oct 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia Oct 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia commented Oct 20, 2022

HyukjinKwon commented Oct 21, 2022

zhengruifeng commented Oct 21, 2022

[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

[SPARK-40839][CONNECT][PYTHON] Implement `DataFrame.sample` #38310

zhengruifeng commented Oct 19, 2022 •

edited

cloud-fan Oct 19, 2022 •

edited

amaliujia Oct 19, 2022 •

edited

amaliujia Oct 19, 2022 •

edited

amaliujia Oct 19, 2022 •

edited