Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample #38310

Closed
wants to merge 2 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 19, 2022

What changes were proposed in this pull request?

Implement DataFrame.sample in Connect

Why are the changes needed?

for DataFrame API coverage

Does this PR introduce any user-facing change?

Yes, new API

    def sample(
        self,
        fraction: float,
        *,
        withReplacement: bool = False,
        seed: Optional[int] = None,
    ) -> "DataFrame":

How was this patch tested?

added UT


@staticmethod
def _prepare_augments_for_sample(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pre-processing of sample augments is pretty complex, so make it a static method and reuse it in connect

Copy link
Contributor

@cloud-fan cloud-fan Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do need to share code between pyspark and spark connect python client, we should probably add a new module like pyspark-common

@zhengruifeng zhengruifeng marked this pull request as ready for review October 19, 2022 10:00
@zhengruifeng zhengruifeng changed the title [SPARK-40839][CONNECT][PYTHON][WIP] Implement DataFrame.sample [SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample Oct 19, 2022
@zhengruifeng zhengruifeng force-pushed the connect_df_sample branch 2 times, most recently from 8953e7f to c114ba4 Compare October 19, 2022 10:13
@zhengruifeng
Copy link
Contributor Author

fraction: Optional[Union[int, float]] = None,
seed: Optional[int] = None,
) -> "DataFrame":
from pyspark.sql import DataFrame as PySparkDataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, does spark connect python client depends on pyspark? Then it's not a thin client any more...

Copy link
Contributor

@amaliujia amaliujia Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is now depending on the pyspark. In fact it depends on pyspark since the first PR. For the short term it is ok cc @HyukjinKwon

I guess we will need to make a final decision how whether it should depend or not before making the python packaging and release.

if withReplacement is None:
withReplacement = False
if seed is None:
# TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaliujia We should really consider this. The principle is to move code implementation to the server side as much as possible. We just moved the identifier parsing logic to server side, and we should probably do the same for parameter default values.

Copy link
Contributor

@amaliujia amaliujia Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense.

@zhengruifeng I am thinking you can wrap this seed into a proto message and in that case the server side can know if this is set or not? In that case, the server side can does the random generation rather than using the value from proto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example: #38275

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let me make this change

withReplacement, fraction, seed
)
if withReplacement is None:
withReplacement = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default bool value for proto is False so this is probably not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh The Plan definition is not Optional for withReplacement. In this case probably set it as False makes sense.

class Sample(LogicalPlan):
    def __init__(
        self,
        child: Optional["LogicalPlan"],
        lower_bound: float,
        upper_bound: float,
        with_replacement: bool,
        seed: int,
    ) -> None:

@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame":
"""Sort by a specific column"""
return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session)

def sample(
Copy link
Contributor

@amaliujia amaliujia Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyspark dataframe API has

    @overload
    def sample(self, fraction: float, seed: Optional[int] = ...) -> "DataFrame":
        ...

    @overload
    def sample(
        self,
        withReplacement: Optional[bool],
        fraction: float,
        seed: Optional[int] = ...,
    ) -> "DataFrame":
        ...

Can we match (as easy as copy the API into connect dataframe.py)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can discard those ones ? @HyukjinKwon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my real question was, will we have an issue to be compatible with existing pyspark dataframe code (needs different imports, of course) if we discard such API? I see many other similar API existing for pyspark dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users may have to change their codes for this emigration, but I think this is also a chance to make some changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We also can go to that direction.

self,
withReplacement: Optional[Union[float, bool]] = None,
fraction: Optional[Union[int, float]] = None,
seed: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just leverage keyword-only argument which will make the logic much simpler. Actually we wanted to do it in PySpark API layer in the past. Since this is a new API layer, I think it;s a good chance to replace them. cc @ueshin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's a bit confusing at first glance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we can break the signature, it would be:

def sample(
    self,
    fraction: float,
    *,
    withReplacement: Optional[bool] = None,
    seed: Optional[int] = None,
) -> "DataFrame":
    ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withReplacement can be : bool = False if the default is False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea

Seed seed = 5;
}

message Seed {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to define Seed out of Sample, otherwise there is no HasSeed method in the generated files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true. The has* messages are generated for non simple types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, maybe the jars were out of sync in that time, let me move Seed in Sample

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I always does a clean then build.

@amaliujia
Copy link
Contributor

LGTM

@zhengruifeng zhengruifeng force-pushed the connect_df_sample branch 2 times, most recently from 4c137ac to 1bd8f65 Compare October 21, 2022 02:07
nit

fix

fix

fix lint

mark as todo

mark as todo

make seed a msg

mv seed outside of sample

mv seed outside of sample

nit

nit

mv Seed into Sample

fix scala lint

change signature
@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng
Copy link
Contributor Author

thank you guys

@zhengruifeng zhengruifeng deleted the connect_df_sample branch October 21, 2022 11:04
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?
Implement `DataFrame.sample` in Connect

### Why are the changes needed?
for DataFrame API coverage

### Does this PR introduce _any_ user-facing change?
Yes, new API

```
    def sample(
        self,
        fraction: float,
        *,
        withReplacement: bool = False,
        seed: Optional[int] = None,
    ) -> "DataFrame":
```

### How was this patch tested?
added UT

Closes apache#38310 from zhengruifeng/connect_df_sample.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants