New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40839][CONNECT][PYTHON] Implement DataFrame.sample
#38310
Conversation
python/pyspark/sql/dataframe.py
Outdated
|
||
@staticmethod | ||
def _prepare_augments_for_sample( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pre-processing of sample
augments is pretty complex, so make it a static method and reuse it in connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do need to share code between pyspark and spark connect python client, we should probably add a new module like pyspark-common
DataFrame.sample
DataFrame.sample
8953e7f
to
c114ba4
Compare
fraction: Optional[Union[int, float]] = None, | ||
seed: Optional[int] = None, | ||
) -> "DataFrame": | ||
from pyspark.sql import DataFrame as PySparkDataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, does spark connect python client depends on pyspark? Then it's not a thin client any more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is now depending on the pyspark. In fact it depends on pyspark since the first PR. For the short term it is ok cc @HyukjinKwon
I guess we will need to make a final decision how whether it should depend or not before making the python packaging and release.
if withReplacement is None: | ||
withReplacement = False | ||
if seed is None: | ||
# TODO: make 'seed' optional in proto, then we can use 'Utils.random.nextLong' in JVM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaliujia We should really consider this. The principle is to move code implementation to the server side as much as possible. We just moved the identifier parsing logic to server side, and we should probably do the same for parameter default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense.
@zhengruifeng I am thinking you can wrap this seed
into a proto message and in that case the server side can know if this is set or not? In that case, the server side can does the random generation rather than using the value from proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example: #38275
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let me make this change
withReplacement, fraction, seed | ||
) | ||
if withReplacement is None: | ||
withReplacement = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default bool value for proto is False
so this is probably not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh The Plan definition is not Optional
for withReplacement
. In this case probably set it as False
makes sense.
class Sample(LogicalPlan):
def __init__(
self,
child: Optional["LogicalPlan"],
lower_bound: float,
upper_bound: float,
with_replacement: bool,
seed: int,
) -> None:
@@ -201,6 +202,34 @@ def sort(self, *cols: "ColumnOrString") -> "DataFrame": | |||
"""Sort by a specific column""" | |||
return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session) | |||
|
|||
def sample( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyspark dataframe API has
@overload
def sample(self, fraction: float, seed: Optional[int] = ...) -> "DataFrame":
...
@overload
def sample(
self,
withReplacement: Optional[bool],
fraction: float,
seed: Optional[int] = ...,
) -> "DataFrame":
...
Can we match (as easy as copy the API into connect dataframe.py)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can discard those ones ? @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe my real question was, will we have an issue to be compatible with existing pyspark dataframe code (needs different imports, of course) if we discard such API? I see many other similar API existing for pyspark dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users may have to change their codes for this emigration, but I think this is also a chance to make some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We also can go to that direction.
self, | ||
withReplacement: Optional[Union[float, bool]] = None, | ||
fraction: Optional[Union[int, float]] = None, | ||
seed: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just leverage keyword-only argument which will make the logic much simpler. Actually we wanted to do it in PySpark API layer in the past. Since this is a new API layer, I think it;s a good chance to replace them. cc @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's a bit confusing at first glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we can break the signature, it would be:
def sample(
self,
fraction: float,
*,
withReplacement: Optional[bool] = None,
seed: Optional[int] = None,
) -> "DataFrame":
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withReplacement
can be : bool = False
if the default is False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea
c114ba4
to
374bffc
Compare
Seed seed = 5; | ||
} | ||
|
||
message Seed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to define Seed
out of Sample
, otherwise there is no HasSeed
method in the generated files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true. The has* messages are generated for non simple types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, maybe the jars were out of sync in that time, let me move Seed
in Sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I always does a clean then build.
906f792
to
7506c56
Compare
LGTM |
4c137ac
to
1bd8f65
Compare
1bd8f65
to
25f4f75
Compare
Merged to master. |
thank you guys |
### What changes were proposed in this pull request? Implement `DataFrame.sample` in Connect ### Why are the changes needed? for DataFrame API coverage ### Does this PR introduce _any_ user-facing change? Yes, new API ``` def sample( self, fraction: float, *, withReplacement: bool = False, seed: Optional[int] = None, ) -> "DataFrame": ``` ### How was this patch tested? added UT Closes apache#38310 from zhengruifeng/connect_df_sample. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Implement
DataFrame.sample
in ConnectWhy are the changes needed?
for DataFrame API coverage
Does this PR introduce any user-facing change?
Yes, new API
How was this patch tested?
added UT