-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41066][CONNECT][PYTHON] Implement DataFrame.sampleBy
and DataFrame.stat.sampleBy
#39328
Conversation
c5182ef
to
192ca0b
Compare
@@ -546,6 +547,34 @@ message StatFreqItems { | |||
optional double support = 3; | |||
} | |||
|
|||
|
|||
// Returns a stratified sample without replacement based on the fraction | |||
// given on each stratum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will invoke 'Dataset.stat.sampleBy' (same as 'StatFunctions.sampleBy') to compute the results.
should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to keep the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, will update
repeated Fraction fractions = 3; | ||
|
||
// (Optional) The random seed. | ||
optional int64 seed = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I want to keep in line with other method which generate a random seed in server if not provided in the proto
@@ -419,6 +421,26 @@ class SparkConnectPlanner(session: SparkSession) { | |||
} | |||
} | |||
|
|||
private def transformStatSampleBy(rel: proto.StatSampleBy): LogicalPlan = { | |||
val fractions = mutable.Map.empty[Any, Double] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
val fractions = rel.getFractionsList.asScala.toSeq.map { protoFraction =>
...
}
def sampleBy( | ||
self, col: "ColumnOrName", fractions: Dict[Any, float], seed: Optional[int] = None | ||
) -> "DataFrame": | ||
if not isinstance(col, (Column, str)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is changed from pyspark sql
if isinstance(col, str):
col = Column(col)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no behavior change, since the underlying plan can accept a ColumnOrName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
for k, v in fractions.items(): | ||
assert v is not None and isinstance(v, float) | ||
|
||
assert seed is None or isinstance(seed, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need add these check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer adding assertion in the plan layer to make sure all parameters are expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can let server side require that the client must pass seed.
Merged to master. |
What changes were proposed in this pull request?
Implement
DataFrame.sampleBy
andDataFrame.stat.sampleBy
Why are the changes needed?
For API coverage
Does this PR introduce any user-facing change?
yes
How was this patch tested?
added UT