[SPARK-39054][PYTHON][PS] Ensure infer schema accuracy in GroupBy.apply #36581
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Ensure sampling rows >= 2 to make sure apply's infer schema is accurate.
Why are the changes needed?
GroupBy.apply infers schema when the type hints is not specified for the func of
grouby.apply
. We cannot guarantee that the infer schema of the sampled values is completely accurate, but we should make it as accurate as possible.Since v1.4, Pandas introduce an interface change [1], especially it has some impact the behavior of Group.apply when df has single row (
head(1)
):Finally, it causes the mismatch error of sample apply infer schema and final apply results.
[1] https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#groupby-apply-consistent-transform-detection
Does this PR introduce any user-facing change?
No
How was this patch tested?
test_apply_infer_schema_without_shortcut
andtest_apply_with_new_dataframe_without_shortcut
passed with v1.4.