-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38774][PYTHON] Implement Series.autocorr #36048
Conversation
check against the Pandas side:
|
Thanks for working on this @zhengruifeng ! cc @ueshin @xinrong-databricks @itholic FYI! |
python/pyspark/pandas/series.py
Outdated
return ( | ||
self._internal.spark_frame.select([scol, lag_col]) | ||
.dropna("any") | ||
.corr("__tmp_col__", "__tmp_lag_col__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should we define the column names in variables which are reused throughout the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will update soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add some basic tests around this?
Thanks @zhengruifeng! https://github.com/apache/spark/blob/master/python/pyspark/pandas/tests/test_series.py is a good place to add tests. It would be great to specify what changes in Does this PR introduce any user-facing change? section of the PR description. An example is good enough. |
@xinrong-databricks Will add the tests and update the PR description, thanks! |
|
||
This method computes the Pearson correlation between | ||
the Series and its shifted self. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add .. versionadded:: 3.4.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's document the .. versionadded:: 3.4.0
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
cc @HyukjinKwon , I think this PR is ready too |
|
||
Notes | ||
----- | ||
If the Pearson correlation is not well defined return 'NaN'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also add a note about the global window operation and its performance impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
90ca30c
to
78daa69
Compare
all tests passed |
Merged to master. |
Thanks all for reviewing! |
… consistent with Pandas ### What changes were proposed in this pull request? in `Series.autocorr`, rename `periods` as `lag` ### Why are the changes needed? when implementing the `Series.autocorr` in my first PS PR #36048 , I wrongly follow the parameter name `min_periods` in `Series.corr`, it should be `lag` to be the same with [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.autocorr.html) ### Does this PR introduce _any_ user-facing change? no, since 3.4 is not released ### How was this patch tested? existing UTs Closes #38216 from zhengruifeng/ps_ser_autocorr_rename_parameter. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
… consistent with Pandas ### What changes were proposed in this pull request? in `Series.autocorr`, rename `periods` as `lag` ### Why are the changes needed? when implementing the `Series.autocorr` in my first PS PR apache#36048 , I wrongly follow the parameter name `min_periods` in `Series.corr`, it should be `lag` to be the same with [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.autocorr.html) ### Does this PR introduce _any_ user-facing change? no, since 3.4 is not released ### How was this patch tested? existing UTs Closes apache#38216 from zhengruifeng/ps_ser_autocorr_rename_parameter. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
Implement Series.autocorr
Why are the changes needed?
for API coverage
Does this PR introduce any user-facing change?
yes, Series now support function
autocorr
How was this patch tested?
added doctest