Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38774][PYTHON] Implement Series.autocorr #36048

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Apr 2, 2022

What changes were proposed in this pull request?

Implement Series.autocorr

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

yes, Series now support function autocorr

In [86]: s = pd.Series([.2, .0, .6, .2, np.nan, .5, .6])

In [87]: s.autocorr()
Out[87]: -0.14121975762272054

How was this patch tested?

added doctest

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Apr 2, 2022

check against the Pandas side:

In [86]: s = pd.Series([.2, .0, .6, .2, np.nan, .5, .6])

In [87]: s.autocorr()
Out[87]: -0.14121975762272054

In [88]: s.autocorr(0)
Out[88]: 1.0

In [89]: s.autocorr(2)
Out[89]: 0.9707253433941511

In [90]: s.autocorr(-3)
Out[90]: 0.2773500981126146

In [91]: s.autocorr(5)
Out[91]: -0.9999999999999998

In [92]: s.autocorr(6)
/home/zrf/.zrf/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:2683: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar, dtype=dtype)
/home/zrf/.zrf/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:2542: RuntimeWarning: divide by zero encountered in true_divide
  c *= np.true_divide(1, fact)
Out[92]: nan



@HyukjinKwon
Copy link
Member

Thanks for working on this @zhengruifeng ! cc @ueshin @xinrong-databricks @itholic FYI!

return (
self._internal.spark_frame.select([scol, lag_col])
.dropna("any")
.corr("__tmp_col__", "__tmp_lag_col__")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we define the column names in variables which are reused throughout the method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will update soon

Copy link
Contributor

@awdavidson awdavidson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add some basic tests around this?

@xinrong-meng
Copy link
Member

Thanks @zhengruifeng!

https://github.com/apache/spark/blob/master/python/pyspark/pandas/tests/test_series.py is a good place to add tests.

It would be great to specify what changes in Does this PR introduce any user-facing change? section of the PR description. An example is good enough.

@zhengruifeng
Copy link
Contributor Author

@xinrong-databricks Will add the tests and update the PR description, thanks!


This method computes the Pearson correlation between
the Series and its shifted self.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add .. versionadded:: 3.4.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's document the .. versionadded:: 3.4.0 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor

@awdavidson awdavidson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@zhengruifeng
Copy link
Contributor Author

cc @HyukjinKwon , I think this PR is ready too


Notes
-----
If the Pearson correlation is not well defined return 'NaN'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also add a note about the global window operation and its performance impact.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@HyukjinKwon
Copy link
Member

Build link: https://github.com/zhengruifeng/spark/runs/6003246221

@zhengruifeng
Copy link
Contributor Author

all tests passed

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng
Copy link
Contributor Author

Thanks all for reviewing!

@zhengruifeng zhengruifeng deleted the pandas_series_autocorr branch April 13, 2022 09:18
zhengruifeng added a commit that referenced this pull request Oct 13, 2022
… consistent with Pandas

### What changes were proposed in this pull request?
in `Series.autocorr`, rename `periods` as `lag`

### Why are the changes needed?
when implementing the `Series.autocorr` in my first PS PR #36048 , I wrongly follow the parameter name `min_periods` in `Series.corr`, it should be `lag` to be the same with [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.autocorr.html)

### Does this PR introduce _any_ user-facing change?
no, since 3.4 is not released

### How was this patch tested?
existing UTs

Closes #38216 from zhengruifeng/ps_ser_autocorr_rename_parameter.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
… consistent with Pandas

### What changes were proposed in this pull request?
in `Series.autocorr`, rename `periods` as `lag`

### Why are the changes needed?
when implementing the `Series.autocorr` in my first PS PR apache#36048 , I wrongly follow the parameter name `min_periods` in `Series.corr`, it should be `lag` to be the same with [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.autocorr.html)

### Does this PR introduce _any_ user-facing change?
no, since 3.4 is not released

### How was this patch tested?
existing UTs

Closes apache#38216 from zhengruifeng/ps_ser_autocorr_rename_parameter.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants