[SPARK-40399][PS] Make pearson correlation in DataFrame.corr support missing values and min_periods #37845
[SPARK-40399][PS] Make pearson correlation in DataFrame.corr support missing values and min_periods #37845zhengruifeng wants to merge 2 commits intoapache:masterfrom
pearson correlation in DataFrame.corr support missing values and min_periods #37845Conversation
Pearson support missing values and min_periods Pearson correlation in DataFrame.corr support missing values and min_periods
Pearson correlation in DataFrame.corr support missing values and min_periods pearson correlation in DataFrame.corr support missing values and min_periods
| self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False) | ||
| self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False) | ||
|
|
||
| def test_corr(self): |
There was a problem hiding this comment.
existing test is mixed by df.corr and ser.corr, not easy to reuse, so add a new one.
will delete it after both df.corr and ser.corr are refactored
There was a problem hiding this comment.
Can we comment this at the top of test_dataframe_corr so as not to forget ?
|
Hm, does another library or method in Spark do this? It feels weird to have a method that computes "mostly a correlation" ignoring data |
|
@srowen Existing implementation calls the But Pandas-API-on-Spark uses |
|
also cc @HyukjinKwon @itholic @xinrong-meng |
|
Where data is missing, the correlation is really just undefined. Why not return NaN? this is mixing in some arbitrary logic to skip some data, I don't see the point of that. If the caller wants to do that, the caller can just do that first. |
But Pandas-API-on-Spark should follow the behavior of Pandas, which will ignore the missing values, and compute the correlation based on remaining data. |
|
Ok if it matches pandas behavior. Do we need this min_periods option, is that from pandas? |
yes, Pandas' |
itholic
left a comment
There was a problem hiding this comment.
Looks pretty good otherwise
| if min_periods is not None and not isinstance(min_periods, int): | ||
| raise TypeError(f"Invalid min_periods type {type(min_periods).__name__}") |
There was a problem hiding this comment.
Seems like pandas allows float:
>>> pdf.corr('pearson', min_periods=1.4)
dogs cats
dogs 1.000000 -0.851064
cats -0.851064 1.000000But I'm not sure if it's intended behavior or not, since they raises TypeError: an integer is required when the type is str as below:
>>> pdf.corr('pearson', min_periods='a')
Traceback (most recent call last):
...
TypeError: an integer is required
>>>There was a problem hiding this comment.
I am also not sure, but the type of min_periods is also expected to be int in Pandas.
I think that pdf.corr('pearson', min_periods=1.4) can work in Pandas just because a validation is missing in Pandas
python/pyspark/pandas/frame.py
Outdated
| tmp_index_1_col = verify_temp_column_name(sdf, "__tmp_index_1_col__") | ||
| tmp_index_2_col = verify_temp_column_name(sdf, "__tmp_index_2_col__") | ||
| tmp_value_1_col = verify_temp_column_name(sdf, "__tmp_value_1_col__") | ||
| tmp_value_2_col = verify_temp_column_name(sdf, "__tmp_value_2_col__") |
There was a problem hiding this comment.
nit: how about tmp_index_x_col_name instead of tmp_index_x_col to explicitly indicate it's the name of column rather than column itself ?
There was a problem hiding this comment.
ok, let me update them
| # +---+---+----+ | ||
|
|
||
| pair_scols: List[Column] = [] | ||
| for i in range(0, num_scols): |
There was a problem hiding this comment.
nit: we can omit the 0 since it's default ?
- for i in range(0, num_scols):
+ for i in range(num_scols):Either looks okay, though.
There was a problem hiding this comment.
just because that other places use range(0, x) instead of range(x)
python/pyspark/pandas/frame.py
Outdated
| # | 1| 2| 1.0| null| | ||
| # | 2| 2| null| null| | ||
| # +-------------------+-------------------+-------------------+-------------------+ | ||
| tmp_tuple_col = verify_temp_column_name(sdf, "__tmp_tuple_col__") |
There was a problem hiding this comment.
ditto ? I think tmp_tuple_col_name is a bit more explicit.
python/pyspark/pandas/frame.py
Outdated
| tmp_corr_col = verify_temp_column_name(sdf, "__tmp_pearson_corr_col__") | ||
| tmp_count_col = verify_temp_column_name(sdf, "__tmp_count_col__") |
python/pyspark/pandas/frame.py
Outdated
| # | 1|[{0, -1.0}, {1, 1...| | ||
| # | 2|[{0, null}, {1, n...| | ||
| # +-------------------+--------------------+ | ||
| tmp_array_col = verify_temp_column_name(sdf, "__tmp_array_col__") |
|
|
||
| self.assert_eq(psdf.corr(), pdf.corr(), check_exact=False) | ||
| self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False) | ||
| self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False) |
There was a problem hiding this comment.
Can we also test for chained operations?
e.g.
self.assert_eq((psdf + 1).corr(), (pdf + 1).corr(), check_exact=False)There was a problem hiding this comment.
nice, let me add it
|
qq: btw, can we say it's a "new API" in the PR description ?? Maybe "new parameter support" or something like that instead ? |
| """ | ||
| Compute pairwise correlation of columns, excluding NA/null values. | ||
|
|
||
| .. versionadded:: 3.3.0 |
|
Merged into master, thanks @srowen @itholic @HyukjinKwon for reviews! |
### What changes were proposed in this pull request? Remove `pyspark.pandas.ml` ### Why are the changes needed? `pyspark.pandas.ml` is no longer needed, since we implemented correlations based on Spark SQL: 1. pearson corrleation implemented in #37845 2. spearman corrleation implemented #37874 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated suites Closes #37968 from zhengruifeng/ps_del_ml. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
refactor
pearsoncorrelation inDataFrame.corrto:min_periods;VectorUDT;before
after
Why are the changes needed?
for API coverage and support common cases containing missing values
Does this PR introduce any user-facing change?
yes, API change, new parameter supported
How was this patch tested?
added UT