New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rolling cov #5154
Add rolling cov #5154
Conversation
Thanks for the PR. Since |
It looks like there is some numerical imprecision when pandas calculates rolling mean (as part of calculating cov) and the exact result depends on the array length. So the difference in df = pd.DataFrame(
{
"a": np.array([0.01, 0.1, 0.01, 0.1]),
"b": np.array([1.0, 1.0, 1.0, 1.0])
}
)
df_half = df.iloc[2:]
print(df.rolling(1).cov())
print(df_half.rolling(1).cov())
df = pd.DataFrame({'a': np.array([0.01, 0.1, 0.01, 0.1])})
means = df.rolling(1).mean()['a']
print(means[0], means[2])
|
Should we just use a separate dataset for |
I tried this out locally on a few random datasets and everything seems great. Using a different dataset or tweaking the tests appropriately seems fine to me. |
Tests were failing on certain pandas versions because of Panel deprecation warning (deprecated in 0.20.0, removed in 0.25.0). |
@@ -166,6 +166,29 @@ def test_rolling_methods(method, args, window, center, check_less_precise): | |||
) | |||
|
|||
|
|||
if PANDAS_VERSION <= "0.25.0" and PANDAS_VERSION >= "0.20.0": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be conditional on the pandas version? I don't see any harm in always applying the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that after >= 0.25.0 will be used for all CI test envs this whole thing (filter_panel_warning
) could be removed, right now dask-35.yml is failing without this. By "always applying the warning" did you mean - let CI fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant "always apply the warnings filter". Not a big deal though.
dask/dataframe/rolling.py
Outdated
@@ -39,10 +39,15 @@ def overlap_chunk( | |||
if isinstance(before, datetime.timedelta): | |||
before = len(prev_part) | |||
|
|||
expansion = out.shape[0] // combined.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this likely to cause new issues with empty partitions? e.g. the one from
In [33]: df = pd.DataFrame({"A": range(12), "B": [True] * 3 + [False] * 3 + [True] * 6})
In [34]: ddf = dd.from_pandas(df, 4)
ddf[df.B].get_partition(1).compute()
We may already have issues with empty partitions, in which case don't worry about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find a test case to trigger this but I'm thinking a check here doesn't add much complexity and could prevent confusion later.
Restarted the failing CI (http error). Things look good here I think. |
Thanks @ivarsfg! |
The issue is that since cov output is 2D (#4053 (comment)) chunk overlap is calculated incorrectly. In cov tests the output was only differing in infs and nans. I've marked this as WIP because I'm not completely sure if this is a good approach.
black dask
/flake8 dask
closes issue #4053