Conversation
|
Thanks for the PR. Since |
|
It looks like there is some numerical imprecision when pandas calculates rolling mean (as part of calculating cov) and the exact result depends on the array length. So the difference in df = pd.DataFrame(
{
"a": np.array([0.01, 0.1, 0.01, 0.1]),
"b": np.array([1.0, 1.0, 1.0, 1.0])
}
)
df_half = df.iloc[2:]
print(df.rolling(1).cov())
print(df_half.rolling(1).cov())df = pd.DataFrame({'a': np.array([0.01, 0.1, 0.01, 0.1])})
means = df.rolling(1).mean()['a']
print(means[0], means[2]) |
|
Should we just use a separate dataset for |
|
I tried this out locally on a few random datasets and everything seems great. Using a different dataset or tweaking the tests appropriately seems fine to me. |
|
Tests were failing on certain pandas versions because of Panel deprecation warning (deprecated in 0.20.0, removed in 0.25.0). |
| ) | ||
|
|
||
|
|
||
| if PANDAS_VERSION <= "0.25.0" and PANDAS_VERSION >= "0.20.0": |
There was a problem hiding this comment.
Does this need to be conditional on the pandas version? I don't see any harm in always applying the warning.
There was a problem hiding this comment.
I was thinking that after >= 0.25.0 will be used for all CI test envs this whole thing (filter_panel_warning) could be removed, right now dask-35.yml is failing without this. By "always applying the warning" did you mean - let CI fail?
There was a problem hiding this comment.
Sorry I meant "always apply the warnings filter". Not a big deal though.
dask/dataframe/rolling.py
Outdated
| if isinstance(before, datetime.timedelta): | ||
| before = len(prev_part) | ||
|
|
||
| expansion = out.shape[0] // combined.shape[0] |
There was a problem hiding this comment.
Is this likely to cause new issues with empty partitions? e.g. the one from
In [33]: df = pd.DataFrame({"A": range(12), "B": [True] * 3 + [False] * 3 + [True] * 6})
In [34]: ddf = dd.from_pandas(df, 4)
ddf[df.B].get_partition(1).compute()We may already have issues with empty partitions, in which case don't worry about it.
There was a problem hiding this comment.
I didn't find a test case to trigger this but I'm thinking a check here doesn't add much complexity and could prevent confusion later.
|
Restarted the failing CI (http error). Things look good here I think. |
|
Thanks @ivarsfg! |
The issue is that since cov output is 2D (#4053 (comment)) chunk overlap is calculated incorrectly. In cov tests the output was only differing in infs and nans. I've marked this as WIP because I'm not completely sure if this is a good approach.
black dask/flake8 daskcloses issue #4053