Add rolling cov · Pull Request #5154 · dask/dask

ghost · 2019-07-25T08:50:31Z

The issue is that since cov output is 2D (#4053 (comment)) chunk overlap is calculated incorrectly. In cov tests the output was only differing in infs and nans. I've marked this as WIP because I'm not completely sure if this is a good approach.

Tests added / passed
Passes black dask / flake8 dask

closes issue #4053

jcrist · 2019-07-25T14:39:24Z

Thanks for the PR. Sinceinf and nan aren't equivalent - instead of changing the tests to ignore this, we should figure out how to make the cov method return the correct results where possible. Do you know why inf is appearing in these cases?

ghost · 2019-07-26T08:37:08Z

It looks like there is some numerical imprecision when pandas calculates rolling mean (as part of calculating cov) and the exact result depends on the array length. So the difference in infs and nans is from splitting the dataframe. This can be reproduced with no dask involvement:

df = pd.DataFrame(
    {
        "a": np.array([0.01, 0.1, 0.01, 0.1]),
        "b": np.array([1.0, 1.0, 1.0, 1.0])
    }
)
df_half = df.iloc[2:]
print(df.rolling(1).cov())
print(df_half.rolling(1).cov())

       a   b
0 a  NaN NaN
  b  NaN NaN
1 a  NaN NaN
  b  NaN NaN
2 a -inf NaN
  b  NaN NaN
3 a  NaN NaN
  b  NaN NaN
      a   b
2 a NaN NaN
  b NaN NaN
3 a NaN NaN
  b NaN NaN

df = pd.DataFrame({'a': np.array([0.01, 0.1, 0.01, 0.1])})
means = df.rolling(1).mean()['a']
print(means[0], means[2])

0.01 0.009999999999999995

ghost · 2019-07-26T08:37:43Z

Should we just use a separate dataset for cov() tests?

jcrist · 2019-07-26T15:28:39Z

I tried this out locally on a few random datasets and everything seems great. Using a different dataset or tweaking the tests appropriately seems fine to me.

dask/dataframe/tests/test_rolling.py

ghost · 2019-07-29T13:05:26Z

Tests were failing on certain pandas versions because of Panel deprecation warning (deprecated in 0.20.0, removed in 0.25.0).

TomAugspurger · 2019-07-29T14:51:16Z

dask/dataframe/tests/test_rolling.py

    )


+if PANDAS_VERSION <= "0.25.0" and PANDAS_VERSION >= "0.20.0":


Does this need to be conditional on the pandas version? I don't see any harm in always applying the warning.

I was thinking that after >= 0.25.0 will be used for all CI test envs this whole thing (filter_panel_warning) could be removed, right now dask-35.yml is failing without this. By "always applying the warning" did you mean - let CI fail?

Sorry I meant "always apply the warnings filter". Not a big deal though.

TomAugspurger · 2019-07-29T14:54:41Z

dask/dataframe/rolling.py

    if isinstance(before, datetime.timedelta):
        before = len(prev_part)

+    expansion = out.shape[0] // combined.shape[0]


Is this likely to cause new issues with empty partitions? e.g. the one from

In [33]: df = pd.DataFrame({"A": range(12), "B": [True] * 3 + [False] * 3 + [True] * 6}) In [34]: ddf = dd.from_pandas(df, 4) ddf[df.B].get_partition(1).compute()

We may already have issues with empty partitions, in which case don't worry about it.

I didn't find a test case to trigger this but I'm thinking a check here doesn't add much complexity and could prevent confusion later.

TomAugspurger · 2019-08-01T21:39:36Z

Restarted the failing CI (http error). Things look good here I think.

jcrist · 2019-08-05T22:30:39Z

Thanks @ivarsfg!

ivarsfg added 3 commits July 24, 2019 15:26

Add cov to rolling object. Fix overlap window calculation.

e2dc488

Treat inf and nan as equal in .cov() tests

b390e5d

Fix overlap window calculation (after).

5920603

ivarsfg added 3 commits July 26, 2019 11:58

NaN and inf are not equivalent

43f0008

Add time series cov tests

a68d483

Use correct parameters for time rolling tests

f7f4a55

jcrist reviewed Jul 26, 2019

View reviewed changes

dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved

dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved

dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved

dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved

ivarsfg added 3 commits July 26, 2019 19:39

Call cov directly

bee727a

Ignore deprecation warning for Panel on certain pandas versions

4f5415b

Ignore Panel working for series tests also

2855372

ghost changed the title ~~[WIP] Add rolling cov~~ Add rolling cov Jul 29, 2019

TomAugspurger reviewed Jul 29, 2019

View reviewed changes

Check for divide by zero

8771731

jcrist merged commit 35166b3 into dask:master Aug 5, 2019

jrbourbeau mentioned this pull request Nov 19, 2019

'Rolling' object has no attribute 'cov' in dask.dataframe #4053

Closed

		)


		if PANDAS_VERSION <= "0.25.0" and PANDAS_VERSION >= "0.20.0":

Uh oh!

Conversation

ghost commented Jul 25, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Jul 25, 2019

Uh oh!

ghost commented Jul 26, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 26, 2019

Uh oh!

jcrist commented Jul 26, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghost commented Jul 29, 2019

Uh oh!

TomAugspurger Jul 29, 2019

Choose a reason for hiding this comment

Uh oh!

ghost Jul 30, 2019

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Aug 1, 2019

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jul 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost Jul 30, 2019

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Aug 1, 2019

Uh oh!

jcrist commented Aug 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghost commented Jul 25, 2019 •

edited by ghost

Loading

ghost commented Jul 26, 2019 •

edited by ghost

Loading

TomAugspurger Jul 29, 2019 •

edited

Loading