Skip to content

Add rolling cov#5154

Merged
jcrist merged 10 commits intomasterfrom
unknown repository
Aug 5, 2019
Merged

Add rolling cov#5154
jcrist merged 10 commits intomasterfrom
unknown repository

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented Jul 25, 2019

The issue is that since cov output is 2D (#4053 (comment)) chunk overlap is calculated incorrectly. In cov tests the output was only differing in infs and nans. I've marked this as WIP because I'm not completely sure if this is a good approach.

  • Tests added / passed
  • Passes black dask / flake8 dask

closes issue #4053

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Jul 25, 2019

Thanks for the PR. Sinceinf and nan aren't equivalent - instead of changing the tests to ignore this, we should figure out how to make the cov method return the correct results where possible. Do you know why inf is appearing in these cases?

@ghost
Copy link
Copy Markdown
Author

ghost commented Jul 26, 2019

It looks like there is some numerical imprecision when pandas calculates rolling mean (as part of calculating cov) and the exact result depends on the array length. So the difference in infs and nans is from splitting the dataframe. This can be reproduced with no dask involvement:

df = pd.DataFrame(
    {
        "a": np.array([0.01, 0.1, 0.01, 0.1]),
        "b": np.array([1.0, 1.0, 1.0, 1.0])
    }
)
df_half = df.iloc[2:]
print(df.rolling(1).cov())
print(df_half.rolling(1).cov())
       a   b
0 a  NaN NaN
  b  NaN NaN
1 a  NaN NaN
  b  NaN NaN
2 a -inf NaN
  b  NaN NaN
3 a  NaN NaN
  b  NaN NaN
      a   b
2 a NaN NaN
  b NaN NaN
3 a NaN NaN
  b NaN NaN
df = pd.DataFrame({'a': np.array([0.01, 0.1, 0.01, 0.1])})
means = df.rolling(1).mean()['a']
print(means[0], means[2])
0.01 0.009999999999999995

@ghost
Copy link
Copy Markdown
Author

ghost commented Jul 26, 2019

Should we just use a separate dataset for cov() tests?

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Jul 26, 2019

I tried this out locally on a few random datasets and everything seems great. Using a different dataset or tweaking the tests appropriately seems fine to me.

@ghost
Copy link
Copy Markdown
Author

ghost commented Jul 29, 2019

Tests were failing on certain pandas versions because of Panel deprecation warning (deprecated in 0.20.0, removed in 0.25.0).

@ghost ghost changed the title [WIP] Add rolling cov Add rolling cov Jul 29, 2019
)


if PANDAS_VERSION <= "0.25.0" and PANDAS_VERSION >= "0.20.0":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be conditional on the pandas version? I don't see any harm in always applying the warning.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that after >= 0.25.0 will be used for all CI test envs this whole thing (filter_panel_warning) could be removed, right now dask-35.yml is failing without this. By "always applying the warning" did you mean - let CI fail?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant "always apply the warnings filter". Not a big deal though.

if isinstance(before, datetime.timedelta):
before = len(prev_part)

expansion = out.shape[0] // combined.shape[0]
Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this likely to cause new issues with empty partitions? e.g. the one from

In [33]: df = pd.DataFrame({"A": range(12), "B": [True] * 3 + [False] * 3 + [True] * 6})

In [34]: ddf = dd.from_pandas(df, 4)

ddf[df.B].get_partition(1).compute()

We may already have issues with empty partitions, in which case don't worry about it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a test case to trigger this but I'm thinking a check here doesn't add much complexity and could prevent confusion later.

@TomAugspurger
Copy link
Copy Markdown
Member

Restarted the failing CI (http error). Things look good here I think.

@jcrist jcrist merged commit 35166b3 into dask:master Aug 5, 2019
@jcrist
Copy link
Copy Markdown
Member

jcrist commented Aug 5, 2019

Thanks @ivarsfg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants