Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rolling cov #5154

Merged
merged 10 commits into from Aug 5, 2019

Conversation

@ivarsfg
Copy link
Contributor

commented Jul 25, 2019

The issue is that since cov output is 2D (#4053 (comment)) chunk overlap is calculated incorrectly. In cov tests the output was only differing in infs and nans. I've marked this as WIP because I'm not completely sure if this is a good approach.

  • Tests added / passed
  • Passes black dask / flake8 dask

closes issue #4053

@jcrist

This comment has been minimized.

Copy link
Member

commented Jul 25, 2019

Thanks for the PR. Sinceinf and nan aren't equivalent - instead of changing the tests to ignore this, we should figure out how to make the cov method return the correct results where possible. Do you know why inf is appearing in these cases?

@ivarsfg

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

It looks like there is some numerical imprecision when pandas calculates rolling mean (as part of calculating cov) and the exact result depends on the array length. So the difference in infs and nans is from splitting the dataframe. This can be reproduced with no dask involvement:

df = pd.DataFrame(
    {
        "a": np.array([0.01, 0.1, 0.01, 0.1]),
        "b": np.array([1.0, 1.0, 1.0, 1.0])
    }
)
df_half = df.iloc[2:]
print(df.rolling(1).cov())
print(df_half.rolling(1).cov())
       a   b
0 a  NaN NaN
  b  NaN NaN
1 a  NaN NaN
  b  NaN NaN
2 a -inf NaN
  b  NaN NaN
3 a  NaN NaN
  b  NaN NaN
      a   b
2 a NaN NaN
  b NaN NaN
3 a NaN NaN
  b NaN NaN
df = pd.DataFrame({'a': np.array([0.01, 0.1, 0.01, 0.1])})
means = df.rolling(1).mean()['a']
print(means[0], means[2])
0.01 0.009999999999999995
@ivarsfg

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2019

Should we just use a separate dataset for cov() tests?

@jcrist

This comment has been minimized.

Copy link
Member

commented Jul 26, 2019

I tried this out locally on a few random datasets and everything seems great. Using a different dataset or tweaking the tests appropriately seems fine to me.

dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
@ivarsfg

This comment has been minimized.

Copy link
Contributor Author

commented Jul 29, 2019

Tests were failing on certain pandas versions because of Panel deprecation warning (deprecated in 0.20.0, removed in 0.25.0).

@ivarsfg ivarsfg changed the title [WIP] Add rolling cov Add rolling cov Jul 29, 2019

@@ -166,6 +166,29 @@ def test_rolling_methods(method, args, window, center, check_less_precise):
)


if PANDAS_VERSION <= "0.25.0" and PANDAS_VERSION >= "0.20.0":

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 29, 2019

Member

Does this need to be conditional on the pandas version? I don't see any harm in always applying the warning.

This comment has been minimized.

Copy link
@ivarsfg

ivarsfg Jul 30, 2019

Author Contributor

I was thinking that after >= 0.25.0 will be used for all CI test envs this whole thing (filter_panel_warning) could be removed, right now dask-35.yml is failing without this. By "always applying the warning" did you mean - let CI fail?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 1, 2019

Member

Sorry I meant "always apply the warnings filter". Not a big deal though.

@@ -39,10 +39,15 @@ def overlap_chunk(
if isinstance(before, datetime.timedelta):
before = len(prev_part)

expansion = out.shape[0] // combined.shape[0]

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 29, 2019

Member

Is this likely to cause new issues with empty partitions? e.g. the one from

In [33]: df = pd.DataFrame({"A": range(12), "B": [True] * 3 + [False] * 3 + [True] * 6})

In [34]: ddf = dd.from_pandas(df, 4)

ddf[df.B].get_partition(1).compute()

We may already have issues with empty partitions, in which case don't worry about it.

This comment has been minimized.

Copy link
@ivarsfg

ivarsfg Jul 30, 2019

Author Contributor

I didn't find a test case to trigger this but I'm thinking a check here doesn't add much complexity and could prevent confusion later.

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 1, 2019

Restarted the failing CI (http error). Things look good here I think.

@jrbourbeau jrbourbeau referenced this pull request Aug 4, 2019
0 of 2 tasks complete

@jcrist jcrist merged commit 35166b3 into dask:master Aug 5, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jcrist

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Thanks @ivarsfg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.