Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rolling cov #5154

Merged
merged 10 commits into from Aug 5, 2019
Merged

Add rolling cov #5154

merged 10 commits into from Aug 5, 2019

Conversation

ivarsfg
Copy link
Contributor

@ivarsfg ivarsfg commented Jul 25, 2019

The issue is that since cov output is 2D (#4053 (comment)) chunk overlap is calculated incorrectly. In cov tests the output was only differing in infs and nans. I've marked this as WIP because I'm not completely sure if this is a good approach.

  • Tests added / passed
  • Passes black dask / flake8 dask

closes issue #4053

@jcrist
Copy link
Member

@jcrist jcrist commented Jul 25, 2019

Thanks for the PR. Sinceinf and nan aren't equivalent - instead of changing the tests to ignore this, we should figure out how to make the cov method return the correct results where possible. Do you know why inf is appearing in these cases?

@ivarsfg
Copy link
Contributor Author

@ivarsfg ivarsfg commented Jul 26, 2019

It looks like there is some numerical imprecision when pandas calculates rolling mean (as part of calculating cov) and the exact result depends on the array length. So the difference in infs and nans is from splitting the dataframe. This can be reproduced with no dask involvement:

df = pd.DataFrame(
    {
        "a": np.array([0.01, 0.1, 0.01, 0.1]),
        "b": np.array([1.0, 1.0, 1.0, 1.0])
    }
)
df_half = df.iloc[2:]
print(df.rolling(1).cov())
print(df_half.rolling(1).cov())
       a   b
0 a  NaN NaN
  b  NaN NaN
1 a  NaN NaN
  b  NaN NaN
2 a -inf NaN
  b  NaN NaN
3 a  NaN NaN
  b  NaN NaN
      a   b
2 a NaN NaN
  b NaN NaN
3 a NaN NaN
  b NaN NaN
df = pd.DataFrame({'a': np.array([0.01, 0.1, 0.01, 0.1])})
means = df.rolling(1).mean()['a']
print(means[0], means[2])
0.01 0.009999999999999995

@ivarsfg
Copy link
Contributor Author

@ivarsfg ivarsfg commented Jul 26, 2019

Should we just use a separate dataset for cov() tests?

@jcrist
Copy link
Member

@jcrist jcrist commented Jul 26, 2019

I tried this out locally on a few random datasets and everything seems great. Using a different dataset or tweaking the tests appropriately seems fine to me.

dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_rolling.py Outdated Show resolved Hide resolved
@ivarsfg
Copy link
Contributor Author

@ivarsfg ivarsfg commented Jul 29, 2019

Tests were failing on certain pandas versions because of Panel deprecation warning (deprecated in 0.20.0, removed in 0.25.0).

@ivarsfg ivarsfg changed the title [WIP] Add rolling cov Add rolling cov Jul 29, 2019
@@ -166,6 +166,29 @@ def test_rolling_methods(method, args, window, center, check_less_precise):
)


if PANDAS_VERSION <= "0.25.0" and PANDAS_VERSION >= "0.20.0":
Copy link
Member

@TomAugspurger TomAugspurger Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be conditional on the pandas version? I don't see any harm in always applying the warning.

Copy link
Contributor Author

@ivarsfg ivarsfg Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that after >= 0.25.0 will be used for all CI test envs this whole thing (filter_panel_warning) could be removed, right now dask-35.yml is failing without this. By "always applying the warning" did you mean - let CI fail?

Copy link
Member

@TomAugspurger TomAugspurger Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant "always apply the warnings filter". Not a big deal though.

@@ -39,10 +39,15 @@ def overlap_chunk(
if isinstance(before, datetime.timedelta):
before = len(prev_part)

expansion = out.shape[0] // combined.shape[0]
Copy link
Member

@TomAugspurger TomAugspurger Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this likely to cause new issues with empty partitions? e.g. the one from

In [33]: df = pd.DataFrame({"A": range(12), "B": [True] * 3 + [False] * 3 + [True] * 6})

In [34]: ddf = dd.from_pandas(df, 4)

ddf[df.B].get_partition(1).compute()

We may already have issues with empty partitions, in which case don't worry about it.

Copy link
Contributor Author

@ivarsfg ivarsfg Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a test case to trigger this but I'm thinking a check here doesn't add much complexity and could prevent confusion later.

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Aug 1, 2019

Restarted the failing CI (http error). Things look good here I think.

@jrbourbeau jrbourbeau mentioned this pull request Aug 4, 2019
2 tasks
@jcrist jcrist merged commit 35166b3 into dask:master Aug 5, 2019
2 checks passed
@jcrist
Copy link
Member

@jcrist jcrist commented Aug 5, 2019

Thanks @ivarsfg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants