Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Rolling' object has no attribute 'cov' in dask.dataframe #4053

Open
NazBen opened this issue Oct 4, 2018 · 5 comments

Comments

@NazBen
Copy link

commented Oct 4, 2018

Hi,

I wanted to compute some rolling covariance matrix in a large dataset, which is quiet computationally expensive. But I am facing an issue using dask.dataframe. Here is a reproducing code:

import numpy as np
import pandas as pd
import dask.dataframe as dd

d = 20
n = 1000

df = pd.DataFrame(np.random.random((n, d)))
ddf = dd.from_pandas(df, npartitions=3)
cov = df.rolling(100).cov()
dcov = ddf.rolling(100).cov()

Output:

pandas version: 0.23.4
dask version: 0.19.2
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-a4ca6b7745d7> in <module>()
     13 ddf = dd.from_pandas(df, npartitions=3)
     14 cov = df.rolling(100).cov()
---> 15 dcov = ddf.rolling(100).cov()

AttributeError: 'Rolling' object has no attribute 'cov'

Regards,
Nazih.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Oct 9, 2018

Thank you for the excellent issue @NazBen . I'm curious why this wasn't implemented originally. I think that most of the rolling algorithms are pretty consistent, so I'm not sure why this would have been missed.

Would you be interested in submitting a small pull request that adds cov?

You can probably add the implementation here:

@derived_from(pd_Rolling)
def count(self):
return self._call_method('count')
@derived_from(pd_Rolling)
def sum(self):
return self._call_method('sum')
@derived_from(pd_Rolling)
def mean(self):
return self._call_method('mean')

And add tests here:

rolling_method_args_check_less_precise = [
('count', (), False),
('sum', (), False),
('mean', (), False),

@rmccorm4

This comment has been minimized.

Copy link

commented Feb 24, 2019

I looked into this bug a little but and a couple of things stood out to me.

  1. Compared to all of the other methods that are tested in rolling_method_args_check_less_precise like mean, sum, etc. -- cov has a 2-D output whereas the other methods have a 1-D output.

  2. As the window size increases, the dask dataframe's shape will increasingly mismatch the pandas dataframe's shape, due to some artefact from #1 I guess. I have a couple examples describing it below, perhaps this could help someone else solve this issue,

Using the cov method, the difference in dataframe shapes is a function of npartitions and window_size:

N=10
df = pd.DataFrame({'a': np.random.randn(N).cumsum(),
                   'b': np.random.randint(100, size=(N,)),
                   'c': np.random.randint(100, size=(N,)),
                   'd': np.random.randint(100, size=(N,)),
                   'e': np.random.randint(100, size=(N,))})
npartitions = 2
ddf = dd.from_pandas(df, npartitions)

window_size = 5
a = df.rolling(window_size).cov()
b = ddf.rolling(window_size).cov().compute()
print(a.shape, '?=', b.shape)
# (50, 5) ?= (66, 5)

dask_diff = 2**npartitions * (window_size-1)
print(dask_diff)  # == 16
shape_diff = b.shape[0] - a.shape[0]
print(shape_diff) # == 16

Using any other method with a 1-D output like sum(), mean(), etc., the shapes will match.

N=10
df = pd.DataFrame({'a': np.random.randn(N).cumsum(),
                   'b': np.random.randint(100, size=(N,)),
                   'c': np.random.randint(100, size=(N,)),
                   'd': np.random.randint(100, size=(N,)),
                   'e': np.random.randint(100, size=(N,))})
npartitions = 2
ddf = dd.from_pandas(df, npartitions)

window_size = 5
a = df.rolling(window_size).mean()
b = ddf.rolling(window_size).mean().compute()
print(a.shape, '?=', b.shape)
# (10, 5) ?= (10, 5)

dask_diff = 2**npartitions * (window_size-1)
print(dask_diff)  # == 16
shape_diff = b.shape[0] - a.shape[0]
print(shape_diff) # == 0
@HSR05

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2019

@mrocklin Can you guide me a little about this issue. I think i can work on it.

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

@HSR05 where are you stuck at? Does the summary from #4053 (comment) make sense? Do you need help writing tests, or implementing the fix?

@HSR05

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2019

@TomAugspurger I just started looking into it again and it is making sense to me. I just need a little guidance in implementing the fix. Can you guide me?

@ivarsfg ivarsfg referenced this issue Jul 25, 2019
2 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.