Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask array usage #10

Closed
cicdw opened this issue Dec 8, 2016 · 7 comments
Closed

dask array usage #10

cicdw opened this issue Dec 8, 2016 · 7 comments

Comments

@cicdw
Copy link
Owner

cicdw commented Dec 8, 2016

Ensure only methods needed for inputs X,y are: .dot() and .T.

@mrocklin
Copy link

When dask/dask#1838 gets in we'll actually have access to almost all of dask.array. Arithmetic, elemwise, reductions, transpose, tensordot, etc.. The only things that definitely won't work are things like slicing.

@mrocklin
Copy link

This should now be resolved if you work from dask/master

In [1]: import dask.dataframe as dd

In [2]: df = dd.demo.make_timeseries('2000', '2001',
   ...:                              {'a': int, 'b': float},
   ...:                              freq='10s', partition_freq='7D', seed=1)

In [3]: df.head()
Out[3]: 
                        a         b
2000-01-01 00:00:00  1018 -0.023142
2000-01-01 00:00:10  1029 -0.911436
2000-01-01 00:00:20   992  0.898917
2000-01-01 00:00:30   999 -0.579152
2000-01-01 00:00:40  1001 -0.712870

In [4]: df
Out[4]: dd.DataFrame<make-ti..., npartitions=52, divisions=(Timestamp('2000-01-01 00:00:00', freq='7D'), Timestamp('2000-01-08 00:00:00', freq='7D'), Timestamp('2000-01-15 00:00:00', freq='7D'), ..., Timestamp('2000-12-23 00:00:00', freq='7D'), Timestamp('2000-12-30 00:00:00', freq='7D'))>

In [5]: x = df.values

In [6]: x
Out[6]: dask.array<values-..., shape=(nan, 2), dtype=float64, chunksize=(nan, 2)>

In [7]: x.dtype
Out[7]: dtype('float64')

In [8]: y = x.T.dot(x)

In [9]: y
Out[9]: dask.array<sum-a15..., shape=(2, 2), dtype=float64, chunksize=(2, 2)>

In [10]: y.compute()
Out[10]: 
array([[  3.14803663e+12,  -5.60514928e+05],
       [ -5.60514928e+05,   1.04724887e+06]])

Or if you want a numpy record array:

In [13]: df.to_records()
Out[13]: dask.array<to-reco..., shape=(nan,), dtype=(numpy.record, [('index', 'O'), ('a', '<i8'), ('b', '<f8')]), chunksize=(nan,)>

In [14]: df.to_records().dtype
Out[14]: dtype((numpy.record, [('index', 'O'), ('a', '<i8'), ('b', '<f8')]))

@cicdw
Copy link
Owner Author

cicdw commented Dec 13, 2016

Resolved with https://github.com/moody-marlin/dask-glm/commit/b5f5033d3d2104b346869a7aa37ec8cd3608cdd1, dask Series still throw a few errors which will be fixed shortly (due to shape issues in the algorithms).

@mrocklin
Copy link

What issues did you run into?

@cicdw
Copy link
Owner Author

cicdw commented Dec 13, 2016

@mrocklin
Whenever I use Series as input (e.g., A['var'] vs. A[['var']]), during the computation of Xbeta = self.X.dot(beta) I run into the following error:

ValueError: Chunks do not add up to shape. Got chunks=((nan,),), shape=(1,)

Here, beta = np.zeros(1).

It's an edge case that can only occur whenever there's a single input variable, but my plan was just to "thicken" it up and add another dimension to A['var'].values.

@mrocklin
Copy link

Ah, I see. Right, I just checked and it looks like pd.Series.values does the same thing and returns a 1d "row" array. Adding a newaxis manually as you suggest is probably the right thing to do here:

In [22]: s.values.shape
Out[22]: (nan,)

In [23]: x.shape
Out[23]: (1,)

In [24]: s.values[:, None].dot(x)
Out[24]: dask.array<sum-d05..., shape=(nan,), dtype=float64, chunksize=(nan,)>

@cicdw
Copy link
Owner Author

cicdw commented Dec 13, 2016

Nice, fixed with 2d10f73. Closing this issue!

@cicdw cicdw closed this as completed Dec 13, 2016
@cicdw cicdw mentioned this issue Dec 13, 2016
cicdw added a commit that referenced this issue Jan 27, 2017
Performance improvements; the only CI checks that were failing were flake8 - all tests passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants