Sparse arrays support #123

rth · 2018-01-25T12:46:28Z

While sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.

In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB).

Potentially relevant estimators

TruncatedSVD, text vectorizer, some estimators in dask_ml.preprocessing, and wrapped scikit learn models that support incremental learning and sparse arrays natively.

Sparse array format

Several choices here,

should the mrocklin/sparse be added as a hard dependency as the package that works out of the box for sparse arrays?
should sparse arrays from scipy be wrapped to make them compatible in some limited fashion (if possible at all)?

In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it's relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I'm not sure if it's still the case for dask. Then there is the partitioning strategy (see next section).

Partitioning strategy

At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.

File storage format

For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn't look like there is a canonical way to do this (dask/dask#2562 (comment)). zarr-developers/zarr-python#152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.

Just a few general thoughts. I'm not sure what's your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with..

cc @jorisvandenbossche @ogrisel

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-01-25T12:52:22Z

Long term it would be nice to see a scipy.sparse csr class that satisfied the np.ndarray rather than np.matrix interface. Then it could operate nicely with dask array while still having the CSR layout. This is discussed in more depth in scipy/scipy#8162

To be clear, I'm not saying a fully ndarray implementation, just the current implementation where n <= 2 that satisfies ndarray conventions. This is a broader issue for the rest of the community. My understanding is that sklearn devs would also appreciate a scipy.sparse.csr_array class.

ogrisel · 2018-01-25T13:06:18Z

+1 for scipy/scipy#8162 as the only true long term solution.

On the shorter term it should be possible to fix the most common discrepancies directly in dask. I found one in dask/dask#2842 but I did not find (take) the time to finish investigating the cause of the test failures.

ogrisel · 2018-01-25T13:08:24Z

For sklearn, both CSR and CSC are useful. CSR is typically optimal for incremental learning algorithm (those with a partial_fit) while CSC is optimal for column-wise algorithms such as decision trees and linear models fitted with a coordinate descent optimizer (Lasso, ElasticNet...).

ogrisel · 2018-01-25T13:11:21Z

For storage you could write a custom dask function that relies on joblib.dump to efficiently serialize scipy.sparse.csr_matrix blocks in a filesystem folder.

ogrisel · 2018-01-25T13:12:54Z

or write your own by using numpy.save / numpy.load on the 3 underlying numpy arrays sparse_data_block.indices, sparse_data_block.data and sparse_data_block.indptr.

mrocklin · 2018-01-25T13:22:55Z

cc @hameerabbasi , dev on the mrocklin/sparse library

rth · 2018-01-25T13:41:45Z

Thanks for the response. The discussion about sparse ndarray in scipy looks quite promising indeed!

For storage you could write a custom dask function that relies on joblib.dump [..] or write your own by using numpy.save

yes, I imagine that wouldn't be too difficult to do.

On the shorter term it should be possible to fix the most current discrepancies directly in dask directly. I found one in dask/dask#2842 ...

Good to know, thanks.

hameerabbasi · 2018-01-25T13:51:42Z

I'm a bit hesitant to add CSR/CSC in mrocklin/sparse, simply because they limit us to 2-D (N-D CSR/CSC is... murky to say the least), and I'm worried about growing complexity. However, we can achieve near-CSR level performance if mrocklin/sparse#60 is fixed (O(log n) vs O(1)).

Currently, we are working with C-style ordering of indices. If we, instead, added a Fortran-style mode for ordering indices, and modified indexing to fit, then we could achieve near-CSC level performance as well.

If you could reduce the bug in dask/dask#2842 to an underlying bug or missing feature in mrocklin/sparse, I'd be happy to look into it. If it's a bug in dask itself, I'm less willing to dive into it at this point, but I will soon, to help get it integrated.

mrocklin · 2018-01-25T14:00:59Z

In regards to performance I think that the main operation where you really need CSR/CSC is tensordot, which is pretty critical in many numeric algorithms. For this CSC/CSR is probably essential.

If we don't want to host these in mrocklin/sparse then csr_array and csc_array could also exist in scipy.sparse. This would probably be good to get more eyes on it and ensure maintenance, but presents a challenge due to the slow release and development cycle. Alternatively someone could make a separate standalone package.

At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.

ogrisel · 2018-01-25T14:26:34Z

CSR and CSC support would be useful to allow for no-copy behavior when calling Cython code from scikit-learn on sparse data blocks.

There are several algorithms in scikit-learn that do very efficient, cache aligned data access when fed with CSR or CSC datastructures.

hameerabbasi · 2018-01-25T14:33:19Z

Interesting... It seems a lot of algorithms have been optimized for CSR/CSC already. I'll see how hard it is to implement CSR/CSC (limited to 2-D) with array semantics. Mostly it's just a matter of studying and re-implementing scipy.sparse code (or maybe just even wrapping may be good enough).

However, indexing would again be COO... Since CSR/CSC isn't defined for 1-D.

mrocklin · 2018-01-25T14:35:55Z

Note that the scope of CSR/CSC is non-trivial. You might want to look at algorithms in scipy/sparse/linalg/. This would almost certainly require writing low-level code in Cython/Numba/C/C++.

mrocklin · 2018-01-25T14:37:33Z

If I were doing this under time constraints I would probably try to find a way to modify the existing solution to support the ndarray interface.

However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).

hameerabbasi · 2018-01-25T14:42:51Z

I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.

dot/tensordot/linalg.solve etc., would either be in Scipy or a separate package... Or in mrocklin/sparse if we had help. I plan to keep the data/indices/indptr structure so their code can be easily ported.

hameerabbasi · 2018-01-25T14:49:17Z

To be clear, I'd love to see mrocklin/sparse expand, or be merged into Scipy when stable. I just don't think I can do it alone. :)

mrocklin · 2018-01-25T14:53:02Z

Yeah, you're not alone here. I think that many people on this issue have demonstrated that they're happy to pitch in when relevant. I also suspect that sparse ndarrays will be of relevance to @jcrist in a couple months.

I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.

FWIW I'd rather see a single csr ndarray class somewhere that has all relevant computational operations. Splitting this up between repositories seems painful to me. I'd rather wait until a clear solution is agreed upon instead of launching into this short term.

jakirkham · 2018-01-25T15:19:03Z

I wonder if it would be possible to extend CSR/CSC to N-D. It seems that all it should be is an ordering of which axes are prioritized in the sparse representation. Does that seem correct? If so, it seems like it should be possible to implement this in sparse in a way where it remains general purpose. With CSR/CSC, this amount to one of two orderings involving axes 0 and 1.

mrocklin · 2018-01-25T15:23:13Z

Yes, this is possible. There are a combinatorial number of such layouts, but presumably we would store some hierarchy of axes that we cared about. I wouldn't make this choice until some important workloads demonstrated its value (like large sparse iterative factorization algorithms). I also wouldn't want to think about how operations like solve generalized under these conditions.

Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this; this could be naive though.

rth · 2018-01-25T15:24:37Z

I would also be interested to help a bit, once it is decided where and in what way this should be implemented. (Though my bandwidth before summer won't be great).

At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.

I would think it's more a question for dask-ml. scikit-learn itself currently has a minimal dependency of scipy 0.13.3 and works reasonably well around the limitations of the matrix interface. Which means that this development won't impact it in the near future either way.

dot/tensordot/linalg.solve etc., would either be in Scipy or a separate package...

However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).
I'd rather wait until a clear solution is agreed upon instead of launching into this short term.

At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?

Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this;

+1

mrocklin · 2018-01-25T15:29:01Z

At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?

Yes, that's true. Short term we've been using the COO implementation in mrocklin/sparse for this. The conversion isn't free, but it's also not that expensive. We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term. There are some short-term-needs vs long-term-goals tradeoffs to make here. I'm not sure what the best path is.

hameerabbasi · 2018-01-25T15:29:25Z

It seems that all it should be is an ordering of which axes are prioritized in the sparse representation.

Yes, that is the most general way to put it. Would CSD (compressed sparse dimensions) make sense, for both CSR/CSC? For CSR, the compressed dimension would be 0 and for CSC it would be 1. We could keep a tuple of compressed_axes pairs.

Then, CSR/CSC would just inherit from this or be special cases. We would implement all operations exactly as in 2-D CSR without any broadcasting. Tensordot would reduce to dot for the appropriate choice of axes.

rth · 2018-01-25T15:35:23Z

We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term.

Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse ?

mrocklin · 2018-01-25T15:38:32Z

Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse ?

This is feasible. The only concern would be that we do a lot of work and never reach the functionality of scipy.sparse. It may be that copying the entire implementation over and then tweaking a couple of things to stop using the np.matrix interface is significantly easier to do.

To be clear, people can do whatever they want to do. I have thoughts on how I would or would not spend my time here, but that doesn't stop others from exploring this topic. I doubt that my thoughts on this topic are entirely correct.

mrocklin · 2018-01-25T15:39:46Z

@hameerabbasi and @jakirkham 's thoughts on CSD might be a decent start. This code seems orthogonal to the logic in scipy.sparse (and so would presumably not be thrown away in the future) and might resolve the short-term need.

hameerabbasi mentioned this issue Feb 21, 2018

Sparse ndarray class scipy/scipy#8162

Closed

TomAugspurger mentioned this issue Jun 13, 2018

Develop a Roadmap #210

Closed

ambrosejcarr mentioned this issue Mar 1, 2019

A model for registration across assays spacetx/starfish#1051

Open

TomAugspurger mentioned this issue Jun 26, 2019

Concatenate Array[sparse] raises ValueError dask/dask#5008

Closed

jdlesage mentioned this issue Sep 18, 2019

Add support of sparse matrix dask/dask-glm#82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse arrays support #123

Sparse arrays support #123

rth commented Jan 25, 2018

mrocklin commented Jan 25, 2018

ogrisel commented Jan 25, 2018 •

edited

ogrisel commented Jan 25, 2018 •

edited

ogrisel commented Jan 25, 2018

ogrisel commented Jan 25, 2018

mrocklin commented Jan 25, 2018

rth commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018

mrocklin commented Jan 25, 2018

ogrisel commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018

mrocklin commented Jan 25, 2018

mrocklin commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018 •

edited

hameerabbasi commented Jan 25, 2018

mrocklin commented Jan 25, 2018

jakirkham commented Jan 25, 2018

mrocklin commented Jan 25, 2018

rth commented Jan 25, 2018

mrocklin commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018 •

edited

rth commented Jan 25, 2018

mrocklin commented Jan 25, 2018

mrocklin commented Jan 25, 2018

Sparse arrays support #123

Sparse arrays support #123

Comments

rth commented Jan 25, 2018

Potentially relevant estimators

Sparse array format

Partitioning strategy

File storage format

mrocklin commented Jan 25, 2018

ogrisel commented Jan 25, 2018 • edited

ogrisel commented Jan 25, 2018 • edited

ogrisel commented Jan 25, 2018

ogrisel commented Jan 25, 2018

mrocklin commented Jan 25, 2018

rth commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018

mrocklin commented Jan 25, 2018

ogrisel commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018

mrocklin commented Jan 25, 2018

mrocklin commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018 • edited

hameerabbasi commented Jan 25, 2018

mrocklin commented Jan 25, 2018

jakirkham commented Jan 25, 2018

mrocklin commented Jan 25, 2018

rth commented Jan 25, 2018

mrocklin commented Jan 25, 2018

hameerabbasi commented Jan 25, 2018 • edited

rth commented Jan 25, 2018

mrocklin commented Jan 25, 2018

mrocklin commented Jan 25, 2018

ogrisel commented Jan 25, 2018 •

edited

ogrisel commented Jan 25, 2018 •

edited

hameerabbasi commented Jan 25, 2018 •

edited

hameerabbasi commented Jan 25, 2018 •

edited