Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse arrays support #123

Open
rth opened this issue Jan 25, 2018 · 24 comments
Open

Sparse arrays support #123

rth opened this issue Jan 25, 2018 · 24 comments

Comments

@rth
Copy link
Contributor

rth commented Jan 25, 2018

While sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.

In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB).

Potentially relevant estimators

TruncatedSVD, text vectorizer, some estimators in dask_ml.preprocessing, and wrapped scikit learn models that support incremental learning and sparse arrays natively.

Sparse array format

Several choices here,

  1. should the mrocklin/sparse be added as a hard dependency as the package that works out of the box for sparse arrays?
  2. should sparse arrays from scipy be wrapped to make them compatible in some limited fashion (if possible at all)?

In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it's relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I'm not sure if it's still the case for dask. Then there is the partitioning strategy (see next section).

Partitioning strategy

At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.

File storage format

For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn't look like there is a canonical way to do this (dask/dask#2562 (comment)). zarr-developers/zarr-python#152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.

Just a few general thoughts. I'm not sure what's your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with..

cc @jorisvandenbossche @ogrisel

@mrocklin
Copy link
Member

Long term it would be nice to see a scipy.sparse csr class that satisfied the np.ndarray rather than np.matrix interface. Then it could operate nicely with dask array while still having the CSR layout. This is discussed in more depth in scipy/scipy#8162

To be clear, I'm not saying a fully ndarray implementation, just the current implementation where n <= 2 that satisfies ndarray conventions. This is a broader issue for the rest of the community. My understanding is that sklearn devs would also appreciate a scipy.sparse.csr_array class.

@ogrisel
Copy link

ogrisel commented Jan 25, 2018

+1 for scipy/scipy#8162 as the only true long term solution.

On the shorter term it should be possible to fix the most common discrepancies directly in dask. I found one in dask/dask#2842 but I did not find (take) the time to finish investigating the cause of the test failures.

@ogrisel
Copy link

ogrisel commented Jan 25, 2018

For sklearn, both CSR and CSC are useful. CSR is typically optimal for incremental learning algorithm (those with a partial_fit) while CSC is optimal for column-wise algorithms such as decision trees and linear models fitted with a coordinate descent optimizer (Lasso, ElasticNet...).

@ogrisel
Copy link

ogrisel commented Jan 25, 2018

For storage you could write a custom dask function that relies on joblib.dump to efficiently serialize scipy.sparse.csr_matrix blocks in a filesystem folder.

@ogrisel
Copy link

ogrisel commented Jan 25, 2018

or write your own by using numpy.save / numpy.load on the 3 underlying numpy arrays sparse_data_block.indices, sparse_data_block.data and sparse_data_block.indptr.

@mrocklin
Copy link
Member

cc @hameerabbasi , dev on the mrocklin/sparse library

@rth
Copy link
Contributor Author

rth commented Jan 25, 2018

Thanks for the response. The discussion about sparse ndarray in scipy looks quite promising indeed!

For storage you could write a custom dask function that relies on joblib.dump [..] or write your own by using numpy.save

yes, I imagine that wouldn't be too difficult to do.

On the shorter term it should be possible to fix the most current discrepancies directly in dask directly. I found one in dask/dask#2842 ...

Good to know, thanks.

@hameerabbasi
Copy link

I'm a bit hesitant to add CSR/CSC in mrocklin/sparse, simply because they limit us to 2-D (N-D CSR/CSC is... murky to say the least), and I'm worried about growing complexity. However, we can achieve near-CSR level performance if mrocklin/sparse#60 is fixed (O(log n) vs O(1)).

Currently, we are working with C-style ordering of indices. If we, instead, added a Fortran-style mode for ordering indices, and modified indexing to fit, then we could achieve near-CSC level performance as well.

If you could reduce the bug in dask/dask#2842 to an underlying bug or missing feature in mrocklin/sparse, I'd be happy to look into it. If it's a bug in dask itself, I'm less willing to dive into it at this point, but I will soon, to help get it integrated.

@mrocklin
Copy link
Member

In regards to performance I think that the main operation where you really need CSR/CSC is tensordot, which is pretty critical in many numeric algorithms. For this CSC/CSR is probably essential.

If we don't want to host these in mrocklin/sparse then csr_array and csc_array could also exist in scipy.sparse. This would probably be good to get more eyes on it and ensure maintenance, but presents a challenge due to the slow release and development cycle. Alternatively someone could make a separate standalone package.

At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.

@ogrisel
Copy link

ogrisel commented Jan 25, 2018

CSR and CSC support would be useful to allow for no-copy behavior when calling Cython code from scikit-learn on sparse data blocks.

There are several algorithms in scikit-learn that do very efficient, cache aligned data access when fed with CSR or CSC datastructures.

@hameerabbasi
Copy link

Interesting... It seems a lot of algorithms have been optimized for CSR/CSC already. I'll see how hard it is to implement CSR/CSC (limited to 2-D) with array semantics. Mostly it's just a matter of studying and re-implementing scipy.sparse code (or maybe just even wrapping may be good enough).

However, indexing would again be COO... Since CSR/CSC isn't defined for 1-D.

@mrocklin
Copy link
Member

Note that the scope of CSR/CSC is non-trivial. You might want to look at algorithms in scipy/sparse/linalg/. This would almost certainly require writing low-level code in Cython/Numba/C/C++.

@mrocklin
Copy link
Member

If I were doing this under time constraints I would probably try to find a way to modify the existing solution to support the ndarray interface.

However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).

@hameerabbasi
Copy link

hameerabbasi commented Jan 25, 2018

I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.

dot/tensordot/linalg.solve etc., would either be in Scipy or a separate package... Or in mrocklin/sparse if we had help. I plan to keep the data/indices/indptr structure so their code can be easily ported.

@hameerabbasi
Copy link

To be clear, I'd love to see mrocklin/sparse expand, or be merged into Scipy when stable. I just don't think I can do it alone. :)

@mrocklin
Copy link
Member

Yeah, you're not alone here. I think that many people on this issue have demonstrated that they're happy to pitch in when relevant. I also suspect that sparse ndarrays will be of relevance to @jcrist in a couple months.

I plan to keep mrocklin/sparse a mostly data structure related thing as discussed in scipy/scipy#8162. Maybe implement operators, etc, but that's about it. I want to go down to Cython at some point... Might as well be soon.

FWIW I'd rather see a single csr ndarray class somewhere that has all relevant computational operations. Splitting this up between repositories seems painful to me. I'd rather wait until a clear solution is agreed upon instead of launching into this short term.

@jakirkham
Copy link
Member

I wonder if it would be possible to extend CSR/CSC to N-D. It seems that all it should be is an ordering of which axes are prioritized in the sparse representation. Does that seem correct? If so, it seems like it should be possible to implement this in sparse in a way where it remains general purpose. With CSR/CSC, this amount to one of two orderings involving axes 0 and 1.

@mrocklin
Copy link
Member

Yes, this is possible. There are a combinatorial number of such layouts, but presumably we would store some hierarchy of axes that we cared about. I wouldn't make this choice until some important workloads demonstrated its value (like large sparse iterative factorization algorithms). I also wouldn't want to think about how operations like solve generalized under these conditions.

Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this; this could be naive though.

@rth
Copy link
Contributor Author

rth commented Jan 25, 2018

I would also be interested to help a bit, once it is decided where and in what way this should be implemented. (Though my bandwidth before summer won't be great).

At this point I think that downstream library maintainers, like the scikit-learn community, should probably decide what they would prefer to depend upon.

I would think it's more a question for dask-ml. scikit-learn itself currently has a minimal dependency of scipy 0.13.3 and works reasonably well around the limitations of the matrix interface. Which means that this development won't impact it in the near future either way.

dot/tensordot/linalg.solve etc., would either be in Scipy or a separate package...

However if I were doing this without serious time constraints then yes, rebuilding that code from scratch might result in significant improvements (and would also be pretty educational I think).
I'd rather wait until a clear solution is agreed upon instead of launching into this short term.

At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?

Short-to-medium term I think that 2d CSR/CSC with the np.ndarray interface is a very valuable and feasible objective. My hope is that going into the scipy.sparse code and making a few modifications is all that's necessary to achieve this;

+1

@mrocklin
Copy link
Member

At the same time in the context of dask-ml, even an implementation that supports basic operations and is able to covert without much cost to/from scipy's csr/csc_matrix might help solve some of the issues?

Yes, that's true. Short term we've been using the COO implementation in mrocklin/sparse for this. The conversion isn't free, but it's also not that expensive. We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term. There are some short-term-needs vs long-term-goals tradeoffs to make here. I'm not sure what the best path is.

@hameerabbasi
Copy link

hameerabbasi commented Jan 25, 2018

It seems that all it should be is an ordering of which axes are prioritized in the sparse representation.

Yes, that is the most general way to put it. Would CSD (compressed sparse dimensions) make sense, for both CSR/CSC? For CSR, the compressed dimension would be 0 and for CSC it would be 1. We could keep a tuple of compressed_axes pairs.

Then, CSR/CSC would just inherit from this or be special cases. We would implement all operations exactly as in 2-D CSR without any broadcasting. Tensordot would reduce to dot for the appropriate choice of axes.

@rth
Copy link
Contributor Author

rth commented Jan 25, 2018

We could also make a thin wrapper around scipy.sparse that just converted every operation back and forth, but this would probably be a wasted project long-term.

Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse ?

@mrocklin
Copy link
Member

Well we would still need to be able to do the conversion in any case (e.g. to communicate with scikit-learn that would expect csr_matrix). Can't we start with this and progressively re-implement operations to finally reach the final implementation that wouldn't eventually use scipy.sparse ?

This is feasible. The only concern would be that we do a lot of work and never reach the functionality of scipy.sparse. It may be that copying the entire implementation over and then tweaking a couple of things to stop using the np.matrix interface is significantly easier to do.

To be clear, people can do whatever they want to do. I have thoughts on how I would or would not spend my time here, but that doesn't stop others from exploring this topic. I doubt that my thoughts on this topic are entirely correct.

@mrocklin
Copy link
Member

@hameerabbasi and @jakirkham 's thoughts on CSD might be a decent start. This code seems orthogonal to the logic in scipy.sparse (and so would presumably not be thrown away in the future) and might resolve the short-term need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants