In [1]:
from GCRS2 import CSR,CSC
import sparse
import numpy as np

The csr/csc arrays shown here are loosely based on the GCRS/GCCS formats presented in [Shaikh et al. 2015](https://ieeexplore.ieee.org/document/7237032). However, I've used a different linearization function where the underlying sparse matrix is C-contiguous and in line with numpy's reshape method. As you'll see, csr/csc offers much better compression than coo. In principle it should be possible to use these arrays in any place that expects the numpy ndarray API and also anything that works with scipy.sparse matrices. Dask, scikit-learn, and xarray are all good candidates for this.  Currently, csr/csc is much faster than coo for indexing 2d arrays. For arrays with more dimensions, the runtime is much longer because the code to transform nd-coords to 2d-coords is a little sloppy and runs in native python. When this is all fixed I suspect that csr/csc will be faster than coo. The csc indexing still has some bugs that I'm working out. This codebase is very young and most everything is likely to change. I'm hoping that when it is ready, this code might be merged with pydata/sparse.

# compression

In [2]:
#create random sparse array
coo = sparse.random((100,100),density=.2)
dense = coo.todense()
csr = CSR(coo)
csc = CSC(coo)
print('no. bytes dense: ',dense.nbytes,' storage ratio: ', dense.nbytes/dense.nbytes)
print('no. bytes coo: ',coo.nbytes, ' storage ratio: ', coo.nbytes/dense.nbytes)
print('no. bytes csr: ',csr.nbytes,' storage ratio: ', csr.nbytes/dense.nbytes)
print('no. bytes csc: ',csc.nbytes,' storage ratio: ', csc.nbytes/dense.nbytes)

no. bytes dense:  80000  storage ratio:  1.0
no. bytes coo:  48000  storage ratio:  0.6
no. bytes csr:  32404  storage ratio:  0.40505
no. bytes csc:  32404  storage ratio:  0.40505


In [3]:
#create random sparse array
coo = sparse.random((100,100,100),density=.2)
dense = coo.todense()
csr = CSR(coo)
csc = CSC(coo)
print('no. bytes dense: ',dense.nbytes,' storage ratio: ', dense.nbytes/dense.nbytes)
print('no. bytes coo: ',coo.nbytes, ' storage ratio: ', coo.nbytes/dense.nbytes)
print('no. bytes csr: ',csr.nbytes,' storage ratio: ', csr.nbytes/dense.nbytes)
print('no. bytes csc: ',csc.nbytes,' storage ratio: ', csc.nbytes/dense.nbytes)

no. bytes dense:  8000000  storage ratio:  1.0
no. bytes coo:  6400000  storage ratio:  0.8
no. bytes csr:  3240004  storage ratio:  0.4050005
no. bytes csc:  3200404  storage ratio:  0.4000505


In [4]:
#create random sparse array
coo = sparse.random((50,50,50,50),density=.2)
dense = coo.todense()
csr = CSR(coo)
csc = CSC(coo)
print('no. bytes dense: ',dense.nbytes,' storage ratio: ', dense.nbytes/dense.nbytes)
print('no. bytes coo: ',coo.nbytes, ' storage ratio: ', coo.nbytes/dense.nbytes)
print('no. bytes csr: ',csr.nbytes,' storage ratio: ', csr.nbytes/dense.nbytes)
print('no. bytes csc: ',csc.nbytes,' storage ratio: ', csc.nbytes/dense.nbytes)

no. bytes dense:  50000000  storage ratio:  1.0
no. bytes coo:  50000000  storage ratio:  1.0
no. bytes csr:  20010004  storage ratio:  0.40020008
no. bytes csc:  20010004  storage ratio:  0.40020008


As we can see, unlike the coo format the compression ratio for csr/csc does not vary much with additional dimensions. Additionally, in most cases csr/csc will compress much better than coo.