# Sorting values in an sparse matrix


Sorting values in an sparse matrix is not trivial, we want to know how to efficiently have the equivalents of `np.sort` and `np.argsort`

In [13]:
import scipy.sparse as sp
import numpy as np

Let us consider the following example,

In [35]:
np.random.seed(123)
import scipy.sparse as sp
n_samples = 20_000_000
x = sp.random(1, n_samples, density=0.01, format='csr')

In [36]:
x.data

array([0.62885824, 0.36998987, 0.01654456, ..., 0.79507112, 0.55035835,
       0.2160247 ])

In [37]:
x.indices

array([     113,      137,      245, ..., 19999745, 19999843, 19999883],
      dtype=int32)

In [38]:
x[0, x.indices[0]] 

0.6288582392500274

## Generating eficiently a sparse vector

In many applications we might want to cast one row or column of a sparse vector as a dense vector. This can be done with `todense()`,  `ravel()` and `flatten()`.

In [86]:
%%time
x_dense = np.asarray(x.todense()).flatten()

CPU times: user 46 ms, sys: 52.6 ms, total: 98.6 ms
Wall time: 113 ms


In [98]:
%%time
x_dense = np.asarray(x.todense()).ravel()

CPU times: user 15.9 ms, sys: 3.17 ms, total: 19.1 ms
Wall time: 19.7 ms


In [106]:
%%timeit
x.todense()

5.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**Note that to convert the 2d matrix to a 1d vector `.ravel` is faster than `.flatten()` (with equal output vector)**

In [109]:
%%timeit
x.toarray().flatten()

11.5 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [110]:
%%timeit
x.toarray().ravel()

5.49 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [158]:
x_ravel = x.toarray().ravel()
x_flatten = x.toarray().flatten()

np.testing.assert_almost_equal(x_ravel, x_flatten)

## Sorting top_k values efficiently a sparse vector (cast as dense)

We can sort all indices of an array and get the positions of the `top_k`.
This is slow

In [290]:
top_k = 10

In [246]:
%%timeit
top_k_argsort =  x_dense.argsort()[-top_k:]

600 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We can sort only 10 of the indices of `-x`. Note that this is allocating `-x_dense` at runtime which is bad

In [279]:
%%timeit
top_k_argpartition_allocating = np.argpartition(-x_dense, range(top_k))[0:top_k][::-1]

345 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


The best approach consist on using argpartition with a negative value

In [350]:
def get_top_k(x_dense, top_k):
    top_k_argpartition = np.argpartition(x_dense, -top_k)[-top_k:]
    return top_k_argpartition[np.argsort(x_dense[top_k_argpartition])]

In [351]:
%%timeit
top_k_argpartition = get_top_k(x_dense, top_k)

101 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can see that all approaches actually return the same result

In [352]:
top_k_argsort =  x_dense.argsort()[-top_k:]
top_k_argpartition_allocating = np.argpartition(-x_dense, range(top_k))[0:top_k][::-1]
top_k_argpartition = get_top_k(x_dense, top_k)

np.testing.assert_almost_equal(top_k_argsort, top_k_argpartition_allocating)
np.testing.assert_almost_equal(top_k_argsort, top_k_argpartition)
np.testing.assert_almost_equal(top_k_argpartition_allocating, top_k_argpartition)

## Sorting top_k values efficiently a sparse vector (no cast as dense)

Note that our previous implementation, the lowest time achieved does not capture the whole process, since it used a dense vector.

We can rethink the solution getting the `top_k` indices without casting to dense

In [353]:
%%time
x_dense = np.asarray(x.todense()).flatten()
top_k_argpartition = get_top_k(x_dense, top_k)

CPU times: user 138 ms, sys: 77.3 ms, total: 216 ms
Wall time: 212 ms


Note that `x.indices` contains the columns (positions since this `x` is a row vector) in which the nonzero data is.

Thefore, we can find the indices of  `x` with highest values (argsort x) and then use those indices to know which features from `x` contain the highest values looking at `x.indices`.

In [377]:
x[0,100:120].todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.62885824, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ]])

In [379]:
x.indices

array([     113,      137,      245, ..., 19999745, 19999843, 19999883],
      dtype=int32)

In [383]:
x[0,113]

0.6288582392500274

We can reuse `get_top_k` method to get the column indices from the nonzero values with highest values

In [387]:
get_top_k(x.data, top_k)

array([ 26679,  45389, 137501, 113607,  37437, 166040, 151744,  11038,
        88535,  30927])

now we need to know which are the original column indices using the previous indicies to select the correct indices from `x.indices` and store them in `indices_from_x`.

Then we also need to verify that the indices that we found out without creating a dense vector are actually the same that we found before.

In [405]:
indices_from_x_data = get_top_k(x.data, top_k)
indices_from_x = x.indices[indices_from_x_data]

np.testing.assert_almost_equal(top_k_argpartition,indices_from_x)

In [406]:
def get_top_k_indices(x_sparse, top_k):
    indices_from_x_data = get_top_k(x_sparse.data, top_k)
    indices_from_x = x.indices[indices_from_x_data]
    return indices_from_x

In [407]:
%%timeit 
get_top_k_indices(x, top_k)

2.01 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
