# Basics of Sparse Matrices

In [294]:
import scipy.sparse as sp
import numpy as np
import sklearn

## coo matrix


Sparse matrices can be used in arithmetic operations: they support
addition, subtraction, multiplication, division, and matrix power.

- Advantages of the COO format
    - facilitates fast conversion among sparse formats
    - permits duplicate entries (see example)
    - very fast conversion to and from CSR/CSC formats

- Disadvantages of the COO format
    - does not directly support:
        + arithmetic operations
        + slicing

- Intended Usage
    - COO is a fast format for constructing sparse matrices
    - Once a matrix has been constructed, convert to CSR or
      CSC format for fast arithmetic and matrix vector operations
    - By default when converting to CSR or CSC format, duplicate (i,j)
      entries will be summed together.  This facilitates efficient
      construction of finite element matrices and the like. (see example)


In [299]:
X_coo = sp.coo_matrix([[0, 2, 0],
                       [0, 0, 3],
                       [0, 0, 0],
                       [7, 6, 0]],
                       dtype=np.float64)

In [300]:
print(X_coo.data)
print(X_coo.row)
print(X_coo.col)

[2. 3. 7. 6.]
[0 1 3 3]
[1 2 0 1]


## lil matrix


In the .help method of lil matrix we find the following


Sparse matrices can be used in arithmetic operations: they support
addition, subtraction, multiplication, division, and matrix power.

- Advantages of the LIL format
    - supports flexible slicing
    - changes to the matrix sparsity structure are efficient

- Disadvantages of the LIL format
    - arithmetic operations LIL + LIL are slow (consider CSR or CSC)
    - slow column slicing (consider CSC)
    - slow matrix vector products (consider CSR or CSC)

- Intended Usage
    - LIL is a convenient format for constructing sparse matrices
    - once a matrix has been constructed, convert to CSR or
      CSC format for fast arithmetic and matrix vector operations
    - consider using the COO format when constructing large matrices

- Data Structure
    - An array (``self.rows``) of rows, each of which is a sorted
      list of column indices of non-zero elements.
    - The corresponding nonzero values are stored in similar



In [275]:
X_lil = sp.lil_matrix([[0, 2, 0],
                       [0, 0, 3],
                       [0, 0, 0],
                       [7, 6, 0]],
                       dtype=np.float64)

A lil matrix stores the elements in the matrix in:
- `.data`: Iterable of Iterables containing the non-zero values of each row.
- `.rows`: Iterable of Iterables containing the non-zero indices (column indices).


In [276]:
print(X_lil.data)
print(X_lil.rows)

[list([2.0]) list([3.0]) list([]) list([7.0, 6.0])]
[list([1]) list([2]) list([]) list([0, 1])]


One can access row or col `k`  with `.getrow(k)` or  `.getcol(k)`  respectively.

In [277]:
X_lil.getrow(1).todense()

matrix([[0., 0., 3.]])

In [279]:
X_lil.getcol(1).todense()

matrix([[2.],
        [0.],
        [0.],
        [6.]])

In [282]:
data = X_lil.data
rows = X_lil.rows

## CSR Matrix

In the .help method of csr matrix we find the following

Sparse matrices can be used in arithmetic operations: they support
addition, subtraction, multiplication, division, and matrix power.

- Advantages of the CSR format
  - efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
  - efficient row slicing
  - fast matrix vector products

- Disadvantages of the CSR format
  - slow column slicing operations (consider CSC)
  - changes to the sparsity structure are expensive (consider LIL or DOK)

In [257]:
X_csr = sp.csr_matrix([[0, 2, 0],
                       [0, 0, 3],
                       [0, 0, 0],
                       [7, 6, 0]],
                       dtype=np.float64)

A CSR matrix stores the data in:

- `.data`: Iterable containing all non-zero values.
- `.indices`: Iterable containing all indices (column indices) of the non-zero values.
- `.indptr`: Iterable containing the boundaries from `.data` that define each row.


In [258]:
print(X_csr.data)
print(X_csr.indices)
print(X_csr.indptr)

[2. 3. 7. 6.]
[1 2 0 1]
[0 1 2 2 4]


The natural way to index in a CSR matrix is by row

In [267]:
X_csr[0].toarray()

array([[0., 2., 0.]])

A csr matrix can be populated from `data`, `indices`, `indptr`

In [266]:
data = X_csr.data
indices = X_csr.indices
indptr = X_csr.indptr

sp.csr_matrix((data, indices, indptr)).todense()

matrix([[0., 2., 0.],
        [0., 0., 3.],
        [0., 0., 0.],
        [7., 6., 0.]])

## CSC Matrix


In the .help method of csr matrix we find the following:

Sparse matrices can be used in arithmetic operations: they support
addition, subtraction, multiplication, division, and matrix power.

- Advantages of the CSC format
    - efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
    - efficient column slicing
    - fast matrix vector products (CSR, BSR may be faster)

- Disadvantages of the CSC format
  - slow row slicing operations (consider CSR)
  - changes to the sparsity structure are expensive (consider LIL or DOK)



In [270]:
X_csc = sp.csc_matrix([[0, 2, 0],
                       [0, 0, 3],
                       [0, 0, 0],
                       [7, 6, 0]],
                       dtype=np.float64)

In [271]:
print(X_csc.data)
print(X_csc.indices)
print(X_csc.indptr)

[7. 2. 6. 3.]
[3 0 3 1]
[0 1 3 4]


The natural way to index in a CSC matrix is by col

In [272]:
X_csc[:,0].toarray()

array([[0.],
       [0.],
       [0.],
       [7.]])

A csc matrix can be populated from `data`, `indices`, `indptr`


In [273]:
data = X_csc.data
indices = X_csc.indices
indptr = X_csc.indptr

sp.csc_matrix((data, indices, indptr)).todense()

matrix([[0., 2., 0.],
        [0., 0., 3.],
        [0., 0., 0.],
        [7., 6., 0.]])

## Getting data from a sparse matrix

One can get rows/columns of a a sparse matrix with:
    
    - integers (including negative integers for 'starting from the end')
    - slice
    - List of row indices
    - Boolean list

In [210]:
idx = 0
print(X_lil[idx].toarray())
print(X_csr[idx].toarray())
print(X_csc[idx].toarray())

idx = -1
print(X_lil[idx].toarray())
print(X_csr[idx].toarray())
print(X_csc[idx].toarray())

[[0. 2. 0.]]
[[0. 2. 0.]]
[[0. 2. 0.]]
[[7. 6. 0.]]
[[7. 6. 0.]]
[[7. 6. 0.]]


In [215]:
idx = slice(0,2)
print(X_lil[idx].toarray())
print(X_csr[idx].toarray())
print(X_csc[idx].toarray())

[[0. 2. 0.]
 [0. 0. 3.]]
[[0. 2. 0.]
 [0. 0. 3.]]
[[0. 2. 0.]
 [0. 0. 3.]]


In [206]:
idx = [1,3]
print(X_lil[idx].toarray())
print(X_csr[idx].toarray())
print(X_csc[idx].toarray())

[[0. 0. 3.]
 [7. 6. 0.]]
[[0. 0. 3.]
 [7. 6. 0.]]
[[0. 0. 3.]
 [7. 6. 0.]]


In [208]:
idx = [True, True, False, True]
print(X_lil[idx].toarray())
print(X_csr[idx].toarray())
print(X_csc[idx].toarray())

[[0. 2. 0.]
 [0. 0. 3.]
 [7. 6. 0.]]
[[0. 2. 0.]
 [0. 0. 3.]
 [7. 6. 0.]]
[[0. 2. 0.]
 [0. 0. 3.]
 [7. 6. 0.]]


## Slicing benchmarks

Not all operations to get data from an sparse matrix have the same cost

In [165]:
from sklearn import datasets

X = datasets.fetch_20newsgroups_vectorized()['data']

In [301]:
X_csc = sp.csc_matrix(X)
X_csr = sp.csr_matrix(X)
X_lil = sp.lil_matrix(X)
X_coo = sp.coo_matrix(X)

Note you can't doo `X_coo[1]`

Here we can see:
- CSC excels in speed when retrieving columns.
- CSR excels in speed when retrieving rows.


In [308]:
idx = [1, 2, 100]

t_csc_row = %timeit -o  aux = X_csc[idx]
t_csr_row = %timeit -o  aux = X_csr[idx]
t_lil_row = %timeit -o  aux = X_lil[idx]

t_csc_col = %timeit -o  aux = X_csc[:,idx]
t_csr_col = %timeit -o  aux = X_csr[:,idx]
t_lil_col = %timeit -o  aux = X_lil[:,idx]

2.27 ms ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
55.7 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
27.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
53.3 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
2.04 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.47 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [244]:
print('Get rows')
print(f'\t X_csr[idx]  --> {round(t_csr_row.average,6)}')
print(f'\t X_csc[idx]  --> {round(t_csc_row.average,6)}')
print(f'\t X_lil[idx]  --> {round(t_lil_row.average,6)}')
print(f'\t X_coo[idx]  --> {round(t__row.average,6)}')


print('Get cols')
print(f'\t t_csr_col = {round(t_csr_col.average,6)}')
print(f'\t t_csc_col = {round(t_csc_col.average,6)}')
print(f'\t t_lil_col = {round(t_lil_col.average,6)}')

Get rows
	 X_csr[idx]  --> 5.6e-05
	 X_csc[idx]  --> 0.002266
	 X_lil[idx]  --> 2.7e-05
Get cols
	 t_csr_col = 0.002036
	 t_csc_col = 5.4e-05
	 t_lil_col = 0.008042
