<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Itroduction-to-csr_matrices" data-toc-modified-id="Itroduction-to-csr_matrices-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Itroduction to csr_matrices</a></span></li><li><span><a href="#Filling-csr_matrices" data-toc-modified-id="Filling-csr_matrices-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Filling csr_matrices</a></span></li></ul></div>

## Itroduction to csr_matrices

In [1]:
import numpy as np
import scipy
from scipy import sparse
import scipy as sp

In [2]:
n_words = 1000000
x = np.zeros(n_words)
p = np.random.randint(0,n_words,100)
x[p] = 1.
x.sum()
w = np.random.rand(n_words)
np.dot(w,x)

53.64701369913742

In [3]:
%%timeit
np.dot(w,x)

1.05 ms ± 50.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
x_sp = sp.sparse.csr_matrix((n_words,1), dtype=np.float64)
x_sp[p] = 1.
w_sp = sp.sparse.csr_matrix(w, dtype=np.float64)

(w_sp * x_sp).toarray()

  self._set_arrayXarray(i, j, x)


array([[53.6470137]])

There are two APIs to call the dot product.

- `x_sp.dot(y_sp)`

- `sparse.csr_matrix(x_sp, y_sp)`




In [5]:
print(sparse.csr_matrix.dot(w_sp,  x_sp).toarray())
print(w_sp * x_sp.toarray())

[[53.6470137]]
[[53.6470137]]


In [6]:
%%timeit
w_sp * x_sp

2.59 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
%%timeit
sparse.csr_matrix.dot(w_sp,  x_sp)

2.81 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In this case the numpy multiplication is faster!

This is a very unexpected (sloopy scipy dot implementation?)

In [8]:
%time sparse.csr_matrix.dot(w_sp,  x_sp)

CPU times: user 3.45 ms, sys: 458 µs, total: 3.91 ms
Wall time: 4.66 ms


<1x1 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [9]:
%time np.dot(w ,x)

CPU times: user 2.67 ms, sys: 652 µs, total: 3.32 ms
Wall time: 1.73 ms


53.64701369913742

Elementwise multiplication is twice as fast

In [10]:
%%timeit
x * 2

3.81 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%%timeit
x_sp * 2

1.78 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Summing vectors is 8 times faster in the sparse version

In [12]:
%%timeit
x_sp.sum()

5.79 ms ± 98.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [13]:
%%timeit
x.sum()

546 µs ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Assume each row of X represents an example assume w represents a weight vector

In [14]:
n_examples = 10000

w = np.random.rand(n_words)
w_sp = sp.sparse.csr_matrix(w)

#X = [np.random.randint(0, n_words,100) for i in range(n_examples)]
X_sp = sp.sparse.csr_matrix((n_examples,n_words), dtype=np.float32)

In [75]:
#sp.sparse.csc_matrix([(0,1),(5,1)])

In [None]:
X_sp[0,]

In [21]:
X_sp.shape

(10000, 1000000)

In [24]:
import sys

In [31]:
sys.getsizeof(X_sp), X_sp.shape

(56, (10000, 1000000))

In [30]:
sys.getsizeof(x), x.shape

(8000096, (1000000,))

In [43]:
%time X_sp * w_sp.T

CPU times: user 11 ms, sys: 7.17 ms, total: 18.2 ms
Wall time: 15.9 ms


<10000x1 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [35]:
X_sp.shape

(10000, 1000000)

In [63]:
X = np.random.rand(50,1000000)

The elementwise multiplication with broadcasting is slower than the matmul!

In [64]:
%time X * w

CPU times: user 165 ms, sys: 163 ms, total: 328 ms
Wall time: 330 ms


array([[0.29851729, 0.22183932, 0.21594377, ..., 0.23025243, 0.10532322,
        0.15672739],
       [0.4552408 , 0.40520724, 0.32506997, ..., 0.05888748, 0.11449384,
        0.13660899],
       [0.60886964, 0.31156141, 0.09371483, ..., 0.18798202, 0.07817784,
        0.10897365],
       ...,
       [0.6521215 , 0.5620724 , 0.04058271, ..., 0.11588074, 0.1099936 ,
        0.16785501],
       [0.60955927, 0.40863199, 0.0556231 , ..., 0.25585338, 0.03019211,
        0.05747164],
       [0.55423306, 0.26572602, 0.40236298, ..., 0.13900469, 0.00811532,
        0.16716085]])

Notice the speed of the matrix-vector product

In [60]:
%time np.dot(X, w)

CPU times: user 65.9 ms, sys: 5.28 ms, total: 71.2 ms
Wall time: 36.6 ms


array([250054.88931125, 249906.18435961, 250101.95941785, 250145.36428914,
       250148.14334656, 250089.94028293, 249953.65023363, 249843.38556682,
       250053.46135393, 250320.29861515, 250113.67903926, 250204.36667919,
       250196.13093261, 249977.92529227, 249900.15014637, 249997.98167936,
       249965.6158709 , 249785.8063665 , 250021.66549167, 249908.67911342,
       249720.4749881 , 250062.47809726, 249979.60873519, 250182.67432   ,
       250070.4801522 , 250194.25087878, 250275.52847528, 250158.27006499,
       249935.96516734, 250272.10283646, 249893.31584715, 250226.84198435,
       249832.27331275, 249972.03508329, 249971.89136104, 250251.00304417,
       249974.05422721, 250118.86870704, 250020.14347498, 250242.4243867 ,
       250107.99403864, 249944.15918611, 250011.17170338, 249911.10053004,
       250162.57003967, 249748.11522963, 250135.05445333, 249909.84263588,
       249688.33123337, 249860.04455445])

The crucial part is memory. Only 50 documents with 1_000_000 features need 400 MB of memory. Therefore...



In [73]:
X.shape,

((50, 1000000),)

In [74]:
byte_to_MB =  0.000001
sys.getsizeof(X) * byte_to_MB

400.000112

This does not scale for bigger datasets.

- If we want 500 examples we will need 4 GB of memory. 
- If we want 5_000 examples we will need 40 GB of memory. 
- If we want 50_000 examples we will need 400 GB of memory. 

64 GB of ram is araound 500 euros. To compute a simple logistic regression with 50_000 examples using numpy arrays we would need 2000 euros in RAM.

Big Data people: Let's use spark it's scalable...

Then you get `Spark java.lang.OutOfMemoryError: Java heap space `

But Spark is fault tolerant, so the comptation starts again

Then you get `Spark java.lang.OutOfMemoryError: Java heap space `

But Spark is fault tolerant, so the comptation starts again

Then you get `Spark java.lang.OutOfMemoryError: Java heap space `
...

Then you can might start getting anxious and think "I should have listened that guy talking about sparse matrices"


## Filling csr_matrices

In [83]:
aux = sp.sparse.lil_matrix((2,10))

In [84]:
aux[0,1] = 2

In [85]:
aux

<2x10 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in LInked List format>

In [87]:
aux.toarray()

array([[0., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

We can use a **`lil_matrix`** and fill it looping over coordinates.

Then we can cast the matrix to csr_matrix type.

In [93]:
aux_sp = sp.sparse.csr_matrix(aux)
aux_sp

<2x10 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

Then we can use the `aux_sp` matrix as a standard `csr_matrix`.

This is not the fastest method to  build csr_matrix.

It is better to do it directly.


In [94]:
aux.toarray()

array([[0., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [95]:
row  = [0] 
col  = [1]
data = [2]

In [105]:
x_csr = sp.sparse.csr_matrix( (data, (row,col)), shape = (2,10))
x_csr

<2x10 sparse matrix of type '<class 'numpy.longlong'>'
	with 1 stored elements in Compressed Sparse Row format>

In [112]:
x_csr.toarray()

array([[0, 2, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

csr is smart to know that if two coordinates are repeated the values will be added up

In [113]:
row  = [0,0] 
col  = [1,1]
data = [1,1]

In [114]:
x_csr = sp.sparse.csr_matrix( (data, (row,col)), shape = (2,10))
x_csr.toarray()

array([[0, 2, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)