<img src="./images/galvanize-logo.png" alt="galvanize-logo" align="center" style="width: 200px;"/>

<hr />

### Big Data: Sparse matrices as a tool efficient data pipeline development


## Objectives

* Review the concept of data staging

* Explain when to use sparse matrices during the machine learning model development process

* Describe simple uses of sparse matrices.

* Execute Python code to work with simple sparse matrices.

Read the objectives

## Data staging and data pipelines

**An example natural language processing pipeline might look like this:**
    
    1. Gather data from multiple sources and merge into a single coprus
    2. Represent the words themselves as tokens (numerically encoded)
    3. Modify the the original token matrix (n-grams, remove stop words)
    4. Carry out a transform of the token matrix like TFIDF or use Word Embeddings
    5. Use a machine learning model on the new matrix
    6. ...
    
> A staging area, or landing zone, is an intermediate storage area used for data processing during the extract, transform and load (ETL) process.    

The point of showing this process is that it exemplifies that there is procedure for going from raw data to being ready to run the model.  With a large corpus it might take several minutes to perform steps 1-4.  It might take several hours under certain circumstances.  If we are trying to tune a model it makes sense to 'stage' our data after step 4.  If we are trying to compare some different transforms it makes sense to stage our data at the end of step 3.

As a rule of thumb if it takes more than a few seconds to process data you should consider staging the data.

## First steps in organizing a data pipeline

* When a machine learning model has been deployed the data ingestion pipeline for that model will also be deployed.

* That pipeline cannot be finalized during the development of the machine learning model it feeds. 

* Be careful about investing large amounts of time building data ingestion pipeline!

Once a well-trained machine learning model has been deployed, the data ingestion pipeline for that model will also be deployed.  That pipeline will consist of a collection of tools and systems used to fetch, transform, and feed data to the machine learning system in production.  

However, that pipeline cannot be finalized during the development of the machine learning model it feeds.  
Finalizing the process of data ingestion *before* models have been run and your hypotheses about the business use case have been tested often leads to lots of re-work. Early experiments almost always fail and you should be careful about investing large amounts of time in building a data ingestion pipeline until there is enough accumulated evidence that a deployed model will help the business.

## Sparse Matrices


* Data scientists will often use *sparse matrices* during the development and testing of a machine learning model.

* Python libraries available in **SciPy** package to work with sparse matrices.


Instead of building a complete data ingestion pipeline, data scientists will often use sparse matrices during the development and testing of a machine learning model.  Sparse matrices are used to represent complex sets of data (e.g., word counts) in a way that reduces the use of computer memory and processing time. 

There are Python libraries available in the **SciPy** package to work with sparse matrices.

The code block below imports this library as well as NumPy for calculations:

In [9]:
import numpy as np
from scipy import sparse

The code block below imports the SciPy library as well as the NumPy library for calculations.

## A middle-ground solution

Sparse matrices offer a middle-ground between:

   - a comprehensive data warehouse solution with extensive test coverage

   - a directory of text files and database dumps

Sparse matrices offer a middle-ground between a comprehensive data warehouse solution with extensive test coverage and a directory of text files and database dumps.  Sparse matrices do not work for all data types, but in situations where they are an appropriate technology you can leverage them even under load in production. 

## Using sparse matrices

* A sparse matrix is one in which most of the values are *zero*.  

* If the number of zero-valued elements divided by the size of the matrix is greater than 0.5 then it is considered *sparse*.


A sparse matrix is one in which most of the values are zero.  If the number of zero-valued elements divided by the size of the matrix is greater than 0.5 then it is consider *sparse*.

In [2]:
A = np.random.randint(0,2,100000).reshape(100,1000)
sparsity = 1.0 - (np.count_nonzero(A) / A.size)
print("\033[1;30;47m") # escape codes to print black font on white bg
print(round(sparsity,4))

[1;30;47m
0.5017


Generate an array of 100,000 random integers between 0 and 2, then reshape that array into a 100x1000 matrix, then compute the sparsity.  

## Advantage of sparse matrices

* Very large non-sparse matrices require significant amounts of memory.

* Sparse matrices allow you to manage large amounts of data in a memory-efficient and time-efficient manner.

Very large matrices require significant amounts of memory.  For example, If we make a matrix of counts for a document or a book where the features are all known English words, the chances are high that your personal machine does not have enough memory to represent it as a dense matrix.  Sparse matrices  have the additional advantage of getting around time-complexity issues that arise with operations on large dense matrices.

## Sparse matrices in Python

**coo_matrix**: sparse matrix built from the COOrdinates and values of the non-zero entries.

In [3]:
A = np.random.poisson(0.3, (10,100))
B = sparse.coo_matrix(A)
C = B.todense()

print("A",type(A),A.shape,"\n"
      "B",type(B),B.shape,"\n"
      "C",type(C),C.shape,"\n")


A <class 'numpy.ndarray'> (10, 100) 
B <class 'scipy.sparse.coo.coo_matrix'> (10, 100) 
C <class 'numpy.matrix'> (10, 100) 



Create a 10x100 array of random numbers drawn from a Poisson distribution.  Then cast that sparse matrix into a matrix in coordinate format, then smash it down into a dense matrix.  

**csc_matrix**:  When there are repeated entries in the rows or cols, we can remove the redundancy by indicating the location of the first occurrence of a value and its increment instead of the full coordinates. When the repeats occur in columns we use a CSC format.  

In [4]:
A = np.random.poisson(0.3, (10,100))
B = sparse.csc_matrix(A)
print(B)

  (1, 0)	1
  (2, 0)	1
  (7, 0)	1
  (2, 1)	1
  (1, 2)	1
  (3, 2)	1
  (8, 2)	1
  (9, 2)	1
  (3, 3)	1
  (0, 4)	1
  (4, 4)	2
  (7, 4)	1
  (0, 5)	1
  (1, 5)	2
  (4, 5)	2
  (7, 5)	1
  (9, 5)	1
  (0, 6)	1
  (4, 6)	1
  (5, 6)	1
  (9, 6)	1
  (8, 7)	1
  (4, 8)	2
  (6, 8)	1
  (8, 8)	2
  :	:
  (0, 87)	1
  (9, 87)	1
  (1, 88)	1
  (3, 88)	1
  (5, 88)	1
  (4, 89)	1
  (5, 89)	1
  (7, 90)	2
  (1, 91)	1
  (2, 91)	1
  (2, 92)	1
  (3, 92)	1
  (5, 92)	1
  (9, 92)	1
  (5, 93)	1
  (6, 93)	1
  (9, 94)	1
  (0, 95)	1
  (8, 95)	1
  (9, 95)	1
  (0, 96)	2
  (1, 96)	1
  (1, 97)	2
  (6, 98)	1
  (7, 98)	2


Because the coordinate format is easier to create, it is common to create it first then cast to another more efficient format.  Let us first show how to create a matrix from coordinates:

In [5]:
rows = [0,1,2,8]
cols = [1,0,4,8]
vals = [1,2,1,4]

A = sparse.coo_matrix((vals, (rows, cols)))
print(A.todense())

[[0 1 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 4]]


Then to cast it to a CSR matrix:

In [6]:
B = A.tocsr()
print(B)

  (0, 1)	1
  (1, 0)	2
  (2, 4)	1
  (8, 8)	4


Because this introduction to sparse matrices is applied to data ingestion we would need to be able to:

   1. concatenate matrices (e.g., add a new user to a recommender matrix)
   2. read and write the matrices to and from disk

In [7]:
## concatenate example		
C = sparse.csr_matrix(np.array([0,1,0,0,2,0,0,0,1]).reshape(1,9))
print(B.shape,C.shape)
D = sparse.vstack([B,C])
print(D.todense())

(9, 9) (1, 9)
[[0 1 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 4]
 [0 1 0 0 2 0 0 0 1]]


In [8]:
## read and write
file_name = "sparse_matrix.npz"
sparse.save_npz(file_name, D)
E = sparse.load_npz(file_name)
print(E.shape)

(10, 9)


## Questions

Questions slide