<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 1: Basic Statistics and Data Types  

## Local and Distributed Matrices

## Lesson Objectives

After completing this lesson, you should be able to:

- Understand local and distributed matrices
- Create dense and sparse matrices 
- Create different types of distributed matrices 


### Local Matrices 
- Natural extension of Vectors 
- Row and column indices are 0-based integers and values are doubles 
- Local matrices are stored on a single machine 
- MLlib's matrices can be either dense or sparse 
- Matrices are filled in column major order

### Dense Matrices 
- A "reshaped" dense Vector 
- First two arguments specify dimensions of the matrix 
- Entries are stored in a single double array 

### A Dense Matrix Example

In [None]:
import org.apache.spark.mllib.linalg.{Matrix, Matrices}

Matrices.dense(3, 2, Array(1, 3, 5, 2, 4, 6))

### Sparse Matrices in Spark: Compressed Sparse Column (CSC) format

Rows: 5
Columns: 4
Column pointers: `(0, 0, 1, 2, 2)`
Row Indices: `(1, 3)`
Non-zero values: `(34.0, 55.0)`

### Sparse Matrix Example

In [None]:
val m = Matrices.sparse(5, 4, 
  Array(0, 0, 1, 2, 2), 
  Array(1, 3),
  Array(34, 55)
)

### Distributed Matrices 

Distributed Matrices are where Spark starts to deliver significant value. They are stored in one or more RDDs.

Three types have been implemented: 
- `RowMatrix`
- `IndexedMatrix`
- `CoodinateMatrix`

Conversions may require an expensive global shuffle.


#### RowMatrix

- The most basic type of distributed matrix 
- It has no meaningful row indices, being only a collection of feature vectors 
- Backed by an RDD of its rows, where each row is a local vector `RowMatrix` 
- Assumes the number of columns is small enough to be stored in a local vector
- Can be easily created from an instance of `RDD[Vector]`


### A Simple RowMatrix Example

In [None]:

import  org.apache.spark.rdd.RDD
import  org.apache.spark.mllib.linalg.distributed.RowMatrix
import  org.apache.spark.mllib.linalg.{Vector, Vectors}

val rows: RDD[Vector] = sc.parallelize(Array(
Vectors.dense(1.0, 2.0),
Vectors.dense(4.0, 5.0), 
Vectors.dense(7.0, 8.0)))

### A Simple RowMatrix

In [None]:
val mat: RowMatrix = new RowMatrix(rows)

val m= mat.numRows()
val n= mat.numCols()

### IndexedRowMatrix

- Similar to a `RowMatrix`
- But it has meaningful row indices, which can be used for identifying rows and executing joins
- Backed by an RDD of indexed rows, where each row is a tuple containing an index (long-typed) and a local vector 
- Easily created from an instance of `RDD[IndexedRow]`
- Can be converted to a `RowMatrix` by calling `toRowMatrix()`


### A Simple IndexedRowMatrix Example

In [None]:
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}

val rows: RDD[IndexedRow] = sc.parallelize(Array(
IndexedRow(0, Vectors.dense(1.0,2.0)),
IndexedRow(1, Vectors.dense(4.0,5.0)),
IndexedRow(2, Vectors.dense(7.0,8.0))))

val idxMat: IndexedRowMatrix = new IndexedRowMatrix(rows)

### CoordinateMatrix 

- Should be used only when both dimensions are huge and the matrix is very sparse
- Backed up by an RDD of matrix entries, where each entry is a tuple `(i: Long, j: Long, value: Double)` where `i` is the row index `j` is the column index value is the entry value
- Can be easily created from an instance of `RDD[MatrixEntry]`
- Can be converted to an `IndexedRowMatrix` with sparse rows by calling `toIndexedRowMatrix()`


### A Simple CoordinateMatrix Example

In [None]:
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix

val entries: RDD[MatrixEntry] = sc.parallelize(Array(
MatrixEntry(0, 0, 9.0),
MatrixEntry(1, 1, 8.0),
MatrixEntry(2, 1, 6.0)))

val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.