### MLib is Machine Learning Library for Spark
1. Incorporates with Numpy in python
2. It provied an Integrated data analysis Flow
3. Enhances speed and performances
4. Clustering, Feature Pattern Matching, Linear Algebra, Collaborative filtering, classification, regression

#### Spark  sparse and dense vector
Spark Mlib supports both sparse and dense vector

1. Dense vector contain zero values
2. Sparse vector don not contain zero values as in huge vectors where most of the values are zero it becomes very inefficient to store all zero values

#### Labelled Point
1. A labeled point is a local vector, either dense or sparse, associated with a label/response.
2. In MLlib, labeled points are used in supervised learning algorithms for regression and classification both.

In [1]:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint

dense_vector = LabeledPoint(1.0, [1.0, 0, 3.0, 0, 0, 7.0])
sparse_vector = LabeledPoint(1.0, SparseVector(6, [ 0, 2, 5], [1.0, 3.0, 7.0]))
#here both have same meaning

#### Basic Statistics

In [2]:
import pyspark
sc = pyspark.SparkContext(appName="mlib-stats")

In [3]:
textRdd = sc.textFile("Resources/heart.csv").filter(lambda x : x[0]!='a')
textRdd.take(5)

['67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2',
 '67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1',
 '37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0',
 '41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0',
 '56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0']

In [4]:
def buildVectors(entry):
    vector = entry.split(',')
    return vector
vectorRdd = textRdd.map(buildVectors)

In [5]:
def convertFloat(entry):
    vector = []
    for x in entry:
        if x!='' and x!='?':
            vector.append(float(x))
        else:
            vector.append('?')
    return vector
fvectorRdd = vectorRdd.map(convertFloat).filter(lambda x: '?' not in x)
fvectorRdd.take(5)

[[67.0, 1.0, 4.0, 160.0, 286.0, 0.0, 2.0, 108.0, 1.0, 1.5, 2.0, 3.0, 3.0, 2.0],
 [67.0, 1.0, 4.0, 120.0, 229.0, 0.0, 2.0, 129.0, 1.0, 2.6, 2.0, 2.0, 7.0, 1.0],
 [37.0, 1.0, 3.0, 130.0, 250.0, 0.0, 0.0, 187.0, 0.0, 3.5, 3.0, 0.0, 3.0, 0.0],
 [41.0, 0.0, 2.0, 130.0, 204.0, 0.0, 2.0, 172.0, 0.0, 1.4, 1.0, 0.0, 3.0, 0.0],
 [56.0, 1.0, 2.0, 120.0, 236.0, 0.0, 0.0, 178.0, 0.0, 0.8, 1.0, 0.0, 3.0, 0.0]]

#### Statistics
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

In [6]:
from pyspark.mllib.stat import Statistics
summary = Statistics.colStats(fvectorRdd)
print(summary.mean())
print(summary.variance())
print(summary.numNonzeros())

[  5.45135135e+01   6.75675676e-01   3.16554054e+00   1.31648649e+02
   2.47398649e+02   1.41891892e-01   9.93243243e-01   1.49597973e+02
   3.27702703e-01   1.05135135e+00   1.59797297e+00   6.79054054e-01
   4.72635135e+00   9.49324324e-01]
[  8.19320202e+01   2.19880898e-01   9.18266148e-01   3.15984608e+02
   2.71221342e+03   1.22171324e-01   9.89784700e-01   5.28098843e+02
   2.21060467e-01   1.35918461e+00   3.76809437e-01   8.83085204e-01
   3.76554054e+00   1.52623683e+00]
[ 296.  200.  296.  296.  296.   42.  149.  296.   97.  200.  296.  123.
  296.  137.]


#### Correlation
Statistics provides methods to calculate correlations between series.

In [7]:
x = fvectorRdd.map(lambda x : x[3])
y = fvectorRdd.map(lambda x : x[4])
print(Statistics.corr(x,y,method='pearson'))
print(Statistics.corr(x,y,method="spearman"))

0.1323795611721055
0.1406576227867661
