In [None]:
def display(*args, **kargs): pass

# MLlib Data Types
 
This notebook explains the machine learning specific data types in Spark.  The focus is on the data types and classes used for generating models.  These include: `DenseVector`, `SparseVector`, `LabeledPoint`, and `Rating`.
 
For reference:
 
The [MLlib Guide](http://spark.apache.org/docs/latest/mllib-guide.html) provides an overview of all aspects of MLlib and [MLlib Guide: Data Types](http://spark.apache.org/docs/latest/mllib-data-types.html) provides a detailed review of data types specific for MLlib
 
After this lab you should understand the differences between `DenseVectors` and `SparseVectors` and be able to create and use `DenseVector`, `SparseVector`, `LabeledPoint`, and `Rating` objects.  You'll also learn where to obtain additional information regarding the APIs and specific class / method functionality.

In [None]:
from pyspark.mllib import linalg
dir(linalg)

In [None]:
help(linalg)

 
#### Dense and Sparse
 
MLlib supports both dense and sparse types for vectors and matrices.  We'll focus on vectors as they are most commonly used in MLlib and matrices have poor scaling properties.
 
A dense vector contains an array of values, while a sparse vector stores the size of the vector, an array of indices, and an array of values that correspond to the indices.  A sparse vector saves space by not storing zero values.
 
For example, if we had the dense vector `[2.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0]`, we could store that as a sparse vector with size 7, indices as `[0, 3]`, and values as `[2.0, 3.0]`.

In [None]:
# import data types
from pyspark.mllib.linalg import DenseVector, SparseVector, SparseMatrix, DenseMatrix, Vectors, Matrices

A great way to get help when using Python is to use the help function on an object or method.  If help isn't sufficient, other places to look include the [programming guides](http://spark.apache.org/docs/latest/programming-guide.html), the [Python API](http://spark.apache.org/docs/latest/api/python/index.html), and directly in the [source code](https://github.com/apache/spark/tree/master/python/pyspark) for PySpark.

In [None]:
help(Vectors.dense)

#### DenseVector

PySpark provides a [DenseVector](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector) class within the module [pyspark.mllib.linalg](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.linalg).  `DenseVector` is used to store arrays of values for use in PySpark.  `DenseVector` actually stores values in a [NumPy array](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) and delegates calculations to that object.  You can create a new `DenseVector` using `DenseVector()` and passing in an NumPy array or a Python list.
 
`DenseVector` implements several functions, such as `DenseVector.dot()` and `DenseVector.norm()`.
 
Note that `DenseVector` stores all values as `np.float64`, so even if you pass in a NumPy array of integers, the resulting `DenseVector` will contain floating-point numbers. Also, `DenseVector` objects exist locally and are not inherently distributed.  `DenseVector` objects can be used in the distributed setting by including them in `RDDs` or `DataFrames`.
 
You can create a dense vector by using the [Vectors](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors) object and calling `Vectors.dense`.  The `Vectors` object also contains a method for creating `SparseVectors`.

In [None]:
# Create a DenseVector using Vectors
denseVector = Vectors.dense([1, 2, 3])

print 'type(denseVector): {0}'.format(type(denseVector))
print '\ndenseVector: {0}'.format(denseVector)

** Dot product **
 
We can calculate the dot product of two vectors, or a vector and itself, by using `DenseVector.dot()`.  Note that the dot product is equivalent to performing element-wise multiplication and then summing the result.
 
Below, you'll find the calculation for the dot product of two vectors, where each vector has length \\( n \\):
 
\\[ w \cdot x = \sum_{i=1}^n w_i x_i \\]
 
Note that you may also see \\( w \cdot x \\) represented as \\( w^\top x \\)

In [None]:
denseVector.dot(denseVector)

** Norm **
 
We can calculate the norm of a vector using `Vectors.norm`.  The norm calculation is:
 
  \\[ ||x|| _p = \bigg( \sum_i^n |x_i|^p \bigg)^{1/p} \\]
 
 
 
Sometimes we'll want to normalize our features before training a model.  Later on we'll use the `ml` library to perform this normalization using a transformer.

In [None]:
Vectors.norm(denseVector, 2)

In Python, `DenseVector` operations are delegated to an underlying NumPy array so we can perform multiplication, addition, division, etc.

In [None]:
denseVector * denseVector

In [None]:
5 + denseVector

Sometimes we'll want to treat a vector as an array.  We can convert both sparse and dense vectors to arrays by calling the `toArray` method on the vector.

In [None]:
denseArray = denseVector.toArray()
print denseArray

In [None]:
print 'type(denseArray): {0}'.format(type(denseArray))

#### SparseVector

In [None]:
help(Vectors.sparse)

Let's create a `SparseVector` using [Vectors.sparse](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors.sparse).

In [None]:
sparseVector = Vectors.sparse(10, [2, 7], [1.0, 5.0])
print 'type(sparseVector): {0}'.format(type(sparseVector))
print '\nsparseVector: {0}'.format(sparseVector)

 
Let's take a look at what fields and methods are available with a `SparseVector`.  Here are links to the [Python](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector) and [Scala](http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.mllib.linalg.SparseVector) APIs for `SparseVector`.

In [None]:
help(SparseVector)

In [None]:
dir(SparseVector)

In [None]:
dir(sparseVector)

In [None]:
# Show the difference between the two
set(dir(sparseVector)) - set(dir(SparseVector))

In [None]:
# inspect is a handy tool for seeing the Python source code
import inspect
print inspect.getsource(SparseVector)

In [None]:
print 'sparseVector.size: {0}'.format(sparseVector.size)
print 'type(sparseVector.size):{0}'.format(type(sparseVector.size))

print '\nsparseVector.indices: {0}'.format(sparseVector.indices)
print 'type(sparseVector.indices):{0}'.format(type(sparseVector.indices))
print 'type(sparseVector.indices[0]):{0}'.format(type(sparseVector.indices[0]))

print '\nsparseVector.values: {0}'.format(sparseVector.values)
print 'type(sparseVector.values):{0}'.format(type(sparseVector.values))
print 'type(sparseVector.values[0]):{0}'.format(type(sparseVector.values[0]))

Don't try to set these values directly.  If you use the wrong type, hard-to-debug errors will occur when Spark attempts to use the `SparseVector`. Create a new `SparseVector` using `Vectors.sparse`

In [None]:
set(dir(DenseVector)) - set(dir(SparseVector))

In [None]:
denseVector + denseVector

In [None]:
try:
    sparseVector + sparseVector
except TypeError as e:
    print e

In [None]:
sparseVector.dot(sparseVector)

In [None]:
sparseVector.norm(2)

#### LabeledPoint

 
In MLlib, labeled training instances are stored using the [LabeledPoint](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) object.  Note that the features and label for a `LabeledPoint` are stored in the `features` and `label` attribute of the object.

In [None]:
from pyspark.mllib.regression import LabeledPoint
help(LabeledPoint)

In [None]:
labeledPoint = LabeledPoint(1992, [3.0, 5.5, 10.0])
print 'labeledPoint: {0}'.format(labeledPoint)

print '\nlabeledPoint.features: {0}'.format(labeledPoint.features)
# Notice that feaures are being stored as a DenseVector
print 'type(labeledPoint.features): {0}'.format(type(labeledPoint.features))

print '\nlabeledPoint.label: {0}'.format(labeledPoint.label)
print 'type(labeledPoint.label): {0}'.format(type(labeledPoint.label))

In [None]:
# View the differences between the class and an instantiated instance
set(dir(labeledPoint)) - set(dir(LabeledPoint))

In [None]:
labeledPointSparse = LabeledPoint(1992, Vectors.sparse(10, {0: 3.0, 1:5.5, 2: 10.0}))
print 'labeledPointSparse: {0}'.format(labeledPointSparse)

print '\nlabeledPoint.featuresSparse: {0}'.format(labeledPointSparse.features)
print 'type(labeledPointSparse.features): {0}'.format(type(labeledPointSparse.features))

#### Rating

In [None]:
from pyspark.mllib.recommendation import Rating
help(Rating)

When performing collaborative filtering we aren't working with vectors or labeled points, so we need another type of object to capture the relationship between users, products, and ratings.  This is represented by a `Rating` which can be found in the [Python](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.recommendation.Rating) and [Scala](https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.mllib.recommendation.Rating) APIs.

In [None]:
print inspect.getsource(Rating)

In [None]:
rating = Rating(4, 10, 2.0)

print 'rating: {0}'.format(rating)
# Note that we can pull out the fields using a dot notation or indexing
print rating.user, rating[0]

#### DataFrames
 
When using Spark's ML library rather than MLlib you'll be working with `DataFrames` instead of `RDDs`.  In this section we'll show how you can create a `DataFrame` using MLlib datatypes.

Above we saw that `Rating` is a `namedtuple`.  `namedtuples` are useful when working with `DataFrames`.  We'll explore how they work below and use them to create a `DataFrame`.

In [None]:
from collections import namedtuple
help(namedtuple)

In [None]:
Address = namedtuple('Address', ['city', 'state'])
address = Address('Boulder', 'CO')
print 'address: {0}'.format(address)

print '\naddress.city: {0}'.format(address.city)
print 'address[0]: {0}'.format(address[0])

print '\naddress.State: {0}'.format(address.state)
print 'address[1]: {0}'.format(address[1])

In [None]:
display(sqlContext.createDataFrame([Address('Boulder', 'CO'), Address('New York', 'NY')]))

Let's create a `DataFrame` with a couple of rows where the first column is the label and the second is the features.

In [None]:
LabelAndFeatures = namedtuple('LabelAndFeatures', ['label', 'features'])
row1 = LabelAndFeatures(10, Vectors.dense([1.0, 2.0]))
row2 = LabelAndFeatures(20, Vectors.dense([1.5, 2.2]))

df = sqlContext.createDataFrame([row1, row2])
display(df)

#### Exercises

Create a `DenseVector` with the values 1.5, 2.5, 3.0 (in that order).

In [None]:
# ANSWER
denseVec = Vectors.dense([1.5, 2.5, 3.0])

In [None]:
# TEST
from test_helper import Test
Test.assertEquals(denseVec, DenseVector([1.5, 2.5, 3.0]), 'incorrect value for denseVec')

Create a `LabeledPoint` with a label equal to 10.0 and features equal to `denseVec`

In [None]:
# ANSWER
labeledP = LabeledPoint(10.0, denseVec)

In [None]:
# TEST
Test.assertEquals(str(labeledP), '(10.0,[1.5,2.5,3.0])', 'incorrect value for labeledP')

** Challenge Question [Intentionally Hard]**
 
Create a `udf` that pulls the first element out of a column that contains `DenseVectors`.

In [None]:
# ANSWER
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

firstElement = udf(lambda v: float(v[0]), DoubleType())

df2 = df.select(firstElement('features').alias('first'))
df2.show()

In [None]:
# TEST
Test.assertEquals(df2.rdd.map(lambda r: r[0]).collect(), [1.0, 1.5], 'incorrect implementation of firstElement')