# Spark MLLilb Example: Clustering

Let's look at a clustering example in Spark MLLib.

Here, we are going to load the mtcars dataset. This has some stats on different models of cars.  Here, we will load the CSV file as a spark dataframe, and view it.

In [24]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import PCA



In [1]:
dataset = spark.read.csv("../../datasets/walmart-triptype/train-transformed.csv.gz", header=True, inferSchema=True)


In [7]:
dataset.show()

+-----------+--------+-------+--------+------+----------+-----------+----------+------+---------------+------+-------+-------------------+---------+----------------+--------------------+-----------------------+-----------+----------+--------------+-------------+-----+-----------+-----------+------------------+------------------+------------+---------+--------------------------+-----------------+--------+----------------------+----------+---------------+-----------------------+------------------------+---------------------+-------------------+--------------+---------------------------+----------------------+------------+----------+---------------------+---------------+----------------+---------------------+----------------+--------+---------------+----------------+----------------+-----------------+---------------------+-------------+-----------------+------------+-----------+-----------------------+------------------+---------------+-------+-------+--------+------------+-------------+-

## Creating Vectors

Now that we have ourselves a dataframe, let's work on turning it into vectors.  We're going to vectorize 2 columns:

1. MPG
2. Number of cylineders.

What we'll do, is we'll use the VectorAssembler class to create a new column by the name of features. This will be a Vector.

In [19]:
columns = dataset.columns
columns.remove('VisitNumber')
columns.remove('TripType')

print(columns)

['Weekday', 'NumItems', 'Return', '1-HR PHOTO', 'ACCESSORIES', 'AUTOMOTIVE', 'BAKERY', 'BATH AND SHOWER', 'BEAUTY', 'BEDDING', 'BOOKS AND MAGAZINES', 'BOYS WEAR', 'BRAS & SHAPEWEAR', 'CAMERAS AND SUPPLIES', 'CANDY, TOBACCO, COOKIES', 'CELEBRATION', 'COMM BREAD', 'CONCEPT STORES', 'COOK AND DINE', 'DAIRY', 'DSD GROCERY', 'ELECTRONICS', 'FABRICS AND CRAFTS', 'FINANCIAL SERVICES', 'FROZEN FOODS', 'FURNITURE', 'GIRLS WEAR, 4-6X  AND 7-14', 'GROCERY DRY GOODS', 'HARDWARE', 'HEALTH AND BEAUTY AIDS', 'HOME DECOR', 'HOME MANAGEMENT', 'HORTICULTURE AND ACCESS', 'HOUSEHOLD CHEMICALS/SUPP', 'HOUSEHOLD PAPER GOODS', 'IMPULSE MERCHANDISE', 'INFANT APPAREL', 'INFANT CONSUMABLE HARDLINES', 'JEWELRY AND SUNGLASSES', 'LADIES SOCKS', 'LADIESWEAR', 'LARGE HOUSEHOLD GOODS', 'LAWN AND GARDEN', 'LIQUOR,WINE,BEER', 'MEAT - FRESH & FROZEN', 'MEDIA AND GAMING', 'MENSWEAR', 'OFFICE SUPPLIES', 'OPTICAL - FRAMES', 'OPTICAL - LENSES', 'OTHER DEPARTMENTS', 'PAINT AND ACCESSORIES', 'PERSONAL CARE', 'PETS AND SUPPLIE

In [20]:

assembler = VectorAssembler(inputCols=columns, outputCol="features")
featureVector = assembler.transform(dataset)


In [21]:
for row in featureVector.select('features').take(10):
    print("Vector: %s\n" % (str(row)))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: -1.0, 2: 1.0, 23: -1.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 2.0, 52: 1.0, 64: 1.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 28.0, 2: 1.0, 19: 2.0, 20: 1.0, 33: 1.0, 44: 1.0, 51: 18.0, 53: 4.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 3.0, 35: 1.0, 59: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 3.0, 14: 1.0, 20: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 4.0, 20: 1.0, 27: 1.0, 35: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 7.0, 11: 2.0, 33: 2.0, 52: 2.0, 64: 1.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 9.0, 22: 9.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 4.0, 14: 2.0, 20: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 9.0, 4: 1.0, 22: 1.0, 31: 1.0, 35: 2.0, 38: 1.0, 46: 3.0}))



Note the output. These are Sparse (not dense) Vectors.  That's because we our data IS sparse, we have relatively few of the variables at any given time.

In [22]:
## Checking the correlation matrix of the data.

r1 = Correlation.corr(featureVector, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))


Pearson correlation matrix:
DenseMatrix([[ 1.        ,  0.04112535,  0.00239757, ...,  0.00603346,
               0.02650173,  0.00378734],
             [ 0.04112535,  1.        , -0.04703964, ...,  0.03728566,
               0.10396361, -0.0131882 ],
             [ 0.00239757, -0.04703964,  1.        , ..., -0.02663534,
              -0.02400574, -0.03420455],
             ..., 
             [ 0.00603346,  0.03728566, -0.02663534, ...,  1.        ,
               0.01531549, -0.00357236],
             [ 0.02650173,  0.10396361, -0.02400574, ...,  0.01531549,
               1.        ,  0.00143963],
             [ 0.00378734, -0.0131882 , -0.03420455, ..., -0.00357236,
               0.00143963,  1.        ]])


Note that there are many correlated dimensions in the original dataset.  We can identify this by the nonzero values in the correlation matrix.  Naturally, dimensions are always related to themselves with a 1.

## Scaling

We need to scale our features so we do not have one dimension dominate.



In [18]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(featureVector)

# Normalize each feature to have unit standard deviation.
ScaledFeatures = scalerModel.transform(featureVector)
ScaledFeatures.select('features', 'scaledFeatures').show()

+--------------------+--------------------+
|            features|      scaledFeatures|
+--------------------+--------------------+
|(70,[0,1,2,23],[5...|(70,[0,1,2,23],[2...|
|(70,[0,1,52,64],[...|(70,[0,1,52,64],[...|
|(70,[0,1,2,19,20,...|(70,[0,1,2,19,20,...|
|(70,[0,1,35,59],[...|(70,[0,1,35,59],[...|
|(70,[0,1,14,20],[...|(70,[0,1,14,20],[...|
|(70,[0,1,20,27,35...|(70,[0,1,20,27,35...|
|(70,[0,1,11,33,52...|(70,[0,1,11,33,52...|
|(70,[0,1,22],[5.0...|(70,[0,1,22],[2.4...|
|(70,[0,1,14,20],[...|(70,[0,1,14,20],[...|
|(70,[0,1,4,22,31,...|(70,[0,1,4,22,31,...|
|(70,[0,1,24,62],[...|(70,[0,1,24,62],[...|
|(70,[0,1,31,59],[...|(70,[0,1,31,59],[...|
|(70,[0,1,37,58],[...|(70,[0,1,37,58],[...|
|(70,[0,1,18,19,24...|(70,[0,1,18,19,24...|
|(70,[0,1,6,16,20,...|(70,[0,1,6,16,20,...|
|(70,[0,1,20],[5.0...|(70,[0,1,20],[2.4...|
|(70,[0,1,40],[5.0...|(70,[0,1,40],[2.4...|
|(70,[0,1,6,19],[5...|(70,[0,1,6,19],[2...|
|(70,[0,1,35],[5.0...|(70,[0,1,35],[2.4...|
|(70,[0,1,20,27],[...|(70,[0,1,2

## Running PCA

Now we will run PCA to reduce and uncorrelate dimensions.  

In [26]:
pca = PCA(k=5, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(ScaledFeatures)
pcaFeatures = model.transform(ScaledFeatures).select("pcaFeatures")

Let's take a look at the transformed dataset.  let's look at a distribution of our transformed dataset

## Re-Running the Correlation Matrix

In [29]:
## Checking the correlation matrix of the data.

r1 = Correlation.corr(pcaFeatures, "pcaFeatures").head()
print("Pearson correlation matrix:\n" + str(r1[0]))


Pearson correlation matrix:
DenseMatrix([[  1.00000000e+00,   2.19178051e-15,  -1.96118809e-15,
                1.19928863e-14,  -7.16527547e-16],
             [  2.19178051e-15,   1.00000000e+00,   4.94598957e-15,
               -1.55774434e-14,  -6.85746261e-15],
             [ -1.96118809e-15,   4.94598957e-15,   1.00000000e+00,
               -8.64329987e-15,  -2.69051328e-15],
             [  1.19928863e-14,  -1.55774434e-14,  -8.64329987e-15,
                1.00000000e+00,   7.28128485e-15],
             [ -7.16527547e-16,  -6.85746261e-15,  -2.69051328e-15,
                7.28128485e-15,   1.00000000e+00]])


Note the very small, close to zero correlations in the matrix.  The 5 dimensions are for all practical purposes independent and orthogonal.