# Spark Clustering the Walmart data

Let's look at a clustering example in Spark MLLib.

Here, we are going to load the mtcars dataset. This has some stats on different models of cars.  Here, we will load the CSV file as a spark dataframe, and view it.

In [7]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans



In [3]:
dataset = spark.read.csv("/data/walmart-triptype/train-transformed.csv.gz", header=True, inferSchema=True)


In [4]:
dataset.show()

+-----------+--------+-------+--------+------+----------+-----------+----------+------+---------------+------+-------+-------------------+---------+----------------+--------------------+-----------------------+-----------+----------+--------------+-------------+-----+-----------+-----------+------------------+------------------+------------+---------+--------------------------+-----------------+--------+----------------------+----------+---------------+-----------------------+------------------------+---------------------+-------------------+--------------+---------------------------+----------------------+------------+----------+---------------------+---------------+----------------+---------------------+----------------+--------+---------------+----------------+----------------+-----------------+---------------------+-------------+-----------------+------------+-----------+-----------------------+------------------+---------------+-------+-------+--------+------------+-------------+-

## Creating Vectors

We'll again use the VectorAssembler class to create features from the data..

In [11]:
columns = dataset.columns
columns.remove('VisitNumber')
columns.remove('TripType')

assembler = VectorAssembler(inputCols=columns, outputCol="features")
featureVector = assembler.transform(dataset)


In [12]:
for row in featureVector.select('features').take(10):
    print("Vector: %s\n" % (str(row)))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: -1.0, 2: 1.0, 23: -1.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 2.0, 52: 1.0, 64: 1.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 28.0, 2: 1.0, 19: 2.0, 20: 1.0, 33: 1.0, 44: 1.0, 51: 18.0, 53: 4.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 3.0, 35: 1.0, 59: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 3.0, 14: 1.0, 20: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 4.0, 20: 1.0, 27: 1.0, 35: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 7.0, 11: 2.0, 33: 2.0, 52: 2.0, 64: 1.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 9.0, 22: 9.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 4.0, 14: 2.0, 20: 2.0}))

Vector: Row(features=SparseVector(70, {0: 5.0, 1: 9.0, 4: 1.0, 22: 1.0, 31: 1.0, 35: 2.0, 38: 1.0, 46: 3.0}))



Note the output. These are Sparse (not dense) Vectors.  That's because we our data IS sparse, we have relatively few of the variables at any given time.

## Step 3: Running Kmeans

We know there are 39 triptypes.  So that makes a good "natural" value of k.

In [13]:
k = 39  # Number of triptypes is 39.
kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(featureVector)
wssse = model.computeCost(featureVector)

print(wssse)

2370357.9494931693


Let's take a look at the transformed dataset.  let's look at a distribution of our transformed dataset

In [14]:
predictions = model.transform(featureVector)
histogram = predictions.groupBy('prediction').count().orderBy('prediction')
histogram.show(40)

+----------+-----+
|prediction|count|
+----------+-----+
|         0|  458|
|         1| 8872|
|         2|  127|
|         3| 9746|
|         4| 3120|
|         5|  845|
|         6| 7489|
|         7|  508|
|         8|  118|
|         9|  188|
|        10| 2590|
|        11|    2|
|        12|   61|
|        13|  855|
|        14| 6053|
|        15|   55|
|        16| 1209|
|        17|  223|
|        18|  540|
|        19| 1027|
|        20|  481|
|        21| 1478|
|        22| 1408|
|        23|  197|
|        24| 3779|
|        25| 5467|
|        26| 9590|
|        27| 1539|
|        28|  462|
|        29|11122|
|        30| 7628|
|        31| 1504|
|        32| 1335|
|        33|  508|
|        34|  410|
|        35| 3687|
|        36|  716|
|        37|  166|
|        38|  111|
+----------+-----+



In [None]:
histogram.toPandas().plot.bar(colormap='Greens')

## Step 4: Relate Cluster Numbers to Trip Types

Is there a relationship here? Discuss the results.

Remember, clustering is trying to find "natural" patterns -- it is not a classifier, and if we are trying to classify trip type we should use a classification algorithm and not k-means.

In [15]:

for i in (range(0,38)):
    print('Cluster #' + str(i) + ':')
    predictions.filter('prediction == ' + str(i)).groupBy('TripType').count().filter("`count` >= 1").sort('count', ascending=False).show()

Cluster #0:
+--------+-----+
|TripType|count|
+--------+-----+
|      40|  389|
|      38|   30|
|      37|   26|
|      24|    5|
|      44|    3|
|      22|    2|
|      29|    1|
|      33|    1|
|      42|    1|
+--------+-----+

Cluster #1:
+--------+-----+
|TripType|count|
+--------+-----+
|       8| 2600|
|       9| 2285|
|       3|  897|
|       5|  573|
|       7|  534|
|     999|  412|
|       6|  201|
|      30|  149|
|      24|  147|
|      31|  145|
|      22|  139|
|      25|  126|
|      32|   80|
|      19|   64|
|      18|   50|
|      29|   48|
|      20|   45|
|      28|   43|
|       4|   42|
|      23|   36|
+--------+-----+
only showing top 20 rows

Cluster #2:
+--------+-----+
|TripType|count|
+--------+-----+
|      25|   81|
|      42|   20|
|      44|   12|
|      41|    6|
|      30|    2|
|       8|    1|
|      24|    1|
|      32|    1|
|     999|    1|
|      36|    1|
|       7|    1|
+--------+-----+

Cluster #3:
+--------+-----+
|TripType|count|
+-----