# Spark MLLib Example: Clustering

### Download the [spreadsheet](WSSSE-versus-k.xlsx)

Let's look at a clustering example in Spark MLLib.

Here, we are going to load the mtcars dataset. This has some stats on different models of cars.  Here, we will load the CSV file as a spark dataframe, and view it.

This dataset contains some statistics on 1974 Cars from Motor Trends

Here are the columns:
* name   - name of the car
*  mpg   - Miles/(US) gallon                        
*  cyl   - Number of cylinders                      
*  disp  - Displacement (cu.in.)                    
*  hp    - Gross horsepower                         
*  drat  - Rear axle ratio            

Are there any natural clusters you can identify from this data?

We are going to use **MPG and CYL** attributes to cluster.

You can also download and view the raw data in Excel : [cars.csv](/data/cars/mtcars_header.csv)

<img src="../../assets/images/6.1-cars2.png" style="border: 5px solid grey; max-width:100%;" />

In [6]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
sc = spark.sparkContext

Initializing Spark...
Spark found in :  /home/ubuntu/spark
Spark config:
	 executor.memory=2g
	some_property=some_value
	spark.app.name=TestApp
	spark.master=local[*]
	spark.sql.warehouse.dir=/tmp/tmp1qw664_n
	spark.submit.deployMode=client
	spark.ui.showConsoleProgress=true
Spark UI running on port 4045


## Step 1 : Load Data

In [7]:
## Imports
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans

In [8]:
dataset = spark.read.csv("/data/cars/mtcars_header.csv", header=True, inferSchema=True)

In [9]:
## TODO : print schema
## Hint : printSchema()
dataset.printSchema()

root
 |-- model: string (nullable = true)
 |-- mpg: double (nullable = true)
 |-- cyl: integer (nullable = true)
 |-- disp: double (nullable = true)
 |-- hp: integer (nullable = true)
 |-- drat: double (nullable = true)
 |-- wt: double (nullable = true)
 |-- qsec: double (nullable = true)
 |-- vs: integer (nullable = true)
 |-- am: integer (nullable = true)
 |-- gear: integer (nullable = true)
 |-- carb: integer (nullable = true)



In [10]:
## TODO : display the data
## Hint : show
dataset.show()

+-------------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|              model| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-------------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|          Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|      Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|         Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|     Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|
|  Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|
|            Valiant|18.1|  6|225.0|105|2.76| 3.46|20.22|  1|  0|   3|   1|
|         Duster 360|14.3|  8|360.0|245|3.21| 3.57|15.84|  0|  0|   3|   4|
|          Merc 240D|24.4|  4|146.7| 62|3.69| 3.19| 20.0|  1|  0|   4|   2|
|           Merc 230|22.8|  4|140.8| 95|3.92| 3.15| 22.9|  1|  0|   4|   2|
|           Merc 280|19.2|  6|167.6|123|3.92| 3.44| 18.3|  1|  0|   4|   4|
|          M

## Step 2 : Extract data
We only care about 'model', 'mpg' and 'cyl' columns

In [11]:
## TODO : extract the columns we need : model, mpg and cyl
dataset2 = dataset.select(["model", "mpg", "cyl"])
dataset2.show()

+-------------------+----+---+
|              model| mpg|cyl|
+-------------------+----+---+
|          Mazda RX4|21.0|  6|
|      Mazda RX4 Wag|21.0|  6|
|         Datsun 710|22.8|  4|
|     Hornet 4 Drive|21.4|  6|
|  Hornet Sportabout|18.7|  8|
|            Valiant|18.1|  6|
|         Duster 360|14.3|  8|
|          Merc 240D|24.4|  4|
|           Merc 230|22.8|  4|
|           Merc 280|19.2|  6|
|          Merc 280C|17.8|  6|
|         Merc 450SE|16.4|  8|
|         Merc 450SL|17.3|  8|
|        Merc 450SLC|15.2|  8|
| Cadillac Fleetwood|10.4|  8|
|Lincoln Continental|10.4|  8|
|  Chrysler Imperial|14.7|  8|
|           Fiat 128|32.4|  4|
|        Honda Civic|30.4|  4|
|     Toyota Corolla|33.9|  4|
+-------------------+----+---+
only showing top 20 rows



## Step 3 : Creating Vectors

Now that we have ourselves a dataframe, let's work on turning it into vectors.  We're going to vectorize 2 columns:

1. MPG
2. Number of cylinders.

What we'll do, is we'll use the VectorAssembler class to create a new column by the name of features. This will be a Vector.

In [12]:
## TODO : create featureVector with 'mpg' and 'cyl'
## Hint :  inputCols=['mpg', 'cyl']
assembler = VectorAssembler(inputCols=["mpg", "cyl"], outputCol="features")
featureVector = assembler.transform(dataset2)
featureVector.show()

+-------------------+----+---+----------+
|              model| mpg|cyl|  features|
+-------------------+----+---+----------+
|          Mazda RX4|21.0|  6|[21.0,6.0]|
|      Mazda RX4 Wag|21.0|  6|[21.0,6.0]|
|         Datsun 710|22.8|  4|[22.8,4.0]|
|     Hornet 4 Drive|21.4|  6|[21.4,6.0]|
|  Hornet Sportabout|18.7|  8|[18.7,8.0]|
|            Valiant|18.1|  6|[18.1,6.0]|
|         Duster 360|14.3|  8|[14.3,8.0]|
|          Merc 240D|24.4|  4|[24.4,4.0]|
|           Merc 230|22.8|  4|[22.8,4.0]|
|           Merc 280|19.2|  6|[19.2,6.0]|
|          Merc 280C|17.8|  6|[17.8,6.0]|
|         Merc 450SE|16.4|  8|[16.4,8.0]|
|         Merc 450SL|17.3|  8|[17.3,8.0]|
|        Merc 450SLC|15.2|  8|[15.2,8.0]|
| Cadillac Fleetwood|10.4|  8|[10.4,8.0]|
|Lincoln Continental|10.4|  8|[10.4,8.0]|
|  Chrysler Imperial|14.7|  8|[14.7,8.0]|
|           Fiat 128|32.4|  4|[32.4,4.0]|
|        Honda Civic|30.4|  4|[30.4,4.0]|
|     Toyota Corolla|33.9|  4|[33.9,4.0]|
+-------------------+----+---+----

## Step 4 : Running Kmeans

Now it's time to run kmeans on the resultant dataframe. We don't know what value of k to use, so let's just start with k=2.  This means we will cluster into two groups.

We will fit a model to the data, and then train it.

In [13]:
k = 2
kmeans = KMeans().setK(k).setMaxIter(10)
model = kmeans.fit(featureVector)
wssse = model.computeCost(featureVector)

print(wssse)

425.39658730158885


The WSSSE for this is not particularly good.  We will probably need to change k.



## Step 5 : Display grouping
Let's take a look at the transformed dataset.  Notice the new column "prediction."

In [14]:
predicted = model.transform(featureVector)
predicted.orderBy(['prediction', 'mpg']).show(32)

+-------------------+----+---+----------+----------+
|              model| mpg|cyl|  features|prediction|
+-------------------+----+---+----------+----------+
|          Mazda RX4|21.0|  6|[21.0,6.0]|         0|
|      Mazda RX4 Wag|21.0|  6|[21.0,6.0]|         0|
|     Hornet 4 Drive|21.4|  6|[21.4,6.0]|         0|
|         Volvo 142E|21.4|  4|[21.4,4.0]|         0|
|      Toyota Corona|21.5|  4|[21.5,4.0]|         0|
|         Datsun 710|22.8|  4|[22.8,4.0]|         0|
|           Merc 230|22.8|  4|[22.8,4.0]|         0|
|          Merc 240D|24.4|  4|[24.4,4.0]|         0|
|      Porsche 914-2|26.0|  4|[26.0,4.0]|         0|
|          Fiat X1-9|27.3|  4|[27.3,4.0]|         0|
|       Lotus Europa|30.4|  4|[30.4,4.0]|         0|
|        Honda Civic|30.4|  4|[30.4,4.0]|         0|
|           Fiat 128|32.4|  4|[32.4,4.0]|         0|
|     Toyota Corolla|33.9|  4|[33.9,4.0]|         0|
| Cadillac Fleetwood|10.4|  8|[10.4,8.0]|         1|
|Lincoln Continental|10.4|  8|[10.4,8.0]|     

Notice what we have here.  We have two clusters. One is smaller, fuel efficient cars like the Fiat and the Corolla (remember, we cluster on two variables only: mpg and cyl).  The other is for basically all other cars.  Probably, we can get better results here with a differnet value of k.

## Step 6 : Adjust K

In [15]:
k = 3
kmeans = KMeans().setK(k).setMaxIter(10)
model = kmeans.fit(featureVector)
wssse = model.computeCost(featureVector)

print('WSSSE: ' + str(wssse))

WSSSE: 169.40535714285784


This is a much better result for WSSSE (lower is better).

In [16]:
predicted = model.transform(featureVector)
predicted.orderBy(['prediction', 'mpg']).show(32)

+-------------------+----+---+----------+----------+
|              model| mpg|cyl|  features|prediction|
+-------------------+----+---+----------+----------+
|          Merc 280C|17.8|  6|[17.8,6.0]|         0|
|            Valiant|18.1|  6|[18.1,6.0]|         0|
|  Hornet Sportabout|18.7|  8|[18.7,8.0]|         0|
|   Pontiac Firebird|19.2|  8|[19.2,8.0]|         0|
|           Merc 280|19.2|  6|[19.2,6.0]|         0|
|       Ferrari Dino|19.7|  6|[19.7,6.0]|         0|
|      Mazda RX4 Wag|21.0|  6|[21.0,6.0]|         0|
|          Mazda RX4|21.0|  6|[21.0,6.0]|         0|
|     Hornet 4 Drive|21.4|  6|[21.4,6.0]|         0|
|         Volvo 142E|21.4|  4|[21.4,4.0]|         0|
|      Toyota Corona|21.5|  4|[21.5,4.0]|         0|
|         Datsun 710|22.8|  4|[22.8,4.0]|         0|
|           Merc 230|22.8|  4|[22.8,4.0]|         0|
|          Merc 240D|24.4|  4|[24.4,4.0]|         0|
|      Porsche 914-2|26.0|  4|[26.0,4.0]|         1|
|          Fiat X1-9|27.3|  4|[27.3,4.0]|     

## Step 7 : Iterate over K
We are going to calculate WSSSE for various values of K:

In [18]:
for k in range(2,33):
    kmeans = KMeans().setK(k).setMaxIter(10)
    model = kmeans.fit(featureVector)
    wssse = model.computeCost(featureVector)
    print ("k", k , "wssse", wssse)

k 2 wssse 425.39658730158885
k 3 wssse 169.40535714285784
k 4 wssse 109.71498412698526
k 5 wssse 93.13778571428566
k 6 wssse 54.20988888888917
k 7 wssse 39.88088888888933
k 8 wssse 32.65122222222237
k 9 wssse 19.45088888888938
k 10 wssse 17.200888888888926
k 11 wssse 12.396388888888282
k 12 wssse 8.817500000000791
k 13 wssse 7.51666666666722
k 14 wssse 4.795333333332792
k 15 wssse 4.670333333332906
k 16 wssse 4.545333333332678
k 17 wssse 2.144166666666706
k 18 wssse 1.019166666666706
k 19 wssse 0.6141666666667334
k 20 wssse 0.5691666666665469
k 21 wssse 0.5216666666671017
k 22 wssse 0.34166666666703804
k 23 wssse 0.2350000000004684
k 24 wssse 0.07666666666648325
k 25 wssse 0.050000000000295586
k 26 wssse 0.050000000000295586
k 27 wssse 0.050000000000295586
k 28 wssse 0.050000000000295586
k 29 wssse 0.050000000000295586
k 30 wssse 0.050000000000295586
k 31 wssse 0.050000000000295586
k 32 wssse 0.050000000000295586
