# K-Means Clustering : Cars Data

Let's look at a clustering example in Spark MLLib.

Here, we are going to load the mtcars dataset. This has some stats on different models of cars.  Here, we will load the CSV file as a spark dataframe, and view it.

And here is a [spreadsheet](WSSSE-versus-k.xlsx) for you record K and WSSSE.

## Step 1: Load the Data

In [None]:
%matplotlib inline
from matplotlib import pyplot

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

In [None]:
dataset = spark.read.csv("/data/cars/mtcars_header.csv", header=True, inferSchema=True)

In [None]:
dataset.show()

## Step 2: Creating Vectors

Now that we have ourselves a dataframe, let's work on turning it into vectors.  We're going to vectorize 2 columns:

1. MPG
2. Number of cylinders.

What we'll do, is we'll use the VectorAssembler class to create a new column by the name of features. This will be a Vector.


In [None]:
## TODO: create an mpg_cyl dataframe with just 'model', 'mpg', and 'cyl'

mpg_cyl = dataset.select("???", "???", "???")
mpg_cyl.show(40)

In [None]:
from pyspark.ml.feature import VectorAssembler

## TODO: create vectorassembler by extracting "mpg" and "cyl" to output column "features"**  
# input : mpg, cyl
# output : features
assembler = VectorAssembler(inputCols=["???", "???"], outputCol="???")


## TODO: transform dataframe in order to create new column with feature vector
## Hint : assembler.transform(mpg_cyl)
featureVector = assembler.???(???)


In [None]:
featureVector.show(40)

## Step 3: Running Kmeans

Now it's time to run kmeans on the resultant dataframe.  We don't know what value of k to use, so let's just start with k=2.  This means we will cluster into two groups.

We will fit a model to the data, and then train it.



In [None]:
from pyspark.ml.clustering import KMeans

## TODO: Instantiate K-means model with value k
k = 2
kmeans = KMeans().setK(???).setSeed(1)

## TODO: fit featureVector with kmeans model
## Hint : featureVector
model = kmeans.fit(???)

## TODO: calculate WSSSE by calling computeCost on dataframe
## Hint : model.computeCost(featureVector)
wssse = model.computeCost(???)

print(wssse)

The WSSSE for this is not particularly good.  We will probably need to change k.

Let's take a look at the transformed dataset.  Notice the new column "prediction."


In [None]:
## transform the dataset from the model
model.transform(featureVector).orderBy(['prediction', 'mpg']).show(32)

Notice what we have here.  We have two clusters. One is smaller, fuel efficient cars like the Fiat and the Corolla (remember, we cluster on two variables only: MPG and cylinders).  The other is for basically oll other cars.  Probably, we can get better results here with a differnet value of k.

In [None]:
k = 3
kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(featureVector)
wssse = model.computeCost(featureVector)

print('WSSSE: ' + str(wssse))

This is a much better result for WSSSE (lower is better).

In [None]:
# look at transformed data again for k=3
model.transform(featureVector).orderBy(['prediction', 'mpg']).show(32)

## Step 4: Hyperparameter tuning

Let's try iterating and plotting over values of k, so we can practice using the elbow method.

**Q ==> Why is WSSSE almost zero when k=32?**

In [None]:
kvals = []
wssses = []

# TODO : Run k from 2 to 32
for k in range(???,???):
    kmeans = KMeans().setK(???).setSeed(1)
    model = kmeans.fit(???)
    wssse = model.computeCost(featureVector)
    print ("k={},  wssse={}".format(k,wssse))
    kvals.append(k)
    wssses.append(wssse)

In [None]:
import pandas as pd
df = pd.DataFrame({'k': kvals, 'wssse':wssses})
df

In [None]:
## TODO: plot the values of k as the X axis versus the costs (WSSSE) as the y axis
## Hint  : x=kvals,  y=wssses
pyplot.plot(???, ???)

Using the Elbow method, what would be a good value of k?
