# K-Means Clustering : Cars Data

Let's look at a clustering example.

Here, we are going to load the mtcars dataset. This has some stats on different models of cars.  Here, we will load the CSV file as a pandas dataframe, and view it.

And here is a [spreadsheet](WSSSE-versus-k.xlsx) for you record K and WSSSE.

## Step 1: Load the Data

In [None]:
import os
import urllib.request

data_location = "../data/cars/mtcars_header.csv"
data_url = 'https://elephantscale-public.s3.amazonaws.com/data/cars/mtcars_header.csv'

if not os.path.exists (data_location):
    data_location = os.path.basename(data_location)
    if not os.path.exists(data_location):
        print("Downloading : ", data_url)
        urllib.request.urlretrieve(data_url, data_location)
print('data_location:', data_location)

In [None]:
import pandas as pd

dataset = pd.read_csv(data_location)
dataset.sample(10)
# dataset

## Step 2: Creating Vectors

Now that we have ourselves a dataframe, let's work on turning it into vectors.  We're going to vectorize 2 columns:

1. MPG
2. Number of cylinders.


In [None]:
##create an smaller dataframe with just 'model', 'mpg', and 'cyl'
dataset2 = dataset [['model', 'mpg', 'cyl']]

## TODO : create our feature vector x by using 'mpg' and 'cyl' columns
## HINT : 'mpg', 'cyl'
x = dataset2[["???", "???"]]
x.sample(10)

In [None]:
import matplotlib.pyplot as plt
plt.scatter(x['mpg'], x['cyl'], marker='o')
plt.show()

## Step 3: Running Kmeans

Now it's time to run kmeans on the resultant dataframe.  We don't know what value of k to use, so let's just start with k=2.  This means we will cluster into two groups.

We will fit a model to the data, and then train it.



In [None]:
from sklearn.cluster import KMeans

## TODO: Instantiate K-means model with value k=2
## Hint : n_clusters=2
kmeans = KMeans(n_clusters=???, random_state=0)

model = kmeans.fit(x)

model

In [None]:
## TODO: calculate WSSSE by calling computeCost on dataframe
## Hint : inertia_
wssse = model.inertia_

print("k=2, wssse = ", wssse)

The WSSSE for this is not particularly good.  We will probably need to change k.

Let's take a look at the transformed dataset.  Notice the new column "prediction."


In [None]:
## transform the dataset from the model
model.labels_ # these are the cluster ids

In [None]:
### TODO: Add new column to DF with cluster labels 
dataset2['cluster'] = model.labels_
dataset2

Notice what we have here.  We have two clusters. One is smaller, fuel efficient cars like the Fiat and the Corolla (remember, we cluster on two variables only: MPG and cylinders).  The other is for basically oll other cars.  Probably, we can get better results here with a differnet value of k.

In [None]:
dataset2.sort_values(by=['cluster', 'mpg'])

### Set K=3

In [None]:
from sklearn.cluster import KMeans

## TODO: Instantiate K-means model with value k=3
kmeans = KMeans(n_clusters=???, random_state=0)

model = kmeans.fit(x)

wssse = model.inertia_
print ("k=3, wssse =",wssse)

In [None]:
dataset2['cluster'] = model.labels_
dataset2.sort_values(by=['cluster', 'mpg'])

This is a much better result for WSSSE (lower is better).

## Step 4: Hyperparameter tuning

Let's try iterating and plotting over values of k, so we can practice using the elbow method.

**Q ==> Why is WSSSE almost zero when k=32?**

In [None]:
kvals = []
wssses = []

## TODO : Run k from 2 to 32
for k in range(2, ???):
    kmeans = KMeans(n_clusters=k, random_state=0)
    model = kmeans.fit(x)
    wssse = model.inertia_
    print ("k={},  wssse={}".format(k,wssse))
    kvals.append(k)
    wssses.append(wssse)

In [None]:
df = pd.DataFrame({'k': kvals, 'wssse':wssses})
df

In [None]:
%matplotlib inline
from matplotlib import pyplot

## TODO: plot the values of k as the X axis versus the costs (WSSSE) as the y axis
## Hint  : x=kvals,  y=wssses
pyplot.plot(kvals, wssses)
pyplot.show()

Using the Elbow method, what would be a good value of k?
