## K Means Clustering

Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group points in your dataset together by similarity. Clustering is considered an unsupervised learning, since you don’t have prescribed labels in the data and no class values denoting a priori grouping of the data instances are given. Today we'll try running one of the most famous clustering algorithms — K-means — on a test dataset.

To run a k-means algorithm, you have to randomly initialize points called the cluster centroids. 

There are two steps to K-means: cluster assignment and centroid update. In the former step, the algorithm goes through each of the data points
and assigns each one to the cluster with the closest centroid. The latter step moves the centroids to the average of the points within the cluster it represents. We do this until there is no change in the clusters (or possibly until some other stopping condition is met).

Lets try out the sklearn implementation of kmeans. First we'll import libraries and the dataset we'll be looking at today. 

In [1]:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
 
import pandas as pd
import numpy as np
 
%matplotlib inline

# We'll use the iris dataset, a small real world dataset that comes with sklearn
iris = datasets.load_iris()

# We'll store these input values as a Pandas Dataframe
x = pd.DataFrame(iris.data)

Now try visualizing iris by printing things out.  Notice that our array in iris.data has four columns.

In [5]:
# TODO: Using the Pandas Dataframe we made in the previous step, set the column names to their proper values.
x.columns = "fill this in"

RangeIndex(start=0, stop=4, step=1)

In [None]:
# TODO: now that our inputs are all set, we'll store the target values as a Pandas Dataframe too
y = pd.DataFrame("use the previous cell as a model to fill this in")
y.columns = ["fill this in"]

Note that while we have class labels for the dataset, the key distinction that makes the clustering method unsupervised is that we aren't training on these labels - instead of trying to construct a function from the feature space to the label space, we are trying to find statistical structure within our feature space. The labels are only used in this example to compare the clusters we find to the actual classes to show the power of the clustering method. 

Now we can plot our data

In [None]:
# First, set the plot's size
plt.figure(figsize=(14,7))

# Make a colormap
colormap = np.array(['red', 'lime', 'black'])
 
# TODO: now lets plot Sepal values
plt.subplot(1, 2, 1)
plt.scatter("fill this in to plot the proper inputs for sepal", c=colormap["fill this in with the output"], s=40)
plt.title('Sepal')

# TODO: do the same thing for Petal values
plt.subplot(1, 2, 2)
plt.scatter("fill this in with the proper inputs for Petal", c=colormap["fill this in with the output"], s=40)
plt.title('Petal')

Now that we've visualized the data, let's try clustering it

In [None]:
# TODO: fill in parameters for sklearn kmeans function
model = KMeans(n_clusters="based on your prior plots, what number of clusters do you think would be appropriate?")
model.fit(x)

Now we can view the results of kmeans. This is what it decided for each point. So basically it assigns each point a number: 0, 1, or 2, depending on which cluster it goes under

In [None]:
model.labels_

Lets plot the real classes against the predicted classes our model.

In [None]:
# Here we are plotting the Petal Length and Width
plt.figure(figsize=(14,7))
 
# Create a colormap
colormap = np.array(['red', 'lime', 'black'])
 
# TODO: Plot Original
plt.subplot(1, 2, 1)
plt.scatter("fill this in", s=40)
plt.title('Actual')
 
# TODO: Plot Models
plt.subplot(1, 2, 2)
plt.scatter("fill this in", s=40)
plt.title('K Mean')

In [None]:
predY = np.choose(model.labels_, [1, 0, 2]).astype(np.int64)
print(model.labels_)
print(predY)

In [None]:
# Now let's see how well we did
sm.accuracy_score(y, predY)