# MATH 3375 Examples Notebook #21

# K-Means Clustering

Previous models have been _**supervised**_ models, meaning that we knew in advance what the 'response' should be; we used that information to train models to recognize that known response.

Another type of model is _**unsupervised**_. This means we analyze the data to find similarities and differences, and use those to classify the data as appropriately as possible.

The technique shown below is a very common unsupervised model, called K-Means Clustering.  


In [None]:
#Look at data set
head(iris)

In [None]:
set.seed(3375)
km_2_2 <- kmeans(iris[,1:2],centers=2,nstart=10)
km_2_2

In [None]:
km_2_2$cluster

In [None]:
plot(Sepal.Length ~ Sepal.Width, col=km_2_2$cluster, data=iris)

In [None]:
s.length <- (iris$Sepal.Length - min(iris$Sepal.Length))/(max(iris$Sepal.Length) - min(iris$Sepal.Length))
s.width <- (iris$Sepal.Width - min(iris$Sepal.Width))/(max(iris$Sepal.Width)- min(iris$Sepal.Width))
p.length <- (iris$Petal.Length - min(iris$Petal.Length))/(max(iris$Petal.Length) - min(iris$Petal.Length))
p.width <- (iris$Petal.Width - min(iris$Petal.Width))/(max(iris$Petal.Width)- min(iris$Petal.Width))

iris_scaled <- data.frame(s.length,s.width,p.length,p.width)

summary(iris_scaled)

In [None]:
kms_2_2 <- kmeans(iris_scaled[,1:2],centers=2,nstart=10)
kms_2_2

In [None]:
kms_2_2$cluster

In [None]:
plot(s.length ~ s.width, col=kms_2_2$cluster, data=iris_scaled,
    main="K-Means Clusters, 2 dimensions, k=2")

In [None]:
plot(Sepal.Length ~ Sepal.Width, col=kms_2_2$cluster, data=iris,
    main="K-Means Clusters, 2 dimensions, k=2")

In [None]:
kms_2_2

In [None]:
kms_2_2$totss

In [None]:
kms_2_2$betweenss

In [None]:
kms_2_2$tot.withinss

## Using More Dimensions

We will run the algorithm again with all 4 features as dimensions. For purposes of _**visualizing**_ the clusters, we will use 2 features as plotting dimensions. Keep in mind that the algorithm is using more than just these 2 features to create the clusters.

In [None]:
kms_4_2 <- kmeans(iris_scaled[,1:4],centers=2,nstart=10)
kms_4_2

In [None]:
plot(Sepal.Length ~ Sepal.Width, col=kms_4_2$cluster, data=iris,
    main="K-Means Clusters, 4 dimensions, k=2")

In [None]:
plot(Petal.Length ~ Sepal.Width, col=kms_4_2$cluster, data=iris,
    main="K-Means Clusters, 4 dimensions, k=2")

In [None]:
plot(Petal.Length ~ Petal.Width, col=kms_4_2$cluster, data=iris,
    main="K-Means Clusters, 4 dimensions, k=2")

## Choosing Best Value for K

How should we choose the best value for K? In other words, how do we decide the optimal number of clusters? The following procedure is an accepted practice for choosing K.

* Run clustering algorithm on data set with several values of k
* Keep track of the **_total within cluster SS_** for each k
* Plot the results with k on x-axis and withinSS on y-axis

The plot is called an 'elbow' graph. Beyond a certain point, the SS value begins to drop much more slowly, meaning that the value of adding more clusters is minimal. This "optimal" point on the graph often appears to be the 'crook' of the elbow. 

We carry out this process below. Notice that we start with k=1 so we can see the "improvement" from a single group to one divided into 2 clusters.

In [None]:
ss <- vector(length=6)
for (k in 1:6) {
    km <- kmeans(iris_scaled[,1:4],centers=k,nstart=10)
    ss[k] <- km$tot.withinss
}

k <- 1:6
plot(ss~k,pch=19,type="b",main="Elbow Graph for K-Means",
     ylab="Total Within-Group Sum of Squares")

### Reading the Elbow Graph

Based on the above graph, k=2 has the most obvious 'crook', but there is still a slight crook at k=3.  Notice, however, that the SS does drop substantially from k=2 to k=3, whereas it does not drop near as much from k=3 to k=4. In other words, k=3 is the point at which the plot really begins to level off. For this reason, it would be most reasonable to select k=3.


## Review More on Your Own

Remember that the **kmeans** function in R has documentation that you can review to learn more-- see the help page by running the next cell.

In [None]:
?kmeans