# Clustering Algorithms
## Discovering Classifications
*Clustering* aims to uncover **groupings** in data, rather than numeric patterns. 
It is an unsupervised process, meaning there are no 'valid' or 'correct' response 
values being used to train the model - it simply finds clusters of similar data *a posteriori*, 
without inherent target groupings.

## References
1. Scikit documentation
    * [K-means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
    * [Mini-batch k-means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans)
    * [Affinity Propagation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation)

---
## K-means  <a id='KMC'></a>
K-means is a method of grouping data into clusters by randomly assigning the data a set of *centroids* and moving those on each subsequent iteration. 
An excellent visual explanation can be found [here](http://bigdata-madesimple.com/possibly-the-simplest-way-to-explain-k-means-algorithm/).

A very basic example follows, using the classic Iris dataset, 
which consists of numeric data on three species of *Iris* plants:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans, MiniBatchKMeans
from collections import OrderedDict

# load data and separate variables
iris = load_iris()
X = iris.data[:, :2]
y = iris.target

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.6)

#print(X_test.shape, Y_test.shape)

# construct KMeans model and fit data
# you could do this with a larger number of clusters,
# but we know the Iris dataset has three prominent groups/species
km = KMeans(n_clusters = 3, random_state=2).fit(X_train, y_train)

# predict() returns the numeric index of the cluster to which each test point belongs
labels = km.predict(X_test)

# max_index = np.max(labels)
# unique_labels = np.arange(0,max_index+1)

# pick your favorite colors!
colors = ["red", "blue", "green"]

# iterate over labels and assign color to each point
for i in range(0,len(X_test)):
    col = colors[labels[i]]
    plt.plot(X_test[:,0][i], X_test[:,1][i], color=col, marker='o', 
             markersize=5, label="Cluster %i" % labels[i])

# remove duplicates from legend and plot
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='upper right')
plt.show()

### Notes
If you run this code several times, 
you'll notice that the specific boundaries of each cluster will change, 
except for the extreme outliers of each cluster. 
This is influenced somewhat by the sparsity of the given dataset and the use of random centroid placement 
in Lloyd's algorithm (the basis of the `scikit` implementation for `KMeans`). 
The random nature of `train_test_split` explains the changing shape as well.

## Minibatch k-means
These work more or less the same as standard `KMeans` but are more scalable to large datasets. 
The minibatch method splits training data into smaller, easier-to-process chunks, 
defined by the `batch_size` parameter.

The only change that needs to be made to switch from `KMeans` to `MiniBatchKMeans` is as follows:

* change the import statements
```python
from sklearn.cluster import KMeans --> from sklearn.cluster import MiniBatchKMeans
```
* change the object/model constructor
```python
model = KMeans(...) --> model = MiniBatchKMeans(batch_size=n, ...)
```

In [None]:
mbkm = MiniBatchKMeans(n_clusters = 3, random_state=2, batch_size=60).fit(X_train, y_train)

# predict() returns the numeric index of the cluster to which each test point belongs
labels = mbkm.predict(X_test)

# pick your favorite colors!
colors = ["red", "blue", "green"]

# iterate over labels and assign color to each point
for i in range(0,len(X_test)):
    col = colors[labels[i]]
    plt.plot(X_test[:,0][i], X_test[:,1][i], color=col, marker='o', 
             markersize=5, label="Cluster %i" % labels[i])

# remove duplicates from legend and plot
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='upper right')
plt.show()

Note that the results are more or less the same as in the full `KMeans` example. 
Real-world usage of the mini-batch algorithm should be restricted to larger datasets; 
this is purely to demonstrate their similarity for a given dataset.

---
# Alternatives to `KMeans`

## `AffinityPropagation`

`sklearn.cluster.AffinityPropagation` is a clustering model notable for its ability to independently determine the number of clusters within a dataset, unlike `KMeans` which takes in a parameter for the number of clusters (as noted above). 

An example of how to use `AffinityPropagation` follows on the same Iris dataset used earlier:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.cluster import AffinityPropagation
from sklearn.datasets.samples_generator import make_blobs
from collections import OrderedDict

# load data and separate variables
iris = load_iris()
X = iris.data[:, :2]
y = iris.target

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)

#print(X_test.shape, Y_test.shape)

# construct KMeans model and fit data
# you could do this with a larger number of clusters,
# but we know the Iris dataset has three prominent groups/species
cls = AffinityPropagation(max_iter=600).fit(X_train, y_train)

# predict() returns the numeric index of the cluster to which each test point belongs
labels = cls.predict(X_test)

# max_index = np.max(labels)
# unique_labels = np.arange(0,max_index+1)

# pick your favorite colors!
colors = ["red", "blue", "green", "yellow", "orange", 
          "black", "purple", "grey", "lightblue", "lightgreen"]

# iterate over labels and assign color to each point
for i in range(0,len(X_test)):
    col = colors[labels[i]]
    plt.plot(X_test[:,0][i], X_test[:,1][i], color=col, marker='o', 
             markersize=5, label="Cluster %i" % labels[i])

# remove duplicates from legend and plot
handles, labels = plt.gca().get_legend_handles_labels()
by_label = OrderedDict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='best')
plt.show()

This method is sometimes unruly and can be computationally expensive, 
but is useful for smaller datasets where the number of local clusters is not well understood.
Another demonstration (based on the one in the `scikit` [documentation](http://scikit-learn.org/stable/auto_examples/cluster/plot_affinity_propagation.html#sphx-glr-auto-examples-cluster-plot-affinity-propagation-py)) follows:

In [None]:
from itertools import cycle

X, Y = make_blobs(n_samples=300, centers=[[1,1],[-1,-1],[-1,1],[1,-1]], 
                       cluster_std=0.5, random_state=2)
af = AffinityPropagation(preference=-50).fit(X)
labels = af.labels_
indices = af.cluster_centers_indices_

n_clusters_ = len(indices)

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    members = labels == k
    center = X[indices[k]]
    plt.plot(X[members, 0], X[members, 1], col + '.')
    plt.plot(center[0], center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[members]:
        plt.plot([center[0], x[0]], [center[1], x[1]], col)