<a href="https://colab.research.google.com/github/cagBRT/Clustering-Intro/blob/master/C4_Clustering_Comparisons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Clustering: Comparisons**

An overview of clustering techniques.
>Affinity Propagation<br>
Agglomerative Clustering<br>
BIRCH<br>
DBSCAN<br>
K-Means<br>
Mini-batch K-Means<br>
Mean Shift<br>
Gaussian Mixture Model<br>

Each algorithm offers a different approach to the challenge of discovering natural groups in data.

There is no best clustering algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

In [0]:
#!pip install scikit-learn
# check scikit-learn version
import sklearn
print(sklearn.__version__)

In [0]:
from matplotlib import pyplot
import matplotlib.pyplot as plt
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification

Plot functions

In [0]:
def plot_dataset(y, X):
  for class_value in range(2):
    # get row indexes for samples with this class
    row_ix = where(y == class_value)
    # create scatter of these samples
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])

In [0]:
def plot_function(clusters, yhat,X):
  for cluster in clusters:
    # get row indexes for samples with this cluster
    row_ix = where(yhat == cluster)
    # create scatter of these samples
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])

**Create a Synthetic Dataset**<br>
To generate a random n-class classification problem, use the make_classification function.<br>




The dataset has two distinct clusters. <br>

Can the clustering algorithms identifiy these two clusters?

In [0]:
# synthetic classification dataset
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, 
                           n_redundant=0, n_clusters_per_class=1, random_state=4)
# create scatter plot for samples from each class
plot_dataset(y, X)
# show the plot
pyplot.show()

**K-Means**<br>
K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to clusters in an effort to minimize the variance within each cluster.

— Some methods for classification and analysis of multivariate observations, 1967.

In [0]:
# k-means clustering
from sklearn.cluster import KMeans
# define the model
#KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, 
#precompute_distances='auto', verbose=0, random_state=None, copy_x=True, 
#n_jobs=None, algorithm='auto')

model = KMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)

# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

#In this case, a reasonable grouping is found, although the unequal equal 
#variance in each dimension makes the method less suited to this dataset.

**Mini-Batch K-Means**<br>
Mini-Batch K-Means is a modified version of k-means that makes updates to the cluster centroids using mini-batches of samples rather than the entire dataset, which can make it faster for large datasets, and perhaps more robust to statistical noise.

… we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent.

— Web-Scale K-Means Clustering, 2010.

In [0]:
# mini-batch k-means clustering
from sklearn.cluster import MiniBatchKMeans
# define the model
#MiniBatchKMeans(n_clusters=8, init='k-means++', max_iter=100, batch_size=100,
#verbose=0, compute_labels=True, random_state=None, tol=0.0, 
#max_no_improvement=10, init_size=None, n_init=3, reassignment_ratio=0.01)

model = MiniBatchKMeans(n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

#In this case, a result equivalent to the standard k-means algorithm is found.

**Gaussian Mix**<br>
A Gaussian mixture model summarizes a multivariate probability density function with a mixture of Gaussian probability distributions as its name suggests.

For more on the model, see:

[Mixture model, Wikipedia.](https://en.wikipedia.org/wiki/Mixture_model)

In [0]:
# gaussian mixture clustering
from sklearn.mixture import GaussianMixture
# define the model
#GaussianMixture(n_components=1, covariance_type='full', tol=0.001, 
#reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', 
#weights_init=None, means_init=None, precisions_init=None, random_state=None, 
#warm_start=False, verbose=0, verbose_interval=10)

model = GaussianMixture(n_components=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

#In this case, we can see that the clusters were identified perfectly. 
#This is not surprising given that the dataset was generated as a mixture of Gaussians.



**Birch**<br>
BIRCH Clustering (BIRCH is short for Balanced Iterative Reducing and Clustering using
Hierarchies) involves constructing a tree structure from which cluster centroids are extracted.

BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints).

— BIRCH: An efficient data clustering method for large databases, 1996.

It is implemented via the Birch class and the main configuration to tune is the “threshold” and “n_clusters” hyperparameters, the latter of which provides an estimate of the number of clusters.

In [0]:
# birch clustering
from sklearn.cluster import Birch
# define the model
#Birch(threshold=0.5, branching_factor=50, n_clusters=2, compute_labels=True, 
#copy=True)
model = Birch(threshold=0.01, n_clusters=2)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

**Affinity Propagation**<br>
We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges

— Clustering by Passing Messages Between Data Points, 2007.

In [0]:
# affinity propagation clustering
from sklearn.cluster import AffinityPropagation

#1 HYPER PARAMETER TO TUNE 0.5 to 1
#AffinityPropagation(damping=0.5, max_iter=200, convergence_iter=15, 
#copy=True, preference=None, affinity='euclidean', verbose=False)
model = AffinityPropagation(damping=0.9, )

# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

#Does this alogrithm find the two clusters of the dataset?

**Agglomerative Clustering**<br>
Agglomerative clustering involves merging examples until the desired number of clusters is achieved.

It is a part of a broader class of hierarchical clustering methods and you can learn more here:

[Hierarchical clustering, Wikipedia](https://en.wikipedia.org/wiki/Hierarchical_clustering).

In [0]:
# agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
# define the model
#AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=None, 
#connectivity=None, compute_full_tree='auto', linkage='ward', 
#distance_threshold=None)
model = AgglomerativeClustering(n_clusters=2)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

#In this case, a reasonable grouping is found.

**DBSCAN**<br>
DBSCAN Clustering (where DBSCAN is short for Density-Based Spatial Clustering of Applications with Noise) involves finding high-density areas in the domain and expanding those areas of the feature space around them as clusters.

… we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it

— A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1996.

In [0]:
# dbscan clustering
from sklearn.cluster import DBSCAN
# define the model
#DBSCAN(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, 
#algorithm='auto', leaf_size=30, p=None, n_jobs=None)
model = DBSCAN(eps=0.3, min_samples=9)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

#In this case, a reasonable grouping is found, although more tuning is required.

**Mean Shift**<br>
Mean shift clustering involves finding and adapting centroids based on the density of examples in the feature space.

We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and thus its utility in detecting the modes of the density.

— Mean Shift: A robust approach toward feature space analysis, 2002.

In [0]:
# mean shift clustering
from sklearn.cluster import MeanShift
# define the model
model = MeanShift()
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

**OPTICS**<br>
OPTICS clustering (where OPTICS is short for Ordering Points To Identify the Clustering Structure) is a modified version of DBSCAN described above.

We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings.

— OPTICS: ordering points to identify the clustering structure, 1999.



In [0]:
# optics clustering
from sklearn.cluster import OPTICS
# define the model
model = OPTICS(eps=0.8, min_samples=10)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()


**Spectral Clustering**<br>
Spectral Clustering is a general class of clustering methods, drawn from linear algebra.

A promising alternative that has recently emerged in a number of fields is to use spectral methods for clustering. Here, one uses the top eigenvectors of a matrix derived from the distance between points.

— On Spectral Clustering: Analysis and an algorithm, 2002.

In [0]:
# spectral clustering
from sklearn.cluster import SpectralClustering
# define the model
model = SpectralClustering(n_clusters=2)
# fit model and predict clusters
yhat = model.fit_predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
plot_function(clusters, yhat,X)
# show the plot
pyplot.subplot(1,2,1)
plot_function(clusters, yhat,X)
#The second subplot
pyplot.subplot(1,2,2)
plot_dataset(y, X)
pyplot.show()

https://machinelearningmastery.com/clustering-algorithms-with-python/
