# 5. Clustering

We talk about clustering when the groups made out of
similar data points do not have a predefined name or label.When the label does exist we talk about classification and
will cover it in Chapter 6. Clustering analysis is an
unsupervised machine learning task, whereas classification
is a supervised one.

## 5.1 Clustering

A cluster can be thought of as a group of similar
data points and therefore the concept of similarity is at the
heart of the definition of a cluster. The greater the similarity among points leads to better clustering and thus to better
results. That its goal is to
provide us with a better understanding of our dataset by
dividing the data points into relevant groups. Clustering provides us with a
layer of abstraction from individual data points to collections of them that share similar characteristics. It is
important to clarify that the enhancement is made by
extracting information from the inherent structure of the
data itself, rather than imposing an arbitrary external one.

## 5.2 Clustering with K-Means

Its goal is to partition an N-dimensional dataset into k different sets, whose number
is fixed at the start of the process. The algorithm performs a
complete clustering of the dataset, in other words, each data
point considered will belong to exactly one of the k clusters.
The most important part of the process
is determining the partitions that form the k sets. This is
done by defining k centroids and assigning each data point to the cluster with the nearest centroid. The centroid is then
updated by taking the mean of the data points in the cluster.
The partitions are not scale-invariant and therefore the same dataset may lead to very different results depending on the scale and units used.
The initial k centroids are set at the beginning of the process and different locations may lead to different results.
The general idea behind k-means can be summarised in the following four
steps: Choose the location of the initial k centroids. For each data point, find the distance to each of the k
centroids and assign the point to the nearest one. Once all data points have been assigned to a cluster,
recalculate the centroid positions. Repeat steps 2 and 3 until convergence is met.

### 5.2.1 Cluster Validation

It is important to note that even in cases where no
partion exists, k-means will return a partition of the dataset
in to k subsets. It is therefore useful to validate the clusters obtained. Cluster validation can be further used to identify clusters that should be split or merged, or to identify individual points with disproportionate effect on the overall clustering. This can be done with the help of two measures: Cohesion and separation. Cohesion is a measure of how closely related data points within a cluster are, and is given by the
within-cluster SSE. Separation is a measure of how well clusters are segregated from each other. The overal cohesion and
separation measures are given by the sum over clusters; in the case of separation it is not unusual to weight each of the terms in the sum. An alternative measure of validity that provides us with a and separation.
combination of the ideas behind cohesion and separation in
a single coefficient is given by the silhouette score (from -1 to 1). The average silhouette over the entire
dataset tells us how well the clustering algorithm has
performed and can be used to determine the best number of
clusters for the dataset at hand. All in all, k-means is pretty efficient both in time and
complexity, however it does not perform very well with non-convex clusters, or with data having varying shapes
and densities. One possible way to deal with some of these
issues is by increasing the value of k, and later recombining
the sub-clusters obtained. Also, remember that k-means
requires a carefully chosen distance measure that captures
the properties of the dataset.

### 5.2 K-Means in action

In [2]:
import pandas as pd
wine = pd.read_csv("Data/wine.csv")

In [3]:
wine.columns

Index(['Wine', 'Alcohol', 'Malic.acid', 'Ash', 'Acl', 'Mg', 'Phenols',
       'Flavanoids', 'Nonflavanoid.phenols', 'Proanth', 'Color.int', 'Hue',
       'OD', 'Proline'],
      dtype='object')

In [5]:
X1 = wine[['Alcohol','Color.int']].values
Y = wine['Wine'].values

In [39]:
from sklearn import cluster
cls_wine = cluster.KMeans(n_clusters=3)
cls_wine.fit(X1)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [40]:
print(cls_wine.labels_)

[1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 1 0
 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 2 0 1 1 1 1 1 1 1 2 1 1 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 2]


In [41]:
print(cls_wine.cluster_centers_)

[[12.25353846  2.854     ]
 [13.45168831  5.19441558]
 [13.38472222  8.74611108]]


In [42]:
from sklearn.metrics import silhouette_score
print(silhouette_score(X1,cls_wine.labels_))

0.5097267872581326
