<h1>CS4618: Artificial Intelligence I</h1>
<h1>Clustering: Introduction</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score

# Class, for use in pipelines, to select certain columns from a DataFrame and convert to a numpy array
# From A. Geron: Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017
# Modified by Derek Bridge to allow for casting in the same ways as pandas.DataFrame.astype
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names, dtype=None):
        self.attribute_names = attribute_names
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_selected = X[self.attribute_names]
        if self.dtype:
            return X_selected.astype(self.dtype).values
        return X_selected.values
    
# Class, for use in pipelines, to binarize nominal-valued features (while avoiding the dummy variabe trap)
# By Derek Bridge, 2017
class FeatureBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, features_values):
        self.features_values = features_values
        self.num_features = len(features_values)
        self.labelencodings = [LabelEncoder().fit(feature_values) for feature_values in features_values]
        self.onehotencoder = OneHotEncoder(sparse=False,
            n_values=[len(feature_values) for feature_values in features_values])
        self.last_indexes = np.cumsum([len(feature_values) - 1 for feature_values in self.features_values])
    def fit(self, X, y=None):
        for i in range(0, self.num_features):
            X[:, i] = self.labelencodings[i].transform(X[:, i])
        return self.onehotencoder.fit(X)
    def transform(self, X, y=None):
        for i in range(0, self.num_features):
            X[:, i] = self.labelencodings[i].transform(X[:, i])
        onehotencoded = self.onehotencoder.transform(X)
        return np.delete(onehotencoded, self.last_indexes, axis=1)
    def fit_transform(self, X, y=None):
        onehotencoded = self.fit(X).transform(X)
        return np.delete(onehotencoded, self.last_indexes, axis=1)
    def get_params(self, deep=True):
        return {"features_values" : self.features_values}
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self

<h1>Clustering</h1>
<ul>
    <li><b>Clustering</b> is the process of grouping objects according to some distance measure</li>
    <li>The goals:
        <ul>
            <li>two objects in the same cluster are a small distance from each other</li>
            <li>two objects in different clusters are a large distance from each other</li>
        </ul>
    </li>
    <li>E.g. how would you cluster these dogs? 
        <img style="float: right" src="images/13_dogs.jpg" />
    </li>
    <li>Applications:
        <ul>
            <li>Genetics: discovering groups of genes that express themselves in similar ways</li>
            <li>Marketing: segmenting customers for targeted advertising or to drive new product development</li>
            <li>Social network analysis: discovering communities in social networks</li>
            <li>Social sciences: analysing populations based on demographics, behaviour, etc</li>
            <li>Genetic algorithms: identifying population niches in an effort to maintain diversity</li>
            <li>&hellip;</li>
        </ul>
    </li>
    <li>Note: Clustering algorithms assign the objects to groups, but they are typically not capable
        of giving meaningful labels (names) to the groups
     </li>
</ul>

<h1>Clustering algorithms</h1>
<ul>
    <li>There are many, many algorithms, falling roughly into two kinds:
        <ul>
            <li><b>Point-assignment algorithms</b>: 
                <ul>
                    <li>objects are initially assigned to clusters, e.g., arbitrarily</li>
                    <li>then, repeatedly, each object is re-considered: it may be assigned to a cluster to 
                        which it is more closely related
                    </li>
                </ul>
            </li>
            <li><b>Hierarchical algorithms</b>: produce a tree of clusters
                <ul>
                    <li><b>Agglomerative algorithms</b> ('bottom-up'):
                        <ul>
                            <li>each object starts in a 'cluster' on its own; </li>
                            <li>then, recursively, pairs of clusters are merged to form a parent cluster</li>
                        </ul>
                    </li>
                    <li><b>Divisive algorithms</b> ('top-down'):
                        <ul>
                            <li>all objects start in a single cluster;</li>
                            <li>then, recursively, a cluster is split into child clusters</li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
    </li>
    <li>There are lots of other ways of distinguishing clustering algorithms from each other, e.g.
        <ul>
            <li>partitioning: must every object belong to exactly one cluster, or may some objects belong to more
                than one cluster and may some objects belong to no cluster?
            </li>
            <li>hard vs. soft: is membership of a cluster Boolean (an object belongs to a cluster or it does not) 
                or is it fuzzy (there are degrees of membership, e.g. it is 0.8 true that this object belongs 
                to this cluster) or probabilistic
            </li>
            <li>whether they only work for certain distance measures (e.g. Euclidean, Manhattan, Chebyshev) and not
                for others (e.g. cosine)
            </li>
            <li>whether they assume a dataset that fits into main memory or whether they scale to larger
                datasets
            </li>
            <li>whether they assume all the data is available up-front, or whether they assume it arrives
                over time
            </li>
        </ul>
    </li>
    <li>We'll study two of the simpler algorithms: one point-assignment and one hierarchical</li>
</ul>

<h1>$k$-Means Clustering</h1>
<ul>
    <li>The <b>$k$-means algorithm</b> is the best-known <em>point-assignment algorithm</em>
        <ul>
            <li>E.g. the <code>KMeans</code> class in scikit-learn</li>
        </ul>
    </li>
    <li>It assumes that you know the number of clusters, $k$, in advance</li>
    <li>Given a dataset of examples (as vectors) $\v{X}$ it returns a <em>partition</em> of $\v{X}$ into
        $k$ subsets
    </li>
    <li>Key concept: the <b>centroid</b> of a cluster
        <ul>
            <li>the <em>mean</em> of the examples in that cluster, i.e. the mean of each feature
            </li>
        </ul>
     </li>
</ul>

<h1>Centroids</h1>
<ul>
    <li>Class exercise: What are the centroids of these clusters?
        <ol>
            <li>$\Set{\cv{1\\1\\1}, \cv{2\\4\\6}, \cv{3\\7\\11}}$
            <li>$\Set{\cv{4\\3\\7}}$</li>
            <li>$\Set{\cv{2\\3}, \cv{4\\2}}$</li>
        </ol>
    </li>
    <li>Observations:
        <ul>
            <li>The centroid of a cluster that contains just one example is the example itself</li>
            <li>The centroid of a cluster that contains more than one example may not even be one of the
                examples in the cluster
            </li>
        </ul>
    </li>
</ul>

<h1>$k$-Means Algorithm</h1>
<ul>
    <li>It starts by choosing $k$ examples from $\v{X}$ to be the initial centroids, e.g. randomly</li>
    <li>Then, repeatedly,
        <ul>
            <li><b>Assignment step</b>: Each example $\v{x} \in \v{X}$ is assigned to one of the clusters: 
                the one whose centroid is closest to $\v{x}$
            </li>
            <li><b>Update step</b>: It re-computes the centroids of the clusters</li>
        </ul>
    </li>
</ul>

<h1>Toy Example, $k = 2$</h1>
<table>
    <tr style="border-bottom: 0">
        <td style="border-bottom: 0">Dataset $\v{X}$</td><td style="border-bottom: 0">Random centroids</td>
    </tr>
    <tr style="border-top: 0">
        <td style="border-top: 0"><img src="images/13_km1.png" /></td>
        <td style="border-top: 0"><img src="images/13_km2.png" /></td>
    </tr>
    <tr style="border-bottom: 0">
        <td style="border-bottom: 0">Assignment step</td><td style="border-bottom: 0">Update step</td>
    </tr>
    <tr style="border-top: 0">
        <td style="border-top: 0"><img src="images/13_km3.png" /></td>
        <td style="border-top: 0"><img src="images/13_km4.png" /></td>
    </tr>
    <tr style="border-bottom: 0">
        <td style="border-bottom: 0">Assignment step</td><td style="border-bottom: 0">Update step</td>
    </tr>
    <tr style="border-top: 0">
        <td style="border-top: 0"><img src="images/13_km5.png" /></td>
        <td style="border-top: 0"><img src="images/13_km6.png" /></td>
    </tr>
</table>

<h1>When to stop?</h1>
<ul>
    <li>If you run it for enough iterations, there <em>usually</em> comes a point when
        <ul>
            <li>In the update step, the centroids don't change</li>
            <li>Hence, in the assignment step, the clustering doesn't change</li>
        </ul>
    </li>
    <li>But there is a small risk that it <em>never</em> happens and that the algorithm oscillates between two or more
        equaly good solutions
    </li>
    <li>Therefore, most implementations have a maximum number of iterations (<code>max_iter</code>
        in scikit-learn)
    </li>
    <li>They might stop earlier, when the algorithm converges &mdash; next slide
    </li>
</ul>

<h1>Inertia and Convergence</h1>
<ul>
    <li>What is $k$-means trying to achieve?
        <ul>
            <li>A clustering that <b>minimizes inertia</b>: the within-cluster sum of distances</li>
            <li>I.e. the sum of the distances from each $\v{x} \in \v{X}$ to its centroid is as low as possible
            </li>
        </ul>
        (Advanced: The algorithm is more correctly formalized as trying to minimise the within-cluster
        sum of squares of disstances, but with Euclidean distance, the best clustering is the same)
    </li>
    <li>If you run it for enough iterations, it will <b>converge</b>
        <ul>
            <li>I.e. the inertia will remain unchanged between iterations</li>
        </ul>
        The algorithm can stop at this point
    </li>
    <li>Most implementations have a tolerance (<code>tol</code> in scikit-learn):
        <ul>
            <li>They stop when the change in inertia falls below the tolerance, rather than waiting
                for zero change
            </li>
        </ul>
    </li>
</ul>

<h1>Local and Global Minima</h1>
<ul>
    <li>Even if the algorithm converges (no improvement in inertia), the clustering it converges on
        might not be the <b>global minimum</b> (the one with lowest possible inertia)</li>
    <li>$k$-means produces different clustering depending on the choice of the initial $k$ centroids</li>
    <li>For a given set of initial centroids, the clustering it converges on might be a <b>local minimum</b>:
        <ul>
            <li>For these initial centroids, no better clustering can be found, but it's not
                the very best clustering possible
            </li>
        </ul>
    </li>
    <li>Class exercise. Here, $\v{X}$ contains four examples at the corners of a rectangle:
        <figure>
            <img src="images/13_minima.png" />
        </figure>
        <ul>
            <li>For $k=2$, choose initial centroids that result in a global minimum</li>
            <li>And choose $k=2$ centroids that give a local minimum</li>
        </ul>
    </li>
    <li>Let's look at ways of reducing the problem&hellip;</li>
</ul>

<h2>Avoiding local minima: re-running</h2>
<ul>
    <li>The obvious solution is to run $k$-means multiple times (with different initial centroids) and
        return the clustering that has the lowest inertia
    </li>
    <li>No guarantee of finding the global minimum this way but likely to be better</li>
    <li>E.g. scikit-learn the number of runs (<code>n_init</code>) is 10, by default</li>
</ul>

<h2>Avoiding local minima: better initial centroids</h2>
<ul>
    <li>Choosing the initial $k$ centroids at random from $\v{X}$ has problems:
        <ul>
            <li>The algorithm can return different clusters for $\v{X}$ each time it is run
            </li>
            <li>The clustering it returns may be a local minima</li>
            <li>A poor choice can increase the number of iterations needed for convergence
            </li>
        </ul>
    </li>
    <li>There are many alternatives to choosing wholly randomly, e.g.:
        <ul>
            <li>insert into $\mathit{Centroids}$ an example $\v{x} \in \v{X}$,
                chosen at random with uniform probability
            </li>
            <li>while $|\mathit{Centroids}| < k$
                <ul>
                    <li>insert into $\mathit{Centroids}$ a different example $\v{x} \in \v{X}$,
                        chosen with probability proportional to $(\min_{\v{x}' \in \mathit{Centroids}}\dist(\v{x}, \v{x}'))^2$
                    </li>
                </ul>
            </li>
        </ul>
     </li>
     <li>$k$-means++ is the name of the $k$-means algorithm when using the above method
         <ul>
             <li>it still has randomness, so it still suffers from the problems above, but typically less so</li>
         </ul>
     </li>
     <li>In scikit-learn, the <code>init</code> parameter can have values <code>‘k-means++’</code> (default)
         or <code>‘random'</code>
     </li>
</ul>

<h1>$k$-means clustering: discussion</h1>
<ul>
    <li>$k$-means can work well
        <ul>
            <li>but not so much in the presence of outliers or when the natural clusters are elongated or 
                irregular shapes
            </li>
        </ul>
    </li>
    <li>The curse of dimensionality may be relevant
        <ul>
            <li>You might want to include dimensionality reduction such as PCA in your pipeline</li>
        </ul>
    </li>
    <li>The algorithm mostly scales well to larger data
        <ul>
            <li>There are variants for speed-up, e.g. <code>MiniBatchKMeans</code> in scikit-learn
            </li>
        </ul>
    </li>
    <li>There is the problem of choosing $k$ in advance
        <ul>
            <li>Why does it not make sense to run it with all $k$ in $[2,m]$ and choose the clustering
                with lowest inertia?
            </li>
            <li>There are point-assignment algorithms that do not require you to choose $k$ in advance</li>
        </ul>
    </li>
</ul>

<h1>$k$-Means in scikit-learn</h1>

In [7]:
df = pd.read_csv("datasets/dataset_corkA.csv")
df.describe(include="all")

Unnamed: 0,flarea,type,bdrms,bthrms,floors,devment,ber,location,price
count,207.0,207,207.0,207.0,207.0,207,207,207,207.0
unique,,4,,,,2,12,36,
top,,Semi-detached,,,,SecondHand,G,CityCentre,
freq,,65,,,,204,25,40,
mean,128.094686,,3.434783,2.10628,1.826087,,,,274.724638
std,73.970582,,1.23239,1.185802,0.379954,,,,171.756507
min,41.8,,1.0,1.0,1.0,,,,55.0
25%,82.65,,3.0,1.0,2.0,,,,165.0
50%,106.0,,3.0,2.0,2.0,,,,225.0
75%,153.65,,4.0,3.0,2.0,,,,327.5


In [11]:
numeric_features = ["flarea", "bdrms", "bthrms", "floors"]
nominal_features = ["type", "devment", "ber", "location"]

numeric_pipeline = Pipeline([
    ("selector", DataFrameSelector(numeric_features)),
    ("scaler", StandardScaler()),
])

nominal_pipeline = Pipeline([
    ("selector", DataFrameSelector(nominal_features)),
    ("binarizer", FeatureBinarizer([df[feature].unique() for feature in nominal_features]))
])

pipeline = Pipeline([
    ("union", FeatureUnion([
        ("numeric_feature", numeric_pipeline),
        ("nominal_feature", nominal_pipeline),
    ]))
])

In [22]:
pipeline.fit(df)
X = pipeline.transform(df)

In [33]:
k = 2
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [31]:
help(kmeans)

Help on KMeans in module sklearn.cluster.k_means_ object:

class KMeans(sklearn.base.BaseEstimator, sklearn.base.ClusterMixin, sklearn.base.TransformerMixin)
 |  K-Means clustering
 |  
 |  Read more in the :ref:`User Guide <k_means>`.
 |  
 |  Parameters
 |  ----------
 |  
 |  n_clusters : int, optional, default: 8
 |      The number of clusters to form as well as the number of
 |      centroids to generate.
 |  
 |  init : {'k-means++', 'random' or an ndarray}
 |      Method for initialization, defaults to 'k-means++':
 |  
 |      'k-means++' : selects initial cluster centers for k-mean
 |      clustering in a smart way to speed up convergence. See section
 |      Notes in k_init for more details.
 |  
 |      'random': choose k observations (rows) at random from data for
 |      the initial centroids.
 |  
 |      If an ndarray is passed, it should be of shape (n_clusters, n_features)
 |      and gives the initial centers.
 |  
 |  n_init : int, default: 10
 |      Number of time 

In [35]:
kmeans.labels_

array([1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0], dtype=int32)

In [5]:
# The features we want to select
numeric_features = ["flarea", "bdrms", "bthrms", "floors"]
nominal_features = ["type", "devment", "ber", "location"]

# Create the pipelines
numeric_pipeline = Pipeline([
        ("selector", DataFrameSelector(numeric_features)),
        ("scaler", StandardScaler())
    ])

nominal_pipeline = Pipeline([
        ("selector", DataFrameSelector(nominal_features)), 
        ("binarizer", FeatureBinarizer([df[feature].unique() for feature in nominal_features]))])

pipeline = Pipeline([("union", FeatureUnion([("numeric_pipeline", numeric_pipeline), 
                                             ("nominal_pipeline", nominal_pipeline)]))])

In [6]:
# Run the pipeline
pipeline.fit(df)
X = pipeline.transform(df)

In [7]:
# Create the clustering object

k = 2
kmeans = KMeans(n_clusters=k)

In [8]:
# Run it
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [9]:
# In case you're interested, you can see the final inertia
kmeans.inertia_

983.77017887121383

In [10]:
# ...and even the vectors of the final centroids
kmeans.cluster_centers_

array([[ -4.31820283e-01,  -3.64013409e-01,  -3.48297304e-01,
         -1.29320334e-01,   1.65605096e-01,   1.46496815e-01,
          3.75796178e-01,   9.80891720e-01,   6.36942675e-03,
          1.91082803e-02,   1.01910828e-01,   7.64331210e-02,
          7.00636943e-02,   9.55414013e-02,   1.01910828e-01,
          1.27388535e-01,   8.28025478e-02,   8.28025478e-02,
          1.33757962e-01,   3.18471338e-02,   1.91082803e-02,
          1.91082803e-02,   2.54777070e-02,  -1.30104261e-17,
          1.91082803e-02,   3.18471338e-02,   1.01910828e-01,
          6.36942675e-03,   2.42038217e-01,  -1.30104261e-17,
          1.27388535e-02,   1.08280255e-01,  -1.30104261e-17,
          2.54777070e-02,   6.36942675e-03,   7.64331210e-02,
          1.91082803e-02,   2.54777070e-02,   6.36942675e-03,
         -2.60208521e-17,   1.91082803e-02,   6.36942675e-03,
         -2.42861287e-17,   1.91082803e-02,   2.54777070e-02,
          5.09554140e-02,   1.27388535e-02,   6.36942675e-03,
        

In [11]:
# The clusters have been labeled (numbered) from 0...(k-1)
# We can see the labels of each example in the dataset
kmeans.labels_

array([1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0], dtype=int32)

In [12]:
# Let's hack up a function that helps us look at a few examples from each cluster
def inspect_clusters(alg, df, k, features_to_show, how_many_to_show=None):
    for i in range(0, k):
        print("A few examples from cluster ", i)
        indexes = alg.labels_ == i
        max_available = indexes.sum()
        print(df.ix[indexes, features_to_show]
                   [:max_available if not how_many_to_show else min(how_many_to_show, max_available)])
        print()

In [13]:
# Show 3 examples from each cluster for KMeans with random initialization
inspect_clusters(kmeans, df, k, numeric_features + nominal_features, 3)

A few examples from cluster  0
   flarea  bdrms  bthrms  floors           type     devment ber  location
1    83.6      3       1       1       Detached  SecondHand  D2  Glanmire
2    97.5      3       2       2  Semi-detached  SecondHand  D1  Glanmire
4   118.7      3       2       2  Semi-detached  SecondHand  E2   Douglas

A few examples from cluster  1
   flarea  bdrms  bthrms  floors           type     devment ber      location
0   497.0      4       5       2       Detached  SecondHand  B2  Carrigrohane
3   158.0      5       2       2  Semi-detached  SecondHand  C3       Douglas
5   170.0      4       4       2       Detached  SecondHand  D1       Douglas



<ul>
    <li>Go back and try with a different value for $k$</li>
    <li>But eye-balling examples from the clusters is not a reliable way of judging the quality of the clustering</li>
</ul>

<h1>Evaluating clustering, part 1</h1>
<ul>
    <li>Suppose someone has already done a manual clustering of the dataset ('ground truth'):
        <ul>
            <li>Then you can compare the output of the algorithm with the ground truth</li>
            <li>Discussed in next lecture</li>
        </ul>
    </li>
    <li>Suppose you don't have a ground truth (much more typical!):
        <ul>
            <li><b>Silhouette Coefficient</b> is one of several ways of scoring clustering quality:
                <ul>
                    <li>For each example $\v{x} \in \v{X}$, compute 
                        $$\frac{b - a}{max(a, b)}$$
                        where $a$ is the mean distance between $\v{x}$ and all other examples in the same cluster,
                        $b$ is the mean distance between $\v{x}$ and all examples in the next nearest cluster
                    </li>
                    <li>The Silhouette Coefficient is the mean of all of these</li>
                </ul>
            </li>
            <li>
                Its values lies in $[-1,1]$:
                <ul>
                    <li>Positive values suggest examples are in their correct clusters</li>
                    <li>Values near 0 indicate clusters that are not well separated</li>
                    <li>Negative values suggest examples are in the wrong clusters</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

In [14]:
silhouette_score(X, kmeans.labels_, metric='euclidean')

0.25888899875904836