# Calculating a Silhouette Coefficient

In [1]:
from sklearn.metrics import silhouette_score

import numpy as np

One way of measuring the quality of a clustering algorithm on a particular dataset is to calculate the clustering's **Silhouette Coefficient**. The main idea is to compare *inter*-cluster distances with *intra*-cluster distances.

##  Formula

![silo](../img/silo2.png)

> **a** refers to the average distance between a point and all other points in that cluster.

> **b** refers to the average distance between that same point and all other points in clusters to which it does not belong

It is calculated for each point in the dataset, then averaged across all points for one cumulative score.

The Silhouette Coefficient ranges between -1 and 1. The closer to 1, the more clearly defined are the clusters.

## Example

Suppose we have four points in a two-dimensional space: (1, 1), (1, 2), (5, 4), and (5, 5).

In [2]:
dist1 = 5
dist2 = 1
dist3 = np.sqrt(20)
dist4 = np.sqrt(32)

In [4]:
avg1 = np.mean((dist1, dist3))
avg2 = np.mean((dist1, dist4))

In [5]:
silh_11 = (avg2 - dist2) / avg2

silh_12 = (avg1 - dist2) / avg1

silh_54 = silh_12

silh_55 = silh_11

In [6]:
np.mean((silh_11, silh_12, silh_54, silh_55))

0.8005908696438615

In [10]:
silhouette_score([(1, 1), (1, 2), (5, 4), (5, 5)], ['spec1', 'spec1', 'spec2', 'spec2'])

0.8005908696438615