&emsp;&emsp;Assume the data have been clustered via any technique, such as k-means, into k clusters.      
&emsp;&emsp;For data point $ i \in C_i $(data point $i$ in the cluster $ C_{i} $),let

$$ a(i) = \frac{1}{|C_i| - 1} \sum_{j \in C_i, i \neq j} d(i, j) $$

&emsp;&emsp;be the mean distance between $i$ and all other data points in the same cluster, 
where $ d(i,j)$ is the distance between data points $i$ and $ j $ in the cluster $ C_{i} $ 
(we divide by $ |C_{i}|-1 $ because we do not include the distance $ d(i,i) $ in the sum). 
We can interpret $ a(i) $  as a measure of how well $ i $ is assigned to its cluster (the smaller the value, the better the assignment).          
&emsp;&emsp;We then define the mean dissimilarity of point $ i $ to some cluster $ C $ as the mean of the distance from $ i $ to all points in $ C $  (where $ C\neq C_{i} $).    
&emsp;&emsp;For each data point $ i\in C_{i} $, we now define

$$ b(i) = \min_{k \neq i} \frac{1}{|C_k|} \sum_{j \in C_k} d(i, j) $$

to be the smallest (hence the $ \min $ operator in the formula) mean distance of $ i $ to all points in any other cluster, 
of which $ i $ is not a member. The cluster with this smallest mean dissimilarity is said to be the "neighboring cluster" of $ i $ because it is the next best fit cluster for point $ i $.      
&emsp;&emsp;We now define a silhouette (value) of one data point $ i $

$$ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}, \quad if |C_i| >1 $$

and

$s(i)=0,\quad if |C_i|=0$

&emsp;&emsp;From the above definition it is clear that

$$ -1 \leq s(i) \leq 1 $$

&emsp;&emsp;For $s(i)$ to be close to 1 we require $ a(i) \ll b(i) $. 
As $ a(i) $ is a measure of how dissimilar $ i $ is to its own cluster, a small value means it is well matched. 
Furthermore, a large $ b(i) $ implies that $ i $ is badly matched to its neighbouring cluster. 
Thus an $ s(i) $ close to one means that the data is appropriately clustered. 
If $ s(i) $ is close to negative one, then by the same logic we see that $ i $  would be more appropriate if it was clustered in its neighbouring cluster. 
An $ s(i) $  near zero means that the datum is on the border of two natural clusters.           

In [17]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

cluster_labels_2 = KMeans(n_clusters=2).fit_predict(X)

In [23]:
X

array([[ 0.49426097,  1.45106697],
       [-1.42808099, -0.83706377],
       [ 0.33855918,  1.03875871],
       ...,
       [-0.05713876, -0.90926105],
       [-1.16939407,  0.03959692],
       [ 0.26322951, -0.92649949]])

In [24]:
cluster_labels_2

array([1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,

In [20]:

# The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. 
print("The average silhouette_score with n_clusters=2: %0.3f"
      % metrics.silhouette_score(X, cluster_labels_2))  # 所有样本的平均轮廓系数

cluster_labels_3 = KMeans(n_clusters=3).fit_predict(X)
print("The average silhouette_score with n_clusters=3: %0.3f"
      % metrics.silhouette_score(X, cluster_labels_3))  # 越接近于1越好

cluster_labels_4 = KMeans(n_clusters=4).fit_predict(X)
print("The average silhouette_score with n_clusters=4: %0.3f"
      % metrics.silhouette_score(X, cluster_labels_4))

print(metrics.silhouette_score(X, cluster_labels_4,
                               metric='euclidean'))  # 欧式距离(默认)
print(metrics.silhouette_score(X, cluster_labels_4,
                               metric='manhattan'))  # 曼哈顿距离
print(metrics.silhouette_score(X, cluster_labels_4,
                               metric='cosine'))  # 余弦距离

The average silhouette_score with n_clusters=2: 0.517
The average silhouette_score with n_clusters=3: 0.650
The average silhouette_score with n_clusters=4: 0.529
0.5285529828008654
0.5114646181901286
0.8040273185357246


In [21]:
# The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.
samples = metrics.silhouette_samples(X, cluster_labels_4)  # 计算每个样本的轮廓系数(越接近于1越好)
samples

array([ 0.74545084,  0.72480657,  0.64185089,  0.50177063,  0.65082269,
        0.50665836,  0.72159468,  0.27911956,  0.45973964,  0.67791423,
        0.71106754,  0.64174003,  0.6344096 ,  0.50010918,  0.30393452,
        0.69807999,  0.40516267,  0.55854476,  0.11700801,  0.476787  ,
        0.45648864,  0.14831193,  0.52307357,  0.17804839,  0.73446006,
        0.63903211,  0.43490815,  0.70949639,  0.70884387,  0.56079064,
        0.7151231 ,  0.69954619,  0.58801667,  0.74155737,  0.5644136 ,
        0.57427433,  0.50435104,  0.71439016,  0.71477722,  0.16536862,
        0.52035499,  0.72651217,  0.73878139,  0.66430293,  0.62867913,
        0.65131827,  0.65663188,  0.50016998,  0.66492797,  0.7333447 ,
        0.51153418,  0.33339118,  0.60683021,  0.73852426,  0.53139085,
        0.29573541,  0.58132398,  0.58701955,  0.39928863,  0.69025727,
        0.28176123,  0.31326231,  0.47825218,  0.38281839,  0.26833209,
        0.57793602,  0.42378461,  0.4802084 ,  0.6841862 ,  0.71

In [22]:
np.sum(samples) / len(samples)  # 所有样本的平均轮廓系数

0.5285529828008654