&emsp;&emsp;Assume the data have been clustered via any technique, such as k-means, into k clusters.      
&emsp;&emsp;For data point $ i \in C_i $(data point $i$ in the cluster $ C_{i} $),let

$$ a(i) = \frac{1}{|C_i| - 1} \sum_{j \in C_i, i \neq j} d(i, j) $$

&emsp;&emsp;be the mean distance between $i$ and all other data points in the same cluster, 
where $ d(i,j)$ is the distance between data points $i$ and $ j $ in the cluster $ C_{i} $ 
(we divide by $ |C_{i}|-1 $ because we do not include the distance $ d(i,i) $ in the sum). 
We can interpret $ a(i) $  as a measure of how well $ i $ is assigned to its cluster (the smaller the value, the better the assignment).          
&emsp;&emsp;We then define the mean dissimilarity of point $ i $ to some cluster $ C $ as the mean of the distance from $ i $ to all points in $ C $  (where $ C\neq C_{i} $).    
&emsp;&emsp;For each data point $ i\in C_{i} $, we now define

$$ b(i) = \min_{k \neq i} \frac{1}{|C_k|} \sum_{j \in C_k} d(i, j) $$

to be the smallest (hence the $ \min $ operator in the formula) mean distance of $ i $ to all points in any other cluster, 
of which $ i $ is not a member. The cluster with this smallest mean dissimilarity is said to be the "neighboring cluster" of $ i $ because it is the next best fit cluster for point $ i $.      
&emsp;&emsp;We now define a silhouette (value) of one data point $ i $

$$ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}, \quad if |C_i| >1 $$

and

$s(i)=0,\quad if |C_i|=0$

&emsp;&emsp;From the above definition it is clear that

$$ -1 \leq s(i) \leq 1 $$

&emsp;&emsp;For $s(i)$ to be close to 1 we require $ a(i) \ll b(i) $. 
As $ a(i) $ is a measure of how dissimilar $ i $ is to its own cluster, a small value means it is well matched. 
Furthermore, a large $ b(i) $ implies that $ i $ is badly matched to its neighbouring cluster. 
Thus an $ s(i) $ close to one means that the data is appropriately clustered. 
If $ s(i) $ is close to negative one, then by the same logic we see that $ i $  would be more appropriate if it was clustered in its neighbouring cluster. 
An $ s(i) $  near zero means that the datum is on the border of two natural clusters.           

In [1]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

cluster_labels_2 = KMeans(n_clusters=2).fit_predict(X)
# The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. 
print("The average silhouette_score with n_clusters=2: %0.3f"
      % metrics.silhouette_score(X, cluster_labels_2)) # 所有样本的平均轮廓系数

cluster_labels_3 = KMeans(n_clusters=3).fit_predict(X)
print("The average silhouette_score with n_clusters=3: %0.3f"
      % metrics.silhouette_score(X, cluster_labels_3)) # 越接近于1越好

cluster_labels_4 = KMeans(n_clusters=4).fit_predict(X)
print("The average silhouette_score with n_clusters=4: %0.3f"
      % metrics.silhouette_score(X, cluster_labels_4))


print(metrics.silhouette_score(X, cluster_labels_4,
                               metric='euclidean')) # 欧式距离(默认)
print(metrics.silhouette_score(X, cluster_labels_4,
                               metric='manhattan')) # 曼哈顿距离
print(metrics.silhouette_score(X, cluster_labels_4,
                               metric='cosine')) # 余弦距离

The average silhouette_score with n_clusters=2: 0.517
The average silhouette_score with n_clusters=3: 0.650
The average silhouette_score with n_clusters=4: 0.528
0.5282113071604241
0.5093667814053747
0.8216278157409964


In [2]:
# The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.
samples = metrics.silhouette_samples(X, cluster_labels_4) # 计算每个样本的轮廓系数(越接近于1越好)
samples

array([ 7.41446794e-01,  7.27959200e-01,  6.33367065e-01,  5.28016419e-01,
        6.46438486e-01,  5.18226328e-01,  7.25211977e-01,  2.84218440e-01,
        4.48047604e-01,  6.72656773e-01,  7.15695701e-01,  6.44660630e-01,
        6.39372824e-01,  4.35351855e-01,  3.05041851e-01,  6.93368935e-01,
        4.04534664e-01,  5.67754219e-01,  1.45121563e-01,  5.37743466e-01,
        5.21516909e-01,  2.02968070e-01,  4.66816686e-01,  2.64815967e-02,
        7.37910179e-01,  6.46804210e-01,  5.04458218e-01,  7.05582589e-01,
        7.04274198e-01,  5.34209238e-01,  7.09487942e-01,  7.04839017e-01,
        5.90282371e-01,  7.37098085e-01,  5.71850357e-01,  5.63834440e-01,
        4.59986816e-01,  7.10535175e-01,  7.09307097e-01,  9.78708322e-02,
        5.05640265e-01,  7.30633479e-01,  7.42152664e-01,  6.66903625e-01,
        6.32725320e-01,  6.58043210e-01,  6.48656334e-01,  4.91170437e-01,
        6.60667667e-01,  7.37176092e-01,  4.82292853e-01,  2.93933895e-01,
        6.14298919e-01,  

In [3]:
np.sum(samples) / len(samples) # 所有样本的平均轮廓系数

0.5282113071604241