## Cross tabulation

Cross tabulation, or contingency matrix, is a basis for many clustering quality measures. It shows how similar are the two clusterings on a cluster level.

In [1]:
# Packages we will use throughout this notebook
using Clustering
using VegaLite
using VegaDatasets
using DataFrames
using Statistics
using JSON
using CSV
using Distances

In [2]:
download("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv","newhouses.csv")
houses = CSV.read("newhouses.csv", DataFrame)

┌ Error: curl_easy_setopt: 48
└ @ Downloads.Curl /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Downloads/src/Curl/utils.jl:36
┌ Error: curl_easy_setopt: 48
└ @ Downloads.Curl /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Downloads/src/Curl/utils.jl:36
┌ Error: curl_easy_setopt: 48
└ @ Downloads.Curl /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Downloads/src/Curl/utils.jl:36


Unnamed: 0_level_0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64?,Float64
1,-122.23,37.88,41.0,880.0,129.0,322.0
2,-122.22,37.86,21.0,7099.0,1106.0,2401.0
3,-122.24,37.85,52.0,1467.0,190.0,496.0
4,-122.25,37.85,52.0,1274.0,235.0,558.0
5,-122.25,37.85,52.0,1627.0,280.0,565.0
6,-122.25,37.85,52.0,919.0,213.0,413.0
7,-122.25,37.84,52.0,2535.0,489.0,1094.0
8,-122.25,37.84,52.0,3104.0,687.0,1157.0
9,-122.26,37.84,42.0,2555.0,665.0,1206.0
10,-122.25,37.84,52.0,3549.0,707.0,1551.0


In [3]:
X = houses[!, [:latitude,:longitude]]
C = kmeans(Matrix(X)', 10) 
insertcols!(houses,3,:cluster10_means=>C.assignments)

Unnamed: 0_level_0,longitude,latitude,cluster10_means,housing_median_age,total_rooms,total_bedrooms
Unnamed: 0_level_1,Float64,Float64,Int64,Float64,Float64,Float64?
1,-122.23,37.88,10,41.0,880.0,129.0
2,-122.22,37.86,10,21.0,7099.0,1106.0
3,-122.24,37.85,10,52.0,1467.0,190.0
4,-122.25,37.85,10,52.0,1274.0,235.0
5,-122.25,37.85,10,52.0,1627.0,280.0
6,-122.25,37.85,10,52.0,919.0,213.0
7,-122.25,37.84,10,52.0,2535.0,489.0
8,-122.25,37.84,10,52.0,3104.0,687.0
9,-122.26,37.84,10,42.0,2555.0,665.0
10,-122.25,37.84,10,52.0,3549.0,707.0


In [4]:
xmatrix = Matrix(X)'
D = pairwise(Euclidean(), xmatrix, xmatrix,dims=2) 

K = kmedoids(D,10)
insertcols!(houses,3,:medoids_clusters=>K.assignments)

Unnamed: 0_level_0,longitude,latitude,medoids_clusters,cluster10_means,housing_median_age,total_rooms
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Float64,Float64
1,-122.23,37.88,7,10,41.0,880.0
2,-122.22,37.86,7,10,21.0,7099.0
3,-122.24,37.85,7,10,52.0,1467.0
4,-122.25,37.85,7,10,52.0,1274.0
5,-122.25,37.85,7,10,52.0,1627.0
6,-122.25,37.85,7,10,52.0,919.0
7,-122.25,37.84,7,10,52.0,2535.0
8,-122.25,37.84,7,10,52.0,3104.0
9,-122.26,37.84,7,10,42.0,2555.0
10,-122.25,37.84,7,10,52.0,3549.0


In [8]:
counts(houses[!,:medoids_clusters], houses[!,:cluster10_means]) 

10×10 Matrix{Int64}:
   0  1073    0     0    0     0     0  5965  362     0
  20     0    0     0  397   114     0     0    0     0
   0     0    0     0    0     0  1765     0    0     0
   0     0    0  1276    0     0     0     0    0   411
   0     0  980     1    0     0     0     1  570     0
   0  1874    0     0    0     0   147    25    0     0
   0     0    0     1    0     0     0     0    0  2697
   0     0   76   745    0   145     0     0    0     0
   7     0    0     0    0  1380     0     0    0    10
 519     0    0     0    0     0     0     0    0    79

## Rand index
Rand index is a measure of the similarity between the two data clusterings. From a mathematical standpoint, Rand index is related to the prediction accuracy, but is applicable even when the original class labels are not used.

In [9]:
randindex(houses[!,:medoids_clusters], houses[!,:cluster10_means])


(0.7338686544611878, 0.9244415207380022, 0.07555847926199778, 0.8488830414760045)

Silhouettes
Silhouettes is a method for evaluating the quality of clustering. Particularly, it provides a quantitative way to measure how well each point lies within its cluster in comparison to the other clusters.

The Silhouette value for the i-th data point is:

si=(bi−ai)/max(ai,bi), where
ai is the average distance from the i-th point to the other points in the same cluster zi,
bi≝mink≠zibik, where bik is the average distance from the i-th point to the points in the k-th cluster.
Note that si≤1, and that si is close to 1 when the i-th point lies well within its own cluster. This property allows using mean(silhouettes(assignments, counts, X)) as a measure of clustering quality. Higher values indicate better separation of clusters w.r.t. point distances.

In [15]:
sum(silhouettes(houses[!,:cluster10_means], D))

9472.41999682436

In [16]:
sum(silhouettes(houses[!,:medoids_clusters], D))

10011.846774416339

## Variation of Information
Variation of information (also known as shared information distance) is a measure of the distance between the two clusterings. It is devised from the mutual information, but it is a true metric, i.e. it is symmetric and satisfies the triangle inequality.

In [21]:
varinfo(houses[!,:medoids_clusters], houses[!,:cluster10_means])

LoadError: UndefVarError: varinfo not defined

## V-measure
V-measure can be used to compare the clustering results with the existing class labels of data points or with the alternative clustering. It is defined as the harmonic mean of homogeneity (h) and completeness (c) of the clustering:

Vβ=(1+β)h⋅cβ⋅h+c.
Both h and c can be expressed in terms of the mutual information and entropy measures from the information theory. Homogeneity (h) is maximized when each cluster contains elements of as few different classes as possible. Completeness (c) aims to put all elements of each class in single clusters. The β parameter (β>0) could used to control the weights of h and c in the final measure. If β>1, completeness has more weight, and when β<1 it's homogeneity.

In [19]:
vmeasure(houses[!,:medoids_clusters], houses[!,:cluster10_means]) 


0.8080415648262416

## Mutual information
Mutual information quantifies the "amount of information" obtained about one random variable through observing the other random variable. It is used in determining the similarity of two different clusterings of a dataset.

Clustering.mutualinfo — Function.
mutualinfo(a, b; normed=true) -> Float64
Compute the mutual information between the two clusterings of the same data points.

a and b can be either ClusteringResult instances or assignments vectors (AbstractVector{<:Integer}).

If normed parameter is true the return value is the normalized mutual information (symmetric uncertainty), see "Data Mining Practical Machine Tools and Techniques", Witten & Frank 2005.

References

Vinh, Epps, and Bailey, (2009). “Information theoretic measures for clusterings comparison”.

Proceedings of the 26th Annual International Conference on Machine Learning - ICML ‘09.



In [20]:
mutualinfo(houses[!,:medoids_clusters], houses[!,:cluster10_means])


0.8080415648262417