In [None]:
## settings 
import numpy as np
import matplotlib.pylab as plt
import scipy, scipy.stats
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
%matplotlib inline
plt.rcParams['figure.figsize'] = (8.0, 4.0)

$\usepackage{amssymb} \newcommand{\R}{\mathbb{R}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vw}{\vec{w}}$

# 4. Clustering Methods

## 4.1. Relevance for Data Mining

* Clusters are the simplest form of a structure
* prevalence of clusters suggests relatedness / affinity and presence of correlations.
* multidimensional clusters can be interpreted as a simple form of rule
* cluster centers offer an economic description of the data (data reduction)

Note that the definition of an objective criterion for the discrimination of clusters is difficult.

## 4.2. Distance Measures 

Starting point: Distance matrix
$$
D= \left[ \begin{array}{cccc}
	d_{11} & d_{12} & \dots & d_{1N} \\
	\vdots & &  & \vdots \\
	d_{N1} & \cdots & \cdots & d_{NN}\\ 
	\end{array} \right] 
$$
* Note that numbers mean the unsimiliarity (distance) between the corresponding data points
* Example: distance between the tastes of different pudding flavours

Requirements for a distance measure ($\forall i,j,k$)
\begin{eqnarray}
d_{ij} & =       & d_{ji} \quad  \text{symmetry}\\
d_{ij} & \geq & 0 \quad ~~ \text{positive definite}\\
d_{ij}+d_{jk} & \geq & d_{ik} \quad \text{triangle inequality}
\end{eqnarray}

The definition / derivation of meaningful distances $d_{ij}$ from the data depends on the fundamental question of the meaning (semantics) of the data:
* stating two data points $x_i$ and $x_j$ as similar (i.e. a small value of $d_{ij}$) requires a decision about what features seem to be meaningful.
* for that reason, there is no general procedure.

### 4.2.1 Distance measures for data points

Distance measure for real valued data vectors are

#### 1. Euclidean Distance 
$$
d(\vx,\vy)= \|\vx-\vy\| = \sqrt{\sum\limits_{i=1}^d (x_i-y_i)^2} 
$$
* most simple, straightforward, and frequently used distance meassure
* but often insufficient as the following example illustrates
* Example: data set of broomstick features:	
 * $\vx = (\mbox{length in cm}, \text{diameter of the stick in cm})$
 * typical broomsticks {(150, 2.0), (158, 2.1), (165, 2.5), (180, 2.4), (170, 2.2)}
 * a difference of 10 cm in diameter is semantically much more 'different' than the equal difference in length
 * however, the euclidean distance of (160, 2.0) and (150, 12.0) to (150, 2.0) is the same dissimiliarity! 

#### 2. Pearson- or $\chi^2$ distance

$$	d(\vx,\vy)= \sqrt{\sum\limits_i \frac{(x_i-y_i)^2}{\sigma_i^2}} $$
with 
$\sigma_i^2 = \langle(x_i- \langle x \rangle)^2\rangle_x$ as scaling factor  

* (+) leads to a more balanced weighting of different dimensions for the result
* (-) Pearson-distance assumes uncorrelated vector components $x_i$. This presumption is often not valid
 * Example: repeated features within the vector, e.g. with different units, or basically a 1:1-dependency
	$$ \vx=\left( 
		\begin{array}{c} u\\ \vdots \\ u \\ v \end{array} 
		\right) \left. \begin{array}[t]{p{0cm}} \\ \\  \end{array} \right\} (n-1)-\mbox{times} 
	$$
* Iso-Distance-surface to a vector $\vx$ is an ellipsoid whose principal axes are aligned with the coordinates.
 * With reference to this, an improved distance measure is 

#### 3. Mahalanobis-Distance:
$$
d(\vx,\vy)= \sqrt{(\vx-\vy)^\tau \Sigma^{-1} (\vx-\vy)}
$$
with 
$$
	\Sigma = \langle(\vx-\langle \vx \rangle)(\vx-\langle \vx \rangle)^\tau \rangle 
$$
* This basically scales as pearson, but with prior change into the PCA basis
* Iso-distance surface around $\vx$ are now rotated ellipsoids (according to the variance ellipsoid of the whole data set)
* Example: 
 * w.l.o.g. let $\vy = 0$. 
 * Let's look at the vector of the $i$th principal component of length  $\sqrt{\lambda_i}$, so $\vx = \sqrt{\lambda_i}\hat{u}_i$. 
 * Then the distance $d$
$$
d = \sqrt{\lambda_i} \hat{u}_i^{\tau} [U D^{-1} U^{\tau}] \sqrt{\lambda_i}\hat{u}_i 
= \sqrt{\lambda_i} \lambda_i^{-1} \sqrt{\lambda_i} = 1 
$$
 * so the iso-distance surface for the distance 1 has an extension of $\lambda_i$ along the eigenvector $\hat{u}_i$
* Attention: Scaling can sometimes even destroy a prevalent clustering structure 

#### 4. `City Block'-Distance
$$
d(\vx,\vy)= \sum\limits_{i=1}^d |y_i-x_i| 
$$
* 'distance to be traveled if streets and avenues are perpendicular, as in Manhattan

#### 5. Supremum Distance
$$
d(\vx,\vy)= \max_{i=1 \dots d} |y_i-x_i| 
$$
* that is the largest component of the city-block sum terms.

#### 6. Minkowski-Distance 
$$
d(x,y)= \left\{ \sum\limits_{i=1}^d |y_i-x_i|^p \right\}^{1/p} 
$$
Some of the above distance measures result as special case, namely:
* $p=1$:  City-Block distance
* $p=2$:  Euklidean distance 
* $p\to \infty$:  Supremum distance

**Remarks:**
* All above distance measures assume implicityl a topology of $\R^n$
* They are not suitable to represent angles (which have a topology of a circle)
* Different topologies can be tackled by embedding into a suitable $\R^m$
* Example: 
 * $\phi_1 = 0$ and $\phi_2 = 2\pi$ represent the same angle, but have a numeric distance of $2\pi$.
 * Embedding the angle variable into a 2D-space by $(\cos(\phi), \sin(\phi))$ gives a representation where this problem does not occur anymore.
* Dealing with nominal attributes
 * if a fixed number (e.g. $K$) of alternative values are given, the variable can be embedded into a $K$-dimensional vector space, e.g. 
$$
	\{\mbox{vanille, chocolade, strawberry}\}\Rightarrow \left\{ \left( \begin{array}{c}1\\0\\0\end{array}
	\right),\left( \begin{array}{c}0\\1\\0\end{array} \right),\left( \begin{array}{c}0\\0\\1\end{array} \right)\right\} 
$$
 * Using a numbering $\{V, C, S\} \Rightarrow \{1,2,3\}$ would instead induce an ordering (e.g. $d(V,C)<d(V, S)$)

 * Trick: if many value alternatives are given, the embedding space would become inadequately high-dimensional. A sometimes acceptable compromise is then to use random projections: select $K$ random vectors of length 1 in a vector space $\R^L,~L < K$. If $L$ is large enough, the vectors are approximately orthogonal on eachother and therefore more or less uncorrelated.
 

### 4.2.2. Distance measures between clusters

Many clustering methods require a distance between clusters.
Let $X, Y$ be clusters, we can define the following frequently used distance measures 

\begin{eqnarray}
d_1(X,Y) =  \min\limits_{\vx \in X \atop \vy \in Y} d(\vx,\vy) && \mbox{minimal distance} \\[4mm]
d_2(X,Y) =  \max\limits_{\vx \in X \atop \vy \in Y} d(\vx,\vy) && \mbox{maximal distance} \\[4mm]
d_3(X,Y) =  \frac{1}{N_X N_Y} \sum\limits_{\vx \in X \atop \vy \in Y} d(\vx,\vy) && \mbox{average distance}\\[4mm]
d_4(X,Y) =  d\left(\frac{1}{N_X} \sum\limits_{\vx \in X}\vx,
	 \frac{1}{N_Y} \sum\limits_{\vy \in Y}\vy \right) && \mbox{centroid distance}
\end{eqnarray}

<img src="images/cluster-distances.png" width="50%">