# Clustering with High-Dimensional Data
Given $x_i \in R^p$, group data points together so that similar observations are in the same group and observations between different groups are dissimilar

<hr>

**K-Means**<br>

Starts with a pre-selected number of clusters, $K$, and minimizes the **within group sums of squares** (WGSS)

$WGSS = \sum_{k = 1}^{K} \sum_{x^{(i}, x^{(j)} \in C_k} d(x^{(i)}, x^{(j)})^2 = \sum_{k = 1}^{K} 2 N_k \sum_{x^{(i} \in C_k} \lVert x_i - u_k \rVert_2^2$

where $k$ indexes the $K$ different clusters, $C_k$ denotes the $k$-th cluster, $d(x^{(i)}, x^{(j)})$ is the Euclidean distance between data points and $N_k$ is the number of observations in cluster $K$

This is the same as maximizing the **between group sum of squares** (BGSS)

*Algorithm*:
1. Initialize the $K$ means {$\mu_k$}$_{k = 1, \dots, K}$ to random positions
2. Repeat the two steps below until algorithm converges
    1. Cluster assignment: Assign each point to the closest centroid $\mu_k$
    2. Centroids update: Update all centroids $\mu_k$ based on all the data points assigned to $C_k$
    
$K$-means does not guarantee convergence to the global minimum, depending on random initialization of cluster centroids. To converge on global minimum, run several iterations with different initialization centroids.

*Limitations*:
- Sensitive to local outliers
- Cluster centroids are not necessarily data points
- Assumes $\Sigma_k = \sigma^2 \cdot I$, i.e. zero covariances between dimensions

Fix by using *medoids* instead of *means* for robustness to outliers and ensuring cluster centroids are data points

*Methods for choosing $K$*
1. Elbow plot
    - Plot WGSS vs $K$
    - Choose $K$ where second derivative is closest to zero
    
    
2. Downstream accuracy
    - Choose $K$ such that downstream measurements is minimized, e.g. RMSE
  
  
3. Business contraints
    - Let business requirements decide the number of clusters expected, e.g. marketing campaign resource is only available for K number of selected campaigns


<hr>

**Gaussian Mixture Models (GMM)**<br>
Considers the covariance of the dimensions and computes a soft probabilistic assignment of each observation to a cluster rather than a hard assignment.

$P(x) = \sum_{k=1}^{K} P(C_k) P(x | C_k)$

Parameter estimates of $P(C_k), \mu_k, \Sigma_k$ are usually found using the **Expectation-Maximization** (EM) algorithm.

*Algorithm*:
1. **E-step**: Compute $P(C_k | x^{(i)}) = \frac{P(C_k) \cdot P(x^{(i)} | C_k)}{P(x^{(i)})}$

2. **M-step**: Update the parameter estimates iteratively

    $P(C_k) = \frac{1}{n} \sum_{i=1}^{n} P(C_k|x^{(i)})$
    
    $\mu_k  = \sum_{i=1}^{n} x^{(i)} \frac{P(C_k | x^{(i)})}{\sum_{i=1}^{n} P(C_k | x^{(i)})}$


3. Repeat steps 1 and 2 until there is no noticeable change in the actual likelihood computed


Number of clusters is found by maximizing the *Bayesian Information Criterion* (BIC)

$BIC =$ log-likelihood - $\frac{log(n)}{2}$ (# of parameters)

<hr>

**Hierarchical Clustering**<br>

1. **Agglomerative clustering**
    
    Build up clusters from individual observations until there is only one cluster at the top of the tree.
    
    Starts with 1 data point per cluster, and at each consequent stage, merges pairs of clusters that are the closest together according to a dissimilarity measure between clusters.
    
    The merging is depicted by a tree, known as a [dendrogram](https://en.wikipedia.org/wiki/Dendrogram). The bottom-most level has $n$ clusters (of 1 observation each) and as merging occurs, the number of clusters decreases and the top-most level has only 1 cluster.
    
    <img alt="Hierarchical Clustering" src="assets/dissimilarity_vs_clusters.png" width="300">
    
    To choose which pair of clusters to merge at each stage, we need to define a dissimilarity measure between clusters. (See below for examples)
    
    
2. **Divisive clustering**

    Start with whole group of observations and split off clusters
    





Examples of distance metrics between clusters:
- Single linkage (minimum distance)

    $d(C_r, C_s) = \displaystyle \min_{\substack{x^{(i)} \in C_r, x^{(j)} \in C_s}} d(x^{(i)}, x^{(j)})$
    
    
- Complete linkage (maximum distance) (*Default*)

    $d(C_r, C_s) = \displaystyle \max_{\substack{x^{(i)} \in C_r, x^{(j)} \in C_s}} d(x^{(i)}, x^{(j)})$
    

- Average linkage (average distance)

    $d(C_r, C_s) = \frac{1}{n_r} \frac{1}{n_s} \sum_{x^{(i)} \in C_r} \sum_{x^{(j)} \in C_s} d(x^{(i)}, x^{(j)})$
    
    
<img alt="Clustering by various distance metrics" src="assets/distance_metrics.png" width="300">


Choosing the number of clusters:
- Find the largest vertical drop in the tree

<img alt="Dendrogram" src="assets/dendrogram.png" width="300">


<hr>

**Density-based spatial clustering of applications with noise (DBSCAN)**

Cluster points that are close to each other in a dense region and leave out points that are in low density regions.

To perform DBSCAN, two parameters are required:
1. $\epsilon$, distance between connected points
2. $k$, core strength

Two points be connected if they are within a distance, $\epsilon$ of one another. Two points are placed into the same cluster *iff* there is a connecting path between them consisting of only core points (a point that is coonnected to at least $k$ other points), except possibly at the ends of the path.

In the figure below, the blue points are core points for core strength $k = 4$. In each cluster, each non-core (black) point is connected to a core point (blue). The non-connected points are *outliers*

<img alt="DBSCAN Core Points" src="assets/dbscan_core_points.png" width="300">


<hr>

**Evaluating quality of clustering: Silhouette Plots**

Compute for each sample $x^{(i)}$:

- $a(x^{(i)}) =$ Average dissimilarity between $x^{(i)}$ and all other points in its cluster

- $b(x^{(i)}) =$ Average dissimilarity between $x^{(i)}$ and the closest cluster it does not belong to

- $S(x^{(i)}) \in [-1, 1]$ with
    
    $S(x^{(i)}) = \frac{(b(x^{(i)}) - a(x^{(i)}))}{\max(a(x^{(i)}), b(x^{(i)}))}$
    

Find average silhouette width across all points where: 

$S(x^{(i)})$ large: well-clustered; $S(x^{(i)})$ small: badly clustered; $S(x^{(i)}) < 0$: wrongly clustered 

<img alt="Silhouette Plot" src="assets/silhouette_plot.png" width="300">


# Basic code
A `minimal, reproducible example`