# Unsupervised Learning
<br>
In this case your data *D* is not labeled so what can you do with it? 
* learn structure in the data such a groups or clusters
* help find features; for example find "centroids" of clusters to use as prototypes. So you could use the similarity to each centroid as a feature.
<br>

Unsupervised learning relies on what are called similarity or dissimilarity measures. These apply as:
* between two points
* between a point and a cluster
* between two clusters
<br><br>

### Dissimilarity Measures
<br>
Let $\delta(x_{ij}, x_{i'j})$ denote the dissimilarity measure between points $x_{ij}$ and $x_{i'j}$. In this case these are some of the different dissimilarity measures:
* euclidean distance, i.e. $(x_{ij} - x_{i'j})^2$
* l1 norm, i.e. |$x_{ij} - x_{i'j}$|
* for categorical features there is the Hamming Distance, i.e. the number of features that are different. $= \sum_{j=1}^D (x_{ij} \ne x_{i'j})$
* any monotonically increasing function 
<br>

### Similarity Measures
<br>
Likewise, for similarity measures you can have $\delta(x_{ij}, x_{i'j})$ be:
* any monotonically decreasing function, ex: = $e^{\frac{d_{ii'}^2}{\sigma^2}}$
* for binary features, the percentage of shared features
  $$s(x_i,x_{i'}) = \frac{x_i^Tx_{i'}}{x_i^Tx_i + x_{i'}^Tx_{i'} - x_i^Tx_{i'}}$$
* for features with physical dependence (temporal or spatial), there is the Pearson correlation coefficient
  $$r_{ii'} = \frac{\sum_{j=1}^D (x_{ij} - x_i)(x_{i'j}-x_{i'})}{[\sum_{j=1}^D (x_{ij} - x_i)^2 \sum_{j=1}^D(x_{i'j}-x_{i'})^2]^{1/2}} $$ 
<br><br>

A common tool used for unsupervised learning is the "[Dendogram](https://en.wikipedia.org/wiki/Dendrogram)" that illustrates the clustering for a data set given a dissimilarity measure. There are two approaches for building a dendogram:
* aglomerative(bottom up) 
* divisive (top down)
<br><br>

### Hierarchical Agglomerative Clustering (HAC)
Let $\gamma_{jk}$ be the distance or dissimilarity between clusters $C_j$ and $C_k$ and $\hat{K}$ be the current number of clusters.
1. Choose halting condition H.C.
2. Initialize $\hat{K}$=N, m=1 and cluster $C_i$ = {$x_i$}, i.e. each point is its own cluster
3. Repeat until H.C. is met
4. Find nearest (most similar) pair of clusters, $\gamma'$ = min $\gamma_{jk}$
5. If H.C. condition is based on $\gamma'$, test for it and halt if true
6. Apply merge rule, so merge $C_j$ with $C_k$ into $C_l$
7. Update, $\hat{K}$ = $\hat{K-1}$, m = m+1
8. If H.C. is based on $\hat{K}$, test for it and halt if true
9. Output final clusters $C_l$, l=1, 2,$\hat{K}_{final}$
    * if $\hat{K}_{final}$=1 then resulting hierarchy is a dendogram
<br>

Some useful dissimilarity measures between clusters $C_j$ and $C_k$ include:
* $\gamma_{mean}(C_j,C_k) = ||C_j - C_k||_2$
* $\gamma_{min}(C_j,C_k) = min||C_j - C_k||_2$
  * This measure is used by the **"Nearest Neighbor"** algorithm
* $\gamma_{max}(C_j,C_k) = max||C_j - C_k||_2$
  * This measure is used by the **"Farthest Neighbor"** algorithm
* $\gamma_{average}(C_j,C_k) = \frac{1}{N_jN_k}\sum_j \sum_k ||C_j - C_k||_2$, where $N_j$ are the number of points in $C_j$
<br>

### Resources
* Complete linkage clustering example: https://onlinecourses.science.psu.edu/stat555/node/86
* NLP HAC example: https://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html
* Great post on usl: https://sdsawtelle.github.io/blog/output/week8-andrew-ng-machine-learning-with-python.html