# Heirarchial Clustering

- Heirarchial clustering is mainly used to group our data for better understanding of how our data is distributed
- specifically it is used in building intuitive heat maps which represent common features or data points together
- unlike k mean clustering we cannot provide any specific value represnting the number of clusters we want. The heirarchial clustering algorithm automatically performs this

- Assume that we are working with different blood samples each having a set of gene features

### The steps which are followed in heirarchial clustering are as follows:
1. We find the similarity (reverse of distance) between the first sample and the remaining one's
2. similarly we find the similarity between every sample and every other sample
3. Out of all these we find the most similar combination of samples and group them into a cluster
4. We repeat steps 1 - 3 considering the cluster as a new sample by it self
5. After each step of this process we end up combining samples and finally end up with just 2 clusters

### What is meant by similarity?
- Similarity can be any measure which can be chosen based on the problenm to define how close two samples are
- Most commonly we use euclidian distance as a similarity measure i.e. shorter the euclidian distance between the two samples; more closer or similar they are
- We also have many other type of similarity measures like
    1. Manhattan distance
    2. Minkowski distance

### After grouping the data-points into cluster how do we compare with other clusters/ data-points?
We can use several different measures to compare clusters
1. We can compare with the centroid of the existing cluster (similar to average)
2. We can find distances based on the closest point in a cluster (single linkage)
3. We can find distances based on the closest point in a cluster (complete linkage)


### How do we represent the heirarchial clusters?

We can represent such heirarchial clustering using dendrograms
Refer: https://www.displayr.com/what-is-dendrogram/

## Sources:

- https://www.youtube.com/watch?v=7xHsRkOdVwo
- https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

## Example 

In [1]:
from sklearn.cluster import AgglomerativeClustering
import numpy as np

In [2]:
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
clustering = AgglomerativeClustering().fit(X)
clustering

AgglomerativeClustering()

In [3]:
clustering.labels_

array([1, 1, 1, 0, 0, 0], dtype=int64)

In [5]:
clustering.fit_predict(X, [0, 0])

array([1, 1, 1, 0, 0, 0], dtype=int64)