# Hierarchical Clustering

## The Agglomerative Algorithm
1. Make each data point a single point cluster.
2. Take two closest neighbour clusters and make them as one cluster. 
3. Take three closest clusters and make them as one cluster. 
4. Repeat step 3. until all points go in only one cluster.<br>

This is a bottom-up approach: __Aggromerative Hierarchical Clustering__<br>
It is possible to start top-down: __Divisible Hierrchical Clustering__

#### AGglomerative NESting vs DIvisible ANAlysis =  AGNES vs DIANA <br>


## Step 1: Environment

In [None]:
import numpy as np
import pandas as pd

# for drawing dendogram
import scipy.cluster.hierarchy as ch
from scipy.spatial.distance import cdist

# for creating a model
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics

import matplotlib.pyplot as plt

## Step 2: Data Preparation

In [None]:
!pwd

In [None]:
# Load input data
# X = np.loadtxt('../data/data_clustering.txt', delimiter=',')
data = pd.read_csv('/Users/tdi/Documents/Teaching/Data/shopping-data.csv', delimiter=',')
print(data)

In [None]:
data.shape

In [None]:
data.describe()

In [None]:
# Select attributes and convert to numpy
X = data.iloc[:, 2:5].values
X

### Linkage
Linkage is a measure of dissimilarity of sets - the distance between sets of observations as a function of the pairwise distances between observations

![image-2.png](attachment:image-2.png)

__Method WARD__ starts with _n_ clusters, each containing a single object. These _n_ clusters are combined to make one cluster containing all objects. <br>
At each step, the process makes a new cluster that minimizes variance, measured by an index called __E__ (also called the _sum of squares index_).

### Building Dendogram
A dendogram is a diagram that shows the hierarchical relationship between objects.

In [None]:
# Use dendogram to find the optimal number of clusters
plt.figure(figsize=(20,10))

dendogram = ch.dendrogram(ch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Observations')
plt.ylabel('Euclidean distance')
plt.show()

## Step 3: Train and Implement the Model

In [None]:
n_clusters = 3
model = AgglomerativeClustering(n_clusters, affinity = 'euclidean', linkage = 'ward')
model.fit(X)

In [None]:
X

In [None]:
Y = model.fit_predict(X)
Y

In [None]:
# Visualising the clusters
# Plot the points with color code of the predicted clusters
# viridis - a color map, https://matplotlib.org/users/colormaps.html
# s - marker size, here, the dots size
# c - marker color, here taken from the predicted clusters
plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap='viridis')
plt.title('Discovered Clusters')
plt.xlabel('X1')
plt.ylabel('X2')
# plt.legend()
plt.show()

In [None]:
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [None]:
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], marker='o', s=30, c=Y, cmap='viridis')

plt.show()

## K-Means vs. Hierarchical Clustering
### K-Means Clustering
 #### Advantages
 > Simple to understand <br>
 > Easily adapptable and efficient<br>
 > Works well on both small and large datasets<br>
 
 #### Disadvantages
 > Need to know the optimal number of clusters in advance
 
### Hierarchical Clustering
 #### Advantages
 > The optimal number of clusters can be obtained from the model itself<br>
 > Dendograms are practical and easy to understand<br>
 
 #### Disadvantages
 > Not suitable for large datasets

## Store the Model

In [None]:
!pip install joblib

In [None]:
# For serialization and deserialization of data from/to file
# from sklearn.externals import joblib
import joblib
joblib.dump(model, '../deploy/hmodel.pkl')