## Hierarchical Clustering

In this practice, we will apply hierarchical clustering to the same data sets as we did in [k-means clustering practice](KMeans_Clustering.ipynb). Take a look at that practice first, if you haven't done so yet. 

In [None]:
# read data from the file
data1 <- read.csv("../../../datasets/toydata/data1.csv",header=TRUE)

str(data1)
# Visualize the data
library(ggplot2)
pl1 <- ggplot(data1, aes(X, Y)) + geom_point(colour="black")
pl1

In [None]:
# Let's apply hierarchical clustering to this data set
set.seed(42)
hc_clust11 <- hclust(dist(data1[, 1:2]), method="complete")
plot(hc_clust11)

In a dendrogram, the height gives a good idea about how "far" the clusters are in terms of dissimilarity. Above, it seems like there are two clusters in this data set based on the heights of the tree branches. Let's cut it at two clusters and visualize them.

In [None]:
# cut the tree at two clusters; it returns labels for each point.
cut2 = cutree(hc_clust11, 2)
# use labels to visualize hclust clusters
pl1 <- ggplot(data1, aes(X, Y)) + geom_point(aes(colour=factor(cut2))) + theme(legend.position="top")
pl1

We can do the same by using *eclust* function of **factoextra** library like following:

In [None]:
library(factoextra)
# run k-means on this data and visualize 
km <- eclust(data1[, 1:2], "kmeans", k=2, nstart=20, graph=FALSE)
fviz_cluster(km, geom="point", frame=FALSE)

In [None]:
# run hclust on this data and visualize 
hc <- eclust(data1[, 1:2], "hclust", k=2, method="complete", graph=FALSE) 
# plot clusters
fviz_cluster(hc, geom="point", frame=FALSE)

In [None]:
# also plot the dendrogram
fviz_dend(hc, rect = TRUE, show_labels = FALSE) 

In [None]:
# Let's compare the cluster assignments to the actual class labels
table(cut2, data1$class)
# or 
table(hc$cluster, data1$class)

Let's see how it looks like for three clusters. **Now it's your turn:**

In [None]:
# run hclust on this data and visualize 
hc2 <- eclust(<what goes in here>) 
# plot clusters
fviz_cluster(hc2, <what goes in here>)
# Dendrogram
fviz_dend(hc2, <what goes in here>) 

Three clusters don't make much sense as can be seen from the branch heights. Here, we can use **NbClust** just as in the 
[k-means clustering practice](KMeans_Clustering.ipynb) to find out at what level we should cut the tree. 

Let's apply it to the second data set. **Now it's your turn:**

In [None]:
# read data from the file 
data2 <- read.csv("../../../datasets/toydata/data2.csv",header=TRUE)

# Visualize the data
pl2 <- ggplot(data2, aes(X, Y)) + geom_point(colour="black")
pl2

Let's start with two clusters.

In [None]:
# run hclust on this data and visualize 
hc3 <- <what goes in here>
# plot clusters
fviz_cluster(<what goes in here>)
# Dendrogram
fviz_dend(<what goes in here>) 

Let's  see how well it does; compute the confusion given the actual labels. **Now it's your turn:**

In [None]:
table(<what goes in here>)

Let's try the same approach for the third data set. From [k-means clustering practice](KMeans_Clustering.ipynb), 
we know that best number of clusters is either 3 or 6 depending on our choice of scale. Let's see how hclust does for those numbers. **Now it's your turn:**

In [None]:
# read data from file
data3 <- read.csv("../../../datasets/toydata/data3.csv",header=TRUE)
pl3 <- ggplot(data3, aes(X, Y)) + geom_point(colour="black")
pl3

First, try for 3:

In [None]:
# run hclust on this data and visualize 
hc4 <- <what goes in here>
# plot clusters
fviz_cluster(<what goes in here>)
# Dendrogram
fviz_dend(<what goes in here>) 

Now, try for 6:

In [None]:
# run hclust on this data and visualize 
hc5 <- <what goes in here>
# plot clusters
fviz_cluster(<what goes in here>)
# Dendrogram
fviz_dend(<what goes in here>) 

Judging from the branch heights, both 3 and 6 look like reasonable numbers for number of clusters. 4 is not bad either. 