# Hierarchical Clustering

**author** : Germain Forestier (germain.forestier@uha.fr)

Hierarchical clustering is an unsupervised learning method used to group similar objects into clusters. It builds a hierarchy of clusters that can be visualized using a dendrogram. The process involves repeatedly merging or splitting clusters based on a similarity metric, such as distance. The key steps include selecting a linkage criterion (e.g., single link, complete link) and determining whether to normalize the data to ensure fair distance calculations.


# Hierarchical Clustering

**author** : Germain Forestier (germain.forestier@uha.fr)

Hierarchical clustering is an unsupervised learning method used to group similar objects into clusters. It builds a hierarchy of clusters that can be visualized using a dendrogram. The process involves repeatedly merging or splitting clusters based on a similarity metric, such as distance. The key steps include selecting a linkage criterion (e.g., single link, complete link) and determining whether to normalize the data to ensure fair distance calculations.


## **Exercise 1**: Introduction to Hierarchical Clustering

In this exercise, we will perform hierarchical clustering on a dataset, observe the effect of data normalization, and visualize the results using dendrograms.


### Part 1: Import Libraries and Load Data

Import the necessary libraries (`numpy`, `pandas`, `matplotlib`, `seaborn`, and `sklearn`). Load a sample dataset to work with (e.g., the Iris dataset).


In [None]:
# TODO: Import the necessary libraries
# Hint: Use `from sklearn.datasets import load_iris` to load the Iris dataset

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

# Load the Iris dataset
# TODO: Complete the code to load the dataset
# Hint: You can load the data using `load_iris()`


### Part 2: Normalize the Data

Normalize the dataset to ensure that each feature contributes equally to the distance calculations.


In [None]:
# TODO: Normalize the dataset
# Hint: Use `StandardScaler` from `sklearn.preprocessing`

# Standardize the data

### Part 3: Perform Hierarchical Clustering and Plot Dendrograms

Perform hierarchical clustering on the normalized dataset and visualize the clusters using a dendrogram. Experiment with different linkage methods (e.g., single link, complete link).


In [None]:
# TODO: Perform hierarchical clustering and plot dendrogram
# Hint: Use `scipy.cluster.hierarchy` to perform clustering and plot the dendrogram

from scipy.cluster.hierarchy import dendrogram, linkage

# Perform hierarchical clustering

# Plot dendrogram


#### **Conclusion**:

What can you conclude from the previous result ?

## **Exercise 2**: Hierarchical Clustering on Countries Dataset

In this exercise, we will apply hierarchical clustering to a dataset of countries. We will generate dendrograms using both single linkage and complete linkage methods to observe the differences in cluster formation.


### Part 1: Import Libraries and Load Data

First, we need to import the necessary libraries and load the dataset containing information about various countries.


In [None]:
# TODO: Import the necessary libraries for data manipulation, plotting, and clustering.
# You will need pandas for data manipulation, numpy for numerical operations, and matplotlib for plotting.
# Additionally, import pdist, squareform, linkage, and dendrogram from scipy.

# Write your import statements here


### Part 2: Load the Dataset

Load the dataset containing information about different countries. The dataset should include various numerical features for each country. You can download the dataset from the following URL: `https://germain-forestier.info/dataset/countries.csv`.


In [None]:
# TODO: Load the dataset using pandas. The dataset is located at the given URL.
# Extract the feature data into 'data' and country names into 'lbl'.

# Write your code here to load the data


### Part 3: Compute Hierarchical Clustering with Complete Linkage

In this step, you will compute the hierarchical clustering using the complete linkage method. This method considers the maximum distance between points when forming clusters.


In [None]:
# TODO: Calculate the pairwise distance between the data points and then apply complete linkage for hierarchical clustering.
# Hint: Use 'pdist' for calculating distances and 'linkage' for clustering. Specify the method as 'complete'.

# Write your code here to calculate distance matrix and perform clustering


### Part 4: Plot the Dendrogram for Complete Linkage

Next, plot the dendrogram to visualize the clusters formed using the complete linkage method.


In [None]:
# TODO: Plot the dendrogram using the linkage matrix. Make sure to label each leaf with country names.
# Hint: Use the 'dendrogram' function and pass the labels from the 'lbl' list. Set the leaf_rotation to 90 degrees.

# Write your code here to plot the dendrogram


### Part 5: Compute and Plot Dendrogram for Single Linkage

Now, let's compare the results by using the single linkage method, which considers the minimum distance between points when forming clusters. Generate the dendrogram for this method and observe the differences.


In [None]:
# TODO: Similar to the complete linkage, now perform clustering using single linkage.
# Hint: Change the 'method' parameter in the linkage function to 'single'.

# Write your code here to compute clustering with single linkage and plot the dendrogram


## **Exercise 3**: Clustering with a Shopping Trend Dataset

In this exercise, we will explore clustering using a dataset that captures shopping trends. Specifically, we'll employ hierarchical clustering to identify customer segments based on their annual income and spending scores. This approach will help us understand different consumer behaviors, which is crucial for targeted marketing strategies.


### Part 1: Setup and Data Loading

Begin by setting up your environment and loading the dataset which includes customer information such as gender, age, annual income, and spending scores.


In [None]:
# TODO: Import necessary libraries and load the dataset.

# Set matplotlib to display inline

# Load the dataset and display the first ten rows

### Part 2: Data Exploration

Before diving into clustering, it's beneficial to explore the dataset. Visualizing distributions of key features like annual income and spending score can provide insights into the data's structure.


In [None]:
# TODO: Plot histograms for the 'Annual Income (k$)' and 'Spending Score (1-100)' columns.


### Part 3: Exploratory Analysis

Explore possible correlations between features such as annual income, age, and spending score.


In [None]:
# TODO: Create scatter plots to explore potential correlations between 'Annual Income (k$)' and 'Spending Score (1-100)', and between 'Age' and 'Spending Score (1-100)'.


### Part 4: Hierarchical Clustering

Segment customers based on 'Annual Income (k$)' and 'Spending Score (1-100)' using hierarchical clustering.


In [None]:
# TODO: Extract the necessary columns for clustering, apply hierarchical clustering using the Ward method, and visualize the results with a dendrogram.


### Part 5: Hierarchical Clustering Implementation

Now, build the hierarchical clustering model to segment the customers based on their income and spending scores.


In [None]:
# TODO: Import AgglomerativeClustering from sklearn.cluster, instantiate the model with 5 clusters, and fit it on the data.
# Use Euclidean metric and Ward linkage for the model.

### Part 6: Plot the Clusters and Label Customer Types

Visualize the clusters and identify the types of customers in each cluster.


In [None]:
# TODO: Plot the clusters. Use different colors for each cluster and label them according to the customer types described.


## **Exercise 4**: Evaluating Cluster Stability with Adjusted Rand Index

This exercise explores how the Adjusted Rand Index (ARI), a measure of the similarity between two data clusterings, varies as the number of clusters changes. You will use the Iris dataset and hierarchical clustering to compute the ARI for cluster counts ranging from 1 to 20.


### Part 1: Setup and Data Loading

Set up your environment and load the Iris dataset, which includes features of Iris flowers and their classifications.


In [None]:
# TODO: Import necessary libraries, load the dataset, and inspect the first few rows.


### Part 2: Compute Adjusted Rand Index for Various Cluster Counts

Evaluate how the Adjusted Rand Index changes as the number of clusters varies from 1 to 20.


In [None]:
# TODO: Write a loop to compute the Adjusted Rand Index for each cluster count from 1 to 20 using Agglomerative Clustering.
# Store the ARI scores in a list and plot them to visualize the trends.


## **Exercise 5**: Hierarchical Clustering of Handwritten Digits

In this exercise, you will use hierarchical clustering to analyze a subset of handwritten digits from the `load_digits` dataset available in `sklearn`. The goal is to visually group similar digits and explore how hierarchical clustering organizes these images. You will visualize the clusters using a dendrogram enhanced with actual digit images.


### Part 1: Setup and Data Loading

Begin by setting up your environment, importing necessary libraries, and loading a random subset of digit images.


In [None]:
# TODO: Import necessary libraries, load the dataset, and select a random subset of 30 images.
# Hint: Use np.random.choice to select random indices and load_digits to load the dataset.


### Part 2: Visualizing the Images

Visualize the selected images in a grid format before performing clustering to get a visual understanding of the different digits.


In [None]:
# TODO: Plot the selected images in a grid. Use plt.imshow within a for loop to display each image.


### Part 3: Hierarchical Clustering and Dendrogram

Apply hierarchical clustering to the image data and plot a dendrogram to visualize how images are grouped together.


In [None]:
# TODO: Perform hierarchical clustering using the 'ward' method and plot the dendrogram.
# Hint: Use linkage from scipy.cluster.hierarchy and remember to import dendrogram if not already done.
