<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Hierarchical Clustering 

© ExploreAI Academy

In this exercise, we will test our understanding of the core concepts of hierarchical clustering.

## Learning Objectives
By the end of this exercise, you should be able to:
* Implement an optimal agglomerative hierarchical clustering model.


Import the libraries and import the data that we will need for this exercise.

In [None]:
# data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# plotting
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns


In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/unsupervised_sprint/wine_clustering.csv')
data

## The data
This dataset presents the outcomes of a chemical examination conducted on wines cultivated in a single region in Italy, originating from three distinct grape varieties. The analysis aimed to quantify the levels of 13 different components present in each type of wine.

The goal of applying hierarchical clustering on this dataset is to identify natural clusters to group similar wines together based on their attributes.

This clustering analysis could aid winemakers in quality control, product segmentation, or even in creating targeted marketing strategies based on the identified wine clusters.

## Exercises

### Exercise 1: Feature scaling
The first step in our process is feature scaling. By scaling the features to a uniform range, we prevent attributes with larger magnitudes from dominating the distance calculations, thus ensuring more balanced clustering results.

Perform feature scaling, using `StandardScaler` on the dataset to ensure that all features contribute equally to the clustering process.

In [None]:
#Your code here

### Exercise 2: Hierarchical clustering and dendrogram visualisation

Next, we want to gain some insights into the hierarchical structure of our clusters to be able to determine the optimal number of clusters to use in our model by applying the hierarchical clustering algorithm to our scaled data.

Compute the hierarchical clustering of the data using the `ward` method then visualise the resulting clusters as a dendrogram. Based on the dendrogram, how many clusters should we use for our model?

In [None]:
#Your code here

### Question 3: Agglomerative clustering

Perform agglomerative clustering on the scaled data using the `AgglomerativeClustering` class from sklearn and 3 clusters.

Print the resulting cluster labels.

In [None]:
#Your code here

### Question 4: Interpretation of cluster characteristics

After performing hierarchical clustering with 3 clusters, we have assigned each wine sample to one of these clusters. To better understand the characteristics of each cluster, we want to examine the average values of the 13 components in our dataset across the samples within each cluster.

Calculate the mean values of the 13 components for each cluster and compare the clusters' characteristics.

**Note**: To get better visibility of the differences between clusters, it might be a good idea to visualise the cluster means using a **bar plot**, which, along with the dataframe, will provide a clearer and more intuitive understanding of the differences in chemical composition across the clusters.


 Based on your observations, how would you describe the differences between the clusters in terms of their chemical composition?


In [None]:
#Your code here

## Solutions

### Exercise 1: Feature scaling

In [None]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
ss = scaler.fit_transform(data)
data_scaled = pd.DataFrame(ss)
data_scaled

### Exercise 2: Hierarchical clustering and dendrogram visualisation

In [None]:
import scipy.cluster.hierarchy as shc

plt.figure(figsize=(10, 7))
plt.title("WINES DENDROGRAMS")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))

Observing the arrangement of the branches and the lengths of the vertical lines, which indicate the merging of clusters, we see a clear separation into three main branches, each representing a cohesive cluster. Therefore, dividing the data into three clusters would be the most appropriate choice based on the dendrogram. 

### Question 3: Agglomerative clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=3, linkage='ward')
types = cluster.fit_predict(data_scaled)
types

### Question 4: Interpretation of cluster characteristics

In [None]:
# Calculate the mean values of the 13 components for each cluster
cluster_means = data.groupby(types).mean()
cluster_means


In [None]:
import plotly.express as px

# Convert the cluster means DataFrame to a format suitable for Plotly
cluster_means_plotly = cluster_means.reset_index().melt(id_vars=('index',), var_name='Wine Component', value_name='Mean Value')

# Plot the cluster means using Plotly
fig = px.bar(cluster_means_plotly, x='index', y='Mean Value', color='Wine Component',
             title='Mean Values of Wine Components Across Clusters (Zoomed In)',
             labels={'index': 'Cluster', 'Mean Value': 'Mean Value'},
             barmode='group', width=1000, height=600)  # Adjust width and height as needed

# Add text labels for each bar
for trace in fig.data:
    for idx, val in enumerate(trace.y):
        fig.add_annotation(x=trace.x[idx], y=val, text=f'{val:.2f}', showarrow=False)

fig.show()

The plot may appear condensed at first glance. Remember that in Plotly, we have the capability to zoom into the plot interactively. This feature allows us to explore smaller values more closely and gain a clearer understanding of the data.

The analysis of the cluster means reveals distinct differences in the chemical composition of wines across the three clusters. **Cluster 2** exhibits generally higher values for components such as alcohol, ash, magnesium, total phenols, flavanoids, proanthocyanins, hue, OD280, and proline compared to **Clusters 0 and 1**. On the other hand, **Cluster 1** shows higher color intensity and malic acid content. **Cluster 0** generally has lower values for most components compared to the other clusters. These findings suggest that wines within **Cluster 2** may possess different characteristics or qualities compared to those in **Clusters 0 and 1**, which could be valuable for winemakers in product segmentation or quality control efforts.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>