# **Lab experience #11 (STUDENTS): Clustering of a dataset with mixed data types**

This eleventh lab session aims **to cluster a dataset with mixed data types**. This lab session refers to all Prof. Stella's lectures on clustering.

In this lab session, you can **re-use code already developed in the past labs**, but you also need to implement some additional pre-processing steps, before going into the clustering part.

The lab session is divided into three main parts:

**Part 1**: Dataset loading and exploratory data analysis.

**Part 2**: Preprocessing (handling missing values and outliers, scaling, columns dropping if needed, ...).

**Part 3**: Clustering. For this part, you are required to **choose at least two clustering methods** to apply (among k-means++, hiearchical clustering, DBSCAN). You can decide to apply two or three of them. Also, you can decide whether it is better to apply them in sequence (see Lab06) or in parallel (i.e., you run them independently and choose the best result). Finally, you need to validate them in an unsupervised manner.

_No true labels will be provided this time during the lab._


## Useful references:

- [Manhattan distance](https://www.cs.cornell.edu/courses/JavaAndDS/files/manhattanDistance.pdf)

- [Jaccard's similarity and distance](https://en.wikipedia.org/wiki/Jaccard_index)

- [Gower's distance: Medium article](https://medium.com/analytics-vidhya/concept-of-gowers-distance-and-it-s-application-using-python-b08cf6139ac2)

- [Clustering mixed data types: Medium article](https://medium.com/analytics-vidhya/the-ultimate-guide-for-clustering-mixed-data-1eefa0b4743b)

## NOTATION:
To uniquely identify the number of clusters in the two/three different clustering solutions, please adhere to the following notation:

> ```
> Kh = number of clusters for the hierarchical clustering solution
> Km = number of clusters for the k-means++ clustering solution
> Kd = number of clusters for the DBSCAN clustering solution
> ```


For the labels assigned by the two/three algorithms, please name them as follows:
> ```
> hierarchical_labels = the labels assigend by the hierarchical clustering solution
> kmeans_labels       = the labels assigend by the k-means++ clustering solution
> dbscan_labels       = the labels assigend by the DBSCAN clustering solution
> ```


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler, RobustScaler
from yellowbrick.cluster import KElbowVisualizer
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

# **Part 1**: Dataset loading and exploratory data analysis

The dataset has attributes of mixed data types.

Legend for the attributes:

- **ID**: Customer’s unique identifier
- **Year_Birth**: Customer's birth year
- **Education**: Education Qualification of customer
- **Marital_Status**: Marital Status of customer
- **Income**: Customer's yearly household income
- **Kidhome**: Number of children in customer's household
- **Teenhome**: Number of teenagers in customer's household
- **Dt_Customer**: Date of customer's enrollment with the company
- **Recency**: Number of days since customer's last purchase
- **MntWines**: Amount spent on wine
- ...


Hint:
- ```.describe()```: it shows a comprehensive statistical description of the attributes
- ```.info()```: it returns the data type for each attribute and the number of non null elements
- ```.isna().sum()```: it counts the number of NaN for each attribute


In [None]:
# Load the dataset
df = pd.read_csv('marketing_campaign.csv', sep='\t')

# Print the first part of the dataset
df.head()

In [None]:
# Compute the basic statistical properties of the dataset. Hint: .describe()
#

In [None]:
# Check the data type for each attribute and the number of non null elements for every one
#

Check the values of the categorical (dtype=object) attributes, the presence of _NaN_ values, possible _outliers_, and non-informative attributes (with no variability in the dataset).

In [None]:
# Check occurrences of different categories for the categorial features
print(df["Marital_Status"].value_counts())
print("\n", df["Education"].value_counts())

In [None]:
# Check for NaN values
#

# **Part 2:** Data preparation

The main objective of this part is to transform the dataset in order to have all attributes as numerical ones (to apply clustering).

**Note** the following:

- ```Marital_Status``` has a few records associated with unique categories (e.g., "Alone", "YOLO", "Absurd"). Rename them as "Single".

- ```Year_Birth``` has to be transformed into a numerical value. Then, replace this column with a new ```Age``` column with the corresponding value of the patient's age (reference year is 2024).

- ```Education``` and ```Marital_Status``` are categorical features (dtype=object). Then, we need to apply encoding to them.

- Reduce the number of columns by shrinking all expenses to a single new column named ```Total_Spent```. To do that, compute the total amount of spendings (any item, i.e., wine, fruits, gold, ...).

- ```Dt_Customer``` is type object in the format "dd-mm-yy". Hint: .```.to_datetime()``` to transform to a numerical value.

- ```Income``` has NaN values. They have to be removed. Hint: ```.dropna()```.

- features ```ID```, ```Z_CostContact``` and ```Z_Revenue``` are non-informative. Thus, they can be removed. Hint: ```.drop()```.

- remove recording with _outliers_: age can not exceed 100 years, income is most likely below 600000.

In [None]:
# Replace singleton categories with "Single"
#

In [None]:
# Compute the age. Hint: 2024-"Year_Birth"
#

# Replace it in the "Year-Birth" column and rename the column as "Age"
df.rename(...

In [None]:
# Compute the total amount of spendings (any item) and group it into one column called "Total_Spent"
#  "Total_Spent" = "MntWines" + MntFruits + MntMeatProducts + ...
df["Total_Spent"] = ...

In [None]:
# Transform "Dt_Customer" into a numerical attribute,
# by computing the number of days each customer engaged to the company (w.r.t. the newest customer). Hint: use .to_datetime(), .dt.days()
df['Dt_Customer'] = pd.to_datetime(df.Dt_Customer, dayfirst=True)

newest_customer   = df['Dt_Customer'].max()
df['newest_customer'] = newest_customer

df['days_engaged'] = ...

# Drop columns "Dt_Customer" and "newest_customer" and keep "days_engaged"
#

In [None]:
# Drop NaN values. Hint: use .dropna()
#

In [None]:
# Dropping redundant features, in this case "ID", "Z_CostContact", "Z_Revenue"
#

In [None]:
# Outliers/Noise exploration by using sns.pairplot() to plot distributions of three most relevant features
# See documentation at https://seaborn.pydata.org/generated/seaborn.pairplot.html
plt.figure()
sns.pairplot(df[['Income','Age', "Total_Spent"]])
plt.show()

In [None]:
# Dropping the outliers (simplest way, by thresholding). Hint: observe the plot and read instructions above

# outliers in "Age"
#

# outliers in "Income"
#

In [None]:
# Check again the new dataset. Hint: .describe(), .info()
#
#

In [None]:
# num of samples after cleaning
print(len(df))

# **Part 3**: Preprocessing

**Encode categorical features**
- use OrdinalEncoder for ordinal data
- use OneHotEncoder for nominal (unordered) data


In [None]:
# Encode ordinal features with ordinal encoding method. Hint: use OrdinalEncoder()
education_order = ['Basic', '2n Cycle','Graduation','Master', 'PhD']
oe              = OrdinalEncoder(...
education_oe    = oe.fit_transform(df[['Education']])
df_enc          = df.assign(Education_encode = education_oe)      # df_enc is a new dataframe with encoded columns
print(df_enc.shape)
print(df_enc[['Education', 'Education_encode']])

In [None]:
# Encode nominal features with one-hot encoding method. Hint: use OneHotEncoder()
ohe         =  OneHotEncoder(sparse = False, dtype = 'int')
Marital_ohe = ohe.fit_transform(...
Marital_ohe = pd.DataFrame(data = Marital_ohe, columns = ohe.get_feature_names_out(['Marital_Status']), index = df.index,)

In [None]:
# Update df_enc with the new column "Marital_ohe"
df_enc      = ...

# Remove non encoded columns ("Marital_Status" and "Education")
df_enc.drop(['Marital_Status','Education'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df_enc.corr(method='pearson'), cmap='viridis', annot = True, annot_kws={"size": 5}, vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

**Note** All features are now numerical. Be careful to use ```df_enc``` now on.



**Scaling**: Apply z-score (StandardScaler) on continuous variables.

In [None]:
# binary (dummy) features do not require normalisation
binary_columns = ['Marital_Status_Divorced','Marital_Status_Married', 'Marital_Status_Single','Marital_Status_Together','Marital_Status_Widow'
                 ,'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
binary_series = df_enc[binary_columns]

In [None]:
# Part of the dataframe needing of scaling
df_to_scaler = df_enc.drop(columns=binary_columns)   # excluding binary attributes from df_enc, before scaling

#scaling the features
scaler = StandardScaler().fit_transform(df_to_scaler)

#creating a new dataframe with numerical features scaled
scaled_df     = pd.DataFrame(scaler, columns = df_to_scaler.columns)

In [None]:
# Update the scaled dataframe by including the binary columns
scaled_df     = pd.concat([scaled_df.reset_index(drop=True), binary_series.reset_index(drop=True)], axis=1)

In [None]:
# Plot the heatmap of the features pair-wise correlation in the new scaled dataframe. Hint: use sns.heatmap()
plt.figure(figsize=(20,10))
#
plt.title(TITLE_TITLE_TITLE)
plt.show()

**Note** Be careful to use ```scaled_df``` now on.



# **Part 4:** Clustering

- remove mean, column-wise
- before going to clustering, apply dimensionality reduction (PCA)
- transform the dataframe into numpy to re-use code already developed in previous labs
- use the suggested palette

In [None]:
# Remove mean column-wise


In [None]:
#Apply PCA
NCOMP = ...
pca = PCA(n_components=NCOMP)
pca.fit(scaled_df)
PCA_df = pd.DataFrame(pca.transform(scaled_df), columns=(["PC1","PC2", ADDOTHERCOLUMNSIFNEEDED]))

In [None]:
# To numpy
X = PCA_df.to_numpy()

In [None]:
# Colors palette for clusters
PAL = ['blue', 'green', 'red', 'yellow', 'orange', 'purple', 'magenta', 'cyan', 'brown']

**Note** Be careful to use ```X``` now on.

#CLUSTERIZE THE DATASET
Below, you can find code from previous labs.

**k-means++ clustering**
- use the elbow method to choose the best number of clusters (Km)
- run k-means++ on the reduced dataset (with NCOMP components)
- find the clusters
- visualize the solution

In [None]:
# Use the elbow method
Elbow_M = KElbowVisualizer(KMeans(), k=RANGE_TO_TEST)
Elbow_M.fit(X)
Elbow_M.show()

In [None]:
# Best number of clusters based on the elbow method
Km = ...

In [None]:
# Apply k-means++
kmeans = KMeans(...

print('The final SSE is: %.2f '% kmeans.inertia_)

# Scatterplot
fig11 = plt.figure('kmeans', figsize=(10,5))
# scatterplot here your data objects using the assigned labels and the given palette
for k in range(Km):
  # complete the following line to plot the cluster centers
  plt.scatter(CLUSTER_CENTERS, s=100, color=PAL[k], marker='s', edgecolor='black', linewidth=1.5)
sns.set_theme(style='dark')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid()
plt.show()

In [None]:
# Validation
# ----------
from sklearn.metrics import silhouette_score

# choose the distance metric
distance_metric = ..

# Compute intra- and inter-cluster distances (dm, Dm). Hint: use the utility function below
#

# Silhouette score
Sm =
print("With k-means clustering, we found an optimal number of clusters equal to Km=%d with a silhouette score of S=%.3f." % (Km, Sm))

**Hierarchical clustering**

In [None]:
# Import useful packages for clustering
from scipy.cluster import hierarchy
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist as pdist
from scipy.spatial.distance import squareform as sf

# Choose the main algorithm parameters
method_merging =
distance_metric =

# Apply the algorithm to obtain the hierarchy
Z = hierarchy.linkage(X, method_merging, metric=distance_metric, optimal_ordering='true')

# Visualize the dendrogram
fig21 = plt.figure(figsize=(15, 5))
dn = hierarchy.dendrogram(Z, no_plot=0)
plt.tick_params(axis='y', which='major', labelsize=15)
plt.tick_params(axis='x', which='major', labelsize=8)
plt.xlabel('Distance')
plt.grid()
plt.show()

In [None]:
# Cut the forest to have a certain inter-cluster distance (max_d)
max_d =

# Form the clusters. Note: subtract 1 in order for the labels to start from 0 (as it happens in k-means++)
hierarchical_labels =

print(hierarchical_labels.shape)
print(hierarchical_labels)

# Confirm that you cut correctly, to have N clusters
Kh = hierarchical_labels.max() + 1
print("We got %d cluster(s)." % Kh)


# Add a vertical line to the dendrogram indicating the cut
plt.figure(fig21)
plt.axhline(y=max_d, color='k', linestyle='--')
plt.show()

# Find clusters centers. Hint: utility function below
hierarchical_centers =
print("\nWe need to compute %d centroids, as we have %d clusters." % (Kh, Kh) )

In [None]:
# Validation
# ----------

# Compute intra- and inter-cluster distances (dh, Dh). Hint: use the utility function below
#


# Silhouette score
Sh =
print("With hierarchical clustering, we found an optimal number of clusters equal to Kh=%d with a silhouette score of Sh=%.3f." % (Kh, Sh))



# Visualize this clustering solution
fig21 = plt.figure('Hierarchical clustering (dendrogram cut at %.2f)' % max_d, figsize=(10,5))
# scatterplot here your data objects using the assigned labels and the given palette
for k in range(Kh):
   # complete the following line to plot the cluster centers
   plt.scatter(CLUSTER_CENTERS, s=100, color=PAL[k], marker='s', edgecolor='black', linewidth=1.5)
sns.set_theme(style='dark')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("For this solution, we got K=%d clusters" % Kh)
plt.grid()
plt.show()

**DBSCAN clustering**

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors as knn
!pip install kneed                   # run this line only the first time you run this cell
from kneed import KneeLocator



# Knee method
neighborhood_order =
neighborhood_set   = knn(n_neighbors=neighborhood_order).fit(X)
distances, indices = neighborhood_set.kneighbors(X)
distances          = np.sort(distances[:,neighborhood_order-1], axis=0)
i = np.arange(len(distances))
knee = KneeLocator(...
knee_x = knee.knee
knee_y = knee.knee_y



fig31 = plt.figure(figsize=(5,5))
# plot here the ordered distances
plt.xlabel("Points")
plt.ylabel("Distance")
plt.grid()
plt.axvline(x=knee_x, color='k', linestyle='--')
plt.axhline(y=knee_y, color='k', linestyle='--')
plt.plot((knee_x), (knee_y), 'o', color='r')
plt.show()


# Apply DBSCAN
dbscan = DBSCAN(...
dbscan_labels = dbscan.labels_

# find the number of clusters formed by the algorithm
Kd = ...

print("We got %d cluster(s)." % Kd)
# print(dbscan_labels)

In [None]:
# Find clusters centers. Hint: utility function below
dbscan_centers = ...
print("\nWe need to compute %d centroids, as we have %d clusters." % (Kd, Kd) )

In [None]:
# Validation
# ----------

# Compute intra- and inter-cluster distances (dd, Dd). Hint: use utility function below
#


# Silhouette score
Sd =
print("With DBSCAN clustering, we found an optimal number of clusters equal to Kd=%d with a silhouette score of Sd=%.3f." % (Kd, Sd))



# Visualize this clustering solution
fig31 = plt.figure('DBSCAN clustering', figsize=(10,5))
# scatterplot
for k in range(Kd):
   # complete the line below
   plt.scatter(CLUSTER_CENTRES, s=100, marker='s', edgecolor='black', linewidth=1.5, color=PAL[k])
sns.set_theme(style='dark')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("For this solution, we got K=%d clusters" % Kd)
plt.grid()
plt.show()

Compare clustering solutions

In [None]:
from sklearn import metrics
y1 = hierarchical_labels   # predicted labels from hierarchical clustering
y2 = kmeans.labels_        # predicted labels from k-means clustering
y3 = dbscan_labels         # predicted labels from DBSCANs clustering

#
#
#
#

# Utility functions

In [None]:
# [FROM SOLUTION OF LAB#4] THIS IS A **METHOD** THAT YOU CAN USE IN THE NEXT LAB SESSIONS TO find visualize data in 2D with clusters in different colours

def PCA_tSNE_visualization(data2visualize, NCOMP, LABELS, PAL):

  '''
  INPUT
  data2visualize    - data matrix to visualize
  NCOMP             - no. of components to decompose the dataset during PCA
  LABELS            - labels given by the clustering solution
  PAL               - palette of colours to distinguish between clusters
  '''

  '''
  OUTPUT
  Two figures: one using PCA and one using tSNE
  '''


  # PCA
  from sklearn.decomposition import PCA
  pca = PCA(n_components=NCOMP)
  pca_result = pca.fit_transform(data2visualize)
  print('PCA: explained variation per principal component: {}'.format(pca.explained_variance_ratio_.round(2)))

  # tSNE
  from sklearn.manifold import TSNE
  print('\nApplying tSNE...')
  np.random.seed(100)
  tsne = TSNE(n_components=2, verbose=0, perplexity=20, n_iter=300)
  tsne_results = tsne.fit_transform(data2visualize)


  # Plots
  fig1000 = plt.figure(figsize=(10,5))
  fig1000.suptitle('Dimensionality reduction of the dataset', fontsize=16)


  # Plot 1: 2D image of the entire dataset
  ax1 = fig1000.add_subplot(121)
  sns.scatterplot(x=pca_result[:,0], y=pca_result[:,1], ax=ax1, hue=LABELS, palette=PAL)
  ax1.set_xlabel('Dimension 1', fontsize=10)
  ax1.set_ylabel('Dimension 2', fontsize=10)
  ax1.title.set_text('PCA')
  plt.grid()

  ax2= fig1000.add_subplot(122)
  sns.scatterplot(x=tsne_results[:,0], y=tsne_results[:,1], ax=ax2, hue=LABELS, palette=PAL)
  ax2.set_xlabel('Dimension 1', fontsize=10)
  ax2.set_ylabel('Dimension 2', fontsize=10)
  ax2.title.set_text('tSNE')
  plt.grid()
  plt.show()

In [None]:
# [FROM SOLUTION OF LAB#3] THIS IS A **METHOD** THAT YOU CAN USE IN THE NEXT LAB SESSIONS TO compute the intra- and inter-cluster distances

def intra_inter_cluster_distances(data, K, labels, cluster_centers, distance_metric):

  '''
  INPUT
  data            - data matrix for which to compute the proximity matrix
  K               - the expected number of clusters
  labels          - predicted labels from the clustering solution applied to data
  cluster_centers - cluster centres from the clustering solution applied to data
  distance_metric - metric to compute the distances within and between clusters. Here, you use the same metric for both measurements (but it might be possible to use two different metrics)
  '''

  '''
  OUTPUT
  d               - intra-cluster distance
  D               - inter-cluster distances
  '''

  from scipy.spatial.distance import pdist as pdist
  from scipy.spatial.distance import squareform as sf


  # Intra-cluster distances (average over all pairwise distances) ----------------- NOTE: bug fixed here!
  PM = pdist(data, metric=distance_metric)
  PM = sf(PM).round(2)

  d = np.zeros(K)
  for k in range(K):
    ind = np.array( np.where(labels == k ) )
    for r in range(ind.size):
      d[k] = d[k] + np.sum( PM[ [ind[0][r]], [ind] ] )
    d[k] = d[k]/2                                          # not to consider pairs of pair-wise distance between objects twice (the PM is symmetric)
    d[k] = d[k]/( (ind.size*(ind.size-1)) / 2 )            # to compute the average among N*(N-1)/2 possible unique pairs
  print("The intra-cluster distance of the clusters are: ", d.round(2))


  # Inter-cluster distance ---------------------------------------------------
  D = pdist(cluster_centers, metric=distance_metric)
  D = sf(D).round(2)
  print("\nAll pair-wise inter-cluster distances:\n", D)

  return d, D

In [None]:
# [FROM SOLUTION OF LAB#2] THIS IS A **METHOD** THAT YOU CAN USE IN THE NEXT LAB SESSIONS TO find cluster centers

def find_cluster_centers(data, K, labels):

  '''
  INPUT
  data    - data matrix for which to compute the proximity matrix
  K       - the expected number of clusters
  labels  - predicted labels from the clustering solution applied to data
  '''

  '''
  OUTPUT
  cluster_centers   - cluster centres from the clustering solution applied to data
  '''

  # Initialize the output
  cluster_centers = np.zeros((K, np.shape(data)[1]))   # np.shape(data)[1] = no. of attributes

  print("%d centroids are being computed, as we have %d clusters." % (K, K) )

  for k in range(0, K):
    ind = np.array( np.where( labels == k ) )
    cluster_points = data[ind, :][0]
    cluster_centers[k,:] = np.mean(cluster_points, axis=0) # cluster_points.mean(axis=0)
    print("The centroid of cluster %d has coordinates: " % (k), *cluster_centers[k,:].round(2))

  return cluster_centers

# _This it the end of Lab session #11_ ✅
