Instructions:
-

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.

In [1]:
import pandas as pd
from ucimlrepo import fetch_ucirepo 

zoo = fetch_ucirepo(id=111) 

X = zoo.data.features
y = zoo.data.targets 
zoo_df = pd.merge(X, y, left_index=True, right_index=True)

zoo_df = zoo_df.dropna()

In [2]:
import numpy as np

def jaccard_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

def ochiai_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    denominator = np.sqrt(len(set1) * len(set2))
    return intersection / denominator if denominator != 0 else 0

def overlap_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    min_length = min(len(set1), len(set2))
    return intersection / min_length if min_length != 0 else 0

def dice_coefficient(set1, set2):
    intersection = len(set1.intersection(set2))
    dice_denominator = len(set1) + len(set2)
    return 2 * intersection / dice_denominator if dice_denominator != 0 else 0

def graph_based_representation(data, num_components):
    num_samples, num_features = data.shape
    similarity_matrix = np.zeros((num_features, num_features))
    for i, j in itertools.combinations(range(num_features), 2):
        similarity_matrix[i, j] = jaccard_coefficient(set(data[:, i]), set(data[:, j]))
        similarity_matrix[j, i] = similarity_matrix[i, j]
    G = nx.from_numpy_array(similarity_matrix)
    embedding = SpectralEmbedding(n_components=num_components)
    representation_matrix = embedding.fit_transform(similarity_matrix)
    return representation_matrix

def joint_operation(data, representation_matrix):
    return np.dot(data, representation_matrix)

def mean_operation(data, representation_matrix):
    return np.mean(np.dot(data, representation_matrix), axis=1)

def perform_clustering(data, k):
    kmeans = KMeans(n_clusters=k)
    return kmeans.fit_predict(data)

3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.

In [3]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.manifold import SpectralEmbedding
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
import networkx as nx
import itertools

p = 10
k = 3  

dataset = zoo_df

results = []

try:
    if dataset.isnull().values.any():
        raise ValueError("Dataset contains missing values. Please handle them before proceeding.")

    X = dataset.drop(columns=['type'])  
    y = dataset['type'].values         

    if not all(pd.api.types.is_numeric_dtype(dtype) or pd.api.types.is_categorical_dtype(dtype) for dtype in X.dtypes):
        raise ValueError("All features must be numeric or categorical for OneHotEncoder.")

    try:
        enc = OneHotEncoder(sparse_output=False)
        X_encoded = enc.fit_transform(X)
    except Exception as e:
        raise ValueError(f"Error during OneHotEncoder transformation: {e}")

    representation_matrix = graph_based_representation(X_encoded, p)
    
    integrated_data = joint_operation(X_encoded, representation_matrix)
    
    labels = perform_clustering(integrated_data, k)

    ARI = adjusted_rand_score(y, labels)
    NMI = normalized_mutual_info_score(y, labels)
    FMI = fowlkes_mallows_score(y, labels)
    
    results.append(['zoo_df', ARI, NMI, FMI])

except ValueError as ve:
    print(f"ValueError: {ve}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

results_df = pd.DataFrame(results, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_df)

  Dataset       ARI       NMI       FMI
0  zoo_df  0.543157  0.599616  0.683843


5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?

Based on the output, FMI yielded a higher performance with 0.683843 followed by NMI with 0.599616, then ARI with 0.543157.

Adjusted Rand Index (ARI) is applicable when results are divided into two categories. Its advantages include correcting random clustering and its interpretable scale, ranging from -1 to 1. Its disadvantages include assuming a flat clustering structure, making it unsuitable for hierarchical clustering and bias with varying cluster sizes. Use ARI to measure the accuracy of ground truth clustering.

Normalized Mutual Information (NMI) is used to measure the level of fit between two clustering results. Its advantages include equal treatment of true and predicted clustering and normalization. Its disadvantage is poor performance with clusters of different sizes. Use NMI  for comparing two clustering results without bias. 

Folkes-Mallows Index (FMI) is the geometric mean of the pairwise precision rate and recall rate. Its advantage is balance, and it does not have assumptions about clusters. Its disadvantage is that it considers pairwise relationships only. Use FMI for balanced evaluation of clustering performance. 

6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.

In [4]:
from kmodes.kmodes import KModes
from scipy.cluster.hierarchy import linkage, fcluster

def perform_kmodes_clustering(data, k):
    kmodes = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
    labels = kmodes.fit_predict(data)
    return labels

def perform_hierarchical_clustering(data, k):
    Z = linkage(data, method='ward')
    labels = fcluster(Z, k, criterion='maxclust')
    return labels
    
k = 3

dataset = zoo_df

results = []

try:
    if dataset.isnull().values.any():
        raise ValueError("Dataset contains missing values. Please handle them before proceeding.")

    X = dataset.drop(columns=['type']) 
    y = dataset['type'].values         

    if not all(pd.api.types.is_numeric_dtype(dtype) or pd.api.types.is_categorical_dtype(dtype) for dtype in X.dtypes):
        raise ValueError("All features must be numeric or categorical for OneHotEncoder.")

    try:
        enc = OneHotEncoder(sparse_output=False)
        X_encoded = enc.fit_transform(X)
    except Exception as e:
        raise ValueError(f"Error during OneHotEncoder transformation: {e}")

    kmodes_labels = perform_kmodes_clustering(X_encoded, k)
    ARI_kmodes = adjusted_rand_score(y, kmodes_labels)
    NMI_kmodes = normalized_mutual_info_score(y, kmodes_labels)
    FMI_kmodes = fowlkes_mallows_score(y, kmodes_labels)
    results.append(['Kmodes', ARI_kmodes, NMI_kmodes, FMI_kmodes])

    hierarchical_labels = perform_hierarchical_clustering(X_encoded, k)  # No need for toarray()
    ARI_hierarchical = adjusted_rand_score(y, hierarchical_labels)
    NMI_hierarchical = normalized_mutual_info_score(y, hierarchical_labels)
    FMI_hierarchical = fowlkes_mallows_score(y, hierarchical_labels)
    results.append(['Hierarchical', ARI_hierarchical, NMI_hierarchical, FMI_hierarchical])

except ValueError as ve:
    print(f"ValueError: {ve}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

results_df = pd.DataFrame(results, columns=["Method", "ARI", "NMI", "FMI"])
print(results_df)

         Method       ARI       NMI       FMI
0        Kmodes  0.733623  0.738007  0.820112
1  Hierarchical  0.716368  0.762020  0.812457


7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

In both Kmodes and Hierachical Clustering, Folkes-Mallows Index (FMI) performed better with 0.820112 Kmodes and 0.812457 Hierarchical than Normalized Mutual Information  (NMI) with 0.738007 Kmodes and 0.762020 Hierarchical, followed by Adjusted Rand Index (ARI) with the lowest performance of 0.733623 Kmodes and 0.716368 Hierachical. 

FMI is always greater than ARI and NMI as it focuses on the pairwise similarity between clusters, entailing the relative proportion of true positive to false positive. Meanwhile, ARI and NMI yield a lower performance as they both consider the true and false positive/negative pairs in computation. ARI and NMI have different normalization schemes and may have limitations with extremely large datasets or highly imbalanced clusters. 

Despite such, note that ARI and NMI are more widely used than FMI due to its normalization for fair comparison across algorithms and datasets. Additionally, they are easier to understand and more robust.