Instructions:
-

1. Read the article: https://www.sciencedirect.com/science/article/abs/pii/S0031320322001753
2. Replicate the study using the same dataset.

In [1]:
from ucimlrepo import fetch_ucirepo
import pandas as pd

# Fetching the datasets
soybean_data = fetch_ucirepo('soybean')
zoo_data = fetch_ucirepo('zoo')
heart_disease_data = fetch_ucirepo('heart disease')
breast_cancer_data = fetch_ucirepo('breast cancer')
dermatology_data = fetch_ucirepo('dermatology')
mushroom_data = fetch_ucirepo('mushroom')

In [2]:
datasets = [
    ('soybean_data', 'soybean_df'),
    ('zoo_data', 'zoo_df'),
    ('heart_disease_data', 'heart_disease_df'),
    ('breast_cancer_data', 'breast_cancer_df'),
    ('dermatology_data', 'dermatology_df'),
    ('mushroom_data', 'mushroom_df')
]

# Loop over datasets to fetch and create dataframes
for dataset_name, dataframe_name in datasets:
    data = fetch_ucirepo(dataset_name.split('_')[0])
    X = data.data.features
    y = data.data.targets
    df = pd.merge(X, y, left_index=True, right_index=True)
    df = df.dropna()
    globals()[dataframe_name] = df

In [3]:
soybean_df

Unnamed: 0,date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,...,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots,class
0,6.0,0.0,2.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,diaporthe-stem-canker
1,4.0,0.0,2.0,1.0,0.0,2.0,0.0,2.0,1.0,1.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,diaporthe-stem-canker
2,3.0,0.0,2.0,1.0,0.0,1.0,0.0,2.0,1.0,2.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,diaporthe-stem-canker
3,3.0,0.0,2.0,1.0,0.0,1.0,0.0,2.0,0.0,1.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,diaporthe-stem-canker
4,6.0,0.0,2.0,1.0,0.0,2.0,0.0,1.0,0.0,2.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,diaporthe-stem-canker
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,5.0,1.0,2.0,1.0,0.0,1.0,2.0,1.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,frog-eye-leaf-spot
286,4.0,0.0,2.0,2.0,0.0,1.0,3.0,1.0,1.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,frog-eye-leaf-spot
287,5.0,0.0,2.0,1.0,0.0,1.0,2.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,frog-eye-leaf-spot
288,5.0,0.0,2.0,2.0,0.0,2.0,0.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,frog-eye-leaf-spot


3. Read articles about Adjusted Rand Index, Normalized Mutual Information, and Folkes-Mallows Index (only use paper published in IEEE, sciencedirect, springerlink, Taylor Francis).
4. Aside from the Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), use the Folkes-Mallows Index (FMI), and compare the result of each performance index.


In [4]:
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import fowlkes_mallows_score, adjusted_rand_score, normalized_mutual_info_score
import numpy as np

datasets = ["soybean_df", "zoo_df", "heart_disease_df", "dermatology_df", "breast_cancer_df", "mushroom_df"]

results = []

for dataset_name in datasets:
    # Fetch dataset
    df = globals()[dataset_name]
    
    # Assuming the last column is the target/label column
    X = df.iloc[:, :-1].values
    y = df.iloc[:, -1].values
    
    # Label encode the target variable if it's categorical
    le = LabelEncoder()
    y = le.fit_transform(y)
    
    # Handle categorical features using one-hot encoding if present
    if np.any(pd.DataFrame(X).dtypes == 'object'):
        # One-hot encode categorical features
        categorical_columns = np.where(pd.DataFrame(X).dtypes == 'object')[0]
        onehot_encoder = OneHotEncoder(categories='auto')
        X_categorical = onehot_encoder.fit_transform(X[:, categorical_columns])
        
        # Reshape one-hot encoded features to match the number of dimensions of numerical features
        X_categorical = X_categorical.toarray().reshape(-1, X_categorical.shape[1])
        
        # Concatenate one-hot encoded features with numerical features
        X_numeric = np.delete(X, categorical_columns, axis=1)
        X = np.hstack((X_numeric, X_categorical))
    
    # Perform any necessary preprocessing (e.g., scaling)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Perform clustering (example using KMeans)
    kmeans = KMeans(n_clusters=len(np.unique(y)))
    predicted_labels = kmeans.fit_predict(X_scaled)
    
    # Compute performance metrics
    ari = adjusted_rand_score(y, predicted_labels)
    nmi = normalized_mutual_info_score(y, predicted_labels)
    fmi = fowlkes_mallows_score(y, predicted_labels)
    
    # Append results to the results list
    results.append([dataset_name, ari, nmi, fmi])

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_df)


            Dataset       ARI       NMI       FMI
0        soybean_df  0.410024  0.722422  0.473503
1            zoo_df  0.634201  0.757663  0.715332
2  heart_disease_df  0.184912  0.203061  0.416507
3    dermatology_df  0.808456  0.882125  0.852665
4  breast_cancer_df  0.001574  0.000130  0.756814
5       mushroom_df  0.496017  0.479574  0.790503


5. Compare and contrast each performance index, what are the advantages and disadvantages of ARI, NMI, and FMI, and when to use each?


The Adjusted Rand Index (ARI) measures the similarity between two clusterings by comparing the agreement of pairs of samples in the true and predicted clusters, after controlling for chance. It ranges from -1 to 1, with 1 indicating perfect agreement between the two clusters, 0 indicating random agreement, and negative values indicating disagreement. Normalized Mutual Information (NMI) measures the mutual information shared by true and predicted labels, which is normalized by entropy terms to account for chance. It provides a normalized measure of clustering quality, with values closer to one indicating better agreement. The Folkes-Mallows Index (FMI) assesses clustering performance by taking into account the pairwise agreement of elements in the same clusters across two clusterings, emphasizing the significance of pairwise relationships. While ARI and NMI are widely used to assess clustering quality, FMI provides insights into finer-grained clustering structures, making it useful for analyzing pairwise similarities in cluster assignments.

6. Using Kmodes and Hierarchical Clustering, use the same dataset and perform categorical data clustering, use FMI, ARI, and NMI for the comparison of performance.


In [5]:
from kmodes.kmodes import KModes
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

results_categorical = []

for dataset_name in datasets:
    dataset = globals()[dataset_name]
    X_cat = dataset.iloc[:, :-1]
    true_labels_cat = dataset.iloc[:, -1]

    encoder = LabelEncoder()
    X_cat_encoded = X_cat.apply(encoder.fit_transform)

    km = KModes(n_clusters=3, init='Huang', n_init=5, verbose=0)
    km_labels = km.fit_predict(X_cat_encoded)
    ARI_km = adjusted_rand_score(true_labels_cat, km_labels)
    NMI_km = normalized_mutual_info_score(true_labels_cat, km_labels)
    FMI_km = fowlkes_mallows_score(true_labels_cat, km_labels)

    ac = AgglomerativeClustering(n_clusters=3, linkage='ward')
    ac_labels = ac.fit_predict(X_cat_encoded)
    ARI_ac = adjusted_rand_score(true_labels_cat, ac_labels)
    NMI_ac = normalized_mutual_info_score(true_labels_cat, ac_labels)
    FMI_ac = fowlkes_mallows_score(true_labels_cat, ac_labels)
    
    results_categorical.append([dataset_name + " (Kmodes)", ARI_km, NMI_km, FMI_km])
    results_categorical.append([dataset_name + " (Hierarchical)", ARI_ac, NMI_ac, FMI_ac])

results_categorical_df = pd.DataFrame(results_categorical, columns=["Dataset", "ARI", "NMI", "FMI"])
print(results_categorical_df)

                            Dataset       ARI       NMI       FMI
0               soybean_df (Kmodes)  0.101866  0.293170  0.280213
1         soybean_df (Hierarchical)  0.190929  0.464970  0.389441
2                   zoo_df (Kmodes)  0.733623  0.738007  0.820112
3             zoo_df (Hierarchical)  0.461701  0.585056  0.645736
4         heart_disease_df (Kmodes)  0.197039  0.194323  0.472942
5   heart_disease_df (Hierarchical)  0.009543  0.010944  0.353434
6           dermatology_df (Kmodes)  0.515638  0.679720  0.677660
7     dermatology_df (Hierarchical)  0.032201  0.078147  0.293614
8         breast_cancer_df (Kmodes)  0.002596  0.005221  0.461092
9   breast_cancer_df (Hierarchical)  0.046450  0.061245  0.499983
10             mushroom_df (Kmodes)  0.297926  0.312381  0.604856
11       mushroom_df (Hierarchical)  0.298722  0.431275  0.610515


7. Write your report using Latex. Your report should be focused on the "why's and the what's" of each performance metrices (i.e. why is FMI always greater than ARI and NMI? What's the problem with ARI and NMI?).

FMI tends to produce higher values than ARI and NMI because it focuses on pairwise agreements between elements in the same cluster. While ARI and NMI provide useful insights into clustering quality, they may show lower values in certain scenarios, such as varying cluster sizes or noisy data, emphasizing the importance of taking into account multiple performance metrics when evaluating clustering algorithms.