<h1>Advanced Analysis of Anuran Calls: Multi-class and Multi-label Classification with Support Vector Machines and K-Means Clustering</h1>

<h5>This project aims to apply advanced machine learning techniques to the Anuran Calls (MFCCs) Dataset for both classification and clustering challenges. It involves multi-class and multi-label classification using Support Vector Machines (SVMs) and explores the effectiveness of various SVM approaches, including Gaussian kernels, L1-penalized SVMs, and Classifier Chains. The project also addresses class imbalance and evaluates classifiers using metrics like exact match, hamming score, and hamming loss. Additionally, K-Means clustering is used to analyze the dataset, with a focus on determining the majority label in each cluster and calculating the Hamming distance, score, and loss. The project employs a Monte-Carlo Simulation approach to report the average and standard deviation of the results.</h5>

## Multi-class and Multi-Label Classification Using Support Vector Machines

Import packages

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import hamming_loss, silhouette_samples, silhouette_score, classification_report, confusion_matrix, f1_score
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings("ignore")

### Download the Anuran Calls (MFCCs) Data Set

In [2]:
MFCC_df = pd.read_csv('../data/Anuran Calls (MFCCs)/Frogs_MFCCs.csv')
MFCC_df

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species,RecordID
0,1.0,0.152936,-0.105586,0.200722,0.317201,0.260764,0.100945,-0.150063,-0.171128,0.124676,...,-0.108351,-0.077623,-0.009568,0.057684,0.118680,0.014038,Leptodactylidae,Adenomera,AdenomeraAndre,1
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,-0.090974,-0.056510,-0.035303,0.020140,0.082263,0.029056,Leptodactylidae,Adenomera,AdenomeraAndre,1
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,-0.050691,-0.023590,-0.066722,-0.025083,0.099108,0.077162,Leptodactylidae,Adenomera,AdenomeraAndre,1
3,1.0,0.224392,0.118985,0.329432,0.372088,0.361005,0.015501,-0.194347,-0.098181,0.270375,...,-0.136009,-0.177037,-0.130498,-0.054766,-0.018691,0.023954,Leptodactylidae,Adenomera,AdenomeraAndre,1
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.172700,0.266434,...,-0.048885,-0.053074,-0.088550,-0.031346,0.108610,0.079244,Leptodactylidae,Adenomera,AdenomeraAndre,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7190,1.0,-0.554504,-0.337717,0.035533,0.034511,0.443451,0.093889,-0.100753,0.037087,0.081075,...,0.069430,0.071001,0.021591,0.052449,-0.021860,-0.079860,Hylidae,Scinax,ScinaxRuber,60
7191,1.0,-0.517273,-0.370574,0.030673,0.068097,0.402890,0.096628,-0.116460,0.063727,0.089034,...,0.061127,0.068978,0.017745,0.046461,-0.015418,-0.101892,Hylidae,Scinax,ScinaxRuber,60
7192,1.0,-0.582557,-0.343237,0.029468,0.064179,0.385596,0.114905,-0.103317,0.070370,0.081317,...,0.082474,0.077771,-0.009688,0.027834,-0.000531,-0.080425,Hylidae,Scinax,ScinaxRuber,60
7193,1.0,-0.519497,-0.307553,-0.004922,0.072865,0.377131,0.086866,-0.115799,0.056979,0.089316,...,0.051796,0.069073,0.017963,0.041803,-0.027911,-0.096895,Hylidae,Scinax,ScinaxRuber,60


Choose 70% of the data randomly as the training set.

In [3]:
# there are 3 target attributes-- Family, Genus, Species 
Family = MFCC_df['Family']
Genus = MFCC_df['Genus']
Species = MFCC_df['Species']

# we need to get 70% the data randomly as the training sets
total_num = len(MFCC_df)
train_index = random.sample(range(total_num), int(total_num * 0.7))

test_index = []
for i in range(total_num):
    if i not in train_index:
        test_index.append(i)

# get train and test stes
MFCC_train = MFCC_df.iloc[train_index, :].reset_index(drop=True)
MFCC_test = MFCC_df.iloc[test_index, :].reset_index(drop=True)

# target features for training
X_train = MFCC_train.iloc[:, :-4]
Family_train = MFCC_train['Family']
Genus_train = MFCC_train['Genus']
Species_train = MFCC_train['Species']

# target features for testing
X_test = MFCC_test.iloc[:, :-4]
Family_test = MFCC_test['Family']
Genus_test = MFCC_test['Genus']
Species_test = MFCC_test['Species']

In [4]:
MFCC_train.shape

(5036, 26)

In [5]:
MFCC_test.shape

(2159, 26)

### Train a classifier for each label

#### Research

<b>Exact Match Ratio</b>: The exact match ratio calculates the percentage of samples where the predicted set of labels exactly matches the true set of labels.


<b>Hamming Score/Loss</b>: The hamming loss calculates the fraction of labels that are incorrectly predicted.

In [8]:
def multiLabelMethods(trueY, predictY):
    # Hamming Loss
    hamming = np.mean(trueY.values != predictY.values)

    # Exact Match Ratio
    exact_match = np.all(trueY.values == predictY.values, axis=1)
    exact_ratio = np.mean(exact_match)

    return np.round(exact_ratio, 4), np.round(hamming, 4), 

In [9]:
def multiLabelEva(title, testX, trueY, classifiers):
    predictY = pd.DataFrame({label: clf.predict(testX) for label, clf in classifiers.items()})
    print(f"Multilabel evaluation of {title}")
    hamming, exact_ratio = multiLabelMethods(trueY, predictY)
    result = {"Hamming Loss": [hamming], "Exact Match Ratio": [exact_ratio]}
    print(pd.DataFrame(data=result))
    return [hamming, exact_ratio]

#### Train a SVM for each of the labels

Train a SVM for each of the labels, using Gaussian kernels and one versus all classifiers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation. You are welcome to try to solve the problem with both standardized and raw attributes and report the results.

In [14]:
def ParameterSearch(classifier, tuned_params, trainX, trainY, testX, testY, label):
    splitter = StratifiedKFold(n_splits=10, random_state=1234, shuffle=True)
    kwargs = {
        'param_grid': tuned_params,
        'cv': splitter,
        'scoring': 'f1_weighted',
        'verbose': 1
    }

    clf = GridSearchCV(estimator=classifier, **kwargs)
    clf.fit(trainX, trainY)
    
    print(f"Class: {label} (Gaussian SVC without Standardization)")

    # get the results stats
    print("Grid Search Results:\n")
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        mean_rounded = round(mean, 3)
        std_rounded = round(std * 2, 3)
        print(f"Mean Score: {mean_rounded} (±{std_rounded}) for Parameters: {params}")
    
    print("\nThe best parameter setting is:")
    print(clf.best_params_, "\n")
    
    test_pred = clf.predict(testX)
    print(classification_report(testY, test_pred))
    
    return clf

In [20]:
gaussianSVC_classifiers = {}
tuned_params = {
    'C': np.logspace(0, 3, 5),
    'gamma': np.logspace(-4, 2, 7)
}

In [22]:
train_data = {
    'train_x': X_train,
    'train_family': Family_train,
    'train_genus': Genus_train,
    'train_species': Species_train
}

test_data = {
    'test_x': X_test,
    'test_family': Family_test,
    'test_genus': Genus_test,
    'test_species': Species_test
}
labels = ['Family', 'Genus', 'Species']

<b>Gaussian SVC with Raw Attributes</b>

In [23]:
for label in labels:
    print(label,"(Gaussian SVC without Standardization):")
    result = ParameterSearch(
        SVC(kernel='rbf'),
        tuned_params,
        X_train,
        train_data[f'train_{label.lower()}'],
        X_test,
        test_data[f'test_{label.lower()}'],
        label
    )
    gaussianSVC_classifiers[label] = result

Family (Gaussian SVC without Standardization):
Fitting 10 folds for each of 35 candidates, totalling 350 fits
Class: Family (Gaussian SVC without Standardization)
Grid Search Results:

Mean Score: 0.472 (±0.002) for Parameters: {'C': 1.0, 'gamma': 0.0001}
Mean Score: 0.485 (±0.013) for Parameters: {'C': 1.0, 'gamma': 0.001}
Mean Score: 0.866 (±0.035) for Parameters: {'C': 1.0, 'gamma': 0.01}
Mean Score: 0.935 (±0.017) for Parameters: {'C': 1.0, 'gamma': 0.1}
Mean Score: 0.986 (±0.007) for Parameters: {'C': 1.0, 'gamma': 1.0}
Mean Score: 0.985 (±0.015) for Parameters: {'C': 1.0, 'gamma': 10.0}
Mean Score: 0.761 (±0.048) for Parameters: {'C': 1.0, 'gamma': 100.0}
Mean Score: 0.472 (±0.002) for Parameters: {'C': 5.623413251903491, 'gamma': 0.0001}
Mean Score: 0.791 (±0.024) for Parameters: {'C': 5.623413251903491, 'gamma': 0.001}
Mean Score: 0.92 (±0.017) for Parameters: {'C': 5.623413251903491, 'gamma': 0.01}
Mean Score: 0.958 (±0.008) for Parameters: {'C': 5.623413251903491, 'gamma': 0.

In [28]:
multiLabelEva("Gaussian SVC with Raw Attributes", X_test, MFCC_test.iloc[:, -4:-1], gaussianSVC_classifiers)

Multilabel evaluation of Gaussian SVC with Raw Attributes
   Hamming Loss  Exact Match Ratio
0        0.9856               0.01


[0.9856, 0.01]

<b>Gaussian SVC with Standardized Attributes</b>

In [29]:
# standardize the attributes
StdScaler = StandardScaler()
X_train_STD = StdScaler.fit_transform(X_train)
X_test_STD = StdScaler.fit_transform(X_test)

In [30]:
for label in labels:
    # print(label,"(Gaussian SVC without Standardization):")
    result = ParameterSearch(
        SVC(kernel='rbf'),
        tuned_params,
        X_train_STD,
        train_data[f'train_{label.lower()}'],
        X_test_STD,
        test_data[f'test_{label.lower()}'],
        label
    )
    gaussianSVC_classifiers[label] = result

Fitting 10 folds for each of 35 candidates, totalling 350 fits
Class: Family (Gaussian SVC without Standardization)
Grid Search Results:

Mean Score: 0.801 (±0.02) for Parameters: {'C': 1.0, 'gamma': 0.0001}
Mean Score: 0.928 (±0.017) for Parameters: {'C': 1.0, 'gamma': 0.001}
Mean Score: 0.965 (±0.01) for Parameters: {'C': 1.0, 'gamma': 0.01}
Mean Score: 0.988 (±0.012) for Parameters: {'C': 1.0, 'gamma': 0.1}
Mean Score: 0.898 (±0.021) for Parameters: {'C': 1.0, 'gamma': 1.0}
Mean Score: 0.55 (±0.03) for Parameters: {'C': 1.0, 'gamma': 10.0}
Mean Score: 0.475 (±0.006) for Parameters: {'C': 1.0, 'gamma': 100.0}
Mean Score: 0.916 (±0.019) for Parameters: {'C': 5.623413251903491, 'gamma': 0.0001}
Mean Score: 0.942 (±0.017) for Parameters: {'C': 5.623413251903491, 'gamma': 0.001}
Mean Score: 0.985 (±0.01) for Parameters: {'C': 5.623413251903491, 'gamma': 0.01}
Mean Score: 0.991 (±0.01) for Parameters: {'C': 5.623413251903491, 'gamma': 0.1}
Mean Score: 0.902 (±0.02) for Parameters: {'C': 5

In [31]:
multiLabelEva("Gaussian SVC with Standardized Attributes", X_test_STD, MFCC_test.iloc[:, -4:-1], gaussianSVC_classifiers)

Multilabel evaluation of Gaussian SVC with Standardized Attributes
   Hamming Loss  Exact Match Ratio
0        0.9819             0.0117


[0.9819, 0.0117]

<b>L1-penalized SVMs with Standardized Attributes</b>

In [32]:
L1_penalized_std_attri= {}
tuned_params = {
    'C': np.logspace(0, 3, 5)
}
for label in labels:
    # print(label,"(Gaussian SVC without Standardization):")
    result = ParameterSearch(
        LinearSVC(penalty='l1', dual=False),
        tuned_params,
        X_train_STD,
        train_data[f'train_{label.lower()}'],
        X_test_STD,
        test_data[f'test_{label.lower()}'],
        label
    )
    L1_penalized_std_attri[label] = result

Fitting 10 folds for each of 5 candidates, totalling 50 fits
Class: Family (Gaussian SVC without Standardization)
Grid Search Results:

Mean Score: 0.93 (±0.018) for Parameters: {'C': 1.0}
Mean Score: 0.93 (±0.018) for Parameters: {'C': 5.623413251903491}
Mean Score: 0.93 (±0.018) for Parameters: {'C': 31.622776601683793}
Mean Score: 0.93 (±0.018) for Parameters: {'C': 177.82794100389228}
Mean Score: 0.93 (±0.018) for Parameters: {'C': 1000.0}

The best parameter setting is:
{'C': 5.623413251903491} 

                 precision    recall  f1-score   support

      Bufonidae       0.00      0.00      0.00        23
  Dendrobatidae       0.90      0.92      0.91       161
        Hylidae       0.92      0.93      0.92       666
Leptodactylidae       0.96      0.97      0.97      1309

       accuracy                           0.94      2159
      macro avg       0.70      0.70      0.70      2159
   weighted avg       0.93      0.94      0.94      2159

Fitting 10 folds for each of 5 can

In [33]:
multiLabelEva("Support Vector Classifier with L1-penalty", X_test_STD, MFCC_test.iloc[:, -4:-1], L1_penalized_std_attri)

Multilabel evaluation of Support Vector Classifier with L1-penalty
   Hamming Loss  Exact Match Ratio
0        0.9231             0.0513


[0.9231, 0.0513]

#### using SMOTE or any other method for imbalance

In [44]:
def SMOTEParameterSearch(classifier, settings, X_train, Y_train, X_test, Y_test, label):
    naive_model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', classifier)
    ])
    selected_model = ParameterSearch(naive_model, settings, X_train, Y_train, X_test, Y_test, label)
    print(f"Class: {label} (L1-penalized and SMOTE with Standardization)")
    return selected_model

In [49]:
tuned_params = {'classification__C': np.logspace(0, 3, 5)}
SMOTE_SVC_classifiers = {}
# splitter = StratifiedKFold(n_splits=10, random_state=1234, shuffle=True)
# kwargs = {
#     'param_grid': tuned_params,
#     'cv': splitter,
#     'scoring': 'f1_weighted',
#     'verbose': 1
# }
labels = ['Family', 'Genus', 'Species']

In [50]:
for label in labels:
    SMOTE_SVC_classifiers[label] = SMOTEParameterSearch(
        LinearSVC(penalty='l1', dual=False),
        tuned_params,
        X_train_STD,
        train_data[f'train_{label.lower()}'],
        X_test_STD,
        test_data[f'test_{label.lower()}'],
        label
    )

Fitting 10 folds for each of 5 candidates, totalling 50 fits
Class: Family (Gaussian SVC without Standardization)
Grid Search Results:

Mean Score: 0.921 (±0.022) for Parameters: {'classification__C': 1.0}
Mean Score: 0.922 (±0.023) for Parameters: {'classification__C': 5.623413251903491}
Mean Score: 0.922 (±0.024) for Parameters: {'classification__C': 31.622776601683793}
Mean Score: 0.921 (±0.024) for Parameters: {'classification__C': 177.82794100389228}
Mean Score: 0.922 (±0.024) for Parameters: {'classification__C': 1000.0}

The best parameter setting is:
{'classification__C': 1000.0} 

                 precision    recall  f1-score   support

      Bufonidae       0.36      0.91      0.52        23
  Dendrobatidae       0.77      0.99      0.86       161
        Hylidae       0.94      0.89      0.91       666
Leptodactylidae       0.98      0.94      0.96      1309

       accuracy                           0.93      2159
      macro avg       0.76      0.93      0.81      2159
  

In [51]:
multiLabelEva("SMOTE with L1-penalized SVMs and Standardized Attributes", X_test_STD, MFCC_test.iloc[:, -4:-1], SMOTE_SVC_classifiers)

Multilabel evaluation of SMOTE with L1-penalized SVMs and Standardized Attributes
   Hamming Loss  Exact Match Ratio
0        0.8685             0.0684


[0.8685, 0.0684]

In [63]:
Gaussian_SVC_Raw_Attributes = pd.DataFrame({
    "Method": ["Gaussian_SVC_Raw_Attributes"],
    "Hamming Loss": [0.9856],
    "Exact Match Ratio": [0.01]
})

Gaussian_SVC_Standardized_Attributes = pd.DataFrame({
    "Method": ["Gaussian_SVC_Standardized_Attributes"],
    "Hamming Loss": [0.9819],
    "Exact Match Ratio": [0.0117]
})

L1_Penalized = pd.DataFrame({
    "Method": ["L1_Penalized"],
    "Hamming Loss": [0.9231],
    "Exact Match Ratio": [0.0513]
})

SMOTE = pd.DataFrame({
    "Method": ["SMOTE"],
    "Hamming Loss": [0.8685],
    "Exact Match Ratio": [0.0684]
})

In [64]:
# Concatenating the DataFrames
combined_df = pd.concat([Gaussian_SVC_Raw_Attributes, Gaussian_SVC_Standardized_Attributes, L1_Penalized, SMOTE])

# Set 'Method' as the index (optional)
combined_df.set_index('Method', inplace=True)
combined_df

Unnamed: 0_level_0,Hamming Loss,Exact Match Ratio
Method,Unnamed: 1_level_1,Unnamed: 2_level_1
Gaussian_SVC_Raw_Attributes,0.9856,0.01
Gaussian_SVC_Standardized_Attributes,0.9819,0.0117
L1_Penalized,0.9231,0.0513
SMOTE,0.8685,0.0684


## K-Means Clustering on a Multi-Class and Multi-Label Data Set

- Use k-means clustering
- Determine which family is the majority
- Calculate the average Hamming distance, Hamming score, and Hamming loss

In [97]:
# get optimal k
def OptimalK(X, num_clusters, rand_state):
    # method used is Silhouettes
    silhouette_scores = []
    for n in range(2, num_clusters+1):
        kmeans = KMeans(n_clusters=n, random_state=rand_state)
        cluster_labels = kmeans.fit_predict(X)
        silhouette_avg = silhouette_score(X, cluster_labels)
        silhouette_scores.append(silhouette_avg)
    optimal_K = silhouette_scores.index(max(silhouette_scores)) + 2  # Adding 2 as range starts from 2
    print(f"The optimal K is: {optimal_K}")
    return optimal_K

# majority labels of a cluster
def MajorityLabels(optimal_K, cluster_labels, Y):
    cluster_major = pd.DataFrame(columns=Y.columns)
    for cluster in range(optimal_K):
        index, = np.where(cluster_labels == cluster)
        cluster_samples = Y.iloc[index, :]
        row = []
        for label in Y.columns:
            cur_major = cluster_samples.loc[:, label].value_counts().index[0]
            row.append(cur_major)
        cluster_major.loc[cluster] = row
    return cluster_major

# calculate hamming distance/loss
# def Hamming_Distance_Loss(cluster_major, cluster_labels, Y):
#     cluster_major = cluster_major.loc[cluster_labels]  # Align the labels
#     hamming_dist = (cluster_major != Y).sum().sum() / (Y.shape[0] * Y.shape[1])
#     hamming_loss = hamming_dist / Y.shape[1]
#     return hamming_dist, hamming_loss

def Hamming_Distance_Loss(cluster_major, cluster_labels, Y):
    cluster_major = cluster_major.loc[cluster_labels]  # Align the labels
    Y = Y.iloc[cluster_labels]  # Align Y with the cluster labels
    hamming_dist = (cluster_major != Y).sum().sum() / (Y.shape[0] * Y.shape[1])
    hamming_loss = hamming_dist / Y.shape[1]
    return hamming_dist, hamming_loss

def MonteCarlo(times, X, Y):
    hamming_dist = []
    hamming_loss = []
    for i in range(times):
        optimal_K = OptimalK(X, num_clusters=50, rand_state=i)
        clusterer = KMeans(n_clusters=optimal_K, random_state=i)
        cluster_labels = clusterer.fit_predict(X)
        majority_families = MajorityLabels(optimal_K, cluster_labels, Y)
        cur_dist, cur_loss = Hamming_Distance_Loss(majority_families, cluster_labels, Y)
        hamming_dist.append(cur_dist)
        hamming_loss.append(cur_loss)
        print(f"Iteration {i + 1} | Major Family: {majority_families}, Hamming Distance: {round(cur_dist, 4)}, Hamming Loss: {round(cur_loss, 4)}")
    return majority_families, hamming_dist, hamming_loss

In [98]:
iterations = 50
majority_families, hamming_dist, hamming_loss = MonteCarlo(iterations, MFCC_df.iloc[:, :-4], MFCC_df.iloc[:, -4:-1])

The optimal K is: 4
Iteration 1 | Major Family:             Family      Genus                 Species
0  Leptodactylidae  Adenomera  AdenomeraHylaedactylus
1          Hylidae  Hypsiboas       HypsiboasCordobae
2    Dendrobatidae   Ameerega      Ameeregatrivittata
3          Hylidae  Hypsiboas    HypsiboasCinerascens, Hamming Distance: 0.6694, Hamming Loss: 0.2231
The optimal K is: 4
Iteration 2 | Major Family:             Family      Genus                 Species
0          Hylidae  Hypsiboas       HypsiboasCordobae
1  Leptodactylidae  Adenomera  AdenomeraHylaedactylus
2    Dendrobatidae   Ameerega      Ameeregatrivittata
3          Hylidae  Hypsiboas    HypsiboasCinerascens, Hamming Distance: 0.6694, Hamming Loss: 0.2231
The optimal K is: 4
Iteration 3 | Major Family:             Family      Genus                 Species
0  Leptodactylidae  Adenomera  AdenomeraHylaedactylus
1  Leptodactylidae  Adenomera          AdenomeraAndre
2          Hylidae  Hypsiboas       HypsiboasCordobae
3   

In [103]:
def summarize(hamming_distance, hamming_loss):
    summary = {
        "Avg Hamming Distance": round(np.mean(hamming_distance), 4),
        "Std of Hamming Distance": round(np.std(hamming_distance), 4),
        "Avg Hamming Loss": round(np.mean(hamming_loss), 4),
        "Std of Hamming Loss": round(np.std(hamming_loss), 4),
        "Avg Hamming Score": round(1 - np.mean(hamming_loss), 4),
        "Std of Hamming Score": round(np.std(hamming_loss), 4)
    }
    
    return pd.DataFrame(summary, index=[0])

In [104]:
hamming_summary = summarize(hamming_dist, hamming_loss)
hamming_summary

Unnamed: 0,Avg Hamming Distance,Std of Hamming Distance,Avg Hamming Loss,Std of Hamming Loss,Avg Hamming Score,Std of Hamming Score
0,0.6614,0.027,0.2205,0.009,0.7795,0.009
