### Introduction
In this notebook, we provide a tutorial on how to use the SpecMix package to perform clustering on mixed-type datasets. 

### Generating Synthetic Datasets
We can create synthetic datasets that either have purely numerical, purely categorical, or mixed data. We generate a dataset with $K$ clusters and exactly $n$ data points by sampling approximately $n/K$ points in $\mathbb{R}^K$ from normal distributions with means given by the $K$ canonical basis vectors of that space ($[1, 0, 0]$, $[0, 1, 0]$ and $[0, 0, 1]$ if $K = 3$, for example) and the standard deviation $\sigma_{\text{noise}}$, to be set in each experiment. To each of these (numerical) data points, we add $Q$ categorical variables, with $K$ possible categories each. The categories for each data point are chosen according to a value $p \in [0, 1]$ that quantifies how much each category is solely attached to a cluster. If $p = 0$, each category can only be found in one specific cluster. If $p>0$, a category may be present in a cluster different from its attached cluster with probability $p$.

In [3]:
from examples.synthetic_dataset_generation import generate_mixed_dataset

#Generate a synthetic dataset with 2 numerical features, 2 categorical features, 3 clusters, 0.1 noise
df = generate_mixed_dataset(n_samples=1000, n_numerical_features=3, n_categorical_features=2, n_clusters=3, p=0.3, save=False)
df

Unnamed: 0,num_feat_0,num_feat_1,num_feat_2,cat_feat_0,cat_feat_1,target
0,0.006731,0.511186,0.527471,0.0,1.0,0
1,0.242235,0.534272,0.229820,2.0,2.0,0
2,0.103889,0.306915,0.333870,0.0,0.0,0
3,-0.338709,0.133508,0.102437,0.0,0.0,0
4,0.100235,0.329353,0.125598,1.0,1.0,0
...,...,...,...,...,...,...
995,-0.258731,0.860525,1.011362,2.0,0.0,2
996,-0.279953,0.836799,1.475038,2.0,2.0,2
997,-0.162176,0.848644,0.708486,2.0,2.0,2
998,-0.108893,0.948853,0.824503,1.0,2.0,2


### Using SpecMix

Our implementation of SpecMix uses sklearn's BaseClassifier, such that it behaves similarly to other classifiers in sklearn.

In [1]:
from SpecMix.specmix import SpecMix

#Initialize the SpecMix algorithm with 3 clusters
specmix = SpecMix(n_clusters=3, random_state=0)

#Fit the algorithm to the dataset
specmix.fit(df)

#Observe the adjacency matrix created by the algorithm
specmix.adj_matrix_

NameError: name 'df' is not defined

Benchmark our classifier using purity score, defined as $Purity(Y, C) = \frac{1}{N} \sum_{k} \max_j |w_{k} \cap c_{j}|$, 
where:
- $Purity(Y, C)$ is the purity of the clustering solution $C$ relative to the true labels $Y$.
- $N$ is the total number of samples.
- The summation $\sum_k$ is over all clusters.
- The function $\max_j$ is the maximum over all classes.
- $|w_{k} \cap c_{j}|$ is the number of samples in cluster $w_{k}$ from class $c_{j}$.


In [7]:
from examples.benchmark_algorithms import purity_score
#Calculate the purity score of the algorithm
predicted_labels = specmix.labels_
target_labels = df['target'].tolist()
purity_score(target_labels, predicted_labels) 

0.809

We also created a classifier onlyCat, which is a clustering algorithm based on the same principles of SpecMix that only uses the categorical variables to classify the data.

In [9]:
from SpecMix.onlycat import onlyCat

onlycat = onlyCat(n_clusters=3, random_state=0)

#Fit the algorithm to the dataset
onlycat.fit(df)

predicted_labels = onlycat.labels_
target_labels = df['target'].tolist()
purity_score(target_labels, predicted_labels)

0.636

Now, let us compare our algorithm with other popular clustering algorithms. We will use the purity score to evaluate the performance of each algorithm.

In [4]:
from examples.benchmark_algorithms import compare_algorithms
import numpy as np

methods = ['k-prototypes', 'lca', 'spectral', 'onlyCat', 'spectralCAT']
kernels = ['median_pairwise', 'cv_sigma', 'preset']
n_categorical_features = 2
n_samples = 1000
n_clusters = 3
lambdas = [0, 1, 10, 50, 100, 1000, n_samples/(n_clusters*n_categorical_features)]
metrics = ['purity', 'calinski_harabasz', 'adjusted_rand', 'homogeneity', 'silhouette']
sigmas = np.linspace(0.1, 10, 20)
scaling = False

scores = compare_algorithms(methods, df, df["target"].tolist(),  n_clusters, metrics=metrics,kernels=kernels, 
lambda_values=lambdas, sigmas=sigmas, scaling=scaling)

scores


Running k-prototypes
Running lca
Fitting StepMix...


Initializations (n_init) : 100%|██████████| 1/1 [00:00<00:00, 26.87it/s, max_LL=-1.62e+3, max_avg_LL=-1.62]


Running onlyCat
Running spectralCAT
Running SpecMix with lambda=0 and kernel=median_pairwise
Running SpecMix with lambda=1 and kernel=median_pairwise
Running SpecMix with lambda=10 and kernel=median_pairwise
Running SpecMix with lambda=50 and kernel=median_pairwise
Running SpecMix with lambda=100 and kernel=median_pairwise
Running SpecMix with lambda=1000 and kernel=median_pairwise
Running SpecMix with lambda=166.66666666666666 and kernel=median_pairwise
Running SpecMix with lambda=0 and kernel=cv_sigma
Running SpecMix with lambda=1 and kernel=cv_sigma
Running SpecMix with lambda=10 and kernel=cv_sigma
Running SpecMix with lambda=50 and kernel=cv_sigma
Running SpecMix with lambda=100 and kernel=cv_sigma
Running SpecMix with lambda=1000 and kernel=cv_sigma
Running SpecMix with lambda=166.66666666666666 and kernel=cv_sigma
Running SpecMix with lambda=0 and kernel=preset
Running SpecMix with lambda=1 and kernel=preset
Running SpecMix with lambda=10 and kernel=preset
Running SpecMix with l

Unnamed: 0,k-prototypes,lca,onlyCat,spectralCAT,spectral lambda=0 kernel=median_pairwise,spectral lambda=1 kernel=median_pairwise,spectral lambda=10 kernel=median_pairwise,spectral lambda=50 kernel=median_pairwise,spectral lambda=100 kernel=median_pairwise,spectral lambda=1000 kernel=median_pairwise,...,spectral lambda=100 kernel=cv_sigma,spectral lambda=1000 kernel=cv_sigma,spectral lambda=166.66666666666666 kernel=cv_sigma,spectral lambda=0 kernel=preset,spectral lambda=1 kernel=preset,spectral lambda=10 kernel=preset,spectral lambda=50 kernel=preset,spectral lambda=100 kernel=preset,spectral lambda=1000 kernel=preset,spectral lambda=166.66666666666666 kernel=preset
purity,0.932,0.949,0.636,0.509,0.883,0.887,0.898,0.925,0.895,0.73,...,0.743,0.743,0.891,0.865,0.884,0.936,0.895,0.872,0.743,0.869
calinski_harabasz,575.138656,1423.991994,571.60077,96.905022,396.096993,399.330886,434.78961,680.044626,806.753557,651.547375,...,656.58329,656.58329,819.299673,356.104879,397.980031,646.489805,812.334415,817.905611,656.58329,819.751658
adjusted_rand,0.809721,0.855292,0.256828,0.121014,0.693061,0.701595,0.726579,0.789498,0.712146,0.351858,...,0.385228,0.385228,0.700725,0.634242,0.681123,0.818217,0.710945,0.654594,0.385228,0.648481
homogeneity,0.769366,0.81086,0.270643,0.110741,0.68088,0.686,0.698697,0.729528,0.648075,0.367106,...,0.390783,0.390783,0.638471,0.644361,0.674851,0.7837,0.648153,0.602465,0.390783,0.596696
silhouette,0.369328,0.51956,0.358376,0.073743,0.282535,0.284405,0.304342,0.401902,0.432085,0.358211,...,0.36327,0.36327,0.434676,0.258988,0.284059,0.394898,0.433841,0.433966,0.36327,0.434258
time_taken,6.342732,0.047289,0.022136,0.835234,0.092368,0.109411,0.104062,0.098879,0.100909,0.095399,...,2.344951,2.156468,2.112381,0.076413,0.091399,0.098329,0.09222,0.089024,0.089571,0.103065


In [5]:
scaling = True
scores = compare_algorithms(methods, df, df["target"].tolist(),  n_clusters, metrics=metrics,kernels=kernels, 
lambda_values=lambdas, sigmas=sigmas, scaling=scaling)
scores

Running k-prototypes
Running lca
Fitting StepMix...


Initializations (n_init) : 100%|██████████| 1/1 [00:00<00:00, 23.02it/s, max_LL=-1.62e+3, max_avg_LL=-1.62]


Running onlyCat
Running spectralCAT
Running SpecMix with lambda=0 and kernel=median_pairwise
Running SpecMix with lambda=1 and kernel=median_pairwise
Running SpecMix with lambda=10 and kernel=median_pairwise
Running SpecMix with lambda=50 and kernel=median_pairwise
Running SpecMix with lambda=100 and kernel=median_pairwise
Running SpecMix with lambda=1000 and kernel=median_pairwise
Running SpecMix with lambda=166.66666666666666 and kernel=median_pairwise
Running SpecMix with lambda=0 and kernel=cv_sigma
Running SpecMix with lambda=1 and kernel=cv_sigma
Running SpecMix with lambda=10 and kernel=cv_sigma
Running SpecMix with lambda=50 and kernel=cv_sigma
Running SpecMix with lambda=100 and kernel=cv_sigma
Running SpecMix with lambda=1000 and kernel=cv_sigma
Running SpecMix with lambda=166.66666666666666 and kernel=cv_sigma
Running SpecMix with lambda=0 and kernel=preset
Running SpecMix with lambda=1 and kernel=preset
Running SpecMix with lambda=10 and kernel=preset
Running SpecMix with l

Unnamed: 0,k-prototypes,lca,onlyCat,spectralCAT,spectral lambda=0 kernel=median_pairwise,spectral lambda=1 kernel=median_pairwise,spectral lambda=10 kernel=median_pairwise,spectral lambda=50 kernel=median_pairwise,spectral lambda=100 kernel=median_pairwise,spectral lambda=1000 kernel=median_pairwise,...,spectral lambda=100 kernel=cv_sigma,spectral lambda=1000 kernel=cv_sigma,spectral lambda=166.66666666666666 kernel=cv_sigma,spectral lambda=0 kernel=preset,spectral lambda=1 kernel=preset,spectral lambda=10 kernel=preset,spectral lambda=50 kernel=preset,spectral lambda=100 kernel=preset,spectral lambda=1000 kernel=preset,spectral lambda=166.66666666666666 kernel=preset
purity,0.932,0.949,0.636,0.509,0.66,0.887,0.73,0.73,0.73,0.73,...,0.89,0.743,0.87,0.777,0.809,0.896,0.929,0.899,0.743,0.889
calinski_harabasz,575.138656,1423.991994,571.60077,96.905022,234.427102,665.351642,651.547375,651.547375,651.547375,651.547375,...,814.357511,656.58329,819.422208,274.127427,297.307763,423.899179,672.602206,803.268085,656.58329,820.114382
adjusted_rand,0.809721,0.855292,0.256828,0.121014,0.512932,0.691162,0.351858,0.351858,0.351858,0.351858,...,0.698621,0.385228,0.650459,0.525822,0.566996,0.718348,0.799822,0.722379,0.385228,0.695935
homogeneity,0.769366,0.81086,0.270643,0.110741,0.781606,0.629798,0.367106,0.367106,0.367106,0.367106,...,0.636768,0.390783,0.59929,0.557711,0.592569,0.698687,0.74461,0.658455,0.390783,0.635493
silhouette,0.369328,0.51956,0.358376,0.073743,0.124762,0.392398,0.358211,0.358211,0.358211,0.358211,...,0.434455,0.36327,0.434099,0.202036,0.217263,0.29736,0.402247,0.433078,0.36327,0.434865
time_taken,1.536613,0.053492,0.015716,0.695854,0.216458,0.11785,0.10609,0.08669,0.085993,0.085508,...,75.018741,74.274544,80.89724,0.090254,0.103818,0.10206,0.10138,0.105778,0.097989,0.102213


### Real Datasets
We can also benchmark our algorithm on real datasets. In this example, we will use the post-operative dataset from the UCI Machine Learning Repository. This dataset contains 90 instances of patients classified into 3 classes. Each instance has 8 attributes, 6 of which are categorical and 2 are numerical. 

In [7]:
import pandas as pd
from examples.real_dataset_experiments import real_experiments
# URL of the Post-Operative Patient dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/postoperative-patient-data/post-operative.data"
column_names = ['L-CORE', 'L-SURF', 'L-02', 'L-BP', 'SURF-STBL', 'CORE-STBL', 'BP-STBL', 'COMFORT', 'target']
numerical_cols = ['COMFORT']
categorical_cols = ['L-CORE', 'L-SURF', 'L-02', 'L-BP', 'SURF-STBL', 'CORE-STBL', 'BP-STBL']
n_clusters = 3
scores = real_experiments(methods, metrics, n_clusters, kernels, numerical_cols=numerical_cols, categorical_cols=categorical_cols
                            , path=url, column_names=column_names, sep=',', header=None, drop = None,
                            lambdas=lambdas, knn=0, scaling = scaling, sigmas = sigmas, random_state = 0, n_init = 10, verbose = 0)

scores

Fitting StepMix...


Initializations (n_init) : 100%|██████████| 1/1 [00:00<00:00, 51.63it/s, max_LL=-13.6, max_avg_LL=-.157]


Unnamed: 0,k-prototypes,lca,onlyCat,spectralCAT,spectral lambda=0 kernel=median_pairwise,spectral lambda=1 kernel=median_pairwise,spectral lambda=10 kernel=median_pairwise,spectral lambda=50 kernel=median_pairwise,spectral lambda=100 kernel=median_pairwise,spectral lambda=1000 kernel=median_pairwise,...,spectral lambda=100 kernel=cv_sigma,spectral lambda=1000 kernel=cv_sigma,spectral lambda=166.66666666666666 kernel=cv_sigma,spectral lambda=0 kernel=preset,spectral lambda=1 kernel=preset,spectral lambda=10 kernel=preset,spectral lambda=50 kernel=preset,spectral lambda=100 kernel=preset,spectral lambda=1000 kernel=preset,spectral lambda=166.66666666666666 kernel=preset
purity,0.724138,0.724138,0.712644,0.712644,0.724138,0.712644,0.712644,0.712644,0.712644,0.712644,...,0.712644,0.712644,0.712644,0.712644,0.712644,0.712644,0.712644,0.712644,0.712644,0.712644
calinski_harabasz,2.666396,7.267888,15.634385,15.620125,2.666396,-1.0,18.521372,18.538625,9.7466,9.265093,...,14.134356,9.265093,9.265093,2.809185,2.809185,15.769928,14.134356,14.134356,9.265093,9.265093
adjusted_rand,0.040178,0.040178,-0.028242,-0.009154,0.040178,0.0,-0.026828,-0.02761,-0.039592,-0.023747,...,-0.043683,-0.023747,-0.023747,0.016516,0.016516,-0.027579,-0.043683,-0.043683,-0.023747,-0.023747
homogeneity,0.025247,0.025247,0.017507,0.016787,0.025247,1.0,0.010207,0.011951,0.018373,0.015125,...,0.035749,0.015125,0.015125,0.011113,0.011113,0.015065,0.035749,0.035749,0.015125,0.015125
silhouette,0.076153,0.187215,0.18245,0.182009,0.076153,-1.0,0.206933,0.203361,0.13683,0.111182,...,0.185785,0.111182,0.111182,0.060041,0.060041,0.176451,0.185785,0.185785,0.111182,0.111182
time_taken,0.798326,0.025782,0.021482,0.120517,0.015922,0.033487,0.034239,0.033982,0.034859,0.032597,...,0.321125,0.319427,0.318193,0.016659,0.028932,0.030812,0.032257,0.031596,0.030744,0.032787


We can also benchmark our onlyCat algorithm versus other categorical clustering algorithms. In this example, we will use the Mushroom dataset from the UCI Machine Learning Repository. This dataset contains 8124 instances of mushrooms classified into 2 classes. Each instance has 22 attributes, all of which are categorical.

In [10]:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"

column_names = ['target', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment',
                'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
                'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
                'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
numerical_cols = []
categorical_cols = ['cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment',
                    'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
                    'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
                    'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']               
methods_categorical = ['k-modes', 'lca', 'onlyCat', 'spectralCAT']
n_clusters = 2
scores = real_experiments(methods_categorical, metrics, n_clusters, kernel = kernels, numerical_cols=[], categorical_cols=categorical_cols
                           , path=url, column_names=column_names, sep=',', header=None, drop = None,
                            lambdas=lambdas, knn=0, scaling = scaling, sigmas = sigmas, random_state = 0, n_init = 10, verbose = 0)

scores

Fitting StepMix...


Initializations (n_init) : 100%|██████████| 1/1 [00:00<00:00,  5.31it/s, max_LL=-9.44e+4, max_avg_LL=-16.7]
