# K-Modes

### Clustering Categorical Data

In [1]:
#!pip install kmodes

In [2]:
import os
os.environ["OMP_NUM_THREADS"] = '1'

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from kmodes.kmodes import KModes
from sklearn import datasets, metrics
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score

### Generated Data

In [4]:
# Generate synthetic categorical data
data, _ = make_blobs(n_samples=100, n_features=4, centers=3, cluster_std=1, random_state=42)
categorical_data = np.random.randint(1, 5, size=(100, 3))  # Three categorical features

# Concatenate numerical and categorical data
full_data = np.hstack((data, categorical_data))

# Specify the number of clusters (k)
n_clusters = 3

# Initialize KModes with the appropriate parameters
km = KModes(n_clusters=n_clusters, init='Huang', n_init=5, verbose=1)

# Fit the model to the data
clusters = km.fit_predict(full_data)

# Print the cluster centroids and labels
print("Cluster Centroids:")
print(km.cluster_centroids_)
print("\nCluster Labels:")
print(clusters)

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 8, cost: 539.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 12, cost: 533.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 27, cost: 535.0
Run 3, iteration: 2/100, moves: 4, cost: 535.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 27, cost: 526.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 23, cost: 532.0
Best run was number 4
Cluster Centroids:
[[ -8.90476978  -8.74737479 -11.03639446   0.5484215    1.
    2.           4.        ]
 [ -8.0799236   -8.12584837 -12.0795951    0.50965474   2.
    4.           2.        ]
 [ -7.76348463  -8.39495682 -10.65593054  -0.01439923   3.
    1.           1.        ]]

Cluster

## KModes-IRIS 

The Iris dataset can be converted into a categorical dataset by binning the numerical features. For example, you can bin the sepal length, sepal width, petal length, and petal width into categories such as 'Short', 'Medium', 'Long'.

**Huang Initialization (KModes):**

    Aims to select initial centroids that are diverse and representative of the categorical data.

    Approach:
        Selects the first centroid randomly from the data points.
        Subsequent centroids are chosen based on a probability proportional to the minimum dissimilarity to the existing centroids.
        Favors centroids that are dissimilar to the already chosen centroids, aiming for diversity in centroid selection.

**Cluster Evaluation**

The ARI measures the similarity between the true labels and the cluster assignments, with values close to 1 indicating a perfect match and values close to 0 indicating random labeling. It is a useful metric for evaluating the quality of clustering when ground truth labels are available.

In [5]:
# Load Iris dataset
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Binning numerical features
X['sepal length (cm)'] = pd.cut(X['sepal length (cm)'], bins=3, labels=['Short', 'Medium', 'Long'])
X['sepal width (cm)'] = pd.cut(X['sepal width (cm)'], bins=3, labels=['Narrow', 'Medium', 'Wide'])
X['petal length (cm)'] = pd.cut(X['petal length (cm)'], bins=3, labels=['Short', 'Medium', 'Long'])
X['petal width (cm)'] = pd.cut(X['petal width (cm)'], bins=3, labels=['Narrow', 'Medium', 'Wide'])

# Display the modified dataset
print(X.head())

# Initialize and fit KModes
km = KModes(n_clusters=3, init='Huang', n_init=5, verbose=1)
clusters = km.fit_predict(X)

# Print cluster centroids
print("Cluster Centroids:")
print(pd.DataFrame(km.cluster_centroids_, columns=X.columns))

# Evaluate clustering performance using ARI
ari = adjusted_rand_score(y, clusters)
print("Adjusted Rand Index (ARI):", ari)

  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0             Short           Medium             Short           Narrow
1             Short           Medium             Short           Narrow
2             Short           Medium             Short           Narrow
3             Short           Medium             Short           Narrow
4             Short           Medium             Short           Narrow
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 19, cost: 107.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 39, cost: 96.0
Run 2, iteration: 2/100, moves: 4, cost: 96.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 21, cost: 174.0
Run 3, iteration: 2/100, moves: 3, cost: 174.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, itera