# K-modes clustering for categorical variables only

### Install kmodes

In [1]:
# pip install kmodes

In [2]:
import numpy as np
import pandas as pd
from kmodes.kmodes import KModes
from kmodes.kprototypes import KPrototypes

### 1. Use 'Telecom customer.csv'. Save catergorical variables

In [3]:
df = pd.read_csv('Telecom customer.csv')
df.head()

Unnamed: 0,tenure,age,income,ed,marital,retire,gender
0,13,44,64,4,1,0,0
1,11,33,136,5,1,0,0
2,68,52,116,1,1,0,1
3,33,33,33,2,0,0,1
4,23,30,30,1,1,0,0


In [4]:
df_cat=df.iloc[:, 4:7]  # all rows and column 4~6
df_cat.head()

Unnamed: 0,marital,retire,gender
0,1,0,0
1,1,0,0
2,1,0,1
3,0,0,1
4,1,0,0


- `df.iloc[:, 4:7]`: selects specific rows and columns by their integer index.
    - `:` specifies all rows.
    - `4:7` specifies indices 4, 5, and 6 (i.e., the 5th, 6th, and 7th columns). The range is inclusive of the start index and exclusive of the stop index.
- `df_cat` : assigns the subset of the original DataFrame obtained with iloc to a new variable called df_cat.

### 2. Conduct K-modes clustering using categorical variables

In [5]:
m = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)
cls = m.fit_predict(df_cat)

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 250, cost: 47.0
Run 1, iteration: 2/100, moves: 1, cost: 47.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 259, cost: 47.0
Run 2, iteration: 2/100, moves: 2, cost: 47.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 15, cost: 261.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 0, cost: 274.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 0, cost: 507.0
Best run was number 1


- `KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)`: This initializes a k-modes clustering model
    - `n_clusters=4` specifies that the algorithm should form four clusters.
    - `init='Huang'` sets the initialization method to 'Huang', which is often more efficient for categorical data than random initialization.
    - `n_init=5` indicates that the k-modes algorithm will run five times with different centroid seeds, and the best run in terms of cost (error) will be chosen.
    - `verbose=1` will output logs about the clustering process (useful for debugging or understanding the process).
- `cls = m.fit_predict(df_cat)`: fits the model using the categorical features stored in df_cat and assigns each data point to one of the four clusters.
    - `cls`: stores the cluster labels for each data point

In [6]:
m.cluster_centroids_ 

array([[1, 0, 0],
       [0, 0, 1],
       [1, 0, 1],
       [0, 0, 0]], dtype=int64)

- `m.cluster_centroids_`: This attribute of the fitted k-modes model m contains the centroids of the formed clusters. Each centroid is represented by the point in the feature space that corresponds to the average (mode) of all points in that cluster.
    - [n_clusters, n_features]=(4,3)

In [7]:
m.cost_

47.0

- `m.cost_`: This attribute of the fitted k-modes model m represents the cost, or clustering error, of the final clustering solution. The cost is computed as the sum of the dissimilarities (distance) between each sample and its corresponding cluster mode (centroid).

# K-prototypes clustering for mixed categorical and numerical variables

### 1. Use 'Telecom customer.csv'. Save the variables as array

In [8]:
df = pd.read_csv('Telecom customer.csv')
array = df.values  # save df as array
array

array([[ 13,  44,  64, ...,   1,   0,   0],
       [ 11,  33, 136, ...,   1,   0,   0],
       [ 68,  52, 116, ...,   1,   0,   1],
       ...,
       [ 67,  59, 944, ...,   0,   0,   1],
       [ 70,  49,  87, ...,   0,   0,   1],
       [ 50,  36,  39, ...,   1,   0,   1]], dtype=int64)

- `array = df.values`: converts the DataFrame into a NumPy array. 
- list vs. array: List can contains different data types. Array can contains same data types

### 2. Conduct K-protopypes clustering using all variables.

In [9]:
cat=[4,5,6]
kproto = KPrototypes(n_clusters=3, verbose=2, max_iter=20).fit(array, categorical=cat)

Initialization method and algorithm are deterministic. Setting n_init to 1.
Init: initializing centroids
Init: initializing clusters
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 1, iteration: 1/20, moves: 340, ncost: 4888551.028927475
Run: 1, iteration: 2/20, moves: 181, ncost: 4261412.963537759
Run: 1, iteration: 3/20, moves: 83, ncost: 3885359.237057534
Run: 1, iteration: 4/20, moves: 62, ncost: 3411778.3459221586
Run: 1, iteration: 5/20, moves: 47, ncost: 3166533.178054045
Run: 1, iteration: 6/20, moves: 26, ncost: 3108234.590019671
Run: 1, iteration: 7/20, moves: 8, ncost: 3103262.0649726517
Run: 1, iteration: 8/20, moves: 3, ncost: 3102772.5364551214
Run: 1, iteration: 9/20, moves: 0, ncost: 3102772.5364551214
Init: initializing centroids
Init: initializing clusters
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run: 2, iteration: 1/20, moves: 274, ncost: 4771469.52510799
Run: 2, iteration: 2/20, moves: 140, 

- `cat = [4, 5, 6]`: This line defines a list cat that specifies the indices of the columns that are categorical.
- `KPrototypes(n_clusters=3, verbose=2, max_iter=20)`: This initializes the k-prototypes clustering algorithm 
    - KPrototypes train categorical features using k-modes and numeric features using k-means
    - `n_clusters=3` indicates that the algorithm should form three clusters.
    - `verbose=2` means that the algorithm will output detailed logs about the clustering process. The higher the verbose level, the more information is printed.
    - `max_iter=20` sets the maximum number of iterations the algorithm should run for, preventing it from running indefinitely (default=100).
- `fit(array, categorical=cat)`: This method fits the k-prototypes model to your dataset array. It requires the data and the indices of categorical columns to be specified (provided by cat).

### 3. Display cluster centers

In [10]:
kproto.cluster_centroids_  

array([[ 33.72437358,  40.13553531,  49.98633257,   2.63895216,
          0.        ,   0.        ,   1.        ],
       [ 59.88888889,  58.44444444, 873.66666667,   3.44444444,
          0.        ,   0.        ,   1.        ],
       [ 47.5840708 ,  52.38053097, 228.17699115,   2.85840708,
          0.        ,   0.        ,   1.        ]])

- `kproto.cluster_centroids_`: The centroids of the clusters formed.
- [n_clusters, n_variables] = (3,7)

### 4. Preict clusters for the samples. Display samples that belong to Cluster=1

In [11]:
# Prediction

cls=kproto.predict(array, categorical=cat)
df['cluster'] = cls
df.head()
# df['cluster'] = kproto.labels_  # alternative

Unnamed: 0,tenure,age,income,ed,marital,retire,gender,cluster
0,13,44,64,4,1,0,0,0
1,11,33,136,5,1,0,0,0
2,68,52,116,1,1,0,1,0
3,33,33,33,2,0,0,1,0
4,23,30,30,1,1,0,0,0


- `cls = kproto.predict(array, categorical=cat)`: This line predicts the closest cluster for each sample
- `df['cluster'] = cls` : This line adds a new column to the original DataFrame, labeled 'cluster', which contains the cluster assignments for each row. 
- `df['cluster'] = kproto.labels_` is an alternative way to achieve the same result without calling the predict method. After fitting a k-prototypes model, the labels_ attribute contains the cluster labels for each point.

In [12]:
df[df['cluster']==1].head()

Unnamed: 0,tenure,age,income,ed,marital,retire,gender,cluster
208,72,64,674,4,0,0,1,1
401,41,52,928,3,0,0,0,1
409,39,59,1668,4,0,0,1,1
680,65,59,732,3,0,0,1,1
799,66,54,591,4,1,0,1,1


- `df['cluster'] == 1`: creates a Boolean series. It goes through the 'cluster' column and checks each row to see whether its value is equal to 1. If a row has the value 1, the corresponding entry in the Boolean series is True; otherwise, it's False.
- `df[df['cluster'] == 1]`: This is a DataFrame indexing operation using the Boolean series. It selects only the rows in df where the Boolean series is True, i.e., only the rows belonging to cluster 1. This operation effectively filters the DataFrame to contain only the data from cluster 1.