```
@title : # Clustering `employee performance` dataset in `Python`
@date  : 20241218 ALUR

@author: Aleksandras Urbonas, aleksandras . urbonas (.) gmail . com
```


---

# Intro

Clustering a dataset that contains both categorical and continuous variables can be more challenging than clustering datasets with only continuous variables, due to the different nature of the variables.

However, with appropriate techniques, it is possible to perform meaningful clustering on this type of data.

By comparing properties of the clusters, we will be able to understand the characteristics of each cluster and identify key differences in employee profiles, which could inform management decisions such as resource allocation, promotions, or performance improvement strategies.

Visualizations are especially useful for presenting this information to non-technical audiences, while statistical tests provide more formal evidence for your findings.



---

# 1. Import data



In [None]:
import pandas as pd
import numpy as np



In [None]:
# Load the clean data
data_0 = pd.read_csv('../data/data_clean.csv', index_col='employee_id')

print(data_0.head(2))

# data shape:
print(data_0.shape, "\n\n *** \n\n")

# review data types
print(data_0.dtypes, "\n\n *** \n\n")

# statistics for numeric values
data_0.describe()



In [None]:
# Drop rows with missing values in key columns, if any
data_1 = data_0.dropna() #subset=['perf_rank', 'is_men', 'is_promo'])



In [None]:
# Drop duplicates (if any)
data_2 = data_1.drop_duplicates()

# Status: notify about number of removed duplicates
print(f'duplicates removed: {data_1.shape[0] - data_2.shape[0]} records.')



In [None]:
# drop `job_level` components:
if 'job_role' in data_2.columns: data_2.drop(columns='job_role', inplace=True)
if 'job_rank' in data_2.columns: data_2.drop(columns='job_rank', inplace=True)



In [None]:
# Check the cleaned data
print(data_2.isnull().sum())



In [None]:
# final assignment: 
data = data_2
data.head(2)



## Key Approaches and Considerations for Clustering Mixed Data (Categorical + Integer Variables):



#### Preprocessing the Data

    Proper preprocessing is essential before applying clustering algorithms. This step involves:
        - handling missing values,
        - scaling continuous variables, and
        - encoding categorical variables.



#### Handling Missing Values:

        Missing values in categorical variables can be imputed using the mode (most frequent value).
        Missing values in continuous variables can be imputed using the mean or median, depending on the distribution of the data.
        For clustering, it’s essential that no missing data exists, as most clustering algorithms don’t handle missing values directly.



#### Scaling Continuous Variables:

    Standardization: Continuous variables should be standardized to have a mean of 0 and a standard deviation of 1. This prevents variables with larger scales (e.g., age or tenure) from dominating the clustering process.



In [None]:
data_scaled = data

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

data_scaled['age_scaled'] = scaler.fit_transform(data_scaled[['age']])
if 'age' in data_scaled.columns: data_scaled.drop(columns='age', inplace=True)
data_scaled['tenure_scaled'] = scaler.fit_transform(data_scaled[['tenure']])
if 'tenure' in data_scaled.columns: data_scaled.drop(columns='tenure', inplace=True)

data_scaled.head(2)



#### 1.2 Encoding Categorical Variables:

        One-Hot Encoding: This is the most common approach where each category in a categorical variable is converted into a binary feature.
            Example: For a column like gender (Male, Female, Non-binary), one-hot encoding would create three new columns: gender_Male, gender_Female, gender_Non-binary.
        Label Encoding: This assigns a unique integer to each category.
            Example: gender can be encoded as 0 for Male, 1 for Female, 2 for Non-binary.
        Frequency or Target Encoding: In some cases, the frequency of the categories or the average target variable (e.g., performance ratings or promotion decisions) can be used to encode categorical variables.



In [None]:
# Encode categorical variables
data_dummy = pd.get_dummies(
    data_scaled
    , columns=['region', 'job_level', 'job_function', 'perf_rank']
    , drop_first=False
)

data_dummy.head(2)



### Columns: categorical or numerical



In [None]:
# Assuming 'gender' and 'job_function' are categorical, 'age' and 'tenure' are continuous
continuous_columns = ['age_scaled', 'tenure_scaled']  # Continuous columns



In [None]:
use_dummy = True

if use_dummy == True:
    data = data_dummy
else:
    data = data_scaled



In [None]:
# select all categorical columns
categorical_columns = ['is_promo', 'is_men']  # binary

for col in data.columns:
    if col not in continuous_columns: 
        if col not in categorical_columns:
            categorical_columns.append(col)

print(categorical_columns)



---

## 2. Choosing Clustering Algorithms for Mixed Data

There are several clustering algorithms that can handle both categorical and continuous data effectively:



### `K-Prototypes` Clustering:

* extension of K-Means, which can handle mixed data types by using different distance measures for categorical and continuous features.
* It minimizes a cost function that consists of both categorical and continuous components:
    - Continuous variables are handled using Euclidean distance.
    - Categorical variables are handled using a dissimilarity measure (e.g., simple matching coefficient).
* Requires choosing the number of clusters k in advance, similar to K-Means.



In [None]:
data_kpro = data_dummy



In [None]:
from kmodes.kprototypes import KPrototypes

# Use K-Prototypes for mixed data clustering
kproto = KPrototypes(n_clusters=3, init='Cao', verbose=1)



In [None]:
clusters = kproto.fit_predict(data_kpro[categorical_columns + continuous_columns], categorical=[0, 1])



> Best run was number 9

In [None]:
# Add the cluster label to the dataset
data_kpro['cluster'] = clusters



In [None]:
data_kpro.head(2)



### Cluster Analysis

To compare the properties of clusters in your original dataset after performing hierarchical clustering, you need to analyze and summarize the data for each cluster.
This comparison can provide insights into how the clusters differ based on various attributes (e.g., age, tenure, gender, job function, performance ratings).

Here are a few common techniques you can use to compare properties of clusters in your original dataset:

1. Descriptive Statistics by Cluster

You can calculate basic descriptive statistics (mean, median, standard deviation, etc.) for each cluster to compare the properties of continuous variables such as age, tenure, and performance rating.

In [None]:
data_2['cluster'] = clusters



In [None]:
# Group by cluster and calculate descriptive statistics
cluster_stats = data_2.groupby('cluster').agg({
    'age_scaled': ['mean', 'min', 'max'],
    'tenure_scaled': ['mean', 'min', 'max'],
    'perf_rank': ['mean', 'min', 'max'],
    'is_promo': ['mean', 'min', 'max'],
    'is_men': ['mean', 'min', 'max'],
})

print(cluster_stats)


2. Count the Frequency of Categorical Variables in Each Cluster

For categorical variables like gender, job_function, you can use a cross-tabulation or group-by operation to see how each category is distributed across clusters.

This will give you a count of how many males, females, and non-binary employees belong to each cluster. Similarly, you can look at the distribution of different job_function categories in each cluster.

In [None]:
# Count the occurrences of categorical variables by cluster
categorical_comparison = pd.crosstab(data_2['cluster'], data_2['is_men'])
print(categorical_comparison)

categorical_comparison_function = pd.crosstab(data_2['cluster'], data_2['job_function'])
print(categorical_comparison_function)



### 3. Compare Cluster Distribution Using Visualization

You can use visualizations to better understand how the clusters differ. Common visualization methods include box plots, violin plots, and bar charts.



a. Box Plot for Continuous Variables

Box plots are useful to compare the distributions of continuous variables (like age, tenure, perf_rating) across clusters.



In [None]:
data = data_2
data.head(2)



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# set chart size
plt.rcParams["figure.figsize"] = 5, 3
sns.set_theme(rc={'figure.figsize':(5, 3)})



In [None]:
# Boxplot for 'age' by cluster
sns.boxplot(x='cluster', y='age_scaled', data=data)
plt.title("Age Distribution by Cluster")
plt.show()

# Boxplot for 'tenure' by cluster
sns.boxplot(x='cluster', y='tenure_scaled', data=data)
plt.title("Tenure Distribution by Cluster")
plt.show()

# Boxplot for 'perf_rating' by cluster
# sns.boxplot(x='cluster', y='perf_rank', data=data)
# plt.title("Performance Rating Distribution by Cluster")
# plt.show()



b. Bar Plot for Categorical Variables

Bar plots are effective for showing how categorical variables like gender or job_function are distributed across clusters.



In [None]:
# Bar plot for gender distribution by cluster
sns.countplot(x='cluster', hue='is_men', data=data)
plt.title("Gender Distribution by Cluster")
plt.show()

# Bar plot for promotion distribution by cluster
sns.countplot(x='cluster', hue='is_promo', data=data)
plt.title("Promotion Distribution by Cluster")
plt.show()



In [None]:
# Bar plot for job_level distribution by cluster
sns.countplot(x='cluster', hue='job_level', data=data)
plt.title("Job Function Distribution by Cluster")
plt.show()



In [None]:
# Bar plot for `perf_rank` distribution by cluster
sns.countplot(x='cluster', hue='perf_rank', data=data)
plt.title("Perf. Rank Distribution by Cluster")
plt.show()

# Bar plot for job_function distribution by cluster
sns.countplot(x='cluster', hue='job_function', data=data)
plt.title("Job Function Distribution by Cluster")
plt.show()



4. Compare Clusters Using Pivot Tables

You can use pivot tables to summarize and compare different properties by cluster. This approach is especially useful for quick comparisons between clusters.

In [None]:
# Pivot table for summary statistics of continuous variables
pivot_stats = pd.pivot_table(
    data_2
    , values=['age_scaled', 'tenure_scaled', 'perf_rank']
    , index='cluster'
    , aggfunc={'age_scaled': ['mean', 'std'], 'tenure_scaled': ['mean', 'std'], 'perf_rank': ['mean', 'std']})

print(pivot_stats)



5. Statistical Tests Between Clusters

If you're interested in testing whether the differences between clusters are statistically significant, you can perform `ANOVA` (for continuous variables) or Chi-squared tests (for categorical variables).

a. ANOVA for Continuous Variables

ANOVA can help determine if there are significant differences between clusters for continuous variables like age, and tenure.



In [None]:
from scipy.stats import f_oneway

# ANOVA for 'age'
f_stat, p_val = f_oneway(data_2[data_2['cluster'] == 0]['age_scaled'], 
                          data_2[data_2['cluster'] == 1]['age_scaled'], 
                          data_2[data_2['cluster'] == 2]['age_scaled'])
print(f"ANOVA for Age - F-statistic: {f_stat}, p-value: {p_val}")

# If p-value < 0.05, the differences between clusters in terms of 'age' are statistically significant
if p_val < 0.05:
    print(f"Differences between clusters in terms of 'age' are statistically significant")
else:
    print(f"Differences between clusters in terms of 'age' are not statistically significant")

    b. Chi-Squared Test for Categorical Variables

    To compare categorical variables, we perform a Chi-squared test to see if the distribution of categories differs significantly across clusters.



In [None]:
from scipy.stats import chi2_contingency

# Chi-squared test for 'gender' vs. 'cluster'
contingency_table = pd.crosstab(data_2['cluster'], data_2['is_men'])
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared Test for Gender - p-value: {p_val}")

# If the p-value is less than 0.05, it indicates that the distribution of gender across clusters is significantly different.
if p_val < 0.05:
    print(f"Differences between clusters in terms of 'gender' are statistically significant")
else:
    print(f"Differences between clusters in terms of 'gender' are not statistically significant")

In [None]:
from scipy.stats import chi2_contingency

# Chi-squared test for 'gender' vs. 'cluster'
contingency_table = pd.crosstab(data_2['cluster'], data_2['is_promo'])
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared Test for Promotion - p-value: {p_val}")

# If the p-value is less than 0.05, it indicates that the distribution of gender across clusters is significantly different.
if p_val < 0.05:
    print(f"Differences between clusters in terms of 'promotion' are statistically significant")
else:
    print(f"Differences between clusters in terms of 'promotion' are not statistically significant")

In [None]:
from scipy.stats import chi2_contingency

# Chi-squared test for 'region' vs. 'cluster'
contingency_table = pd.crosstab(data_2['cluster'], data_2['region'])
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared Test for region - p-value: {p_val}")

# If the p-value is less than 0.05, it indicates that the distribution of gender across clusters is significantly different.
if p_val < 0.05:
    print(f"Differences between clusters in terms of 'region' are statistically significant")
else:
    print(f"Differences between clusters in terms of 'region' are not statistically significant")

In [None]:
from scipy.stats import chi2_contingency

# Chi-squared test for 'job_function' vs. 'cluster'
contingency_table = pd.crosstab(data_2['cluster'], data_2['job_function'])
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared Test for `job_function` - p-value: {p_val}")

# If the p-value is less than 0.05, it indicates that the distribution of gender across clusters is significantly different.
if p_val < 0.05:
    print(f"Differences between clusters in terms of 'job_function' are statistically significant")
else:
    print(f"Differences between clusters in terms of 'job_function' are not statistically significant")

In [None]:
from scipy.stats import chi2_contingency

# Chi-squared test for 'job_level' vs. 'cluster'
contingency_table = pd.crosstab(data_2['cluster'], data_2['job_level'])
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared Test for `job_level` - p-value: {p_val}")

# If the p-value is less than 0.05, it indicates that the distribution of gender across clusters is significantly different.
if p_val < 0.05:
    print(f"Differences between clusters in terms of 'job_level' are statistically significant")
else:
    print(f"Differences between clusters in terms of 'job_level' are not statistically significant")

## 6. Cluster Profiling Summary

Once you've completed the analysis using the above techniques, you can create a summary of the clusters' characteristics. For example:

    Cluster 1: might have a younger average age, higher performance ratings, and a more balanced gender distribution.
    Cluster 2: might have employees with higher tenure and predominantly male employees, with lower performance ratings.
    Cluster 3: might have a mix of younger and older employees, but more females and a higher rate of promotion readiness.



# Hierarchical Clustering with `Gower` Distance:

    Hierarchical Clustering can be used with a distance matrix.
    
    In this case, the Gower distance can be calculated to handle both continuous and categorical variables.
    Gower distance is a metric designed to measure dissimilarity between mixed-type data points.
    Hierarchical clustering can be agglomerative (bottom-up) or divisive (top-down).
    `Scipy` or `sklearn` allows for hierarchical clustering, but you must compute the custom Gower distance.



In [None]:
data_gower = data_scaled



In [None]:
import gower

# Calculate the Gower distance matrix for dataset
distance_matrix = gower.gower_matrix(data_gower)



In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram

# Perform hierarchical clustering using Ward's method
Z = linkage(distance_matrix, method='ward')



> `ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix after removing the cwd from sys.path.`

In [None]:
# import matplotlib.pyplot as plt

# Create a dendrogram to visualize the clustering
plt.figure(figsize=(10, 7))
dendrogram(Z, labels=data_gower.index)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()



5. Cutting the Dendrogram to Form Clusters

If you want to cut the dendrogram at a certain distance level to form clusters, you can use the fcluster() function.



In [None]:
from scipy.cluster.hierarchy import fcluster

# Define the maximum distance for cutting the dendrogram (e.g., cutting at distance < 5)
clusters = fcluster(Z, t=100, criterion='distance')



In [None]:
# Add the cluster labels to the original dataframe
data_gower['cluster'] = clusters
data_gower.head(2)



---

# Self-Organizing Maps (SOMs):

    Self-Organizing Maps (SOM) are neural networks that can be used for clustering mixed data types.
    SOMs can be trained using both categorical and continuous variables.
    They are particularly useful for visualizing high-dimensional data in lower-dimensional representations (2D grid).

    `MiniSom` is a Python package to train SOMs for mixed data.



In [None]:
data_som = data_dummy
data_som.head(2)



In [None]:
from minisom import MiniSom

# Define the grid size
som = MiniSom(10, 10, len(data_som.columns), sigma=0.5, learning_rate=0.5, random_seed=8)
som.train(data_som.values, 1000, verbose=True)  # Training on data values



In [None]:
from minisom import MiniSom
import numpy as np
import pandas as pd

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', 
                    names=['area', 'perimeter', 'compactness', 'length_kernel', 'width_kernel',
                   'asymmetry_coefficient', 'length_kernel_groove', 'target'], usecols=[0, 5], 
                   sep='\t+', engine='python')
# data normalization
data_som = (data - np.mean(data, axis=0)) / np.std(data, axis=0)
data_som = data_som.values

# Initialization and training
som_shape = (1, 3)
som = MiniSom(som_shape[0], som_shape[1], data_som.shape[1], sigma=.5, learning_rate=.5,
              neighborhood_function='gaussian', random_seed=10)

som.train_batch(data_som, 500, verbose=True)

In [None]:
# each neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in data_som]).T


# with np.ravel_multi_index we convert the bidimensional
# coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates, som_shape)



In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
    plt.scatter(data_som[cluster_index == c, 0],
                data_som[cluster_index == c, 1], label='cluster='+str(c), alpha=.7)

# plotting centroids
for centroid in som.get_weights():
    plt.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
                s=80, linewidths=3.5, color='k', label='centroid')
plt.legend();

In [None]:
# Use the trained SOM to assign clusters
cluster_labels = som._labels
data_som['cluster'] = cluster_labels



---

## 3. Evaluation of Clustering

Evaluating the quality of clusters is crucial to determine how well your clustering algorithm has performed.



####  Internal Evaluation Metrics:

    Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters.

    from sklearn.metrics import silhouette_score
    silhouette_avg = silhouette_score(data[['age_scaled', 'tenure_scaled']], clusters)
    print(f"Silhouette Score: {silhouette_avg}")

    Inertia (within-cluster sum of squares): This is the sum of squared distances from each point to its assigned cluster center. Lower values indicate better clustering (for K-Means/K-Prototypes).



In [None]:
#### External Evaluation Metrics (if ground truth is available):

    Adjusted Rand Index (ARI): Measures the similarity between two data clusterings while correcting for chance.

    Normalized Mutual Information (NMI): Measures the amount of information shared between two clusterings.



In [None]:
4. Visualization of Clusters

Visualizing clusters is essential to interpret the results and help with reporting.

    Pairwise Plots:
        Use pairwise plots (scatter plots) to visualize the relationship between different features, coloring by the cluster labels.

    import seaborn as sns
    sns.pairplot(data, hue='cluster', vars=['age', 'tenure', 'perf_rating'])



#### t-SNE or PCA:

    Reduce the data to two dimensions for visualization using t-SNE or PCA (Principal Component Analysis), coloring by cluster labels.



In [None]:
# perform PCA on Categorical columns job_level, job_function, region, job_role



In [None]:
#assign work dataset
data_pca = data_scaled[['job_level', 'job_function', 'region', 'job_role']]
data_pca.head(2)



In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_pca)
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=data_2['cluster'])
plt.show()



## 5. Conclusion

Clustering mixed data (categorical + integer) requires careful preprocessing and the choice of clustering algorithm tailored to handle the different data types. The following steps can be taken:

    Data Preprocessing: Handle missing values, encode categorical variables, and standardize continuous variables.
    Clustering Algorithms: Use specialized algorithms like K-Prototypes, Hierarchical Clustering with Gower distance, or Self-Organizing Maps (SOMs) for mixed data types.
    Evaluation: Use internal evaluation metrics (e.g., silhouette score) and external metrics if ground truth is available.
    Visualization: Use pairwise plots, t-SNE, or PCA to visualize the clusters.

These approaches allow for a deep exploration of mixed data, providing actionable insights from the clustering process.