## K Means Clustering

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA


In [None]:
# import all of the functions from k means clustering src
from rice_ml.k_means_clustering import (get_features, 
                                        get_feature_types, 
                                        create_preprocessor,
                                        plot_elbow_curve,
                                        plot_silhouette_scores,
                                        train_kmeans,
                                        evaluate_clustering,
                                        plot_pca_clusters,
                                        plot_cluster_distribution,
                                        cluster_vs_target,
                                        cluster_numeric_summary
                                        )

In [None]:
##  Load and preprocess data

# load dataset
df = pd.read_csv("unsupervised_ObesityDataSet_raw_and_data_sinthetic.csv")

# Prepare features
X = get_features(df)

# Feature types
num_features, cat_features = get_feature_types()

# Preprocessing
preprocessor = create_preprocessor(num_features, cat_features)

# Transform data once for evaluation plots
X_processed = preprocessor.fit_transform(X)


Once everything has been loaded and the data has been prepared for the algortithm, we can begin training the model. For K-means clustering, we need to start by picking the k, or number of clusters which could be done by looking at the elbow curve. From the graph, we can see that the elbow of the data is around k = 5, but it is a bit difficult to determine.

In [None]:
plot_elbow_curve(X_processed)

Since we did not get a clear answer from the elbow graph, we can instead try a Silhouette Score Analysis. In this chart, a higher value means clearer defined clusters. Since this is the first iteration, the silhouette analysis shows overall weak clustering, but we can still see that 4 clusters seems to provide the most separation. Thus, we will use k = 4.

In [None]:
plot_silhouette_scores(X_processed)

Once K has been selected, we can begin training the model! We are running the model for 50 iterations. Each time, the model will propose an improved set of centroids (cluster centers) that more accurately sort the data.

In [None]:
# define k
optimal_k = 4

# Train final model
pipeline, clusters = train_kmeans(preprocessor, X, optimal_k)

In [None]:
df["Cluster"] = clusters


Once the machine has learned, we can evaluate how well the model did. We can determine this by looking at the Silhouette score of the final clusters (0.155) and also a map of the final clusters. The map appears to have 4 distinct clusters, as demarcated by the clusters, but there is some overlap; it is hard to separate some of the observations.

In [None]:
# Silhouette Analysis Score
sil_score = evaluate_clustering(X_processed, clusters)
print(f"Silhouette Score: {sil_score:.3f}")

In [None]:
# Plot of final clusters
plot_pca_clusters(X_processed, clusters)

In addition to the spatial map of the clusters, we can look at the distribution of variables within each cluster. First, we can see the number of observations assigned to each cluster. This can be done across the entire cluster, as in the bar chart, or broken down by variable in the next table.

In [None]:
# Bar Chart: distribution of observations in clusters
plot_cluster_distribution(df)

In [None]:
# Table: distribution of observation variables in clusters
print(cluster_vs_target(df))

We can also see the average values of each variable within the clusters. This is a great way to see if the observations were actually matched into the correct clusters. As we can see, there are clear differences in values in many of the variables between the clusters (especially Age, Weight, and NCP), indicating that the algorithm did its job.

In [None]:
print(cluster_numeric_summary(df, num_features))