<a href="https://colab.research.google.com/github/dotmanjohn/kosh/blob/master/Customer_Segmentation_using_k_means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Perform Customer Segmentation Analysis of Bank Customers Using k-means
You are working for an international bank. The credit department is reviewing its offerings and wants to get a better understanding of its current customers. You have been tasked with performing customer segmentation analysis. You will perform cluster analysis with k-means to identify groups of similar customers.

In [None]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
from sklearn.preprocessing import StandardScaler

In [None]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter05/DataSet/german.data-numeric'

In [None]:
df = pd.read_csv(file_url, header=None, sep='\s\s+', prefix='X')
df.head(20)

  """Entry point for launching an IPython kernel.


Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24
0,1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,1.0
1,2,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,2.0
2,4,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,1.0
3,1,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,1.0
4,1,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,2.0
5,4,36,2,91,5,3,3,4,4,35,3,1,2,2,1,0,0,1,0,0,0,0,1,0,1.0
6,4,24,2,28,3,5,3,4,2,53,3,1,1,1,1,0,0,1,0,0,1,0,0,1,1.0
7,2,36,2,69,1,3,3,2,3,35,3,1,1,2,1,0,1,1,0,1,0,0,0,0,1.0
8,4,12,2,31,4,4,1,4,1,61,3,1,1,1,1,0,0,1,0,0,1,0,1,0,1.0
9,2,30,4,52,1,1,4,2,3,28,3,2,1,1,1,1,0,1,0,0,1,0,0,0,2.0


Even though all the columns in this dataset are integers, most of them are actually categorical variables. The data in these columns is not continuous. Only two variables are really numeric, we will use them for our clustering.

An examination of the dataset reveals 3 columns, (X1, X3,X9) to be our desired columns however, a closer look at these columns reveals column X1 to be categorical given its values are in multiples of 3, hence we are left with columns X3 and X9.

Extract the X3 and X9 columns and assign them to a new variable called X

In [None]:
X = df[['X3', 'X9']]

Instantiate a StandardScaler object, standardize the data, and store the result in a variable called X_scaled

In [None]:
standard_scaler = StandardScaler()
X_scaled = standard_scaler.fit_transform(X)

Create an empty pandas DataFrame called clusters, an empty list called inertia and a new column called 'cluster_range' in the clusters DataFrame and assign a range from 1 to 15

In [None]:
clusters = pd.DataFrame()
clusters['cluster_range'] = range(1, 15)
inertia = []

Using a for loop, fit a k-means model with the number of clusters defined in the 'cluster_range' column, extract the relevant inertia value, and append it to the inertia list

In [None]:
for k in clusters['cluster_range']:
  kmeans = KMeans(n_clusters=k, random_state=8).fit(X_scaled)
  inertia.append(kmeans.inertia_)

Create a new column called 'cluster_range' from the clusters DataFrame, assign it the inertia list and print the clusters Dataframe

In [None]:
clusters['inertia'] = inertia
clusters

Unnamed: 0,cluster_range,inertia
0,1,2000.0
1,2,1280.612749
2,3,767.637196
3,4,576.086134
4,5,443.905649
5,6,360.418261
6,7,291.39305
7,8,252.709449
8,9,219.498996
9,10,193.015983


Use the altair package, the mark_line and encode methods to display the Elbow plot

In [None]:
alt.Chart(clusters).mark_line().encode(x='cluster_range', y='inertia')

Looking at the Elbow plot, find the optimal number of clusters and save this value in a new variable called optim_cluster

In [None]:
optim_cluster=5

Fit k-means with the default hyperparameters

In [None]:
kmeans = KMeans(random_state=42, n_clusters=optim_cluster)
kmeans.fit(X_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

Use the predict() method from sklearn to get the assigned clusters for all data points saved in X_scaled

In [None]:
df['cluster'] = kmeans.predict(X_scaled)
df.head(20)

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,cluster
0,1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,1.0,4
1,2,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,2.0,0
2,4,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,1.0,4
3,1,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,1.0,1
4,1,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,2.0,4
5,4,36,2,91,5,3,3,4,4,35,3,1,2,2,1,0,0,1,0,0,0,0,1,0,1.0,1
6,4,24,2,28,3,5,3,4,2,53,3,1,1,1,1,0,0,1,0,0,1,0,0,1,1.0,4
7,2,36,2,69,1,3,3,2,3,35,3,1,1,2,1,0,1,1,0,1,0,0,0,0,1.0,1
8,4,12,2,31,4,4,1,4,1,61,3,1,1,1,1,0,0,1,0,0,1,0,1,0,1.0,4
9,2,30,4,52,1,1,4,2,3,28,3,2,1,1,1,1,0,1,0,0,1,0,0,0,2.0,0


Plot the clusters

In [None]:
alt.Chart(df).mark_circle().encode(x='X3', y='X9',color='cluster:N',
      tooltip=['X3', 'X9']).interactive()

Tune the hyperparameters and re-train k-means, fit a k-means++ algorithm with this number of clusters, random_state=1, n_init=50, and max_iter=1000

In [None]:
kmeans2 = KMeans(random_state=1, n_clusters=optim_cluster, init='k-means++', n_init=50, max_iter=1000)
kmeans2.fit(X_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000,
       n_clusters=5, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

Use the predict() method from sklearn to get the assigned clusters for all data points saved in X_scaled:

In [None]:
df['cluster2'] = kmeans2.predict(X_scaled)
df.head(20)

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,cluster,cluster2
0,1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,1.0,4,2
1,2,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,2.0,0,4
2,4,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,1.0,4,2
3,1,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,1.0,1,0
4,1,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,2.0,4,2
5,4,36,2,91,5,3,3,4,4,35,3,1,2,2,1,0,0,1,0,0,0,0,1,0,1.0,1,0
6,4,24,2,28,3,5,3,4,2,53,3,1,1,1,1,0,0,1,0,0,1,0,0,1,1.0,4,2
7,2,36,2,69,1,3,3,2,3,35,3,1,1,2,1,0,1,1,0,1,0,0,0,0,1.0,1,0
8,4,12,2,31,4,4,1,4,1,61,3,1,1,1,1,0,0,1,0,0,1,0,1,0,1.0,4,2
9,2,30,4,52,1,1,4,2,3,28,3,2,1,1,1,1,0,1,0,0,1,0,0,0,2.0,0,4


Plot the scatter plot with the altair package

In [None]:
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='X3', y='X9',color='cluster2:N', tooltip=['X3', 'X9']).interactive()

After tuning the two main hyperparameters responsible for initializing k-means clusters, it can be seen that increasing the number of iterations with n_init didn't have much impact on the clustering result for this dataset. Although the classes are reassigned which is evident with the change in cluster colors (as expected), the distribution of the clusters as seen on both plots is uniform.

In this case, it is better to use a lower value for this hyperparameter as it will speed up the training time. But for a different dataset, we may face a case where the results differ drastically depending on the n_init value. In such a case, we will have to find a value of n_init that is not too small but also not too big. The aim will be to find a good spot where the results do not change much compared to the last result obtained with a different value.