# Customer Segmentation using KMeans

We have a customer dataset, and we need to apply customer segmentation on this historical data. Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources. For example, one group might contain customers who are high-profit and low-risk, that is, more likely to purchase products, or subscribe for a service. A business task is to retaining those customers. Another group might include customers from non-profit organizations. And so on.

In [14]:
import pandas as pd
import numpy as np

In [1]:
df = pd.read_csv("Cust_Segmentation.csv")
df.head()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,Address,DebtIncomeRatio
0,1,41,2,6,19,0.124,1.073,0.0,NBA001,6.3
1,2,47,1,26,100,4.582,8.218,0.0,NBA021,12.8
2,3,33,2,10,57,6.111,5.802,1.0,NBA013,20.9
3,4,29,2,4,19,0.681,0.516,0.0,NBA009,6.3
4,5,47,1,31,253,9.308,8.908,0.0,NBA008,7.2


## Data Pre-processing and Selection

**Address** in this dataset is a categorical variable. k-means algorithm isn't directly applicable to categorical variables because Euclidean distance function isn't really meaningful for discrete variables. So, lets drop this feature and run clustering.

In [2]:
df = df.drop('Address',axis=1)
df.head()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
0,1,41,2,6,19,0.124,1.073,0.0,6.3
1,2,47,1,26,100,4.582,8.218,0.0,12.8
2,3,33,2,10,57,6.111,5.802,1.0,20.9
3,4,29,2,4,19,0.681,0.516,0.0,6.3
4,5,47,1,31,253,9.308,8.908,0.0,7.2


Now let's **Normalize** the dataset.

In [5]:
from sklearn.preprocessing import StandardScaler

In [16]:
x = df.values[:,1:]
x = np.nan_to_num(x)
clus_data = StandardScaler().fit_transform(x)
clus_data[0:5]

array([[ 0.74291541,  0.31212243, -0.37878978, -0.71845859, -0.68381116,
        -0.59048916, -0.52379654, -0.57652509],
       [ 1.48949049, -0.76634938,  2.5737211 ,  1.38432469,  1.41447366,
         1.51296181, -0.52379654,  0.39138677],
       [-0.25251804,  0.31212243,  0.2117124 ,  0.26803233,  2.13414111,
         0.80170393,  1.90913822,  1.59755385],
       [-0.75023477,  0.31212243, -0.67404087, -0.71845859, -0.42164323,
        -0.75446707, -0.52379654, -0.57652509],
       [ 1.48949049, -0.76634938,  3.31184882,  5.35624866,  3.63890032,
         1.71609424, -0.52379654, -0.44250653]])

## Modeling

Lets apply k-means on our dataset, and take look at cluster labels.

In [17]:
from sklearn.cluster import KMeans

In [20]:
k_means = KMeans(init='k-means++',n_clusters=3,n_init=12)
k_means.fit(clus_data)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=12, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Now let's grab the labels for each point in the model using KMeans' .labels_ attribute and save it as k_means_labels

In [22]:
k_means_labels = k_means.labels_
k_means_labels[0:5]

array([1, 0, 2, 1, 0], dtype=int32)

## Insights

We assign the labels to each row in dataframe.

In [23]:
df['Clus_lbl'] = k_means_labels
df.head()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_lbl
0,1,41,2,6,19,0.124,1.073,0.0,6.3,1
1,2,47,1,26,100,4.582,8.218,0.0,12.8,0
2,3,33,2,10,57,6.111,5.802,1.0,20.9,2
3,4,29,2,4,19,0.681,0.516,0.0,6.3,1
4,5,47,1,31,253,9.308,8.908,0.0,7.2,0


We can easily check the centroid values by averaging the features in each cluster.

In [24]:
df.groupby('Clus_lbl').mean()

Unnamed: 0_level_0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
Clus_lbl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,424.408163,43.0,1.931973,17.197279,101.959184,4.220673,7.954483,0.162393,13.915646
1,426.122905,33.817505,1.603352,7.625698,36.143389,0.853128,1.816855,0.0,7.964991
2,424.451807,31.891566,1.861446,3.963855,31.789157,1.576675,2.843355,0.993939,13.994578


## Examine Clusters

### Cluster 0

In [45]:
c0 = df[df['Clus_lbl'] ==0]
c0[0:10]

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_lbl
1,2,47,1,26,100,4.582,8.218,0.0,12.8,0
4,5,47,1,31,253,9.308,8.908,0.0,7.2,0
5,6,40,1,23,81,0.998,7.831,,10.9,0
9,10,47,3,23,115,0.653,3.947,0.0,4.0,0
18,19,44,1,18,61,2.806,3.782,,10.8,0
23,24,45,1,19,77,2.303,4.165,0.0,8.4,0
24,25,37,4,10,123,3.022,18.257,0.0,17.3,0
31,32,42,2,12,55,2.533,5.717,0.0,15.0,0
39,40,39,3,16,126,1.405,7.163,,6.8,0
41,42,48,3,17,113,3.376,10.184,0.0,12.0,0


In [46]:
c0.shape

(147, 10)

In [47]:
c0.describe()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_lbl
count,147.0,147.0,147.0,147.0,147.0,147.0,147.0,117.0,147.0,147.0
mean,424.408163,43.0,1.931973,17.197279,101.959184,4.220673,7.954483,0.162393,13.915646,0.0
std,242.422273,6.31697,1.031423,6.609084,59.124188,3.590995,4.988961,0.370397,7.860493,0.0
min,2.0,26.0,1.0,1.0,30.0,0.288,1.003,0.0,2.0,0.0
25%,220.0,39.0,1.0,12.0,64.0,1.6395,4.691,0.0,7.7,0.0
50%,444.0,43.0,2.0,17.0,83.0,3.176,7.036,0.0,13.1,0.0
75%,639.0,47.0,3.0,22.0,119.0,5.3265,9.6765,0.0,17.85,0.0
max,850.0,56.0,5.0,33.0,446.0,20.561,35.197,1.0,41.3,0.0


We can see that cluster 0 has those customers which has age b/w 26-56, Years Employed 1-33, Income 30-446. 

### Cluster 1

In [49]:
c1 = df[df['Clus_lbl'] ==1]
c1[0:10]

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_lbl
0,1,41,2,6,19,0.124,1.073,0.0,6.3,1
3,4,29,2,4,19,0.681,0.516,0.0,6.3,1
6,7,38,2,4,56,0.442,0.454,0.0,1.6,1
7,8,42,3,0,64,0.279,3.945,0.0,6.6,1
8,9,26,1,5,18,0.575,2.215,,15.5,1
11,12,34,2,9,40,0.374,0.266,,1.6,1
12,13,24,1,7,18,0.526,0.643,0.0,6.5,1
13,14,46,1,6,30,1.415,3.865,,17.6,1
15,16,24,1,1,16,0.185,1.287,,9.2,1
16,17,29,1,1,17,0.132,0.293,0.0,2.5,1


In [51]:
c1.shape

(537, 10)

In [57]:
c1.describe()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_lbl
count,537.0,537.0,537.0,537.0,537.0,537.0,537.0,418.0,537.0,537.0
mean,426.122905,33.817505,1.603352,7.625698,36.143389,0.853128,1.816855,0.0,7.964991,1.0
std,250.158472,7.053912,0.873014,5.341293,17.499358,0.778158,1.312747,0.0,4.927747,0.0
min,1.0,20.0,1.0,0.0,13.0,0.012,0.046,0.0,0.1,1.0
25%,209.0,29.0,1.0,4.0,24.0,0.291,0.867,0.0,4.4,1.0
50%,412.0,34.0,1.0,7.0,32.0,0.606,1.481,0.0,7.0,1.0
75%,656.0,39.0,2.0,11.0,44.0,1.168,2.47,0.0,10.5,1.0
max,849.0,56.0,5.0,23.0,120.0,4.881,7.286,0.0,24.6,1.0


We can see that cluster 1 has those customers which has age b/w 20-56, Years Employed 0-23, Income 13-120. 

### Cluster 2

In [53]:
c3 = df[df['Clus_lbl'] ==2]
c3[0:10]

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_lbl
2,3,33,2,10,57,6.111,5.802,1.0,20.9,2
10,11,44,3,8,88,0.285,5.083,1.0,6.1,2
14,15,28,3,2,20,0.233,1.647,1.0,9.4,2
22,23,28,3,6,47,5.574,3.732,1.0,19.8,2
32,33,23,2,0,42,1.019,0.619,1.0,3.9,2
36,37,35,3,5,37,0.581,1.417,1.0,5.4,2
37,38,37,1,0,18,1.584,0.738,1.0,12.9,2
40,41,20,1,4,14,0.201,1.157,1.0,9.7,2
52,53,24,1,3,19,1.358,3.278,1.0,24.4,2
54,55,29,3,5,70,3.176,10.754,1.0,19.9,2


In [54]:
c3.shape

(166, 10)

In [55]:
c3.describe()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_lbl
count,166.0,166.0,166.0,166.0,166.0,166.0,166.0,165.0,166.0,166.0
mean,424.451807,31.891566,1.861446,3.963855,31.789157,1.576675,2.843355,0.993939,13.994578,2.0
std,234.246091,8.031019,0.952869,3.807316,15.785229,1.3943,2.323803,0.07785,7.465137,0.0
min,3.0,20.0,1.0,0.0,14.0,0.073,0.161,0.0,0.9,2.0
25%,223.25,26.0,1.0,1.0,20.0,0.46225,1.255,1.0,8.4,2.0
50%,439.5,29.5,2.0,3.0,27.0,1.2095,2.3355,1.0,13.2,2.0
75%,607.5,36.75,2.0,6.0,40.0,2.142,3.77175,1.0,18.55,2.0
max,848.0,55.0,5.0,16.0,94.0,6.912,15.405,1.0,35.3,2.0


We can see that cluster 2 has those customers which has age b/w 20-55, Years Employed 0-16, Income 14-94. 

Now we can create a profile for each group, considering the common characteristics of each cluster. For example, the 3 clusters can be:

- AFFLUENT, EDUCATED AND OLD AGED
- MIDDLE AGED AND MIDDLE INCOME
- YOUNG AND LOW INCOME