# Customer Segmentation with K-Means

In this chapter we will used the pre processed data to identify customer clusters based on their recency, frequency and monetary value

**Key Steps of a segmentation project**
- Data pre-processing
- Choosing a number of clusters
- Running k-means clustering on pre-processed data
- Analyzing average RFM values of each cluster

**Methods to define the number of clusters**
- Visual methods - elbow criterion
- Mathematical method - silhouette coefficient

**Running k-means**
- Import *KMeans* from *sklearn* library and initialize it
    ```python
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters = 2, random_state = 1)
    ```
- Compute *k-means* clustering n pre-processed data
    ```python
    kmeans.fit(datamart_normalized)
    ```
- Extract cluster labels from *labels_* attribute
    ```python
    cluster_labels = kmeans.labels_
    ```

**Anayzing average RFM values of each cluster**

- Create a culster label column in the **original** DataFrame
    ```python
    datamart_rfm_k2 = datamart_RFM.assign(Cluster = cluster_labels)
    ```
- Calculate average RFM values and size for each cluster
    ```python
    datamart_rfm_k2.groupby(['Cluster']).agg({
        'Recency': 'mean',
        'Frequency': 'mean',
        'MonetaryValue': ['mean', 'count']
    }).round(0)
    ```

In [7]:
import pandas as pd
from sklearn.cluster import KMeans

datamart_normalized = pd.read_csv('./datasets/chapter_4/datamart_normalized_df.csv')
datamart_rfm = pd.read_csv('./datasets/chapter_4/datamart_rfm.csv')

datamart_normalized.head()

Unnamed: 0,CustomerID,Recency,Frequency,MonetaryValue
0,12747,-2.002202,0.865157,1.46494
1,12748,-2.814518,3.815272,2.994692
2,12749,-1.78949,1.189117,1.347598
3,12820,-1.78949,0.546468,0.500595
4,12822,0.337315,0.020925,0.037943


In [8]:
kmeans = KMeans(n_clusters = 3, random_state = 1)
kmeans.fit(datamart_normalized)
cluster_labels = kmeans.labels_

In [10]:
datamart_rfm_k3 = datamart_rfm.assign(Cluster = cluster_labels)
grouped = datamart_rfm_k3.groupby(['Cluster'])
grouped.agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
}).round(1)

Unnamed: 0_level_0,Recency,Frequency,MonetaryValue,MonetaryValue
Unnamed: 0_level_1,mean,mean,mean,count
Cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,89.7,17.3,343.6,1231
1,92.7,19.5,378.3,1229
2,88.8,19.3,390.9,1183


## Choosing number of clusters

**Elbow Criterion Method**
- Plot the number of clusters against within-clusters sum-of-squared-errors (SSE)
    - sum of squared distances from every data point to their cluster center
- Identify the "elbow" in the plot
    - where the decrease in SSE slows down and becomes somewhat marginal
    - shows where there are diminishing returns by increasing the number of clusters
- "Elbow" - a point representing an "optimal" number of clusters from a sum-of-squared errors perspective

```python
from sklearn.cluster import KMeans
import seaborn as sns
from matplotlib import pyplot as plt

#fit KMeans and calculate SSE for each *k*
sse = {}
for k in range(1, 11):
    kmeans = KMeans(n_clusters = k, random_state = 1)
    kmeans.fit(data_normalized)
    sse[k] = kmeans.inertia_ #sum of squared distances to closest cluster center

#Plot SSE for each *k*
plt.title('The Elbow Method")
plt.xlabel('k')
plt.ylabel('SSE')
sns.pointplot(x = list(sse.keys()), y = list(sse.values()))
plt.show()
```
**Analyze segments**
- Build clustering at and around elbow solution
- Analyze their properties - average RFM values
- Compare against each other and choose one which makes most business sense

In [12]:
datamart_normalized.head()

Unnamed: 0,CustomerID,Recency,Frequency,MonetaryValue
0,12747,-2.002202,0.865157,1.46494
1,12748,-2.814518,3.815272,2.994692
2,12749,-1.78949,1.189117,1.347598
3,12820,-1.78949,0.546468,0.500595
4,12822,0.337315,0.020925,0.037943
