<a href="https://colab.research.google.com/github/data-analytics-workshop/python/blob/master/005c_case_study_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study - Customer Segmentation

In today’s competitive world, it is crucial to understand customer behavior and categorize customers based on their demography and buying behavior. Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services.

The most common ways in which businesses segment their customer base are:
1.   Demographic
2.   Geographic
3.   Psycographic
4.   Behavior

**The Challenge**

You are owing a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers like who are the target customers so that the sense can be given to marketing team and plan the strategy accordingly.

The Mall Customer Segmentation Data dataset can be downloaded from the kaggle website.



**Install and Import Libraries Needed**

Install Library

In [0]:
# Install Category Encoders
! pip install category_encoders

Import Libraries

In [0]:
# Import Library for Data Manipulation
import pandas as pd
import numpy as np

# Import Library for Visualization
import matplotlib. pyplot as plt
import seaborn as sns

**Import Data**

In [0]:
# Import Dataset
df_customer = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/mall_customer.csv', sep=';')
df_customer

In [0]:
# Prints the Dataset Information
df_customer.info()

In [0]:
# Prints Descriptive Statistics
df_customer.describe().transpose()

**Explore the Dataset**

Visualize Data using Pairplot

In [0]:
# Set Graph Size
plt.rcParams['figure.figsize'] = (16, 8)

# Visualize Pair Plot with Colors
sns.pairplot(df_customer, hue='gender')

Visualize Data using Scatterplot

In [0]:
# Draw Scatter Plot
sns.relplot(x='ann_income_kUSD', y='spending_score', hue='gender', size='age', kind='scatter', col='gender', data=df_customer)
plt.title('Customer Behavior')
plt.xlabel('Annual Income k USD')
plt.ylabel('Spending Score')


Visualize Correlation between Features

In [0]:
# Draw Correlation
sns.clustermap(df_customer.corr(), center=0, cmap='vlag', linewidths=.75)

**Data Preprocessing**

First, we standardize the data to equalize the range and/or data variability. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. 

In [0]:
# Importing Standardscalar Module 
from sklearn.preprocessing import StandardScaler 

# Set Name for StandardScaler as scaler
scaler = StandardScaler() 

# Select Data
df_standardized = df_customer[['age',	'ann_income_kUSD',	'spending_score']]

# Fit Standardization
column_names = df_standardized.columns.tolist()
df_standardized[column_names] = scaler.fit_transform(df_standardized[column_names])
df_standardized.sort_index(inplace=True)
df_standardized

**Clustering with K-Means**

***Search for the Optimum Number of Clusters (k)***

In [0]:
# Transform Data Frame to Numpy Array
customer = df_standardized.to_numpy()
customer

In [0]:
# Elbow Method
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(customer)
    wcss.append(kmeans.inertia_)
  
# Visualize 
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.xticks(np.arange(1,11,1))
plt.show()

In [0]:
# Silhoutte Method
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

for n_cluster in range(2, 11):
    kmeans = KMeans(n_clusters=n_cluster).fit(customer)
    label = kmeans.labels_
    sil_coeff = silhouette_score(customer, label, metric='euclidean')
    print('For n_clusters={}, The Silhouette Coefficient is {}'.format(n_cluster, sil_coeff))

The optimal K value is found to be 6.

***Modeling K-Means Clustering***

In [0]:
# Apply the K-Means Model to the Data
kmeans = KMeans(n_clusters=6, init='k-means++', max_iter=300, n_init=10, random_state=0)
clusters = kmeans.fit_predict(df_standardized.iloc[:,1:])
df_standardized['label'] = clusters

# Show Clusters
df_standardized

We try to visualize the clusters in 2D graph

In [0]:
# Visualising Clusters
sns.scatterplot(x='age', y='ann_income_kUSD', data=df_standardized)
plt.scatter(df_standardized.age[df_standardized.label == 0], df_standardized['ann_income_kUSD'][df_standardized.label == 0], s = 50, label = 'Cluster 1')
plt.scatter(df_standardized.age[df_standardized.label == 1], df_standardized['ann_income_kUSD'][df_standardized.label == 1], s = 50, label = 'Cluster 2')
plt.scatter(df_standardized.age[df_standardized.label == 2], df_standardized['ann_income_kUSD'][df_standardized.label == 2], s = 50, label = 'Cluster 3')
plt.scatter(df_standardized.age[df_standardized.label == 3], df_standardized['ann_income_kUSD'][df_standardized.label == 3], s = 50, label = 'Cluster 4')
plt.scatter(df_standardized.age[df_standardized.label == 4], df_standardized['ann_income_kUSD'][df_standardized.label == 4], s = 50, label = 'Cluster 5')
plt.scatter(df_standardized.age[df_standardized.label == 5], df_standardized['ann_income_kUSD'][df_standardized.label == 5], s = 50, label = 'Cluster 6')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],s=200,marker='s', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income')
plt.ylabel('Annual spend')
plt.legend()
plt.show()

Finally we made a 3D plot to visualize the spending score of the customers with their annual income. The data points are separated into 6 classes which are represented in different colours as shown in the 3D plot.

In [0]:
# Import Module
from mpl_toolkits.mplot3d import Axes3D

# Visualize Clusters
ax = plt.axes(projection='3d')
ax.scatter(df_standardized.age[df_standardized.label == 0], df_standardized['ann_income_kUSD'][df_standardized.label == 0], df_standardized['spending_score'][df_standardized.label == 0], c='blue', s=60)
ax.scatter(df_standardized.age[df_standardized.label == 1], df_standardized['ann_income_kUSD'][df_standardized.label == 1], df_standardized['spending_score'][df_standardized.label == 1], c='red', s=60)
ax.scatter(df_standardized.age[df_standardized.label == 2], df_standardized['ann_income_kUSD'][df_standardized.label == 2], df_standardized['spending_score'][df_standardized.label == 2], c='green', s=60)
ax.scatter(df_standardized.age[df_standardized.label == 3], df_standardized['ann_income_kUSD'][df_standardized.label == 3], df_standardized['spending_score'][df_standardized.label == 3], c='orange', s=60)
ax.scatter(df_standardized.age[df_standardized.label == 4], df_standardized['ann_income_kUSD'][df_standardized.label == 4], df_standardized['spending_score'][df_standardized.label == 4], c='purple', s=60)
ax.scatter(df_standardized.age[df_standardized.label == 5], df_standardized['ann_income_kUSD'][df_standardized.label == 5], df_standardized['spending_score'][df_standardized.label == 5], c='yellow', s=60)
ax.view_init(30, 185)
plt.xlabel('Age')
plt.ylabel('Annual Income')
ax.set_zlabel('Spending Score')
plt.show()

***Save Prediction Result***

In [0]:
# Add Cluster Information to the Raw Data
df_customer['cluster'] = clusters
df_customer

In [0]:
# Save Prediction Result
df_customer.to_csv('customer_clusters.csv', index=False)

K means clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of K means is to group data points into distinct non-overlapping subgroups. One of the major application of K means clustering is segmentation of customers to get a better understanding of them which in turn could be used to increase the revenue of the company.