# Introduction

Hi there! In this notebook I'm performing EDA on a mall's customer dataset and then segmenting using clustering techniques.
There are no labels in the dataset so it's a good application of unsupervised learning methods.
I'm comparing K-Means, DBScan and Hierarchical Clustering. 

We perform clustering to find the number of segments that can be created and how effective / useful these segments will be 
for various purposes such as targeted advertising, traffic control, event planning and so on.

# Imports

In [74]:
import os
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.cluster import KMeans 
from sklearn.cluster import MiniBatchKMeans   
from sklearn.cluster import DBSCAN
from sklearn import metrics

# EDA

In [48]:
df_mall = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

Let's check the first couple of records:

In [49]:
df_mall.head(10) 

Let's get some basic statistics about the dataset:

In [50]:
df_mall.describe()

Checking for null values:

In [51]:
df_mall.isnull().sum()

Phew... No null values! Since there are no null or missing values in the dataset, we don't need to perform
feature engineering.

Based on the info() output, we can see that there are 200 customers in the dataset.

In [52]:
df_mall.info()

Let's take a look at the pair plot of customers according to their gender and see if there are any significant differences between male and female shoppers:

In [53]:
df_mall_copy = df_mall.copy()

# Replacing Male / Female with 1 & 0 respectively to allow us to use the Gender column.
df_mall_copy['Gender'].replace('Female', 0, inplace = True)
df_mall_copy['Gender'].replace('Male', 1, inplace = True)
sns.pairplot(data = df_mall, hue = 'Gender', corner = True, palette = 'bright')

From the above plot, we can infer that there is no significant difference based on Gender.

An interesting library I came across was the pandas_profiling library.
The ProfileReport is super handy to obtain a detailed statistical report.
It also gives us great visualisations for the dataset.

In [54]:
ProfileReport(df_mall)

# K-Means

The K-Means clustering algorithm groups the dataset into 'k' number of clusters. 
The data belonging to each cluster have similar properties. 

The 'k' value can be found in two ways: 
- WCSS (inertia) 
- Elbow Method

**Feature selection:**

I'm only extracting two columns as features: `Annual Income` and the `Spending Score`

In [55]:
feats = df_mall[['Annual Income (k$)','Spending Score (1-100)']]
feats

**Finding `k` using WCSS**

I'm performing 20 iterations to check which value between 1 to 20 is best fit.
As mentioned in the previous step, the groups will be created based on the annual income and the spending score.
Each cluster is created using the default k-means implementation.

In [56]:
wcss = []
for i in range(1,20):   
    km = KMeans(n_clusters = i,init = 'k-means++',random_state = 30)  
    km.fit(feats)                                                   
    wcss.append(km.inertia_)                                    

In [57]:
wcss 

**Finding `k` using Elbow Graph method**

In [58]:
plt.plot(range(1, 20), wcss)  

There is a steep decline at x = 5.0, y = 50000.

I'm choosing k = 5 i.e., 5 clusters will be created.

**Building & Fitting the Model**

In [59]:
km = KMeans(n_clusters = 5, init = 'k-means++', random_state = 30)
km.fit_predict(feats)  

We can see that all the rows are labelled into our 5 cluster categories, i.e, clusters 0 to 4.
Let's record all the cluster numbers as a column in the dataset:

In [60]:
feats['cluster_number'] = km.fit_predict(feats)
feats

Let's see the points that are inside cluster 4.

In [61]:
feats[feats['cluster_number'] == 4]

Similarly, for cluster 0:

In [62]:
feats[feats['cluster_number'] == 0]

We can clearly that all data points with high spending score are in cluster 0.

Let's try predicting for annual income of $87000 and spending score of 75.

In [63]:
km.predict([[87, 75]])   

We can see that it predicted this data point to be in cluster 0 as can be observed in the table.

# K-Means + Minibatch

The minibatch algorithm splits the dataset into smaller batches and creates clusters for each batch.

**Building & Fitting the Model**

In [64]:
mini = MiniBatchKMeans(n_clusters = 5)  
mini.fit(feats[['Annual Income (k$)','Spending Score (1-100)']])

Let's try predicting with annual income of $16000 and spending score of 77:

In [65]:
#model prediction 

mini.predict([[16,77]])   

Our model has correctly predicted that the data point belongs to Cluster 4.

# DBSCAN

The DBSCAN algorithm is used to find the number of clusters along with the outliers. 

Its advantage over K-Means is that the number of clusters doesn't have to be known beforehand.
So model training can be performed even without knowing the `k` value.

**Building & Fitting the Model**

To create the DBScan model, we need to specify two important parameters:
- eps: the radius of the cluster,
- min_samples: the min number of data points within each cluster 

In [66]:
dbscan = DBSCAN(eps = 0.7,min_samples = 3)   
dbscan.fit(feats[['Annual Income (k$)','Spending Score (1-100)']])

**Detecting the Outliers:**

In [67]:
dbscan.labels_  

The -1 as output symbolizes an outlier. Since every data point is an outlier in our model, we change the radius such that clusters can be formed. 

**Building & Fitting the Model**

In [68]:
dbscan = DBSCAN(eps = 5, min_samples = 4)
dbscan.fit(feats[['Annual Income (k$)','Spending Score (1-100)']])

**Detecting the Outliers in the updated model:**

In [69]:
dbscan.labels_  

There are fewer outliers in our new model and we can see that clusters have been formed between 0 to 6. So there are 7 clusers.

Let's add the cluster info to the dataset:

In [70]:
feats_new = feats[['Annual Income (k$)','Spending Score (1-100)']]
feats_new['dbscan_cluster_name'] = dbscan.labels_   
feats_new

Let's take the dataset with `cluster_number` as the true values and the one with `dbscan_cluster_name` as predicted values.

**Labels**

In [71]:
yt = feats['cluster_number']
yp = feats_new['dbscan_cluster_name']  

# Metrics

**Adjusted Rand Score**

The accuracy of our model is 56%.

In [72]:
metrics.adjusted_rand_score(yt, yp)

**Homogeneity Score**

In [73]:
metrics.homogeneity_score(yt, yp)