## Customer Segmentation
This notebook uses publicly dataset from marketing campain to explore customer segmentation through unsupervised learning methods. The customer segmentation is the practice of separating customers into groups that reflect similarities among customers in each cluster.


### Concept
Customer segmentation is the problem of uncovering information about a firm's customer base, based on their interactions with the business. In most cases this interaction is in terms of their purchase behavior and patterns. 

Customer segmentation is similarly the process of dividing an organization’s customer bases into different sections or segments based on various customer attributes. 

The process of customer segmentation is based on the premise of finding differences among the customers’ behavior and patterns.

The major objectives and benefits behind the motivation for customer segmentation are:
 - **Higher Revenue**: This is the most obvious requirement of any customer segmentation project.
 - **Customer Understanding**: One of the mostly widely accepted business paradigms is “know your customer” and a segmentation of the customer base allows for a perfect dissection of this paradigm.
 - **Target Marketing**: The most visible reason for customer segmentation is the ability to focus marketing efforts effectively and efficiently. If a firm knows the different segments of its customer base, it can devise better marketing campaigns which are tailor made for the segment. A good segmentation model allows for better understanding of customer requirements and hence increases the chances of the success of any marketing campaign developed by the organization.
 - **Optimal Product Placement**: A good customer segmentation strategy can also help the firm with developing or offering new products, or a bundle of products together as a combined offering.
    Finding Latent Customer Segments: Finding out which segment of customers it might be missing to identifying untapped customer segments by focused on marketing campaigns or new business development.

### Clustering

The most obvious method to perform customer segmentation is using unsupervised Machine Learning methods like clustering. The method is as simple as collecting as much data about the customers as possible in the form of features or attributes and then finding out the different clusters that can be obtained from that data. Finally, we can find traits of customer segments by analyzing the characteristics of the clusters.

Parts of this notebook have been inspired by:

 - https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering/notebook
 - https://www.kaggle.com/code/paulinan/bank-customer-segmentation
 - https://www.kaggle.com/code/mgmarques/customer-segmentation-and-market-basket-analysis
 - https://www.kaggle.com/code/kushal1996/customer-segmentation-k-means-analysis/notebook
 - https://www.kaggle.com/code/vjchoudhary7/kmeans-clustering-in-customer-segmentation
 - https://thecleverprogrammer.com/2021/02/08/customer-personality-analysis-with-python/

## Approach

- explore data
- cleanup data
- select & engineer features
- handle outliers
- train model
- evaluate model
- back to explore data

### Load dependencies

In [None]:
import warnings
warnings.simplefilter(action = 'ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
def ignore_warn(*args, **kwargs):
    pass

warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

In [None]:
import os

import pandas as pd
from datetime import datetime
import math
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

import seaborn as sns

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_rows', 500)
pd.options.display.max_columns = 100

## Dataset

Source: 
- https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering/data
- https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering/notebook
- https://thecleverprogrammer.com/2021/02/08/customer-personality-analysis-with-python/

In [None]:
data = pd.read_csv("marketing_campaign.csv", sep="\t")
print("Number of datapoints:", data.shape)
data.head()

In [None]:
%%html
<style>
table {float:left}
</style>

## Exploratory Data Analysis (EDA)

 
| Column name |    Details | 
| ------------------- | ----------- |
| ID                  | Customer’s unique identifier |
| Year_Birth          | Customer's birth year |
| Education           | Education Qualification of customer |
| Marital_Status      | Marital Status of customer |
| Income              | Customer's yearly household income |
| Kidhome             | Number of children in customer's household |
| Teenhome            | Number of teenagers in customer's household |
| Dt_Customer         | Date of customer's enrollment with the company |
| Recency             | Number of days since customer's last purchase |
| MntWines            | Amount spent on wine |
| MntFruits           | Amount spent on fruits |
| MntMeatProducts     | Amount spent on meat |
| MntFishProducts     | Amount spent on fish |
| MntSweetProducts    | Amount spent on sweet products |
| MntGoldProds        | Amount spent on gold products |
| NumDealsPurchases   | Number of purchase |
| NumWebPurchases     | Number of web purchase |
| NumCatalogPurchases | Number of catalog purchase |
| NumStorePurchases   | Number of store purchase |
| NumWebVisitsMonth   | Number of web site visits per month |
| AcceptedCmp3        | Accepted marketing campain 3 |
| AcceptedCmp4        | Accepted marketing campain 4 |
| AcceptedCmp5        | Accepted marketing campain 5 |
| AcceptedCmp1        | Accepted marketing campain 1 |
| AcceptedCmp2        | Accepted marketing campain 2 |
| Complain            | Complained |
| Z_CostContact       |  |
| Z_Revenue           |  |
| Response            |  |


In [None]:
data.info()

The function below was created to simplify the analysis of general characteristics of the data. Inspired on the str function of R, this function returns the types, counts, distinct, count nulls, min, max, missing ratio and uniques values of each field/feature.

In [None]:
def display_details(df:pd.DataFrame): 
    obs = df.shape[0]
    types = df.dtypes
    
    counts = df.count()
    uniques = df.apply(lambda x: x.unique())
    nulls =  df.isnull().sum()
    min = df.min()
    max = df.max()
    
    distincts = df.nunique()
    missing_ratio = (df.isnull().sum()/ obs) * 100
    skewness = df.skew(skipna = True)
    kurtosis = df.kurt(skipna = True)
    print('Data shape:', df.shape)
    
    cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis', 'min', 'max']
    df_res = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis, min, max], axis = 1, sort=True)
    df_res.columns = cols
    dtypes = df_res.types.value_counts()
    print('___________________________\nData types:\n',df_res.types.value_counts())
    print('___________________________')
    return df_res

details = display_details(data)
display(details)

### Observations

 - `Income` field has missing values
 - `Marital_Status` and `Education` fiels are identified as string columns but we need to transform them to numeric values in order to use them
 - `Dt_Customer` is identified as string columns instead date. We need to transform it to date format and than to a numeric value
 
**Note**: string fields are specified in pandas dataframe as having type `object`

### Handling missing values

The process of replacing missing data with substituted values is called Imputation. 
See more on https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

Options:
 1. drop rows having missing data - `data = data.dropna()`
 1. replace missing Income value with a fixed value - SKLearn encoders
 1. replace missing Income value with mean value of our data - SKLearn encoders
 1. replace missing Income with a more complex computed using autoencoders - see https://curiousily.com/posts/data-imputation-using-autoencoders/

In [None]:
from sklearn.impute import SimpleImputer

# use median for imputation
incomeImputer = SimpleImputer(strategy='median').fit(data[['Income']])
print("Values used to fill missing values: ", incomeImputer.statistics_)

data[['Income']] = incomeImputer.transform(data[['Income']])
# Let's print again the info about our dataset
# Notice 'nulls' collumns for Income field
#details = display_details(data)
#display(details)

## Feature Engineering

- Create a feature ("Customer_For") of the number of days the customers started to shop in the store relative to the last recorded date
- Encode `Marital_Status` and `Education` into numeric values
- Extract `Age` information of the customer from `Year_Birth`
- Extract total spent in the last two years
- Extract the total number of purchases in last two years
- Extract the average spent per purchase
- Derive `Familly_size` base on the number of `Kidhome` + `Teenhome` and `Marital_Status`
- Combine `AcceptedCmpXXX` in a single categorical field

In [None]:
# Treat ID field as string since this represents the customer ID not a numeric value 
data = data.astype({"ID": str}) 

In [None]:
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"]).apply(lambda d: d.date())

In [None]:
newest = data["Dt_Customer"].max()
data["Customer_For"] = (newest - data['Dt_Customer']).dt.days
data[['Dt_Customer', "Customer_For"]].head()

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder().fit(data[['Marital_Status', 'Education']])
display(encoder.categories_)

data[['Marital_Status', 'Education']] = encoder.transform(data[['Marital_Status', 'Education']])
data.head()

In [None]:
data['Age'] = 2021 - data['Year_Birth']
data["Spent"] = data["MntWines"] + data["MntFruits"] + data["MntMeatProducts"] + data["MntFishProducts"] + data["MntSweetProducts"] + data["MntGoldProds"]
data["Purchases"] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases']+data['NumStorePurchases']
data["AvgSpentPPurchase"] = data["Spent"]/data["Purchases"]
data["Family_Size"]=data["Marital_Status"].replace({"Married":2, "Together":2, "Alone":1, "Absurd":1, "Widow":1, "YOLO":1, "Divorced":1, "Single":1}) + data["Kidhome"]+data["Teenhome"]
data["Campaigns"]=data["AcceptedCmp5"] + data["AcceptedCmp4"] + data["AcceptedCmp3"] + data["AcceptedCmp2"] + data["AcceptedCmp1"]
data["Children"]=data["Kidhome"] + data["Teenhome"]
data["Is_Parent"] = np.where(data.Children> 0, 1, 0)
data["Living_With"]=data["Marital_Status"].replace({"Married":"Partner", "Together":"Partner", "Absurd":"Alone", "Widow":"Alone", "YOLO":"Alone", "Divorced":"Alone", "Single":"Alone",})

display(data.head())
display(data.shape)

### Vizualize the data

In [None]:
data.describe()

In [None]:
features = [ "Income", "Recency", "Customer_For", "Age", "Spent", "Family_Size", "Purchases"]

plt.figure(figsize=(15, 25))
g = sns.pairplot(data[features], hue= "Family_Size", corner=True, palette = sns.color_palette("crest", 8))
plt.show()

### Handle Outliers

Notice some outliers in `Income` and `Age` fields 
Let's remove these values and replot the fields

In [None]:
filtered_data = data[data["Age"] < 90].copy()
filtered_data = filtered_data[filtered_data["Income"] < 600000]
filtered_data = filtered_data[(filtered_data["Purchases"]!=0)]

print("Number of records removed:", len(data) - len(filtered_data))

#g = sns.pairplot(filtered_data[Features], hue= "Familly_size", corner=True)
#plt.show()

### Vizualize the data

In [None]:
fig = plt.figure(figsize=(25, 7))
f1 = fig.add_subplot(121)

ds_grouped = filtered_data.groupby(["Family_Size"]).Purchases.sum().sort_values(ascending = False)
ds_grouped.plot(kind='bar', title='Total number of purchases based on familly size')

plt.show()

In [None]:
fig = plt.figure(figsize=(25, 7))
gb_ID_data = filtered_data.groupby(["ID"])
PercentSales = np.round((gb_ID_data.Spent.sum().sort_values(ascending = False)[:30].sum()/gb_ID_data.Spent.sum().sort_values(ascending = False).sum()) * 100, 2)

gb_ID_data.Spent.sum().sort_values(ascending = False)[:30].plot(kind='bar', title='Top 30 Customers: {:3.2f}% Sales Amount'.format(PercentSales))

############################################
fig = plt.figure(figsize=(25, 7))

f1 = fig.add_subplot(121)
PercentSales =  np.round((gb_ID_data.Spent.sum().sort_values(ascending = False)[:10].sum()/gb_ID_data.Spent.sum().sort_values(ascending = False).sum()) * 100, 2)
gb_ID_data.Spent.sum().sort_values(ascending = False)[:10].plot(kind='bar', title='Top 10 Customers: {:3.2f}% Sales Amont'.format(PercentSales))

f1 = fig.add_subplot(122)
PercentSales =  np.round((gb_ID_data.Purchases.sum().sort_values(ascending = False)[:10].sum()/gb_ID_data.Purchases.sum().sort_values(ascending = False).sum()) * 100, 2)
gb_ID_data.Purchases.sum().sort_values(ascending = False)[:10].plot(kind='bar', title='Top 10 Customers: {:3.2f}% Event Sales'.format(PercentSales))


### Feature Selection

Create a dataset with  fields that  are interesting for our experiment. 

In [None]:
keep_fields = [
 'ID',
 'Income',
 'Recency',
 'Customer_For',
 'Marital_Status',
 'Education',
 'Age',
 'Spent',
 'Purchases',
 'AvgSpentPPurchase',
 'Family_Size',
 'Campaigns',
 'Kidhome',
 'Teenhome',
 'Children',
 'Is_Parent',
 'Living_With'
]
 
df_base = filtered_data[keep_fields].copy()
df_base.head(10)

#### Correlation matrix

In [None]:
#correlation matrix
corrmat= df_base.corr()
plt.figure(figsize=(20,20))  
cmap = sns.color_palette("ch:start=.2,rot=-.3", as_cmap=True)
display(type(cmap))
sns.heatmap(corrmat,annot=True, cmap=cmap, center=0)

### RFM Model for Customer Value

The RFM(*Recency, Frequency and Monetary Value) model will take the transactions of a customer and analyse important informational attributes about each customer:

 - Recency: The value of how recently a customer purchased in campain
 - Frequency: How frequent the customer’s transactions are in campan
 - Monetary value: The ammount value of all that the customer purchases made in campain

We will plot the Recency Distribution and QQ-plot to identify substantive departures from normality, likes outliers, skewness and kurtosis.

In [None]:
from scipy.stats import skew, norm, probplot, boxcox

def QQ_plot(data, measure):
    fig = plt.figure(figsize=(20,7))

    #Get the fitted parameters used by the function
    (mu, sigma) = norm.fit(data)

    #Kernel Density plot
    fig1 = fig.add_subplot(121)
    sns.distplot(data, fit=norm)
    fig1.set_title(measure + ' Distribution ( mu = {:.2f} and sigma = {:.2f} )'.format(mu, sigma), loc='center')
    fig1.set_xlabel(measure)
    fig1.set_ylabel('Frequency')

    #QQ plot
    fig2 = fig.add_subplot(122)
    res = probplot(data, plot=fig2)
    fig2.set_title(measure + ' Probability Plot (skewness: {:.6f} and kurtosis: {:.6f} )'.format(data.skew(), data.kurt()), loc='center')

    plt.tight_layout()
    plt.show()


In [None]:
display(df_base[['ID','Recency']].reset_index().describe().transpose())
QQ_plot(df_base.Recency, 'Recency')

In [None]:
QQ_plot(df_base.Purchases, 'Frequency')

In [None]:
QQ_plot(df_base.Spent, 'Amount')

#### Observations
From the first graph above we can see that sales recency distribution is not **skewed**, and has no long tail. 

From the Probability Plot, we could see that sales recency also partially aligns with the diagonal red line which represent normal distribution.

With a **low negative skewness** of -0.004299, we confirm the symmetry of our sales recency. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero.
A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

**Kurtosis** is a measure of whether the data is heavy-tailed or light-tailed relative to a normal distribution. 
That is, data sets with high kurtosis tend to have heavy tails, or outliers, and positive kurtosis indicates a heavy-tailed distribution and negative kurtosis indicates a light tailed distribution. 
So, with **1.2 of negative kurtosis** sales recency are not heavy-tailed and does not have outliers.


### Data Scaling

We will be using the K-means clustering algorithm. One of the requirements for proper functioning of the algorithm is the mean centering of the variable values. Centering of a variable value means that we will replace the actual value of the variable with a standardized value, so that the variable has a mean of 0 and variance of 1. This ensures that all the variables are in the same range and the difference in ranges of values doesn't cause the algorithm to not perform well. This is akin to feature scaling.

Another problem that you can investigate about is the range of values each variable can take. This problem is particularly noticeable for the monetary amount variable. To take care of this problem, we transform all the variables on the log scale. This transformation, along with the standardization, will ensure that the input to our algorithm is a homogenous set of scaled and transformed values.

An important point about the data preprocessing step is that we need it to be reversible. In our case, we will have the clustering results in terms of the log transformed and scaled variable. But to make inferences in terms of the original data, we will need to reverse transform all the variable so that we get back the actual RFM figures. This can be done by using the preprocessing capabilities of Python.

In [None]:
scaler = StandardScaler()
scaler.fit(df_base)
df_scaled = pd.DataFrame(scaler.transform(df_base), columns=df_base.columns )
print("All features are now scaled")

In [None]:
import math

df_subset = df_base[['Spent', 'Recency', 'Purchases']].copy()
df_subset.head()

df_subset['Purchases'] = df_subset['Purchases'].apply(math.log)
df_subset['Spent'] = df_subset['Spent'].apply(math.log)
scaler = StandardScaler().fit(df_subset)
arr_subset_scaled = scaler.transform(df_subset)
df_subset = pd.DataFrame(arr_subset_scaled, columns=df_subset.columns)

display(df_subset.describe().T)

In [None]:
fig = plt.figure(figsize=(20,10))

ax1 = fig.add_subplot(121); sns.regplot(x='Recency', y='Spent', data=df_subset)
ax1.title.set_text('Recency Log')

ax2 = fig.add_subplot(122); sns.regplot(x='Recency', y='Spent', data=df_base)
ax2.title.set_text('Recency Base')

fig = plt.figure(figsize=(20,10))
ax3 = fig.add_subplot(121); sns.regplot(x='Purchases', y='Spent', data=df_subset)
ax3.title.set_text('Frequency Log')

ax4 = fig.add_subplot(122); sns.regplot(x='Purchases', y='Spent', data=df_base)
ax4.title.set_text('Frequency Raw')

In [None]:
#!pip install PyQt5
#%matplotlib qt
%matplotlib inline

In [None]:
#3D view 
fig = plt.figure(figsize=(20, 15))
ax = fig.add_subplot(111, projection='3d')

xs = df_subset.Recency
ys = df_subset.Purchases
zs = df_subset.Spent
ax.scatter(xs, ys, zs, s=5)

ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')

plt.show()


### Clustering for Segments
#### K-Means Clustering

The K-means clustering belongs to the partition based\centroid based hard clustering family of algorithms, a family of algorithms where each sample in a dataset is assigned to exactly one cluster.

Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors (SSE), which is sometimes also called cluster inertia. So, the objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function: 

objective func  $$    J = \sum _{j=1} ^k \sum _{i=1} ^n || x _i^{(j)} - c _j|| ^2 $$

where:
 - k number of clusters
 - n number of cases
 - c centroid of cluster j


The steps that happen in the K-means algorithm for partitioning the data are as given follows:

The algorithm starts with random point initializations of the required number of centers. The “K” in K-means stands for the number of clusters.
In the next step, each of the data point is assigned to the center closest to it. The distance metric used in K-means clustering is normal Euclidian distance.
Once the data points are assigned, the centers are recalculated by averaging the dimensions of the points belonging to the cluster.
The process is repeated with new centers until we reach a point where the assignments become stable. In this case, the algorithm terminates.


#### The Elbow Method

Using the elbow method to find the optimal number of clusters. The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly. If k increases, the distortion will decrease, because the samples will be closer to the centroids they are assigned to.

This method looks at the percentage of variance explained as a function of the number of clusters. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified.Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance.

In [None]:
clusters = {}
for k in range (1, 10):
    # Create a kmeans model on our data, using k clusters.  
    # random_state helps ensure that the algorithm returns the same results each time.
    model = KMeans(
        n_clusters=k, 
        init='k-means++',
        n_init=10,
        max_iter=300,
        tol=1e-04,
        random_state=101)

    clusters[k] = model.fit_predict(df_subset)

In [None]:
# Quick examination of elbow method to find numbers of clusters to make.
from yellowbrick.cluster import KElbowVisualizer

kmeans = KMeans(init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=101)

elbow_m = KElbowVisualizer(kmeans, k=10, size=(1000, 500))
elbow_m.fit(df_subset)

best_k = elbow_m.elbow_value_
print("BEST K:", best_k)

elbow_m.show()

### Plot the clusters

In [None]:
for k in (2, 3, 4 , 5, 6):
    fig = plt.figure(figsize=(20,5))
    plt.title( "{0} Clusters".format(k))

    ax = fig.add_subplot(121)
    plt.scatter(x = df_subset["Recency"], y = df_subset["Spent"], c=clusters[k], cmap=plt.cm.Set1)
    ax.set_xlabel("Recency")
    ax.set_ylabel("Spent")

    ax = fig.add_subplot(122)
    plt.scatter(x = df_subset["Purchases"], y = df_subset["Spent"], c=clusters[k],cmap=plt.cm.Set1)
    ax.set_xlabel("Purchases")
    ax.set_ylabel("Spent")

    plt.show()

In [None]:
df_count = df_base.copy()

df_count['clusters_3'] = clusters[3] #cluster_centers[3]['labels'] 
df_count['clusters_4'] = clusters[4] #cluster_centers[5]['labels']
df_count['clusters_7'] = clusters[7] #cluster_centers[7]['labels']

fig = plt.figure(figsize=(20,7))
f1 = fig.add_subplot(131)
market = df_count['clusters_3'].value_counts()
plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('3 Clusters')

f1 = fig.add_subplot(132)
market = df_count['clusters_4'].value_counts()
plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('4 Clusters')

f1 = fig.add_subplot(133)
market = df_count['clusters_7'].value_counts()
plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('7 Clusters')

plt.show()

In [None]:
fig = plt.figure(figsize=(20,15))
pal = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
pl=sns.swarmplot(x=clusters[best_k], y=df_base["Spent"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=clusters[best_k], y=df_base["Spent"], palette=pal)
plt.title("Spent")
plt.show()

In [None]:
fig = plt.figure(figsize=(20,15))
pal = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
pl=sns.swarmplot(x=clusters[best_k], y=df_base["Recency"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=clusters[best_k], y=df_base["Recency"], palette=pal)
plt.title("Recency")
plt.show()

### PCA

https://scikit-learn.org/stable/modules/decomposition.html#pca

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=3)
pca.fit(df_scaled)
df_pca = pd.DataFrame(pca.transform(df_scaled), columns=(["col1","col2", "col3"]))
print('Explained variance ratio:', pca.explained_variance_ratio_)
df_pca.describe().T

In [None]:
#%matplotlib qt
#%matplotlib inline

In [None]:
#A 3D Projection Of Data In The Reduced Dimension
x = df_pca["col1"]
y = df_pca["col2"]
z = df_pca["col3"]

#To plot
fig = plt.figure(figsize=(20, 15))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x, y, z, c="maroon", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()

## Clustering on PCA data

In [None]:
#!pip install yellowbrick

In [None]:
from yellowbrick.cluster import KElbowVisualizer

# Quick examination of elbow method to find numbers of clusters to make.
print('Elbow Method to determine the number of clusters to be formed:')
elbow_m = KElbowVisualizer(KMeans(), k=10)
elbow_m.fit(df_pca)
elbow_m.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering

# create the Agglomerative Clustering model 
aggc = AgglomerativeClustering(n_clusters=4)

# fit model and predict clusters
yhat_aggc = aggc.fit_predict(df_pca)
df_pca["Clusters"] = yhat_aggc

# add the Clusters feature to the orignal dataframe.
df_base["Clusters"] = yhat_aggc

In [None]:
#%matplotlib qt
%matplotlib inline

In [None]:
# plot the clusters
fig = plt.figure(figsize=(20, 15))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=df_pca["Clusters"], marker='o', cmap = cmap )
ax.set_title("The Clusters")
plt.show()

In [None]:
#Plotting countplot of clusters
fig = plt.figure(figsize=(10,8))
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]
pl = sns.countplot(x=df_base["Clusters"], palette= pal)
pl.set_title("Distribution Of The Clusters")
plt.show()

In [None]:
#Plotting count of total campaign accepted.
fig = plt.figure(figsize=(15,8))
pl = sns.countplot(x=df_base["Campaigns"], hue=df_base["Clusters"], palette= pal)
pl.set_title("Count Of Promotion Accepted")
pl.set_xlabel("Number Of Total Accepted Promotions")
plt.show()

In [None]:
profile_features = [ "Kidhome", "Teenhome", "Income", "Age", "Children", 
                    "Family_Size", "Is_Parent", "Education","Living_With"]

for pf in profile_features:
    fig = plt.figure(figsize=(25,12))
    sns.jointplot(x=df_base[pf], y=df_base["Spent"], hue=df_base["Clusters"], kind='hist', palette=pal)
    plt.show()

## Cluster Properties

### Cluster 0
- definitelly a parent
- 2 to 4 family members
- most have teenager children
- relatively older

### Cluster 1
- definitelly NOT a parent
- max 2 members in the family
- all ages
- high income

### Cluster 2
- most are parents
- max 3 fmaily members
- one kid, typically not a teenager
- relatively younger

### Cluster 3
- definitelly a parent
- 2 to 5 family members
- teenager kids
- relatively older
- low income


### Bonus

render 3d interactive plots inline

In [None]:
#!pip install plotly

In [None]:
import plotly as py
import plotly.graph_objs as go

df = df_base

trace1 = go.Scatter3d(
    x= df['Age'],
    y= df['Spent'],
    z= df['Purchases'],
    mode='markers',
     marker=dict(
        color = df['Clusters'], 
        size= 10,
        line=dict(
            color= df['Clusters'],
            width= 12
        ),
        opacity=0.8
     )
)
data = [trace1]
layout = go.Layout(
    title= 'Clusters wrt Age, Income and Spending Scores',
    scene = dict(
            xaxis = dict(title  = 'Age'),
            yaxis = dict(title  = 'Spending Score'),
            zaxis = dict(title  = 'Annual Income')
        )
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)