# Introduction

In this notebook, we will classify the customers of a grocery retailer into different segments with the purpose of understanding their needs. We will analyze the results to provide useful insights and recommendations to the business so that it can tailor the consumer experience and increase sales.

To organize the individuals into clusters, we will use the K-Prototypes algorithm. This algorithm is particularly effective with mixed data as it optimizes and simplifies the organization process.

# Importing Libraries and Dataset

In [1]:
# Downloading the necesary libraries
!pip install proplot

# Importing the necesary libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer
from keras.layers import Embedding
import proplot as pplt
from kmodes.kprototypes import KPrototypes
import plotly.express as px

import warnings
warnings.filterwarnings('ignore') 

In [2]:
# Importing data
df = pd.read_csv('/kaggle/input/customer-personality-analysis/marketing_campaign.csv', sep='\t')
df.head()

# Data Preprocessing

## Dealing with Missing Values

In [4]:
# Do we have missing values?
df.isnull().sum()   

In [6]:
# Imputing missing values with mean
df.loc[(df['Income'].isnull() == True), 'Income'] = df['Income'].mean()   

## Adding new variables

In [7]:
df['Kids'] = df['Kidhome'] + df['Teenhome']
df['Expenses'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']

## Cleaning some variables

In [8]:
# Check the types of marital status
df['Marital_Status'].value_counts()  

In [9]:
# Renaming categories
df['Marital_Status'] = df['Marital_Status'].str.replace('Married', 'In relationship')
df['Marital_Status'] = df['Marital_Status'].str.replace('Together', 'In relationship')
df['Marital_Status'] = df['Marital_Status'].str.replace('Divorced', 'Single')
df['Marital_Status'] = df['Marital_Status'].str.replace('Widow', 'Single')
df['Marital_Status'] = df['Marital_Status'].str.replace('Absurd', 'Single')
df['Marital_Status'] = df['Marital_Status'].str.replace('Alone', 'Single')
df['Marital_Status'] = df['Marital_Status'].str.replace('YOLO', 'Single')

In [10]:
# Check the types of marital status
df['Education'].value_counts()  

In [11]:
# 2n Cycle = Master (Bologna Process)
df['Education'] = df['Education'].str.replace('2n Cycle', 'Master')   

In [12]:
# Customer's time being enrolled
df['Dt_Customer'] = pd.to_datetime(df.Dt_Customer)
df['Date_Collected'] = '01-01-2015'
df['Date_Collected'] = pd.to_datetime(df.Date_Collected)
df['Time_Enrolled_Days'] = (df['Date_Collected'] - df['Dt_Customer']).dt.days

In [13]:
df.columns   # Columns names

In [18]:
df.info()

In [14]:
# Removing some variables
df = df.drop(columns=[
       'ID', 'Dt_Customer', 'Kidhome', 'Teenhome', 'Recency', 'NumDealsPurchases', 
       'NumWebPurchases','NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response', 'Date_Collected'
       ])

In [15]:
# Rename the columns
df.columns = ['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Wines', 'Fruits', 'Meat', 'Fish', 'Sweet', 'Gold', 'Children', 'Expenses', 'Time_Enrolled_Days']

In [17]:
df.describe().T

## Dealing with outliers

In [19]:
# Removing outliers in income
from scipy import stats

df1 = df[(np.abs(stats.zscore(df['Income'])) < 3)]     # Remove observation with more than 3 in Standard Desviation
df1.reset_index(inplace=True)                          # Reset index
df1 = df1.drop(columns=['index'])   

## Standardization

* By **standarizing** the continous variables we make them all equally important to the analysis. This is credical because if there are large differences between the range of the numerical variables, those variable with higher range of values will dominate over those with smaller ranges.

* **Standardization** rescales data to have a mean (𝜇) of 0 and standard deviation (𝜎) of 1 (unit variance).

$$ Z = \frac{Value - Mean}{Standard Deviation}$$

In [20]:
df_final = df1.copy()

# Standardization
for i in df_final.select_dtypes(exclude='object').columns:
    df_final.loc[:, i] = StandardScaler().fit_transform(np.array(df_final[[i]]))

# Modeling clusters

## Clustering using K-Prototypes

In [22]:
#Choosing optimal K
K = range(1,10)
cost = []
for k in K:
    kproto = KPrototypes(n_clusters=k, init='Cao', random_state=42)
    kproto.fit_predict(df_final, categorical=[1,2])
    cost.append(kproto.cost_)

In [23]:
sns.set(rc={'axes.facecolor':'black', 'figure.facecolor':'black', 'axes.grid' : False, 'font.family': 'Ubuntu'})

fig, ax = plt.subplots(figsize =(12, 8))

plt.plot(K, cost, 'bo-', color = '#FFC300')
plt.xlabel('k', color = 'white', size = 14)
plt.ylabel('Distortion', color = 'white', size = 14)
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Hide the right and top spines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

plt.text(2.4, 26800, "The", size=22, color="white")
plt.text(3.3, 26800,"Elbow Method", size=22, color="#FFC300")
plt.text(6.4, 26800, ": Optimal Number of", size=22, color="white")
plt.text(67000,2750,"Expenses", size=22, color="#FFC300")
plt.text(120000, -350, "@miguelfzzz", fontsize=12, ha="right", color='lightgray', fontweight="bold")
plt.text(10.9, 26800, "Clusters", size=22, color="#FFC300", fontweight="bold")
plt.text(14.65, 6900, "@miguelfzzz", fontsize=12, ha="right", color='lightgray', fontweight="bold")


plt.show()

We use the *Elbow Method* to determine the optimal number of clusters to use. In our case, we'll choose **4 clusters**.

In [24]:
# Clustering
kproto = KPrototypes(n_clusters= 4, init='Cao', n_jobs = 4)
clusters = kproto.fit_predict(df_final, categorical=[1,2])

In [25]:
# Merging original data with clusters
df_clusters = pd.concat([df1, pd.DataFrame({'cluster': clusters})], axis=1)   

## Clusters interpretation

In [26]:
# Clusters interpretation
sns.set(rc={'axes.facecolor':'black', 'figure.facecolor':'black', 'axes.grid' : False, 'font.family': 'Ubuntu'})

for i in df_clusters:
    g = sns.FacetGrid(df_clusters, col = "cluster", hue = "cluster", palette = "Set2")
    g.map(plt.hist, i, bins=10, ec="k") 
    g.set_xticklabels(rotation=30, color = 'white')
    g.set_yticklabels(color = 'white')
    g.set_xlabels(size=15, color = 'white')
    g.set_titles(size=15, color = '#FFC300', fontweight="bold")
    g.fig.set_figheight(5)

* **Cluster 0**: People with high incomes and high spending habits that tend to have one child. We will call this group *good customers*.

* **Cluster 1**: People with the highest incomes and the highest spending habits. We will call this group *elite customers*.

* **Cluster 2**: People with lower incomes and low spending habits that also have the lowest time enrolled as a member. We will call this group *economical customers*.

* **Cluster 3**:  People with lowest incomes and lowest spending habits. We will call this group *cheap customers*.

In [29]:
# Results
clusters_count = df_clusters['cluster'].value_counts()                        # Counting wins-losses-draws
clusters_count = clusters_count.to_frame().reset_index()                      # Convert series to dataframe
clusters_count.columns = ['clusters', 'count']                                # Rename column names
clusters_count = clusters_count.sort_values('clusters', ascending = True)     # Sorting data

labels = [
        "Good Customers", 
        "Elite Customers", 
        "Economical Customers", 
        "Cheap Customers"
        ]

# Visualization
plt.figure(figsize=(12,8))

mpl.rcParams['font.size'] = 17
colors = sns.color_palette('Set2')[0:4]

plt.pie(clusters_count['count'], 
        explode=(0.05, 0.05, 0.05, 0.05), 
        labels = labels,
        colors= colors,
        autopct='%1.1f%%',
        textprops = dict(color ="white", fontsize=19),
        counterclock = False,
        startangle=180,
        wedgeprops={"edgecolor":"gray",'linewidth': 1}
        )

plt.axis('equal')

# Title 
plt.text(-0.8, 1.2, "Clusters", size=30, color="#FFC300", fontweight="bold")
plt.text(-0.12, 1.2, "Distribution", size=30, color="white")

# Author
plt.text(1.1, -1.25, "@enesimek", fontsize=12, ha="right", color='lightgray', fontweight="bold")

plt.show()


In [30]:
# Creating a new dataset 
clusters_incomes = df_clusters[['Income', 'Expenses', 'cluster']]           # Select variables
clusters_incomes['group'] = clusters_incomes['cluster']                     # Create new variable
clusters_incomes['group'] = clusters_incomes['group'].astype(str)           # Change data type 

# Rename values
clusters_incomes['group'] = clusters_incomes['group'].str.replace('0', 'Good Customers')
clusters_incomes['group'] = clusters_incomes['group'].str.replace('1', 'Elite Customers')
clusters_incomes['group'] = clusters_incomes['group'].str.replace('2', 'Economical Customers')
clusters_incomes['group'] = clusters_incomes['group'].str.replace('3', 'Cheap Customers')

clusters_incomes = clusters_incomes.sort_values('group', ascending = False)    # Sorting data 

# Visualizing 
fig, ax = plt.subplots(figsize =(12, 8))

sns.scatterplot(data = clusters_incomes, x = 'Income', y = 'Expenses', hue = 'group', palette = 'Set2', alpha=0.6)

# Naming axis labels
plt.xlabel('Income', color = 'white', size = 14);
plt.ylabel('Expenses', color = 'white', size = 14);

# Hide the right and top spines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# Coloring the axis in white
plt.tick_params(colors='white')

# Customize legend
plt.legend(labelcolor = 'white', frameon=False, bbox_to_anchor=(0.33, 0.8))    

# Title
plt.text(20000, 2750, "Clusters", size=22, color="#FFC300", fontweight="bold")
plt.text(38000, 2750, "by", size=22, color="white")
plt.text(45000,2750,"Income", size=22, color="#FFC300")
plt.text(61000, 2750, "&", size=22, color="white")
plt.text(67000,2750,"Expenses", size=22, color="#FFC300")

# Author
plt.text(120000, -350, "@enesimek", fontsize=12, ha="right", color='lightgray', fontweight="bold")

plt.show()



In [31]:
# Creating a new dataset 
clusters_products = df_clusters[['Wines', 'Fruits', 'Meat', 'Fish', 'Sweet', 'Gold', 'cluster']]    # Select variables

clusters_products1 = clusters_products.groupby(['cluster'])
clusters_products2 = clusters_products1.agg({'Wines':'sum', 'Fruits':'sum', 'Meat':'sum', 'Fish':'sum', 'Sweet':'sum', 'Gold':'sum'})

clusters_products3 = clusters_products2.stack().reset_index(name='Count').rename(columns={'level_1':'Products'})   # Oposite as pivoting

clusters_products3['group'] = clusters_products3['cluster']
clusters_products3['group'] = clusters_products3['group'].astype(str)

# Rename values
clusters_products3['group'] = clusters_products3['group'].str.replace('0', 'Good Customers')
clusters_products3['group'] = clusters_products3['group'].str.replace('1', 'Elite Customers')
clusters_products3['group'] = clusters_products3['group'].str.replace('2', 'Economical Customers')
clusters_products3['group'] = clusters_products3['group'].str.replace('3', 'Cheap Customers')

products = clusters_products3.copy()
products = products.assign(ratio=products.groupby('group').Count.transform(lambda x: x / x.sum()))

# Visualization
fig = px.bar(products, x='group', y='ratio', color='Products',
             labels={
                     "ratio": "Ratio",
                     "group": "Consumer's type"
                     },
             color_discrete_map={
                     'Gold': '#FFD700',
                     'Fish': '#87CEEB',
                     'Wines': '#b11226',
                     'Meat': '#f08080',
                     'Sweet': '#FF69B4',
                     'Fruits': 'lightgreen'},
                title="Products Distribution by Clusters")

fig.layout.yaxis.tickformat = ',.0%'

fig.update_traces(marker_line_color='white', marker_line_width=1, opacity=0.8)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_layout(
    {'plot_bgcolor': 'black',
    'paper_bgcolor': 'black'
    },
    font=dict(
        family="verdana",
        size=21,
        color="white"
    ),
    width=680,
    height=800,
    title_font_color="#FFC300",
    yaxis_title=None,
    xaxis_title=None
)

fig.show()