# Predict Customer Personality to Boost Marketing Campaign by Using Machine Learning

## Task 1 : Conversion Rate Analysis Based On Income, Spending And Age
Goals : Find a pattern of consumer behavior.<br>
Objective : 
- Feature engineering 
- Analyze Conversion Rate with other variables such as age, income, expenses, etc 

### Import Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.decomposition import PCA
randomstate=511

### Load Data

In [None]:
pd.set_option('display.max_columns', None)
df = pd.read_csv('./data/marketing_campaign_data.csv')
df.sample(4)

In [None]:
# Display the information about the DataFrame
print("DataFrame Information:")
df.info()

# Display the number of unique values in each column
print("\nNumber of unique values in each column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()}")

# Display the statistical summary of the DataFrame
print("\nDataFrame Description:")
display(df.describe().transpose())

### Feature Engineering

In this section, we create new features to better understand our customers and their behaviors. Here's a brief explanation of each new feature:

1. **Age**: This feature represents the age of each customer. It is calculated by subtracting the `Year_Birth` feature from the current year.

2. **AgeGroup**: This feature categorizes customers into different age groups for easier analysis. The age groups are determined based on the customer's `Age` range, as suggested by this [article](https://www.researchgate.net/figure/Age-intervals-and-age-groups_tbl1_228404297). The minimum age in this dataset is 28.

3. **Parent**: This feature indicates the parental status of each customer. It is created based on whether a customer has a kid at home or not.

4. **NumChild**: This feature represents the total number of children each customer has. It is calculated from the sum of the `KidHome` and `TeenHome` features.

5. **TotalAcceptedCmp**: This feature represents the total number of campaigns each customer accepted after the campaign was carried out. It is calculated from the sum of the `AcceptedCmp1` to `AcceptedCmp5` features.

6. **TotalSpending** : This feature represents the total spending each customer spended on our platform. It is calculated from the sum of `MntCoke`,
       `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, and `MntGoldProds` features.

7. **Total Trx**: This feature represents the total number of transactions the customer made in our store, either offline or online. It is calculated from the `NumDealsPurchases`, `NumWebPurchases`, `NumCatalogPurchases`, and `NumStorePurchases` features.

8. **Online Trx**: This feature represents the number of online transactions the customer made on our platform.

9. **ConversionRate**: This feature represents the percentage of website visitors who complete a web purchase. It is a key metric for understanding the effectiveness of our *online sales efforts*.

In [None]:
# Create a copy of the original dataframe to avoid modifying the original data
dfe = df.copy()

# Calculate the age of each customer based on their year of birth
dfe['Age'] = 2024 - dfe['Year_Birth']

# Categorize customers into age groups based on their age
age_grouping = [
    (dfe['Age'] >= 60),
    (dfe['Age'] >= 40 ) & (dfe['Age'] < 60),
    (dfe['Age'] >= 28) & (dfe['Age'] < 40)
]
age_category = ['Old Adults', 'Middled-aged Adults', 'Young Adults']
dfe['AgeGroup'] = np.select(age_grouping, age_category)

# Determine whether each customer has a kid at home
def has_kid(row):
    if row['Kidhome'] > 0 or row['Teenhome'] > 0:
        return 'yes'
    else:
        return 'no'
dfe['Parent'] = dfe.apply(has_kid, axis=1)

# Calculate the total number of children each customer has
dfe['NumChild'] = dfe['Kidhome'] + dfe['Teenhome']

# Calculate the total number of campaigns each customer accepted
dfe['TotalAcceptedCmp'] = dfe['AcceptedCmp1'] + dfe['AcceptedCmp2'] + dfe['AcceptedCmp3'] + dfe['AcceptedCmp4'] + dfe['AcceptedCmp5']

# Calculate the total spending of each customer across all product categories
dfe['TotalSpending'] = dfe['MntCoke'] + dfe['MntFruits'] + dfe['MntMeatProducts'] + dfe['MntFishProducts'] + dfe['MntSweetProducts'] + dfe['MntGoldProds']

# Calculate the total number of transactions each customer made
dfe['TotalTrx'] = dfe['NumDealsPurchases'] + dfe['NumWebPurchases'] + dfe['NumCatalogPurchases'] + dfe['NumStorePurchases']

# Convert 'Dt_Customer' to datetime format
dfe['Dt_Customer'] = pd.to_datetime(dfe['Dt_Customer'], format='%d-%m-%Y')

# Calculate the number of months since each customer's first purchase
dfe['Loyalty'] = ((pd.Timestamp.now() - dfe['Dt_Customer']).dt.days / 30.44).astype(int)

# Calculate the conversion rate for each customer (the number of web purchases divided by the number of web visits)
dfe['ConversionRate'] =  dfe['NumWebPurchases'] / dfe['NumWebVisitsMonth']

### EDA

In [None]:
dfe.columns

In [None]:
nums_eda = ['Income', 'Recency', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 
            'MntSweetProducts','MntGoldProds','NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
            'NumStorePurchases', 'NumWebVisitsMonth', 'Age', 'TotalSpending', 'TotalTrx', 'ConversionRate']

# create boxplots for each column with subplots
fig, axes = plt.subplots(2, 8, figsize=(24,8))
fig.suptitle('Boxplots of Necessary Numeric Features', fontsize=16, fontweight='bold', y=1.02)
fig.set_facecolor('#E8E8E8')

for col, ax in zip(nums_eda, axes.flatten()):
    sns.boxplot(y=dfe[col], ax=ax, color='#D1106F', linewidth=2.1, width=0.55, fliersize=3.5)
    ax.set_title(f'Boxplot of {col}', fontsize=14, fontweight='bold', pad=5)
    ax.grid(False)
    plt.tight_layout()

In [None]:
nums_eda = ['Income', 'Recency', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
            'MntSweetProducts', 'MntGoldProds', 'Age', 'TotalSpending', 'ConversionRate']

# create boxplots for each column with subplots
fig, axes = plt.subplots(2, 5, figsize=(15,8))
fig.suptitle('KDE plot for Necessary Features', fontsize=16, fontweight='bold', y=1.02)
for col, ax in zip(nums_eda, axes.flatten()):
    sns.kdeplot(x=dfe[col], ax=ax, color='#D1106F', linewidth=0.7, fill=True)
    ax.set_title(f'KdePlot of {col}', fontsize=14, fontweight='bold', pad=5)
    ax.grid(False)
    plt.tight_layout()

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(24,16))
fig.set_facecolor('#E8E8E8')
# Plot 1
sns.scatterplot(x='Income', y='ConversionRate', data=dfe, color='#D1106F', ax=axs[0, 0])
axs[0, 0].set_xlim(0, 200000000)
axs[0, 0].set_ylim(0, 4.7)
axs[0, 0].axvline(x=110000000, color='b', linestyle='--')
axs[0, 0].set_title("Customer Conversion Rate and Income Correlation", fontsize=19, fontweight='bold', y=1.02)
axs[0, 0].set_xlabel('Income', fontsize=13.5)
axs[0, 0].set_ylabel('Conversion Rate', fontsize=13.5)
axs[0, 0].grid(False)

# Plot 2
sns.scatterplot(x='TotalSpending', y='Income', data=dfe, color='#D1106F', ax=axs[0, 1])
axs[0, 1].set_ylim(0, 122000000)
axs[0, 1].set_xlim(0, 2700000)
axs[0, 1].axvline(x=2540000, color='b', linestyle='--')
axs[0, 1].set_title('Customer Income and Total Spending Correlation', fontsize=17, fontweight='bold', y=1.03)
axs[0, 1].set_xlabel('Total Spending', fontsize=13.5)
axs[0, 1].set_ylabel('Income', fontsize=13.5)
axs[0, 1].grid(False)

# Plot 3
sns.scatterplot(x='TotalSpending', y='ConversionRate', data=dfe, color='#D1106F', ax=axs[1, 0])
axs[1, 0].set_ylim(0, 3.8)
axs[1, 0].set_title('Correlation Between Conversion Rate and Total Spending', fontsize=18, fontweight='bold', y=1.02)
axs[1, 0].set_xlabel('Total Spending', fontsize=13.5)
axs[1, 0].set_ylabel('Conversion Rate', fontsize=13.5)
axs[1, 0].grid(False)


# Plot 4
sns.scatterplot(x='Age', y='ConversionRate', data=dfe, color='#D1106F', ax=axs[1, 1])
# axs[1, 1].set_ylim(0, 3.8)
axs[1, 1].set_title('Correlation Between Conversion Rate and Total Spending', fontsize=18, fontweight='bold', y=1.02)
axs[1, 1].set_xlabel('Age', fontsize=13.5)
axs[1, 1].set_ylabel('Conversion Rate', fontsize=13.5)
axs[1, 1].grid(False)

plt.tight_layout()
plt.show()

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(24,12))
fig.set_facecolor('#E8E8E8')

# Get counts of each age group
age_counts = dfe['AgeGroup'].value_counts()
palt = ['#00D19B','#D1106F' ,'#25A9D9']

# Create pie chart
patches, texts, autotexts = axs[0].pie(age_counts, colors=palt, autopct='%1.1f%%', textprops={'size': 13})

# Legend
axs[0].legend(patches, age_counts.index, loc="best")
axs[0].set_title("Distribution of Customer by Age Group", fontsize=18, fontweight='bold', y=1.03)

# Get counts of each age group
parent_counts = dfe['Parent'].value_counts()
palt = ['#00D19B','#D1106F']

# Create pie chart
patches, texts, autotexts = axs[1].pie(parent_counts, colors=palt, autopct='%1.1f%%', textprops={'size':13})

# Add legend
axs[1].legend(patches, parent_counts.index, loc="best")
axs[1].set_title("Parent Customer Distribution", fontsize=18, fontweight='bold', y=1.02)

plt.tight_layout()
plt.show()

In [None]:
# Create a 3x2 grid of subplots with a specific size and background color
fig, axs = plt.subplots(3, 2, figsize=(20, 30), facecolor='#E8E8E8')
# Define the color palette and order of age groups
palt = ['#D1106F','#00D19B' ,'#25A9D9']
age_order = ['Young Adults', 'Middled-aged Adults', 'Old Adults']

# Define a function to annotate the bars in a bar plot with their height values
def annotate_barplot(barplot):
    for p in barplot.patches:
        height = p.get_height()
        barplot.text(p.get_x()+p.get_width()/2.,
                     height + 0.01,
                     '{:1.2f}'.format(height),
                        ha="center",
                        fontweight='bold')

# Conversion Rate Vs Age Group
barplot = sns.barplot(data=dfe, x='AgeGroup', y='ConversionRate',hue='AgeGroup', order=age_order, legend=False, palette=palt, errorbar=None, edgecolor='black', ax=axs[0, 0])
annotate_barplot(barplot)
axs[0, 0].set_ylim(0, 1.5)
axs[0, 0].set_title("Conversion Rate by Age Group", fontsize=18, fontweight='bold', y=1.03)
axs[0, 0].set_xlabel('Age Group', fontsize=12)
axs[0, 0].set_ylabel('Conversion Rate', fontsize=12)
axs[0, 0].grid(False)

# Total Spending By Age Group
barplot = sns.barplot(data=dfe, x='AgeGroup', y='TotalSpending',hue='AgeGroup', order=age_order, legend=False, palette=palt, errorbar=None, edgecolor='black', ax=axs[0, 1])
annotate_barplot(barplot)
axs[0, 1].set_ylim(0, 820000)
axs[0, 1].set_title("Total Spending by Age Group", fontsize=18, fontweight='bold', y=1.03)
axs[0, 1].set_xlabel('Age Group', fontsize=13)
axs[0, 1].set_ylabel('Total Spending', fontsize=13)
axs[0, 1].grid(False)

# Total Spending By Age Group
barplot = sns.barplot(data=dfe, x='AgeGroup', y='TotalAcceptedCmp',hue='AgeGroup', order=age_order, legend=False, palette=palt, errorbar=None, edgecolor='black', ax=axs[1, 0])
annotate_barplot(barplot)
axs[1, 0].set_title("Total Spending by Age Group", fontsize=18, fontweight='bold', y=1.03)
axs[1, 0].set_xlabel('Age Group', fontsize=13)
axs[1, 0].set_ylabel('Total Spending', fontsize=13)
axs[1, 0].grid(False)

# Conversion Rate Number of Children
palt = ['#D1106F','#00D19B' ,'#25A9D9', '#D16F11']
barplot = sns.barplot(x='NumChild', y='ConversionRate',hue='NumChild', legend=False, data=dfe, palette=palt, errorbar=None, edgecolor='black', ax=axs[1, 1])
annotate_barplot(barplot)
axs[1, 1].set_ylim(0, 2.2)
axs[1, 1].set_title("Customer Conversion Rate by Number of Children", fontsize=18, fontweight='bold', y=1.03)
axs[1, 1].set_xlabel('Number of Children', fontsize=13.5)
axs[1, 1].set_ylabel('Conversion Rate', fontsize=13.5)
axs[1, 1].grid(False)

# Conversion Rate by Parental Status
palt = ['#D1106F','#00D19B']
barplot = sns.barplot(x='Parent', y='ConversionRate',hue='Parent', data=dfe, legend=False, palette=palt, errorbar=None, edgecolor='black', ax=axs[2, 0])
annotate_barplot(barplot)
axs[2, 0].set_ylim(0, 2.3)
axs[2, 0].set_title('Conversion Rate by Parental Status', fontsize=18, fontweight='bold', y=1.03)
axs[2, 0].set_xlabel('Parental Status', fontsize=12)
axs[2, 0].set_ylabel('Conversion Rate', fontsize=12)
axs[2, 0].grid(False)
# Conversion Rate by education
palt = ['#D1106F','#00D19B' ,'#25A9D9', '#D16F11', '#6F11D1']
ed_order = ['SMA', 'D3', 'S1', 'S2', 'S2']
barplot = sns.barplot(x='Education', y='ConversionRate',hue='Education', data=dfe, order=ed_order, legend=False, palette=palt, errorbar=None, edgecolor='black', ax=axs[2, 1])
annotate_barplot(barplot)
axs[2, 1].set_ylim(0, 1.28)
axs[2, 1].set_title('Conversion Rate by Education Level', fontsize=18, fontweight='bold', y=1.03)
axs[2, 1].set_xlabel('Education', fontsize=12)
axs[2, 1].set_ylabel('Conversion Rate', fontsize=12)
axs[2, 1].grid(False)

plt.tight_layout()
plt.show()

In [None]:
num = ['Income', 'Recency', 'NumWebVisitsMonth',
       'Complain', 'Response', 'Age', 'NumChild', 'TotalAcceptedCmp',
       'TotalSpending', 'TotalTrx', 'ConversionRate']
plt.figure(figsize=(18,10), facecolor='#E8E8E8')
sns.heatmap(dfe[num].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap', fontsize=18, fontweight='bold', y=1.02)
plt.show()

## Task 2 : Data Cleaning & Preprocessing
Goals : Preparing raw data into clean data ready to be processed by machine learning<br><br>
Objective : 
- Handle Missing Values
- Handle Duplicate Values
- Handle Infinity values 
- Feature Selection 
- Feature Encoding
- Standarization

#### Handle missing values

In [None]:
# make a copy of previous dataframe for next step (Data Preprocessing)
dfp = dfe.copy()

# Print missing values
missing_col = dfp.isna().sum()
display_missing_col = missing_col[missing_col > 0]
print(f'Missing Values : \n \n{display_missing_col}')

In [None]:
missing = dfp.isnull().sum()*100 / len(dfp)

percentage_missing = pd.DataFrame({'column':dfp.columns,
                                   'missing_percentage %':missing.values})
percentage_missing['missing_percentage %'] = percentage_missing['missing_percentage %'].round(2)
percentage_missig = percentage_missing.sort_values('missing_percentage %', ascending=False)
percentage_missing = percentage_missing.reset_index()
percentage_missing = percentage_missing.drop('index', axis=1)

plt.figure(figsize=(10,8), facecolor='#E8E8E8')
ax = sns.barplot(x='missing_percentage %', y='column', data=percentage_missing, color='#E1341E')
for p in ax.patches:
    ax.annotate('%.2f' % p.get_width() + '%', xy=(p.get_width(), p.get_y()+p.get_height()/2),
                xytext=(8,0), textcoords='offset points', ha='left', va='center', fontsize=10)
plt.title('Percentage of Missing Data', fontsize=17, fontweight='bold')
plt.ylabel('Column', fontsize=12, fontweight='bold')
plt.xlabel('Percentage', fontsize=12, fontweight='bold')
plt.xlim(0,1.5)
plt.show()

In [None]:
missing_cr = dfp[['NumWebPurchases', 'NumWebVisitsMonth', 'ConversionRate']]
missing_crdf = missing_cr[missing_cr.isna().any(axis=1)]

print(f"Highlighted Missing values : \n")
display(missing_crdf)
print('*Conversion Rate not missing at Random*')

In [None]:
plt.figure(figsize=(7, 5), facecolor='#E8E8E8')
sns.kdeplot(data=dfp, x='Income', fill=True, color='#D1106F')
plt.title('Income')

plt.tight_layout()
plt.show()

In [None]:
# print total null on income and conversion rate
total_null_income = dfp['Income'].isna().sum()
total_null_conrate = dfp['ConversionRate'].isna().sum()
print(f"Total Missing Values on Income Column = {total_null_income}")
print(f"Total Missing Values on Conversion Rate Column = {total_null_conrate}")

# print median income
median_income = dfp['Income'].median()
print(f"\nIncome Median to fill the missing value: {median_income}")

# handle missing values with fill and drop method
dfp['Income'].fillna(dfp['Income'].median(), inplace=True)
dfp.dropna(subset=['ConversionRate'], inplace=True)

# checking missing values if still exist
nonull_income = dfp['Income'].isna().sum()
nonull_conrate = dfp['ConversionRate'].isna().sum()
print(f"\nMissing Values on Income Column after handling = {nonull_income}")
print(f"Missing Values on Conversion Rate Column after handling = {nonull_conrate}")

#### No Duplicates

In [None]:
total_duplicate = dfp.duplicated().sum()
print(f"Total Duplicated Data = {total_duplicate}")

#### Fix the Infinity Value On Conversion Rate Features

In [None]:
# Print count Infiinity values in dataframe
count_inf = dfp.map(lambda x: isinstance(x, float) and x == float('inf')).sum().sum()
print(f"Count of Infinity Values :\nIt Contains {str(count_inf)} Infinite values in dataframe")

# print column where infinity values exist
col_inf = dfp.columns[dfp.map(lambda x: isinstance(x, float) and x == float('inf')).any()]
print("\nColumns where Infinity values exist:")
print(", ".join(col_inf))

In [None]:
# Replace infinity values with NaN
dfp.replace([np.inf, -np.inf], np.nan, inplace=True)

print(f"Dataframe Entries before dropping infinity values {len(dfp)}")

# Drop infinity value as nan value
dfp.dropna(inplace=True)

print(f"\nDataframe Entries After dropping infinity values {len(dfp)}")

no_inf = dfp.map(lambda x: isinstance(x, float) and x == float('inf')).sum().sum()
print(f"\nChecking if inifinity values still exist in dataframe : {str(no_inf)}")

#### Handle Outliers

In [None]:
dfp = dfe.copy()
dfp['Income'].fillna(dfp['Income'].median(), inplace=True)
dfp.dropna(subset=['ConversionRate'], inplace=True)
dfp.replace([np.inf, -np.inf], np.nan, inplace=True)
dfp.dropna(inplace=True)

In [None]:
# def remove_outliers(data, columns):
#     result = dfp.copy()
#     for col in columns:
#         Q1 = result[col].quantile(0.25)
#         Q3 = result[col].quantile(0.75)
#         IQR = Q3 - Q1
#         result = result[~((result[col] < (Q1 - 1.5 * IQR)) |(result[col] > (Q3 + 1.5 * IQR)))]
#     return result

# outliers = ['Income', 'MntMeatProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
#             'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth', 'Age', 'TotalTrx', 'ConversionRate'] 

# dfp_noutlier = remove_outliers(dfp, outliers)

In [None]:
def cap_outliers(data, columns):
    result = data.copy()
    for col in columns:
        Q1 = result[col].quantile(0.25)
        Q3 = result[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        result[col] = np.where(result[col] < lower_bound, lower_bound, result[col])
        result[col] = np.where(result[col] > upper_bound, upper_bound, result[col])
    return result

outliers = ['Income', 'MntMeatProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
            'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth', 'Age', 'TotalTrx', 'ConversionRate'] 

dfp_noutlier = cap_outliers(dfp, outliers)

In [None]:
# def replace_outliers(data, columns):
#     result = data.copy()
#     for col in columns:
#         Q1 = result[col].quantile(0.25)
#         Q3 = result[col].quantile(0.75)
#         IQR = Q3 - Q1
#         lower_bound = Q1 - 1.5 * IQR
#         upper_bound = Q3 + 1.5 * IQR
#         median = result[col].median()
#         result[col] = np.where((result[col] < lower_bound) | (result[col] > upper_bound), median, result[col])
#     return result

# outliers = ['Income', 'MntMeatProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
#             'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth', 'Age', 'TotalTrx', 'ConversionRate'] 

# dfp_noutlier = replace_outliers(dfp, outliers)

Otlier = 'income', 'MntMeatProducts', 'MntSweeetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumWebVisitsMonth', 'Age', 'TotalTrx', 'ConversionRate' 

#### Feature Selection

In [None]:
dfp_noutlier = dfp_noutlier.drop(columns=['Unnamed: 0', 'ID', 'Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue'])

In [None]:
# dfp_slctd = dfp[['Education', 'Marital_Status', 'Income', 'Recency', 'MntCoke',
#        'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
#        'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
#        'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
#        'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
#        'AcceptedCmp2', 'Response', 'Parent', 'AgeGroup','NumChild', 'TotalAcceptedCmp',
#        'TotalSpending', 'TotalTrx', 'Loyalty', 'ConversionRate']].copy()

# uncssry = ['Unnamed: 0', 'ID', 'Year_Birth', 'Kidhome', 'Teenhome', 'Dt_Customer', 'MntCoke', 'MntFruits', 
#            'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds', 'NumDealsPurchases',
#            'NumWebPurchases','NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3', 
#            'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response', 'Age']
# print(f"drop unecessary features and redundant features : \n{uncssry}")

# display(dfp_slctd.sample(5))

#### Feature Encoding
Features to label Encode :<br>
- Education
- Age Group

Features to One Hot Encode: <br>
- Marital_Status
- Parent

In [None]:
# Label Encding
# Initialize Label Encoder as le
le = LabelEncoder()

dfp_noutlier['Education'] = le.fit_transform(dfp_noutlier['Education'])
dfp_noutlier['AgeGroup'] = le.fit_transform(dfp_noutlier['AgeGroup'])


# One hot Encoding
ms_encoded = pd.get_dummies(dfp_noutlier['Marital_Status'], prefix='Status').astype(int)
dfp_noutlier = pd.concat([dfp_noutlier, ms_encoded], axis=1)

parent_encoded = pd.get_dummies(dfp_noutlier['Parent'], prefix='Parent').astype(int)
dfp_noutlier = pd.concat([dfp_noutlier, parent_encoded], axis=1)

# drop marital status and parent column after encoded(redundant)
dfp_noutlier.drop(columns=['Marital_Status', 'Parent'], inplace=True)

print('\ndataframe after feature encoding :')
display(dfp_noutlier.head())

#### Standarization

In [None]:
# Inititalize standard scaler as scaler
scaler = StandardScaler()
# Standardize the data
scaled_data = scaler.fit_transform(dfp_noutlier)

# new dataframe with scaled data
scaled_dfp = pd.DataFrame(scaled_data, columns=dfp_noutlier.columns, index=dfp_noutlier.index)

print('\ndataframe after scaled(standarized) :')
scaled_dfp.head()

## Task 3 : Modelling
Goals : Group customers into several clusters<br><br>
Objective : 
Apply the k-means clustering algorithm to the existing dataset, choose the correct number of clusters by looking at the elbow method, and evaluate using the silhouette score.

### PCA 1st

In [None]:
pca = PCA(n_components=2)
dfpca = pd.DataFrame(pca.fit_transform(scaled_dfp), index=dfp_noutlier.index)
dfpca.rename(columns={0:'PC1', 1:'PC2'}, inplace=True)

### Find the optimal n cluster with Elbow Method and Silhouette Method 

In [None]:
inertia = []
silhouette = []
for k in range(2, 10):
    kmeans = KMeans(n_clusters=k, random_state=randomstate, n_init="auto")
    kmeans.fit(dfpca)
    inertia.append(kmeans.inertia_)
    cluster_label = kmeans.labels_
    silhouette.append(silhouette_score(dfpca, cluster_label))


fig, ax1 = plt.subplots()
fig.set_facecolor("#E8E8E8")

ax1.set_xlabel("k")
ax1.set_ylabel("inertia score", color="tab:blue")
ax1.plot(
    range(2, 10), inertia, marker="o", linestyle="--", color="tab:blue", label="inertia"
)
ax1.tick_params(axis="y", labelcolor="tab:blue")

ax2 = ax1.twinx()

ax2.set_ylabel("silhouette score", color="tab:red")
ax2.plot(
    range(2, 10),
    silhouette,
    marker="o",
    linestyle="--",
    color="tab:red",
    label="silhouette",
)
ax2.tick_params(axis="y", labelcolor="tab:red")

lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, loc="upper right")

plt.title("Inertia-Silhouette Score")
# plt.grid(False)
plt.show()

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(15, 8))
fig.set_facecolor("#E8E8E8")
for i in range(2, 6):
    kmeans = KMeans(n_clusters=i, random_state=randomstate, n_init='auto')
    q, mod = divmod(i, 2)
    visualizer = SilhouetteVisualizer(kmeans, colors="yellowbrick", ax=ax[q - 1][mod])
    visualizer.fit(dfpca)
    ax[q - 1][mod].set_title(f'Silhouette plot for {i} clusters', fontsize=12, fontweight='bold')
    ax[q - 1][mod].set_xlabel('Silhouette Coefficient Values')  # Set x-label
    ax[q - 1][mod].set_ylabel('Cluster Label')  # Set y-label
    plt.tight_layout()

optimal n_cluster = 4

In [None]:
k_optimal = 4
kmeans = KMeans(n_clusters=k_optimal, random_state=randomstate, n_init='auto')
kmeans.fit(dfpca)
dfpca['cluster'] = kmeans.labels_
dfpca

In [None]:
plt.figure(figsize=(12,8), facecolor='#E8E8E8')
sns.scatterplot(x='PC1', y='PC2', hue='cluster', data=dfpca, palette='Set1')

centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=200, alpha=0.8, marker='x')

plt.title('K-Means Clustering', fontsize=18, fontweight='bold', y=1.03)
plt.xlabel('PCA 1', fontsize=12)
plt.ylabel('PCA 2', fontsize=12)
plt.show()


In [None]:
df_clust = dfp_noutlier.copy()
label = dfpca['cluster']
df_clust['cluster'] = label
# df_clust

In [None]:
features = ['Income', 'TotalSpending', 'ConversionRate', 'Loyalty', 'TotalTrx', 'Recency', 'cluster']
features.remove('cluster')

n = len(features)
ncols = 2
nrows = n // ncols if n % ncols == 0 else n // ncols + 1

# Create a figure and a grid of subplots
fig, ax = plt.subplots(nrows, ncols, figsize=(15, nrows*5))
fig.set_facecolor('#E8E8E8')

# Flatten the axes array
ax = ax.flatten()

# cluster order
cluster_order = [1, 2, 3, 0]

# Create subplots for each feature
for i, feature in enumerate(features):
    sns.boxplot(data=df_clust, y=feature, x='cluster', hue='cluster', palette='Set1', ax=ax[i], order=cluster_order, hue_order=cluster_order)
    ax[i].set_title(feature)
    ax[i].grid(False)

    # Change the labels of the hue
    hue_labels = ['High Spender', 'Mid Spender', 'Low Spender', 'Risk Churn']
    legend = ax[i].get_legend()
    for text, label in zip(legend.texts, hue_labels):
        text.set_text(label)

# Remove unused subplots
if n < nrows * ncols:
    for i in range(n, nrows * ncols):
        fig.delaxes(ax[i])

plt.tight_layout()
plt.show()

In [None]:
features = ['Income', 'TotalSpending', 'ConversionRate', 'Loyalty', 'TotalTrx', 'Recency', 'cluster']
features.remove('cluster')

n = len(features)
ncols = 2
nrows = n // ncols if n % ncols == 0 else n // ncols + 1

# Create a figure and a grid of subplots
fig, ax = plt.subplots(nrows, ncols, figsize=(15, nrows*5))
fig.set_facecolor('#E8E8E8')

# Flatten the axes array
ax = ax.flatten()

# cluster order
cluster_order = [2, 0, 3, 1]

# Create subplots for each feature
for i, feature in enumerate(features):
    sns.boxplot(data=df_clust, y=feature, x='cluster', hue='cluster', palette='Set1', ax=ax[i], order=cluster_order, hue_order=cluster_order)
    ax[i].set_title(feature)
    ax[i].grid(False)

    # Change the labels of the hue
    hue_labels = ['High Spender', 'Mid Spender', 'Low Spender', 'Risk Churn']
    legend = ax[i].get_legend()
    for text, label in zip(legend.texts, hue_labels):
        text.set_text(label)

# Remove unused subplots
if n < nrows * ncols:
    for i in range(n, nrows * ncols):
        fig.delaxes(ax[i])

plt.tight_layout()
plt.show()