Predict Customer Personality to boost marketing campaign by using Machine Learning


“A company can develop rapidly when it knows the behavior of it’s customer personality, so that it can provide better services and benefits to customers who have the potential to become loyal customers. By processing historical marketing campaign data to improve performance and target the right customers, so they can transcat on the company’s platform, from this data insight our focus is to create a cluster prediction model to make it easir for companies to make decisions.“

Load Data

df = pd.read_csv("marketing_campaign_data.csv")

Feature Engineering

Conversion rate

df['conversion_rate'] = df['Response'] / df['NumWebVisitsMonth']


def kelompok_usia(x):
    if x['Year_Birth'] <= 1954:
        kelompok = 'Lansia'
    elif x['Year_Birth'] >= 1955 and x['Year_Birth'] <= 1993: 
        kelompok = 'Dewasa'
        kelompok  = 'Remaja'
    return kelompok  

df['grup_umur'] = df.apply(lambda x: kelompok_usia(x), axis=1)

Social status

def kesejahteraan_masyakat(x):
    if x['Income'] >= 5.174150e+07:
        kelompok = 'Kaya'
        kelompok  = 'Biasa aja'
    return kelompok  

df['grup_income'] = df.apply(lambda x: kesejahteraan_masyakat(x), axis=1) 

Number of children, Total transactions and Total expenses

df['Total_Purchases'] = df['NumDealsPurchases'] + df['NumWebPurchases']+df['NumCatalogPurchases']+df['NumStorePurchases']+df['NumWebVisitsMonth']
df['jumlah_anak'] = df['Kidhome'] + df['Teenhome']
df['total_pembelian'] = df['MntCoke']+df['MntFruits']+df['MntMeatProducts']+df['MntFishProducts']+df['MntSweetProducts']+df['MntGoldProds']
df['Total_Transaksi'] = df['Income'] - df['total_pembelian'] 
df['total_acc_cmp'] = df['AcceptedCmp2'] + df['AcceptedCmp1'] + df['AcceptedCmp5'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] 

Exploratory Data Analysis

Univariate Analysis

Numerical Boxplot


From the boxplot above, it can be seen that there is an outlier that is not too far from the other data. This outlier is located between the upper and lower bounds of the boxplot. This indicates that the outlier is still within the reasonable range of values for the data.

Numerical Distplot


  1. The following variables are normally distributed: 'total_transaksi', NumWebVisitsMonth, NumStorePurchases, NumWebPurchases, NumDealsPurchases, Recency, Year_Birth
  2. The following variables are positively skewed: MntCoke, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, conversion_rate
  3. The following variables are bimodal or have more than 1 mode: total_acc_cmp, jumlah_anak, Kidhome, Teenhome

Multivariate Analysis


Conversion Rate

Conversion Rate Based On Age

There is a significant relationship between customer age and conversion rate, where adults tend to have a greater impact on conversion rate than teenagers and the elderly. This is because adults are in their active age and have a higher income than teenagers and are more active than the elderly.

Conversion Rate Based On Jumlah Anak

The graph above shows the relationship between conversion rate and the number of children. It can be seen that people with no children tend to have a higher conversion rate than people with one or more children.

Data Preprocessing

According to the results, Income has 24 null values, conversion_rate has 11, and Total_Transaksi has 24.

Handling Missing Value

We handle missing values using the following query,

df['Income'].fillna(df['Income'].mean(), inplace=True) 
df['conversion_rate'] = df['conversion_rate'].fillna(0) 
df['Total_Transaksi'].fillna(df['Total_Transaksi'].mean(), inplace=True) 

Handling Duplicated Data

there are no duplicates in our data


Drop Data

we will remove unnecessary data.

df.drop(columns = ['Unnamed: 0','ID', 'Kidhome', 'Teenhome','Z_CostContact', 'Z_Revenue','Dt_Customer'], inplace=True)

Feature Encoding

df['Education'] = df['Education'].map({'S3' : 4, 'S2' : 3, 'S1':2, 'D3':1, 'SMA':0})
df['grup_income'] = df['grup_income'].map({'Kaya':1, 'Biasa aja':0})
df['grup_umur'] = df['grup_umur'].map({'Dewasa' : 1, 'Lansia': 0, 'Remaja':2})
df['Marital_Status'] = df['Marital_Status'].map({'Single' : 0, 'Couple' : 1})

Standardization of Features

from sklearn.preprocessing import StandardScaler, MinMaxScaler
scd = StandardScaler()
y_fit = scd.fit_transform(df.astype(float))

K-Means Clustering - PCA

cluster = df[['Recency', 'Total_Purchases', 'total_pembelian']].copy()
cluster.columns = ['Recency','Frequency','Monetary']
features = ['Recency','Frequency','Monetary']

We want to see a graph of the RFM


cols = cluster.columns
plt.figure(figsize= (15, 20))
for i in range(len(cols)):
    plt.subplot(6, 2, i+1)
    sns.kdeplot(x = cluster[cols[i]])



cols = cluster.columns
plt.figure(figsize= (10,15))
for i in range(len(cols)):
    plt.subplot(4, 4, i+1)
    sns.boxplot(y = cluster[cols[i]], orient='v')


Looks like we have a few outliers. Time to handle them.

Handling Outliers

for col in cols:
    high_cut = cluster[col].quantile(q=0.99)
    low_cut= cluster[col].quantile(q=0.01)

It turns out that there are still some outliers in the monetary data. Let's handle with transformation.

tf_log = cluster.copy()
tf_log['Monetary'] = np.log(cluster['Monetary'])

plt.figure(figsize= (5, 5))
sns.kdeplot(x = tf_log['Monetary'])


Implementing clustering using k-means clustering

inertia = []

for i in range(1,11):
    kmeans = KMeans(n_clusters = i, max_iter = 300, n_init=10, random_state = 42)

sns.lineplot(x=range(1,11), y = inertia, color = 'purple')
sns.scatterplot(x=range(1,11), y = inertia, s = 50, color = 'blue')
circle = Ellipse((4, 45000), width=0.3, height=2000, color='red', fill=False, linewidth=2)
The slope appears to be decreasing from 4 to 5. Therefore, n_cluster = 4 will be chosen to perform the k-means clustering model.

Calculating the silhouette score to see how the model performance is obtained.

n_cluster = [4,5,6,8,9,10]
fig, ax = plt.subplots(2, 3, figsize=(15,8))
for i in n_cluster:
    kmeans = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=42)
    q, mod = divmod(i, 4)
    visualizer = SilhouetteVisualizer(kmeans, colors='yellowbrick', ax=ax[q-1][mod])

The silhouette score that is good is the one on the lower right with an average value of 0.6, so the performance of the model obtained from the silhouette score is also better. In addition, if you pay attention. In general, a silhouette value that approaches 1 indicates that the data clustering within that cluster is very good.

Principal Component Analysis

# Membandingkan hasil scatter plot PCA dengan scatter plot sebelumnya
sns.pairplot(data=df_pca, hue='Labels', diag_kind='kde', palette=(random.shuffle(colors)))
plt.tight_layout(rect = (2,2,2,2))


Customer Personality Analysis for Marketing Retargeting

c = ['#957DAD','#E0BBe4','#B7D3DF','#CDE8E6']

def dist_feats(features):
    i = 1
    for feats in features:
        ax = plt.subplot(1,len(features),i)
        ax.vlines(cluster[feats].median(), ymin=-0.5, ymax=3, color='black', linewidth=1,  linestyle='--')
        dfg = cluster.groupby('Labels')
        x = dfg[feats].median().index
        y = dfg[feats].median().values
        ax.barh(x,y, color=c)
        i = i+1



Insights for each feature:

• R, Recency: The higher the value of frequency, the more often the customer makes a purchase.
• F, Total_Purchases: The higher the value of frequency, the more often the customer makes a purchase.
• M, total Purchases: The higher the value of monetary, the more money the customer spends on purchases.

From the visualization above, we can draw the following conclusions:

• Label 0 = has a high R pattern as well as F and M below the median.
• Label 1 = has a high F and M pattern as well as R below the median.
• Label 2 = has a low F, M, and R pattern.
• Label 3 = has a high F, M, and R pattern.

• Cluster 0: Most Loyal Customers:
Customers in this cluster last interacted with the business 74 days ago, with low shopping frequency and the highest spending.

• Cluster 1: New Customers:
Customers in this cluster have just interacted with the business within the last 22 days, with high shopping frequency and significant spending.

• Cluster 2: Impactful Customers:
Customers in this cluster have just interacted with the business within the last 24 days, with low shopping frequency and a fair amount of spending.

• Cluster 3: Passive Customers:
Customers in this cluster last interacted with the business 73 days ago, with high shopping frequency and significant spending.

Selecting clusters for marketing retargeting:

Cluster 3 & Cluster 1: These clusters are good targets for retargeting because of their high shopping frequency and spending. Marketing strategies can focus on offering exclusive deals or purchase bonuses to increase customer loyalty in these groups.

Calculating the potential impact of marketing retargeting results from existing clusters

Cluster_0 = cluster[cluster['Labels'] == 0]['Monetary'].sum()
Cluster_1 = cluster[cluster['Labels'] == 1]['Monetary'].sum()
Cluster_2 = cluster[cluster['Labels'] == 2]['Monetary'].sum()
Cluster_3 = cluster[cluster['Labels'] == 3]['Monetary'].sum()
total_spent  = Cluster_0 + Cluster_1 + Cluster_2 + Cluster_3
potential_impact_cluster_3 = (Cluster_3 / total_spent) * 100
potential_impact_cluster_1 = (Cluster_1 / total_spent) * 100

print('Total Spent of Cluster 0: Rp', Cluster_0)
print('Total Spent of Cluster 1: Rp', Cluster_1)
print('Total Spent of Cluster 2: Rp', Cluster_2)
print('Total Spent of Cluster 3: Rp', Cluster_3)
print('Total Spent: Rp', total_spent)
print('Potential Impact of Cluster 3: {:.2f}%'.format(potential_impact_cluster_3))
print('Potential Impact of Cluster 1: {:.2f}%'.format(potential_impact_cluster_1))

Total Spent of Cluster 0: Rp 88453000
Total Spent of Cluster 1: Rp 557668000
Total Spent of Cluster 2: Rp 80137000
Total Spent of Cluster 3: Rp 626855000
Total Spent: Rp 1353113000
Potential Impact of Cluster 3: 46.33%
Potential Impact of Cluster 1: 41.21%

"Focusing our retargeting efforts on Cluster 3 and Cluster 1 could yield significant returns. We can expect to secure around Rp 62.7 billion from Cluster 3 and Rp 55.8 billion from Cluster 1, translating to potential impact rates of 46.3% and 41.2% respectively. In essence, prioritizing these clusters presents a promising avenue for boosting revenue and customer engagement."


