Predict Customer Personality to Boost Marketing Campaign


A company can develop rapidly when it knows the personality behavior of its customers so that it can provide better services and benefits to customers who have the potential to become loyal customers. By processing historical marketing campaign data to improve performance and target the right customers so they can make transactions on the company's platform, from these data insights our focus is to create a cluster prediction model to make it easier for companies to make decisions.

Points to Analyze

  • Conversion Rate Analysis Based On Income, Spending And Age
  • Data Modeling
  • Customer Personality Analysis for Marketing Retargeting

Data Overview

Feature Name Description
Unnamed : 0 Index number
ID Customer's unique identifier
Year_Birth Customer's birth year
Education Customer's education level
Marital_Status Customer's marital status
Income Customer's yearly household income
Kidhome Number of children in customer's household
Teenhome Number of teenagers in customer's household
Dt_Customer Date of customer's enrollment with the company
Recency Number of days since customer's last purchase
Complain 1 if the customer complained in the last 2 years, 0 otherwise
MntCoke Amount spent on coke in last 2 years
MntFruits Amount spent on fruits in last 2 years
MntMeatProducts Amount spent on meat in last 2 years
MntFishProducts Amount spent on fish in last 2 years
MntSweetProducts Amount spent on sweets in last 2 years
MntGoldProds Amount spent on gold in last 2 years
NumDealsPurchases Number of purchases made with a discount
AcceptedCmp1 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response 1 if the customer accepted the offer in the last campaign, 0 otherwise
NumWebPurchases Number of purchases made through the company’s website
NumCatalogPurchases Number of purchases made using a catalog
NumStorePurchases Number of purchases made directly in stores
NumWebVisitsMonth Number of visits to the company’s website in the last month
Z_CostContact Cost to contact a customer
Z_Revenue Revenue after client accepting campaign

  • There are 2240 lines with 30 features
  • There is only 1 column with a null value, namely the Income column (24 null values)
  • The data type for the Dt_Customers column needs to be changed to DateTime
  • No duplicate data
  • There is a lot of numerical data but not many outliers
  • Perform feature extraction in the form of age features, number of children, number of transactions, number of expenses, conversion rate, etc. to become 36 features

Exploratory Data Analysis

1. Data Distribution

From the data distribution, it can be seen that many features are close to a normal distribution, despite Children and TotalAccCmp having a small real value. Meanwhile, other features have right-skewed.

2. Outliers Checking

Feature Age, Income, TotalSpending, TotalTrx, and CVR have outliers. If we look at the outlier for Age, it can be seen that the data does not make sense because it is more than 80 years old, so it is best to delete this row so that the clustering process avoids outliers. Likewise, the outliers in the Income column are worth more than 600,000,000. TotalSpending, TotalTrx, and CVR also show outliers so they need further handling.

3. Regression Plot of Features and Conversion Rate


4. Categorical Features
The categorical features look neat and clean, but for Marital Status it can be simplified into some values.

Business Insight

Conversion rate analysis is a search for insight into data on the percentage of website visitors what actions they take while visiting the site, and whether their actions result in a purchase transaction or not while visiting the website. This can be done by performing feature engineering on the data variables presented so that it can produce a new column, that is the Conversion Rate.

1. Conversion Rate Based on Age

Based on the cleaned data, the youngest age is 27 and the eldest is 80. Late twenties to thirties are our potential customers as we can see on the graph shows the highest conversion rate. The least potential is from groups 41-50 which is the middle group. The graph moves lower from the highest to the lowest group and the conversion rate then starts to grow as they get older (>51 years old).

2. Conversion Rate Based on Income

The conversion rate tends to increase along with higher income groups. The highest conversion rate comes from the 90-100M income group. It indicates that income has a linear correlation with the conversion rate.

3. Conversion Rate Based on Spending

It can be seen that customer spending has a strong correlation with the conversion rate. The higher the spending the higher the conversion rate for them to do other transactions.


Before modeling, make sure the data has been cleaned and preprocessed. (The detailed steps are in Jupyter Notebook) In this stage, we will try to cluster the data based on some aspects or variables.

a. Elbow Method

First, let's use the elbow method and visualize the inertia. Elbow method is a method that is often used to determine the number of clusters to be used in K-Means clustering. Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster.

image image

b. Silhouette Score

The silhouette score of a point measures how close that point lies to its nearest neighbor points, across all clusters. It provides information about clustering quality which can be used to determine whether further refinement by clustering should be performed on the current clustering.

From the Elbow Method and Silhouette Score, the optimal cluster is 4 clusters and has good distribution data for each cluster.

Customer Personality Analysis for Each Cluster

The distribution of each cluster can be seen below.


The results of the clustering that has been carried out previously can be interpreted based on the characteristics of each group, how the cluster tends to respond to existing marketing campaigns, and what the potential revenue results will be if we carry out marketing retargeting to that cluster.
Now, let's see the statistics for each cluster from some features (Recency, Total Transactions, Spending, Total Accepted Campaign, and Conversion Rate).
The graph of the median from some features corresponding to each cluster. image

It looks like Recency and Age don't have a big impact on differentiating the cluster because the gap between each cluster is low. We only know that cluster 0 has the biggest recency.
Meanwhile, Total Transactions has a similar mean and median and we can conclude that clusters 0 and 3 are the highest. For other features, all the patterns seem similar where the most potential cluster in order are 0 > 3 > 2 > 1.

Cluster 0 (The Most Potential Customer)

They tend to respond to existing marketing campaigns. This cluster has the most total transactions and the highest income & spending among others. This cluster also has the highest Conversion Rate. For this cluster, rewarding or sometimes giving a gift is highly recommended. The best campaign for Cluster 0 is they will get a special gift after spending a certain money (for example: for a minimum transaction of 1 million).

Cluster 3 (The 2nd Potential Customer)

This cluster has many transactions same as Cluster 0 but they spent lower than Cluster 0. We can say that they may often make transactions but in small amounts because they also have lower income than Cluster 0. But when we look at the Conversion Rate is low compared to Cluster 0. It may be indicated that the large total transactions are coming from a large number of customers since this cluster has the most total customers (615 customers), because the tendency to convert the campaign is low. The best campaigns for Cluster 1 are to get lower prices for bundling products so that in one transaction the spending is higher than before or they can get special discounts after purchasing for some times (for example after 5 transactions) which will increase the conversion rate.

Cluster 2

This cluster has total transactions and spending lower compared to the 2 previous clusters. But if we see from their income, it's quite normal (range 4 of 8). So, it may be indicated that these customers are economical customers who would only buy what they need. The best campaign for Cluster 2 is to offer high-quality products with high prices so even if they make fewer transactions, the spending still can be high.

Cluster 1

This cluster has the least potential customers. They have the lowest rank for all indicators. It can be interpreted that because this cluster has the lowest income, it affected the total amounts of spending and total transactions, even the Conversion Rate. The best campaign for Cluster 1 is to make them start to buy new kinds of products to make them interested in buying by giving special prices for the first purchase.


