In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import datetime as dt

pio.renderers.default = 'colab'
pio.templates.default = 'plotly'

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('marketing_campaign.csv', sep='\t')

## Data preprocessing

In [3]:
# Calculate age
df['Age'] = dt.datetime.now().year - df['Year_Birth']

df['Spend'] = df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)

df['Familysize'] = df['Marital_Status'].replace({"Married":"Relationship", "Together":"Relationship",
                                                 "Widow":"Single", "Divorced":"Single", 
                                                 "Single":"Single","Alone":"Single", "Absurd":"Single", "YOLO":"Single"}).replace({'Single': 1, 'Relationship' : 2}).fillna(0).astype(int) + df['Kidhome'] + df['Teenhome']

# change date format
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='%d-%m-%Y')

# Calculate last enrollment date
last_enrollment = df['Dt_Customer'].max()


# Calculate days enrolled
df['Days_Enrolled'] = (last_enrollment - df['Dt_Customer']).dt.days

In [4]:
#filter out age >= 100
df = df[(df['Age']<100)]

#Unusually high Income
#Assume a customer's Income not more than 600000 since the outlier show untill 600000
df = df[(df['Income']<600000)]

df_clean = df.copy()

#show the total length of data after dropping
print('The total number of data after removing the outliers are:', len(df))

The total number of data after removing the outliers are: 2212


In this dataset, the customer segmentation will be performed by RFM (Recency, Frequency, Monetary) analysis . This method allow companies to determine a customer’s lifetime value and distinguish them based on their behavior which will help the marketing team to set the appropriate marketing approach for each customer.

What is R,F, and M?

- **Recency**
    It indicates how recent the customer's last purchase was. Customers who recently made a purchase will still have the product on their mind and are more likely to purchase or use the product again.
- **Frequency**
    Frequency is the number of orders that customer has placed. Customers who purchased frequently are more likely to purchase again. Additionally, first time customers may be good targets for advertising to persuade them into more frequent buyers.
- **Monetary**
    Monetary signifies the amount of money each customer has spent to purchase the products. High-spending customers are seen as an opportunity for more revenues in the future and thus considered as having a high value for a business.
    

Performing RFM analysis in this dataset only requires some attributes and simple feature engineering to generate new features of Frequency and Monetary. Meanwhile, the `Recency` feature is available on the original data.

The attribute Frequency is the sum of number of purchases through all sales channel `NumWebPurchases`, `NumCatalogPurchases`, `NumStorePurchases`

In [5]:
df['Frequency'] = df['NumWebPurchases']+df['NumCatalogPurchases']+df['NumStorePurchases']


Subsequently, the `Monetary` is generated from the total amount spent on all products `MntWines`, `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`.

In [6]:
df['Monetary'] = df['MntWines']+df['MntFruits']+df['MntMeatProducts']+df['MntFishProducts']+df['MntSweetProducts']+df['MntGoldProds']

In [7]:
#creating new dataframe of RFM attributes
df_rfm = df[['Recency','Frequency','Monetary']]
df_rfm.describe()

Unnamed: 0,Recency,Frequency,Monetary
count,2212.0,2212.0,2212.0
mean,49.019439,12.566908,607.268083
std,28.943121,7.205427,602.513364
min,0.0,0.0,5.0
25%,24.0,6.0,69.0
50%,49.0,12.0,397.0
75%,74.0,18.25,1048.0
max,99.0,32.0,2525.0


The clustering algorithms that will be used here is K-Means. Since K-Means measure mean squared distance as the cost function, it is sensitive to the scale of data. Thus, data should be scaled before the clustering is executed.

### Data Scaling

In [8]:
#setting the colors of rfm
colors_rfm = ['rgb(183, 9, 76)',  'rgb(92, 77, 125)', 'rgb(0, 145, 173)']

In [9]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

#plotting rfm distribution before scaling
fig = px.box(pd.melt(df_rfm), x='variable', y='value',
             title='<b>RFM Data Distribution Before Scaling</b>',
             color='variable', color_discrete_sequence=colors_rfm,
             boxmode='overlay', points='all')

fig.update_layout(showlegend=False,paper_bgcolor='rgb(229, 236, 246)',
                  title_font_size=22)

fig.show()

There are vast differences between the variables ranges that will affect the results. If the data is not scaled, the Monetary variable will have superiority over the others, which means neglecting the contribution of the Recency and Frequency.

Therefore, I will do standardization here, which rescales the data to have a mean of 0 and a standard deviation of 1.

In [10]:
#standardization
std = StandardScaler()
df_rfm_scaled = pd.DataFrame(std.fit_transform(df_rfm), columns=df_rfm.columns)

In [11]:
#plotting rfm distribution after scaling
fig = px.box(pd.melt(df_rfm_scaled), x='variable', y='value',
             title='<b>RFM Data Distribution After Scaling</b>',
             color='variable', color_discrete_sequence=colors_rfm,
             boxmode='overlay', points='all')

fig.update_layout(showlegend=False,paper_bgcolor='rgb(229, 236, 246)',
                  title_font_size=22)

fig.show()

Pareto Principle -> Monetary

## K-Means Clustering

In [12]:

# plotting rfm
fig = px.scatter_3d(df_rfm_scaled, x='Recency', y='Frequency', z='Monetary',
                    title='<b>RFM Mapping</b>',
                    opacity=0.5, color='Monetary', color_continuous_scale='electric')

fig.update_traces(marker=dict(size=5))

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)', title_font_size=22)

fig.show()


**Why use k-means in RFM?**

The K-Means is one of the widely used algorithm for clustering. It is very simple and intuitive. The algorithm starts by placing a centroid randomly and assigned each instance to the cluster with nearest centroid. The model is then evaluated by a performance metric, inertia, which is the mean squared distance between each instance and its closest centroid.

Although this algorithm is assured to converge, it may be trapped into the local optimum because of its random initialization. However, this initialization method has been improved in K-Means++. It uses an initialization step that chooses centroids that are distant from one another. This way, the algorithm can avoid the local optimum and reach the optimal solution. The KMeans in scikit-learn library set init hyperparameter as KMeans++ by default.

To begin with, we need to determine the optimal number of clusters since K-Means is one of the clustering algorithm that still requires the user to pre-specify 

**Determine the number of clusters**

There are two methods which frequently used to find the optimal number of clusters.

First, the **Elbow Method** where each inertia (mean squared distance) will be plotted against the number of cluster in ascending order. Generally, the optimal k is the last number of cluster that reduce the inertia significantly until adding more cluster will not help much.

Second, calculating the **Silhouette Score**, which is the mean silhouette coefficient over all the instances. The silhoutte coefficient for each intance is computed by `(b-a)/max(a,b)` where a is the mean intra cluster distance and b is the mean nearest cluster distance. The formula produce results that vary between -1 and +1. A value close to +1 means the instance is well assigned inside its cluster and far from the other cluster, while -1 indicates it may have been incorretly assigned.

In [13]:
#calculating inertia for each k for elbow method
inertias = []
K = range(2,10)
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(df_rfm_scaled)
    inertias.append(kmeans.inertia_)

In [14]:
#plotting elbow method
fig = px.line(x=K, y=inertias,
              title='<b>Optimal Number of Clusters by Elbow Method</b>',
              color_discrete_sequence=['rgb(5, 60, 94)'])

fig.update_traces(line_width=4)

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',title_font_size=22,
                  xaxis_title='k (No. of Clusters)', yaxis_title='Inertia')
fig.add_vline(x=4, line_width=3, line_dash='dash', line_color='rgb(183, 9, 76)')

fig.add_annotation(x=4.1, y=1800, text='<i>optimal k=4</i>', font_size=16,
                   showarrow=True, ax=60, ay=-30, arrowhead=2, arrowsize=1,
                   arrowwidth=2)

fig.add_shape(type='circle', xref='x', yref='y',
    x0=3.9, y0=1650, x1=4.1,y1=1914,
    line_color='rgb(183, 9, 76)')

fig.show()

In [15]:
#calculating silhouette score for each k
silhouette = []
K = range(2,10)
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit_predict(df_rfm_scaled)
    score = silhouette_score(df_rfm_scaled, kmeans.labels_)
    silhouette.append(score)

In [16]:
#plotting silhouette score
fig = px.line(x=K, y=silhouette,
              title='<b>Optimal Number of Clusters by Silhouette Score</b>',
              color_discrete_sequence=['rgb(5, 60, 94)'])

fig.update_traces(line_width=4)

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',title_font_size=22,
                  xaxis_title='k (No. of Clusters)',
                  yaxis_title='Silhouette Score')

fig.add_vline(x=2, line_width=3, line_dash='dash', line_color='rgb(183, 9, 76)')

fig.add_annotation(x=2.1, y=0.44, text='<i>optimal k=2</i>', font_size=16,
                   showarrow=True, ax=60, ay=-30, arrowhead=2, arrowsize=1,
                   arrowwidth=2)

fig.add_shape(type='circle', xref='x', yref='y',
    x0=1.9, y0=0.432, x1=2.1,y1=0.442,
    line_color='rgb(183, 9, 76)')

fig.show()

The optimal cluster based on the elbow method is 4, while according to silhouette score is 2. Since silhouette score method considers intra and inter clusters, it might produce more separated clusters.

However, two clusters might be too general for customer segmentation which require more specific cluster to build personalized offers. Thus, I will use **k=4** to perform clustering.

### K-Means

In [17]:
kmeans = KMeans(n_clusters=4, random_state=2022)
kmeans.fit(df_rfm_scaled)

#sorting cluster orders to get same label results in every run
idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
luv = np.zeros_like(idx)
luv[idx] = np.arange(4)

y_pred = luv[kmeans.labels_] #use label according to lookup values

In [18]:
#adding the clusters column to the orignal dataframe for further analysis
df_rfm_scaled["Clusters"]= y_pred
df["Clusters"]= y_pred

In [19]:
#setting the colors of clusters
colors_cluster = ['rgb(183, 9, 76)', 'rgb(137, 43, 100)','rgb(69, 94, 137)',
                  'rgb(0, 145, 173)']

In [20]:
df["Clusters"] = pd.to_numeric(df["Clusters"])

fig = px.scatter_3d(df, x='Recency', y='Monetary', z='Frequency',
                    color='Clusters', category_orders=dict(Clusters=[0,1,2,3]),
                    title='<b>Cluster Results</b>', opacity=0.5,
                    color_continuous_scale=colors_cluster,
                    labels={'Clusters': 'Cluster'})

fig.update_traces(marker=dict(size=6, opacity=0.6))

fig.update_layout(showlegend=True, paper_bgcolor='rgb(229, 236, 246)',
                  title_font_size=22, legend_title='Cluster')

fig.show()


## Clusters Evaluation

It is relatively hard to interpret 3 dimensional plot. Let's see the data distribution for each R, F, and M to determine the characteristic of each cluster.

In [21]:
#convert to long format table
df_rfm_long = pd.melt(df_rfm_scaled, id_vars='Clusters')

#plotting rfm distributions
fig = px.box(df_rfm_long, x='Clusters', y='value',
             title='<b>RFM Distribution by Cluster</b>',
             color='variable', color_discrete_sequence=colors_rfm,
             boxmode='group')

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',
                  title_font_size=22)

fig.show()

## Tabulate RFM by Cluster

In [22]:
#plotting number of customers in clusters
fig = px.histogram(df, x='Clusters', color='Clusters',
                   color_discrete_sequence=colors_cluster,
                   category_orders=dict(Clusters=[0,1,2,3]),
                   title='<b>Number of Customers in Each Cluster</b>',
                   text_auto=True)

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',title_font_size=22,
                  bargap=0.4)

fig.show()

In [23]:
# identifying the centroid from original data
centroid = df.groupby('Clusters')[['Recency', 'Frequency', 'Monetary']].agg('mean')

# Resetting the index and transposing the pivot table
centroid = centroid.reset_index()

# style the pivot table
centroid.style.background_gradient(cmap='Blues')

Unnamed: 0,Clusters,Recency,Frequency,Monetary
0,0,23.490998,7.06383,151.725041
1,1,73.450886,7.020934,146.52496
2,2,22.971922,19.773218,1185.524838
3,3,73.170213,19.27853,1181.205029


Based on the boxplot and the centroid, we can see that the clusters separate the values of RFM into two groups, low and high.

- Recency
    Low : 22 - 23 days; High : 73 days
- Frequency
    Low : 7 purchases; High : 19 Purchases
- Monetary
    Low : 146 - 151 USD; High : 1181 - 1184 USD
    
The RFM separations form a combination of 4 customer segment
- Cluster 0 = Low Recency, Low Frequency, Low Monetary (**Low-Spending Active Customers**) <br><br> This is the group of those casual customers who transacted recently but only spent small amount of money. It consists of 619 or 28% of all customers 


- Cluster 1 = High Recency, Low Frequency, Low Monetary (**Churned Low-Spending Customers**) <br><br> These are the customers who have low engagement and spent the least. Unfortunately, the number is high (631 or 28% of all customers). <br>


- Cluster 2 = Low Recency, High Frequency, High Monetary (**Best Active Customers**) <br><br> Cluster 2 is the top customers who spend highly and frequently. They contribute to high portion of revenue and considered as the most valuable for the company. Unfortunately, their number is the least with only 466 or 21% of all customers. <br>


- Cluster 3 = High Recency, High Frequency, High Monetary (**Churned Best Customers**) <br><br> The customers in this cluster have spent great amount and contributed highly to company's revenue. However, their last purchase was long ago that might be an indication of churning. This group consist of 524 customers or equivalent to 23%.

## Visualization

### Income feature

In [24]:
#plotting income & monetary by clusters
#one customer with income of 666666 is excluded because it's obscuring the pattern
fig = px.scatter(df[df['Income']<500000], x='Monetary', y='Income',
                 color='Clusters',
                 color_continuous_scale=colors_cluster,
                 title='<b>Clusters by Income & Monetary</b>',
                 opacity=0.5)

fig.update_traces(marker=dict(size=11, opacity=0.85, line=dict(width=1, color='#F7F7F7')))

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',title_font_size=22)

fig.show()

### Insight:
The chart shows customers with higher monetary tend to have higher income. Notice how the cluster 2 and 3 (Our Best Customers) are dominated by the high earners. Nonetheless, there are some customers with high income that are low spenders. These are the customers that can be considered highly potential to increase their transaction.

### Deals feature

In [25]:
#getting all the campaign related features
deals = df[['NumDealsPurchases','Clusters']]

#getting new variable of AllCmp which will be 1 if customers ever accepted any campaign
deals['DealsAccpt'] = np.where(deals.iloc[:,0]>0,1,0)
deals = deals.sort_values('DealsAccpt')

In [26]:
colors_bin = ['rgb(183, 9, 76)','rgb(0, 145, 173)']

In [27]:
#plotting deals acceptance rate for each cluster
fig = px.histogram(deals, x='Clusters', barnorm='percent',
                   color='DealsAccpt',
                   color_discrete_sequence=colors_bin,
                   category_orders=dict(Clusters=[0,1,2,3]),
                   title='<b>Deals Acceptance Rate</b>',
                   text_auto=True)

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',title_font_size=22,
                  bargap=0.4, yaxis_title='Percent (%)')

fig.show()

In [28]:
#plotting number of deals purchases for each cluster
fig = px.histogram(df, x='Clusters', y='NumDealsPurchases', color='Clusters',
                   color_discrete_sequence=colors_cluster,
                   category_orders=dict(Clusters=[0,1,2,3]),
                   title='<b>Number of Deals Purchases by Clusters</b>',
                   text_auto=True)

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',title_font_size=22,
                  bargap=0.4)

fig.show()

### Insight:
The deals received good responses from all the clusters, especially in the cluster 0 and 1 (low-spending customers). Despite their low frequency, these customers bought more discounted items than the best customers. Knowing the price sensitive behavior, offering them a deal will work best to increase their transactions.

### Total Campaign feature

In [29]:
#getting all the campaign related features
cmpgn = df[['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4',
               'AcceptedCmp5','Response','Clusters']]

#getting new variable of AllCmp which will be 1 if customers ever accepted any campaign
cmpgn['CmpAccpt'] = np.where((cmpgn.iloc[:,0:5].mean(axis=1))>0,1,0)
cmpgn = cmpgn.sort_values('CmpAccpt')

#shortening column name
new_col = dict(AcceptedCmp1='Cmp1',AcceptedCmp2='Cmp2',
               AcceptedCmp3='Cmp3',AcceptedCmp4='Cmp4',
               AcceptedCmp5='Cmp5',Response='CmpLast')

cmpgn.rename(columns=new_col, inplace=True)

In [30]:
#plotting campaigns acceptance rate
fig = px.histogram(cmpgn, x='Clusters', barnorm='percent',
                   color='CmpAccpt',
                   color_discrete_sequence=colors_bin,
                   category_orders=dict(Clusters=[0,1,2,3]),
                   title='<b>Campaigns Acceptance Rate</b>',
                   text_auto=True)

fig.update_layout(paper_bgcolor='rgb(229, 236, 246)',title_font_size=22,
                  bargap=0.4, yaxis_title='Percent (%)')

fig.show()

### Insight:
In contrary to the deals, the campaigns seems underperforming with far less acceptance than deals. Besides, cluster 2 and 3 (best customers) responded more than the low-spending customers. The marketing team might take this into consideration to craft more personalized campaigns for low-spending customers.

Next, we will take a look on each campaign performance.

In [31]:
#convert table wide to long format for each cluster
cmpgn_long_0 = pd.melt(cmpgn[cmpgn['Clusters']==0][['Cmp1','Cmp2',
                                                    'Cmp3','Cmp4',
                                                    'Cmp5','CmpLast']])
cmpgn_long_1 = pd.melt(cmpgn[cmpgn['Clusters']==1][['Cmp1','Cmp2',
                                                    'Cmp3','Cmp4',
                                                    'Cmp5','CmpLast']])
cmpgn_long_2 = pd.melt(cmpgn[cmpgn['Clusters']==2][['Cmp1','Cmp2',
                                                    'Cmp3','Cmp4',
                                                    'Cmp5','CmpLast']])
cmpgn_long_3 = pd.melt(cmpgn[cmpgn['Clusters']==3][['Cmp1','Cmp2',
                                                    'Cmp3','Cmp4',
                                                    'Cmp5','CmpLast']])

In [32]:
#plotting campaigns performance on each cluster
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=('Cluster 0', 'Cluster 1', 'Cluster 2', 
                                    'Cluster 3'),
                    shared_yaxes=True)

fig.add_trace(
    go.Histogram(histfunc='sum', x=cmpgn_long_0['variable'],
                 y=cmpgn_long_0['value'], marker_color=colors_cluster[0],
                 texttemplate='%{y}'),
    row=1, col=1)

fig.add_trace(
    go.Histogram(histfunc='sum', x=cmpgn_long_1['variable'],
                 y=cmpgn_long_1['value'], marker_color=colors_cluster[1],
                 texttemplate='%{y}'),
    row=1, col=2)

fig.add_trace(
    go.Histogram(histfunc='sum', x=cmpgn_long_2['variable'],
                 y=cmpgn_long_2['value'], marker_color=colors_cluster[2],
                 texttemplate='%{y}'),
    row=2, col=1)

fig.add_trace(
    go.Histogram(histfunc='sum', x=cmpgn_long_3['variable'],
                 y=cmpgn_long_3['value'], marker_color=colors_cluster[3],
                 texttemplate='%{y}'),
    row=2, col=2)

fig.update_layout(showlegend=False, paper_bgcolor='rgb(229, 236, 246)',
                  title_text='<b>Campaigns Performance on Each Cluster</b>',
                  title_font_size=22, yaxis_range=[0,150], yaxis3_range=[0,150])

### Insight:
Most campaigns failed to attract low-spending customers (cluster 0 and 1), where only the third and last campaign that could get higher acceptance . In addition, the second campaign received the least responses in all clusters while the last got the highest.

In the next campaign, the strategy can be formulated according to this evaluation. Which kind of offerings gave the best acceptance in each cluster and which one was not should be taken into account when designing the new campaign.

In [34]:
#Visualize all the feature based on spending
df["Marital_Status"] = df["Marital_Status"].astype("category")
df["Education"] = df["Education"].astype("category")

Personal = ["Age", "Income", "Education", "Marital_Status", "NumWebVisitsMonth", "Familysize", 'Days_Enrolled']

for i in Personal:
    fig = px.scatter(data_frame=df, x=i, y="Spend", color="Clusters", marginal_y="violin", marginal_x="box",
                     color_discrete_sequence=colors_cluster, category_orders=dict(Clusters=[0,1,2,3]))
    fig.update_layout(title=i.capitalize() + " vs Spending",
                      xaxis_title=i.capitalize(),
                      paper_bgcolor='rgb(229, 236, 246)',
                      plot_bgcolor='rgb(229, 236, 246)',
                      template="simple_white")
    fig.show()

# Cluster Analysis

Once the characteristic of customers are identified, the next step will be crafting personalized offer that is tailored for each customer group. Here are some example of the promotion strategies.

- **Low-Spending Active Customers** : <br>
    The solely purpose of promotion to this group is to increase the frequency and monetary of customers. The marketing team can create personalized offer to increase their purchase level in the form of vouchers or discount that will be eligible if their transaction can reach the desired threshold.
        
        
- **Churned Low-Spending Customers** : <br>
    Since the size of this group is the largest, we just can not simply ignore them. The promotion can start by engaging and getting them back to transacting again. Sending them email, message, or notification of personalized product recommendation along with the interesting offers might do the task.


- **Best Active Customers** : <br>
    Best customers deserve the best services. The marketers can build connection and reward them with great offers and services dedicated for customers in this group only to make them feel special.
    
    
- **Churned Best Customers** : <br>
    This group was the revenue generators before. Thus, the marketing team should prioritize to re-engage with them, while also listening to their feedbacks and giving immediate responses. In addition, the team can send personalized product recommendation with great offers to entice them into buying.

## Low-Spending Active Customers ::
- Income range is in between 5000 to 40000 and spending range is in between 0 to 500
- Age range in between 25 to 50
- From any educational level
- Can be married and unmarried as well
- Most of them are parents
- Some have one child
- Customer for at least 300 days
- Promotion acceptance is rare
- Very few complete purchases using discounts

## Churned Low-Spending Customers:
- Income range is in between 65000 to 85000 and spending range is in between 550 to 2000
- Age range in between 30 to 60
- Almost all have completed graduation
- Most of them are married
- They are not parents
- Have no child
- Customer for at least 250 days
- Promotion acceptance ratio is 0.5
- Completing purchases using discounts are rare

## Best Active Customers:
- Income range is in between 50000 to 80000 and spending range is in between 250 to 1800
- Age range in between 35 to 60
- Almost all have completed graduation
- Most of them are married
- They are parents
- All have children, most have one child
- Customer for at least 400 days
- Promotion acceptance ratio is poor
- Highly interested in completing purchases using discounts

## Churned Best Customers:
- Income range is in between 40000 to 60000 and spending range is in between 0 to 500
- Age range in between 40 to 65
- Almost all have completed graduation
- Can be married and unmarried as well
- They are parents
- All have children, most have two children
- Customer for at least 150 days
- Promotion acceptance is rare
- Highly interested in completing purchases using discounts