# **Introduction**

Customer segmentation is to divide a population into groups with similar characteristics. It helps a business better understand its customers and enables the business to cater to each customer in the most effective way. The purpose of this project is to analyze the data and divide the customers into segments by using methods such as unsupervised clustering, principal component analysis. Segmentation will be done using the demographic segmentation method, which is one of the most basic methods. Each segment will then be analyzed separately to reveal the buying patterns of each customer segment, how much they are benefiting from the campaigns, and where the buying process is taking place overall.

# **Data**

### Customer

* ID: Customer's unique identifier
* Year_Birth: Customer's birth year
* Education: Customer's education level
* Marital_Status: Customer's marital status
* Income: Customer's yearly household income
* Kidhome: Number of children in customer's household
* Teenhome: Number of teenagers in customer's household
* Dt_Customer: Date of customer's enrollment with the company
* Recency: Number of days since customer's last purchase
* Complain: 1 if the customer complained in the last 2 years, 0 otherwise

### Products
 
* MntWines: Amount spent on wine in last 2 years
* MntFruits: Amount spent on fruits in last 2 years
* MntMeatProducts: Amount spent on meat in last 2 years
* MntFishProducts: Amount spent on fish in last 2 years
* MntSweetProducts: Amount spent on sweets in last 2 years
* MntGoldProds: Amount spent on gold in last 2 years

### Promotion
 
* NumDealsPurchases: Number of purchases made with a discount
* AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
* AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
* AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
* AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
* AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
* Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

### Place
 
* NumWebPurchases: Number of purchases made through the company’s website
* NumCatalogPurchases: Number of purchases made using a catalog
* NumStorePurchases: Number of purchases made directly in stores
* NumWebVisitsMonth: Number of visits to company’s website in the last month

# **Libraries**

In [863]:
import numpy as np
import pandas as pd
import scipy

from datetime import datetime
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

# **Explore Data**

In [864]:
df_raw = pd.read_csv('../input/customer-personality-analysis/marketing_campaign.csv', sep='\t')
df_raw.head()

In [865]:
df_raw.info()

# **Data preprocessing**

We need to preprocess the data to make it suitable for visualization and machine learning algorithms. Firstly, we start by dropping customers who have missing data. The missing data as seen above are relatively few, so it's okay to drop them.

Next, we process the categorical data, which are education and marital status features. There are five different categories in the Education feature, to make the difference more obvious, we change them to undergraduate and postgraduate. There are 8 different categories in the marital status feature, since they have similar meanings, we change them to single and couple.

Then we move on to numerical features, we focus on Dt_Customer and YearBirth. We make these features more understandable by expressing them as total years rather than by starting dates. Finally, we'll cover the Kidhome and Teenhome features. We will add a new column to show the family size and change the features to show whether customers have children or teenagers.


In [866]:
df_raw.dropna(inplace=True)

df_raw["Education"] = df_raw["Education"].replace({'Basic': "graduate", 'Graduation': "graduate", 
                                                   '2n Cycle' : "postgraduate", 'Master' : "postgraduate", 
                                                   'PhD': "postgraduate"})

df_raw["Education"] = df_raw["Education"].replace({'graduate': "0", 'postgraduate': "1"})


df_raw["Marital_Status"] = df_raw["Marital_Status"].replace({'Married': "couple", 'Together': "couple", 
                                                   'Single' : "single", 'Divorced' : "single", 
                                                   'Widow': "single", 'Alone' : "single", 
                                                   'Absurd': "single", 'YOLO' : "single"})
df_raw["Marital_Status"] = df_raw["Marital_Status"].replace({'single': "1", 'couple': "2"})

LABELS = ["Education","Marital_Status"]
int_label = lambda x: x.astype('int64')
df_raw[LABELS] = df_raw[LABELS].apply(int_label, axis=0)

In [867]:
df_raw["Dt_Customer"] = pd.to_datetime(df_raw["Dt_Customer"])

df_raw["CustomerYear"] = datetime.now().year - df_raw["Dt_Customer"].dt.year

df_raw["Age"] = datetime.now().year - df_raw["Year_Birth"]

df_raw["FamilySize"] = df_raw['Kidhome'] + df_raw['Teenhome'] + df_raw["Marital_Status"]

df_raw.loc[df_raw['Kidhome'] > 0, 'Kidhome'] = 1

df_raw.loc[df_raw['Teenhome'] > 0, 'Teenhome'] = 1

df_raw.reset_index(drop=True, inplace=True)

# **Visual EDA**

Before using the data prepared in the previous section, we need to examine it in more detail. First of all, we need to separate the data according to the demographic segmentation method before starting the visualization. Here, we have not included features such as Year_Birth, Dt_Customer, Recency, Complain, Z_CostContact, and Z_Revenue as they will not contribute to our model. 

We first start the visualization with the correlation between features. As can be seen, there is not a very strong correlation between features. Next, we look at the distribution of categorical and numeric features. There are two points that should be mentioned here. Age and income ranges are different from each other and there are outliers in both features.

In [868]:
customer = df_raw[['ID', 'Education', 'Marital_Status', 'Income', 
                   'Kidhome', 'Teenhome', 'CustomerYear', 'Age', 'FamilySize']].copy()

product = df_raw[['ID', 'MntWines', 'MntFruits', 'MntMeatProducts', 
                  'MntFishProducts', 'MntSweetProducts','MntGoldProds']].copy()

promotion = df_raw[['ID', 'NumDealsPurchases', 'AcceptedCmp1', 'AcceptedCmp2', 
                  'AcceptedCmp3', 'AcceptedCmp4','AcceptedCmp5', 'Response']].copy()

place = df_raw[['ID', 'NumStorePurchases', 'NumCatalogPurchases', 
                  'NumWebPurchases', 'NumWebVisitsMonth']].copy()

In [869]:
mask = np.triu(np.ones_like(customer.corr()))
plt.figure(figsize = (12, 9))
s = sns.heatmap(customer.corr(),
               annot = True,
               cmap = 'RdBu',
               mask = mask,
               vmin = -1,
               vmax = 1)
s.set_yticklabels(s.get_yticklabels(), rotation = 0, fontsize = 12)
s.set_xticklabels(s.get_xticklabels(), rotation = 90, fontsize = 12)
plt.title("Correlation Heatmap", y=1.02)
plt.show()

In [870]:
fig, ([ax1,ax2, ax3], [ax4,ax5, ax6]) = plt.subplots(2,3, figsize=(16,10))

sns.countplot(x='Education', data=customer, ax=ax1)
ax1.set(ylabel='Count')
ax1.set_xticklabels(['graduate', 'postgraduate'])

sns.countplot(x='Marital_Status', data=customer, ax=ax2)
ax2.set(ylabel=None)
ax2.set_xticklabels(['single', 'couple'])

sns.countplot(x='Kidhome', data=customer, ax=ax3)
ax3.set(ylabel=None)
ax3.set_xticklabels(['No', 'Yes'])

sns.countplot(x='Teenhome', data=customer, ax=ax4)
ax4.set(ylabel='Count')
ax4.set_xticklabels(['No', 'Yes'])

sns.countplot(x='CustomerYear', data=customer, ax=ax5)
ax5.set(ylabel=None)

sns.countplot(x='FamilySize', data=customer, ax=ax6)
ax6.set(ylabel=None)
fig.suptitle("Categorical data", fontsize=16, y=0.95)

In [871]:
s = sns.pairplot(customer[["Income", "Age"]], height= 3.5, aspect=2, plot_kws={'alpha': 0.5})
s.fig.suptitle("Income vs Age",fontsize=16, x =0.53, y=1.02)
plt.show()

# **Handling outliers**

Since Outliers may affect the precision of the model we will use, we exclude customers with these values from our data. As seen in the image, age and income features have a more normal distribution after this process.

In [872]:
z = np.abs(stats.zscore(customer['Income']))
OutlierIncome = np.where(z > 3)

z = np.abs(stats.zscore(customer['Age']))
OutlierAge = np.where(z > 3)

Outliers = list(OutlierIncome[0]) + list(OutlierAge[0])

customer.drop(Outliers,inplace = True)
customer.reset_index(drop=True, inplace=True)

product.drop(Outliers,inplace = True)
product.reset_index(drop=True, inplace=True)

promotion.drop(Outliers,inplace = True)
promotion.reset_index(drop=True, inplace=True)

place.drop(Outliers,inplace = True)
place.reset_index(drop=True, inplace=True)

s = sns.pairplot(customer[["Income", "Age"]], height= 3.5, aspect=2, plot_kws={'alpha': 0.5})
s.fig.suptitle("Income vs Age",fontsize=16, x =0.53, y=1.02)
plt.show()

# **Standardization**

Since the K-means clustering algorithm uses distance-based metrics to determine the similarity between data points, we need to standardize our dataset.

In [873]:
customer.set_index('ID', inplace=True)

scaler = StandardScaler()
customer_std = scaler.fit_transform(customer)

# **Hierarchical Clustering**

According to the hierarchical clustering results used as the first method, the ideal number of clusters is three. And according to the results of K-means clustering, which is used as the second method, the ideal number of clusters can be three, four, or five. Since the difference between clusters is clearer and seems to be a more secure choice, we initially choose the number of clusters as 3.

In [874]:
hier_clust = linkage(customer_std, method = 'ward')

plt.figure(figsize = (8,6))
plt.title('Hierarchical Clustering Dendrogram', fontsize=16)
plt.xlabel('Observations')
plt.ylabel('Distance')
dendrogram(hier_clust,
           truncate_mode = 'level', 
           p = 3, 
           show_leaf_counts = False, 
           no_labels = True)
plt.show()

# **K-means clustering**

In [875]:
inertia = []
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(customer_std)
    inertia.append(kmeans.inertia_)
    
plt.figure(figsize = (8,6))
plt.plot(range(1, 11), inertia, marker = 'o', linestyle = '--')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('K-means Clustering', fontsize=16)
plt.show()

In [876]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
kmeans.fit(customer_std)

# **Analyzing clusters**

The differences between the three clusters can be seen mostly through the income, age, and family size features. 
* The average values of the first cluster, income is around 31000, age is around 43 and family size is around 2 or 3 people. 
* The average values of the second cluster, income is around 66000, age is around 55 and family size is around 1 or 2 people. 
* The average values of the third cluster, income is around 51000, age is around 57 and family size is around 3 or 4 people. 

Finally, when we look at the clusters using income and age features, it is seen that all of them are intertwined. This shows that our model is still open to development.

In [877]:
customer_kmeans = customer.copy()
customer_kmeans['Segment'] = kmeans.labels_

customer_analysis = customer_kmeans.groupby(['Segment']).mean()
customer_analysis['Size'] = customer_kmeans[['Segment','Education']].groupby(['Segment']).count()

customer_analysis['Proportion'] = customer_analysis['Size'] / customer_analysis['Size'].sum()
customer_analysis

In [878]:
x_ = customer_kmeans['Age']
y_ = customer_kmeans['Income']
plt.figure(figsize = (10, 8))
sns.scatterplot(x = x_, y = y_, hue = customer_kmeans['Segment'], palette=['blue','red', 'green'])
plt.title('K-means Clustering')
plt.show()

# **Principal Component Analysis**

As seen in the previous section, although the clusters are separated from each other in certain ways, the clusters are intertwined in the scatter plot. We use the principal component analysis method to further improve our model in this sense. Although there are relatively few features in our data set, it is aimed to minimize features, reduce noise and improve performance with the PCA method.

The rule of thumb suggests that the explained variance should be above 80 percent. As can be seen in the graph below, a value close to 80 percent is reached with 4 components. And as a result, the number of components was chosen as 4. In other words, we reduce the number of features in half with PCA.

In [879]:
pca = PCA()
pca.fit(customer_std)
cum_sum = pca.explained_variance_ratio_.cumsum()

plt.figure(figsize = (8,6))
plt.plot(range(1,9), cum_sum, marker = 'o', linestyle = '--')
plt.title('Explained Variance by Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')

In [880]:
pca = PCA(n_components = 4)
scores_pca = pca.fit_transform(customer_std)

# **K-means clustering with PCA**

In this section, we combine the results of the PCA method with K-means clustering. As seen from the Elbow method, there is no change in the number of clusters that can be selected. That's why we continue with 3 clusters, as we determined at the beginning.

In [881]:
inertia = []
for i in range(1,11):
    kmeans_pca = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans_pca.fit(scores_pca)
    inertia.append(kmeans_pca.inertia_)
    
plt.figure(figsize = (8,6))
plt.plot(range(1, 11), inertia, marker = 'o', linestyle = '--')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('K-means with PCA')
plt.show()

In [882]:
kmeans_pca = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
kmeans_pca.fit(scores_pca)

customer_kmeans_pca = pd.concat([customer.reset_index(drop = False), pd.DataFrame(scores_pca)], axis = 1)
customer_kmeans_pca.columns.values[-4: ] = ['Component 1', 'Component 2', 'Component 3', 'Component 4']
customer_kmeans_pca['Segment PCA'] = kmeans_pca.labels_

# **Visual analysis of final clusters**

In this section, we first visualize the results of the k-means clustering model that we use with PCA. As seen in the image below, the clusters are more clearly separated from each other than in the first model. The reason for this is that the PCA method minimizes the features and reduces the noise. Thus, we have achieved a more advanced result with the PCA method. In addition, we named the clusters according to the income feature in order to make the clusters catchy and understandable. Of course, it should not be forgotten that clusters were not created using only the income feature.

In [883]:
customer_kmeans_pca['Legend'] = customer_kmeans_pca['Segment PCA'].map({0:'low-income', 
                                                          1:'high-income',
                                                          2:'middle-income'
                                                          })

x_ = customer_kmeans_pca['Component 1']
y_ = customer_kmeans_pca['Component 2']
plt.figure(figsize = (10, 8))
sns.scatterplot(x=x_, y=y_, hue = customer_kmeans_pca['Legend'] )
plt.title('Clusters by PCA')
plt.show()

**Middle-Income:**
* With 843 customers, this segment constitutes 38.2% of the total customers.
* 48% are at the graduate level, while 52% are at the graduate level.
* 83.5% are a couple while 16.5% are single.
* 47.8% have children at home, 52.2% do not.
* 100% have teens at home.
* Of this segment, 21.6% have been customers for 10 years, 53% for 9 years, and 25.4% for 8 years.
* 62.8% have families of 3, while 33.5% have families of 4
* The average income of this segment is around 69000.
* The average age of this segment is around 57 years.

**High-Income:**
* With 781 customers, this segment constitutes 35.4% of the total customers.
* 53.3% are at the graduate level, while 46.7% are at the postgraduate level.
* 42.1% are a couple while 57.9% are single.
* 0.4% have children at home, 99.6% do not.
* 26.6% have teens at home, 73.4% do not.
* Of this segment, 23.7% have been customers for 10 years, 52.2% for 9 years, and 24.1% for 8 years.
* 69% have families of 2, while 30.9% have families of 1
* The average income of this segment is around 51000.
* The average age of this segment is around 57 years.

**Low-Income:**
* With 581 customers, this segment constitutes 26.3% of the total customers.
* 59.6% are at the graduate level, while 40.4% are at the postgraduate level.
* 67% are a couple while 33% are single.
* 90% have children at home, 10% do not.
* 2.6% have teens at home, 97.4% do not.
* Of this segment, 21% have been customers for 10 years, 53.5% for 9 years, and 25.5% for 8 years.
* 59.7% have families of 3, while 36.3% have families of 2
* The average income of this segment is around 30000.
* The average age of this segment is around 44 years.

In [884]:
values = customer_kmeans_pca['Legend'].value_counts()
names = ['Middle-Income','High-Income','Low-Income']

fig = px.pie(customer_kmeans_pca, values = values, names = names)
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text='Cluster sizes', 
                  title_x=0.465, title_y=0.95)
fig.show()

In [885]:
labels_m = ['Postgraduate', 'Graduate']
labels_w_l = ['Graduate', 'Postgraduate']

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Middle-Income', 'High-Income', 'Low-Income'])

values_m = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'middle-income']['Education'].value_counts()
values_w = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'high-income']['Education'].value_counts()
values_l = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'low-income']['Education'].value_counts()

fig.add_trace(go.Pie(labels = labels_m, values = values_m, 
                     scalegroup = 'one', name = "Middle-Income"), 1, 1)

fig.add_trace(go.Pie(labels = labels_w_l, values = values_w, 
                     scalegroup = 'one',  name = "High-Income"), 1, 2)

fig.add_trace(go.Pie(labels = labels_w_l, values = values_l, 
                     scalegroup = 'one',  name = "Low-Income"), 1, 3)

fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text='Education percentages of clusters', 
                  title_x=0.5, title_y=0.95)
fig.show()

In [886]:
labels_w = ['Single', 'Couple']
labels_m_l = ['Couple', 'Single']

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Middle-Income', 'High-Income', 'Low-Income'])

values_m = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'middle-income']['Marital_Status'].value_counts()
values_w = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'high-income']['Marital_Status'].value_counts()
values_l = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'low-income']['Marital_Status'].value_counts()

fig.add_trace(go.Pie(labels = labels_m_l, values = values_m, 
                     scalegroup = 'one', name = "Middle-Income"), 1, 1)

fig.add_trace(go.Pie(labels = labels_w, values = values_w, 
                     scalegroup = 'one',  name = "High-Income"), 1, 2)

fig.add_trace(go.Pie(labels = labels_m_l, values = values_l, 
                     scalegroup = 'one',  name = "Low-Income"), 1, 3)

fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text='Marital status percentages of clusters', 
                  title_x=0.5, title_y=0.95)
fig.show()

In [887]:
labels_l = ['Has kid', 'Has no kid']
labels_m_w = ['Has no kid', 'Has kid']

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Middle-Income', 'High-Income', 'Low-Income'])

values_m = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'middle-income']['Kidhome'].value_counts()
values_w = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'high-income']['Kidhome'].value_counts()
values_l = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'low-income']['Kidhome'].value_counts()

fig.add_trace(go.Pie(labels = labels_m_w, values = values_m, 
                     scalegroup = 'one', name = "Middle-Income"), 1, 1)

fig.add_trace(go.Pie(labels = labels_m_w, values = values_w, 
                     scalegroup = 'one',  name = "High-Income"), 1, 2)

fig.add_trace(go.Pie(labels = labels_l, values = values_l, 
                     scalegroup = 'one',  name = "Low-Income"), 1, 3)

fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text='Kid percentages of clusters', 
                  title_x=0.5, title_y=0.95)
fig.show()

In [888]:
labels_m = ['Has teen', 'Has no teen']
labels_l_w = ['Has no teen', 'Has teen']

fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Middle-Income', 'High-Income', 'Low-Income'])

values_m = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'middle-income']['Teenhome'].value_counts()
values_w = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'high-income']['Teenhome'].value_counts()
values_l = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'low-income']['Teenhome'].value_counts()

fig.add_trace(go.Pie(labels = labels_m, values = values_m, 
                     scalegroup = 'one', name = "Middle-Income"), 1, 1)

fig.add_trace(go.Pie(labels = labels_l_w, values = values_w, 
                     scalegroup = 'one',  name = "High-Income"), 1, 2)

fig.add_trace(go.Pie(labels = labels_l_w, values = values_l, 
                     scalegroup = 'one',  name = "Low-Income"), 1, 3)

fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text='Teen percentages of clusters', 
                  title_x=0.5, title_y=0.95)
fig.show()

In [889]:
labels = [9, 8, 10]


fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Middle-Income', 'High-Income', 'Low-Income'])

values_m = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'middle-income']['CustomerYear'].value_counts()
values_w = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'high-income']['CustomerYear'].value_counts()
values_l = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'low-income']['CustomerYear'].value_counts()

fig.add_trace(go.Pie(labels = labels, values = values_m, 
                     scalegroup = 'one', name = "Middle-Income"), 1, 1)

fig.add_trace(go.Pie(labels = labels, values = values_w, 
                     scalegroup = 'one',  name = "High-Income"), 1, 2)

fig.add_trace(go.Pie(labels = labels, values = values_l, 
                     scalegroup = 'one',  name = "Low-Income"), 1, 3)

fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text='Customer loyalty percentages of clusters', 
                  title_x=0.5, title_y=0.95)
fig.show()

In [890]:
labels_m = [3, 4, 5, 2]
labels_w = [2, 1, 3]
labels_l = [3, 2, 4, 1]


fig = make_subplots(1, 3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Middle-Income', 'High-Income', 'Low-Income'])

values_m = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'middle-income']['FamilySize'].value_counts()
values_w = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'high-income']['FamilySize'].value_counts()
values_l = customer_kmeans_pca[customer_kmeans_pca['Legend'] == 'low-income']['FamilySize'].value_counts()

fig.add_trace(go.Pie(labels = labels_m, values = values_m, 
                     scalegroup = 'one', name = "Middle-Income"), 1, 1)

fig.add_trace(go.Pie(labels = labels_w, values = values_w, 
                     scalegroup = 'one',  name = "High-Income"), 1, 2)

fig.add_trace(go.Pie(labels = labels_l, values = values_l, 
                     scalegroup = 'one',  name = "Low-Income"), 1, 3)

fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text='Family size percentages of clusters', 
                  title_x=0.5, title_y=0.95)
fig.show()

Don't be misled by the cluster names here, because clusters were not created solely on the basis of income, so there may be extreme values in each cluster.

In [891]:
fig = px.box(customer_kmeans_pca, x="Legend", y="Income", points="all", color="Legend")
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  xaxis_title = 'Clusters',
                  legend_title = 'Clusters',
                  title_text='Income distribution', 
                  title_x=0.46, title_y=0.95)
fig.update_xaxes(showticklabels=False)
fig.show()

In [892]:
fig = px.box(customer_kmeans_pca, x="Legend", y="Age", points="all", color="Legend")
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Age distribution',
                  xaxis_title = 'Clusters',
                  legend_title = 'Clusters',
                  title_x=0.46, title_y=0.95,)
fig.update_xaxes(showticklabels=False)
fig.show()

## **Visual analysis of products by segments**

In [893]:
df_customer_segments = customer_kmeans_pca[["ID", "Segment PCA" , "Legend"]]

product_list = ["MntWines","MntFruits","MntMeatProducts","MntFishProducts","MntSweetProducts","MntGoldProds"]
product["TotalExpenditure"] = product[product_list].sum(axis=1)
product_segments = pd.merge(product, df_customer_segments, on="ID")

We have the products purchased by the customers in the data set. These products consist of wine, fruit, meat, fish, sweet, gold. Using these data, first of all, the segments' total expenditure distributions were visualized. 
* The average total spend of the high-income segment is around 1100. 
* The average total spend of the middle-income segment is around 310. 
* The average total spend of the low-income segment is around 66.

In [894]:
fig = px.box(product_segments, x="Legend", y="TotalExpenditure", points="all", color="Legend")
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Total expenditure distribution',
                  xaxis_title = 'Clusters',
                  yaxis_title = 'Total expenditure',
                  legend_title = 'Clusters',
                  title_x=0.46, title_y=0.95,)
fig.update_xaxes(showticklabels=False)
fig.show()

In [895]:
high = pd.DataFrame(product_segments[product_segments["Legend"] == "high-income"][product_list].sum(), columns = ['Total'])
high.reset_index(drop = False, inplace=True)
high["Legend"] = "high-income"
high.sort_values("Total", ascending=False, inplace= True)

middle = pd.DataFrame(product_segments[product_segments["Legend"] == "middle-income"][product_list].sum(), columns = ['Total'])
middle.reset_index(drop = False, inplace=True)
middle["Legend"] = "middle-income"
middle.sort_values("Total", ascending=False, inplace= True)

low = pd.DataFrame(product_segments[product_segments["Legend"] == "low-income"][product_list].sum(), columns = ['Total'])
low.reset_index(drop = False, inplace=True)
low["Legend"] = "low-income"
low.sort_values("Total", ascending=False, inplace=True)

final = pd.concat([high, middle, low], ignore_index=True)
final.rename({'index': 'Product'}, axis=1, inplace=True)

In this section, the total amount spent on products by each segment was visualized. As can be seen, the highest expenditure on products was made in the high-income segment. In addition, the most purchased products are the same in each segment, and these are wine and meat respectively.

In [896]:
fig = px.bar(final, x="Legend", y="Total", color="Product", text_auto=True)
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Money spent on products',
                  xaxis_title = 'Clusters',
                  yaxis_title = 'Total',
                  legend_title = 'Clusters',
                  title_x=0.46, title_y=0.95)
fig.show()

## **Visual analysis of promotion by segments**

There are 6 most recent campaigns offered to customers in the dataset. As can be seen in the chart below, the number of campaigns accepted by customers in each segment is quite low. When we make a comparison between the segments, it is seen that the segment that accepts the most campaigns is the high-income segment. In addition, the last campaign is the most successful among the campaigns.

In [897]:
promotion_segments = pd.merge(promotion, df_customer_segments, on="ID")
promotion_list = ["AcceptedCmp1","AcceptedCmp2","AcceptedCmp3","AcceptedCmp4","AcceptedCmp5","Response"]

promotion_middle = list(promotion_segments[promotion_segments["Legend"] == "middle-income"][promotion_list].sum())
promotion_high = list(promotion_segments[promotion_segments["Legend"] == "high-income"][promotion_list].sum())
promotion_low = list(promotion_segments[promotion_segments["Legend"] == "low-income"][promotion_list].sum())

fig = go.Figure(data=[
    go.Bar(name='middle-income', x= promotion_list, y=promotion_middle),
    go.Bar(name='high-income', x= promotion_list, y=promotion_high),
    go.Bar(name='low-income', x= promotion_list, y=promotion_low)
])
# Change the bar mode
fig.update_layout(barmode='group',
                  #autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Number of discounted purchases',
                  xaxis_title = 'Clusters',
                  yaxis_title = 'Count',
                  legend_title = 'Clusters',
                  title_x=0.46, title_y=0.90)
fig.show()

In this section, the distribution of the total number of discounted purchases was visualized. As can be seen in the chart below, the segments that benefit most from the discount are middle-income, low-income, and high-income, respectively.
* The middle-income segment benefited from a total of 3 discounts on average.
* The low-income segment benefited from a total of 2 discounts on average.
* The high-income segment benefited from a total of 1 discount on average.

In [898]:
fig = px.box(promotion_segments, x="Legend", y="NumDealsPurchases", points="all", color="Legend")
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Distribution of the total number of discounted purchases',
                  xaxis_title = 'Clusters',
                  yaxis_title = 'Total discounted purchases',
                  legend_title = 'Clusters',
                  title_x=0.46, title_y=0.95,)
fig.update_xaxes(showticklabels=False)
fig.show()

## **Visual analysis of place by segments**

In the data set we have, there are the number of purchases made by the customers through the store, the web, and the catalog. By using these features, the distribution of the purchase numbers of the segments was visualized in the chart below. 
* The average number of total purchases in the high-income segment is around 18. 
* The average number of total purchases in the middle-income segment is around 11. 
* The average number of total purchases in the low-income segment is around 6.

In [899]:
place_list = ["NumStorePurchases","NumCatalogPurchases","NumWebPurchases"]
place["TotalPurchase"] = place[place_list].sum(axis=1)
place_segments = pd.merge(place, df_customer_segments, on="ID")

fig = px.box(place_segments, x="Legend", y="TotalPurchase", points="all", color="Legend")
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Total purchasing distributions',
                  xaxis_title = 'Clusters',
                  yaxis_title = 'Total purchase',
                  legend_title = 'Clusters',
                  title_x=0.46, title_y=0.95,)
fig.update_xaxes(showticklabels=False)
fig.show()

In [900]:
high = pd.DataFrame(place_segments[place_segments["Legend"] == "high-income"][place_list].sum(), columns = ['Total'])
high.reset_index(drop = False, inplace=True)
high["Legend"] = "high-income"

middle = pd.DataFrame(place_segments[place_segments["Legend"] == "middle-income"][place_list].sum(), columns = ['Total'])
middle.reset_index(drop = False, inplace=True)
middle["Legend"] = "middle-income"

low = pd.DataFrame(place_segments[place_segments["Legend"] == "low-income"][place_list].sum(), columns = ['Total'])
low.reset_index(drop = False, inplace=True)
low["Legend"] = "low-income"

final = pd.concat([high, middle, low], ignore_index=True)
final.rename({'index': 'Place'}, axis=1, inplace=True)

As seen below, the segments that make the most purchases in total are high-income, middle-income, and low-income, respectively. In addition, the purchasing method preference of each segment is the same. Segments make the most purchases through store, web and catalog, respectively.

In [901]:
fig = px.bar(final, x="Legend", y="Total", color="Place", text_auto=True)
fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Purchasing place distributions',
                  xaxis_title = 'Clusters',
                  legend_title = 'Place',
                  title_x=0.46, title_y=0.95)
fig.show()

In this section, we compared the segments' web visits and web purchases. As seen in the graph, web visits are mostly made in middle-income and low-income segments. However, compared to web visits, web purchases in these segments are very few. The high-income segment has the opposite situation. Although the web visits of the high-income segment are the least compared to the other segments, they have the most web purchases.

In [902]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x=["high-income", "low-income", "middle-income"],
    y=list(place_segments.groupby("Legend")["NumWebVisitsMonth"].sum()),
    name='NumWebVisitsMonth'
))
fig.add_trace(go.Bar(
    x=["high-income", "low-income", "middle-income"],
    y=list(place_segments.groupby("Legend")["NumWebPurchases"].sum()),
    name='NumWebPurchases'
))

fig.update_layout(#autosize=False,
                  #width=1000,
                  #height=500,
                  title_text = 'Web visits vs Web purchases',
                  xaxis_title = 'Clusters',
                  yaxis_title = 'Total',
                  legend_title = 'Legend',
                  title_x=0.46, title_y=0.95)
fig.show()