<a href="https://www.kaggle.com/code/amirulmahmud/clustering-pca-for-customer-segmentation?scriptVersionId=124935175" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **The Objective**

The goal of this work is to do clustering for customer segmentation. Customer segmentation is the process of separating customers into groups/clusters that reflect similarities among customers in each group. It can help business to do better actions for customers based on their behaviors.

# **The Dataset**

The dataset used in this work is taken from kaggle dataset (https://www.kaggle.com/datasets/mahdinavaei/customermarketing).

Column Descriptions:

* ID: Customer's unique identifier
* Year_Birth: Customer's birth year
* Education: Customer's education level
* Marital_Status: Customer's marital status
* Income: Customer's yearly household income
* Kidhome: Number of children in customer's household
* Teenhome: Number of teenagers in customer's household
* Dt_Customer: Date of customer's enrollment with the company
* Recency: Number of days since customer's last purchase
* Complain: 1 if the customer complained in the last 2 years, 0 otherwise
* MntWines: Amount spent on wine in last 2 years
* MntFruits: Amount spent on fruits in last 2 years
* MntMeatProducts: Amount spent on meat in last 2 years
* MntFishProducts: Amount spent on fish in last 2 years
* MntSweetProducts: Amount spent on sweets in last 2 years
* MntGoldProds: Amount spent on gold in last 2 years
* NumDealsPurchases: Number of purchases made with a discount
* AcceptedCmp1: 1 if the customer accepted the offer in the 1st campaign, 0 otherwise
* AcceptedCmp2: 1 if the customer accepted the offer in the 2nd campaign, 0 otherwise
* AcceptedCmp3: 1 if the customer accepted the offer in the 3rd campaign, 0 otherwise
* AcceptedCmp4: 1 if the customer accepted the offer in the 4th campaign, 0 otherwise
* AcceptedCmp5: 1 if the customer accepted the offer in the 5th campaign, 0 otherwise
* Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
* NumWebPurchases: Number of purchases made through the company’s website
* NumCatalogPurchases: Number of purchases made using a catalog
* NumStorePurchases: Number of purchases made directly in stores
* NumWebVisitsMonth: Number of visits to the company’s website in the last month

**Import the libraries & dataset**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/customermarketing/Customer marketing.csv",sep="\t")

In [None]:
df.head()

# **Data Cleaning & Feature Engineering**

**Check the data types from each column in dataset**

In [None]:
df.info()

The data type of Dt_Customer is object, wee need to convert it to datetime.

In [None]:
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])

In [None]:
df.info()

**Check Missing Values**

In [None]:
df.isna().sum()

Income has 24 missing values. I will simply delete them.

In [None]:
df = df.dropna()

In [None]:
df.isna().sum()

**Check data duplicates**

In [None]:
df.duplicated().sum()

**Check the unique values for each column**

In [None]:
for col in df.columns:
    print(col," : ",df[col].nunique())

Z_CostContact and Z_Revenue each has only 1 unique value. So, I will simply drop those columns.

In [None]:
df = df.drop(['Z_CostContact','Z_Revenue'],axis=1)
df.info()

**Create a new column 'Age' from 'Year_Birth' column. Then, drop 'Year_Birth' column.**

In [None]:
df['Age'] = 2022 - df['Year_Birth']
df = df.drop('Year_Birth',axis=1)
df.info()

**Check column: 'Education'**

In [None]:
df['Education'].value_counts()

In [None]:
# Convert to numeric
df['Education'] = df['Education'].replace({"Basic": 0, "2n Cycle":1, "Graduation": 2, "Master": 3, "PhD": 4})
df['Education'].value_counts()

**Check column: Marital_Status**

In [None]:
df['Marital_Status'].value_counts()

In [None]:
# Create a new column
df['Living_With'] = df["Marital_Status"].replace({"Married":2, "Together":2, "Absurd":1, "Widow":1, "YOLO":1, "Divorced":1,'Single':1,'Alone':1})

# Drop Marital_Status column
df = df.drop('Marital_Status',axis=1)
df['Living_With'].value_counts()

**Create a new column 'Children' that combines Kidhome and Teenhome.**

In [None]:
df['Children'] = df['Kidhome'] + df['Teenhome']

In [None]:
# Drop Kidhome & Teenhome
df = df.drop(['Kidhome','Teenhome'],axis=1)

df['Children'].value_counts()

**Create a new column 'Family_Size'**

In [None]:
df['Family_Size'] = df['Living_With'] + df['Children']
df['Family_Size'].value_counts()

**Check column : Dt_Customer**

In [None]:
# the oldest enrollment date
print('oldest : ',max(df['Dt_Customer']))

# the newest enrollment date
print('newest : ',min(df['Dt_Customer']))

The enrollment date ranges from 2012 to 2014. I will create a new column 'Batch' based on enrollment year that divides customer into 3 Batch.

In [None]:
df['Batch'] = df['Dt_Customer'].dt.year
df['Batch'].value_counts()

In [None]:
df['Batch'] = df['Batch'].replace({2012:1, 2013:2, 2014:3})
df['Batch'].value_counts()

In [None]:
df = df.drop('Dt_Customer',axis=1)
df.info()

**Create a new column 'Total_Spent'**

In [None]:
df['Total_Spent'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntSweetProducts'] + df['MntGoldProds'] + df['MntFishProducts']

**Create a new column 'Total_Accept' that indicates the total offers accepted by customers from all 6 campaigns.**

In [None]:
df['Total_Accept'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5'] + df['Response']

In [None]:
df.info()

Now, all features are in numeric data type. So, there is no need to do encoding.

**Check again the duplicates.**

In [None]:
df.duplicated().sum()

# **Exploratory Data Analysis**

**Check statistical summary**

In [None]:
df.describe().transpose()

**Identify & remove outliers**

In [None]:
# create function to drop outliers
def drop_outlier(feature: str, data=df):
    print('Dimension before removing outliers : ',data.shape)
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outlier = data[((data[feature] < lower) | (data[feature] > upper))]
    print('List of Outliers: \n',outlier[[feature]])
    data2 = data.drop(outlier.index,axis=0)
    print('Outliers have been removed')
    print('Dimension after removing outliers : ',data2.shape)
    return data2

In [None]:
sns.boxplot(data=df,y='Total_Spent')

In [None]:
df2 = drop_outlier('Total_Spent')

In [None]:
sns.boxplot(data=df,y='Income')

In [None]:
df3 = drop_outlier('Income',data=df2)

In [None]:
sns.boxplot(data=df,y='Recency')

In [None]:
df4 = drop_outlier('Recency',data=df3)

In [None]:
sns.boxplot(data=df,y='Age')

In [None]:
df5 = drop_outlier('Age',data=df4)

In [None]:
df5.shape

In [None]:
df5.head()

**Create a heatmap that displays the correlation between features**

In [None]:
plt.figure(figsize=(8,6),dpi=150)
sns.heatmap(df5.drop('ID_',axis=1).corr(),cmap='viridis')

# **Feature Scaling**

In [None]:
df6 = df5.drop('ID_',axis=1)

In [None]:
df6.head()

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
df_scaled = StandardScaler().fit_transform(df6)

# **PCA - Dimensionality Reduction**

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2)

In [None]:
pca.fit(df_scaled)

In [None]:
df_reduction = pd.DataFrame(pca.transform(df_scaled), columns=['PCA_1','PCA_2'])

In [None]:
df_reduction.head()

In [None]:
plt.figure(figsize=(6,5),dpi=150)
sns.scatterplot(x=df_reduction['PCA_1'],y=df_reduction['PCA_2'])

# **KMeans Clustering**

In [None]:
from sklearn.cluster import KMeans

In [None]:
from yellowbrick.cluster import KElbowVisualizer

In [None]:
elbow = KElbowVisualizer(estimator=KMeans(), k=12)
elbow.fit(df_reduction)
elbow.show()

In [None]:
#The optimal k value
elbow.elbow_value_

In [None]:
KMeans_model = KMeans(n_clusters=5,random_state=101)
y_pred = KMeans_model.fit_predict(df_reduction)

#Add cluster to reduction dataset
df_reduction['Cluster'] = y_pred

#Add cluster to original dataset
df5['Cluster'] = y_pred

**Visualize The Clusters**

In [None]:
plt.figure(figsize=(6,5),dpi=150)
sns.scatterplot(x=df_reduction['PCA_1'],y=df_reduction['PCA_2'],hue=df_reduction['Cluster'])

# **Clusters Interpretation**

**Cluster Distribution**

In [None]:
sns.countplot(df5['Cluster'])

Distribution of clusters is unbalance. Cluster 1 has the highest members and cluster 4 has the lowest.

In [None]:
df5.info()

**Income VS Spent**

In [None]:
plt.figure(figsize=(6,5),dpi=150)
sns.scatterplot(x=df5['Income'],y=df5['Total_Spent'],hue=df5['Cluster'], palette='deep')

* Cluster 0 : Medium Income & Medium Spent
* Cluster 1 : Low Income & Low Spent
* Cluster 2 : Medium Income & Low Spent
* Cluster 3 : Medium Income & Low Spent
* Cluster 4 : High Income & High Spent

**Check Other Columns**

In [None]:
sns.countplot(df5['Cluster'],hue=df5['Education'])

In [None]:
sns.countplot(df5['Cluster'],hue=df5['Batch'])

In [None]:
sns.countplot(df5['Cluster'],hue=df5['Total_Accept'])

In [None]:
sns.boxplot(y=df5['Total_Accept'],x=df5['Cluster'],showmeans=True)


    Cluster 0 : Majority accepts the offer once in the campaigns
    Cluster 1 : Majority does not accept the offer in the campaigns
    Cluster 2 : Majority does not accept the offer in the campaigns
    Cluster 3 : Majority accepts the offer once or twice in the campaigns
    Cluster 4 : Majority accepts the offer more than twice in the campaigns


In [None]:
sns.boxplot(y=df5['Age'],x=df5['Cluster'],showmeans=True)

In [None]:
sns.countplot(df5['Cluster'],hue=df5['Children'])

In [None]:
sns.boxplot(y=df5['Children'],x=df5['Cluster'],showmeans=True)


    Cluster 0 : Majority does not have children
    Cluster 1 : Majority has 1-3 children
    Cluster 2 : Majority has 1 children
    Cluster 3 : Majority has 1-2 children
    Cluster 4 : Majority does not have children

In [None]:
sns.countplot(df5['Cluster'],hue=df5['Family_Size'])

In [None]:
sns.boxplot(y=df5['Family_Size'],x=df5['Cluster'],showmeans=True)

* Cluster 0 : Majority has small family size (1-2)
* Cluster 1 : Majority has big family size (3-5)
* Cluster 2 : Majority has medium family size (2-4)
* Cluster 3 : Majority has medium family size (2-4)
* Cluster 4 : Majority has small family size (1-2)

# **Conclusion**

1. **The optimal number of clusters for this project is 5.**

2. **Characteristics for each Cluster:**

* **Cluster 0**
- Medium Income & Medium Spent
- Majority accepts the offer once in the campaigns
- Majority does not have children
- Majority has small family size (1-2)


* **Cluster 1**
- Low Income & Low Spent
- Majority does not accept the offer in the campaigns
- Majority has 1-3 children
- Majority has big family size (3-5)


* **Cluster 2**
- Medium Income & Low Spent
- Majority does not accept the offer in the campaigns
- Majority has 1 children
- Majority has medium family size (2-4)


* **Cluster 3**
- Medium Income & Low Spent
- Majority accepts the offer once or twice in the campaigns
- Majority has 1-2 children
- Majority has medium family size (2-4)

* **Cluster 4**
- High Income & High Spent
- Majority accepts the offer more than twice in the campaigns
- Majority does not have children
- Majority has small family size (1-2)

**Thank you for reading this notebook. Feel free to give some constructive advice or suggestion. I will really appreciate it.**