# Customer Segmentation Analysis using Clustering

### 1.Introduction

In the rapidly evolving marketplace, understanding your customer has never been more important. Traditional demographic methods of segmenting customers, such as age or location, can be enhanced with clustering techniques to understand customer behavior and preferences on a much deeper level.

In this project, we will apply unsupervised learning techniques to identify segments of the customer population. Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group.

The goal of this project is to segment customers into different groups based on their behavior. The customer behavior can be related to many factors such as gender, age, marital status, spending score, etc. Understanding the distinct groups in our customer base will help in strategizing the marketing activities more effectively.

The dataset we will use contains various customer details like gender, marital status, age, graduation details, profession, work experience, spending score, family size and more.

We will start with an exploratory data analysis, followed by data preprocessing. Then, we will proceed to find the optimal number of clusters in our data using the Elbow Method and finally apply a KMeans clustering algorithm. The final part of the project will be to analyze and visualize these customer segments. Let's get started!

### 2. Importing Data and Necessary Libraries

In this step, we're importing all the necessary libraries we'll need to analyze the data and perform customer segmentation.

The primary libraries we are using are:

NumPy: It provides a high-performance multidimensional array object and tools for working with these arrays. It is fundamental for scientific computing with Python.

Pandas: This library is excellent for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data.

Matplotlib and Seaborn: These are fantastic libraries for data visualization. They provide a flexible interface for creating plots and graphs.

Scikit-learn: It's the most widely used machine learning library in Python. We use it for preprocessing the data and also for implementing KMeans clustering.

The dataset is loaded into a pandas dataframe, and we display the first few records using the head() function. This gives us an idea of the structure of the dataset we're working with.

In [1]:
# Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from scipy.spatial.distance import cdist

# Load the dataset
df = pd.read_csv('/kaggle/input/customer-segmentation/Train.csv')

# Display the first few records of our dataset
df.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


### 3. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the initial step in data analysis, where we uncover the underlying structure of data, extract important parameters and relationships that hold between them. Let's start:

Shape of the data: We print the number of rows and columns in the dataset to understand the size of the data we're working with.

Missing Values: Here we check if there are any missing values in our dataset across different columns. Missing data can lead to weak or biased analysis.

Data Types: Knowing the type of the data is important as the type of data determines the statistical method used to analyze it.

Unique Values: We check for unique values in each column to get a sense of the diversity and spread of data.

Summary Statistics: We print the summary statistics (count, mean, standard deviation, min, 25 percentile, 50 percentile, 75 percentile, max) of numerical columns to get an understanding of the distribution of different columns in the dataset.

In [2]:
# Checking the shape of the data
print("Number of rows in the dataset:",df.shape[0])
print("Number of columns in the dataset:",df.shape[1])

# Checking for missing values
print("\nMissing Values in the dataset:\n",df.isnull().sum())

# Checking the data types
print("\nData types of the columns:\n",df.dtypes)

# Checking the unique values in each column
for col in df.columns:
    print("\nUnique values in column", col, ":",df[col].nunique())

# Getting the summary statistics of numerical columns
df.describe()

Number of rows in the dataset: 8068
Number of columns in the dataset: 11

Missing Values in the dataset:
 ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

Data types of the columns:
 ID                   int64
Gender              object
Ever_Married        object
Age                  int64
Graduated           object
Profession          object
Work_Experience    float64
Spending_Score      object
Family_Size        float64
Var_1               object
Segmentation        object
dtype: object

Unique values in column ID : 8068

Unique values in column Gender : 2

Unique values in column Ever_Married : 2

Unique values in column Age : 67

Unique values in column Graduated : 2

Unique values in column Profession : 9

Unique values in column Work_Experience : 15

Unique values i

Unnamed: 0,ID,Age,Work_Experience,Family_Size
count,8068.0,8068.0,7239.0,7733.0
mean,463479.214551,43.466906,2.641663,2.850123
std,2595.381232,16.711696,3.406763,1.531413
min,458982.0,18.0,0.0,1.0
25%,461240.75,30.0,0.0,2.0
50%,463472.5,40.0,1.0,3.0
75%,465744.25,53.0,4.0,4.0
max,467974.0,89.0,14.0,9.0


### 4. Data Preprocessing

Data Preprocessing is a crucial step in the machine learning pipeline. It involves cleaning and formatting the data before feeding into a machine learning algorithm. For this dataset, our data preprocessing will involve the following steps:

Filling Missing Values: We observe from the EDA that 'Ever_Married', 'Graduated', 'Profession', 'Work_Experience', 'Family_Size', and 'Var_1' have missing values. For the categorical variables, we will fill the missing values with the most common class (mode). For the numerical variable 'Work_Experience' and 'Family_Size', we will fill the missing values with the mean.

One-Hot Encoding: Machine Learning algorithms require input to be numerical, which requires the categorical data to be converted into a numerical form. We perform one-hot encoding on the categorical variables to convert them into a numerical form.

Feature Scaling: Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a weighted sum of inputs like linear regression, and algorithms that use distance measures like k-nearest neighbors. We standardize features by removing the mean and scaling to unit variance using StandardScaler.

Dropping 'Var_1': The 'Var_1' column contains anonymized categorical data. While it may contain valuable information, the categories are not described, and thus it's difficult to interpret and use them meaningfully in our analysis. We decide to drop this column for simplicity.

Dropping 'Segmentation': The 'Segmentation' column is actually the output from a previous segmentation exercise. Including this in our clustering could bias our results and make it harder to identify truly distinct clusters within the data. As our goal is to demonstrate an unbiased clustering process, we are dropping this column.

In [3]:
# Filling missing values
df['Ever_Married'].fillna(df['Ever_Married'].mode()[0], inplace=True)
df['Graduated'].fillna(df['Graduated'].mode()[0], inplace=True)
df['Profession'].fillna(df['Profession'].mode()[0], inplace=True)
df['Work_Experience'].fillna(df['Work_Experience'].mean(), inplace=True)
df['Family_Size'].fillna(df['Family_Size'].mean(), inplace=True)

# Drop 'Var_1' and 'Segmentation' columns
df.drop(['Var_1', 'Segmentation'], axis=1, inplace=True)

# Converting categorical variables into dummy/indicator variables
df = pd.get_dummies(df, drop_first=True)

# Drop the 'ID' column as it is not a feature
df.drop(['ID'], axis=1, inplace=True)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df_scaled = sc.fit_transform(df)

### 5. Clustering Modeling

To perform clustering on the preprocessed data, we will use the K-means algorithm. This algorithm aims to partition the data into K clusters, where each data point belongs to the cluster with the nearest mean (centroid).

In [4]:
# Create an instance of the KMeans algorithm
kmeans = KMeans(n_clusters=4, random_state=0)

# Fit the algorithm to the standardized data
kmeans.fit(df_scaled)

# Get the cluster labels for each data point
labels = kmeans.labels_

# Add the cluster labels to the DataFrame
df['Cluster'] = labels



### 6. Cluster Evaluation and Interpretation

After clustering the data, it is important to evaluate the clusters and interpret their characteristics. One common approach is to analyze the distribution of features within each cluster and identify the unique characteristics of each cluster. This can provide insights into the different segments or groups of customers in the dataset.


In [5]:
# Calculate the mean values of each feature for each cluster
cluster_means = df.groupby('Cluster').mean()

# Analyze the characteristics of each cluster
for cluster in range(4):
    print(f"Cluster {cluster}:\n")
    print(cluster_means.loc[cluster])
    print("\n")

Cluster 0:

Age                         48.584473
Work_Experience              2.483769
Family_Size                  3.150635
Gender_Male                  0.619015
Ever_Married_Yes             0.999658
Graduated_Yes                0.712038
Profession_Doctor            0.071135
Profession_Engineer          0.096101
Profession_Entertainment     0.125171
Profession_Executive         0.192544
Profession_Healthcare        0.022230
Profession_Homemaker         0.029412
Profession_Lawyer            0.000000
Profession_Marketing         0.016416
Spending_Score_High          0.303352
Spending_Score_Low           0.034542
Name: 0, dtype: float64


Cluster 1:

Age                         26.142857
Work_Experience              2.503044
Family_Size                  3.718820
Gender_Male                  0.576893
Ever_Married_Yes             0.088993
Graduated_Yes                0.339578
Profession_Doctor            0.000000
Profession_Engineer          0.000000
Profession_Entertainment     0.000000


The next step is to understand each cluster better, so the company or client can make different market strategies for each one:

Cluster 0: The customers in this cluster are typically around 48 years old, have 2.5 years of work experience, and 3.2 family members. Most of them are male and married, and about 71% are graduates. They come from various professions but are more likely to be executives. Their spending score is predominantly 'high'. This group can be categorized as '**Married Professionals**'.

Cluster 1: This cluster represents the youngest group with an average age of 26. They have a similar work experience to the first cluster, around 2.5 years, and they typically have larger families with an average size of 3.7. Majority of them are males and are not married. Most of them are not graduates and work in healthcare. Their spending score is predominantly 'low'. This group can be categorized as '**Young Healthcare Workers**'.

Cluster 2: The customers in this cluster are around 40 years old on average, with slightly higher work experience (3 years), and smaller family sizes (2.4). The gender distribution is slightly skewed towards females, and majority are not married. A significant proportion are graduates and they come from diverse professions. Their spending score is predominantly 'low'. This group can be categorized as '**Independent Professionals**'.

Cluster 3: This is the oldest group with an average age of 75. They have the lowest work experience (1.4 years), and the smallest family sizes (2.0). The gender distribution is fairly even, and most of them are married. Around 63% are graduates and all of them work as lawyers. Their spending score is split fairly evenly between 'high' and 'low'. This group can be categorized as '**Independent Lawyers**'.

### 6. Conclusion

In this project, we performed a clustering analysis on a dataset of customer information to identify distinct segments or groups within the customer base. The dataset included variables such as gender, age, marital status, profession, and spending score. Our objective was to gain insights into customer behavior and preferences by clustering similar customers together.

We started by conducting exploratory data analysis (EDA) to understand the characteristics and distributions of the variables. We then preprocessed the data by handling missing values, converting categorical variables into numerical form, and scaling the features as necessary.

Next, we applied a clustering algorithm (K-means) to identify clusters in the data. The clustering algorithm grouped similar customers together based on their attributes, allowing us to identify distinct segments within the customer base.

Additionally, we visualized the clusters to gain a better understanding of their distribution and characteristics. The visualization helped us identify patterns and relationships among the clusters.

The identified clusters can provide valuable insights for targeted marketing strategies, personalized recommendations, or customer segmentation. By understanding the preferences and behaviors of different customer segments, businesses can tailor their offerings and communication strategies to better meet the specific needs and preferences of each group.