# Customer Segmentation Project

### Introduction


Customer segmentation is a powerful technique used in the field of marketing to divide a customer base into distinct groups or segments based on shared characteristics, behaviors, and preferences. These segments enable businesses to gain valuable insights into their customers and tailor their strategies to effectively target each group.

In this project, we will explore customer segmentation in the online retailer. The project includes a structured approach, starting with data loading and exploration, followed by data preprocessing, exploratory data analysis, customer segmentation techniques, segment profiling, and interpretation. Throughout the process, we will utilize Python libraries such as pandas, numpy, and scikit-learn for data manipulation, analysis, and modeling tasks.

This project c is based on the "Customer Segmentation Dataset" by **Yasser H.**, which can be found on Kaggle. The dataset contains information about customer transactions for Online Retail Store.

You can access the dataset at [Kaggle - Customer Segmentation Dataset](https://www.kaggle.com/datasets/yasserh/customer-segmentation-dataset).

### Libraries:
Import necessary libraries and modules:  
**Data manipulation**:
* `Pandas`
* `NumPy`
  
**Visualisation**:
* `Matplotlib`
* `Seaborn`
* `Missingno`  

**Machine Learning and preprocssing**:  
* `Scikit-learn`  


In [1]:
import os
import warnings
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import missingno as no
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Ignore warning messages
warnings.filterwarnings('ignore')



In [2]:
# Iterate through the '/kaggle/input' directory and its subdirectories
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        # Print the full file path of each file
        print(os.path.join(dirname, filename))

/kaggle/input/customer-segmentation-dataset/Online Retail.xlsx


### Load Data

In [None]:
data  = pd.read_excel('/kaggle/input/customer-segmentation-dataset/Online Retail.xlsx', sheet_name='Online Retail')

### Explortary Data Analysis (EDA)  
EDA techniques are applied to gain insights into the dataset, identify patterns, and understand the distribution of variables.

In [None]:
data.head(5)

In [None]:
data.info()

In [None]:
data.isna().sum()

In [None]:
no.bar(data)
plt.show()

In [None]:
# Calculate the percentage of missing data in the CustomerID column
print(f'Percentage of missing data from CustomerID column is: {round(data.CustomerID.isna().sum() / data.shape[0] *100,2)}%')

Missing values are detected, primarily in the `CustomerID` column and `Description` column. The percentage of missing data in the CustomerID column is calculated, revealing the extent of the data quality issue.


Before removing null values from the 'Description' column, we can replace those null values with a placeholder value to indicate that the description information is missing. This step ensures that we retain the information about missing descriptions and can still use it in our analysis.

In [None]:
data['Description'].fillna('No Description', inplace=True)

In [None]:
data['Description'].isna().sum()

Let's remove of any null values that may be present within the dataset.

In [None]:
df = data.dropna()

In [None]:
df.isna().sum()

In [None]:
df.shape

In [None]:
df.describe()

We notice that there are negative values in the `Quantity` column and zero value in `UnitPrice` column . let's assume that we consulted stakeholders and they confirmed that these negative values are a result of typographical errors, and the correct values should be positive. As a result, we will consider only the positive values in the `Quantity` column for our analysis and further processing. And they have confirmed that the zero `UnitPrice` values are invalid entries or represent missing data. As per their guidance, we will proceed by removing these records from the dataset.

In [None]:
df['Quantity'] = df['Quantity'].abs()

In [None]:
df.drop(df[df['UnitPrice'] == 0].index, axis=0, inplace=True)

### Feature Engineering

Now that the dataset appears to be in good condition, our focus is shifting towards customer segmentation. In order to accomplish this, we will transform the current sales data into a customer-level perspective.

we will use the RFM marketing Metrics, to look at customer's behaviours based on:  
* **Recency**: How many days had passed since customer last purchased.
* **Frequency**: - How many times a customer had shopped here.
* **Monetary Value**: - How much money had the customer spent.

![RFM](https://d35fo82fjcw0y8.cloudfront.net/2018/03/01013508/Incontent_image.png)

Calculate the total price for each transaction: This step creates a new column named `TotalPrice` by multiplying the `Quantity` and `UnitPrice` columns. It calculates the total monetary value for each transaction, which can be useful for customer segmentation.

In [None]:
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

Find the most recent date in the dataset: Identifying the most recent date in the `InvoiceDate` column helps in calculating the recency of customer transactions.

##### **Why we didn't use `Country` column?**  
When dealing with categorical variables like the `Country` column in a dataset with a large number of unique categories (37 countries in this case), it is essential to consider the appropriate approach for handling such data.  
Using the `Country` column directly as a feature in clustering algorithms like K-means might not be the best approach because it could lead to a high-dimensional feature space and potential inefficiencies in the clustering process. The large number of unique categories may introduce noise and make it challenging for the algorithm to identify meaningful patterns.
If you believe that the `Country` information is essential for segmentation, you could consider aggregating countries into broader regions or continents. This way, you reduce the number of categories while still capturing regional trends.  

  Group the data by `CustomerID` and calculate Recency, Frequency, and Monetary values: This step groups the data by `CustomerID` and calculates three important metrics : recency (the number of days between the most recent date and the maximum `InvoiceDate` for each customer), frequency (the count of invoices for each customer), and monetary value (the sum of the `TotalPrice` for each customer).

In [None]:
most_recent_date = df['InvoiceDate'].max()
customer_df = df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (most_recent_date - x.max()).days,
                                            'InvoiceNo': 'count',
                                            'TotalPrice': 'sum'})
customer_df.rename(columns={'InvoiceDate':'Recency', 'InvoiceNo':'Frequency', 'TotalPrice':'Monetary'}, inplace=True)
customer_df.head(5)

In [None]:
customer_df.shape

In [None]:
customer_df.describe()

In [None]:
sns.pairplot(customer_df)

Standardize the customer dataframe using StandardScaler: Standardizing the data using StandardScaler ensures that all the variables are on the same scale, which is necessary for K-means clustering.

In [None]:
scaler = StandardScaler()
norm_df = scaler.fit_transform(customer_df)
norm_df

### Model Building

Performs K-means clustering with different numbers of clusters (ranging from 1 to 10)

In [None]:
inertia = []
for k in range(1,11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(norm_df)
    inertia.append(kmeans.inertia_)

Plot the inertia values 

In [None]:
plt.plot(range(1,11), 
         inertia,
         marker='o')
plt.xlabel('Number of Cluster')
plt.ylabel('Inertia')

K-means clustering is performed with different numbers of clusters, and we think `3` is the most appropriate value for number of clusters.

Lets validate our findings using silhouette score

In [None]:
sil = []
for k in range(2,11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(norm_df)
    sil.append(silhouette_score(norm_df, kmeans.labels_))

In [None]:

plt.plot(range(2,11), sil, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.show()

The silhouette scores for 3, 4, and 5 clusters were approximately similar. However, the inertia value significantly decreased from 3 to 4 clusters and  relatively less when moving from 4 to 5 clusters. Consequently, we selected 3 clusters as the optimal number for customer segmentation.

### Further Analysis and Interpretation

Analyze The segmented customer data in more detail to understand the characteristics and behaviors of each customer segment.

Perform K-means clustering with the final number of clusters

In [None]:
final_kmeans = KMeans(n_clusters=3,random_state=42)
final_kmeans.fit(norm_df)


Create a new dataframe with customer information and assigned clusters

In [None]:
final_df = pd.DataFrame(customer_df, columns=customer_df.columns, index=customer_df.index)
final_df['Cluster'] = final_kmeans.labels_ + 1 # I want to have cluster labels starting from 1 instead of 0
final_df.head(10)

Visualize the distribution of clusters 

In [None]:
sns.histplot(final_df.Cluster)
plt.xticks(range(1,4))
plt.show()

In [None]:
final_df.groupby('Cluster').agg({'Monetary':'mean',
                                 'Frequency':'mean',
                                 'Recency':'mean'})

After segmenting the customer data into three distinct clusters, we conducted a detailed analysis to comprehend the unique characteristics and behaviors exhibited by each group. The segmentation revealed clear patterns in terms of customer recency, purchase frequency, and monetary value.  
* **Cluster 1: "High-Value Regular Customers"**:  
This cluster consists of customers with relatively high monetary value, moderate frequency, and recent transactions. They are likely to be loyal and valuable customers who make regular purchases.  

* **Cluster 2: "Low-Value Occasional Customers**"
This cluster includes customers with lower monetary value, lower frequency, and higher recency. They might be occasional buyers who make infrequent purchases.  

* **Cluster 3: "High-Value VIP Customers**"  
This cluster represents customers with exceptionally high monetary value, high frequency, and very recent transactions. They are top-tier customers who contribute significantly to the business's revenue and should be treated as VIPs.

It's important to note that this project is a starting point, and further analyses and experiments can be performed to gain deeper insights and refine the segmentation strategy. Additionally, business-specific factors and domain knowledge should be taken into account when interpreting the results and making strategic decisions based on the customer segments.