<a href="https://colab.research.google.com/github/chota-mota01/Capstone_Unsupervised_Online_Retail_Customer_Segmentation/blob/main/Online_Retail_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Online Retail Customer Segmentation**



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

The project centered around analyzing transactional data from a UK-based non-store online retail company specializing in unique all-occasion gifts. With the goal of identifying distinct customer segments, we embarked on a comprehensive analysis employing RFM (Recency, Frequency, Monetary) analysis and clustering techniques.

Initially, the dataset underwent meticulous preprocessing steps, including handling null values, duplicates, and converting the 'InvoiceDate' column to datetime format. These preparatory measures ensured data integrity and facilitated subsequent analysis. For data visualization, the researcher used the seaborn and matplotlib libraries and various types of graphs, such as bar charts, distplot, scatter plots, count plots, correlation heatmaps, and pair plots. These visualizations helped to simplify complex data and make it more understandable. The United Kingdom leads in transaction count, with Thursdays and November showing the highest activity. Transactions peak at 12 pm, with afternoons being the busiest for product sales.

RFM analysis played a pivotal role in segmenting customers based on their transactional behavior. By categorizing customers into segments such as high-value, loyal, at-risk, and dormant, we gained valuable insights into their purchasing patterns and engagement levels. This segmentation laid the foundation for targeted marketing strategies and personalized offerings tailored to each customer segment's preferences and needs.

In parallel, we employed clustering algorithms to further delineate customer segments and determine the optimal number of clusters. Through methods like K-Means with silhouette score, K-Means with elbow method, and agglomerative clustering, we identified clear segmentation among customers, with an optimal number of clusters determined to be 2. This finding provided actionable insights into customer behavior and preferences, enabling businesses to refine their marketing initiatives and operational strategies effectively.

In summary, the project underscored the significance of data-driven approaches in understanding customer dynamics and driving business growth. By leveraging RFM analysis and clustering techniques, businesses can unlock valuable insights, optimize customer engagement strategies, and foster long-term relationships with their customer base, ultimately leading to enhanced customer satisfaction and sustainable business success.

# **GitHub Link -**

https://github.com/chota-mota01/Capstone_Unsupervised_Online_Retail_Customer_Segmentation

# **Problem Statement**


The objective of this project is to segment customers based on a transnational dataset that includes all transactions between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company specializes in selling unique all-occasion gifts, with a significant portion of its customer base consisting of wholesalers. The goal is to identify distinct customer segments within the dataset to better understand customer behavior and tailor marketing strategies accordingly.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from numpy import math
import seaborn as sns
from datetime import datetime
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from sklearn import preprocessing

import warnings
from pylab import rcParams
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Reading Data
path = '/content/drive/My Drive/Colab Notebooks/Online Retail.xlsx'
cus_data = pd.read_excel(path)

### Dataset First View

In [None]:
# Dataset First Look
# head() method returns first 5 rows of the dataset
cus_data.head()

In [None]:
# If number is specified, head() returns specified number of first rows
cus_data.head(3)

In [None]:
# Dataset Last Look
# tail() method returns last 5 rows of the dataset
cus_data.tail()

In [None]:
# If number is specified, tail() returns specified number of last rows
cus_data.tail(3)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
cus_data.shape

### Dataset Information

In [None]:
# Dataset Info
cus_data.info()

In [None]:
# Columns present in the dataset
list(cus_data.columns)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
cus_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
cus_data.isna().sum().sum()

In [None]:
# Visualizing the missing values
cus_data.isnull().sum()

In [None]:
# Visualizing the missing values
# Check Null value by plotting Heatmap
from pickle import FALSE
plt.figure(figsize=(12,6))
sns.heatmap(cus_data.isnull(),cbar=FALSE)

### What did you know about your dataset?

The transnational dataset includes all transactions between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. We need to analyze the important factors in the dataset for customer segmentation.

The dataset has 541909 rows and 8 columns. The dataset contains 136534 missing/null values and 5268 duplicate values. The null values in column 'Description' and 'CustomerID' are 1454 and 135080 respectively.

Using seaborn library, we have visualized the following missing/null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
cus_data.columns

In [None]:
# Dataset Describe
cus_data.describe(include='all')

### Variables Description

* **InvoiceNo :** 6-digit integral number uniquely assigned to each transaction where starting letter 'c' indicates a cancellation. (Nominal)

* **StockCode :** 5-digit integral number uniquely assigned to each distinct product.(Nominal)

* **Description :** Item name(Nominal)

* **Quantity :** Quantities of each item per transaction (Numeric)

* **InvoiceDate :** The day and time when each transaction was generated (Numeric)

* **UnitPrice :** Product price per unit in sterling (Numeric)

* **CustomerID :** a 5-digit integral number uniquely assigned to each customer(Nominal)

* **Country :** Name of the country where each customer resides(Nominal)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in cus_data:
  print(cus_data[column].unique())

In [None]:
# Count of Unique Values for each variable.
for col in cus_data:
  print("Count of unique values in",col,"is",cus_data[col].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create copy of the dataset
cus_df = cus_data.copy()
cus_df.columns

In [None]:
# Drop Null values
cus_df.dropna(inplace=True)
cus_df.info()

In [None]:
# Drop InvoiceNo starting with 'c' (cancellation)
cus_df['InvoiceNo'] = cus_df['InvoiceNo'].astype('str')

In [None]:
cus_df = cus_df[~cus_df['InvoiceNo'].str.contains('C')]

In [None]:
# Check Duplicates
cus_df[cus_df.duplicated()==True].head()

In [None]:
# Drop Duplicates
cus_df.drop_duplicates(inplace=True)

In [None]:
# Recheck Duplicate Sum
cus_df.duplicated().sum()

In [None]:
cus_df.shape

In [None]:
cus_df.describe()

In [None]:
# Unit price more than 0
cus_df= cus_df[cus_df['UnitPrice']>0]

In [None]:
cus_df["InvoiceDate"] = pd.to_datetime(cus_df["InvoiceDate"], format="%Y-%m-%d %H:%M:%S")

In [None]:
# Convert InvoiceDate column into date time format
cus_df['Day']=cus_df['InvoiceDate'].dt.day_name()

In [None]:
# Create a new feature from InvoiceDate
cus_df["year"] = cus_df["InvoiceDate"].apply(lambda x: x.year)
cus_df["month_num"] = cus_df["InvoiceDate"].apply(lambda x: x.month)
cus_df["day_num"] = cus_df["InvoiceDate"].apply(lambda x: x.day)
cus_df["hour"] = cus_df["InvoiceDate"].apply(lambda x: x.hour)
cus_df["minute"] = cus_df["InvoiceDate"].apply(lambda x: x.minute)#

In [None]:
# Create a new feature TotalAmount
cus_df['TotalAmount']=cus_df['Quantity']*cus_df['UnitPrice']

In [None]:
cus_df['Month']=cus_df['InvoiceDate'].dt.month_name()

### What all manipulations have you done and insights you found?

While analyzing dataset, we found many null values and duplicate values. Before manipulation of data, we created a copy of the given dataset because of which the changes made in the duplicate dataset won't affect the original dataset.

The null values and duplicates in the dataset were dropped. After dropping values, from 541909 rows 392732 rows were left. We converted the 'InvoiceDate' column into date time format. Also, created some new features for ease of understanding.

The manipulations performed are for better visualization of the dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Barplot for Top 5 Stocks

In [None]:
# Chart - 1 visualization code
# Top 5 Stock Code
Stcode=cus_df['StockCode'].value_counts().reset_index()
Stcode.rename(columns={'index': 'Stock Code'}, inplace=True)
Stcode.rename(columns={'StockCode': 'Count'}, inplace=True)
Stcode.head()

In [None]:
plt.figure(figsize=(12,7))
plt.title('Top 5 Stock')
sns.barplot(x='Stock Code',y='Count',data=Stcode[:5],palette='pastel')

##### 1. Why did you pick the specific chart?

The specific chart was choosed because bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

The specific chart tells us about the top 5 stock code.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart tells us about the top 5 stock code :85123A, 22423, 85099B, 84879 and 47566.

#### Chart - 2 - Barplot for Bottom 5 Stocks

In [None]:
# Chart - 2 visualization code
# Bottom 5 Stock Name
plt.figure(figsize=(12,7))
plt.title('Bottom 5 Stock')
sns.barplot(x='Stock Code',y='Count',data=Stcode[-5:],palette='deep')

##### 1. Why did you pick the specific chart?

The specific chart was choosed because a bar plot is a visualization technique used to display categorical data with rectangular bars. The length of each bar represents the frequency or count of data in each category.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart tells us the bottom 5 stock code - 90059A, 20678, 90059D, 90168 and 23843.

#### Chart - 3 - Distplot for Quantity Distribution

In [None]:
# Chart - 3 visualization code
#distribution of Quantity
plt.figure(figsize=(12,8))
plt.title('Quantity Distribution')
sns.distplot(cus_df['Quantity'],color="Green")

##### 1. Why did you pick the specific chart?

The specific chart was choosed because distplot, short for "distribution plot," represents the distribution of a univariate dataset. It combines a histogram with a kernel density estimate (KDE) plot, providing a visual summary of the distribution of values in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The following chart gives us the insight of distribution of quantity.



#### Chart - 4 - Distplot for Log Quantity Distribution

In [None]:
# Chart - 4 visualization code
# Distribution of Quantity
plt.figure(figsize=(12,8))
plt.title('Log Quantity Distribution')
sns.distplot(np.log(cus_df['Quantity']),color="Green")

##### 1. Why did you pick the specific chart?

The specific chart was choosed because distplot, short for "distribution plot," represents the distribution of a univariate dataset. It combines a histogram with a kernel density estimate (KDE) plot, providing a visual summary of the distribution of values in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The insight from the quantity distribution can be seen clearly after logarithmic transformation.

#### Chart - 5 - Barplot for Top 10 Countries for Order

In [None]:
# Chart - 5 visualization code
# Plot top 10 countries as percentage of total order
top_coun = cus_df.Country.value_counts()[0:10]/len(cus_df)*100
top_coun=pd.DataFrame(top_coun)
top_coun.columns=['Percent of total orders']
top_coun

In [None]:
sns.barplot(data=top_coun,y=top_coun.index,x=top_coun['Percent of total orders'],color='darkmagenta')
plt.title('Top 10 Countries for Order')
plt.ylabel('Country')


##### 1. Why did you pick the specific chart?

The particular chart was selected because a bar plot is a visualization technique used to display categorical data with rectangular bars. The length of each bar represents the frequency or count of data in each category.

##### 2. What is/are the insight(s) found from the chart?

The following chart gives us top 10 countries according to orders. United Kingdom holds the topmost position.

#### Chart - 6 - Distplot for Unit Price Distribution

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12,8))
plt.title('Unit Price Distribution')
sns.distplot(cus_df['UnitPrice'],color="Blue")

##### 1. Why did you pick the specific chart?

The particular chart was choosed because distplot, short for "distribution plot," represents the distribution of a univariate dataset. It combines a histogram with a kernel density estimate (KDE) plot, providing a visual summary of the distribution of values in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The insight from the chart tells us that mostly unit price is nearly 0.

#### Chart - 7 - Barplot for Top 10 Product Name

In [None]:
# Chart - 7 visualization code
# Top 10 Product Name
des= cus_df['Description'].value_counts().reset_index()
des.rename(columns={'index': 'Description_Name'}, inplace=True)
des.rename(columns={'Description': 'Count'}, inplace=True)
des.head(10)

In [None]:
plt.figure(figsize=(25,12))
plt.title('Top 10 Product Name')
sns.barplot(x='Description_Name',y='Count',data=des[:10],palette='colorblind')

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

##### 2. What is/are the insight(s) found from the chart?

The insights shows us the top 10 product name.

#### Chart - 8 - Barplot for Bottom 5 Product

In [None]:
des.tail()

In [None]:
# Chart - 8 visualization code
# Barplot for Bottom 5 Product
plt.figure(figsize=(15,6))
plt.title('Bottom 5 Product Name')
sns.barplot(x='Description_Name',y='Count',data=des[-5:],palette='icefire')

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

##### 2. What is/are the insight(s) found from the chart?

The following chart represents the bottom 5 product name.

#### Chart - 9 - Barplot for Days of the Week

In [None]:
day_det=cus_df['Day'].value_counts().reset_index()
day_det.rename(columns={'index': 'Day_Name'}, inplace=True)
day_det.rename(columns={'Day': 'Count'}, inplace=True)
day_det

In [None]:
# Chart - 9 visualization code
# Days of the Week
plt.figure(figsize=(13,8))
plt.title('Days of the Week')
sns.barplot(x='Day_Name',y='Count',data=day_det,palette ='rocket_r')

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

##### 2. What is/are the insight(s) found from the chart?

The following chart represents the count of orders placed on different days of the week. The maximum count is obtained on Thursday.

#### Chart - 10 - Barplot for Months of the Year

In [None]:
month_det=cus_df['Month'].value_counts().reset_index()
month_det.rename(columns={'index': 'Month_Name'}, inplace=True)
month_det.rename(columns={'Month': 'Count'}, inplace=True)
month_det

In [None]:
# Chart - 10 visualization code
# Months of the Year
plt.figure(figsize=(13,8))
plt.title('Months of the Year')
sns.barplot(x='Month_Name',y='Count',data=month_det, palette ='bright')

##### 1. Why did you pick the specific chart?

The particular chart is used because bar graph, also known as a bar chart, is a visual representation of data using rectangular bars. Each bar represents a category, and the height or length of the bar corresponds to the value of that category.

##### 2. What is/are the insight(s) found from the chart?

The following chart represents the count of orders placed on different months of the year. The maximum count is obtained in the month of November.

#### Chart - 11 - Barplot for Hour

In [None]:
hour_det=cus_df['hour'].value_counts().reset_index()
hour_det.rename(columns={'index': 'Hour_Name'}, inplace=True)
hour_det.rename(columns={'hour': 'Count'}, inplace=True)
hour_det

In [None]:
# Chart - 11 visualization code
# Barplot for Hour
plt.figure(figsize=(13,8))
plt.title('Hour')
sns.barplot(x='Hour_Name',y='Count',data=hour_det,palette='pastel')

##### 1. Why did you pick the specific chart?

A bar graph, also known as a bar chart, is a visual representation of data using rectangular bars. Each bar represents a category, and the height or length of the bar corresponds to the value of that category.

##### 2. What is/are the insight(s) found from the chart?

The following chart represents the count of order per hour. The maximum count is obtained at 12 pm.

#### Chart - 12 - Barplot for Periods of the Day

In [None]:
def time_period(time):
  if(time==6 or time==7 or time==8 or time==9 or time==10 or time==11):
    return 'Morning'
  elif(time==12 or time==13 or time==14 or time==15 or time==16 or time==17):
    return 'Afternoon'
  else:
    return 'Evening'

In [None]:
cus_df['time_period']=cus_df['hour'].apply(time_period)


In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(12,6))
plt.title('Periods of the day')
sns.countplot(x='time_period',data=cus_df,palette='muted')

##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart tells us that the maximum product are sold during afternoon.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Correlation Heatmap visualization code
plt.figure(figsize=(20,5))
cor = sns.heatmap(cus_df.corr(),annot=True)

##### 1. Why did you pick the specific chart?

Correlation heatmaps are a type of plot that visualize the strength of relationships between numerical variables. Correlation plots are used to understand which variables are related to each other and the strength of this relationship.

I used the correlation heatmap to find correlation between all the variables along with correlation coefficient.

##### 2. What is/are the insight(s) found from the chart?

From the above correlation heatmap, we can see that there is correlation between all the independent variables. Also, the dependent and independent variables are highly correlated.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# sns.pairplot(cus_df)
# plt.show()
'''The pairplot for RFM is represented below'''

##### 1. Why did you pick the specific chart?

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. Pairplot allows us to plot pairwise relationships between variables within a dataset.

##### 2. What is/are the insight(s) found from the chart?

The pairplot basically plots entire dataframe. Plots between each column take place in pairplot and a big plot is created to compare overall relationship between each column. This creates nice visualization and helps us understand the large amount of data in a single figure.

**RFM Analysis**

RFM analysis is a customer segmentation technique used in marketing to categorize customers based on their purchasing behavior. RFM stands for Recency, Frequency, and Monetary Value.

Recency, Frequency, and Monetary (RFM) are three key metrics used in RFM analysis to evaluate customer behavior:

Recency: Measures how recently a customer made a purchase. It indicates the time elapsed since the customer's last transaction. Customers who made a purchase more recently are considered more engaged and valuable.

Frequency: Measures how often a customer makes purchases within a specific period. It indicates the number of transactions made by the customer over time. Customers with higher frequency are typically more loyal and engaged.

Monetary: Measures the monetary value of a customer's purchases. It represents the total amount of money spent by the customer on purchases. Customers who spend more are considered higher-value customers.

By analyzing these three dimensions, RFM analysis helps businesses identify different customer segments, such as high-value customers, loyal customers, at-risk customers, and dormant customers.

In [None]:
# Calculating RFM Scores
# Recency = Latest Date - Last Inovice Data, Frequency = count of invoice no. of transaction(s), Monetary = Sum of Total
# Amount for each customer
import datetime as dt

# Set Latest date 2011-12-10 as last invoice date was 2011-12-09. This is to calculate the number of days from recent purchase
Latest_Date = dt.datetime(2011,12,10)

#Create RFM Modelling scores for each customer
rfm_data = cus_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

#Convert Invoice Date into type int
rfm_data['InvoiceDate'] = rfm_data['InvoiceDate'].astype(int)

#Rename column names to Recency, Frequency and Monetary
rfm_data.rename(columns={'InvoiceDate': 'Recency',
                         'InvoiceNo': 'Frequency',
                         'TotalAmount': 'Monetary'}, inplace=True)

rfm_data.reset_index().head()

In [None]:
# Descriptive Statistics (Recency)
rfm_data.Recency.describe()


In [None]:
# Recency distribution plot
import seaborn as sns
x = rfm_data['Recency']
plt.figure(figsize=(11,6))
sns.distplot(x)

In [None]:
# Descriptive Statistics (Frequency)
rfm_data.Frequency.describe()

In [None]:
# Frequency distribution plot, taking observations which have frequency less than 1000
import seaborn as sns
x = rfm_data['Frequency']
plt.figure(figsize=(11,6))
sns.distplot(x)

In [None]:
# Descriptive Statistics (Monetary)
rfm_data.Monetary.describe()

In [None]:
# Monateray distribution plot with value less than 10000
import seaborn as sns
x = rfm_data['Monetary']
plt.figure(figsize=(11,6))
sns.distplot(x)

In [None]:
# Split into four segments using quantiles
quantiles = rfm_data.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()

In [None]:
quantiles

In [None]:
# Functions to create R, F and M segments
def RScoring(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]:
        return 3
    else:
        return 4

def FnMScoring(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]:
        return 2
    else:
        return 1


In [None]:
# Calculate Add R, F and M segment value columns
rfm_data['R'] = rfm_data['Recency'].apply(RScoring, args=('Recency',quantiles,))
rfm_data['F'] = rfm_data['Frequency'].apply(FnMScoring, args=('Frequency',quantiles,))
rfm_data['M'] = rfm_data['Monetary'].apply(FnMScoring, args=('Monetary',quantiles,))
rfm_data.head()

In [None]:
# Calculate and Add RFMGroup value column showing combined concatenated score of RFM
rfm_data['RFMGroup'] = rfm_data.R.map(str) +rfm_data.F.map(str) + rfm_data.M.map(str)

# Calculate and Add RFMScore value column showing total sum of RFMGroup values
rfm_data['RFMScore'] = rfm_data[['R', 'F', 'M']].sum(axis = 1)
rfm_data.head()

In [None]:
# Handle negative and zero values to handle infinite numbers during log transformation
def handle_neg_n_zero(num):
    if num <= 0:
        return 1
    else:
        return num
# Apply handle_neg_n_zero function
rfm_data['Recency'] = [handle_neg_n_zero(x) for x in rfm_data.Recency]
rfm_data['Monetary'] = [handle_neg_n_zero(x) for x in rfm_data.Monetary]

# Perform Log transformation to bring data into near normal distribution
Log_Tfd = rfm_data[['Recency', 'Frequency', 'Monetary']].apply(np.log, axis = 1).round(3)



In [None]:
# Data distribution after data normalization for Recency
Recency_Plot = Log_Tfd['Recency']
plt.figure(figsize=(11,6))
sns.distplot(Recency_Plot)

In [None]:
# Data distribution after data normalization for Frequency
Frequency_Plot = Log_Tfd.query('Frequency < 1000')['Frequency']
plt.figure(figsize=(11,6))
sns.distplot(Frequency_Plot)

In [None]:
# Data distribution after data normalization for Monetary
Monetary_Plot = Log_Tfd.query('Monetary < 10000')['Monetary']
plt.figure(figsize=(11,6))
sns.distplot(Monetary_Plot)

In [None]:
rfm_data['Recency_log'] = rfm_data['Recency'].apply(math.log)
rfm_data['Frequency_log'] = rfm_data['Frequency'].apply(math.log)
rfm_data['Monetary_log'] = rfm_data['Monetary'].apply(math.log)

## ***5. Hypothesis Testing***

### Based on your chart experiments, define two hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Statement 1 - Recent customers tend to spend more than old customers.

Statement 2 - Frequent customers spend more than non-frequent customers.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Recent customers tend to spend more than old customers

Null hypothesis(H0) : Recent customers do not tend to spend more than old customers.

Alternative hypothesis(H1) : Recent customers tend to spend more than old customers.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# create two groups: recent (made a purchase within the last 30 days) and older (made a purchase more than 30 days ago)
recent = rfm_data[rfm_data['Recency'] <= 30]
older = rfm_data[rfm_data['Recency'] > 30]

# calculate mean monetary value for each group
mean_recent = np.mean(recent['Monetary'])
mean_older = np.mean(older['Monetary'])

# state the null hypothesis and alternative hypothesis
null_hypothesis = "Recent customers do not tend to have a higher monetary value than older customers"
alternative_hypothesis = "Recent customers tend to have a higher monetary value than older customers"

# perform two-sample t-test
t, p = ttest_ind(recent['Monetary'], older['Monetary'], equal_var=True)

# determine whether to reject the null hypothesis based on the p-value
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis. " + alternative_hypothesis)
else:
    print("Fail to reject the null hypothesis. " + null_hypothesis)

# output the mean monetary value for each group, as well as the t-statistic and p-value
print("Mean monetary value of recent customers: ", mean_recent)
print("Mean monetary value of non-recent customers: ", mean_older)
print("T-statistic: ", t)
print("P-value: ", p)
print("Degrees of freedom: ", len(recent) + len(older) - 2)

##### Which statistical test have you done to obtain P-Value?

We used two sample t-test .

##### Why did you choose the specific statistical test?

A two-sample t-test is a statistical hypothesis test used to determine if there is a significant difference between the means of two independent groups.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Frequent customers spend more than non-frequent customers.

Null hypothesis(H0): Frequent customers do not spend more than non-frequent customers.

Alternative hypothesis(H1): Frequent customers spend more than non-frequent customers.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# create two groups: frequent (made more than 10 purchases) and non-frequent (made 10 or fewer purchases)
frequent = rfm_data[rfm_data['Frequency'] > 10]
non_frequent = rfm_data[rfm_data['Frequency'] <= 10]

# calculate mean monetary value for each group
mean_frequent = np.mean(frequent['Monetary'])
mean_non_frequent = np.mean(non_frequent['Monetary'])

# state the null hypothesis and alternative hypothesis
null_hypothesis = "Frequent customers do not spend more than non-frequent customers"
alternative_hypothesis = "Frequent customers spend more than non-frequent customers"

# perform two-sample t-test
t, p = ttest_ind(frequent['Monetary'], non_frequent['Monetary'], equal_var=True)

# determine whether to reject the null hypothesis based on the p-value
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis. " + alternative_hypothesis)
else:
    print("Fail to reject the null hypothesis. " + null_hypothesis)

# output the mean monetary value for each group, as well as the t-statistic and p-value
print("Mean monetary value of frequent customers: ", mean_frequent)
print("Mean monetary value of non-frequent customers: ", mean_non_frequent)
print("Standard deviation of monetary value for frequent customers: ", np.std(frequent['Monetary']))
print("Standard deviation of monetary value for non-frequent customers: ", np.std(non_frequent['Monetary']))
print("T-statistic: ", t)
print("P-value: ", p)
print("Degrees of freedom: ", len(frequent) + len(non_frequent) - 2)

##### Which statistical test have you done to obtain P-Value?

We used two sample t-test.

##### Why did you choose the specific statistical test?

A two-sample t-test is a statistical hypothesis test used to determine if there is a significant difference between the means of two independent groups.

In [None]:
# Plot heatmap of the feature correlations in the dataframe
sns.heatmap(rfm_data.corr(), annot=True, cmap='Reds')

In [None]:
# Pairplot using Seaborn.
sns.pairplot(rfm_data, diag_kind='kde')

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
''' The null values from the given dataset was dropped during Data Manipulation'''

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
'''Log transformation was used '''

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
'''Created new features in data manipulation section'''

## ***7. ML Model Implementation***

### ML Model - 1 - K-Means Clustering with Silhouette Method

K-Means clustering is a popular unsupervised machine learning algorithm used for partitioning data into K distinct clusters. The Silhouette method is a technique used to evaluate the quality of clustering by measuring how well-separated the clusters are.

In [None]:
# ML Model - 1 Implementation
# Applying Silhouette Method on Recency , Frequency and Monetary
feature_vector=['Recency_log','Frequency_log','Monetary_log']
X_features=rfm_data[feature_vector].values
scaler=preprocessing.StandardScaler()
X=scaler.fit_transform(X_features)


In [None]:
range_n_clusters = [2,3,4,5,6]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on    Recency ,Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='RdYlBu')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='yellow', s=200, alpha=0.5)

The scatter plot is a widely utilized visualization tool to depict the distribution of data points across a two-dimensional space. In this context, it serves to illustrate customer segmentation derived from RFM (Recency, Frequency, Monetary) features.

When clustering customers based on their Recency, Frequency, and Monetary (RFM) metrics, distinct clusters emerge, indicating clear segmentation among customers.

### ML Model - 2 - K-Means Clustering with Elbow Method

The K-Means clustering algorithm is widely used for partitioning a dataset into a predetermined number of clusters. The Elbow method is a technique used to find the optimal number of clusters by plotting the within-cluster sum of squared distances (WCSS) against the number of clusters.

In [None]:
# ML Model - 2 Implementation
# Applying Elbow Method on Recency , Frequency and Monetary
from sklearn.cluster import KMeans

sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_

#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
#Perform K-Mean Clustering or build the K-Means clustering model
KMean_clust = KMeans(n_clusters= 2, init= 'k-means++', max_iter= 1000)
KMean_clust.fit(X)

#Find the clusters for the observation given in the dataset
rfm_data['Cluster'] = KMean_clust.labels_
rfm_data.head(10)

In [None]:
# Using the dendogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
'''The chart is represented above. The number of clusters corresponds to the count of vertical lines intersected by the threshold line drawn at
a value of 90.'''

### ML Model - 3 - Agglomerative Clustering

Agglomerative clustering is a hierarchical clustering technique used to group similar data points into clusters. It starts with each data point as a separate cluster and then merges the closest clusters iteratively until only one cluster remains.

In [None]:
# Import necessary libraries
# Fitting hierarchical clustering to the mall dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)


In [None]:
# Visualizing the clusters (two dimensions only)
plt.figure(figsize=(13,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Customer 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Customer 2')
#plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Target')

plt.title('Clusters of Customer')
plt.xlabel('RFM')

plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

In [None]:
from prettytable import PrettyTable

# Specify the Column Names while initializing the Table
myTable = PrettyTable(['SL No.',"Model_Name",'Data', "Optimal_Number_of_cluster"])

# Add rows
myTable.add_row(['1',"K-Means with silhouette_score ", "RFM", "2"])
myTable.add_row(['2',"K-Means with Elbow methos  ", "RFM", "2"])
myTable.add_row(['3',"Agglomerative clustering  ", "RFM", "2"])
print(myTable)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

RFM analysis helps businesses identify different customer segments, such as high-value customers, loyal customers, at-risk customers, and dormant customers. This segmentation enables businesses to tailor their marketing strategies and offerings to better meet the needs of each customer segment, thereby improving customer satisfaction and driving revenue growth. By applying various clustering algorithms to our dataset, we determined that the optimal number of clusters is 2.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

After conducting various clustering methods, including K-Means with silhouette score, K-Means with the elbow method, and agglomerative clustering, we concluded that the optimal number of clusters for our dataset is 2. Among these methods, we found that K-Means clustering with the silhouette score is the most straightforward and effective approach.

The silhouette score method measures the quality of clustering based on the average distance between data points within the same cluster and the average distance between data points in different clusters. A higher silhouette score indicates better-defined clusters. In our case, the silhouette score analysis indicated that 2 clusters provided the most cohesive grouping of data points.

By leveraging K-Means clustering with the silhouette score, we can easily segment our dataset into two distinct clusters, allowing for clear differentiation between customer groups. This approach simplifies the clustering process and facilitates the interpretation of results, making it a practical choice for our analysis.

# **Conclusion**

The objective of this project is to segment customers based on a transnational dataset that includes all transactions between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. To ensure the integrity of the original dataset, we created a duplicate copy for subsequent manipulations, safeguarding the primary data from unintended alterations. We eliminated both null values and duplicates from the dataset.

The United Kingdom emerges as the leading region in terms of transaction count, indicating a strong customer base or market presence in that region. Thursday stands out as the day with the highest transaction count, suggesting heightened activity or sales on that particular day of the week. Moreover, November emerges as the month with the highest transaction count, possibly indicating increased sales activity due to seasonal factors or promotions during that period. Additionally, transactions peak at 12 pm, suggesting a surge in sales activity during the midday hours. Furthermore, the afternoon emerges as the time period with the highest product sales, underscoring the significance of afternoon hours in driving sales volume.

In our analysis, we explored three distinct clustering techniques to identify meaningful patterns within the dataset: K-means clustering with the silhouette method, K-means clustering with the elbow method, and agglomerative clustering. Through the application of various clustering algorithms to our dataset, we identified that the optimal number of clusters is 2. This finding underscores the importance of segmentation in understanding customer behavior and preferences. With this knowledge, businesses can effectively allocate resources, tailor marketing initiatives, and optimize operational strategies to maximize customer engagement and retention. The insights gained from RFM analysis and clustering enable businesses to make informed decisions that lead to improved customer experiences and sustainable business success.

Each method offers unique insights into the underlying structure of the data and helps us uncover clusters or groups that share similar characteristics. By comparing the results obtained from these different approaches, we can gain a comprehensive understanding of the data and extract valuable insights that can inform decision-making processes.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***