<a href="https://colab.research.google.com/github/Ujjwalrai7/Capstone-Project-Online-retail-customer-segmentation/blob/main/Capstone_project_Online_retail_customer_segmentation_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Online Retail Customer Segmentation Project



##### **Project Type**    - Unsupervised Machine Learning Project
##### **Contribution**    - Individual Project By Rai Ujjwal Manoj

# **Project Summary -**

The purpose of this project was to conduct a comprehensive customer segmentation analysis for an online retail company. The company, an established e-commerce platform, wanted to gain a deeper understanding of its customer base and identify distinct customer segments based on their behaviors, preferences, and characteristics. By segmenting customers, the company aimed to personalize marketing efforts, improve customer targeting, and enhance overall customer experience.

The project followed a data-driven approach, utilizing advanced analytical techniques to segment customers. The methodology consisted of the following steps:

Data Preprocessing: The collected data underwent preprocessing steps to ensure data quality and consistency. This involved cleaning the data, handling missing values, and transforming variables into suitable formats for analysis.

Exploratory Data Analysis: An in-depth exploration of the data was conducted to gain insights into customer behavior and identify patterns. Descriptive statistics, visualizations, and correlation analysis were employed to uncover key trends and relationships within the data.

Feature Engineering: To enhance the effectiveness of segmentation, additional features were derived from the existing data. These features were designed to capture specific aspects of customer behavior, such as recency of purchase, frequency of interaction, and monetary value.

Segmentation Analysis: Advanced clustering techniques, such as K-means clustering, hierarchical clustering, or Binning models, were applied to segment customers into distinct groups based on their similarities and differences. The appropriate number of segments was determined by evaluating different clustering solutions and selecting the one that provided the most meaningful and actionable insights.

Segment Profiling: Each customer segment was carefully profiled and characterized based on their unique attributes. This involved analyzing the key features and behaviors that differentiated one segment from another. The profiles provided a deep understanding of customer preferences, motivations, and needs within each segment.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The problem at hand is that an online retail company lacks a comprehensive understanding of its diverse customer base. Without proper customer segmentation, the company struggles to effectively target its marketing efforts, personalize customer experiences, and maximize overall customer satisfaction. Therefore, the project aims to address the following
problem:

Lack of Customer Understanding: The online retail company lacks insights into its customer base, including their behaviors, preferences, and characteristics. As a result, the company is unable to tailor its strategies and offerings to meet the specific needs and expectations of different customer segments.

Ineffective Marketing Campaigns: Without proper customer segmentation, the company's marketing campaigns lack relevance and fail to effectively reach the intended audience. The absence of targeted messaging and personalized promotions leads to lower customer engagement and suboptimal conversion rates.

Limited Personalization Opportunities: The company struggles to deliver personalized product recommendations, offers, and user experiences to its customers. The absence of segmentation hinders the ability to leverage customer data for personalized marketing, resulting in missed opportunities for cross-selling, upselling, and customer retention.

By addressing these problems through a robust customer segmentation analysis, the online retail company can gain a comprehensive understanding of its customer base and effectively tailor its marketing strategies, personalized recommendations, and user experiences. Ultimately, this will lead to improved customer satisfaction, increased customer engagement, higher conversion rates, and enhanced long-term customer loyalty.


# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
from datetime import datetime

import warnings
from pylab import rcParams
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Load the dataset
df=pd.read_excel("/content/drive/MyDrive/Online Retail.xlsx")

### Dataset First View

In [None]:
# Dataset First

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
#Checking numbers of unique values in each colums
for i in df.columns:
  print(i,':' ,df[i].nunique())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(len(df[df.duplicated()]))

In [None]:
# Dropping duplicate rows
df.drop_duplicates(inplace=True)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False);

### What did you know about your dataset?

The dataset is of transnational dataset which contains all the transactions occuring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.
* The company mainly sells unique all-occassion gifts.
* Many customers of the company are wholesalers.
* The dataset contains 541909 rows and 8 columns.
* There are 2 columns of datatype float64, 1 column of datatype int64, 4  
  columns of datatype object and 1 column of datatype datetime64.
* The total number of duplicated values in the dataset: 5268
* Missing Data Percentage
  CustomerID - 24.93%
  Description - 0.27%

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(f'Features: {df.columns.to_list()}')

In [None]:
# Dataset Describe

### Variables Description

* ### **InvoiceNo**: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* ### **StockCode**: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* ### **Description**: Product (item) name. Nominal.
* ### **Quantity**: The quantities of each product (item) per transaction. Numeric.
* ### **InvoiceDate**: Invoice Date and time. Numeric, the day and time when each transaction was generated.
* ### **UnitPrice**: Unit price. Numeric, Product price per unit in sterling.
* ### **CustomerID**: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* ### **Country**: Country name. Nominal, the name of the country where each customer resides.

### Check Unique Values for each variable.

In [None]:
#Checking numbers of unique values in each colums
for i in df.columns:
  print(i,':' ,df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Extracting year month date & time from Invoice Date column
df["year"]  = df["InvoiceDate"].apply(lambda x: x.year)
df['Month'] = df['InvoiceDate'].apply(lambda x: x.month_name())
df['Day']   = df['InvoiceDate'].apply(lambda x: x.day_name())
df["hour"]  = df["InvoiceDate"].apply(lambda x: x.hour)

In [None]:
# Creating a new feature 'TotalAmount' by multiplying Quantity and UnitPrice
df['TotalAmount']= df['UnitPrice'] * df['Quantity']

In [None]:
# Creating a new feature 'TimeType' based on hours to define whether its Morning,Afternoon or Evening

df['TimeType'] = np.where((df["hour"]>5)&(df["hour"]<18), np.where(
                           df["hour"]<12, 'Morning','Afternoon'),'Evening')

In [None]:
# Most orders placed are from these countries
country_invoice = df.groupby("Country").nunique()["InvoiceNo"].reset_index().sort_values("InvoiceNo",ascending=False)
country_invoice.rename(columns={'InvoiceNo': 'Invoice_Count'}, inplace=True)
country_invoice.head(10)

In [None]:
# Most customers are from these countries
country_cust = df.groupby("Country").nunique()["CustomerID"].reset_index().sort_values("CustomerID",ascending=False)
country_cust.rename(columns={'CustomerID': 'Customer_Count'}, inplace=True)
country_cust.head()

In [None]:
# Countrywise average item purchases
country_quantity = df.groupby("Country").mean()["Quantity"].reset_index().sort_values("Quantity",ascending=False)
country_quantity.rename(columns={'Quantity': 'Average_Quantity'}, inplace=True)
country_quantity.head()

In [None]:
# quantity wise item purchases
product_quantity = df.groupby("Description").sum()["Quantity"].reset_index().sort_values("Quantity",ascending=False)
product_quantity.head()

In [None]:
# Amount wise item purchases
product_price = df.groupby("Description").sum()["TotalAmount"].reset_index().sort_values("TotalAmount",ascending=False)
product_price.head()

In [None]:
StockCode_df=df['StockCode'].value_counts().reset_index()
StockCode_df.rename(columns={'index': 'StockCode_Name'}, inplace=True)
StockCode_df.rename(columns={'StockCode': 'Count'}, inplace=True)
StockCode_df.head()

In [None]:
hour_df=df['hour'].value_counts().reset_index()
hour_df.rename(columns={'index': 'Hour_Name'}, inplace=True)
hour_df.rename(columns={'hour': 'Count'}, inplace=True)
hour_df

In [None]:
month_df=df['Month'].value_counts().reset_index()
month_df.rename(columns={'index': 'Month_Name'}, inplace=True)
month_df.rename(columns={'Month': 'Count'}, inplace=True)
month_df

In [None]:
day_df=df['Day'].value_counts().reset_index()
day_df.rename(columns={'index': 'Day_Name'}, inplace=True)
day_df.rename(columns={'Day': 'Count'}, inplace=True)
day_df

### What all manipulations have you done and insights you found?

1. Most Customers are from United Kingdom. Considerable number of customers are also from Germany, France, EIRE and Spain. Whereas Saudi Arabia, Bahrain, Czech Republic, Brazil and Lithuania has least number of customers
2. There are no orders placed on Saturdays. Looks like it's a non working day for the retailer.
3. Most of the customers have purchased the gifts in the month of November, October, December and September. Less number of customers have purchased the gifts in the month of April, January and February.
4. Most of the customers have purchased the items in Afternoon, moderate numbers of customers have purchased the items in Morning and the least in Evening.
5. WHITE HANGING HEART T-LIGHT HOLDER, REGENCY CAKESTAND 3 TIER, JUMBO BAG RED RETROSPOT are the most ordered products

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Quantity wise most items purchases

In [None]:
# Chart - 1 visualization code
# quantity wise most item purchases
plt.figure(figsize=(20,5), dpi=90)
plt.subplot(1,2,1)
plt.xticks(rotation=20,ha='right')
plt.title("Product with High quantity orders")
sns.barplot(data=product_quantity.head(10),x="Description",y="Quantity")



##### 1. Why did you pick the specific chart?

To visualise quantity wise purchases of products

##### 2. What is/are the insight(s) found from the chart?

world war 2 gliders, Jumbo bag , Popcorn etc are the items purchased in maximum quantity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes it would help the management to decide the incentives on the products such as discount wether to keep or extend or detain on such high selling items.

#### Chart - 2 Country wise visualisation of total invoices.

In [None]:
# Chart - 2 visualization code
# Visualizing top 10 countries based on total invoices
plt.figure(figsize=(20,5),dpi=90)
plt.subplot(1,2,1)
plt.xticks(rotation=20,ha='right')
plt.title("Most orders placed are from these countries")
sns.barplot(data=country_invoice.head(10),x="Country",y="Invoice_Count")

##### 1. Why did you pick the specific chart?

To Visualize top 10 countries based on total invoices

##### 2. What is/are the insight(s) found from the chart?

United Kingdom is making most of the purchases as compared to other countries


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.

#### Chart - 3 Visualizing top 10 countries based on average item purchases

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(20,5), dpi=90)
plt.subplot(1,2,1)
plt.xticks(rotation=20,ha='right')
plt.title("High quantity orders are from these countries")
sns.barplot(data=country_quantity.head(10),x="Country",y="Average_Quantity")

##### 1. Why did you pick the specific chart?

 To Visualize top 10 countries based on average item purchases

##### 2. What is/are the insight(s) found from the chart?

Orders with mass quantity are placed by the customers from Netherlands

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.

#### Chart - 4  Visualising the top stocks

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(13,8))
plt.title('Top 5 Stock Name')
sns.barplot(x='StockCode_Name',y='Count',data=StockCode_df[:5])

##### 1. Why did you pick the specific chart?

To visualise the top stocks.

##### 2. What is/are the insight(s) found from the chart?

The insights that can be generated are Top 5 Stock name based on selling are :

1)85123A

2)22423

3)85099B

4)47566

5)20725



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.

#### Chart - 5    Day wise visualisation of customer count

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(13,8))
plt.title('Day')
sns.barplot(x='Day_Name',y='Count',data=day_df)

##### 1. Why did you pick the specific chart?

To visualise the daily customer count.


##### 2. What is/are the insight(s) found from the chart?

Most of the customers have purches the items in Thursday ,Wednesday and Tuesday also it can be said that as there is no data for saturday the store might be closed on saturday


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.


#### Chart - 6  Monthly Analysis of customer count.


In [None]:
# Chart - 6 visualization code
# Month wise customer purchase
plt.figure(figsize=(13,8))
plt.title('Month')
sns.barplot(x='Month_Name',y='Count',data=month_df)

##### 1. Why did you pick the specific chart?

To visualise Month wise customer purchase.

##### 2. What is/are the insight(s) found from the chart?

Most numbers of customers have purchased the gifts in the month of November,December October and  September

less numbers of customers have purchased the gifts in the month of April ,January and February

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.

#### Chart - 7 Hourly analysis of customer purchases

In [None]:
# Chart - 7 visualization code.
plt.figure(figsize=(13,8))
plt.title('Hour')
sns.barplot(x='Hour_Name',y='Count',data=hour_df)

##### 1. Why did you pick the specific chart?

To visualise the hourly customer purchases.

##### 2. What is/are the insight(s) found from the chart?

From this graph we can see that from 11:00 am to 4:00 pm most of the customers have purchased the item.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.

#### Chart - 8 Time_type wise analysis of customer purchases.

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(13,8))
plt.title('Time_type')
sns.countplot(x='TimeType',data=df)

##### 1. Why did you pick the specific chart?

To visualise at what phase of time there is maximum purchases.

##### 2. What is/are the insight(s) found from the chart?

Most of the customers have purchased the items in Aftrnoon ,moderate numbers of customers have purchased the items in Morning and least numbers of customers have purchased the items in evening

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.

#### Chart - 9 Distribution analysis.

In [None]:
# Chart - 9 visualization code
# Visualizing the distributions.
target = ['Quantity','UnitPrice','TotalAmount']
plt.figure(figsize=(20,5), dpi=90)
for n,col in enumerate(target):
  plt.subplot(1, 3, n+1)
  sns.distplot(df[col])
  plt.title(col.title())
  plt.tight_layout()


##### 1. Why did you pick the specific chart?

To visualise the distribution of various features.

##### 2. What is/are the insight(s) found from the chart?

1. It shows a positively skewed distribution because most of the values are clustered around the left side of the distribution while the right tail of the distribution is longer, which means mean>median>mode2. For symmetric graph mean=median=mode.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.

#### Chart - 10 Year wise analysis.

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(13,8))
plt.title('Year')
sns.countplot(x='year',data=df)

##### 1. Why did you pick the specific chart?

Ti visualise the yearly pattern of customer purchases.

##### 2. What is/are the insight(s) found from the chart?

Most of the data is from the year 2011 and very less purchases belongs to 2010.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes the gained insights would help creating a positive business impact.



#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
 ## Correlation
plt.figure(figsize=(20,12))
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?


A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

None of the features are highly correlated except total amount which is formed from quantity.

#### Chart - 12 - Pair Plot

In [None]:
# Pair Plot visualization code
# Pair Plot visualization code
sns.pairplot(df ,hue="CustomerID")
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

The above chart depicts the correlation of various labels with respect to customerID around.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
#missing data counts and percentage

missing = df.columns[df.isnull().any()].tolist()
missing

print('Missing Data count')
print(df[missing].isnull().sum().sort_values(ascending=False))

print('++'*12)
print('Missing Data Percentage')
print(round(df[missing].isnull().sum().sort_values(ascending=False)/len(df)*100,2))

In [None]:
# Dropping the rows with null values
df.dropna(subset=['CustomerID'],inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

24.93% of items purchases are not assigned to any customer
Hence there is no use of having the data with no customer assignment.
Because we can't form clusters without CustomerID so we will delete them from dataset.

In [None]:
df.info()

In [None]:
# Checking duplicates
print(len(df[df.duplicated()]))

In [None]:
# checking null counts and datatype in each column
df.info()

In [None]:
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
cancellations = df[df['InvoiceNo'].str.contains('C')]
cancellations.shape

In [None]:
# Dropping cancellations from the main dataframe
df = df[~df['InvoiceNo'].str.contains('C')]

In [None]:
df.info()

### 2. Handling Outliers

##### What all outlier treatment techniques have you used and why did you use those techniques?

Not needed in this case.

### 3. Categorical Encoding

#### What all categorical encoding techniques have you used & why did you use those techniques?

Not needed in this case.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

# **RFM Modelling**

***RFM (Recency, Frequency, Monetary) modeling is a statistical technique used in marketing and customer relationship management to analyze customer behavior and determine the value of each customer to a business. It involves analyzing the transactional data of customers and grouping them based on their buying behavior.***

***Recency refers to how recently a customer has made a purchase from the business. Frequency refers to how often a customer makes purchases, and Monetary refers to how much a customer spends on purchases. These three factors are used to assign scores to each customer, which are then used to segment them into different groups.***

***In general, customers who have made recent purchases, purchase frequently, and spend more money are considered to be more valuable to the business than those who have not made a purchase recently, purchase infrequently, and spend less money. RFM modeling helps businesses identify their most valuable customers and tailor their marketing efforts accordingly, such as by offering personalized promotions or improving customer service.***

***RFM modeling can be performed using various statistical techniques, such as clustering or decision tree analysis. However, it requires accurate and up-to-date data on customer transactions, and the results may need to be validated and updated periodically to reflect changes in customer behavior.***

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Adding 1 day to the Last Invoice date to set as Latest date for reference
LatestDate = df["InvoiceDate"].max() + pd.DateOffset(days=1)

# Creating a new dataframe to calculate Recency, Frequency and Monetary scores for each customer
rfm = df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (LatestDate - x.max()).days,
                                    'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

# Renaming the columns
rfm.rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency',
                    'TotalAmount': 'Monetary'}, inplace=True)

# Checking top 5 rows
rfm.reset_index().head()


#### 2. Feature Selection

In [None]:
# Calculating R, F and M scores by splitting Recency, Frequency	and Monetary based on quantiles
rfm['R'] = pd.qcut(rfm['Recency'], q=4, labels=[4,3,2,1]).astype(int)
rfm['F'] = pd.qcut(rfm['Frequency'], q=4, labels=[1,2,3,4]).astype(int)
rfm['M'] = pd.qcut(rfm['Monetary'], q=4, labels=[1,2,3,4]).astype(int)

In [None]:
# Finding the RFM Group for each customer by combining the factors R,Fand M
rfm['RFM'] = 100*rfm['R'] + 10*rfm['F'] + rfm['M']

# Finding the RFM Score for each customer by adding the factors R,Fand M
rfm['RFM_Score'] = rfm['R'] + rfm['F'] + rfm['M']

##### What all feature selection methods have you used  and why?

***RFM (Recency, Frequency, Monetary) modeling is a statistical technique used in marketing and customer relationship management to analyze customer behavior and determine the value of each customer to a business. It involves analyzing the transactional data of customers and grouping them based on their buying behavior.***

##### Which all features you found important and why?

In [None]:
rfm.head()

In [None]:
print("Best Customers: ",len(rfm[rfm['RFM']==444]))
print('Loyal Customers: ',len(rfm[rfm['F']==4]))
print("Big Spenders: ",len(rfm[rfm['M']==4]))
print('Almost Lost: ', len(rfm[rfm['RFM']==244]))
print('Lost Customers: ',len(rfm[rfm['RFM']==144]))
print('Lost Cheap Customers: ',len(rfm[rfm['RFM']==111]))

# **Interpretation:**
***1. If the RFM of any customer is 444. His Recency is good, frequency is more and Monetary is more. So, he is the best customer.***</br>
***2. If the RFM of any customer is 111. His Recency is low, frequency is low and Monetary is low. So, he is the churning customer.***</br>
***3. If the RFM of any customer is 144. He purchased a long time ago but buys frequently and spends more. And so on.***</br>
***4. Like this we can come up with number of segments for all combinations of R,F and M base on our usecase. Higher the RFM score, more valuable the customer is.***

### 5. Data Transformation

In [None]:
# Visualizing the distributions.
target = ['Quantity','UnitPrice','TotalAmount']
plt.figure(figsize=(20,5), dpi=90)
for n,col in enumerate(target):
  plt.subplot(1, 3, n+1)
  sns.distplot(df[col])
  plt.title(col.title())
  plt.tight_layout()

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

1. It shows a positively skewed distribution because most of the values are clustered around the left side of the distribution while the right tail of the distribution is longer, which means mean>median>mode
2. For symmetric graph mean=median=mode.

In [None]:
# Transform Your data
target = ['Quantity']
plt.figure(figsize=(20,5), dpi=90)
for n,col in enumerate(target):
  plt.subplot(1, 3, n+1)
  sns.distplot(np.log(df[col]))
  plt.title(col.title())
  plt.tight_layout()

In [None]:
# Handling the zeroes in the dataframe to avoid error in transformations
rfm.replace(0.0,1,inplace=True)

# Applying Log transformation on columns for smoothening the distribution
rfm['Recency_Log']   = rfm['Recency'].apply(np.log)
rfm['Frequency_Log'] = rfm['Frequency'].apply(np.log)
rfm['Monetary_Log']  = rfm['Monetary'].apply(np.log)
rfm.head()

In [None]:
# Visualizing the distributions before and after log transformation.
target = ['Recency', 'Frequency',	'Monetary', 'Recency_Log', 'Frequency_Log', 'Monetary_Log']
plt.figure(figsize=(20,10), dpi=90)
for n,col in enumerate(target):
  plt.subplot(2, 3, n+1)
  sns.distplot(rfm[col])
  plt.title(col.title())
  plt.tight_layout()

# **Observations:**
***1. Earlier the distributions of Recency, Frequency and Monetary columns were positively skewed but after applying log transformation, the distributions appear to be symmetrical and normally distributed.***
***2. It will be more suitable to use the transformed features for better visualisation of clusters.***

In [None]:
# Visualizing the correlations among features.
target = ['Recency_Log','Frequency_Log','Monetary_Log','RFM','RFM_Score']
plt.figure(figsize = (8, 4), dpi=150)
sns.heatmap(abs(rfm[target].corr()), annot=True, cmap='coolwarm')
plt.title('RFM Correlation Heatmap')
plt.show()

### 6. Data Scaling

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

This dataset doesnot need any dimensionality reduction.

Dimensionality reduction is a technique that is used to reduce the number of features in a dataset. It is often used when the number of features is very large, as this can lead to problems such as overfitting and slow computation. There are a variety of techniques that can be used for dimensionality reduction, such as principal component analysis (PCA) and singular value decomposition (SVD).

There are several reasons why dimensionality reduction might be useful. One reason is that it can help to reduce the size of a dataset, which can be particularly useful when the dataset is very large. It can also help to improve the performance of machine learning models by reducing the number of features that the model has to consider, which can lead to faster computation and better generalization to new data.

Another reason to use dimensionality reduction is to reduce the curse of dimensionality, which refers to the fact that as the number of dimensions increases, the volume of the space increases exponentially. This can lead to problems such as the nearest neighbor search becoming less effective, as the distances between points become much larger. Dimensionality reduction can help to reduce the curse of dimensionality by reducing the number of dimensions in the data.

Finally, dimensionality reduction can also be useful for visualizing high-dimensional data. It can be difficult to visualize data in more than three dimensions, so reducing the number of dimensions can make it easier to understand the patterns in the data.

### 8. Data Splitting

##### What data splitting ratio have you used and why?

Not Required in this case as it is an unsupervised machine learning project.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Imbalanced dataset is relevant primarily in the context of supervised machine learning involving two or more classes.

Imbalance means that the number of data points available for different the classes is different: If there are two classes, then balanced data would mean 50% points for each of the class. For most machine learning techniques, little imbalance is not a problem. So, if there are 60% points for one class and 40% for the other class, it should not cause any significant performance degradation. Only when the class imbalance is high, e.g. 90% points for one class and 10% for the other, standard optimization criteria or performance measures may not be as effective and would need modification.

In our case it is not required and imbalanced.

In [None]:
rfm.head()

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig=plt.figure(figsize=(15,10))
plt.title('3d visualization of Recency Frequency and Monetary')
ax=fig.add_subplot(111,projection='3d')
xs=rfm.Recency_Log
ys=rfm.Frequency_Log
zs=rfm.Monetary_Log
ax.scatter(xs,ys,zs,s=5)
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary value')
plt.show()

## ***7. ML Model Implementation***

### ML Model - 1    ****K-Means Clustering****

#Applying Silhouette  Method on Recency ,Frequency and Monetary

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from sklearn import preprocessing

In [None]:
feature_vector=['Recency_Log','Frequency_Log','Monetary_Log']
X_features=rfm[feature_vector].values
scaler=preprocessing.StandardScaler()
X=scaler.fit_transform(X_features)

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

range_n_clusters = [2,3,4,5,6,7,8]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

###  We can see that the maximum sihouette score is when no. of clusters is 2 ###

#**Applying Elbow Method on Recency ,Frequency and Monetary**

In [None]:
from sklearn.cluster import KMeans

sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_

#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

## From the above elbow curve we can see that optimal no of clusters would be 2 or 3 ##

## From both elbow curve and silhouette score considering 2 as optimal no. of clusters and implementing K-Means ##

## K-means Implementation

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)

In [None]:
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on    Recency ,Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='RdYlBu')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='yellow', s=200, alpha=0.5)

In [None]:

#Perform K-Mean Clustering or build the K-Means clustering model
KMean_clust = KMeans(n_clusters= 2, init= 'k-means++', max_iter= 1000)
KMean_clust.fit(X)

#Find the clusters for the observation given in the dataset
rfm['Cluster'] = KMean_clust.labels_
rfm.head(10)

In [None]:
def func(row):
    if row["Cluster"]==0:
        return 'Major Customers'
    elif row["Cluster"]==1:
        return 'At Risk'
    else:
        return 'Average Standing Customers'

In [None]:
rfm['group']=rfm.apply(func, axis=1)
rfm

### ML Model - 2- ***Hierarchical Clustering***

#**Dendogram to find the optimal number of clusters**

In [None]:
from scipy.cluster.hierarchy import dendrogram,linkage
# Clustering alorithms
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

In [None]:
# Using the Dendogram to Decide the number of clusters
plt.figure(figsize=(15,12), dpi=90)                         # Setting the figure size
dendrogram(linkage(X, method='ward'), color_threshold=50)   # using ward linkage method to differ similarities
plt.title('Dendrogram')                                     # Setting the title
plt.xlabel('Customers')                                     # Setting the x label
plt.ylabel('Euclidean Distances')                           # Setting y label
plt.axhline(y=70, color='black', linestyle='--')            # Setting the axis line for y=70
plt.axhline(y=50, color='black', linestyle='--')            # Setting the axis line for y=50
plt.show()

## From both Dendogram considering 2 as optimal no. of clusters and implementing Hierarchical Clustering ##

In [None]:
# Fitting hierarchical clustering to the mall dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
# Visualizing the clusters (two dimensions only)
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = '0')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = '1')
#plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = '2')

plt.title('Clusters of customers')
plt.xlabel('RFM')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

### ML Model - 3 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a density-based clustering algorithm that groups data points based on their density in the feature space. It identifies clusters as dense regions separated by areas of lower density and is robust to noise and outliers. It doesn't require specifying the number of clusters in advance. Its parameters include the radius (Eps) and minimum number of points (MinPts) to form a core point.

In [None]:
# ML Model - 3 Implementation
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

# Create and fit the DBSCAN model
dbscan = DBSCAN(eps=0.5, min_samples=15)
dbscan.fit(X)

# Plot the results
plt.scatter(X[:,0], X[:,1], c=dbscan.labels_, cmap='rainbow')
plt.title('DBSCAN Clustering')
plt.xlabel('RFM')
plt.ylabel('Spending Score (1-100)')
plt.show()

* The chart used is a scatter plot, which is a suitable choice for visualizing the clustering results of DBSCAN. The x and y axes represent the two features of the dataset, and the points are colored based on their assigned cluster labels.

* The insights gained from the chart include identifying the clusters formed by the DBSCAN algorithm and their density. The points that are closer to each other are assigned to the same cluster, and the outliers or noise points are labeled as -1. By observing the distribution of the points and the density of the clusters, we can understand the structure and characteristics of the data, and potentially find any patterns or anomalies.

* The gained insights can help in creating a positive business impact by identifying groups of similar data points, which can aid in targeting specific segments of customers or optimizing operational processes.

## Summary

In [None]:

# Import necessary libraries
from prettytable import PrettyTable

# Initialize the table with specified column names
myTable = PrettyTable(['SL No.', "Model_Name", 'Data', "Optimal_Number_of_cluster"])

# Add rows to the table
myTable.add_row(['1', "K-Means with silhouette_score", "RFM", "2"])
myTable.add_row(['2', "K-Means with Elbow method", "RFM", "2"])
myTable.add_row(['3', "Hierarchical clustering", "RFM", "2"])
myTable.add_row(['4',"DBSCAN ", "RFM", "3"])

# Print the table
print(myTable)

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**EDA**

* Null values and duplicates were removed from the dataset before applying clustering.

* Top customer IDs were found to be 17841.0, 14911.0, 14096.0, 12748.0, and 14606.0.

* The top five countries based on the percentage of total orders were the United Kingdom (88.95%), Germany (2.33%), France (1.84%), Ireland (1.84%), and Spain (0.62%).

* The top five products purchased based on frequency were White Hanging Heart T-Light Holder, Regency Cakestand 3 Tier, Jumbo Bag Red Retrospot, Party Bunting, and Assorted Colour Bird Ornament.

* The top stock codes based on count values were 85123A, 22423, 85099B, 47566, and 84879.

* New columns were created using InvoiceDate, such as Year, Month, Day, Hour, Month_Num, and Day_Num.

* The total amount of each order was calculated using the product of unit price and quantity.

* The months of November, October, December, September, and May generated the most business.

* The most popular purchasing days were Thursday, Wednesday, Tuesday, Monday, Saturday, and Friday.

* Most customers made purchases between 10:00 A.M. and 2:00 P.M.

* The top time duration for purchasing was found to be afternoon, followed by morning and evening.

**Algorithm**

* RFM (Recency, Frequency, and Monetary) dataframe helps in solving problems in a particular order, making it easy to recommend and display new products to selected customers.

* Different clustering algorithms were applied to the dataset, including: Clustering on Recency, Frequency & Monetary (RFM) with 2 clusters.

1. K-Means with Silhouette_score
2. K-Means with Elbow Method
3. Hierarchical Clustering
4. DBSCAN

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***