# **Project Name** - Online Retail Customer Segmentation



![](https://raw.githubusercontent.com/datasciritwik/Online-Retail/main/Online-retail-shops-800x445.jpeg)

##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual

# **Project Summary**

**This project analyzes a transactional dataset from a UK-based online retail company that specializes in unique all-occasion gifts, with a significant portion of its customer base being wholesalers. The dataset covers transactions from December 1, 2010, to December 9, 2011. The primary goal is to identify major customer segments using unsupervised machine learning techniques like clustering and association rules.**

The project begins with data understanding and preparation. This involves exploring the dataset, handling missing values, removing duplicates, and addressing negative or return transactions. Data quality is improved by ensuring consistent product descriptions and converting data types for analysis.

Exploratory data analysis reveals key insights about customer behavior and sales patterns. Visualizations like bar charts and pie charts are used to understand sales distribution across different countries, identify top customers by sales amount and event sales, and analyze the sales performance of top products.

**The core of the project involves customer segmentation using the RFM model (Recency, Frequency, Monetary Value). This model helps understand customer behavior based on how recently they purchased, how often they purchase, and the total value of their purchases. Each RFM component is calculated and visualized to understand its distribution and identify potential outliers.**

**Data preprocessing is performed to prepare the data for the K-means clustering algorithm. This includes log transformation and standardization to ensure that variables have a mean of 0 and a variance of 1, addressing the issue of varying ranges and potential outliers.**

**The Elbow method is used to determine the optimal number of clusters for K-means. This involves analyzing the percentage of variance explained as a function of the number of clusters and identifying the point where the distortion begins to increase most rapidly.**

**Silhouette analysis is then employed to evaluate the quality of the clustering results. This technique helps visualize the separation distance between clusters and identify potential issues like poorly separated clusters or misclassified data points.**

The project leverages various Python libraries like Pandas, Scikit-learn, and Plotly to perform data manipulation, analysis, and visualization. The findings from this analysis can be used to develop targeted marketing campaigns, optimize product placement, and identify potential new customer segments.

Further analysis could involve using association rule mining to identify relationships between products purchased together, leading to insights for product recommendations and targeted promotions. The project demonstrates a practical application of unsupervised machine learning techniques for customer segmentation and business decision-making in the online retail industry.

# **[GitHub Link](https://github.com/datasciritwik/Online-Retail.git)**

![](https://pngimg.com/uploads/github/github_PNG23.png)

# **Problem Statement**


**Identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Fetching Dataset

In [None]:
!git clone https://github.com/datasciritwik/Online-Retail.git

### Installing dependencies

In [None]:
!pip install Orange3 pandasql Orange3-Associate -qq

### Import Libraries

In [None]:
# Import Libraries

# Suppress warnings to keep the output clean
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  # Ignore FutureWarnings
warnings.filterwarnings('ignore')  # Ignore all warnings

# Function to ignore warnings
def ignore_warn(*args, **kwargs):
    pass

warnings.warn = ignore_warn  # Override the default warning function

# Import necessary libraries
import pandas as pd  # For data manipulation and analysis
import datetime  # For handling date and time data
import math  # For mathematical functions
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For static plotting
import matplotlib.cm as cm  # For colormap handling

# Enable inline plotting for Jupyter Notebooks
%matplotlib inline

# Import pandasql for SQL-like queries on DataFrames
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())  # Define a function to run SQL queries

# Import seaborn for enhanced statistical data visualization
import seaborn as sns
sns.set(style="ticks", color_codes=True, font_scale=1.5)  # Set seaborn style
color = sns.color_palette()  # Get the default color palette
sns.set_style('darkgrid')  # Set the style to dark grid

# Import Plotly for interactive plotting
import plotly as py
import plotly.graph_objs as go
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
py.offline.init_notebook_mode()  # Initialize Plotly for offline use in Jupyter

# Import statistical functions from SciPy
from scipy import stats
from scipy.stats import skew, norm, probplot, boxcox  # Specific statistical functions

# Import preprocessing tools from Scikit-learn
from sklearn import preprocessing  # For data preprocessing tasks

# Import clustering algorithms and metrics from Scikit-learn
from sklearn.cluster import KMeans  # KMeans clustering algorithm
from sklearn.metrics import silhouette_samples, silhouette_score  # Metrics for evaluating clustering

# Import Orange for data mining and machine learning
import Orange
from Orange.data import Domain, DiscreteVariable, ContinuousVariable  # Data types for Orange
from orangecontrib.associate.fpgrowth import *  # Association rule mining functions

### Dataset Loading

In [None]:
# Load Dataset
data_path = "/content/Online-Retail/Online Retail.xlsx"
df = pd.read_excel(data_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head().style.background_gradient(cmap='Blues')

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Number of Rows: {df.shape[0]}')
print(f'Number of Columns: {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
missing_values = df.isnull().sum()
plt.figure(figsize=(10, 6))
sns.barplot(x=missing_values.index, y=missing_values.values, palette="viridis")
plt.xticks(rotation=90)
plt.xlabel('Columns')
plt.ylabel('Missing Values Count')

### What did you know about your dataset?

The dataset provided is a transactional dataset from a UK-based online retail company specializing in unique all-occasion gifts, with a significant portion of their customer base being wholesalers. The goal is to perform customer segmentation and gain insights into customer behavior and purchasing patterns.

Customer segmentation is the process of dividing a company's customer base into distinct groups based on various characteristics such as demographics, purchase history, and behavior. The aim is to better understand customer needs and preferences to tailor marketing strategies, improve customer satisfaction, and increase revenue.

|Rows|Columns|Columns with Missing Values|Duplicate Count|
|-|-|-|-|
|541909|8|Description(1454), CustomerID(135080)|5268|

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().style.background_gradient(cmap='Blues')

In [None]:
df.describe(include='object').T

### Variables Description

- **InvoiceNo**: A unique identifier for the invoice. An invoice number shared across rows means that those transactions were performed in a single invoice (multiple purchases).
- **StockCode**: Identifier for items contained in an invoice.
- **Description**: Textual description of each of the stock item.
- **Quantity**: The quantity of the item purchased.
- **InvoiceDate**: Date of purchase.
- **UnitPrice**: Value of each item.
- **CustomerID**: Identifier for customer making the purchase.
- **Country**: Country of customer.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"Feature Name: {col} \n- Total Unique Values - {len(df[col].unique())}\n")

## 3. ***Data Wrangling***

#### Removing Rows with Null Values
Since rows with null values contains `text values (Description)` and `Unique ID (CustomerID)` because of that we can't apply any kind of Imputation techniques.

In [None]:
df.dropna(inplace = True, axis = 0)

#### Drop Duplicated

In [None]:
df.drop_duplicates(inplace = True)

#### Checking Remove negative or return transactions - Barplot

In [None]:
# Calculate min and max values
min_max = {}
min_max['Quantity'] = [df['Quantity'].min(), df['Quantity'].max()]
min_max['UnitPrice'] = [df['UnitPrice'].min(), df['UnitPrice'].max()]

# Create a DataFrame for plotting
min_max_df = pd.DataFrame(min_max, index=['Min', 'Max']).reset_index()
min_max_df.rename(columns={'index': 'Metric'}, inplace=True)

# Set the color palette
palette = sns.color_palette("pastel")

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot for Quantity
sns.barplot(x='Metric', y='Quantity', data=min_max_df, ax=axes[0], color=palette[0], alpha=0.7)
axes[0].set_title('Min-Max Chart for Quantity', fontsize=16)
axes[0].set_xlabel('Metric', fontsize=14)
axes[0].set_ylabel('Values', fontsize=14)
axes[0].grid(axis='y', linestyle='--', alpha=0.7)
axes[0].axhline(0, color='red', linestyle='--', linewidth=1)  # Horizontal line at y=0

# Annotate min and max values for Quantity
for index, row in min_max_df.iterrows():
    axes[0].text(row['Metric'], row['Quantity'] + 1, row['Quantity'],
                  color='black', ha='center', fontsize=10)

# Plot for Unit Price
sns.barplot(x='Metric', y='UnitPrice', data=min_max_df, ax=axes[1], color=palette[1], alpha=0.7)
axes[1].set_title('Min-Max Chart for Unit Price', fontsize=16)
axes[1].set_xlabel('Metric', fontsize=14)
axes[1].set_ylabel('Values', fontsize=14)
axes[1].grid(axis='y', linestyle='--', alpha=0.7)
axes[1].axhline(0, color='red', linestyle='--', linewidth=1)  # Horizontal line at y=0

# Annotate min and max values for Unit Price
for index, row in min_max_df.iterrows():
    axes[1].text(row['Metric'], row['UnitPrice'] + 0.1, row['UnitPrice'],
                  color='black', ha='center', fontsize=10)

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
print('Check if we had negative quantity and prices at same register:',
     'No' if df[(df.Quantity<0) & (df.UnitPrice<0)].shape[0] == 0 else 'Yes', '\n')
print('Check how many register we have where quantity is negative',
      'and prices is 0 or vice-versa:',
      df[(df.Quantity<=0) & (df.UnitPrice<=0)].shape[0])
print('\nWhat is the customer ID of the registers above:',
      df.loc[(df.Quantity<=0) & (df.UnitPrice<=0),
                ['CustomerID']].CustomerID.unique())
print('\n% Negative Quantity: {:3.2%}'.format(df[(df.Quantity<0)].shape[0]/df.shape[0]))
print('\nAll register with negative quantity has Invoice start with:',
      df.loc[(df.Quantity<0) & ~(df.CustomerID.isnull()), 'InvoiceNo'].apply(lambda x: x[0]).unique())
print('\nSee an example of negative quantity and others related records:')
display(df[(df.CustomerID==12472) & (df.StockCode==22244)])

In [None]:
print('Check register with UnitPrice negative:')
display(df[(df.UnitPrice<0)])
print("Sales records with Customer ID and zero in Unit Price:",df[(df.UnitPrice==0)  & ~(df.CustomerID.isnull())].shape[0])
df[(df.UnitPrice==0)  & ~(df.CustomerID.isnull())].head()

As you can see, there are no records where quantity and price are negative, but there are 1.336 records where one of them is and the other is 0. However, note that for all these records we do not have the customer ID. So we conclude that we can erase all records in that quantity or the price and negative. **In addition, by the foregoing summary we see that there are 135,080 records without customer identification that we may also disregard.**

#### Remove negative or return transactions

In [None]:
df = df[~(df.Quantity<0)]
df = df[df.UnitPrice>0]

This gives the multiple descriptions for one of those items and we witness the simple ways in which data quality can be corrupted in any dataset. A simple spelling mistake can end up in reducing data quality and an erroneous analysis.

In [None]:
unique_desc = df[["StockCode", "Description"]].groupby(by=["StockCode"]).\
                apply(pd.DataFrame.mode).reset_index(drop=True)
q = '''
select df.InvoiceNo, df.StockCode, un.Description, df.Quantity, df.InvoiceDate,
       df.UnitPrice, df.CustomerID, df.Country
from df as df INNER JOIN
     unique_desc as un on df.StockCode = un.StockCode
'''

df = pysqldf(q)

#### Type Convertion and creating Amount column

In [None]:
df.InvoiceDate = pd.to_datetime(df.InvoiceDate)
df['Amount'] = df.Quantity*df.UnitPrice
df.CustomerID = df.CustomerID.astype('Int64')

### What all manipulations have you done and insights you found?

The code you shared performs the following data manipulations:

- **Handles missing values:** Removes rows with missing values in any column.
- **Removes duplicate rows:** Keeps only the unique rows in the dataset.
- **Removes negative or return transactions:** Filters out transactions with negative quantity or unit price.
- **Ensures consistent product descriptions:** Groups by StockCode and uses the mode to handle multiple descriptions for the same product.
- **Performs type conversion:** Converts InvoiceDate to datetime, CustomerID to integer, and creates an Amount column by multiplying Quantity and UnitPrice.

The insights from these manipulations include:

- There were missing values in
`Description` and `CustomerID` columns, which were removed as they can't be imputed with meaningful values.
- There were duplicate transactions, which were removed to avoid inaccuracies in analysis.
- Negative quantity values likely indicate returns, and were removed to focus on actual sales.
- Some products had inconsistent descriptions, which were standardized to ensure data quality.
- Type conversion ensures that data is in the correct format for analysis and modeling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1. Normalized Amount Sales by Country (Bar Chart)

In [None]:
# Chart - 1 visualization code
# Create a copy of the DataFrame
temp1 = df.copy()

# Normalize the 'Amount' values between 0 and 1
temp1['normalized_amount'] = (temp1['Amount'] - temp1['Amount'].min()) / (temp1['Amount'].max() - temp1['Amount'].min())

# Create the first figure for the bar chart
fig1 = plt.figure(figsize=(12, 7))
amount_sales = temp1.groupby(["Country"])['normalized_amount'].sum().sort_values(ascending=False)

# Create the bar plot with thicker bars
bars = amount_sales.plot(kind='bar', color='skyblue', ax=fig1.add_subplot(111), width=0.6)

# Adding titles and labels
plt.title('Normalized Amount Sales by Country', fontsize=20, fontweight='bold')
plt.xlabel('Country', fontsize=12, fontweight='bold')
plt.ylabel('Normalized Sales Amount', fontsize=12, fontweight='bold')
plt.xticks(rotation=90, fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Adding data labels on top of the bars
for index, value in enumerate(amount_sales):
    plt.text(index, value + 0.02, f'{value:.4f}', ha='center', va='bottom', fontsize=8, color='black', rotation=90)

# Show the first plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart effectively illustrates the distribution of sales amounts across various countries.

##### 2. What is/are the insight(s) found from the chart?

The insight from this chart is that the UK has the highest normalized sales amount, followed by Netherlands and EIRE. This suggests that these countries are the most important markets for the business.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from this chart can definitely have a positive impact on the business. By identifying the top-performing countries, the business can develop targeted marketing campaigns and allocate resources effectively to maximize revenue. There's also potential for expanding into new markets or further developing existing high-performing ones. However, it's important to be mindful of potential downsides. If the business is heavily reliant on just a few countries, any negative changes in those markets could have a big impact. Also, the chart may highlight some underperforming countries, indicating a need for the business to reassess its strategies in those areas.

#### Chart - 2. Internal Market Distribution (Pie Chart)

In [None]:
# Chart - 2 visualization code
# Create the second figure for the pie chart
fig2 = plt.figure(figsize=(6, 6))
temp1['Internal'] = temp1.Country.apply(lambda x: 'UK' if x == 'United Kingdom' else 'Others')
market = temp1.groupby(["Internal"]).Amount.sum().sort_values(ascending=False)

# Create the pie chart
colors = ['lightcoral', 'lightskyblue']
plt.pie(market, labels=market.index, autopct='%1.1f%%', shadow=True, startangle=90, colors=colors)

# Adding title and styling
plt.title('Internal Market Distribution', fontsize=20, fontweight='bold')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# Show the second plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is a suitable choice because it effectively visualizes the proportion of sales attributed to the UK market compared to other countries. Pie charts are excellent for displaying the relative size of different categories within a whole, making it easy to understand the dominant market for the online retail business.

##### 2. What is/are the insight(s) found from the chart?

A significant portion of the sales comes from the UK, representing approximately 82.0% of the total sales. This indicates a strong domestic market presence for the online retail business. The remaining 18.0% represents sales from other countries, suggesting potential for international market expansion or a focus on strengthening the existing UK market dominance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The dominance of the UK market highlights the importance of continuing to cater to those customers, perhaps through tailored marketing or loyalty programs. At the same time, the presence of international sales, even if smaller, shows there's room to grow by expanding into new countries or improving the experience for existing international customers. However, there are also potential risks. Relying too much on the UK market could be problematic if that market experiences difficulties. Additionally, not focusing enough on expanding internationally might mean missing out on valuable opportunities for growth in the long run.

#### Chart - 3. Top Customer Sales Contribution (Bar Chart)

In [None]:
# Chart - 3 visualization code

# Create a figure for Top Customers Sales Amount
fig1 = plt.figure(figsize=(25, 6))

# Top Customers Sales Amount
top_customers = df.groupby(["CustomerID"]).Amount.sum().sort_values(ascending=False)[:51]
total_sales = df.groupby(["CustomerID"]).Amount.sum().sum()
percent_sales = np.round((top_customers.sum() / total_sales) * 100, 2)

# Bar chart for Top Customers
ax1 = fig1.add_subplot(121)
top_customers.plot(kind='bar', color='skyblue', ax=ax1)
ax1.set_title(f'Top Customers: {percent_sales:.2f}% Sales Amount', fontsize=20, fontweight='bold')
ax1.set_xlabel('Customer ID', fontsize=14)
ax1.set_ylabel('Total Sales Amount', fontsize=14)
ax1.tick_params(axis='x', rotation=90, labelsize=8)  # Decrease x tick font size
ax1.tick_params(axis='y', labelsize=8)  # Decrease y tick font size
ax1.grid(axis='y', linestyle='--', alpha=0.7)

# Show the first plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are great for comparing discrete categories, in this case, individual customers, and their corresponding sales amounts. This allows for easy identification of the highest-spending customers and understanding the distribution of sales among the top customer base.

##### 2. What is/are the insight(s) found from the chart?

It highlights the Pareto principle, often known as the 80/20 rule, where a small percentage of customers generate a large proportion of sales. This suggests that focusing on retaining and nurturing these high-value customers is crucial for business success. It also indicates potential for developing strategies to increase the spending of other customer segments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By understanding which customers are the highest spenders, the business can implement tailored strategies like personalized offers and loyalty programs to keep them happy and coming back for more. It also allows for more targeted marketing efforts, ensuring that different customer groups receive the most relevant messages. However, there are potential drawbacks to consider. If the business relies too heavily on a small group of big spenders, losing even a few could have a significant impact. It's also important to avoid neglecting other customers who, with the right encouragement, could become more valuable over time. Finally, focusing solely on existing customers might lead to slower growth in the overall customer base.

#### Chart - 4. Top Customer Sales Amount vs. Event Frequency (Bar Chart)

In [None]:
# Chart - 4 visualization code
# Create a new figure for Top 10 Customers
fig2 = plt.figure(figsize=(25, 7))
f1 = fig2.add_subplot(121)

# Top 10 Customers Sales Amount
top_10_customers = df.groupby(["CustomerID"]).Amount.sum().sort_values(ascending=False)[:10]
percent_sales_top_10 = np.round((top_10_customers.sum() / total_sales) * 100, 2)

# Bar chart for Top 10 Customers
top_10_customers.plot(kind='bar', color='lightgreen', ax=f1)
f1.set_title(f'Top 10 Customers: {percent_sales_top_10:.2f}% Sales Amount', fontsize=20, fontweight='bold')
f1.set_xlabel('Customer ID', fontsize=14)
f1.set_ylabel('Total Sales Amount', fontsize=14)
f1.tick_params(axis='x', rotation=45)
f1.grid(axis='y', linestyle='--', alpha=0.7)

# Adding data labels on top of the bars
for index, value in enumerate(top_10_customers):
    f1.text(index, value + 5, f'{value}', ha='center', va='bottom', fontsize=12, color='black')

# Top 10 Customers Event Sales
f2 = fig2.add_subplot(122)
top_10_event_sales = df.groupby(["CustomerID"]).Amount.count().sort_values(ascending=False)[:10]
percent_sales_event = np.round((top_10_event_sales.sum() / df.groupby(["CustomerID"]).Amount.count().sum()) * 100, 2)

# Bar chart for Top 10 Customers Event Sales
top_10_event_sales.plot(kind='bar', color='salmon', ax=f2)
f2.set_title(f'Top 10 Customers: {percent_sales_event:.2f}% Event Sales', fontsize=20, fontweight='bold')
f2.set_xlabel('Customer ID', fontsize=14)
f2.set_ylabel('Number of Events', fontsize=14)
f2.tick_params(axis='x', rotation=45)
f2.grid(axis='y', linestyle='--', alpha=0.7)

# Adding data labels on top of the bars
for index, value in enumerate(top_10_event_sales):
    f2.text(index, value + 1, f'{value}', ha='center', va='bottom', fontsize=12, color='black')

# Show the second plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This is a useful approach for comparing related metrics for the same categories. In this case, it allows for a direct comparison of sales amount and the number of purchase events for the top 10 customers. This helps understand not only who the biggest spenders are but also how frequently they make purchases.

##### 2. What is/are the insight(s) found from the chart?

This provides a fascinating look at how different customers contribute to the business. We see that "big spenders" don't necessarily shop frequently, like customer 14646 who makes large, infrequent purchases. In contrast, customers like 18102 are very frequent shoppers, contributing to sales through the sheer volume of transactions rather than individual purchase size. This means that customer value can come in different forms, and understanding these variations is key to developing successful customer engagement strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from this can definitely guide the business towards a positive impact. By understanding that some customers are big spenders while others are frequent buyers, the business can create much more effective and personalized offers. This could involve anything from exclusive deals for high-value customers to loyalty programs for those who shop regularly. It also helps with things like making sure the right products are in stock and using the most effective communication channels for each customer type. However, there are some potential downsides to be aware of. Relying too much on any one type of customer could be risky if their behavior changes. It's also important to make sure all customer segments feel valued and appreciated, as even small, frequent purchases can add up significantly. The key takeaway is that a flexible approach to customer engagement, recognizing the different ways customers contribute value, is essential for sustained growth.

In [None]:
# Calculate total sales amount and unique invoice counts
AmoutSum = df.groupby(["Description"]).Amount.sum().sort_values(ascending=False)
inv = df[["Description", "InvoiceNo"]].groupby(["Description"]).InvoiceNo.unique().agg(np.size).sort_values(ascending=False)

# Function to create bar charts
def create_bar_chart(data, title, ax, annotate=True):
    bars = data.plot(kind='bar', ax=ax, color='skyblue', edgecolor='black')
    ax.set_title(title, fontsize=20, fontweight='bold')
    ax.set_xlabel('Product Description', fontsize=14)
    ax.set_ylabel('Sales Amount', fontsize=14)
    ax.tick_params(axis='x', rotation=90, labelsize=10)  # Rotate labels to 90 degrees
    ax.tick_params(axis='y', labelsize=10)
    ax.grid(axis='y', linestyle='--', alpha=0.7)

    # Adding data labels on top of the bars if annotate is True
    if annotate:
        for bar in bars.patches:
            ax.annotate(f'{bar.get_height():.0f}',
                        (bar.get_x() + bar.get_width() / 2, bar.get_height()),
                        ha='center', va='bottom', fontsize=10, color='black')

#### Chart - 5. Top Products by Sales Amount and Event Count (Bar Chart)

In [None]:
# Chart - 5 visualization code
# Create the first figure for Top 10 Products
fig1, (f1, f2) = plt.subplots(1, 2, figsize=(25, 7))

Top10 = list(AmoutSum[:10].index)
PercentSales = np.round((AmoutSum[Top10].sum() / AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top10].sum() / inv.sum()) * 100, 2)
create_bar_chart(AmoutSum[Top10],
                  f'Top 10 Products in Sales Amount: {PercentSales:.2f}% of Amount and {PercentEvents:.2f}% of Events',
                  f1)

Top10Ev = list(inv[:10].index)
PercentSales = np.round((AmoutSum[Top10Ev].sum() / AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top10Ev].sum() / inv.sum()) * 100, 2)
create_bar_chart(inv[Top10Ev],
                  f'Events of Top 10 Most Sold Products: {PercentSales:.2f}% of Amount and {PercentEvents:.2f}% of Events',
                  f2)

# Show all plots
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

In this, which again uses two bar charts side-by-side, is designed to compare sales performance across different products. Bar charts excel at this type of comparison, allowing us to quickly identify the top-performing products in terms of both sales amount and the number of times they were sold (events). This dual perspective is crucial for understanding which products are driving the most revenue and which are most popular among customers.

##### 2. What is/are the insight(s) found from the chart?

This provides a valuable lesson in not judging a product by its sales figures alone. We can see that some products, like the "REGENCY CAKESTAND 3 TIER," are incredibly popular and fly off the shelves, but don't necessarily bring in the big bucks. On the flip side, there are products that generate a lot of revenue despite being purchased less frequently, likely indicating a higher price point. This tells us that a product's success isn't just about how often it's sold, but also about the value of each sale. The business can use this information to develop different strategies for different products, perhaps focusing on volume for some and promoting higher-value items to specific customer segments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information from this can definitely help the business make smarter decisions and boost its success. By understanding how different products perform in terms of both popularity and revenue, the business can optimize its product mix, pricing, and promotions. This could involve anything from creating attractive bundle deals for popular items to targeting specific customer groups with high-value products. It also helps with things like managing stock levels effectively and getting ideas for new products that are likely to be a hit with customers. However, there are some potential pitfalls to avoid. For example, it's important to consider both sales figures and revenue generated, as focusing on just one aspect could lead to bad choices. The business also needs to make sure it's offering a good variety of products to cater to different needs and budgets. Finally, avoiding over-reliance on a few star products is important to ensure long-term stability and growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Create the second figure for Top 15 Products
fig2 = plt.figure(figsize=(25, 7))
Top15ev = list(inv[:15].index)
PercentSales = np.round((AmoutSum[Top15ev].sum() / AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top15ev].sum() / inv.sum()) * 100, 2)
create_bar_chart(AmoutSum[Top15ev].sort_values(ascending=False),
                  f'Sales Amount of Top 15 Most Sold Products: {PercentSales:.2f}% of Amount and {PercentEvents:.2f}% of Events',
                  plt.gca())

# Show all plots
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This utilizes a single bar chart to visualize the sales amount of the top 15 most sold products. While similar to Chart 5, this chart focuses specifically on the relationship between sales volume (represented by the frequency of products sold) and the resulting sales amount. A bar chart effectively displays this comparison, allowing for a clear understanding of which products generate the highest revenue despite not necessarily being the absolute top sellers in terms of quantity.

##### 2. What is/are the insight(s) found from the chart?

This dives deeper into the sales performance of the top 15 most frequently sold products. The key insight here is that even among the most popular products, there's a significant variation in the generated sales amount. Some products that are frequently purchased contribute significantly to revenue, while others, despite their popularity, generate comparatively lower sales. This suggests a potential difference in price points or the purchase quantity of these products.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This provides a more granular view of product performance, showing that even among the most popular items, there's a difference in how much money they bring in. This information can be used to create smart strategies like bundling low-value items with others to increase overall sales or running targeted promotions to boost revenue from specific products. It also helps the business make better decisions about stock levels, ensuring they have enough of the right products on hand. However, it's important to avoid getting caught up in sales figures alone and to always consider the actual revenue generated by each product. Failing to do so could mean missing out on opportunities to increase profits, such as by not promoting or bundling popular items effectively.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Create the third figure for Top 50 Products
fig3 = plt.figure(figsize=(25, 7))
Top50 = list(AmoutSum[:50].index)
PercentSales = np.round((AmoutSum[Top50].sum() / AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top50].sum() / inv.sum()) * 100, 2)
create_bar_chart(AmoutSum[Top50],
                  f'Top 50 Products in Sales Amount: {PercentSales:.2f}% of Amount and {PercentEvents:.2f}% of Events',
                  plt.gca(), annotate=False)
# Show all plots
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart to display the top 50 products based on sales amount. This is an effective way to visualize the distribution of revenue across a larger set of products compared to charts 5 and 6, which focused on smaller subsets. The bar chart format allows for easy comparison of sales amounts for different products, highlighting those that contribute the most to overall revenue.

##### 2. What is/are the insight(s) found from the chart?

This expands the view of product performance by showcasing the top 50 products in terms of sales amount. This broader perspective reveals that a relatively small number of products contribute significantly to the overall revenue. This reinforces the Pareto principle observed in earlier charts, highlighting the importance of focusing on high-performing products and understanding their characteristics to inform future product development and marketing strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This gives us a big-picture view of the products that are really driving the business's financial success. By identifying these top performers, the business can make them a priority in terms of marketing, stock management, and even when developing new products. This focused approach can lead to higher sales and increased profits. It also helps to tailor marketing messages to the right customer groups who are most likely to be interested in these key products. However, it's important to remember that putting all your eggs in one basket can be risky. Over-relying on a few top products might leave the business vulnerable if those products lose popularity. It's also important to keep an eye on those niche products that might not be bestsellers but could still be attracting valuable customer segments.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Create the fourth figure for Top 50 Events
fig4 = plt.figure(figsize=(25, 7))
Top50Ev = list(inv[:50].index)
PercentSales = np.round((AmoutSum[Top50Ev].sum() / AmoutSum.sum()) * 100, 2)
PercentEvents = np.round((inv[Top50Ev].sum() / inv.sum()) * 100, 2)
create_bar_chart(inv[Top50Ev],
                  f'Top 50 Most Sold Products: {PercentSales:.2f}% of Amount and {PercentEvents:.2f}% of Events',
                  plt.gca(), annotate=False)

# Show all plots
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This employs a bar chart to visualize the top 50 most sold products based on the number of events or transactions. This complements Chart 7 by focusing on sales volume rather than just revenue. The bar chart format facilitates the comparison of transaction frequencies for different products, highlighting those that are most popular among customers regardless of their individual price or contribution to overall revenue.

##### 2. What is/are the insight(s) found from the chart?

This shifts the focus to the top 50 products based purely on how often they are purchased. This highlights the products that are most popular among customers, regardless of their individual price or revenue contribution. This perspective helps understand customer preferences and identify potential high-demand products that might not necessarily be the top revenue generators.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This gives us a clear picture of what customers love to buy, regardless of the price tag. This information is like gold for the business because it helps them understand customer preferences and tailor their offerings accordingly. It's also really useful for managing stock levels, making sure they have enough of the popular items to satisfy demand. Additionally, the business can use this knowledge to suggest related products or offer premium versions, potentially increasing the value of each sale. Popular products can even be used as bait in promotions to attract more customers. However, there are some potential downsides to watch out for. It's important to remember that popular doesn't always mean profitable. Focusing too much on high-volume sales might lead to overlooking those products that bring in more money despite being purchased less often. If most of the popular items have small profit margins, it could impact the overall profitability of the business.

### Customer Segmentation:
Customer segmentation is similarly the process of dividing an organization’s customer bases into different sections or segments based on various customer attributes.  The process of customer segmentation is based on the premise of finding differences among the customers’ behavior and patterns.

The major objectives and benefits behind the motivation for customer segmentation are:
* **Higher Revenue**: This is the most obvious requirement of any customer segmentation project.
* **Customer Understanding**: One of the mostly widely accepted business paradigms is “know your customer” and a segmentation of the customer base allows for a perfect dissection of this paradigm.
* **Target Marketing**:  The most visible reason for customer segmentation is the ability to focus marketing efforts effectively and efficiently. If a firm knows the different segments of its customer base, it can devise better marketing campaigns which are tailor made for the segment. A good segmentation model allows for better understanding of customer requirements and hence increases the chances of the success of any marketing campaign developed by the organization.
* **Optimal Product Placement**: A good customer segmentation strategy can also help the firm with developing or offering new products, or a bundle of products together as a combined offering.
* **Finding Latent Customer Segments**: Finding out which segment of customers it might be missing to identifying untapped customer segments by focused on marketing campaigns or new business development.

**Clustering**:

The most obvious method to perform customer segmentation is using unsupervised Machine Learning methods like clustering.  The method is as simple as collecting as much data about the customers as possible in the form of features or attributes and then finding out the different clusters that can be obtained from that data. Finally, we can find traits of customer segments by analyzing the characteristics of the clusters.

**Exploratory Data Analysis**:

Using exploratory data analysis is another way of finding out customer segments. This is usually done by analysts who have a good knowledge about the domain relevant to both products and customers. It can be done flexibly to include the top decision points in an analysis.

### RFM Model for Customer Value:

Since our dataset is limited to the sales records, and didn´t include anothers information about our customers, we will use a **RFM**,***Recency, Frequency and Monetary Value**, based model of customer value for finding our customer segments.
 The RFM model will take the transactions of a customer and calculate three important informational attributes about each customer:
- **Recency**: The value of how recently a customer purchased at the establishment
- **Frequency**: How frequent the customer’s transactions are at the establishment
- **Monetary value**: The dollar (or pounds in our case) value of all the transactions that the customer made at the establishment

#### Recency
To create the recency feature variable, we need to decide the reference date for our analysis. Usually, we make use of the last transaction date plus one day. Then, we will construct the recency variable as the number of days before the reference date when a customer last made a purchase.

In [None]:
refrence_date = df.InvoiceDate.max() + datetime.timedelta(days = 1)
print('Reference Date:', refrence_date)
df['days_since_last_purchase'] = (refrence_date - df.InvoiceDate).astype('timedelta64[s]')
customer_history_df =  df[['CustomerID', 'days_since_last_purchase']].groupby("CustomerID").min().reset_index()
customer_history_df.rename(columns={'days_since_last_purchase':'recency'}, inplace=True)
customer_history_df.describe().transpose()

##### We will plot the Recency Distribution and QQ-plot to identify substantive departures from normality, likes outliers, skewness and kurtosis.

In [None]:
def QQ_plot(data, measure):
    # Ensure the data is numeric and drop NaN values
    data = pd.to_numeric(data, errors='coerce').dropna()

    # Check if the data is empty after conversion
    if data.empty:
        print("The data provided is empty after conversion to numeric.")
        return

    # Set the seaborn theme
    sns.set_theme(style="whitegrid")  # You can change the style to "darkgrid", "white", "ticks", etc.

    fig = plt.figure(figsize=(20, 7))

    # Get the fitted parameters used by the function
    (mu, sigma) = norm.fit(data)

    # Kernel Density plot
    fig1 = fig.add_subplot(121)
    sns.histplot(data, kde=True, stat="density", color='skyblue', bins=30, alpha=0.6)
    x = np.linspace(min(data), max(data), 100)
    p = norm.pdf(x, mu, sigma)
    plt.plot(x, p, color='red', linewidth=2, label='Fitted Normal Distribution')
    plt.legend()
    fig1.set_title(f'{measure} Distribution (mu = {mu:.2f} and sigma = {sigma:.2f})', loc='center', fontsize=16)
    fig1.set_xlabel(measure, fontsize=14)
    fig1.set_ylabel('Density', fontsize=14)

    # QQ plot
    fig2 = fig.add_subplot(122)
    res = probplot(data, dist="norm", plot=fig2)

    # Change the color of the QQ plot points and the reference line
    fig2.get_lines()[0].set_color('orange')  # Change the color of the reference line
    fig2.get_lines()[1].set_color('blue')  # Change the color of the points

    fig2.set_title(f'{measure} Probability Plot (skewness: {data.skew():.6f} and kurtosis: {data.kurt():.6f})', loc='center', fontsize=16)

    # Customize tick parameters
    fig1.tick_params(axis='both', labelsize=12)
    fig2.tick_params(axis='both', labelsize=12)

    plt.tight_layout()
    plt.show()

#### Chart - 9

In [None]:
# Chart - 9 visualization code
QQ_plot(customer_history_df.recency, 'Recency')

##### 1. Why did you pick the specific chart?

Two powerful tools to give us a deep dive into how recently customers have been making purchases. The histogram paints a picture of how the data is spread out, showing how many customers fall into different "recency" categories. The Q-Q plot, on the other hand, helps us figure out if this data follows a typical bell curve pattern or if it's skewed in some way. This combo is super helpful because it allows us to see things like whether the data is lopsided, has any extreme values, or any other quirks that could affect how we analyze and use this information for things like customer segmentation.

##### 2. What is/are the insight(s) found from the chart?

Provides valuable insights into the recency of customer purchases. The histogram reveals that the data is heavily skewed to the right, indicating that a large proportion of customers have made purchases very recently. This suggests a potentially high level of customer engagement and repeat business. However, the long tail on the right indicates that there are also customers who haven't purchased in a while.

The Q-Q plot further confirms the non-normality of the data, as the points deviate significantly from the straight line, particularly at the tails. This deviation suggests that the data might not be suitable for certain statistical analyses that assume normality. Transformations or alternative methods might be needed for accurate modeling and insights.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This provides valuable insights into customer purchase patterns, revealing that many customers are actively engaged while others haven't made a purchase recently. This knowledge allows the business to create targeted strategies, like sending personalized messages to those who haven't shopped in a while to win them back. It also helps with dividing customers into different groups based on their behavior, allowing for more effective marketing and tailored recommendations. Being aware of inactive customers helps prevent them from leaving and potentially losing valuable business. However, it's important to remember that the data isn't perfectly "normal," meaning some adjustments might be needed when using it for analysis to avoid making wrong decisions based on skewed results. Ignoring inactive customers or misinterpreting the data could negatively impact the business.

**Analysis of Sales Recency Distribution**

The first graph illustrates that the sales recency distribution is skewed, exhibiting a peak on the left and a long tail extending to the right. This indicates a deviation from normality, suggesting a positive skew.

In the Probability Plot, we observe that the sales recency data does not align with the diagonal blue line, which represents a normal distribution. This further confirms the right skewness of the distribution.

With a positive skewness of 1.25, we can affirm the lack of symmetry in the data, indicating that sales recency is skewed to the right. This means that the right tail is longer relative to the left tail. For reference, a normal distribution has a skewness of zero, and symmetric data should have a skewness close to zero, appearing identical on both sides of the center point.

Kurtosis measures the heaviness of the tails in a distribution compared to a normal distribution. High kurtosis indicates heavy tails or the presence of outliers, while positive kurtosis suggests a heavy-tailed distribution, and negative kurtosis indicates a light-tailed distribution. In this case, with a kurtosis value of 0.43, the sales recency distribution is classified as heavy-tailed, suggesting the presence of some outliers.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
customer_freq = (df[['CustomerID', 'InvoiceNo']].groupby(["CustomerID", 'InvoiceNo']).count().reset_index()).\
                groupby(["CustomerID"]).count().reset_index()
customer_freq.rename(columns={'InvoiceNo':'frequency'},inplace=True)
customer_history_df = customer_history_df.merge(customer_freq)
QQ_plot(customer_history_df.frequency, 'Frequency')

##### 1. Why did you pick the specific chart?

This dives into how often customers make purchases using a similar approach to Chart 9. The histogram gives us a visual snapshot of how often different customers shop, showing how many fall into various frequency categories. Meanwhile, the Q-Q plot helps us determine if this data follows a typical pattern or if it has any quirks that could skew the results of our analysis. This combination is really helpful for getting a complete picture of customer purchase frequency, allowing us to spot things like imbalances, outliers, or other unusual patterns that might need special attention when we use this data for things like customer segmentation or predicting future behavior.

##### 2. What is/are the insight(s) found from the chart?

This reveals interesting patterns in customer purchase frequency. The histogram shows a right-skewed distribution, indicating that a large portion of customers make purchases relatively infrequently. However, there's a long tail on the right, suggesting the presence of a smaller segment of customers who make purchases very frequently.

The Q-Q plot confirms the non-normality of the data, with points deviating significantly from the straight line, especially at the tails. This deviation indicates that the frequency data might not be suitable for statistical analyses that assume normality. Transformations or alternative methods might be needed to ensure accurate insights and modeling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This sheds light on how often customers shop, showing that while many make purchases infrequently, there's a valuable group of frequent buyers. This information is key for creating targeted marketing campaigns and loyalty programs that cater to the different spending habits of various customer groups. It also helps personalize product suggestions, ensuring that frequent shoppers are tempted with new items and less frequent ones are reminded of what they might be missing. Understanding purchase frequency also helps the business make smart decisions about stock levels. However, it's important to avoid focusing solely on the frequent shoppers and neglecting those who buy less often, as they still represent potential for increased sales. Also, remember that the data isn't perfectly "normal," so some adjustments might be needed during analysis to avoid drawing misleading conclusions.

**Analysis of Sales Frequency Distribution**

From the first graph, we observe that the sales frequency distribution is **skewed**, exhibiting a **peak** on the left with a long tail extending to the right. This indicates a **deviation from normal distribution** and suggests that the data is **positively biased**.

In the **Probability Plot**, it is evident that the **sales frequency** does **not align with the diagonal line**, further confirming that the distribution is skewed to the right.

With a ***skewness of 12.1***, we can affirm a **significant lack of symmetry** in the data. Additionally, a **kurtosis value of 249** indicates that the distribution is **heavy-tailed** and contains **outliers**.

#### Monetary Value

#### Chart - 11

In [None]:
# Chart - 11 visualization code
customer_monetary_val = df[['CustomerID', 'Amount']].groupby("CustomerID").sum().reset_index()
customer_history_df = customer_history_df.merge(customer_monetary_val)
QQ_plot(customer_history_df.Amount, 'Amount')

##### 1. Why did you pick the specific chart?

This takes us on a journey into how much money customers are spending, using a similar approach to Charts 9 and 10. The histogram gives us a visual overview of customer spending, showing how many customers fall into different spending brackets. The Q-Q plot, on the other hand, helps us figure out if this data follows a typical pattern or if it has any unusual characteristics. This combo helps us get a complete picture of customer spending, allowing us to identify big spenders, those who spend less, and any unusual spending patterns that might need special attention when segmenting customers or making predictions about future spending.

##### 2. What is/are the insight(s) found from the chart?

This reveals insightful patterns in customer spending behavior. The histogram shows a right-skewed distribution, indicating that a large portion of customers spend relatively smaller amounts. However, a long tail on the right suggests the presence of a smaller segment of high-spending customers who contribute significantly to overall revenue.

The Q-Q plot confirms the non-normality of the data, with points deviating considerably from the straight line, particularly at the tails. This deviation suggests that the monetary value data might not be suitable for statistical analyses that assume normality. Transformations or alternative methods might be needed to ensure accurate insights and modeling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This gives us a glimpse into the spending habits of customers, showing that while many spend moderate amounts, there's a valuable group of high rollers who contribute significantly to the bottom line. This information is crucial for tailoring marketing strategies and loyalty programs to different customer groups, ensuring that everyone feels valued and receives relevant offers. Understanding spending patterns allows the business to personalize recommendations, tempting big spenders with luxury items and offering budget-conscious shoppers attractive deals. By focusing on retaining those valuable high-spending customers, the business can boost its overall profitability. However, it's important to avoid alienating other customer segments by focusing solely on the high spenders. It's also crucial to remember that the data isn't perfectly "normal," so adjustments might be needed during analysis to avoid making inaccurate conclusions.

**Analysis of Sales Amount Distribution**

From the first graph, we observe that the sales amount distribution is **skewed**, exhibiting a **peak** on the left with a long tail extending to the right. This indicates a **deviation from normal distribution** and suggests that the data is **positively biased**.

In the **Probability Plot**, it is evident that the **sales amount** does **not align with the diagonal line**, particularly on the right side.

With a ***skewness of 19.3***, we confirm a **significant lack of symmetry** in the data. Additionally, a **kurtosis value of 478** indicates that the distribution is **extremely heavy-tailed** and contains **outliers**, with more than 10 being very extreme.

Let’s take a look at a statistical summary of this dataset:

In [None]:
customer_history_df.describe()

#### Data Preprocessing
Once we have created our customer value dataset, we will perform some preprocessing on the data. For our clustering, we will be using the K-means clustering algorithm. One of the requirements for proper functioning of the algorithm is the mean centering of the variable values. Mean centering of a variable value means that we will replace the actual value of the variable with a standardized value, so that the variable has a mean of 0 and variance of 1. This ensures that all the variables are in the same range and the difference in ranges of values doesn't cause the algorithm to not perform well. This is akin to feature scaling.

Another problem that you can investigate about is the huge range of values each variable can take. This
problem is particularly noticeable for the monetary amount variable. To take care of this problem, we will transform all the variables on the log scale. This transformation, along with the standardization, will ensure that the input to our algorithm is a homogenous set of scaled and transformed values.

An important point about the data preprocessing step is that sometimes we need it to be reversible. In our case, we will have the clustering results in terms of the log transformed and scaled variable. But to make inferences in terms of the original data, we will need to reverse transform all the variable so that we get back the actual RFM figures. This can be done by using the preprocessing capabilities of Python.

In [None]:
customer_history_df['recency_log'] = customer_history_df['recency'].dt.days.apply(math.log)
customer_history_df['frequency_log'] = customer_history_df['frequency'].apply(math.log)
customer_history_df['amount_log'] = customer_history_df['Amount'].apply(math.log)
feature_vector = ['amount_log', 'recency_log','frequency_log']
X_subset = customer_history_df[feature_vector] #.as_matrix()
scaler = preprocessing.StandardScaler().fit(X_subset)
X_scaled = scaler.transform(X_subset)
pd.DataFrame(X_scaled, columns=X_subset.columns).describe().T

In [None]:
customer_history_df['recency'] = customer_history_df['recency'].dt.days

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Create regression plots
fig1 = plt.figure(figsize=(20, 14))

# First subplot: Recency vs Amount
ax1 = fig1.add_subplot(221)
sns.regplot(x='recency', y='Amount', data=customer_history_df,
            scatter_kws={'color': 'green', 's': 50},
            line_kws={'color': 'orange', 'linewidth': 2},
            ax=ax1)
ax1.set_title('Recency vs Amount')

# Second subplot: Frequency vs Amount
ax2 = fig1.add_subplot(222)
sns.regplot(x='frequency', y='Amount', data=customer_history_df,
            scatter_kws={'color': 'green', 's': 50},
            line_kws={'color': 'orange', 'linewidth': 2},
            ax=ax2)
ax2.set_title('Frequency vs Amount')

# Third subplot: Log Recency vs Log Amount
ax3 = fig1.add_subplot(223)
sns.regplot(x='recency_log', y='amount_log', data=customer_history_df,
            scatter_kws={'color': 'green', 's': 50},
            line_kws={'color': 'orange', 'linewidth': 2},
            ax=ax3)
ax3.set_title('Log Recency vs Log Amount')

# Fourth subplot: Log Frequency vs Log Amount
ax4 = fig1.add_subplot(224)
sns.regplot(x='frequency_log', y='amount_log', data=customer_history_df,
            scatter_kws={'color': 'green', 's': 50},
            line_kws={'color': 'orange', 'linewidth': 2},
            ax=ax4)
ax4.set_title('Log Frequency vs Log Amount')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot to visualize the relationship between "Frequency" (how often a customer makes a purchase) and "MonetaryValue" (the total amount spent by a customer). Scatter plots are excellent for visualizing the relationship between two continuous variables, allowing us to identify patterns, trends, and potential correlations.

This visualization helps determine if there's a correlation between purchase frequency and monetary value, which can inform customer segmentation strategies and the development of targeted marketing campaigns. For instance, it can reveal if customers who purchase more frequently also tend to spend more or if there are other patterns in customer behavior.

##### 2. What is/are the insight(s) found from the chart?

This explores the connection between how often customers shop and how much they spend. It shows that while there's a general tendency for frequent buyers to spend more overall, it's not a hard and fast rule. This means that while loyal, frequent customers are often big spenders, there are also those who spend a lot even if they don't shop as often. Similarly, some customers shop frequently but might not spend as much on each purchase. This chart highlights that there are different types of valuable customers, and understanding these nuances is essential for creating marketing strategies that cater to their individual needs and spending habits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This provides valuable insights for creating a more personalized and effective customer experience. By understanding the relationship between how often customers shop and how much they spend, the business can create specific groups of customers with similar behaviors. This allows for tailoring marketing campaigns and loyalty programs to resonate with each group's unique needs and preferences. For example, loyal customers who spend a lot might receive exclusive offers, while those who spend big but shop less often could get personalized recommendations based on their past purchases. Knowing these patterns also helps the business allocate resources wisely, ensuring that valuable customers receive special attention. However, it's important to remember that not all customers fit neatly into boxes. Overlooking certain groups or making assumptions based solely on the general trend could lead to missed opportunities and ineffective strategies.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Initialize Plotly for offline use
init_notebook_mode(connected=True)

# Create a 3D scatter plot
fig = px.scatter_3d(customer_history_df,
                     x='recency_log',
                     y='frequency_log',
                     z='amount_log',
                     color='recency_log',  # You can choose a different column for color mapping
                     title='3D Scatter Plot of Recency, Frequency, and Monetary Value',
                     labels={'recency_log': 'Recency (Log)',
                             'frequency_log': 'Frequency (Log)',
                             'amount_log': 'Monetary (Log)'},
                     opacity=0.6)

# Update layout to set the size of the plot
fig.update_layout(width=1000, height=600)

fig.show(renderer="colab")

##### 1. Why did you pick the specific chart?

Scatter plots created using Plotly are highly interactive and informative for visualizing the relationship between two variables. Plotly allows for customization of the plot with features like hover information, zooming, and panning, which enhance the understanding of data patterns and potential correlations.

By using Plotly for this chart the business can explore the relationship between different variables in a more interactive and insightful way, enabling a deeper understanding of customer behavior and the effectiveness of segmentation strategies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,5))
corr = df.select_dtypes(exclude='object').iloc[:, 4:].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask = mask, annot=True, cmap='flare')

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,5))
corr = customer_history_df.select_dtypes(exclude='object').iloc[:, 4:].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask = mask, annot=True, cmap='flare')

##### 1. Why did you pick the specific chart?

A correlation heatmap is a powerful tool for visualizing the correlation between multiple variables. It uses a color-coded matrix to represent the correlation coefficients, making it easy to identify strong positive or negative relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

A correlation heatmap like Chart would reveal the relationships between different customer behavior metrics, likely including Recency, Frequency, and Monetary Value. Here are some possible insights you might find:

- Strong Positive Correlation: A strong positive correlation between Frequency and MonetaryValue would indicate that customers who purchase more often tend to spend more overall. This is a common pattern in customer behavior.
- Negative Correlation: A negative correlation between Recency and MonetaryValue could suggest that customers who haven't purchased recently tend to have lower overall spending. This could indicate a risk of churn.
- Weak or No Correlation: If there's a weak or no correlation between certain variables, it suggests that those variables might not be strongly related to each other.
By analyzing the correlation patterns in Chart, the business can:

Identify key variables: Determine which variables are most strongly related to customer value and prioritize them for segmentation and marketing efforts.
- Develop targeted strategies: Tailor marketing campaigns and customer engagement strategies based on the identified correlations. For example, focus on increasing purchase frequency for customers with high monetary value.
- Improve predictive models: Use the correlation insights to select relevant features for predictive models like churn prediction or customer lifetime value estimation.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, diag_kind='kde', palette='set2')

##### 1. Why did you pick the specific chart?

A pair plot is a powerful visualization tool that displays the relationships between all pairs of variables in a dataset. It creates a matrix of scatter plots, allowing you to quickly identify patterns, correlations, and potential outliers in the data. Pair plots are particularly useful for exploratory data analysis and understanding the interactions between multiple variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1. **The average amount spent by customers in the United Kingdom is higher than the average amount spent by customers from other countries.**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis (H0):** There is no difference in the average amount spent by customers in the United Kingdom and customers from other countries.

- **Alternative Hypothesis (H1):** The average amount spent by customers in the United Kingdom is higher than the average amount spent by customers from other countries.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Create two groups for UK and other countries
UK = df[df.Country == 'United Kingdom'].Amount
Other_countries = df[df.Country != 'United Kingdom'].Amount

In [None]:
# Perform the t-test
from scipy import stats
t_stat, p_value = stats.ttest_ind(UK, Other_countries)  # compare two independent samples
print(f'T-statistic: {t_stat}, P-value: {p_value}')

if p_value < 0.05:
    print("- Reject the null hypothesis. \nThere is a significant difference in the average amount spent between UK and other countries.")
else:
    print("- Fail to reject the null hypothesis. \nThere is no significant difference in the average amount spent between UK and other countries.")

##### Which statistical test have you done to obtain P-Value?

t-test

##### Why did you choose the specific statistical test?

The t test estimates the true difference between two group means using the ratio of the difference in group means over the pooled standard error of both groups.

### Hypothetical Statement - 2. **The average Unit price by customers in the United Kingdom is higher than the average Unit price by customers from other countries.**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis (H0):** There is no difference in the average Unit Price by customers in the United Kingdom and customers from other countries.

- **Alternative Hypothesis (H1):** The average amount spent by customers in the United Kingdom is higher than the average unit price by customers from other countries.

#### 2. Perform an appropriate statistical test.

In [None]:
# Create two groups for UK and other countries
UK_unit = df[df.Country == 'United Kingdom'].UnitPrice
Other_countries_unit = df[df.Country != 'United Kingdom'].UnitPrice

In [None]:
# Perform Statistical Test to obtain P-Value
t_stat_unit, p_value_unit = stats.ttest_ind(UK_unit, Other_countries_unit)  # compare two independent samples
print(f'T-statistic: {t_stat_unit}, P-value: {p_value_unit}')

if p_value_unit < 0.05:
    print("- Reject the null hypothesis. \nThere is a significant difference in the average Unit Price between UK and other countries.")
else:
    print("- Fail to reject the null hypothesis. \nThere is no significant difference in the average Unit Price between UK and other countries.")

##### Which statistical test have you done to obtain P-Value?

t-test

##### Why did you choose the specific statistical test?

The t test estimates the true difference between two group means using the ratio of the difference in group means over the pooled standard error of both groups.

### Hypothetical Statement - 3. **The average recency of customers who have purchased more than 50 different items is lower than the average recency of all customers.**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis (H0):** There is no difference in the average recency of customers who have purchased more than 50 different items.

- **Alternative Hypothesis (H1):** There is difference in the average recency of customers who have purchased more than 50 different items.

#### 2. Perform an appropriate statistical test.

In [None]:
# Create two groups for UK and other countries
recency_geter_50 = customer_history_df[customer_history_df.frequency > 50].recency
recency_less_50 = customer_history_df[customer_history_df.frequency < 50].recency

In [None]:
# Perform Statistical Test to obtain P-Value
t_stat_rency, p_value_rency = stats.ttest_ind(recency_geter_50, recency_less_50)  # compare two independent samples
print(f'T-statistic: {t_stat_rency}, P-value: {p_value_rency}')

if p_value_rency < 0.05:
    print("- Reject the null hypothesis. \nThere is a significant difference in the average Recency between recency more then 50 and recency less then 50.")
else:
    print("- Fail to reject the null hypothesis. \nThere is no significant difference in the average Recency between recency more then 50 and recency less then 50.")

##### Which statistical test have you done to obtain P-Value?

t-test

##### Why did you choose the specific statistical test?

The t test estimates the true difference between two group means using the ratio of the difference in group means over the pooled standard error of both groups.

## ***6. Unsupervised ML Model Implementation***

### Clustering and Segmentation

#### k-Means Clustering

The K-means clustering belongs to the partition based\centroid based hard clustering family of algorithms, a family of algorithms where each sample in a dataset is assigned to exactly one cluster.

Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors (SSE), which is sometimes also called cluster inertia. So, the objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:
![image](https://www.saedsayad.com/images/Clustering_kmeans_c.png)

The steps that happen in the K-means algorithm for partitioning the data are as given follows:
1. The algorithm starts with random point initializations of the required number of centers. The “K” in K-means stands for the number of clusters.
2. In the next step, each of the data point is assigned to the center closest to it. The distance metric used in K-means clustering is normal Euclidian distance.
3. Once the data points are assigned, the centers are recalculated by averaging the dimensions of the points belonging to the cluster.
4. The process is repeated with new centers until we reach a point where the assignments become stable. In this case, the algorithm terminates.

##### K-means++
- Place the initial centroids far away from each other via the k-means++ algorithm, which leads to better and more consistent results than the classic k-means.
- To use k-means++ with scikit-learn's KMeans object, we just need to set the init parameter to k-means++ (the default setting) instead of random.

#### The Elbow Method
  
Using the elbow method to find the optimal number of clusters. The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly. If k increases, the distortion will decrease, because the samples will be closer to the centroids they are assigned to.

This method looks at the percentage of variance explained as a function of the number of clusters. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified.Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance.

In [None]:
cl = 50
corte = 0.1

anterior = 100000000000000
cost = []
K_best = cl

for k in range(1, cl + 1):
    model = KMeans(
        n_clusters=k,
        init='k-means++',
        n_init=10,
        max_iter=300,
        tol=1e-04,
        random_state=101
    )

    model = model.fit(X_scaled)
    labels = model.labels_
    inertia = model.inertia_

    if (K_best == cl) and (((anterior - inertia) / anterior) < corte):
        K_best = k - 1

    cost.append(inertia)
    anterior = inertia

# Inertia plot
plt.figure(figsize=(20, 10))
plt.plot(range(1, cl + 1), cost, marker='o', color='red', linestyle='-')
plt.title('KMeans Inertia vs. Number of Clusters', fontsize=20, fontweight='bold')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.grid(True)
plt.xticks(range(1, cl + 1))
plt.show()

In [None]:
# Create a KMeans model with the best K
print('The best K suggested: ', K_best)
model = KMeans(n_clusters=K_best, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=101)
model = model.fit(X_scaled)
labels = model.labels_

# Visualization of clusters
fig = plt.figure(figsize=(20, 10))

# First subplot
ax1 = fig.add_subplot(121)
scatter1 = ax1.scatter(X_scaled[:, 1], X_scaled[:, 0], c=labels, cmap='viridis', s=50, alpha=0.6)
ax1.set_xlabel(feature_vector[1])
ax1.set_ylabel(feature_vector[0])
ax1.set_title('Clusters Visualization (Feature 1 vs Feature 0)', fontsize=20, fontweight='bold')
plt.colorbar(scatter1, ax=ax1, label='Cluster Label')

# Second subplot
ax2 = fig.add_subplot(122)
scatter2 = ax2.scatter(X_scaled[:, 2], X_scaled[:, 0], c=labels, cmap='viridis', s=50, alpha=0.6)
ax2.set_xlabel(feature_vector[2])
ax2.set_ylabel(feature_vector[0])
ax2.set_title('Clusters Visualization (Feature 2 vs Feature 0)', fontsize=20, fontweight='bold')
plt.colorbar(scatter2, ax=ax2, label='Cluster Label')

plt.tight_layout()
plt.show()

Note that by the Elbow method from a K equal to 3 we already observed low rates of gain in the decay of the distortions with the decrease of K reaching the limit of 10% with the K equal to 7. With this in mind, we will begin to evaluate the options more deeply with 3, and 7, starting with the silhouette analysis.

#### Silhouette analysis on K-Means clustering

Silhouette analysis can be used to study the separation distance between the resulting clusters, as a strategy to quantifying the quality of clustering via graphical tool to plot a measure of how tightly grouped the samples in the clusters are. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.

It can also be applied to clustering algorithms other than k-means

Silhouette coefficients has a range of \[-1, 1\], it calculated by:
1. Calculate the cluster cohesion a( i )as the average distance between a sample x( i )   and all other points in the same cluster.
2. Calculate the cluster separation b( i ) from the next closest cluster as the average distance between the sample x( i ) and all samples in the nearest cluster.
3. Calculate the silhouette s( i )  as the difference between cluster cohesion and separation divided by the greater of the two, as shown here:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/3d80ab22fb291b347b2d9dc3cc7cd614f6b15479)
Which can be also written as:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/ab5579a6c7150579af8a0d432b6630ba529376f0)

Where:
- If near +1, it indicate that the sample is far away from the neighboring clusters.
- a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
- If most objects have a high value, then the clustering configuration is appropriate.
- If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
- A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters
- Negative values indicate that those samples might have been assigned to the wrong cluster.

The silhouette plot can shows a bad K clusters pick for the given data due to the presence of clusters with below average silhouette scores and also due to wide fluctuations in the size of the silhouette plots. A good k clusters can found when all the plots are more or less of similar thickness and hence are of similar sizes.

Although we have to keep in mind that in several cases and scenarios, sometimes we may have to drop the mathematical explanation given by the algorithm and look at the business relevance of the results obtained.

Let's see below how our data perform for each K clusters groups (3, 5 and 7) in the silhouette score of each cluster, along with the center of each of the cluster discovered in the scatter plots, by amount_log vs recency_log and vs frequency_log.

In [None]:
# Set the seaborn style
sns.set(style="whitegrid")

cluster_centers = dict()

# Choose a color palette
palette = sns.color_palette("husl", 10)  # A color palette with 10 distinct colors

for n_clusters in range(3, K_best + 1, 2):
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(25, 7))

    # Set limits for silhouette plot
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])

    # Fit KMeans
    clusterer = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=101)
    cluster_labels = clusterer.fit_predict(X_scaled)

    # Calculate silhouette score
    silhouette_avg = silhouette_score(X_scaled, cluster_labels)
    cluster_centers[n_clusters] = {
        'cluster_center': clusterer.cluster_centers_,
        'silhouette_score': silhouette_avg,
        'labels': cluster_labels
    }

    # Silhouette values
    sample_silhouette_values = silhouette_samples(X_scaled, cluster_labels)
    y_lower = 10
    for i in range(n_clusters):
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = palette[i % len(palette)]  # Use the color palette
        ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7)
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i), fontsize=12, color='black')
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("Silhouette Plot for {} Clusters".format(n_clusters), fontsize=16)
    ax1.set_xlabel("Silhouette Coefficient Values", fontsize=14)
    ax1.set_ylabel("Cluster Label", fontsize=14)
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--", label='Average Silhouette Score')
    ax1.set_yticks([])
    ax1.set_xticks(np.arange(-0.1, 1.1, 0.1))
    ax1.grid(True)

    # Scatter plot for first two features
    centers = clusterer.cluster_centers_
    colors = [palette[label % len(palette)] for label in cluster_labels]  # Use the color palette

    # First scatter plot
    x, y = 1, 0
    ax2.scatter(X_scaled[:, x], X_scaled[:, y], marker='o', s=50, lw=0, alpha=0.6, c=colors, edgecolor='k')
    ax2.scatter(centers[:, x], centers[:, y], marker='o', c="white", alpha=1, s=200, edgecolor='k')
    for i, c in enumerate(centers):
        ax2.scatter(c[x], c[y], marker='$%d$' % i, alpha=1, s=100, edgecolor='k', color='black')
    ax2.set_title("{} Clustered Data (Feature 1 vs Feature 0)".format(n_clusters), fontsize=16)
    ax2.set_xlabel(feature_vector[x], fontsize=14)
    ax2.set_ylabel(feature_vector[y], fontsize=14)
    ax2.grid(True)

    # Second scatter plot
    x = 2
    ax3.scatter(X_scaled[:, x], X_scaled[:, y], marker='o', s=50, lw=0, alpha=0.6, c=colors, edgecolor='k')
    ax3.scatter(centers[:, x], centers[:, y], marker='o', c="white", alpha=1, s=200, edgecolor='k')
    for i, c in enumerate(centers):
        ax3.scatter(c[x], c[y], marker='$%d$' % i, alpha=1, s=100, edgecolor='k', color='black')
    ax3.set_title("Silhouette Score: {:1.2f}".format(silhouette_avg), fontsize=16)
    ax3.set_xlabel(feature_vector[x], fontsize=14)
    ax3.set_ylabel(feature_vector[y], fontsize=14)
    ax3.grid(True)

    plt.suptitle("Silhouette Analysis for KMeans Clustering (n_clusters = {})".format(n_clusters), fontsize=18, fontweight='bold')
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjust layout to make room for the title
    plt.show()

When we look at the results of the clustering process, we can infer some interesting insights:

- First notice that all K clusters options is valid, because they don't have presence of clusters with below average silhouette scores.
- In the other hand, all options had a some wide fluctuations in the size of the silhouette plots.

So, the best choice may lie on the option that gives us a simpler business explanation and at the same time target customers in focus groups with sizes closer to the desired.

#### Clusters Center:
Let's look at the cluster center values after returning them to normal values from the log and scaled version.

In [None]:
features = ['Amount',  'recency',  'frequency']
for i in range(3,K_best+1,2):
    print("for {} clusters the silhouette score is {:1.2f}".format(i, cluster_centers[i]['silhouette_score']))
    print("Centers of each cluster:")
    cent_transformed = scaler.inverse_transform(cluster_centers[i]['cluster_center'])
    print(pd.DataFrame(np.exp(cent_transformed),columns=features))
    print('-'*50)

#### Clusters Insights:

With the plots and the center in the correct units, let's see some insights by each clusters groups:

***In the three-cluster:***
- The tree clusters appears have a good stark differences in the Monetary value of the customer, we will confirm this by a box plot.
- Cluster 1 is the cluster of high value customer who shops frequently and is certainly an important segment for each business.
- In the similar way we obtain customer groups with low and medium spends in clusters with labels 0 and 2, respectively.
- Frequency and Recency correlate perfectly to the Monetary value based on the trend (High Monetary-Low Recency-High Frequency).

***In the five-cluster:***
- Note that clusters 0 and 1 are very similar to their cluster in the configuration with only 3 clusters.
- The cluster 1 appears more robust on the affirmation of those who shop often and with high amount.
- The cluster 2 are those who have a decent spend but are not as frequent as the cluster 1
- The cluster 4 purchases medium amounts, with a relatively low frequency and not very recent
- The cluster 3 makes low-cost purchases, with a relatively low frequency, but above 1, and made their last purchase more recently. This group of customers probably response to price discounts and can be subject to loyalty promotions to try increase the medium-ticket, strategy that can be better defined when we analyzing the market basket.
- The silhouette score matrix says that the  five cluster segments are less optimal then the three cluster segments.

***In the Seven-cluster:***
- Definitely cluster 6 defines those who shop often and with high amount.
- Clusters 1 and 5 show good spending and good frequency, only deferring in how recent were their last purchases, where 5 is older, which suggests an active action to sell to group 5 as soon as possible and another to 1 seeking to raise its frequency.
- Cluster 0 presents the fourth best purchase and a reasonable frequency, but this is a long time without buying. This group should be sensible to promotions and activations, so that they do not get lost and make their next purchase.
- Cluster 5 is similar to 0, but has made its purchases more recently and has a slightly better periodicity. Then actions must be taken to raise their frequency and reduce the chances of them migrating to cluster 0 by staying longer without purchasing products.

#### Drill Down Clusters:

To further drill down on this point and find out the quality of these difference, we can label our data with the corresponding cluster label and then visualize these differences. The following code will extract the clustering label and attach it with our customer summary dataset.

In [None]:
# Assuming customer_history_df is already defined and contains the cluster labels
customer_history_df['clusters_3'] = cluster_centers[3]['labels']
customer_history_df['clusters_5'] = cluster_centers[5]['labels']
customer_history_df['clusters_7'] = cluster_centers[7]['labels']
display(customer_history_df.head())

# Set the color palette
palette = sns.color_palette("pastel")

# Create a figure with horizontal stacking
fig, axs = plt.subplots(1, 3, figsize=(20, 7))

# Plot for 3 clusters
market_3 = customer_history_df.clusters_3.value_counts()
axs[0].pie(market_3, labels=market_3.index, autopct='%1.1f%%', shadow=True, startangle=90, colors=palette, explode=[0.1]*len(market_3))
axs[0].set_title('3 Clusters', fontsize=18, fontweight='bold')

# Plot for 5 clusters
market_5 = customer_history_df.clusters_5.value_counts()
axs[1].pie(market_5, labels=market_5.index, autopct='%1.1f%%', shadow=True, startangle=90, colors=palette, explode=[0.1]*len(market_5))
axs[1].set_title('5 Clusters', fontsize=18, fontweight='bold')

# Plot for 7 clusters
market_7 = customer_history_df.clusters_7.value_counts()
axs[2].pie(market_7, labels=market_7.index, autopct='%1.1f%%', shadow=True, startangle=90, colors=palette, explode=[0.1]*len(market_7))
axs[2].set_title('7 Clusters', fontsize=18, fontweight='bold')

# Adjust layout
plt.tight_layout()
plt.show()

Once we have the labels assigned to each of the customers, our task is simple. Now we want to find out how the summary of customer in each group is varying. If we can visualize that information we will able to find out the differences in the clusters of customers and we can modify our strategy on the basis of those differences.

The following code leverages plotly and will take the cluster labels we got for each configurations clusters and create boxplots. Plotly enables us to interact with the plots to see the central tendency values in each boxplot in the notebook. Note that we want to avoid the extremely high outlier values of each group, as they will interfere in making a good observation around the central tendencies of each cluster. Since we have only positive values, we will restrict the data such that only data points which are less than 0.95th percentile of the cluster is used. This will give us good information about the majority of the users in that cluster segment.

I've used these charts to review my previously stated insights, but follow the same for you to explore:

In [None]:
x_data = ['Cluster 0', 'Cluster 1','Cluster 2','Cluster 3','Cluster 4', 'Cluster 5', 'Cluster 6']
colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)', 'rgba(255, 65, 54, 0.5)',
          'rgba(22, 80, 57, 0.5)', 'rgba(127, 65, 14, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']
cutoff_quantile = 95

for n_clusters in range(3,K_best+1,2):
    cl = 'clusters_' + str(n_clusters)
    for fild in range(0, 3):
        field_to_plot = features[fild]
        y_data = list()
        ymax = 0
        for i in np.arange(0,n_clusters):
            y0 = customer_history_df[customer_history_df[cl]==i][field_to_plot].values
            y0 = y0[y0<np.percentile(y0, cutoff_quantile)]
            if ymax < max(y0): ymax = max(y0)
            y_data.insert(i, y0)

        traces = []

        for xd, yd, cls in zip(x_data[:n_clusters], y_data, colors[:n_clusters]):
                traces.append(go.Box(y=yd, name=xd, boxpoints=False, jitter=0.5, whiskerwidth=0.2, fillcolor=cls,
                    marker=dict( size=1, ),
                    line=dict(width=1),
                ))

        layout = go.Layout(
            title='Difference in {} with {} Clusters and {:1.2f} Score'.\
            format(field_to_plot, n_clusters, cluster_centers[n_clusters]['silhouette_score']),
            yaxis=dict( autorange=True, showgrid=True, zeroline=True,
                dtick = int(ymax/10),
                gridcolor='black', gridwidth=0.1, zerolinecolor='rgb(255, 255, 255)', zerolinewidth=2, ),
            margin=dict(l=40, r=30, b=50, t=50, ),
            paper_bgcolor='white',
            plot_bgcolor='white',
            showlegend=False
        )

        fig = go.Figure(data=traces, layout=layout)
        fig.show(renderer="colab")

### Next Steps in the Segmentation:

To enhance discovery and can further improve the quality of clustering by adding relevant features, other customer information and purchases details may be included in this dataset.

For example, but not limited to:
- New indicators, such as customer relationship time, based on the date of your first purchase of the client
- whether the customer is from abroad or not
- some group or category of product to be obtained through the SKUs
- External data vendors and use it, and so on.

Another dimension to explore can be trying out different algorithms for performing the segmentation for instance hierarchical clustering, which we explored in some of the earlier chapters. A good segmentation process will encompass all these avenues to arrive at optimal segments that provide valuable insight.

## Cross Selling

The cross selling is the ability to sell more products to a customer by analyzing the customer's shopping trends as well as general shopping trends and patterns which are in common with the customer's shopping patterns. More often than not, these recommended products would be very appealing. The retailer will often offer you a bundle of products with some attractive offer and it is highly likely that we will end up buying the bundled products instead of just the original item.

So, we research the customer transactions and find out potential additions to the customer's original needs and offer it to the customer as a suggestion in the hope and intent that they buy them benefiting both the customer as well as the retail establishment.

In this section, we explore association rule-mining, a powerful technique that can be used for cross selling, then we apply the concept of market basket analysis to our retail transactions dataset.

### Market Basket Analysis with Association Rule-Mining
![image](https://fiverr-res.cloudinary.com/images/t_main1,q_auto,f_auto/gigs/117717078/original/7cd5947def9d84483612ef9b19829323850ac9af/bulid-a-product-recommender-system-for-ur-site.jpg)
The whole concept of association rule-mining is based on the concept that customer purchase behavior has a pattern which can be exploited for selling more items to the customer in the future.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. This rule-based approach also generates new rules as it analyzes more data. The ultimate goal, assuming a large enough dataset, is to help a machine mimic the human brain's feature extraction and abstract association capabilities from new uncategorized data.

An association rule usually has the structure like below:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/3141048979b977982202dbf7a80596f8a6b1177e)
This rule can be read in the obvious manner that when the customer bought items on the left of the rule he is likely to buy the item on the right. Following are some vital concepts pertaining to association rule-mining.
- **Itemset**: Is just a collection of one or more items that occur together in a transaction. For example, here {milk, bread} is example of an itemset.
- **Support**: is defined as number of times an itemset appears in the dataset. The support of ***X*** with respect to ***T*** is defined as the proportion of transactions ***t*** in the dataset which contains the itemset ***X***. Mathematically it is defined as:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/1c6acacd3b17051205704b5d323c83fc737e5db1)
- **Confidence**: Confidence is an indication of how often the rule has been found to be true. It is a measure of the times the number of times a rule is found to exist in the dataset. For a rule which states { beer -> diaper } the confidence is mathematically defined as:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/90324dedc399441696116eed3658fd17c5da4329)
- **Lift**: Lift of the rule is defined as the ratio of observed support to the support expected in the case the elements of the rule were independent. For the previous set of transactions if the rule is defined as { X -> Y }, then the lift of the rule is defined as:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/c392e3111167b60687405dfdc7ed55f22409f4c5)
    - If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.
    - If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.
    - If the lift is < 1, that lets us know the items are substitute to each other. This means that presence of one item has negative effect on presence of other item and vice versa.
- **Frequent itemset**: Frequent itemsets are itemsets whose support is greater than a user defined support threshold.
- **Conviction**: Is the ratio of the expected frequency that item X occurs without a item Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. The conviction of a rule is defined as:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/4c2228820d6a8cb5a84bd059d53764a6b9280386)

#### Algorithms:

Some well-known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.

The major bottleneck in any association rule-mining algorithm is the generation of frequent itemsets. If the transaction
dataset is having k unique products, then potentially we have 2<sup>k</sup> possible itemsets.

##### Apriori
Apriori uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. So, the algorithm will first generate these itemsets and then proceed to finding the frequent itemsets.  For around 100 unique products the possible number of itemsets is huge, and shows up that the Apriori algorithm prohibitively computationally expensive.

##### Eclat algorithm
Eclat is a depth-first search algorithm based on set intersection. It is suitable for both sequential as well as parallel execution with locality-enhancing properties.

##### FP Growth
FP stands for frequent pattern. The FP growth algorithm is superior to Apriori algorithm as it doesn't need to generate all the candidate itemsets. The FP growth algorithm uses a divide-and-conquer strategy and leverages a special data structure called the FP-tree, to find frequent itemsets without generating all itemsets. The core steps of the algorithm are as follows:
1. In the first pass, the algorithm take in the transactional database and counts occurrence of items (attribute-value pairs) in the dataset, and stores them to 'header table'.
2. In the second pass, it builds the FP-tree structure by inserting instances to represent frequent itemsets. Items in each instance have to be sorted by descending order of their frequency in the dataset, so that the tree can be processed quickly. Items in each instance that do not meet minimum coverage threshold are discarded. If many instances share most frequent items, FP-tree provides high compression close to tree root.
3. Divide this compressed representation into multiple conditional datasets such that each one is associated with a frequent pattern.
4. Mine for patterns in each such dataset so that shorter patterns can be recursively concatenated to longer patterns, hence making it more efficient.

Recursive processing of this compressed version of main dataset grows large item sets directly, instead of generating candidate items and testing them against the entire database. Growth starts from the bottom of the header table (having longest branches), by finding all instances matching given condition. New tree is created, with counts projected from the original tree corresponding to the set of instances that are conditional on the attribute, with each node getting sum of its children counts. Recursive growth ends when no individual items conditional on the attribute meet minimum support threshold, and processing continues on the remaining header items of the original FP-tree.

Once the recursive process has completed, all large item sets with minimum coverage have been found, and association rule creation begins.

#### Build Transaction Dataset
In order to perform our data in these algorithms, we must first turn them into a sales event table where each product sold will be represented by a column, having its value 1 for when it was sold in that event or zero when not. This will generate a sparse table

In [None]:
items = list(df.Description.unique())
grouped = df.groupby('InvoiceNo')
transaction_level = grouped.aggregate(lambda x: tuple(x)).reset_index()[['InvoiceNo','Description']]
transaction_dict = {item:0 for item in items}
output_dict = dict()
temp = dict()
for rec in transaction_level.to_dict('records'):
    invoice_num = rec['InvoiceNo']
    items_list = rec['Description']
    transaction_dict = {item:0 for item in items}
    transaction_dict.update({item:1 for item in items if item in items_list})
    temp.update({invoice_num:transaction_dict})

new = [v for k,v in temp.items()]
transaction_df = pd.DataFrame(new)

#### Prune Dataset for frequently purchased items
We saw in the earlier on EDA how only a handful of items are responsible for bulk of our sales so we want to prune our dataset to reflect this information. For this we have created a function prune_dataset below, which will help us reduce the size of our dataset based on our requirements. The function can be used for performing three types of pruning:
- Pruning based on percentage of total sales: The parameter total_sales_perc will help us select the number of items that will explain the required percentage of sales. The default value is 50% or 0.5.
- Pruning based on ranks of items: Another way to perform the pruning is to specify the starting and the ending rank of the items for which we want to prune our dataset.
- Pruning based on list of features passed to the parameter TopCols.

By default, we will only look for transactions which have at least two items, as transactions with only one item are counter to the whole concept of association rule-mining.

In [None]:
def prune_dataset(input_df, length_trans = 2, total_sales_perc = 0.5,
                  start_item = None, end_item = None, TopCols = None):
    if 'total_items' in input_df.columns:
        del(input_df['total_items'])
    item_count = input_df.sum().sort_values(ascending = False).reset_index()
    total_items = sum(input_df.sum().sort_values(ascending = False))
    item_count.rename(columns={item_count.columns[0]:'item_name',
                               item_count.columns[1]:'item_count'}, inplace=True)
    if TopCols:
        input_df['total_items'] = input_df[TopCols].sum(axis = 1)
        input_df = input_df[input_df.total_items >= length_trans]
        del(input_df['total_items'])
        return input_df[TopCols], item_count[item_count.item_name.isin(TopCols)]
    elif end_item > start_item:
        selected_items = list(item_count[start_item:end_item].item_name)
        input_df['total_items'] = input_df[selected_items].sum(axis = 1)
        input_df = input_df[input_df.total_items >= length_trans]
        del(input_df['total_items'])
        return input_df[selected_items],item_count[start_item:end_item]
    else:
        item_count['item_perc'] = item_count['item_count']/total_items
        item_count['total_perc'] = item_count.item_perc.cumsum()
        selected_items = list(item_count[item_count.total_perc < total_sales_perc].item_name)
        input_df['total_items'] = input_df[selected_items].sum(axis = 1)
        input_df = input_df[input_df.total_items >= length_trans]
        del(input_df['total_items'])
        return input_df[selected_items], item_count[item_count.total_perc < total_sales_perc]

In [None]:
output_df, item_counts = prune_dataset(input_df=transaction_df, length_trans=2,start_item = 0, end_item = 15)
print('Total of Sales Amount by the Top 15 Products in Sales Events (Invoice): {:.2f}'.format(AmoutSum[Top15ev].sum()))
print('Number of Sales Events:', output_df.shape[0])
print('Number of Products:', output_df.shape[1])

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a count plot
plt.figure(figsize=(12, 8))
sns.barplot(x='item_count', y='item_name', data=item_counts, palette='gnuplot2')

# Add titles and labels
plt.title('Item Count Distribution', fontsize=16)
plt.xlabel('Count', fontsize=14)
plt.ylabel('Item Name', fontsize=14)

# Show the plot
plt.show()

So we find out that we have 15 items responsible for 8,73% of sales amount and close to 5% of the events result in 4.664 transactions that have those items along with other items. The next step is to convert this selected data into the required table data structure.

#### Association Rule Mining with FP Growth

##### Orange Table Data Structure
Since we are using the Orage framework we still have to convert it to the Table data structure by providing the metadata about our columns. We need to define the domain for each of our variables. The domain means the possible set of values that each of our variables can use. This information will be stored as metadata and will be used in later transformation of the data. As our columns are only having binary values,we can easily create the domain by using this information.

In [None]:
input_assoc_rules = output_df
# Defined the data domain by specifying each variable as a DiscreteVariable having values as (0, 1)
domain_transac = Domain([DiscreteVariable.make(name=item,values=['0', '1']) \
                         for item in input_assoc_rules.columns])

# Then using this domain, we created our Table structure for our data
# Use .values instead of .as_matrix()
data_tran = Orange.data.Table.from_numpy(domain=domain_transac,
                                         X=input_assoc_rules.values, Y= None)

# Coding our input so that the entire domain is represented as binary variables
data_tran_en, mapping = OneHot.encode(data_tran, include_class=True)

In [None]:
support = 0.01
print("num of required transactions = ", int(input_assoc_rules.shape[0]*support))
num_trans = input_assoc_rules.shape[0]*support
itemsets = dict(frequent_itemsets(data_tran_en, support))
print('Items Set Size:', len(itemsets))

So we get a whopping 663,273 itemsets for only 15 itens and a support of only 1%! This will increase exponentially if we decrease the support or if we increase the number of items in our dataset. The next step is specifying a confidence value and generating our rules. The following code snippet will perform rule generation and decoding of rules, and then compile it all in a neat dataframe that we can use for further analysis.

In [None]:
confidence = 0.6
rules_df = pd.DataFrame()
if len(itemsets) < 1000000:
    rules = [(P, Q, supp, conf)
    for P, Q, supp, conf in association_rules(itemsets, confidence)
       if len(Q) == 1 ]

    names = {item: '{}={}'.format(var.name, val)
        for item, var, val in OneHot.decode(mapping, data_tran, mapping)}

    eligible_ante = [v for k,v in names.items() if v.endswith("1")]

    N = input_assoc_rules.shape[0]

    rule_stats = list(rules_stats(rules, itemsets, N))

    rule_list_df = []
    for ex_rule_frm_rule_stat in rule_stats:
        ante = ex_rule_frm_rule_stat[0]
        cons = ex_rule_frm_rule_stat[1]
        named_cons = names[next(iter(cons))]
        if named_cons in eligible_ante:
            rule_lhs = [names[i][:-2] for i in ante if names[i] in eligible_ante]
            ante_rule = ', '.join(rule_lhs)
            if ante_rule and len(rule_lhs)>1 :
                rule_dict = {'support' : ex_rule_frm_rule_stat[2],
                             'confidence' : ex_rule_frm_rule_stat[3],
                             'coverage' : ex_rule_frm_rule_stat[4],
                             'strength' : ex_rule_frm_rule_stat[5],
                             'lift' : ex_rule_frm_rule_stat[6],
                             'leverage' : ex_rule_frm_rule_stat[7],
                             'antecedent': ante_rule,
                             'consequent':named_cons[:-2] }
                rule_list_df.append(rule_dict)
    rules_df = pd.DataFrame(rule_list_df)
    print("Raw rules data frame of {} rules generated".format(rules_df.shape[0]))
    if not rules_df.empty:
        pruned_rules_df = rules_df.groupby(['antecedent','consequent']).max().reset_index()
    else:
        print("Unable to generate any rule")

#### Explore The Association Rule Created

Let's see what we get in the first 5 rules with highest confidence:

In [None]:
dw = pd.options.display.max_colwidth
pd.options.display.max_colwidth = 100
(rules_df[['consequent', 'antecedent', 'support','confidence','lift']].\
 groupby(['consequent', 'antecedent']).first()
                                      .reset_index()
                                      .sort_values(['confidence', 'support', 'lift'],
                                                   ascending=False)).head()

In [None]:
(rules_df[['consequent', 'antecedent', 'support','confidence','lift']].\
 groupby(['consequent', 'antecedent']).first()
                                      .reset_index()
                                      .sort_values(['support', 'confidence', 'lift'],
                                                   ascending=False)).head()

In [None]:
# Create the DataFrame
top_rules = (rules_df[['consequent', 'antecedent', 'support','confidence','lift']].\
 groupby(['consequent', 'antecedent']).first()
                                      .reset_index()
                                      .sort_values(['support', 'confidence', 'lift'],
                                                   ascending=False)).head(10)

# Reset the display option
pd.options.display.max_colwidth = dw

# Create a mapping for the labels
labels = list(pd.concat([top_rules['antecedent'], top_rules['consequent']]).unique())
antecedent_indices = [labels.index(a) for a in top_rules['antecedent']]
consequent_indices = [labels.index(c) for c in top_rules['consequent']]

# Create the ribbon chart
fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        line=dict(color="black", width=0.5),
        label=labels,
        color="blue"
    ),
    link=dict(
        source=antecedent_indices,  # Indices correspond to labels
        target=consequent_indices,
        value=top_rules['confidence'],
        color='orange'
    ))])

fig.update_layout(title_text="Association Rules Ribbon Chart", font_size=10)
fig.show(renderer="colab")

Typically, a lift value of 1 indicates that the probability of occurrence of the antecedent and consequent together are independent of each other. Hence, the idea is to look for rules having a lift much greater than 1.  So, let's see how much rules has lift greater than 1, equal 1 and less than one:

In [None]:
rules_df.lift.apply(lambda x: 'Greater Than One' if x > 1 else 'One' \
                           if x == 0 else 'Less Than One').value_counts()

In [None]:
pd.options.display.max_colwidth = dw

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**Key Findings and Recommendations:**

* **Customer Behavior:** The analysis revealed insights into customer purchase behavior, including top customers, purchase frequency, and sales performance of top products.
* **RFM Segmentation:** The RFM model identified potential customer segments based on their recency, frequency, and monetary value.
* **Clustering Potential:** The project hinted at the potential for K-means clustering to further refine customer segmentation.
* **Marketing and Product Optimization:** The findings can be used to develop targeted marketing campaigns, optimize product placement, and identify new customer segments.
* **Association Rule Mining:** The project suggested further analysis using association rule mining to identify product relationships for recommendations and promotions.

**Considerations and Future Directions:**

* **Computational Efficiency:** The generation of association rules can be computationally expensive, especially with large datasets. Balancing support and confidence levels is crucial for obtaining a reasonable number of strong rules.
* **Rule Targeting:** For rare but high-confidence patterns, consider setting low support and high confidence levels to identify potential areas for cross-selling strategies.
* **Clustering Evaluation:** If K-means clustering is pursued, careful evaluation of cluster quality using methods like Silhouette analysis is essential.
* **Data Quality:** Ensure data quality and consistency throughout the analysis to avoid biases and inaccuracies in the results.

**Overall, this project demonstrates the application of unsupervised machine learning techniques for customer segmentation and business decision-making in the online retail industry.** By leveraging insights from customer behavior, RFM analysis, and potential clustering, businesses can tailor their marketing strategies and product offerings to enhance customer satisfaction and drive sales.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***