<center><h1 style="font-family: 'Georgia'; color: #f2f2f2; background-color:#800040; padding: 20px;">Online Store Customer Descriptive Segmentation</h1></center>

<p style="font-family: 'Georgia'; font-size: 16px; font-weight: 800; color: #800040;">
Customer Segmentation by KMeans Clustering.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; font-weight: 500; color: #800040;">
dataset link: https://www.kaggle.com/datasets/yasserh/customer-segmentation-dataset
</p>

<p style="font-family: 'Georgia'; font-size: 14px; font-weight: 500; color: #800040;">    
In this project, we will extract the following features: 
</p>

<ul> 
    <li style="font-family: 'Georgia'; font-size: 14px; color: #800040;">total purchase amount ---- sum of all orders of a customer (quantity * unit price)</li>
    <li style="font-family: 'Georgia'; font-size: 14px; color: #800040;">frequency of purchase ---- how freqently a certain customer id in the customer id column</li>
    <li style="font-family: 'Georgia'; font-size: 14px; color: #800040;">recency of purchase ---- will get the most recent date of purchase</li>
</ul>

<p style="font-family: 'Georgia'; font-size: 14px; font-weight: 500; color: #800040;">    
then we will cluster customers using these 3 categories.
</p>


<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Preprocessing</h1>

In [1]:
# IMPORT LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.cluster import KMeans
import warnings

In [2]:
# LOAD DATASET
# df = pd.read_excel('customer_segmentation.xlsx')
df = pd.read_csv('../datasets/original_datasets/sampled_data_only.csv')
df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,UnitPrice,CustomerID
0,566042,22426,16,2011-09-08 13:53:00,2.95,17811.0
1,555841,21850,1,2011-06-07 12:47:00,4.95,14583.0
2,549238,22084,6,2011-04-07 11:30:00,2.95,14667.0
3,572082,21975,24,2011-10-20 14:18:00,0.55,17672.0
4,581585,21916,24,2011-12-09 12:31:00,0.42,15804.0
5,574289,23002,24,2011-11-03 15:16:00,0.42,17750.0
6,556783,23188,24,2011-06-14 13:15:00,1.65,13408.0
7,544182,85099B,10,2011-02-16 15:13:00,1.95,18257.0
8,539307,90210D,60,2010-12-16 17:43:00,1.25,13694.0
9,566285,22499,3,2011-09-11 14:51:00,5.95,13611.0


In [3]:
# CHECK NULL VALUES
df.isnull().sum()

InvoiceNo      0
StockCode      0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
dtype: int64

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
Since this analysis is primarily focused on customer behavior based on total spending, frequency of purchase, and recency of purchase, and the CustomerID column is the key identifier in this analysis, then it would be advisable to drop rows with missing values in the CustomerID column. This will ensure that we have complete and accurate data for this analysis, which is crucial for generating reliable insights.
</p>

In [4]:
# DROP NULL VALUES
df.dropna(inplace=True)
print(df.shape)

(5000, 6)


In [5]:
# CHECK FOR DUPLICATE RECORDS
df.duplicated().sum()

0

In [6]:
# DROP DUPLICATED VALUES
df.drop_duplicates(inplace=True)
print(df.shape)

(5000, 6)


In [7]:
# DROP IRRELEVANT COLUMNS
# df.drop(['Description', 'Country'], axis=1, inplace=True)

In [8]:
print(df.columns)

Index(['InvoiceNo', 'StockCode', 'Quantity', 'InvoiceDate', 'UnitPrice',
       'CustomerID'],
      dtype='object')


<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
    We decided to drop the Description column because it is the same as StockCode. In store settings, they have SKUs or Stock Keeping Units which allows you to keep track of each individual stock as opposed to have lengthy descriptive words. In this way, tracking is easier and making analyses like theses becomes simpler.
</p>
<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
On the other hand, we dropped the country column because we want this analysis to be applicable globally. This only makes perfect sense because online stores are usually not constrained to just domestic customers.
</p>

In [9]:
# EXAMINE THE DATASET SHAPE
print('Data Shape:', df.shape)

Data Shape: (5000, 6)


In [10]:
# EXAMINE DATATYPES
df.dtypes

InvoiceNo       object
StockCode       object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
dtype: object

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
CustomerId column should be categorical, so we need to change it.
</p>

In [11]:
# CHANGE DATATYPES
df['CustomerID'] = df['CustomerID'].astype('object')

In [12]:
df.dtypes

InvoiceNo       object
StockCode       object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID      object
dtype: object

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Creating a New Data Frame</h1>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
My approach in this analysis is to create a new dataframe, df_new, with the following columns:
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
1. CustomerID, which are all the unique customers we have in df.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
2. Recency of purchase, which we will get from the InvoiceDate mapped for each customer.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
3. Frequecy of purchase, which we will get from counting the Invoice records for each individual customer.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
4. Total Amount Purchased, which will be obtained by summing the total amounts spent by the customer on all the products they have purchased. Each component in this sum is calculated as the product of the quantity purchased and the unit price of each item.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
This is a possible approach because in the data we have, each customer is tracked individually. They have a known identifier. In this way, we can tell which group of customers behave the same way.
</p>

In [23]:
# USE A PANDAS SERIES TO STORE ALL OF UNIQUE CUSTOMERS
customers_series = df['CustomerID'].unique()

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
I thought it will be a great idea to have a list of all unique customers right away so that we just need to reference them later and not find them at the same time of extracting relevant information about them.
</p>

In [24]:
# THE FUNCTION THAT WILL GENERATE THE df_new 
def generate_data_frame(customers_df, df):
    extracted_data = []
    
    for customer_id in customers_df:
        specific_customer = df[df['CustomerID'] == customer_id].copy() # check all records for a specific customer
        
        most_recent = specific_customer['InvoiceDate'].max() # last transaction made
        
        frequency = specific_customer['InvoiceNo'].nunique() # count the number of transactions a specific customer has
        
        specific_customer.loc[:, 'TotalPrice'] = specific_customer['Quantity'] * specific_customer['UnitPrice']  # calculate the total spending for each row
        total_spending = specific_customer['TotalPrice'].sum()  # sum the total spending for the specific customer
        total_spending = "{:.2f}".format(total_spending)

        extracted_data.append((customer_id, most_recent, frequency, total_spending))
        
    cleaned_df = pd.DataFrame(columns=['CustomerID', 'MostRecent', 'Frequency', 'TotalSpending'], data=extracted_data)
    
    return cleaned_df

In [25]:
df_new = generate_data_frame(customers_series, df)
df_new.head(10)

Unnamed: 0,CustomerID,MostRecent,Frequency,TotalSpending
0,17811.0,2011-12-05 13:46:00,11,196.23
1,14583.0,2011-07-18 12:53:00,3,12.65
2,14667.0,2011-11-18 14:23:00,8,154.35
3,17672.0,2011-10-20 14:18:00,1,26.4
4,15804.0,2011-12-09 12:31:00,4,72.66
5,17750.0,2011-12-04 14:09:00,3,48.82
6,13408.0,2011-10-26 13:28:00,6,270.86
7,18257.0,2011-02-16 15:13:00,1,19.5
8,13694.0,2011-11-14 11:11:00,10,896.52
9,13611.0,2011-09-11 14:51:00,1,17.85


In [None]:
# SAVE CLEANED DATASET TO CSV
df_new.to_csv('../datasets/cleaned_datasets/extracted_data.csv')

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Feature Scaling</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Clustering</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Visualization</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Interpretation</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Evaluation</h1>