<center><h1 style="font-family: 'Georgia'; color: #f2f2f2; background-color:#800040; padding: 20px;">Online Store Customer Segmentation</h1></center>

<p style="font-family: 'Georgia'; font-size: 16px; font-weight: 800; color: #800040;">
Customer Segmentation by KMeans Clustering.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; font-weight: 500; color: #800040;">
dataset link: https://www.kaggle.com/datasets/yasserh/customer-segmentation-dataset
</p>

<p style="font-family: 'Georgia'; font-size: 14px; font-weight: 500; color: #800040;">    
In this project, we will extract the following features: 
</p>

<ul> 
    <li style="font-family: 'Georgia'; font-size: 14px; color: #800040;">total purchase amount ---- sum of all orders of a customer (quantity * unit price)</li>
    <li style="font-family: 'Georgia'; font-size: 14px; color: #800040;">frequency of purchase ---- how freqently a certain customer id in the customer id column</li>
    <li style="font-family: 'Georgia'; font-size: 14px; color: #800040;">recency of purchase ---- will get the most recent date of purchase</li>
</ul>

<p style="font-family: 'Georgia'; font-size: 14px; font-weight: 500; color: #800040;">    
then, we will compare it to customer id to create 3 different clusters.
</p>


<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Preprocessing</h1>

In [1]:
# IMPORT LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.cluster import KMeans
import warnings

In [2]:
# LOAD DATASET
df = pd.read_excel('customer_segmentation.xlsx')
df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,2010-12-01 08:34:00,1.69,13047.0,United Kingdom


In [3]:
# CHECK NULL VALUES
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
Since this analysis is primarily focused on customer behavior based on total spending, frequency of purchase, and recency of purchase, and the CustomerID column is the key identifier in this analysis, then it would be advisable to drop rows with missing values in the CustomerID column. This will ensure that we have complete and accurate data for your analysis, which is crucial for generating reliable insights.
</p>

In [4]:
# DROP NULL VALUES
df.dropna(inplace=True)
print(df.shape)

(406829, 8)


In [5]:
# CHECK FOR DUPLICATE RECORDS
df.duplicated().sum()

5225

In [6]:
# DROP DUPLICATED VALUES
df.drop_duplicates(inplace=True)
print(df.shape)

(401604, 8)


In [9]:
df['StockCode'].nunique()

3684

In [10]:
# DROP IRRELEVANT COLUMNS
df.drop(['Description', 'Country'], axis=1, inplace=True)

In [11]:
print(df.columns)

Index(['InvoiceNo', 'StockCode', 'Quantity', 'InvoiceDate', 'UnitPrice',
       'CustomerID'],
      dtype='object')


<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
    We decided to drop the Description column because it is the same as StockCode. In store settings, they have SKUs or Stock Keeping Units which allows you to keep track of each individual stock as opposed to have lengthy descriptive words. In this way, tracking is easier and making analyses like theses becomes simpler.
</p>
<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
On the other hand, we dropped the country column because we want this analysis to be applicable globally. This only makes perfect sense because online stores are usually not constrained to just domestic customers.
</p>

In [12]:
# EXAMINE THE DATASET SHAPE
print('Data Shape:', df.shape)

Data Shape: (401604, 6)


In [13]:
# EXAMINE DATATYPES
df.dtypes

InvoiceNo              object
StockCode              object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
dtype: object

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
CustomerId column should be categorical, so we need to change it.
</p>

In [14]:
# CHANGE DATATYPES
df['CustomerID'] = df['CustomerID'].astype('object')

In [15]:
df.dtypes

InvoiceNo              object
StockCode              object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID             object
dtype: object

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Creating a New Data Frame</h1>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
My approach in this problem is to create a new dataframe, df_new, with the following columns:
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
1. CustomerID, which are all the unique customers we have in df.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
2. Recency of purchase, which we will get from the InvoiceDate mapped for each customer.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
3. Frequecy of purchase, which we will get from counting the Invoice records for each individual customer.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040; margin-left: 18px;">
4. Total Amount Purchased, which will be obtained by summing the total amounts spent by the customer on all the products they have purchased. Each component in this sum is calculated as the product of the quantity purchased and the unit price of each item.
</p>

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
This is a possible approach because in the data we have, each customer is tracked individually. They have a known identifier. In this way, we can tell which group of customers behave the same way.
</p>

In [17]:
# USE A PANDAS SERIES TO STORE ALL OF UNIQUE CUSTOMERS
CustomerID = df['CustomerID'].unique()
CustomerID

array([17850.0, 13047.0, 12583.0, ..., 13298.0, 14569.0, 12713.0],
      dtype=object)

<p style="font-family: 'Georgia'; font-size: 14px; color: #800040;">
We use pandas Series instead of regular Python lists or arrays, especially for larger datasets with complex structures, as pandas Series are optimized for data analysis tasks and offer various functionalities that facilitate efficient data manipulation, querying, and analysis. While memory efficiency is one advantage, pandas Series also provide a range of built-in methods for data transformation and analysis, making them an ideal choice for handling diverse and substantial datasets.
</p>

In [23]:
# GET MOST RECENT TRANSACTION DATE FOR EACH CUSTOMER
def recent_transactions(df):
    most_recent = []
    for customer_id in df['CustomerID']:
        customer_data = df[df['CustomerID'] == customer_id] # this is for keeping all the records for each unique customer, with customer_id customer_id
        last_transaction_date = customer_data['InvoiceDate'].max()
        most_recent.append((customer_id, last_transaction_date))
    return most_recent


In [None]:
recent = recent_transactions(df)

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Feature Scaling</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Clustering</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Visualization</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Interpretation</h1>

<h1 style="font-family: 'Georgia'; font-size: 24px; color: #008000;">Evaluation</h1>