# Customer Segmentation with the Tindie Orders' dataset

Customer segmentation will be applied to the KASSER SYNTHS customer database in Tindie using K-means clustering from scikit-learn.

Case Study:
Can this customer database be grouped to develop customized relationships?

To answer this question 3 features will be created and used:
- products ordered (Quantity)
- average return rate (Status Refund / Billed + Refund)
- total spending (Item Total)

Dataset represents real customers & orders data between November 2018 - May 2022 and it is pseudonymized for confidentiality.

Imports

In [79]:
# data wrangling
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# for data preprocessing and clustering
from sklearn.cluster import KMeans

%matplotlib inline
# to include graphs inline within the frontends next to code

%config InlineBackend.figure_format='retina'
#to enable retina (high resolution) plots

pd.options.mode.chained_assignment = None
# to bypass warnings in various dataframe assignments

Investigate data

In [80]:
# load data into a dataframe
customers_orders = pd.read_csv("https://raw.githubusercontent.com/abcasas/kasser_tindie_stats/main/datasets/orders/orders.csv")

In [81]:
# first rows of the dataset
customers_orders.head()

Unnamed: 0,Order ID,Status,Order Date,Shipping Date,Refund Date,Customer ID,City,State/Province,Postal/Zip Code,Country,Product Title,Option Summary,Quantity,Unit Price,Discount Price,Item Total,Shipping Total,Discount Total,Order Total
0,134029,billed,2018-12-07,2018-12-12,,79b517750071a0fce0ea0c2ef27fc40d5063df78aac79c...,Brookings,SD,57006,United States of America,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM3438 - Fully Assembled,1,124.38,124.38,124.38,0.0,0.0,124.38
1,136661,billed,2019-01-04,2019-01-13,,5f54c081a80b3cd0960794be1ea8f4fbd1bb977f7d9b30...,Neustadt,RP,67433,Germany,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,126.12,126.12,126.12,0.0,0.0,126.12
2,136829,billed,2019-01-05,2019-01-16,,1eaa433ace9b356d976ae83bdfac56282ef0fa7afcc05f...,Auckland,Auckland,1024,New Zealand,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,126.12,126.12,126.12,0.0,0.0,126.12
3,137381,billed,2019-01-10,2019-01-19,,92ce259747bc2c850787c5e27547416aa4da0a13a23708...,Berlin,Berlin,10409,Germany,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,126.12,126.12,126.12,0.0,0.0,126.12
4,142040,billed,2019-02-23,2019-02-25,,d1b71ad194e919d69cabdf143cea070293c4190d6c3be3...,Bluff City,TN,37618,United States of America,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612 - Fully Assembled,1,156.11,129.99,129.99,19.99,26.12,149.98


In [82]:
# descriptive statistics of the non-object columns
customers_orders.describe()

Unnamed: 0,Order ID,Quantity,Unit Price,Discount Price,Item Total,Shipping Total,Discount Total,Order Total
count,264.0,264.0,264.0,264.0,264.0,262.0,262.0,262.0
mean,217737.806818,1.0,188.928939,188.063939,188.063939,20.515611,0.871603,210.015153
std,52199.15293,0.0,63.501703,63.478458,63.478458,8.536438,4.580202,68.293399
min,134029.0,1.0,14.98,14.98,14.98,0.0,0.0,14.98
25%,169716.25,1.0,156.11,154.37,154.37,16.49,0.0,175.26
50%,213764.5,1.0,174.99,174.99,174.99,24.99,0.0,199.98
75%,257184.75,1.0,249.99,249.99,249.99,24.99,0.0,274.98
max,332621.0,1.0,274.98,274.98,274.98,90.0,30.0,374.97


In [83]:
# first glance of customers_orders data
customers_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order ID         264 non-null    int64  
 1   Status           264 non-null    object 
 2   Order Date       264 non-null    object 
 3   Shipping Date    253 non-null    object 
 4   Refund Date      13 non-null     object 
 5   Customer ID      264 non-null    object 
 6   City             262 non-null    object 
 7   State/Province   260 non-null    object 
 8   Postal/Zip Code  262 non-null    object 
 9   Country          262 non-null    object 
 10  Product Title    264 non-null    object 
 11  Option Summary   264 non-null    object 
 12  Quantity         264 non-null    int64  
 13  Unit Price       264 non-null    float64
 14  Discount Price   264 non-null    float64
 15  Item Total       264 non-null    float64
 16  Shipping Total   262 non-null    float64
 17  Discount Total  

# 0. Cleaning the data and calculating new columns

Looking at the customers_orders.info() results there are at least 2 empty rows in some of the columns. This may be related to order IDs where more than one item was ordered. Let's check for duplicate order IDs

In [84]:
customers_orders[customers_orders['Order ID'].duplicated(keep=False)]

Unnamed: 0,Order ID,Status,Order Date,Shipping Date,Refund Date,Customer ID,City,State/Province,Postal/Zip Code,Country,Product Title,Option Summary,Quantity,Unit Price,Discount Price,Item Total,Shipping Total,Discount Total,Order Total
54,166827,billed,2019-09-30,2019-10-08,,4dd0c81d81cee897e87b2bc807b9b82a62a5ba3f978e55...,Kitchener,ON,N2H6M4,Canada,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Metallic Red; OLED Display Color:...,1,174.99,174.99,174.99,24.99,0.0,374.97
55,166827,billed,2019-09-30,,,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,,,,,DAFM synth - GENESIS YM2612 / YM3438,OLED Display Color: Electric Blue; Knobs color...,1,174.99,174.99,174.99,,,
195,256126,billed,2021-04-03,2021-04-07,,7e0a3e445c2e165b164b20375750818514f4c55fae9dfa...,Phoenix,AZ,85009-5108,United States of America,DAFM synth - ARCADE YM2151,Knobs color: Rose Gold,1,259.99,259.99,259.99,24.99,0.0,299.96
196,256126,billed,2021-04-03,,,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,,,,,DAFM synth - Upgrade KIT,Upgrade Kit: USB To Serial Converter,1,14.98,14.98,14.98,,,


Exactly. There have been two orders where two items have been ordered. In these cases, the order information is not duplicated in both items. Let's group by order ID and update the remaining information.

In [85]:
columns_tofill = ['Shipping Date', 'City', 'State/Province', 'Postal/Zip Code', 'Country', 'Shipping Total', 'Discount Total', 'Order Total']

for column in columns_tofill:
    customers_orders[column] = customers_orders.groupby('Order ID')[column].transform(lambda x: x.ffill())

customers_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order ID         264 non-null    int64  
 1   Status           264 non-null    object 
 2   Order Date       264 non-null    object 
 3   Shipping Date    255 non-null    object 
 4   Refund Date      13 non-null     object 
 5   Customer ID      264 non-null    object 
 6   City             264 non-null    object 
 7   State/Province   262 non-null    object 
 8   Postal/Zip Code  264 non-null    object 
 9   Country          264 non-null    object 
 10  Product Title    264 non-null    object 
 11  Option Summary   264 non-null    object 
 12  Quantity         264 non-null    int64  
 13  Unit Price       264 non-null    float64
 14  Discount Price   264 non-null    float64
 15  Item Total       264 non-null    float64
 16  Shipping Total   264 non-null    float64
 17  Discount Total  

There are still missing values in Shipping Date, Refund Date and State/Province.

Shipping Date: There are nine orders that do not have a shipping date. Let's see why

In [86]:
customers_orders[customers_orders['Shipping Date'].isnull()]

Unnamed: 0,Order ID,Status,Order Date,Shipping Date,Refund Date,Customer ID,City,State/Province,Postal/Zip Code,Country,Product Title,Option Summary,Quantity,Unit Price,Discount Price,Item Total,Shipping Total,Discount Total,Order Total
26,153924,refunded,2019-06-06,,2019-06-16,159f093907cc176ab57e859c19213d7b908307bf73d55a...,Ottawa,ON,K2J 4P3,Canada,DAFM synth - GENESIS YM2612 / YM3438,OLED Display Color: Electric Blue; Knobs color...,1,174.99,174.99,174.99,24.99,0.0,199.98
81,179904,refunded,2020-01-15,,2020-01-16,e0cdc8183d4daf54548d2f40ef22c23c6a3889f199a995...,Tacoma,WA,98405,United States of America,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; FM YAMAHA chip: YM26...,1,174.99,174.99,174.99,24.99,0.0,199.98
98,193505,refunded,2020-05-08,,2020-05-22,108c224d3ef0ce006ed1e4b9b3e30e3acfb512c562d1a9...,Shime,Fukuoka,8112202,Japan,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; OLED Display Color: ...,1,249.99,249.99,249.99,24.99,0.0,274.98
116,204797,refunded,2020-07-20,,2020-07-20,da95cf8f688e01792724a48b134e61c606481eca724d77...,Snohomish,WA,98296-7608,United States of America,DAFM synth - Upgrade KIT,Upgrade Kit: Green Pill - 128 kbytes Microcont...,1,33.98,33.98,33.98,0.0,0.0,33.98
117,204985,refunded,2020-07-21,,2020-07-21,16cc87c406335e0d9248ecc90769f263f19567c09bd210...,Oakland,CA,94609,United States of America,DAFM synth - Upgrade KIT,Upgrade Kit: Green Pill - 128 kbytes Microcont...,1,33.98,33.98,33.98,0.0,0.0,33.98
138,216979,refunded,2020-09-22,,2020-09-30,80a7d106a7ba6c95b22261d7efa05ed9e90456f02748e7...,BENTLEY,WA,6102,Australia,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; FM YAMAHA chip: YM2612,1,249.99,249.99,249.99,24.99,0.0,274.98
142,220003,refunded,2020-10-08,,2020-10-08,2b399d7420dc7bc7af563099b028273c83adae8cf92fef...,Columbus,WI,53925-1763,United States of America,DAFM synth - Upgrade KIT,Upgrade Kit: Green Pill - 128 kbytes Microcont...,1,33.98,33.98,33.98,0.0,0.0,33.98
207,264226,refunded,2021-05-12,,2021-05-15,65518f673f0f0ee95524bbfac829fac90915a87b89fe92...,Brighton,East Sussex,BN21FH,United Kingdom,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; FM YAMAHA chip: YM3438,1,259.99,259.99,259.99,24.99,0.0,284.98
263,332621,billed,2022-06-12,,,cd36235e7711e378b3ec49eca87fc354ec2b528814c00f...,McMinnville,OR,97128-9554,United States of America,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; FM YAMAHA chip: YM3438,1,269.99,269.99,269.99,24.99,0.0,294.98


As you can see, eight of them correspond to canceled orders with a refund and it surely corresponds to cancellations prior to shipping. In the ninth row it is simply a recent order that has not yet been shipped.

Regarding to Refund Date column it is clear that when there is no refund this column is empty.
Finally let's see why there are missing values in State/Province column.

In [87]:
customers_orders[customers_orders['State/Province'].isnull()]

Unnamed: 0,Order ID,Status,Order Date,Shipping Date,Refund Date,Customer ID,City,State/Province,Postal/Zip Code,Country,Product Title,Option Summary,Quantity,Unit Price,Discount Price,Item Total,Shipping Total,Discount Total,Order Total
167,235588,refunded,2020-12-14,2020-12-20,2021-06-09,fc7799bbdd95b92d83288e9063128cd625070e6b45ccea...,Tokyo,,1580082,Japan,DAFM synth - ARCADE YM2151,Knobs color: Metallic Red,1,249.99,249.99,249.99,24.99,0.0,274.98
170,239280,billed,2021-01-06,2021-01-11,,fc7799bbdd95b92d83288e9063128cd625070e6b45ccea...,Tokyo,,1580082,Japan,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; FM YAMAHA chip: YM2612,1,249.99,249.99,249.99,24.99,0.0,274.98


Nothing critical for now. We leave it as is

After that we will create a new column using the Quantity and Status columns. The new column will be called Refunded Quantity

In [88]:
def refunded_orders(row):
    if row['Status'] == 'refunded':
        return 1
    if row['Status'] == 'billed':
        return 0

customers_orders['Refunded_Quantity'] = customers_orders.apply (lambda row: refunded_orders(row), axis=1)

customers_orders[customers_orders['Status']=='refunded'].head()

Unnamed: 0,Order ID,Status,Order Date,Shipping Date,Refund Date,Customer ID,City,State/Province,Postal/Zip Code,Country,Product Title,Option Summary,Quantity,Unit Price,Discount Price,Item Total,Shipping Total,Discount Total,Order Total,Refunded_Quantity
26,153924,refunded,2019-06-06,,2019-06-16,159f093907cc176ab57e859c19213d7b908307bf73d55a...,Ottawa,ON,K2J 4P3,Canada,DAFM synth - GENESIS YM2612 / YM3438,OLED Display Color: Electric Blue; Knobs color...,1,174.99,174.99,174.99,24.99,0.0,199.98,1
32,156880,refunded,2019-07-03,2019-07-06,2019-07-29,8a54cfa8c6f9b391b313500e8b359c3cc336c76412581a...,Madrid,Madrid,28046,Spain,DAFM synth - GENESIS YM2612 / YM3438,OLED Display Color: Electric Blue; Knobs color...,1,174.99,174.99,174.99,5.99,0.0,180.98,1
81,179904,refunded,2020-01-15,,2020-01-16,e0cdc8183d4daf54548d2f40ef22c23c6a3889f199a995...,Tacoma,WA,98405,United States of America,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; FM YAMAHA chip: YM26...,1,174.99,174.99,174.99,24.99,0.0,199.98,1
98,193505,refunded,2020-05-08,,2020-05-22,108c224d3ef0ce006ed1e4b9b3e30e3acfb512c562d1a9...,Shime,Fukuoka,8112202,Japan,DAFM synth - GENESIS YM2612 / YM3438,Knobs color: Urban Black; OLED Display Color: ...,1,249.99,249.99,249.99,24.99,0.0,274.98,1
115,203694,refunded,2020-07-12,2020-07-23,2020-10-25,d3e0eed6c223da5f38558b7cddee4f1a886f908993a5e1...,Las Vegas,NV,89102,United States of America,DAFM synth - GENESIS YM2612 / YM3438,FM YAMAHA chip: YM2612; Knobs color: Rose Gold,1,249.99,249.99,249.99,24.99,0.0,274.98,1


# 1. Products ordered

It is the count of the products ordered in Quantity column by a customer

Create functions to identify customers who order multiple products

In [89]:
def encode_column(column):
    if column > 0:
        return 1
    if column <= 0:
        return 0


def aggregate_by_ordered_quantity(dataframe, column_list):
    '''this function:
    1. aggregates a given dataframe by column list, 
    as a result creates a aggregated dataframe by counting the ordered item quantities

    2. adds number_of_X ordered where X is the second element in the column_list 
    to the aggregated dataframe by encoding ordered items into 1

    3. creates final dataframe containing information about 
    how many of X are ordered, based on the first element passed in the column list'''

    aggregated_dataframe = (dataframe
                            .groupby(column_list)
                            .Quantity.count()
                            .reset_index())

    aggregated_dataframe["products_ordered"] = (aggregated_dataframe
                                                 .Quantity
                                                 .apply(encode_column))

    final_dataframe = (aggregated_dataframe
                       .groupby(column_list[0])
                       .products_ordered.sum() # aligned with the added column name
                       .reset_index())

    return final_dataframe

In [90]:
# apply functions to customers_orders
customers = aggregate_by_ordered_quantity(customers_orders, ["Customer ID", "Product Title"])

In [91]:
customers.head()

Unnamed: 0,Customer ID,products_ordered
0,02d256b0945f57b2c67f2cf6de70b7d49846e77cb09977...,1
1,04dc33d3e53a30be927e33fedee9a25dcbd7ecaf3ad001...,2
2,053e199141a85eeef526609bd0d9340828f1f5135adab4...,1
3,05f39745a03d75243b0135a7b893ebb22303ac5c011305...,1
4,0618510568bedd20b573f8bfeca0ceda5753c949015800...,1


# 2. Average Return Rate

It is the ratio of returned item quantity and ordered item quantity. This ratio is first calculated per order and then averaged for all orders of a customer.

In [92]:
# aggregate data per customer_id and order_id, 
# to see ordered item sum and returned item sum
ordered_sum_by_customer_order = (customers_orders
                                 .groupby(["Customer ID", "Order ID"])
                                 .Quantity.sum()
                                 .reset_index())

returned_sum_by_customer_order = (customers_orders
                                  .groupby(["Customer ID", "Order ID"])
                                  .Refunded_Quantity.sum()
                                  .reset_index())

# merge two dataframes to be able to calculate unit return rate
ordered_returned_sums = pd.merge(ordered_sum_by_customer_order, returned_sum_by_customer_order)


In [93]:
# calculate unit return rate per order and customer
ordered_returned_sums["average_return_rate"] = ( 
                                             ordered_returned_sums["Refunded_Quantity"] /
                                             ordered_returned_sums["Quantity"])

In [94]:
ordered_returned_sums.head()

Unnamed: 0,Customer ID,Order ID,Quantity,Refunded_Quantity,average_return_rate
0,02d256b0945f57b2c67f2cf6de70b7d49846e77cb09977...,318128,1,0,0.0
1,04dc33d3e53a30be927e33fedee9a25dcbd7ecaf3ad001...,267893,1,0,0.0
2,04dc33d3e53a30be927e33fedee9a25dcbd7ecaf3ad001...,270070,1,0,0.0
3,053e199141a85eeef526609bd0d9340828f1f5135adab4...,147569,1,0,0.0
4,05f39745a03d75243b0135a7b893ebb22303ac5c011305...,155539,1,0,0.0


In [95]:
# take average of the unit return rate for all orders of a customer
customer_return_rate = (ordered_returned_sums
                        .groupby("Customer ID")
                        .average_return_rate
                        .mean()
                        .reset_index())

In [96]:
return_rates = pd.DataFrame(customer_return_rate["average_return_rate"]
                            .value_counts()
                            .reset_index())

return_rates.rename(columns=
                    {"index": "average return rate",
                     "average_return_rate": "count of unit return rate"},
                    inplace=True)

return_rates.sort_values(by="average return rate")

Unnamed: 0,average return rate,count of unit return rate
0,0.0,223
2,0.5,5
1,1.0,8


In [97]:
# add average_return_rate to customers dataframe
customers = pd.merge(customers,
                     customer_return_rate,
                     on="Customer ID")

In [98]:
customers.head()

Unnamed: 0,Customer ID,products_ordered,average_return_rate
0,02d256b0945f57b2c67f2cf6de70b7d49846e77cb09977...,1,0.0
1,04dc33d3e53a30be927e33fedee9a25dcbd7ecaf3ad001...,2,0.0
2,053e199141a85eeef526609bd0d9340828f1f5135adab4...,1,0.0
3,05f39745a03d75243b0135a7b893ebb22303ac5c011305...,1,0.0
4,0618510568bedd20b573f8bfeca0ceda5753c949015800...,1,0.0


# 3. Total spending

Total spending is the aggregated sum of total sales value which is the amount after returns.

In [99]:
def total_sales(row):
    if row['Status'] == 'refunded':
        return 0
    if row['Status'] == 'billed':
        return row['Item Total']

customers_orders['total_sales'] = customers_orders.apply (lambda row: total_sales(row), axis=1)

# aggreagate total sales per customer id
customer_total_spending = (customers_orders
                           .groupby("Customer ID")
                           .total_sales
                           .sum()
                           .reset_index())

customer_total_spending.rename(columns = {"total_sales" : "total_spending"},
                               inplace = True)

Create features data frame

In [100]:
# add total sales to customers dataframe
customers = customers.merge(customer_total_spending, 
                            on="Customer ID")

In [101]:
print("The number of customers from the existing customer base:", customers.shape[0])

The number of customers from the existing customer base: 236


In [102]:
# drop id column since it is not a feature
customers.drop(columns="Customer ID",
               inplace=True)

In [103]:
customers.head()

Unnamed: 0,products_ordered,average_return_rate,total_spending
0,1,0.0,259.99
1,2,0.0,519.98
2,1,0.0,174.99
3,1,0.0,124.99
4,1,0.0,259.99


Visualize features

In [104]:
fig = make_subplots(rows=3, cols=1,
                   subplot_titles=("Products Ordered", 
                                   "Average Return Rate", 
                                   "Total Spending"))

fig.append_trace(go.Histogram(x=customers.products_ordered),
                 row=1, col=1)

fig.append_trace(go.Histogram(x=customers.average_return_rate),
                 row=2, col=1)

fig.append_trace(go.Histogram(x=customers.total_spending),
                 row=3, col=1)

fig.update_layout(height=800, width=800,
                  title_text="Distribution of the Features")

fig.show()

Scale Features: Log Transformation

In [105]:
def apply_log1p_transformation(dataframe, column):
    '''This function takes a dataframe and a column in the string format
    then applies numpy log1p transformation to the column
    as a result returns log1p applied pandas series'''
    
    dataframe["log_" + column] = np.log1p(dataframe[column])
    return dataframe["log_" + column]

1. Products ordered

In [106]:
apply_log1p_transformation(customers, "products_ordered")

0      0.693147
1      1.098612
2      0.693147
3      0.693147
4      0.693147
         ...   
231    0.693147
232    0.693147
233    1.098612
234    0.693147
235    0.693147
Name: log_products_ordered, Length: 236, dtype: float64

2. Average return rate

In [107]:
apply_log1p_transformation(customers, "average_return_rate")

0      0.000000
1      0.000000
2      0.000000
3      0.000000
4      0.000000
         ...   
231    0.000000
232    0.000000
233    0.405465
234    0.000000
235    0.000000
Name: log_average_return_rate, Length: 236, dtype: float64

3. Total spending

In [108]:
apply_log1p_transformation(customers, "total_spending")

0      5.564482
1      6.255712
2      5.170427
3      4.836203
4      5.564482
         ...   
231    5.564482
232    4.836203
233    5.525413
234    5.081342
235    5.525413
Name: log_total_spending, Length: 236, dtype: float64

Visualize log transformation applied features

In [109]:
fig = make_subplots(rows=3, cols=1,
                   subplot_titles=("Products Ordered", 
                                   "Average Return Rate", 
                                   "Total Spending"))

fig.append_trace(go.Histogram(x=customers.log_products_ordered),
                 row=1, col=1)

fig.append_trace(go.Histogram(x=customers.log_average_return_rate),
                 row=2, col=1)

fig.append_trace(go.Histogram(x=customers.log_total_spending),
                 row=3, col=1)

fig.update_layout(height=800, width=800,
                  title_text="Distribution of the Features after Logarithm Transformation")

fig.show()

In [110]:
customers.head()

Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending
0,1,0.0,259.99,0.693147,0.0,5.564482
1,2,0.0,519.98,1.098612,0.0,6.255712
2,1,0.0,174.99,0.693147,0.0,5.170427
3,1,0.0,124.99,0.693147,0.0,4.836203
4,1,0.0,259.99,0.693147,0.0,5.564482


In [111]:
# features we are going to use as an input to the model
customers.iloc[:,3:]

Unnamed: 0,log_products_ordered,log_average_return_rate,log_total_spending
0,0.693147,0.000000,5.564482
1,1.098612,0.000000,6.255712
2,0.693147,0.000000,5.170427
3,0.693147,0.000000,4.836203
4,0.693147,0.000000,5.564482
...,...,...,...
231,0.693147,0.000000,5.564482
232,0.693147,0.000000,4.836203
233,1.098612,0.405465,5.525413
234,0.693147,0.000000,5.081342


Create K-means model

In [112]:
# create initial K-means model
kmeans_model = KMeans(init='k-means++', 
                      max_iter=500, 
                      random_state=42)

In [113]:
kmeans_model.fit(customers.iloc[:,3:])

# print the sum of distances from all examples to the center of the cluster
print("within-cluster sum-of-squares (inertia) of the model is:", kmeans_model.inertia_)

within-cluster sum-of-squares (inertia) of the model is: 2.2441645181869063


Hyperparameter tuning: Find optimal number of clusters

In [114]:
def make_list_of_K(K, dataframe):
    '''inputs: K as integer and dataframe
    apply k-means clustering to dataframe
    and make a list of inertia values against 1 to K (inclusive)
    return the inertia values list
    '''
    
    cluster_values = list(range(1, K+1))
    inertia_values=[]
    
    for c in cluster_values:
        model = KMeans(
            n_clusters = c, 
            init='k-means++', 
            max_iter=500, 
            random_state=42)
        model.fit(dataframe)
        inertia_values.append(model.inertia_)
    
    return inertia_values

Visualize different K and models

In [115]:
# save inertia values in a dataframe for k values between 1 to 15 
results = make_list_of_K(15, customers.iloc[:, 3:])

k_values_distances = pd.DataFrame({"clusters": list(range(1, 16)),
                                   "within cluster sum of squared distances": results})


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.



In [116]:
# visualization for the selection of number of segments
fig = go.Figure()

fig.add_trace(go.Scatter(x=k_values_distances["clusters"], 
                         y=k_values_distances["within cluster sum of squared distances"],
                         mode='lines+markers'))

fig.update_layout(xaxis = dict(
        tickmode = 'linear',
        tick0 = 1,
        dtick = 1),
                  title_text="Within Cluster Sum of Squared Distances VS K Values",
                  xaxis_title="K values",
                  yaxis_title="Cluster sum of squared distances")

fig.show()

Update K-Means Clustering

In [117]:
# create clustering model with optimal k=4
updated_kmeans_model = KMeans(n_clusters = 4, 
                              init='k-means++', 
                              max_iter=500, 
                              random_state=42)

updated_kmeans_model.fit_predict(customers.iloc[:,3:])

array([2, 2, 0, 0, 2, 0, 0, 2, 0, 2, 0, 2, 2, 0, 2, 1, 1, 2, 2, 0, 2, 0,
       2, 0, 0, 2, 2, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 2, 0, 0, 2, 0, 3, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2, 0, 0, 2,
       2, 0, 0, 2, 1, 0, 0, 0, 3, 2, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 1, 0, 0, 0, 2, 0, 3, 2, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 2, 0, 2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0,
       2, 2, 0, 0, 0, 0, 2, 2, 2, 2, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0,
       0, 0, 2, 2, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 2, 2, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 0, 0, 0,
       0, 2, 0, 2, 0, 0, 2, 2, 2, 0, 2, 2, 0, 2, 0, 2])

Add cluster centers to the visualization

In [118]:
# create cluster centers and actual data arrays
cluster_centers = updated_kmeans_model.cluster_centers_
actual_data = np.expm1(cluster_centers)
add_points = np.append(actual_data, cluster_centers, axis=1)
add_points

array([[1.04156006e+00, 5.89360527e-03, 1.56189393e+02, 7.13714251e-01,
        5.87630591e-03, 5.05745140e+00],
       [1.00000000e+00, 1.00000000e+00, 0.00000000e+00, 6.93147181e-01,
        6.93147181e-01, 0.00000000e+00],
       [1.16197502e+00, 1.40797545e-02, 2.77465136e+02, 7.71022167e-01,
        1.39815555e-02, 5.62929287e+00],
       [1.00000000e+00, 6.93889390e-18, 1.97487108e+01, 6.93147181e-01,
        6.93889390e-18, 3.03248412e+00]])

In [119]:
# add labels to customers dataframe and add_points array
add_points = np.append(add_points, [[0], [1], [2], [3]], axis=1)
customers["clusters"] = updated_kmeans_model.labels_

In [120]:
# create centers dataframe from add_points
centers_df = pd.DataFrame(data=add_points, columns=["products_ordered",
                                                    "average_return_rate",
                                                    "total_spending",
                                                    "log_products_ordered",
                                                    "log_average_return_rate",
                                                    "log_total_spending",
                                                    "clusters"])
centers_df.head()

Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending,clusters
0,1.04156,0.005893605,156.189393,0.713714,0.005876306,5.057451,0.0
1,1.0,1.0,0.0,0.693147,0.6931472,0.0,1.0
2,1.161975,0.01407975,277.465136,0.771022,0.01398156,5.629293,2.0
3,1.0,6.938894e-18,19.748711,0.693147,6.938894e-18,3.032484,3.0


In [121]:
# align cluster centers of centers_df and customers
centers_df["clusters"] = centers_df["clusters"].astype("int")

In [122]:
centers_df.head()

Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending,clusters
0,1.04156,0.005893605,156.189393,0.713714,0.005876306,5.057451,0
1,1.0,1.0,0.0,0.693147,0.6931472,0.0,1
2,1.161975,0.01407975,277.465136,0.771022,0.01398156,5.629293,2
3,1.0,6.938894e-18,19.748711,0.693147,6.938894e-18,3.032484,3


In [123]:
customers.head()

Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending,clusters
0,1,0.0,259.99,0.693147,0.0,5.564482,2
1,2,0.0,519.98,1.098612,0.0,6.255712,2
2,1,0.0,174.99,0.693147,0.0,5.170427,0
3,1,0.0,124.99,0.693147,0.0,4.836203,0
4,1,0.0,259.99,0.693147,0.0,5.564482,2


In [124]:
# differentiate between data points and cluster centers
customers["is_center"] = 0
centers_df["is_center"] = 1

# add dataframes together
customers = customers.append(centers_df, ignore_index=True)

In [125]:
customers.tail()

Unnamed: 0,products_ordered,average_return_rate,total_spending,log_products_ordered,log_average_return_rate,log_total_spending,clusters,is_center
235,1.0,0.0,249.99,0.693147,0.0,5.525413,2,0
236,1.04156,0.005893605,156.189393,0.713714,0.005876306,5.057451,0,1
237,1.0,1.0,0.0,0.693147,0.6931472,0.0,1,1
238,1.161975,0.01407975,277.465136,0.771022,0.01398156,5.629293,2,1
239,1.0,6.938894e-18,19.748711,0.693147,6.938894e-18,3.032484,3,1


Visualize Customer Segmentation

In [126]:
# add clusters to the dataframe
customers["cluster_name"] = customers["clusters"].astype(str)

In [127]:
# visualize log_transformation customer segments with a 3D plot
fig = px.scatter_3d(customers,
                    x="log_products_ordered",
                    y="log_average_return_rate",
                    z="log_total_spending",
                    color='cluster_name',
                    hover_data=["products_ordered",
                                "average_return_rate",
                                "total_spending"],
                    category_orders = {"cluster_name": 
                                       ["0", "1", "2", "3"]},
                    symbol = "is_center"
                    )

fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()

Check for Cluster Magnitude

In [128]:
# values for log_transformation
cardinality_df = pd.DataFrame(
    customers.cluster_name.value_counts().reset_index())

cardinality_df.rename(columns={"index": "Customer Groups",
                               "cluster_name": "Customer Group Magnitude"},
                      inplace=True)

In [129]:
cardinality_df

Unnamed: 0,Customer Groups,Customer Group Magnitude
0,0,139
1,2,88
2,1,9
3,3,4


In [130]:
fig = px.bar(cardinality_df, x="Customer Groups", 
             y="Customer Group Magnitude",
             color = "Customer Groups",
             category_orders = {"Customer Groups": ["0", "1", "2", "3"]})

fig.update_layout(xaxis = dict(
        tickmode = 'linear',
        tick0 = 1,
        dtick = 1),
                 yaxis = dict(
        tickmode = 'linear',
        tick0 = 1000,
        dtick = 1000))

fig.show()