# **Customer Lifetime Value (CLTV) Calculation using Online Retail Dataset**

# Business problem
### An e-commerce company wants to divide its customers into groups and show marketing approaches specific to these groups. For this, it wants to divide the customers into groups by using various techniques and to exhibit various approaches specific to these groups. The purpose of this study is to calculate CLTV values for each customer and then to perform a segmentation study according to these calculated values. There are many different types of segmentation. The method to be used in the segmentation here is carried out according to the lifetime value. Therefore, the customers will be segmented into 3-4 or any number of segments based on the final lifetime values to be calculated.
# Dataset story
### The dataset named Online Retail includes online sales transactions of a UK-based retail company between 01/12/2009 and 09/12/2011. The company's product catalog includes souvenirs and it is known that most of its customers are wholesalers.¶
* InvoiceNo: Invoice Number (If this code starts with C, it means that the transaction has been cancelled)
* StockCode: Product code (unique for each product)
* Description: Product name
* Quantity: Number of products (How many of the products on the invoices were sold)
* InvoiceDate: Invoice date
* UnitPrice: Invoice price ( Sterling )
* CustomerID: Unique customer number
* Country: Country name

# Importing the libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Reading dataset

In [2]:
df_ = pd.read_excel('/kaggle/input/online-retail-dataset/online_retail_II.xlsx', sheet_name='Year 2009-2010')
df = df_.copy()
df.columns = [col.lower() for col in df.columns]
df.head()

Unnamed: 0,invoice,stockcode,description,quantity,invoicedate,price,customer id,country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


# Data preparation

### Let's delete the observations with 'C' in the invoice variable


In [3]:
df = df[~df['invoice'].str.contains('C', na=False)]

### Let's show the description statistics

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
quantity,515255.0,10.957,104.354,-9600.0,1.0,3.0,10.0,19152.0
price,515255.0,3.956,127.689,-53594.36,1.25,2.1,4.21,25111.09
customer id,407695.0,15368.504,1679.796,12346.0,13997.0,15321.0,16812.0,18287.0


### There are negative values in the variables prices and quantity. They must be removed from the dataset

In [5]:
df = df[df['quantity'] > 0]
df = df[df['price'] > 0]

### Let's examine the missing data

In [6]:
df.isnull().sum()

invoice             0
stockcode           0
description         0
quantity            0
invoicedate         0
price               0
customer id    103901
country             0
dtype: int64

### Let's remove the missing values from the dataset

In [7]:
df.dropna(inplace=True)

### Let's check again the descriptive statistics

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
quantity,407664.0,13.586,96.841,1.0,2.0,5.0,12.0,19152.0
price,407664.0,3.294,34.758,0.001,1.25,1.95,3.75,10953.5
customer id,407664.0,15368.593,1679.762,12346.0,13997.0,15321.0,16812.0,18287.0


### Let's calculate the total price

In [9]:
df['total_price'] = df['price'] * df['quantity']
df

Unnamed: 0,invoice,stockcode,description,quantity,invoicedate,price,customer id,country,total_price
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.950,13085.000,United Kingdom,83.400
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.750,13085.000,United Kingdom,81.000
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.750,13085.000,United Kingdom,81.000
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.100,13085.000,United Kingdom,100.800
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.250,13085.000,United Kingdom,30.000
...,...,...,...,...,...,...,...,...,...
525456,538171,22271,FELTCRAFT DOLL ROSIE,2,2010-12-09 20:01:00,2.950,17530.000,United Kingdom,5.900
525457,538171,22750,FELTCRAFT PRINCESS LOLA DOLL,1,2010-12-09 20:01:00,3.750,17530.000,United Kingdom,3.750
525458,538171,22751,FELTCRAFT PRINCESS OLIVIA DOLL,1,2010-12-09 20:01:00,3.750,17530.000,United Kingdom,3.750
525459,538171,20970,PINK FLORAL FELTCRAFT SHOULDER BAG,2,2010-12-09 20:01:00,3.750,17530.000,United Kingdom,7.500


### Let's convert the dataset to the cltv format

In [10]:
cltv = df.groupby('customer id').agg({
    'invoice': lambda x: x.nunique(),
    'quantity': lambda x: x.sum(),
    'total_price': lambda x: x.sum()
})
cltv

Unnamed: 0_level_0,invoice,quantity,total_price
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346.000,11,70,372.860
12347.000,2,828,1323.320
12348.000,1,373,222.160
12349.000,3,993,2671.140
12351.000,1,261,300.930
...,...,...,...
18283.000,6,336,641.770
18284.000,1,494,461.680
18285.000,1,145,427.000
18286.000,2,608,1296.430


### Let's change the names of the variables

In [11]:
cltv.columns = ['total_transaction', 'total_unit', 'total_price']
cltv

Unnamed: 0_level_0,total_transaction,total_unit,total_price
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346.000,11,70,372.860
12347.000,2,828,1323.320
12348.000,1,373,222.160
12349.000,3,993,2671.140
12351.000,1,261,300.930
...,...,...,...
18283.000,6,336,641.770
18284.000,1,494,461.680
18285.000,1,145,427.000
18286.000,2,608,1296.430


# Calculation of CLTV values

### Now, the observations are unique and each obervation represents a different customer

### Let's get the average order value (total_price / total_transaction)

In [12]:
cltv['average_order_value'] = cltv['total_price'] / cltv['total_transaction']
cltv

Unnamed: 0_level_0,total_transaction,total_unit,total_price,average_order_value
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12346.000,11,70,372.860,33.896
12347.000,2,828,1323.320,661.660
12348.000,1,373,222.160,222.160
12349.000,3,993,2671.140,890.380
12351.000,1,261,300.930,300.930
...,...,...,...,...
18283.000,6,336,641.770,106.962
18284.000,1,494,461.680,461.680
18285.000,1,145,427.000,427.000
18286.000,2,608,1296.430,648.215


### Let's get the purchase frequency (total_transaction / total number of customers)

In [13]:
cltv['purchase_frequency'] = cltv['total_transaction'] / cltv.shape[0]
cltv

Unnamed: 0_level_0,total_transaction,total_unit,total_price,average_order_value,purchase_frequency
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12346.000,11,70,372.860,33.896,0.003
12347.000,2,828,1323.320,661.660,0.000
12348.000,1,373,222.160,222.160,0.000
12349.000,3,993,2671.140,890.380,0.001
12351.000,1,261,300.930,300.930,0.000
...,...,...,...,...,...
18283.000,6,336,641.770,106.962,0.001
18284.000,1,494,461.680,461.680,0.000
18285.000,1,145,427.000,427.000,0.000
18286.000,2,608,1296.430,648.215,0.000


### Let's get the repeat rate and churn rate 

In [14]:
repeat_rate = cltv[cltv['total_transaction'] > 1].shape[0] / cltv.shape[0]
churn_rate = 1 - repeat_rate

### Let's get the profit margin (total_price * 0.1)

In [15]:
cltv['profit_margin'] = cltv['total_price'] * 0.1
cltv

Unnamed: 0_level_0,total_transaction,total_unit,total_price,average_order_value,purchase_frequency,profit_margin
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
12346.000,11,70,372.860,33.896,0.003,37.286
12347.000,2,828,1323.320,661.660,0.000,132.332
12348.000,1,373,222.160,222.160,0.000,22.216
12349.000,3,993,2671.140,890.380,0.001,267.114
12351.000,1,261,300.930,300.930,0.000,30.093
...,...,...,...,...,...,...
18283.000,6,336,641.770,106.962,0.001,64.177
18284.000,1,494,461.680,461.680,0.000,46.168
18285.000,1,145,427.000,427.000,0.000,42.700
18286.000,2,608,1296.430,648.215,0.000,129.643


### Let's get the customer value (average_order_value * purchase_frequency)

In [16]:
cltv['customer_value'] = cltv['average_order_value'] * cltv['purchase_frequency']
cltv

Unnamed: 0_level_0,total_transaction,total_unit,total_price,average_order_value,purchase_frequency,profit_margin,customer_value
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12346.000,11,70,372.860,33.896,0.003,37.286,0.086
12347.000,2,828,1323.320,661.660,0.000,132.332,0.307
12348.000,1,373,222.160,222.160,0.000,22.216,0.052
12349.000,3,993,2671.140,890.380,0.001,267.114,0.619
12351.000,1,261,300.930,300.930,0.000,30.093,0.070
...,...,...,...,...,...,...,...
18283.000,6,336,641.770,106.962,0.001,64.177,0.149
18284.000,1,494,461.680,461.680,0.000,46.168,0.107
18285.000,1,145,427.000,427.000,0.000,42.700,0.099
18286.000,2,608,1296.430,648.215,0.000,129.643,0.301


### Let's get the cltv value

In [17]:
cltv['cltv'] = (cltv['customer_value'] / churn_rate) * cltv['profit_margin']
cltv

Unnamed: 0_level_0,total_transaction,total_unit,total_price,average_order_value,purchase_frequency,profit_margin,customer_value,cltv
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12346.000,11,70,372.860,33.896,0.003,37.286,0.086,9.797
12347.000,2,828,1323.320,661.660,0.000,132.332,0.307,123.409
12348.000,1,373,222.160,222.160,0.000,22.216,0.052,3.478
12349.000,3,993,2671.140,890.380,0.001,267.114,0.619,502.818
12351.000,1,261,300.930,300.930,0.000,30.093,0.070,6.382
...,...,...,...,...,...,...,...,...
18283.000,6,336,641.770,106.962,0.001,64.177,0.149,29.025
18284.000,1,494,461.680,461.680,0.000,46.168,0.107,15.021
18285.000,1,145,427.000,427.000,0.000,42.700,0.099,12.849
18286.000,2,608,1296.430,648.215,0.000,129.643,0.301,118.445


### Let's sort the cltv values in ascending order

In [18]:
cltv.sort_values(by='cltv', ascending=False)

Unnamed: 0_level_0,total_transaction,total_unit,total_price,average_order_value,purchase_frequency,profit_margin,customer_value,cltv
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
18102.000,89,124216,349164.350,3923.195,0.021,34916.435,80.975,8591666.195
14646.000,78,170278,248396.500,3184.571,0.018,24839.650,57.606,4348190.360
14156.000,102,108107,196566.740,1927.125,0.024,19656.674,45.586,2722937.511
14911.000,205,69722,152147.570,742.183,0.048,15214.757,35.285,1631351.872
13694.000,94,125893,131443.190,1398.332,0.022,13144.319,30.483,1217569.570
...,...,...,...,...,...,...,...,...
18115.000,1,3,9.700,9.700,0.000,0.970,0.002,0.007
15040.000,1,1,7.490,7.490,0.000,0.749,0.002,0.004
15913.000,1,3,6.300,6.300,0.000,0.630,0.001,0.003
13788.000,1,1,3.750,3.750,0.000,0.375,0.001,0.001


# Segmentation of CLTV values

In [19]:
cltv['segment'] = pd.qcut(cltv['cltv'], 4, labels=['D', 'C', 'B', 'A'])
cltv.sort_values(by='cltv', ascending=False)

Unnamed: 0_level_0,total_transaction,total_unit,total_price,average_order_value,purchase_frequency,profit_margin,customer_value,cltv,segment
customer id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
18102.000,89,124216,349164.350,3923.195,0.021,34916.435,80.975,8591666.195,A
14646.000,78,170278,248396.500,3184.571,0.018,24839.650,57.606,4348190.360,A
14156.000,102,108107,196566.740,1927.125,0.024,19656.674,45.586,2722937.511,A
14911.000,205,69722,152147.570,742.183,0.048,15214.757,35.285,1631351.872,A
13694.000,94,125893,131443.190,1398.332,0.022,13144.319,30.483,1217569.570,A
...,...,...,...,...,...,...,...,...,...
18115.000,1,3,9.700,9.700,0.000,0.970,0.002,0.007,D
15040.000,1,1,7.490,7.490,0.000,0.749,0.002,0.004,D
15913.000,1,3,6.300,6.300,0.000,0.630,0.001,0.003,D
13788.000,1,1,3.750,3.750,0.000,0.375,0.001,0.001,D


### Let's analyse the segement in terms of mean, count, and sum

In [20]:
cltv.groupby('segment').agg({'count', 'mean', 'sum'})

Unnamed: 0_level_0,total_transaction,total_transaction,total_transaction,total_unit,total_unit,total_unit,total_price,total_price,total_price,average_order_value,average_order_value,average_order_value,purchase_frequency,purchase_frequency,purchase_frequency,profit_margin,profit_margin,profit_margin,customer_value,customer_value,customer_value,cltv,cltv,cltv
Unnamed: 0_level_1,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count
segment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
D,1.229,1325,1078,109.207,117725,1078,178.639,192573.08,1078,157.805,170113.415,1078,0.0,0.307,1078,17.864,19257.308,1078,0.041,44.66,1078,2.653,2860.027,1078
C,2.006,2163,1078,283.472,305583,1078,476.265,513414.153,1078,294.333,317290.818,1078,0.0,0.502,1078,47.627,51341.415,1078,0.11,119.066,1078,16.919,18238.301,1078
B,3.768,4062,1078,680.716,733812,1078,1132.303,1220622.59,1078,390.097,420525.058,1078,0.001,0.942,1078,113.23,122062.259,1078,0.263,283.076,1078,96.354,103869.942,1078
A,10.819,11663,1078,4064.224,4381234,1078,6405.745,6905393.451,1078,671.056,723397.877,1078,0.003,2.705,1078,640.575,690539.345,1078,1.486,1601.436,1078,23462.602,25292684.543,1078


### Let's save the final file as a csv file
#### cltv.to_csv('cltv.csv')

# **Thanks for checking my notebook!**