# Customer Segmentation using RFM analysis

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

We will create cutomer segments as per the Recency,Frequency and Monetary analysis by analyzing the data to know our customer base. This knowlwdge can then be used to target customers to retain customers, pitch offers etc

### Importing libraries

In [1]:
import numpy as np
import pandas as pd


import time, warnings
import datetime as dt

#visualizations
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
import seaborn as sns

warnings.filterwarnings("ignore")

### Read the data

In [2]:
data = pd.read_csv('../data/commercial_data.csv')

data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,545220,21955,DOORMAT UNION JACK GUNS AND ROSES,2,3/1/2011 8:30,7.95,14620.0,United Kingdom
1,545220,48194,DOORMAT HEARTS,2,3/1/2011 8:30,7.95,14620.0,United Kingdom
2,545220,22556,PLASTERS IN TIN CIRCUS PARADE,12,3/1/2011 8:30,1.65,14620.0,United Kingdom
3,545220,22139,RETROSPOT TEA SET CERAMIC 11 PC,3,3/1/2011 8:30,4.95,14620.0,United Kingdom
4,545220,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,4,3/1/2011 8:30,3.75,14620.0,United Kingdom


In [3]:
data.shape

(236079, 8)

### Remove rows where customerID are NA

In [4]:
data.dropna(subset=['CustomerID'],how='all',inplace=True)
data.shape

(176137, 8)

## RFM Analysis
RFM (Recency, Frequency, Monetary) analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups. RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.

**RECENCY (R)**: Days since last purchase

**FREQUENCY (F):** Total number of purchases

**MONETARY VALUE (M):** Total money this customer spent.

We will create those 3 customer attributes for each customer.

## Recency
To calculate recency, we need to choose a date point from which we evaluate how many days ago was the customer's last purchase.

### Find out the latest date in the data to use it as for reference

In [5]:
#last date available in our dataset
data['InvoiceDate'].max()

'9/9/2011 9:52'

In [6]:
now = dt.date(2011,12,9)
print(now)

2011-12-09


### Create a new column called date which contains the date of invoice only

In [7]:
data['date'] = pd.DatetimeIndex(data['InvoiceDate']).date

In [8]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,date
0,545220,21955,DOORMAT UNION JACK GUNS AND ROSES,2,3/1/2011 8:30,7.95,14620.0,United Kingdom,2011-03-01
1,545220,48194,DOORMAT HEARTS,2,3/1/2011 8:30,7.95,14620.0,United Kingdom,2011-03-01
2,545220,22556,PLASTERS IN TIN CIRCUS PARADE,12,3/1/2011 8:30,1.65,14620.0,United Kingdom,2011-03-01
3,545220,22139,RETROSPOT TEA SET CERAMIC 11 PC,3,3/1/2011 8:30,4.95,14620.0,United Kingdom,2011-03-01
4,545220,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,4,3/1/2011 8:30,3.75,14620.0,United Kingdom,2011-03-01


### Check the last date of purchase with respect to CustomerID and calculate the RECENCY

In [9]:
#group by customers and check last date of purshace
recency_df = data.groupby(by='CustomerID', as_index=False)['date'].max()
recency_df.columns = ['CustomerID','LastPurshaceDate']
recency_df.head()

Unnamed: 0,CustomerID,LastPurshaceDate
0,12747.0,2011-08-22
1,12748.0,2011-09-30
2,12749.0,2011-08-01
3,12820.0,2011-09-26
4,12821.0,2011-05-09


In [10]:
#calculate recency
recency_df['Recency'] = recency_df['LastPurshaceDate'].apply(lambda x: (now - x).days)

In [11]:
recency_df.head()

Unnamed: 0,CustomerID,LastPurshaceDate,Recency
0,12747.0,2011-08-22,109
1,12748.0,2011-09-30,70
2,12749.0,2011-08-01,130
3,12820.0,2011-09-26,74
4,12821.0,2011-05-09,214


## Frequency
Frequency helps us to know how many times a customer purchased from us. To do that we need to check how many invoices are registered by the same customer.

### Drop duplicate data from the data

In [12]:
# drop duplicates
retail_uk_copy = data
retail_uk_copy.drop_duplicates(subset=['InvoiceNo', 'CustomerID'], keep="first", inplace=True)


### Calculate the frequency of purchases

In [13]:
#calculate frequency of purchases
frequency_df = retail_uk_copy.groupby(by=['CustomerID'], as_index=False)['InvoiceNo'].count()
frequency_df.columns = ['CustomerID','Frequency']
frequency_df.head()

Unnamed: 0,CustomerID,Frequency
0,12747.0,5
1,12748.0,96
2,12749.0,3
3,12820.0,1
4,12821.0,1


## Monetary

**Monetary attribute answers the question: How much money did the customer spent over time?**

### To do that, first, we will create a new column total cost to have the total price per invoice.

In [14]:
#create column total cost
data['TotalCost'] = data['Quantity'] * data['UnitPrice']

In [15]:
monetary_df = data.groupby(by='CustomerID',as_index=False).agg({'TotalCost': 'sum'})
monetary_df.columns = ['CustomerID','Monetary']
monetary_df.head()

Unnamed: 0,CustomerID,Monetary
0,12747.0,191.85
1,12748.0,1054.43
2,12749.0,67.0
3,12820.0,15.0
4,12821.0,19.92


### Create RFM Table

In [16]:
#merge recency dataframe with frequency dataframe
temp_df = recency_df.merge(frequency_df,on='CustomerID')
temp_df.head()

Unnamed: 0,CustomerID,LastPurshaceDate,Recency,Frequency
0,12747.0,2011-08-22,109,5
1,12748.0,2011-09-30,70,96
2,12749.0,2011-08-01,130,3
3,12820.0,2011-09-26,74,1
4,12821.0,2011-05-09,214,1


In [17]:
#merge with monetary dataframe to get a table with the 3 columns
rfm_df = temp_df.merge(monetary_df,on='CustomerID')
#use CustomerID as index
rfm_df.set_index('CustomerID',inplace=True)
#check the head
rfm_df.head()

Unnamed: 0_level_0,LastPurshaceDate,Recency,Frequency,Monetary
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12747.0,2011-08-22,109,5,191.85
12748.0,2011-09-30,70,96,1054.43
12749.0,2011-08-01,130,3,67.0
12820.0,2011-09-26,74,1,15.0
12821.0,2011-05-09,214,1,19.92


## Customer segments with RFM Model

**The simplest way to create customers segments from RFM Model is to use Quartiles. We assign a score from 1 to 4 to Recency, Frequency and Monetary. Four is the best/highest value, and one is the lowest/worst value. A final RFM score is calculated simply by combining individual RFM score numbers.**

Note: Quintiles (score from 1-5) offer better granularity, in case the business needs that but it will be more challenging to create segments since we will have 555 possible combinations. So, we will use quartiles.

### Find RFM quartiles

In [18]:
quantiles = rfm_df.quantile(q=[0.25,0.5,0.75])
quantiles

Unnamed: 0,Recency,Frequency,Monetary
0.25,85.0,1.0,16.35
0.5,119.0,2.0,35.4
0.75,183.0,3.0,92.42


In [19]:
quantiles.to_dict()

{'Recency': {0.25: 85.0, 0.5: 119.0, 0.75: 183.0},
 'Frequency': {0.25: 1.0, 0.5: 2.0, 0.75: 3.0},
 'Monetary': {0.25: 16.35, 0.5: 35.400000000000006, 0.75: 92.42000000000002}}

## Creation of RFM Segments

We will create two segmentation classes since, high recency is bad, while high frequency and monetary value is good.



### Create functions as per the appropriate quaritle values and apply them to create segments

In [21]:
# Arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)
def RScore(x,p,d):
    '''Function suiting the requirement of high recency being bad'''
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1
    
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def FMScore(x,p,d):
    '''Function suiting the requirement of high frequency and monetary value being good'''
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

In [22]:
#create rfm segmentation table
rfm_segmentation = rfm_df
rfm_segmentation['R_Quartile'] = rfm_segmentation['Recency'].apply(RScore, args=('Recency',quantiles,))
rfm_segmentation['F_Quartile'] = rfm_segmentation['Frequency'].apply(FMScore, args=('Frequency',quantiles,))
rfm_segmentation['M_Quartile'] = rfm_segmentation['Monetary'].apply(FMScore, args=('Monetary',quantiles,))

In [23]:
rfm_segmentation.head()

Unnamed: 0_level_0,LastPurshaceDate,Recency,Frequency,Monetary,R_Quartile,F_Quartile,M_Quartile
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12747.0,2011-08-22,109,5,191.85,3,4,4
12748.0,2011-09-30,70,96,1054.43,4,4,4
12749.0,2011-08-01,130,3,67.0,2,3,3
12820.0,2011-09-26,74,1,15.0,4,1,1
12821.0,2011-05-09,214,1,19.92,1,1,2


### Now that we have the score of each customer, we can represent our customer segmentation, combine the scores (R_Quartile, F_Quartile,M_Quartile) together.

In [24]:
rfm_segmentation['RFMScore'] = rfm_segmentation.R_Quartile.map(str) \
                            + rfm_segmentation.F_Quartile.map(str) \
                            + rfm_segmentation.M_Quartile.map(str)
rfm_segmentation.head()

Unnamed: 0_level_0,LastPurshaceDate,Recency,Frequency,Monetary,R_Quartile,F_Quartile,M_Quartile,RFMScore
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12747.0,2011-08-22,109,5,191.85,3,4,4,344
12748.0,2011-09-30,70,96,1054.43,4,4,4,444
12749.0,2011-08-01,130,3,67.0,2,3,3,233
12820.0,2011-09-26,74,1,15.0,4,1,1,411
12821.0,2011-05-09,214,1,19.92,1,1,2,112


Best Recency score = 4: most recently purchase. Best Frequency score = 4: most quantity purchase. Best Monetary score = 4: spent the most.

### FInd out the best customers

In [25]:
rfm_segmentation[rfm_segmentation['RFMScore']=='444'].sort_values('Monetary', ascending=False).head(10)

Unnamed: 0_level_0,LastPurshaceDate,Recency,Frequency,Monetary,R_Quartile,F_Quartile,M_Quartile,RFMScore
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
18102.0,2011-09-28,72,34,26632.62,4,4,4,444
17949.0,2011-09-30,70,32,22504.73,4,4,4,444
17450.0,2011-09-30,70,28,18009.06,4,4,4,444
16029.0,2011-09-20,80,39,15119.49,4,4,4,444
16013.0,2011-09-30,70,24,10402.34,4,4,4,444
12901.0,2011-09-19,81,20,5915.66,4,4,4,444
13798.0,2011-09-28,72,34,4648.8,4,4,4,444
17857.0,2011-09-28,72,12,4644.68,4,4,4,444
13694.0,2011-09-29,71,32,4472.68,4,4,4,444
15061.0,2011-09-27,73,23,3417.7,4,4,4,444


## Learner Activity

**1. Find the following:**
1. Best Customer

2. Loyal Customer

3. Big Spenders

4. Almost lost customers

5. Lost customers

**2. Now that we know our customers segments, how will you target them?**