# Customer Segmentation: RFM

#### Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately.

**WHY SEGMENTATION?**

- It helps in identifying your best and worst customers.
- It helps create more customer-oriented strategies for the company.
- It improves customer relationships through a better understanding of their needs.
- It can show the advantages and disadvantages of the business and products.
- It can help create new products and improve old ones based on customer needs.
- It improves customer service.
- Upsell and cross-sell other products and services.

### Customer Segmentation using RFM analysis

RFM (recency, frequency, monetary) analysis is a marketing technique used to determine quantitatively which customers are the best ones by examining how recently a customer has purchased (recency), how often they purchase (frequency), and how much the customer spends (monetary). RFM analysis is based on the marketing axiom that "80% of your business comes from 20% of your customers."

![](docs/rfm.png)

Image from: https://clevertap.com/blog/rfm-analysis/

For this analysis we will use the data set we studied last week on [Find the preprocessing steps here.](https://github.com/LilitYolyan/customer_behavior_analysis/blob/master/Week_2_Data_Preparation_and_EDA.ipynb)

### RFM segmentation

In the previous analysis, we found out that we have order level data, which means that we have multiple orders from the same customer at the same date and time. In the first step of RFM analysis, we need to transform our dataset to the customer level, where each row represents information about one customer.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

%matplotlib inline


In [None]:
# Dataset
data = pd.read_csv('data/data_cleared.csv')
data.head()

In [None]:
# Create customer level dataset
dt = data.groupby(['CustomerID', 'InvoiceDate'], as_index=False)['TotalPrice'].sum().rename(columns={'TotalPrice':'Budget'})
# uncomment if as_index=True
# pd.DataFrame(dt.head())
dt.head()

In [None]:
dt['InvoiceDate'] = pd.to_datetime(dt['InvoiceDate'])
print("The last date in our dataset: ", dt.InvoiceDate.max())
now = datetime.datetime(2011,12,10)
print('The last date to count recency: ', now)

In [None]:
# datetime.datetime.today()
# datetime.datetime.strptime("2012-Dec-01", "%Y-%b-%d")
# datetime.datetime.strftime

- **For Recency**, Calculate the number of days between present date and date of last purchase each customer.
- **For Frequency**, Calculate the number of orders for each customer.
- **For Monetary**, Calculate sum of purchase price for each customer.

In [None]:
# rfm = dt.groupby(['CustomerID'],as_index=True).agg(
#     {
#         'InvoiceDate': lambda date: (now - date.max()).days,
#         'CustomerID': 'count',
#         'Budget': 'sum'})

In [None]:
rfm = dt.groupby(['CustomerID'], as_index=True).agg(
    {
        'InvoiceDate': lambda date: (now - date.max()).days,
        'CustomerID': 'count',
        'Budget': 'sum'}).rename(columns={'InvoiceDate':'recency',
            'CustomerID': 'frequency',
            'Budget': 'monetary'}).reset_index()

In [None]:
rfm.describe()

If we look closer to the table above, we can notice that up to 0.5 quantiles of the frequency variable are equal to 2. To conduct RFM analysis, the main base of segmenting customers is quantiles, which cannot be done in the "frequency" column. To solve this problem, we will have 3 groups: up to 0.5 quantiles, from 0.5 to 0.75, from 0.75 to 1.

**RFM features**

In [None]:
sns.distplot(rfm.recency)
plt.axvline(rfm.recency.mean(), color='k', linestyle='dashed', linewidth=1)
plt.axvline(rfm.recency.median(), color='r', linewidth=1)
plt.title('Distribution of Recency')
plt.xlabel("Recency")

Here, we can see that we have right skewness in distribution. The median is somewhere around 50. Therefore we can make two assumptions:

1. Many customers recently made their first purchase
2. On average, customers return within 50 days.

Most likely, in the skewed edge of the distribution are customers who churn.

In [None]:
sns.distplot(rfm.frequency)
plt.axvline(rfm.frequency.mean(), color='k', linestyle='dashed', linewidth=1)
plt.axvline(rfm.frequency.median(), color='r', linewidth=1)
plt.title('Distribution of Frequency')
plt.xlabel("Frequency")

Again, we have high skewness here. Only few customers purchased often than 25 times. On average, customers make purchases only twice in a lifetime.

In [None]:
sns.distplot(rfm.monetary)
plt.axvline(rfm.monetary.mean(), color='k', linestyle='dashed', linewidth=1)
plt.axvline(rfm.monetary.median(), color='r', linewidth=1)
plt.title('Distribution of Monetary value')
plt.xlabel("Monetary value")

Like frequency values, some customers spend large amount of money on our products, however on average customers spend 666 pounds per order.

**Computing Quantile of RFM values**

In [None]:
rfm['r_quartile'] = pd.qcut(rfm['recency'], 4, ['1','2','3','4'])
# In the frequency column, we manually change the quantiles to solve the problem described above.
rfm['f_quartile'] = pd.qcut(rfm['frequency'], [0, .5, .75,  1], ['3', '2', '1']) 
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['4','3','2','1'])

In [None]:
# Playgraound
# pd.qcut(rfm['recency'], 4, ['group 1 customer','group 2 customer','group 3 customer','group 4 customer'])

In [None]:
rfm.head()

**RFM Result Interpretation**

In [None]:
rfm['RFM_Score'] = rfm.r_quartile.astype(str) + rfm.f_quartile.astype(str) + rfm.m_quartile.astype(str)
rfm.head(30)

In [None]:
results = rfm.groupby('RFM_Score', as_index=False)['CustomerID'].count()
sorted_results = results.rename(columns={'CustomerID': 'Customers'}).sort_values('Customers', ascending=False)
show_top_x = 10
tops = sorted_results.head(show_top_x)
sns.barplot(tops.RFM_Score, tops.Customers)
plt.title(f'Top {show_top_x} largest segments')

We have very interesting results. As you can see, the two largest segments are our best and worst segments. On the one hand, we have a lot of loyal customers who are willing to spend more money on our products(111), and on the other hand, we have a large group of customers who have the shortest lifespan and are likely to churn soon.(434).
<br>
<br>

This information can be used by managers, marketers and salespeople to improve their actions in a more customer-centric way.
<br>
<br>
Food for thought, what other assumptions can we make about this segmentation? 