# RFM Segmentation
RFM (Recency, Frequency, Monetary) Segmentation is a customer analysis method used to group customers based on:
- Recency: How recently the customer made their last transaction.
- Frequency: How often the customer makes transactions.
- Monetary: How much money the customer spends.

The goal of this segmentation is to identify customer groups based on their values and behaviors, so companies can develop more targeted marketing strategies. RFM is highly effective in customer retention, loyalty, and revenue growth because it focuses on existing customers.

# Impor packages

In [2]:
import pandas as pd
import numpy as np
import datetime as dt

In [3]:
import os
os.getcwd()

'C:\\Users\\LENOVO\\Python\\Intermediate'

# Import data from CSV to DataFrame

In [4]:
df = pd.read_csv('C:/Users/LENOVO/Python/Online Retail Data.csv', header=0)
df

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0
3,493413,21724,PANDA AND BUNNIES STICKER SHEET,1,2010-01-04 09:54:00,0.85,
4,493413,84578,ELEPHANT TOY WITH BLUE T-SHIRT,1,2010-01-04 09:54:00,3.75,
...,...,...,...,...,...,...,...
461768,539991,21618,4 WILDFLOWER BOTANICAL CANDLES,1,2010-12-23 16:49:00,1.25,
461769,539991,72741,GRAND CHOCOLATECANDLE,4,2010-12-23 16:49:00,1.45,
461770,539992,21470,FLOWER VINE RAFFIA FOOD COVER,1,2010-12-23 17:41:00,3.75,
461771,539992,22258,FELT FARM ANIMAL RABBIT,1,2010-12-23 17:41:00,1.25,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 461773 entries, 0 to 461772
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      461773 non-null  object 
 1   product_code  461773 non-null  object 
 2   product_name  459055 non-null  object 
 3   quantity      461773 non-null  int64  
 4   order_date    461773 non-null  object 
 5   price         461773 non-null  float64
 6   customer_id   360853 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 24.7+ MB


In [6]:
df.isna().sum()

order_id             0
product_code         0
product_name      2718
quantity             0
order_date           0
price                0
customer_id     100920
dtype: int64

# Data cleansing

In [7]:
df_clean = df.copy()

## create date column
Standardizes the time column so that it can be used to calculate recency and time trends.

In [8]:
df_clean['order_date'] = pd.to_datetime(df_clean['order_date'])

## deletes all rows without customer_id
Avoiding anonymous data that cannot be analyzed per customer.

In [9]:
df_clean = df_clean[~df_clean['customer_id'].isna()]

## deletes all rows without product_name
Maintains product information consistency; rows without a product name may indicate incomplete data or an input error.

In [10]:
df_clean = df_clean[~df_clean['product_name'].isna()]

## make all product_names lowercase
Standardizes naming so that Python doesn't treat them differently (e.g., "Shoes" ≠ "shoes").

In [11]:
df_clean['product_name'] = df_clean['product_name'].str.lower()

## removes all rows with product_code or product_name test
Removes dummy entries or trial data that does not represent real customer behavior.

In [12]:
df_clean = df_clean[(~df_clean['product_code'].str.lower().str.contains('test')) |
                    (~df_clean['product_name'].str.contains('test '))]

## Creates an order_status column with the value 'cancelled' if the order_id starts with the letter 'c' and 'delivered' if the order_id doesn't start with the letter 'c'.
Explicitly distinguishes between cancelled and delivered transactions for more accurate segmentation analysis.

In [13]:
df_clean['order_status'] = np.where(df_clean['order_id'].str[:1]=='C', 'cancelled', 'delivered')

## Changes negative quantity values to positive because negative values only indicate that the order was cancelled.
Normalizes product quantities because negative values are only used to indicate cancellation (not actual values).

In [14]:
df_clean['quantity'] = df_clean['quantity'].abs()

## removes rows with negative price values
Maintains transaction validity; negative prices are illogical in a purchasing context.

In [15]:
df_clean = df_clean[df_clean['price']>0]

## create amount (quantity * price)
To calculate the total transaction value per row, as the monetary basis in RFM.

In [16]:
df_clean['amount'] = df_clean['quantity'] * df_clean['price']

## Replace duplicate product_names based on product_code
Standardize product names to prevent multiple name variations for a single code.

In [17]:
most_freq_product_name = df_clean.groupby(
    ['product_code','product_name'], as_index=False).agg(
    order_cnt=('order_id','nunique')).sort_values(
    ['product_code','order_cnt'], ascending=[True,False])
most_freq_product_name['rank'] = most_freq_product_name.groupby(
    'product_code')['order_cnt'].rank(method='first', ascending=False)
most_freq_product_name = most_freq_product_name[most_freq_product_name['rank']==1].drop(
    columns=['order_cnt','rank'])

In [18]:
df_clean = df_clean.merge(
    most_freq_product_name.rename(
        columns={'product_name':'most_freq_product_name'}), how='left', on='product_code')
df_clean['product_name'] = df_clean['most_freq_product_name']
df_clean = df_clean.drop(columns='most_freq_product_name')

## convert customer_id to string
Avoiding number parsing errors (e.g.: 1.0 ≠ '1') and facilitating categorical analysis.

In [19]:
df_clean['customer_id'] = df_clean['customer_id'].astype(str)

## remove outliers
Avoid distortion in metric calculations (recency/frequency/monetary) due to outlier data.

In [20]:
from scipy import stats
df_clean = df_clean[(np.abs(stats.zscore(df_clean[['quantity','amount']]))<3).all(axis=1)]
df_clean = df_clean.reset_index(drop=True)
df_clean

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,order_status,amount
0,C493411,21539,red retrospot butter dish,1,2010-01-04 09:43:00,4.25,14590.0,cancelled,4.25
1,493414,21844,red retrospot mug,36,2010-01-04 10:28:00,2.55,14590.0,delivered,91.80
2,493414,21533,retro spot large milk jug,12,2010-01-04 10:28:00,4.25,14590.0,delivered,51.00
3,493414,37508,new england ceramic cake server,2,2010-01-04 10:28:00,2.55,14590.0,delivered,5.10
4,493414,35001G,hand open shape gold,2,2010-01-04 10:28:00,4.25,14590.0,delivered,8.50
...,...,...,...,...,...,...,...,...,...
358464,539988,84380,set of 3 butterfly cookie cutters,1,2010-12-23 16:06:00,1.25,18116.0,delivered,1.25
358465,539988,84849D,hot baths soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358466,539988,84849B,fairy soap soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358467,539988,22854,cream sweetheart egg holder,2,2010-12-23 16:06:00,4.95,18116.0,delivered,9.90


## Rename the order_date column to date
For naming consistency and to conform to more common conventions.

In [23]:
df_clean = df_clean.rename(columns={'order_date':'date'})
df_clean

Unnamed: 0,order_id,product_code,product_name,quantity,date,price,customer_id,order_status,amount
0,C493411,21539,red retrospot butter dish,1,2010-01-04 09:43:00,4.25,14590.0,cancelled,4.25
1,493414,21844,red retrospot mug,36,2010-01-04 10:28:00,2.55,14590.0,delivered,91.80
2,493414,21533,retro spot large milk jug,12,2010-01-04 10:28:00,4.25,14590.0,delivered,51.00
3,493414,37508,new england ceramic cake server,2,2010-01-04 10:28:00,2.55,14590.0,delivered,5.10
4,493414,35001G,hand open shape gold,2,2010-01-04 10:28:00,4.25,14590.0,delivered,8.50
...,...,...,...,...,...,...,...,...,...
358464,539988,84380,set of 3 butterfly cookie cutters,1,2010-12-23 16:06:00,1.25,18116.0,delivered,1.25
358465,539988,84849D,hot baths soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358466,539988,84849B,fairy soap soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358467,539988,22854,cream sweetheart egg holder,2,2010-12-23 16:06:00,4.95,18116.0,delivered,9.90


# Creating RFM segmentation

## Aggregate transaction data into a summary of total transactions (orders), total order value (order value), and last order date for each user.
Goals:
* Simplify transaction data into a single row per customer.
* Prepare RFM features: recency (last date), frequency (number of transactions), and monetary (total spending).
* Support customer segmentation based on value and spending behavior.
* Facilitate customer data-driven analysis and marketing strategies.

Aggregating transactions into a summary per user aims to evaluate individual customer value and behavior, which is the main foundation for RFM-based segmentation and customer retention strategies.

In [24]:
df_user = df_clean.groupby('customer_id', as_index=False).agg(
    order_cnt=('order_id','nunique'),
    max_order_date=('date','max'),total_order_value=('amount','sum'))
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value
0,12346.0,5,2010-10-04 09:54:00,602.40
1,12608.0,1,2010-10-31 10:49:00,415.79
2,12745.0,2,2010-08-10 10:14:00,723.85
3,12746.0,2,2010-06-30 08:19:00,266.35
4,12747.0,19,2010-12-13 10:41:00,4094.79
...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77
3885,18284.0,2,2010-10-06 12:31:00,486.68
3886,18285.0,1,2010-02-17 10:24:00,427.00
3887,18286.0,2,2010-08-20 11:57:00,941.48


## Create a column for the number of days since the last order
* The purpose is to measure Recency, which is how long (in days) since the customer last made a transaction.
* The day_since_last_order column shows the number of days since the customer's last order to the last date in the dataset (today).
* A smaller value means the customer is still active or has recently made a transaction.
* A larger value means the customer hasn't made a transaction in a while and may be inactive.

In [26]:
today = df_clean['date'].max()
df_user['day_since_last_order'] = (today - df_user['max_order_date']).dt.days
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order
0,12346.0,5,2010-10-04 09:54:00,602.40,80
1,12608.0,1,2010-10-31 10:49:00,415.79,53
2,12745.0,2,2010-08-10 10:14:00,723.85,135
3,12746.0,2,2010-06-30 08:19:00,266.35,176
4,12747.0,19,2010-12-13 10:41:00,4094.79,10
...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31
3885,18284.0,2,2010-10-06 12:31:00,486.68,78
3886,18285.0,1,2010-02-17 10:24:00,427.00,309
3887,18286.0,2,2010-08-20 11:57:00,941.48,125


In [27]:
df_user.describe()

Unnamed: 0,order_cnt,max_order_date,total_order_value,day_since_last_order
count,3889.0,3889,3889.0,3889.0
mean,5.128568,2010-09-23 18:15:51.267678208,1544.623084,90.651581
min,1.0,2010-01-05 12:43:00,1.25,0.0
25%,1.0,2010-08-19 12:30:00,296.36,25.0
50%,3.0,2010-10-26 18:45:00,648.2,57.0
75%,6.0,2010-11-28 14:54:00,1585.94,126.0
max,163.0,2010-12-23 16:06:00,71970.39,352.0
std,8.49933,,3434.816315,88.883201


## Create a binning of the number of days since the last order, consisting of 5 bins with boundaries as min, P20, P40, P60, P80, and max, and label them 1 to 5 from highest to lowest as the recency score.
* Converts the recency value (number of days since the last transaction) into a discrete score of 1–5 to simplify customer segmentation based on how recently they transacted.
* This score simplifies understanding customer recency.
* It is crucial in strategies such as retargeting, rewarding active customers, and reactivating lapsed customers.
* Recency score meaning
| recency\_score | Meaning |
| -------------- | ------------------------- |
| 5 | Very recent, very active |
| 4 | New |
| 3 | Fairly active |
| 2 | Starting to be inactive |
| 1 | Inactive / old |

In [28]:
df_user['recency_score'] = pd.cut(df_user['day_since_last_order'],
                                  bins=[df_user['day_since_last_order'].min(),
                                        np.percentile(df_user['day_since_last_order'], 20),
                                        np.percentile(df_user['day_since_last_order'], 40),
                                        np.percentile(df_user['day_since_last_order'], 60),
                                        np.percentile(df_user['day_since_last_order'], 80),
                                        df_user['day_since_last_order'].max()],
                                  labels=[5, 4, 3, 2, 1],
                                  include_lowest=True).astype(int)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5
...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2


In [29]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 182.4+ KB


## Bin the total number of transactions (orders) into 5 bins with boundaries as min, P20, P40, P60, P80, and max, and label them from 1 to 5, from lowest to highest, as the frequency score.
* Converts the total number of transactions per customer (order_cnt) into a discrete score of 1–5 to measure how frequently a customer transacts.
* Frequency_score helps identify loyal (high-order) customers versus passive customers.
* Important for rewarding loyal customers, segmenting membership programs, or upselling strategies.
* Meaning of frequency_score
| frequency_score | Meaning |
| ---------------- | ----------------------- |
| 5 | Very frequent transactions |
| 4 | Fairly frequent |
| 3 | Average |
| 2 | Rarely |
| 1 | Very rarely |

In [43]:
'''
df_user['frequency_score'] = pd.cut(df_user['order_cnt'],
                                    bins=[df_user['order_cnt'].min(),
                                          np.percentile(df_user['order_cnt'], 20),
                                          np.percentile(df_user['order_cnt'], 40),
                                          np.percentile(df_user['order_cnt'], 60),
                                          np.percentile(df_user['order_cnt'], 80),
                                          df_user['order_cnt'].max()],
                                    labels=[1, 2, 3, 4, 5],
                                    include_lowest=True).astype(int)
df_user
'''

"\ndf_user['frequency_score'] = pd.cut(df_user['order_cnt'],\n                                    bins=[df_user['order_cnt'].min(),\n                                          np.percentile(df_user['order_cnt'], 20),\n                                          np.percentile(df_user['order_cnt'], 40),\n                                          np.percentile(df_user['order_cnt'], 60),\n                                          np.percentile(df_user['order_cnt'], 80),\n                                          df_user['order_cnt'].max()],\n                                    labels=[1, 2, 3, 4, 5],\n                                    include_lowest=True).astype(int)\ndf_user\n"

In [31]:
df_user['frequency_score'] = pd.cut(df_user['order_cnt'],
                                    bins=[0,
                                          np.percentile(df_user['order_cnt'], 20),
                                          np.percentile(df_user['order_cnt'], 40),
                                          np.percentile(df_user['order_cnt'], 60),
                                          np.percentile(df_user['order_cnt'], 80),
                                          df_user['order_cnt'].max()],
                                    labels=[1, 2, 3, 4, 5],
                                    include_lowest=True).astype(int)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score,frequency_score
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2,4
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3,1
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2,2
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1,2
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5,5
...,...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4,4
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2,2
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1,1
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2,2


In [32]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
 6   frequency_score       3889 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 212.8+ KB


## Create a binning of the total order value consisting of 5 bins with boundaries as min, P20, P40, P60, P80, and max. Label the total order value (from lowest to highest) as the monetary score from 1 to 5.
* Converts the total customer transaction value (total_order_value) into a discrete score from 1–5 that represents the customer's financial contribution to the business.
* The monetary_score indicates who the most profitable customers are.
* Suitable for premium customer retention strategies, exclusive promotions, or priority service.
* Meaning of monetary_score
| monetary_score | Meaning |
| --------------- | ------------------------- |
| 5 | High-value customer |
| 4 | Fairly valuable |
| 3 | Average |
| 2 | Low-value |
| 1 | Very low |

In [33]:
df_user['monetary_score'] = pd.cut(df_user['total_order_value'],
                                   bins=[df_user['total_order_value'].min(),
                                         np.percentile(df_user['total_order_value'], 20),
                                         np.percentile(df_user['total_order_value'], 40),
                                         np.percentile(df_user['total_order_value'], 60),
                                         np.percentile(df_user['total_order_value'], 80),
                                         df_user['total_order_value'].max()],
                                   labels=[1, 2, 3, 4, 5],
                                   include_lowest=True).astype(int)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score,frequency_score,monetary_score
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2,4,3
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3,1,2
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2,2,3
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1,2,2
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5,5,5
...,...,...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4,4,3
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2,2,3
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1,1,2
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2,2,4


In [34]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
 6   frequency_score       3889 non-null   int64         
 7   monetary_score        3889 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 243.2+ KB


## Create segment name columns based on recency and frequency scores
* Groups customers into different behavioral segments based on how recently and how often they purchased.
* Using a combination of recency_score and frequency_score, customers are classified into 10 segments using np.select().
* This segmentation helps with marketing strategy: who to retain, upgrade, or re-engage.
* Useful for campaign personalization, promotional budget allocation, and customer retention.
* List of segmentations and their meaning
| Segment | Criteria | Key Characteristics |
| ---------------------- | -------------------------- | --------------------------------------------- |
| 01-Champion | Recency 5 & Frequency ≥4 | Active and very frequent customers |
| 02-Loyal Customers | Recency 3–4 & Frequency ≥4 | Frequent shoppers, loyalty needs to be maintained |
| 03 - Potential Loyalists | Recency ≥4 & Frequency 2–3 | Just starting to be active, potential to become loyal |
| 04 - Can't Lose Them | Recency ≤2 & Frequency 5 | Used to be active, but are starting to shop less frequently |
| 05 - Need Attention | Recency 3 & Frequency 3 | So-so, could be improved |
| 06 - New Customers | Recency 5 & Frequency 1 | New customers, need further engagement |
| 07 - Promising | Recency 4 & Frequency 1 | Have only shopped once, quite potential |
| 08 - At Risk | Recency ≤2 & Frequency 3–4 | Starting to be passive, need preventive action |
| 09 - About to Sleep | Recency 3 & Frequency ≤2 | Less active and rarely shop |
| 10-Hibernating | Recency ≤2 & Frequency ≤2 | Very inactive, likely to churn |

In [36]:
df_user['segment'] = np.select(
    [(df_user['recency_score']==5) & (df_user['frequency_score']>=4),
     (df_user['recency_score'].between(3, 4)) & (df_user['frequency_score']>=4),
     (df_user['recency_score']>=4) & (df_user['frequency_score'].between(2, 3)),
     (df_user['recency_score']<=2) & (df_user['frequency_score']==5),
     (df_user['recency_score']==3) & (df_user['frequency_score']==3),
     (df_user['recency_score']==5) & (df_user['frequency_score']==1),
     (df_user['recency_score']==4) & (df_user['frequency_score']==1),
     (df_user['recency_score']<=2) & (df_user['frequency_score'].between(3, 4)),
     (df_user['recency_score']==3) & (df_user['frequency_score']<=2),
     (df_user['recency_score']<=2) & (df_user['frequency_score']<=2)],
    ['01-Champion', '02-Loyal Customers', '03-Potential Loyalists', "04-Can't Lose Them", '05-Need Attention',
     '06-New Customers', '07-Promising', '08-At Risk', '09-About to Sleep', '10-Hibernating'],
    default='Other'
)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score,frequency_score,monetary_score,segment
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2,4,3,08-At Risk
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3,1,2,09-About to Sleep
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2,2,3,10-Hibernating
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1,2,2,10-Hibernating
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5,5,5,01-Champion
...,...,...,...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4,4,3,02-Loyal Customers
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2,2,3,10-Hibernating
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1,1,2,10-Hibernating
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2,2,4,10-Hibernating


In [37]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
 6   frequency_score       3889 non-null   int64         
 7   monetary_score        3889 non-null   int64         
 8   segment               3889 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(2)
memory usage: 273.6+ KB


## Display a summary of the RFM segmentation (point 8) in the form of the number of users, the average and median of total orders, total order value, and the number of days since the last order.

In [39]:
summary = pd.pivot_table(df_user, index='segment',
               values=['customer_id','day_since_last_order','order_cnt','total_order_value'],
               aggfunc={'customer_id': 'nunique',
                        'day_since_last_order': ['mean', 'median'],
                        'order_cnt': ['mean', 'median'],
                        'total_order_value': ['mean', 'median']})

summary['pct_unique'] = (summary['customer_id'] / summary['customer_id'].sum() * 100).round(1)
summary

Unnamed: 0_level_0,customer_id,day_since_last_order,day_since_last_order,order_cnt,order_cnt,total_order_value,total_order_value,pct_unique
Unnamed: 0_level_1,nunique,mean,median,mean,median,mean,median,Unnamed: 8_level_1
segment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
01-Champion,553,10.533454,9.0,15.432188,10.0,4989.208761,2773.91,14.2
02-Loyal Customers,549,41.200364,37.0,8.744991,7.0,2618.121117,1937.05,14.1
03-Potential Loyalists,514,23.083658,24.0,2.830739,3.0,766.076265,621.005,13.2
04-Can't Lose Them,62,123.274194,113.0,11.467742,10.0,2851.737258,2268.405,1.6
05-Need Attention,184,58.505435,59.0,3.402174,3.0,1004.317071,826.37,4.7
06-New Customers,50,14.0,16.0,1.0,1.0,244.689,193.675,1.3
07-Promising,133,31.954887,32.0,1.0,1.0,288.694135,239.46,3.4
08-At Risk,418,141.5311,120.0,4.126794,4.0,1141.224835,866.32,10.7
09-About to Sleep,370,58.175676,58.0,1.416216,1.0,448.176081,336.735,9.5
10-Hibernating,1056,197.151515,199.0,1.3125,1.0,342.61845,256.9,27.2


In [40]:
summary['customer_id']

Unnamed: 0_level_0,nunique
segment,Unnamed: 1_level_1
01-Champion,553
02-Loyal Customers,549
03-Potential Loyalists,514
04-Can't Lose Them,62
05-Need Attention,184
06-New Customers,50
07-Promising,133
08-At Risk,418
09-About to Sleep,370
10-Hibernating,1056


Some insights:
* The largest segment: 10-Hibernating (1,056 users), meaning many users have been inactive for a long time and rarely shop.
* The smallest segment: 04-Can't Lose Them (62 users), meaning a small number of previously active and valuable users who have now become inactive.
* The best segment: 01-Champion (553 users), meaning the most active and recent users in transactions.

In [42]:
summary['customer_id'] / summary['customer_id'].sum() * 100

Unnamed: 0_level_0,nunique
segment,Unnamed: 1_level_1
01-Champion,14.219594
02-Loyal Customers,14.11674
03-Potential Loyalists,13.216765
04-Can't Lose Them,1.59424
05-Need Attention,4.731293
06-New Customers,1.285678
07-Promising,3.419902
08-At Risk,10.748264
09-About to Sleep,9.514014
10-Hibernating,27.15351


Some insights:
* The largest segment is 10-Hibernating (27.15%). This indicates many users have been inactive for a long time and require a re-engagement strategy.
* 01-Champion (14.22%), 02-Loyal Customers (14.12%), and 03-Potential Loyalists (13.22%) are the most valuable segments that need to be retained and facilitated.
* 08-At Risk (10.75%) and 09-About to Sleep (9.51%) are segments with a significant risk of churn and require special attention.
* 04-Can't Lose Them (1.59%) is small in number, but these users were previously active and high-value.
* 05-Need Attention, 06-New Customers, and 07-Promising are segments that show potential, but still require a different approach to convert them into loyal customers.

# Conclusion
* The majority of customers are in passive segments (Hibernating, About to Sleep, and At Risk), while high-value segments such as Champions and Loyal Customers are smaller but very important to retain.

# Strategy
* Retain high-value customers (Champions, Loyal Customers)
Provide special rewards, loyalty programs, or early access to new products.
* Reactivate at-risk customers (At Risk, Can't Lose Them, About to Sleep)
Use email marketing with exclusive offers or personal reminders.
* Encourage potential segments (Potential Loyalists, Promising)
Offer small discounts or product education to encourage them to shop more frequently.
* Re-engage passive customers (Hibernating)
Reactivation campaigns or satisfaction surveys can help identify the causes of inactivity.
* Nurture new customers (New Customers, Need Attention)
Create strong onboarding and encourage repeat purchases with special new customer promotions.