# RFM Analysis using Python

RFM Analysis is used to understand and segment customers based on their buying behavior. RFM stands for recency, frequency, and monetary value, which are three key metrics that provide information about cusomter engagement, loyalty, and value to a business. 

How it predicts: Customers with high frequency and recent purchases are often considered loyal customers. RFM can identify these segments, allowing businesses to predict which customers are likely to continue buying.

High monetary value, combined with frequency, can indicate customers with a high lifetime value.

By leveraging RFM analysis, businesses can effectively predict these aspects of customer behavior and strategically plan their marketing, sales, customer service, and product development efforts to optimize customer value and business performance.

RFM Analysis is a concept used by Data Science professionals, especially in the marketing domain for understanding and segmenting customers based on their buying behavior. 

Using RFM Analysis, a business can assess customers':

- Recency (the date they made their last purchase)

- Frequency (how often they make purchases)

- And monetary value (the amount spent on purchases)

Recency, Frequency and Monetary value of a customer are three key metrics that provide information about customer engagement, loyalty and value to a business.

To perform RFM analysis using Python, we need a dataset that includes customer IDs, purchase dates, and transaction amounts.

In [125]:
import pandas as pd 
import plotly.express as px 
import plotly.io as pio
import plotly.graph_objects as go 

data = pd.read_csv('/Users/greenboi/CS/customer-segmentation/Resources/raw/ecommerce_customer_data_custom_ratios.csv')
data.head(10)

Unnamed: 0,Customer ID,Purchase Date,Product Category,Product Price,Quantity,Total Purchase Amount,Payment Method,Customer Age,Returns,Customer Name,Age,Gender,Churn
0,46251,2020-09-08 09:38:32,Electronics,12,3,740,Credit Card,37,0.0,Christine Hernandez,37,Male,0
1,46251,2022-03-05 12:56:35,Home,468,4,2739,PayPal,37,0.0,Christine Hernandez,37,Male,0
2,46251,2022-05-23 18:18:01,Home,288,2,3196,PayPal,37,0.0,Christine Hernandez,37,Male,0
3,46251,2020-11-12 13:13:29,Clothing,196,1,3509,PayPal,37,0.0,Christine Hernandez,37,Male,0
4,13593,2020-11-27 17:55:11,Home,449,1,3452,Credit Card,49,0.0,James Grant,49,Female,1
5,13593,2023-03-07 14:17:42,Home,250,4,575,PayPal,49,1.0,James Grant,49,Female,1
6,13593,2023-04-15 03:02:33,Electronics,73,1,1896,Credit Card,49,0.0,James Grant,49,Female,1
7,13593,2021-03-27 21:23:28,Books,337,2,2937,Cash,49,0.0,James Grant,49,Female,1
8,13593,2020-05-05 20:14:00,Clothing,182,2,3363,PayPal,49,1.0,James Grant,49,Female,1
9,28805,2023-09-13 04:24:00,Electronics,394,2,1993,Credit Card,19,0.0,Jose Collier,19,Male,0


In [126]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Customer ID            250000 non-null  int64  
 1   Purchase Date          250000 non-null  object 
 2   Product Category       250000 non-null  object 
 3   Product Price          250000 non-null  int64  
 4   Quantity               250000 non-null  int64  
 5   Total Purchase Amount  250000 non-null  int64  
 6   Payment Method         250000 non-null  object 
 7   Customer Age           250000 non-null  int64  
 8   Returns                202404 non-null  float64
 9   Customer Name          250000 non-null  object 
 10  Age                    250000 non-null  int64  
 11  Gender                 250000 non-null  object 
 12  Churn                  250000 non-null  int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 24.8+ MB


## Calculating RFM Values

We begin with calculating the Recency, Frequency, and Monetary values of the customers.

To calculate recency, we subtract the purchase date from the current date and extract the number of days. It gives as the number of days since the customer's last purchase, representing their recency value.

Finally, we calculate the monetary value for each customer. We grop the data by 'CustomerID' and sum the 'TransactionAmount' values to calculate the total spent by each customer.

By performing these calculations, we now have the necessary RFM values (recency, frequency, monetary value) for each customer, which are important indicators for understanding customer behavior and segmentation in RFM analysis.

In [127]:
from datetime import datetime

# Convert 'PurchaseDate' to datetime
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   Customer ID            250000 non-null  int64         
 1   Purchase Date          250000 non-null  datetime64[ns]
 2   Product Category       250000 non-null  object        
 3   Product Price          250000 non-null  int64         
 4   Quantity               250000 non-null  int64         
 5   Total Purchase Amount  250000 non-null  int64         
 6   Payment Method         250000 non-null  object        
 7   Customer Age           250000 non-null  int64         
 8   Returns                202404 non-null  float64       
 9   Customer Name          250000 non-null  object        
 10  Age                    250000 non-null  int64         
 11  Gender                 250000 non-null  object        
 12  Churn                  250000 non-null  int6

In [128]:
data['Purchase Date'].max()

Timestamp('2023-09-15 12:24:08')

In [129]:
data['Recency'] = (data['Purchase Date'].max() - data['Purchase Date'])
data['Recency'] = data['Recency'].dt.days
data.head(10)

Unnamed: 0,Customer ID,Purchase Date,Product Category,Product Price,Quantity,Total Purchase Amount,Payment Method,Customer Age,Returns,Customer Name,Age,Gender,Churn,Recency
0,46251,2020-09-08 09:38:32,Electronics,12,3,740,Credit Card,37,0.0,Christine Hernandez,37,Male,0,1102
1,46251,2022-03-05 12:56:35,Home,468,4,2739,PayPal,37,0.0,Christine Hernandez,37,Male,0,558
2,46251,2022-05-23 18:18:01,Home,288,2,3196,PayPal,37,0.0,Christine Hernandez,37,Male,0,479
3,46251,2020-11-12 13:13:29,Clothing,196,1,3509,PayPal,37,0.0,Christine Hernandez,37,Male,0,1036
4,13593,2020-11-27 17:55:11,Home,449,1,3452,Credit Card,49,0.0,James Grant,49,Female,1,1021
5,13593,2023-03-07 14:17:42,Home,250,4,575,PayPal,49,1.0,James Grant,49,Female,1,191
6,13593,2023-04-15 03:02:33,Electronics,73,1,1896,Credit Card,49,0.0,James Grant,49,Female,1,153
7,13593,2021-03-27 21:23:28,Books,337,2,2937,Cash,49,0.0,James Grant,49,Female,1,901
8,13593,2020-05-05 20:14:00,Clothing,182,2,3363,PayPal,49,1.0,James Grant,49,Female,1,1227
9,28805,2023-09-13 04:24:00,Electronics,394,2,1993,Credit Card,19,0.0,Jose Collier,19,Male,0,2


In [130]:
# Calculate Frequency
frequency_data = data.groupby('Customer ID')['Purchase Date'].count().reset_index()
frequency_data.rename(columns={'Purchase Date': 'Frequency'}, inplace=True)
data = data.merge(frequency_data, on='Customer ID', how='left')
data.head(10)

Unnamed: 0,Customer ID,Purchase Date,Product Category,Product Price,Quantity,Total Purchase Amount,Payment Method,Customer Age,Returns,Customer Name,Age,Gender,Churn,Recency,Frequency
0,46251,2020-09-08 09:38:32,Electronics,12,3,740,Credit Card,37,0.0,Christine Hernandez,37,Male,0,1102,4
1,46251,2022-03-05 12:56:35,Home,468,4,2739,PayPal,37,0.0,Christine Hernandez,37,Male,0,558,4
2,46251,2022-05-23 18:18:01,Home,288,2,3196,PayPal,37,0.0,Christine Hernandez,37,Male,0,479,4
3,46251,2020-11-12 13:13:29,Clothing,196,1,3509,PayPal,37,0.0,Christine Hernandez,37,Male,0,1036,4
4,13593,2020-11-27 17:55:11,Home,449,1,3452,Credit Card,49,0.0,James Grant,49,Female,1,1021,5
5,13593,2023-03-07 14:17:42,Home,250,4,575,PayPal,49,1.0,James Grant,49,Female,1,191,5
6,13593,2023-04-15 03:02:33,Electronics,73,1,1896,Credit Card,49,0.0,James Grant,49,Female,1,153,5
7,13593,2021-03-27 21:23:28,Books,337,2,2937,Cash,49,0.0,James Grant,49,Female,1,901,5
8,13593,2020-05-05 20:14:00,Clothing,182,2,3363,PayPal,49,1.0,James Grant,49,Female,1,1227,5
9,28805,2023-09-13 04:24:00,Electronics,394,2,1993,Credit Card,19,0.0,Jose Collier,19,Male,0,2,6


In [131]:
data = data.groupby('Customer ID').agg({
    'Customer Name': 'first',  # Assuming customer name is unique to each ID
    'Customer Age': 'first',  # Assuming age doesn't change within this dataset
    'Gender': 'first',  # Assuming gender is consistent for each customer ID
    'Churn': 'first',  # Assuming churn status is consistent for each customer ID
    'Frequency': 'first',  # Frequency is already calculated per customer
    'Total Purchase Amount': 'sum',  # Sum of all purchase amounts
    'Product Price': 'mean',  # Average price of products purchased
    'Quantity': 'mean',  # Average quantity of products per purchase
    'Returns': 'sum',  # Total number of returns
    'Recency': 'first'
}).reset_index()

data = data.rename({'Total Purchase Amount': 'MonetaryValue'}, axis='columns')

# Display the first few rows of the consolidated customer profiles
data.head()

Unnamed: 0,Customer ID,Customer Name,Customer Age,Gender,Churn,Frequency,MonetaryValue,Product Price,Quantity,Returns,Recency
0,1,Nicole Johnson,70,Male,0,1,3491,169.0,5.0,1.0,57
1,2,Marie Wright,27,Female,0,3,7988,140.0,2.666667,1.0,923
2,3,Julie Wolfe,23,Female,0,8,22587,225.375,2.875,3.0,845
3,4,Tracey Smith,66,Male,1,4,8715,134.25,2.25,1.0,597
4,5,Nancy Jones,26,Female,0,8,12524,229.0,4.125,4.0,290


We assign scores from 5 to 1 to calculate the recency score, where a higher score indicates a more recent purchase.

We assign score from 1 to 5 to calculate the frequency score, where a higher score means a higher purchase frequency. Customers who made more frequent purchases will receive higher frequency scores.

To calculate the monetary score, we assigned scores from 1 to 5, where a higher score is a higher amount spent by the customer.

To determine RFM scores, we use pd.cut() to divide recency, frequency and monetary values into bins. We define 5 bins for each value and assign the corresponding scores to the bins.

Once the scores are in the bins, they are categorical variables. As such we need to convert their datatypes into integers to use these scores further.

In [132]:
recency_scores = [5, 4, 3, 2, 1]  # Higher score for lower recency (more recent)
frequency_scores = [1, 2, 3, 4, 5]  # Higher score for higher frequency
monetary_scores = [1, 2, 3, 4, 5]  # Higher score for higher monetary value

# Calculate RFM scores
data['RecencyScore'] = pd.cut(data['Recency'], bins=5, labels=recency_scores)
data['FrequencyScore'] = pd.cut(data['Frequency'], bins=5, labels=frequency_scores)
data['MonetaryScore'] = pd.cut(data['MonetaryValue'], bins=5, labels=monetary_scores)

## RFM Value Segmentation

We want to calculate the final RFM score and the value segment according to the scores.

To calculate the RFM score, we add the scores obtained for recency, frequency and monetary value. After calculating RFM scores, we can create RFM segments based on the scores. We divde RFM scores into three segments: low value, mid value and high value. Segmentation is done using the pd.qcut() function, which evenly distributes scores between segments.

In [133]:
data.head(10)

Unnamed: 0,Customer ID,Customer Name,Customer Age,Gender,Churn,Frequency,MonetaryValue,Product Price,Quantity,Returns,Recency,RecencyScore,FrequencyScore,MonetaryScore
0,1,Nicole Johnson,70,Male,0,1,3491,169.0,5.0,1.0,57,5,1,1
1,2,Marie Wright,27,Female,0,3,7988,140.0,2.666667,1.0,923,2,1,1
2,3,Julie Wolfe,23,Female,0,8,22587,225.375,2.875,3.0,845,2,3,3
3,4,Tracey Smith,66,Male,1,4,8715,134.25,2.25,1.0,597,3,1,1
4,5,Nancy Jones,26,Female,0,8,12524,229.0,4.125,4.0,290,4,3,2
5,6,Kayla Smith,24,Female,1,6,15517,191.333333,1.833333,0.0,774,3,2,2
6,7,Kristina Miller,34,Female,0,3,7835,210.333333,5.0,1.0,642,3,1,1
7,8,Michael Weber,50,Female,0,6,15265,214.5,2.833333,1.0,707,3,2,2
8,9,Laurie Holmes,41,Female,0,5,14720,300.6,2.6,3.0,126,5,2,2
9,10,Jessica Daniels,58,Female,1,8,20857,201.75,2.5,4.0,1026,2,3,2


In [134]:
# Convert RFM scores to numeric type
data['RecencyScore'] = data['RecencyScore'].astype(int)
data['FrequencyScore'] = data['FrequencyScore'].astype(int)
data['MonetaryScore'] = data['MonetaryScore'].astype(int)

In [135]:
# Calculate RFM score by combining the individual scores
data['RFM_Score'] = data['RecencyScore'] + data['FrequencyScore'] + data['MonetaryScore']

# Create RFM segments based on the RFM score
segment_labels = ['Low-Value', 'Mid-Value', 'High-Value']
data['Value Segment'] = pd.qcut(data['RFM_Score'], q=3, labels=segment_labels)

In [136]:
data.head(10)

Unnamed: 0,Customer ID,Customer Name,Customer Age,Gender,Churn,Frequency,MonetaryValue,Product Price,Quantity,Returns,Recency,RecencyScore,FrequencyScore,MonetaryScore,RFM_Score,Value Segment
0,1,Nicole Johnson,70,Male,0,1,3491,169.0,5.0,1.0,57,5,1,1,7,Mid-Value
1,2,Marie Wright,27,Female,0,3,7988,140.0,2.666667,1.0,923,2,1,1,4,Low-Value
2,3,Julie Wolfe,23,Female,0,8,22587,225.375,2.875,3.0,845,2,3,3,8,High-Value
3,4,Tracey Smith,66,Male,1,4,8715,134.25,2.25,1.0,597,3,1,1,5,Low-Value
4,5,Nancy Jones,26,Female,0,8,12524,229.0,4.125,4.0,290,4,3,2,9,High-Value
5,6,Kayla Smith,24,Female,1,6,15517,191.333333,1.833333,0.0,774,3,2,2,7,Mid-Value
6,7,Kristina Miller,34,Female,0,3,7835,210.333333,5.0,1.0,642,3,1,1,5,Low-Value
7,8,Michael Weber,50,Female,0,6,15265,214.5,2.833333,1.0,707,3,2,2,7,Mid-Value
8,9,Laurie Holmes,41,Female,0,5,14720,300.6,2.6,3.0,126,5,2,2,9,High-Value
9,10,Jessica Daniels,58,Female,1,8,20857,201.75,2.5,4.0,1026,2,3,2,7,Mid-Value


In [137]:
segment_counts = data['Value Segment'].value_counts().reset_index()
segment_counts.columns = ['Value Segment', 'Count']

pastel_colors = px.colors.qualitative.Pastel

# Create the bar chart
fig_segment_dist = px.bar(segment_counts, x='Value Segment', y='Count', 
                          color='Value Segment', color_discrete_sequence=pastel_colors,
                          title='RFM Value Segment Distribution')

# Update the layout
fig_segment_dist.update_layout(xaxis_title='RFM Value Segment',
                              yaxis_title='Count',
                              showlegend=False)

# Show the figure
fig_segment_dist.show()

## RFM Customer Segments

The segments we have calculated are RFM value segments. We now want to calculate RFM cusomter segments. The RFM value segment represents the categorization of customers based on their RFM scores into groups such as "low value", "medium value", and "high value". The segments are determined by dividing RFM scores into distinct ranges or groups. The RFM value segment helps us understand the relative value of customers in terms of recency, frequency and monetary aspects.

We can also create and analyze RFM Customer Segments that are broader classifications based on the RFM score. These segments provide a more strategoic perspective on customer heavior and characteristics in terms of recency, frequency and monetary aspects.

RFM Customer Segment:

- Objective: Primary goal here is to categorize customers into different groups based on their purchasing behavior. It focuses on identifying various types of customers, such as new customers, loyal customers, lost customers, and others, based on how recently and frequently they've purchased and how much they've spent.
- Application: This segmentation is used to tailor marketing communications, develop customer relationship strategies, and enhance customer engagement. For example, a segment of recent, frequent, high-spending customers might be targeted with loyalty programs, while a segment of customers who haven't purchased recently might receive re-engagement campaigns.

RFM Value Segment:

- Objective: While also utilizing RFM metrics, the focus here shifts towards quantifying the value each customer segment brings to the company. It aims at ranking or scoring customers based on their potential or actual value, often leading to a more nuanced understanding of how different segments contribute to the business’s revenue.
- Application: This approach helps in allocating marketing resources more efficiently, focusing efforts on high-value segments that are more likely to drive profit. For example, high-value segments might be targeted with exclusive offers or premium services to enhance retention and increase their lifetime value.

In [139]:
data['RFM Customer Segments'] = ''

# Assign RFM segments based on the RFM score
data.loc[data['RFM_Score'] >= 9, 'RFM Customer Segments'] = 'Champions'
data.loc[(data['RFM_Score'] >= 6) & (data['RFM_Score'] < 9), 'RFM Customer Segments'] = 'Potential Loyalists'
data.loc[(data['RFM_Score'] >= 5) & (data['RFM_Score'] < 6), 'RFM Customer Segments'] = 'At Risk Customers'
data.loc[(data['RFM_Score'] >= 4) & (data['RFM_Score'] < 5), 'RFM Customer Segments'] = "Can't Lose"
data.loc[(data['RFM_Score'] >= 3) & (data['RFM_Score'] < 4), 'RFM Customer Segments'] = "Lost"

# Print the updated data with RFM segments
print(data[['Customer ID', 'RFM Customer Segments']])

       Customer ID RFM Customer Segments
0                1   Potential Loyalists
1                2            Can't Lose
2                3   Potential Loyalists
3                4     At Risk Customers
4                5             Champions
...            ...                   ...
49668        49996     At Risk Customers
49669        49997             Champions
49670        49998     At Risk Customers
49671        49999            Can't Lose
49672        50000            Can't Lose

[49673 rows x 2 columns]


In [140]:
segment_product_counts = data.groupby(['Value Segment', 'RFM Customer Segments']).size().reset_index(name='Count')

segment_product_counts = segment_product_counts.sort_values('Count', ascending=False)

fig_treemap_segment_product = px.treemap(segment_product_counts, 
                                         path=['Value Segment', 'RFM Customer Segments'], 
                                         values='Count',
                                         color='Value Segment', color_discrete_sequence=px.colors.qualitative.Pastel,
                                         title='RFM Customer Segments by Value')
fig_treemap_segment_product.show()

In [141]:
# Filter the data to include only the customers in the Champions segment
champions_segment = data[data['RFM Customer Segments'] == 'Champions']

fig = go.Figure()
fig.add_trace(go.Box(y=champions_segment['RecencyScore'], name='Recency'))
fig.add_trace(go.Box(y=champions_segment['FrequencyScore'], name='Frequency'))
fig.add_trace(go.Box(y=champions_segment['MonetaryScore'], name='Monetary'))

fig.update_layout(title='Distribution of RFM Values within Champions Segment',
                  yaxis_title='RFM Value',
                  showlegend=True)

fig.show()

Creating a correlation matrix for RFM (Recency, Frequency, Monetary value) values involves statistical analysis to understand the relationships between these three key customer metrics. The correlation matrix can reveal how each of these metrics influences the others, which is valuable for making informed decisions in marketing strategy and customer relationship management. Here’s a brief overview of how to interpret the correlation values before we proceed to calculate an example matrix:

- Correlation Coefficient Range: The correlation coefficient between two variables ranges from -1 to +1. A value of +1 indicates a perfect positive correlation, meaning that as one variable increases, the other one also increases. A value of -1 indicates a perfect negative correlation, meaning that as one variable increases, the other decreases. A value of 0 indicates no correlation between the variables.

- Positive Correlation: If the correlation between two RFM metrics is positive, it means that higher values in one metric are associated with higher values in another. For example, a positive correlation between Frequency and Monetary value would suggest that customers who purchase more frequently tend to spend more.

- Negative Correlation: A negative correlation indicates that higher values in one metric are associated with lower values in another. For instance, a negative correlation between Recency and Frequency could suggest that customers who have not purchased recently tend to make purchases less frequently.

In [142]:
correlation_matrix = champions_segment[['RecencyScore', 'FrequencyScore', 'MonetaryScore']].corr()

# Visualize the correlation matrix using a heatmap
fig_heatmap = go.Figure(data=go.Heatmap(
                   z=correlation_matrix.values,
                   x=correlation_matrix.columns,
                   y=correlation_matrix.columns,
                   colorscale='RdBu',
                   colorbar=dict(title='Correlation')))

fig_heatmap.update_layout(title='Correlation Matrix of RFM Values within Champions Segment')

fig_heatmap.show()

In [143]:
import plotly.colors

pastel_colors = plotly.colors.qualitative.Pastel

segment_counts = data['RFM Customer Segments'].value_counts()

# Create a bar chart to compare segment counts
fig = go.Figure(data=[go.Bar(x=segment_counts.index, y=segment_counts.values,
                            marker=dict(color=pastel_colors))])

# Set the color of the Champions segment as a different color
champions_color = 'rgb(158, 202, 225)'
fig.update_traces(marker_color=[champions_color if segment == 'Champions' else pastel_colors[i]
                                for i, segment in enumerate(segment_counts.index)],
                  marker_line_color='rgb(8, 48, 107)',
                  marker_line_width=1.5, opacity=0.6)

# Update the layout
fig.update_layout(title='Comparison of RFM Segments',
                  xaxis_title='RFM Segments',
                  yaxis_title='Number of Customers',
                  showlegend=False)

fig.show()

In [144]:
# Calculate the average Recency, Frequency, and Monetary scores for each segment
segment_scores = data.groupby('RFM Customer Segments')[['RecencyScore', 'FrequencyScore', 'MonetaryScore']].mean().reset_index()

# Create a grouped bar chart to compare segment scores
fig = go.Figure()

# Add bars for Recency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['RecencyScore'],
    name='Recency Score',
    marker_color='rgb(158,202,225)'
))

# Add bars for Frequency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['FrequencyScore'],
    name='Frequency Score',
    marker_color='rgb(94,158,217)'
))

# Add bars for Monetary score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['MonetaryScore'],
    name='Monetary Score',
    marker_color='rgb(32,102,148)'
))

# Update the layout
fig.update_layout(
    title='Comparison of RFM Segments based on Recency, Frequency, and Monetary Scores',
    xaxis_title='RFM Segments',
    yaxis_title='Score',
    barmode='group',
    showlegend=True
)

fig.show()