Customers who haven‚Äôt bought in the last 90 days (relative to your snapshot date) ‚Üí churned
Others ‚Üí active

In [3]:
import pandas as pd

# Load the RFM data from the CSV file
rfm = pd.read_csv('../outputs/rfm_clusters.csv')

In [4]:
# Create churn label (1 = churned, 0 = active)
churn_df = rfm.copy()
churn_df['Churn'] = (churn_df['Recency'] > 90).astype(int)
# Quick check
churn_df[['Recency', 'Frequency', 'Monetary', 'Cluster', 'Churn']].head()


Unnamed: 0,Recency,Frequency,Monetary,Cluster,Churn
0,326,1,77183.6,3,1
1,2,7,4310.0,0,0
2,75,4,1797.24,0,0
3,19,1,1757.55,0,0
4,310,1,334.4,1,1


-------------Prepare features & target-----------
Features: Recency, Frequency, Monetary, Cluster
Target: Churn

In [5]:
X = churn_df[['Recency', 'Frequency', 'Monetary', 'Cluster']]
y = churn_df['Churn']

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Train size: (3470, 4)
Test size: (868, 4)


In [7]:
# Train model (Random Forest)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Initialize
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred = rf_model.predict(X_test)
y_prob = rf_model.predict_proba(X_test)[:,1]

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", round(roc_auc_score(y_test, y_prob), 3))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       578
           1       1.00      1.00      1.00       290

    accuracy                           1.00       868
   macro avg       1.00      1.00      1.00       868
weighted avg       1.00      1.00      1.00       868

ROC-AUC Score: 1.0


-----------------Interpretation-----------------
Classification report ‚Üí shows precision/recall/f1-score
ROC-AUC ‚Üí overall model quality
Feature importance ‚Üí which RFM or cluster features influence churn most

In [8]:
# Feature importance
importances = rf_model.feature_importances_
feat_names = X.columns
for name, importance in zip(feat_names, importances):
    print(f"{name}: {importance:.3f}")

Recency: 0.790
Frequency: 0.025
Monetary: 0.009
Cluster: 0.177


1. Recency (79%) ‚Üí DOMINANT FACTOR
Most powerful predictor of churn
Customers who haven‚Äôt purchased recently are far more likely to churn
This validates your churn definition and business logic

üëâ Business insight
‚ÄúTime since last purchase is the strongest indicator of customer churn.‚Äù

2. Cluster (17.7%) ‚Üí STRATEGIC VALUE
Segmentation adds predictive power
Certain customer segments are structurally more likely to churn
Confirms that RFM-based clusters are meaningful

üëâ Business insight
‚ÄúCustomer segments behave differently; churn risk is cluster-dependent.‚Äù

3. Fequency (2.5%) ‚Üí Minor Signal
Buying often helps retention, but less important than recency
A frequent buyer can still churn if inactive recently

üëâ Insight
‚ÄúPast loyalty doesn‚Äôt guarantee future retention.‚Äù

4. Monetary (0.9%) ‚Üí Least Important
High spenders can still churn
Spending ‚â† engagement

üëâ Insight
‚ÄúRevenue alone is a weak indicator of churn risk.‚Äù