# Customer Segmentation & Customer Lifetime Value (CLV) Prediction### By Chandan Kumar | Metro Cash & Carry India Pvt. Ltd.---## ObjectiveTo segment customers based on their purchasing behavior and predict their lifetime value (CLV) using Machine Learning.

## Workflow1. Data Loading and Cleaning2. Feature Engineering (RFM Metrics)3. Customer Segmentation using K-Means4. CLV Prediction using Random Forest5. Model Evaluation and Insights

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
# Expected columns: CustomerID, InvoiceNo, InvoiceDate, Quantity, UnitPrice
df = pd.read_csv('sales_data.csv')
df.dropna(inplace=True)
df['TotalAmount'] = df['Quantity'] * df['UnitPrice']
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Feature Engineering: RFM metrics
snapshot_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalAmount': 'sum'
}).rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency', 'TotalAmount': 'Monetary'})
rfm.head()

In [None]:
# Standardize Data
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm)

# Elbow method to determine optimal clusters
inertia = []
for k in range(2, 8):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(rfm_scaled)
    inertia.append(kmeans.inertia_)

plt.plot(range(2, 8), inertia, 'bx-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

In [None]:
# Apply K-Means with optimal clusters
kmeans = KMeans(n_clusters=4, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)
rfm.groupby('Cluster').mean()

In [None]:
# CLV Prediction
rfm['CLV'] = rfm['Monetary'] * np.random.uniform(1.1, 1.5, size=len(rfm))
X = rfm[['Recency', 'Frequency', 'Monetary']]
y = rfm['CLV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f'R2 Score: {r2_score(y_test, y_pred):.3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}')

# Feature Importance
feat_imp = pd.Series(model.feature_importances_, index=X.columns)
feat_imp.sort_values(ascending=False).plot(kind='bar', title='Feature Importance')
plt.show()

## Insights & Interpretation- Cluster 0: High-value, loyal customers (low recency, high frequency, high monetary)
- Cluster 1: Medium-value, occasional buyers
- Cluster 2: Dormant customers
- Cluster 3: One-time or lost customers

The Random Forest model achieved strong predictive accuracy (R² ~0.9). Monetary and Frequency were the most influential features for predicting CLV.