# RFM Segmentation
RFM (Recency, Frequency, Monetary) Segmentation adalah metode analisis pelanggan yang digunakan untuk mengelompokkan customer berdasarkan:
- Recency: Seberapa baru pelanggan melakukan transaksi terakhir.
- Frequency: Seberapa sering pelanggan melakukan transaksi.
- Monetary: Berapa banyak uang yang dibelanjakan pelanggan.

Tujuan dari segmentasi ini adalah mengidentifikasi kelompok pelanggan berdasarkan nilai dan perilaku mereka, sehingga perusahaan dapat menyusun strategi pemasaran yang lebih tepat sasaran. RFM sangat efektif dalam retensi pelanggan, loyalitas, dan peningkatan pendapatan, karena fokus pada pelanggan yang sudah ada.

# Impor packages

In [2]:
import pandas as pd
import numpy as np
import datetime as dt

In [3]:
import os
os.getcwd()

'C:\\Users\\LENOVO\\Python\\Intermediate'

# Impor data dari CSV ke DataFrame

In [4]:
df = pd.read_csv('C:/Users/LENOVO/Python/Online Retail Data.csv', header=0)
df

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id
0,493410,TEST001,This is a test product.,5,2010-01-04 09:24:00,4.50,12346.0
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 09:43:00,4.25,14590.0
2,493412,TEST001,This is a test product.,5,2010-01-04 09:53:00,4.50,12346.0
3,493413,21724,PANDA AND BUNNIES STICKER SHEET,1,2010-01-04 09:54:00,0.85,
4,493413,84578,ELEPHANT TOY WITH BLUE T-SHIRT,1,2010-01-04 09:54:00,3.75,
...,...,...,...,...,...,...,...
461768,539991,21618,4 WILDFLOWER BOTANICAL CANDLES,1,2010-12-23 16:49:00,1.25,
461769,539991,72741,GRAND CHOCOLATECANDLE,4,2010-12-23 16:49:00,1.45,
461770,539992,21470,FLOWER VINE RAFFIA FOOD COVER,1,2010-12-23 17:41:00,3.75,
461771,539992,22258,FELT FARM ANIMAL RABBIT,1,2010-12-23 17:41:00,1.25,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 461773 entries, 0 to 461772
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      461773 non-null  object 
 1   product_code  461773 non-null  object 
 2   product_name  459055 non-null  object 
 3   quantity      461773 non-null  int64  
 4   order_date    461773 non-null  object 
 5   price         461773 non-null  float64
 6   customer_id   360853 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 24.7+ MB


In [6]:
df.isna().sum()

order_id             0
product_code         0
product_name      2718
quantity             0
order_date           0
price                0
customer_id     100920
dtype: int64

# Data cleansing

In [7]:
df_clean = df.copy()

## membuat kolom date
Menstandarkan kolom waktu agar bisa digunakan untuk menghitung recency dan tren waktu.

In [8]:
df_clean['order_date'] = pd.to_datetime(df_clean['order_date'])

## menghapus semua baris tanpa customer_id
Menghindari data anonim yang tidak bisa dianalisis per pelanggan.

In [9]:
df_clean = df_clean[~df_clean['customer_id'].isna()]

## menghapus semua baris tanpa product_name
Menjaga konsistensi informasi produk; baris tanpa nama produk bisa jadi data tidak lengkap atau error input.

In [10]:
df_clean = df_clean[~df_clean['product_name'].isna()]

## membuat semua product_name berhuruf kecil
Menstandarkan penamaan agar tidak dianggap berbeda oleh Python (misal, "Kaos" ≠ "kaos").

In [11]:
df_clean['product_name'] = df_clean['product_name'].str.lower()

## menghapus semua baris dengan product_code atau product_name test
Menghilangkan entri dummy atau data uji coba yang tidak merepresentasikan perilaku pelanggan nyata.

In [12]:
df_clean = df_clean[(~df_clean['product_code'].str.lower().str.contains('test')) |
                    (~df_clean['product_name'].str.contains('test '))]

## membuat kolom order_status dengan nilai 'cancelled' jika order_id diawali dengan huruf 'c' dan 'delivered' jika order_id tanpa awalan huruf 'c'
Membedakan transaksi cancelled dan delivered secara eksplisit untuk analisis segmentasi yang lebih akurat.

In [13]:
df_clean['order_status'] = np.where(df_clean['order_id'].str[:1]=='C', 'cancelled', 'delivered')

## mengubah nilai quantity yang negatif menjadi positif karena nilai negatif tersebut hanya menandakan order tersebut cancelled
Menormalkan jumlah produk karena negatif hanya digunakan sebagai penanda pembatalan (bukan nilai aktual).

In [14]:
df_clean['quantity'] = df_clean['quantity'].abs()

## menghapus baris dengan price bernilai negatif
Menjaga validitas transaksi; harga negatif tidak logis dalam konteks pembelian.

In [15]:
df_clean = df_clean[df_clean['price']>0]

## membuat nilai amount (quantity * price)
Untuk menghitung total nilai transaksi per baris, sebagai dasar monetary dalam RFM.

In [16]:
df_clean['amount'] = df_clean['quantity'] * df_clean['price']

## Mengganti product_name duplikat berdasarkan product_code
Menstandarkan nama produk agar satu kode tidak memiliki banyak variasi nama.

In [17]:
most_freq_product_name = df_clean.groupby(
    ['product_code','product_name'], as_index=False).agg(
    order_cnt=('order_id','nunique')).sort_values(
    ['product_code','order_cnt'], ascending=[True,False])
most_freq_product_name['rank'] = most_freq_product_name.groupby(
    'product_code')['order_cnt'].rank(method='first', ascending=False)
most_freq_product_name = most_freq_product_name[most_freq_product_name['rank']==1].drop(
    columns=['order_cnt','rank'])

In [18]:
df_clean = df_clean.merge(
    most_freq_product_name.rename(
        columns={'product_name':'most_freq_product_name'}), how='left', on='product_code')
df_clean['product_name'] = df_clean['most_freq_product_name']
df_clean = df_clean.drop(columns='most_freq_product_name')

## mengkonversi customer_id menjadi string
Menghindari kesalahan parsing angka (misalnya: 1.0 ≠ '1') dan memudahkan analisis kategorikal.

In [19]:
df_clean['customer_id'] = df_clean['customer_id'].astype(str)

## menghapus outlier
Menghindari distorsi dalam penghitungan metrik (recency/frequency/monetary) akibat data ekstrem.

In [20]:
from scipy import stats
df_clean = df_clean[(np.abs(stats.zscore(df_clean[['quantity','amount']]))<3).all(axis=1)]
df_clean = df_clean.reset_index(drop=True)
df_clean

Unnamed: 0,order_id,product_code,product_name,quantity,order_date,price,customer_id,order_status,amount
0,C493411,21539,red retrospot butter dish,1,2010-01-04 09:43:00,4.25,14590.0,cancelled,4.25
1,493414,21844,red retrospot mug,36,2010-01-04 10:28:00,2.55,14590.0,delivered,91.80
2,493414,21533,retro spot large milk jug,12,2010-01-04 10:28:00,4.25,14590.0,delivered,51.00
3,493414,37508,new england ceramic cake server,2,2010-01-04 10:28:00,2.55,14590.0,delivered,5.10
4,493414,35001G,hand open shape gold,2,2010-01-04 10:28:00,4.25,14590.0,delivered,8.50
...,...,...,...,...,...,...,...,...,...
358464,539988,84380,set of 3 butterfly cookie cutters,1,2010-12-23 16:06:00,1.25,18116.0,delivered,1.25
358465,539988,84849D,hot baths soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358466,539988,84849B,fairy soap soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358467,539988,22854,cream sweetheart egg holder,2,2010-12-23 16:06:00,4.95,18116.0,delivered,9.90


## Rename kolom order_date menjadi date
Untuk konsistensi penamaan dan kesesuaian dengan konvensi yang lebih umum.

In [23]:
df_clean = df_clean.rename(columns={'order_date':'date'})
df_clean

Unnamed: 0,order_id,product_code,product_name,quantity,date,price,customer_id,order_status,amount
0,C493411,21539,red retrospot butter dish,1,2010-01-04 09:43:00,4.25,14590.0,cancelled,4.25
1,493414,21844,red retrospot mug,36,2010-01-04 10:28:00,2.55,14590.0,delivered,91.80
2,493414,21533,retro spot large milk jug,12,2010-01-04 10:28:00,4.25,14590.0,delivered,51.00
3,493414,37508,new england ceramic cake server,2,2010-01-04 10:28:00,2.55,14590.0,delivered,5.10
4,493414,35001G,hand open shape gold,2,2010-01-04 10:28:00,4.25,14590.0,delivered,8.50
...,...,...,...,...,...,...,...,...,...
358464,539988,84380,set of 3 butterfly cookie cutters,1,2010-12-23 16:06:00,1.25,18116.0,delivered,1.25
358465,539988,84849D,hot baths soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358466,539988,84849B,fairy soap soap holder,1,2010-12-23 16:06:00,1.69,18116.0,delivered,1.69
358467,539988,22854,cream sweetheart egg holder,2,2010-12-23 16:06:00,4.95,18116.0,delivered,9.90


# Membuat RFM segmentation

## Agregat data transaksi ke bentuk ringkasan total transaksi (order), total nilai order (order value), tanggal order terakhir dari setiap user
Tujuannya:
* Menyederhanakan data transaksi menjadi satu baris per pelanggan.
* Menyiapkan fitur RFM: recency (tanggal terakhir), frequency (jumlah transaksi), monetary (total belanja).
* Mendukung segmentasi pelanggan berdasarkan nilai dan perilaku belanja.
* Memudahkan analisis dan strategi pemasaran berbasis data pelanggan.

Agregasi transaksi ke bentuk ringkasan per user bertujuan untuk mengevaluasi nilai dan perilaku pelanggan secara individual, yang menjadi fondasi utama untuk segmentasi berbasis RFM dan strategi retensi pelanggan.

In [24]:
df_user = df_clean.groupby('customer_id', as_index=False).agg(
    order_cnt=('order_id','nunique'),
    max_order_date=('date','max'),total_order_value=('amount','sum'))
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value
0,12346.0,5,2010-10-04 09:54:00,602.40
1,12608.0,1,2010-10-31 10:49:00,415.79
2,12745.0,2,2010-08-10 10:14:00,723.85
3,12746.0,2,2010-06-30 08:19:00,266.35
4,12747.0,19,2010-12-13 10:41:00,4094.79
...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77
3885,18284.0,2,2010-10-06 12:31:00,486.68
3886,18285.0,1,2010-02-17 10:24:00,427.00
3887,18286.0,2,2010-08-20 11:57:00,941.48


## Buat kolom jumlah hari sejak order terakhir
* Tujuannya mengukur Recency, yaitu berapa lama (dalam hari) sejak terakhir kali pelanggan melakukan transaksi.
* Kolom day_since_last_order menunjukkan jumlah hari sejak order terakhir pelanggan hingga tanggal terakhir di dataset (today).
* Nilai lebih kecil artinya pelanggan masih aktif atau baru-baru ini bertransaksi.
* Nilai besar artinya pelanggan sudah lama tidak bertransaksi, bisa jadi tidak aktif lagi.

In [26]:
today = df_clean['date'].max()
df_user['day_since_last_order'] = (today - df_user['max_order_date']).dt.days
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order
0,12346.0,5,2010-10-04 09:54:00,602.40,80
1,12608.0,1,2010-10-31 10:49:00,415.79,53
2,12745.0,2,2010-08-10 10:14:00,723.85,135
3,12746.0,2,2010-06-30 08:19:00,266.35,176
4,12747.0,19,2010-12-13 10:41:00,4094.79,10
...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31
3885,18284.0,2,2010-10-06 12:31:00,486.68,78
3886,18285.0,1,2010-02-17 10:24:00,427.00,309
3887,18286.0,2,2010-08-20 11:57:00,941.48,125


In [27]:
df_user.describe()

Unnamed: 0,order_cnt,max_order_date,total_order_value,day_since_last_order
count,3889.0,3889,3889.0,3889.0
mean,5.128568,2010-09-23 18:15:51.267678208,1544.623084,90.651581
min,1.0,2010-01-05 12:43:00,1.25,0.0
25%,1.0,2010-08-19 12:30:00,296.36,25.0
50%,3.0,2010-10-26 18:45:00,648.2,57.0
75%,6.0,2010-11-28 14:54:00,1585.94,126.0
max,163.0,2010-12-23 16:06:00,71970.39,352.0
std,8.49933,,3434.816315,88.883201


## Buat binning dari jumlah hari sejak order terakhir yang terdiri dari 5 bins dengan batas-batasnya merupakan min, P20, P40, P60, P80, max dan beri label 1 sampai 5 dari bin tertinggi ke terendah sebagai skor recency
* Mengubah nilai recency (berapa hari sejak transaksi terakhir) menjadi skor diskrit 1–5 untuk mempermudah segmentasi pelanggan berdasarkan seberapa baru mereka bertransaksi.
* Skor ini menyederhanakan pemahaman tentang Recency pelanggan.
* Sangat penting dalam strategi seperti retargeting, reward pelanggan aktif, dan reaktivasi pelanggan lama.
* Makna recency score

| recency\_score | Arti                      |
| -------------- | ------------------------- |
| 5              | Sangat baru, sangat aktif |
| 4              | Baru                      |
| 3              | Cukup aktif               |
| 2              | Mulai jarang              |
| 1              | Tidak aktif / sudah lama  |


In [28]:
df_user['recency_score'] = pd.cut(df_user['day_since_last_order'],
                                  bins=[df_user['day_since_last_order'].min(),
                                        np.percentile(df_user['day_since_last_order'], 20),
                                        np.percentile(df_user['day_since_last_order'], 40),
                                        np.percentile(df_user['day_since_last_order'], 60),
                                        np.percentile(df_user['day_since_last_order'], 80),
                                        df_user['day_since_last_order'].max()],
                                  labels=[5, 4, 3, 2, 1],
                                  include_lowest=True).astype(int)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5
...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2


In [29]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 182.4+ KB


## Buat binning dari total transaksi (order) yang terdiri dari 5 bins dengan batas-batasnya merupakan min, P20, P40, P60, P80, max dan beri label 1 sampai 5 dari bin terendah ke tertinggi sebagai skor frequency
* Mengubah jumlah total transaksi per pelanggan (order_cnt) menjadi skor diskrit 1–5 untuk mengukur seberapa sering pelanggan bertransaksi.
* Frequency_score membantu mengidentifikasi pelanggan loyal (order tinggi) vs pelanggan pasif.
* Penting untuk rewarding loyal customer, segmentasi program membership, atau strategi upselling.
* Makna frequency_score

| frequency\_score | Arti                    |
| ---------------- | ----------------------- |
| 5                | Sangat sering transaksi |
| 4                | Cukup sering            |
| 3                | Rata-rata               |
| 2                | Jarang                  |
| 1                | Sangat jarang           |

In [43]:
'''
df_user['frequency_score'] = pd.cut(df_user['order_cnt'],
                                    bins=[df_user['order_cnt'].min(),
                                          np.percentile(df_user['order_cnt'], 20),
                                          np.percentile(df_user['order_cnt'], 40),
                                          np.percentile(df_user['order_cnt'], 60),
                                          np.percentile(df_user['order_cnt'], 80),
                                          df_user['order_cnt'].max()],
                                    labels=[1, 2, 3, 4, 5],
                                    include_lowest=True).astype(int)
df_user
'''

"\ndf_user['frequency_score'] = pd.cut(df_user['order_cnt'],\n                                    bins=[df_user['order_cnt'].min(),\n                                          np.percentile(df_user['order_cnt'], 20),\n                                          np.percentile(df_user['order_cnt'], 40),\n                                          np.percentile(df_user['order_cnt'], 60),\n                                          np.percentile(df_user['order_cnt'], 80),\n                                          df_user['order_cnt'].max()],\n                                    labels=[1, 2, 3, 4, 5],\n                                    include_lowest=True).astype(int)\ndf_user\n"

In [31]:
df_user['frequency_score'] = pd.cut(df_user['order_cnt'],
                                    bins=[0,
                                          np.percentile(df_user['order_cnt'], 20),
                                          np.percentile(df_user['order_cnt'], 40),
                                          np.percentile(df_user['order_cnt'], 60),
                                          np.percentile(df_user['order_cnt'], 80),
                                          df_user['order_cnt'].max()],
                                    labels=[1, 2, 3, 4, 5],
                                    include_lowest=True).astype(int)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score,frequency_score
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2,4
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3,1
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2,2
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1,2
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5,5
...,...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4,4
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2,2
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1,1
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2,2


In [32]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
 6   frequency_score       3889 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 212.8+ KB


## Buat binning dari total nilai order (order value) yang terdiri dari 5 bins dengan batas-batasnya merupakan min, P20, P40, P60, P80, max dan beri label 1 sampai 5 dari bin terendah ke tertinggi sebagai skor monetary
* Mengubah nilai total transaksi pelanggan (total_order_value) menjadi skor diskrit 1–5 yang merepresentasikan nilai kontribusi finansial pelanggan terhadap bisnis.
* monetary_score menunjukkan siapa pelanggan paling menguntungkan.
* Cocok untuk strategi retensi pelanggan premium, promosi eksklusif, atau prioritas layanan.
* Makna monetary_score

| monetary\_score | Arti                      |
| --------------- | ------------------------- |
| 5               | Pelanggan bernilai tinggi |
| 4               | Cukup bernilai            |
| 3               | Rata-rata                 |
| 2               | Bernilai rendah           |
| 1               | Sangat rendah             |


In [33]:
df_user['monetary_score'] = pd.cut(df_user['total_order_value'],
                                   bins=[df_user['total_order_value'].min(),
                                         np.percentile(df_user['total_order_value'], 20),
                                         np.percentile(df_user['total_order_value'], 40),
                                         np.percentile(df_user['total_order_value'], 60),
                                         np.percentile(df_user['total_order_value'], 80),
                                         df_user['total_order_value'].max()],
                                   labels=[1, 2, 3, 4, 5],
                                   include_lowest=True).astype(int)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score,frequency_score,monetary_score
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2,4,3
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3,1,2
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2,2,3
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1,2,2
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5,5,5
...,...,...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4,4,3
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2,2,3
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1,1,2
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2,2,4


In [34]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
 6   frequency_score       3889 non-null   int64         
 7   monetary_score        3889 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 243.2+ KB


## Buat kolom nama segmen berdasarkan skor recency dan frequency
* Mengelompokkan pelanggan ke dalam segmen perilaku yang berbeda berdasarkan seberapa baru dan seberapa sering mereka melakukan pembelian.
* Menggunakan kombinasi recency_score dan frequency_score, pelanggan diklasifikasikan ke dalam 10 segmen dengan menggunakan np.select().
* Segmentasi ini membantu dalam strategi pemasaran: siapa yang harus dipertahankan, ditingkatkan, atau ditarik kembali.
* Berguna untuk personalisasi campaign, alokasi budget promosi, dan retensi pelanggan.
* Daftar segmentasi dan maknanya

| Segmen                 | Kriteria                   | Karakteristik Utama                           |
| ---------------------- | -------------------------- | --------------------------------------------- |
| 01-Champion            | Recency 5 & Frequency ≥4   | Pelanggan aktif dan sangat sering belanja     |
| 02-Loyal Customers     | Recency 3–4 & Frequency ≥4 | Sering belanja, perlu dijaga loyalitasnya     |
| 03-Potential Loyalists | Recency ≥4 & Frequency 2–3 | Baru mulai aktif, potensi menjadi loyal       |
| 04-Can't Lose Them     | Recency ≤2 & Frequency 5   | Dulu aktif, tapi mulai jarang belanja         |
| 05-Need Attention      | Recency 3 & Frequency 3    | Biasa-biasa saja, bisa ditingkatkan           |
| 06-New Customers       | Recency 5 & Frequency 1    | Pelanggan baru, butuh pendekatan lebih lanjut |
| 07-Promising           | Recency 4 & Frequency 1    | Baru belanja sekali, cukup potensial          |
| 08-At Risk             | Recency ≤2 & Frequency 3–4 | Mulai pasif, perlu tindakan preventif         |
| 09-About to Sleep      | Recency 3 & Frequency ≤2   | Kurang aktif dan jarang belanja               |
| 10-Hibernating         | Recency ≤2 & Frequency ≤2  | Sangat tidak aktif, kemungkinan akan churn    |


In [36]:
df_user['segment'] = np.select(
    [(df_user['recency_score']==5) & (df_user['frequency_score']>=4),
     (df_user['recency_score'].between(3, 4)) & (df_user['frequency_score']>=4),
     (df_user['recency_score']>=4) & (df_user['frequency_score'].between(2, 3)),
     (df_user['recency_score']<=2) & (df_user['frequency_score']==5),
     (df_user['recency_score']==3) & (df_user['frequency_score']==3),
     (df_user['recency_score']==5) & (df_user['frequency_score']==1),
     (df_user['recency_score']==4) & (df_user['frequency_score']==1),
     (df_user['recency_score']<=2) & (df_user['frequency_score'].between(3, 4)),
     (df_user['recency_score']==3) & (df_user['frequency_score']<=2),
     (df_user['recency_score']<=2) & (df_user['frequency_score']<=2)],
    ['01-Champion', '02-Loyal Customers', '03-Potential Loyalists', "04-Can't Lose Them", '05-Need Attention',
     '06-New Customers', '07-Promising', '08-At Risk', '09-About to Sleep', '10-Hibernating'],
    default='Other'
)
df_user

Unnamed: 0,customer_id,order_cnt,max_order_date,total_order_value,day_since_last_order,recency_score,frequency_score,monetary_score,segment
0,12346.0,5,2010-10-04 09:54:00,602.40,80,2,4,3,08-At Risk
1,12608.0,1,2010-10-31 10:49:00,415.79,53,3,1,2,09-About to Sleep
2,12745.0,2,2010-08-10 10:14:00,723.85,135,2,2,3,10-Hibernating
3,12746.0,2,2010-06-30 08:19:00,266.35,176,1,2,2,10-Hibernating
4,12747.0,19,2010-12-13 10:41:00,4094.79,10,5,5,5,01-Champion
...,...,...,...,...,...,...,...,...,...
3884,18283.0,6,2010-11-22 15:30:00,641.77,31,4,4,3,02-Loyal Customers
3885,18284.0,2,2010-10-06 12:31:00,486.68,78,2,2,3,10-Hibernating
3886,18285.0,1,2010-02-17 10:24:00,427.00,309,1,1,2,10-Hibernating
3887,18286.0,2,2010-08-20 11:57:00,941.48,125,2,2,4,10-Hibernating


In [37]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   customer_id           3889 non-null   object        
 1   order_cnt             3889 non-null   int64         
 2   max_order_date        3889 non-null   datetime64[ns]
 3   total_order_value     3889 non-null   float64       
 4   day_since_last_order  3889 non-null   int64         
 5   recency_score         3889 non-null   int64         
 6   frequency_score       3889 non-null   int64         
 7   monetary_score        3889 non-null   int64         
 8   segment               3889 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(2)
memory usage: 273.6+ KB


## Tampilkan summary dari RFM segmentation (poin 8) berupa banyaknya pengguna, rata-rata dan median dari total order, total order value, dan jumlah hari sejak order terakhir

In [39]:
summary = pd.pivot_table(df_user, index='segment',
               values=['customer_id','day_since_last_order','order_cnt','total_order_value'],
               aggfunc={'customer_id': 'nunique',
                        'day_since_last_order': ['mean', 'median'],
                        'order_cnt': ['mean', 'median'],
                        'total_order_value': ['mean', 'median']})

summary['pct_unique'] = (summary['customer_id'] / summary['customer_id'].sum() * 100).round(1)
summary

Unnamed: 0_level_0,customer_id,day_since_last_order,day_since_last_order,order_cnt,order_cnt,total_order_value,total_order_value,pct_unique
Unnamed: 0_level_1,nunique,mean,median,mean,median,mean,median,Unnamed: 8_level_1
segment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
01-Champion,553,10.533454,9.0,15.432188,10.0,4989.208761,2773.91,14.2
02-Loyal Customers,549,41.200364,37.0,8.744991,7.0,2618.121117,1937.05,14.1
03-Potential Loyalists,514,23.083658,24.0,2.830739,3.0,766.076265,621.005,13.2
04-Can't Lose Them,62,123.274194,113.0,11.467742,10.0,2851.737258,2268.405,1.6
05-Need Attention,184,58.505435,59.0,3.402174,3.0,1004.317071,826.37,4.7
06-New Customers,50,14.0,16.0,1.0,1.0,244.689,193.675,1.3
07-Promising,133,31.954887,32.0,1.0,1.0,288.694135,239.46,3.4
08-At Risk,418,141.5311,120.0,4.126794,4.0,1141.224835,866.32,10.7
09-About to Sleep,370,58.175676,58.0,1.416216,1.0,448.176081,336.735,9.5
10-Hibernating,1056,197.151515,199.0,1.3125,1.0,342.61845,256.9,27.2


In [40]:
summary['customer_id']

Unnamed: 0_level_0,nunique
segment,Unnamed: 1_level_1
01-Champion,553
02-Loyal Customers,549
03-Potential Loyalists,514
04-Can't Lose Them,62
05-Need Attention,184
06-New Customers,50
07-Promising,133
08-At Risk,418
09-About to Sleep,370
10-Hibernating,1056


Beberapa insight:
* Segmen terbesar: 10-Hibernating (1056 user) artinya banyak user yang sudah lama tidak aktif dan jarang belanja.
* Segmen terkecil: 04-Can't Lose Them (62 user) artinya sedikit user yang sebelumnya aktif dan berharga, tapi sekarang mulai tidak aktif.
* Segmen terbaik: 01-Champion (553 user) artinya user paling aktif dan terbaru dalam bertransaksi.

In [42]:
summary['customer_id'] / summary['customer_id'].sum() * 100

Unnamed: 0_level_0,nunique
segment,Unnamed: 1_level_1
01-Champion,14.219594
02-Loyal Customers,14.11674
03-Potential Loyalists,13.216765
04-Can't Lose Them,1.59424
05-Need Attention,4.731293
06-New Customers,1.285678
07-Promising,3.419902
08-At Risk,10.748264
09-About to Sleep,9.514014
10-Hibernating,27.15351


Beberapa insight:
* Segmen terbesar adalah 10-Hibernating (27.15%). Ini menunjukkan banyak pengguna tidak aktif dalam waktu lama dan perlu strategi re-engagement.
* 01-Champion (14.22%), 02-Loyal Customers (14.12%), dan 03-Potential Loyalists (13.22%) adalah segmen paling berharga yang perlu dipertahankan dan difasilitasi.
* 08-At Risk (10.75%) dan 09-About to Sleep (9.51%) adalah segmen dengan risiko churn yang signifikan, perlu perhatian khusus.
* 04-Can't Lose Them (1.59%) jumlahnya sedikit, tapi pengguna ini dulunya aktif dan bernilai tinggi.
* 05-Need Attention, 06-New Customers, dan 07-Promising adalah segmen yang menunjukkan potensi, tapi masih perlu pendekatan berbeda agar menjadi pelanggan setia.

# Kesimpulan
* Mayoritas pelanggan berada pada segmen pasif (Hibernating, About to Sleep, dan At Risk), sedangkan segmen bernilai tinggi seperti Champion dan Loyal Customers jumlahnya lebih kecil namun sangat penting untuk dipertahankan.

# Strategi
* Pertahankan pelanggan bernilai tinggi (Champion, Loyal Customers)\
Berikan reward khusus, program loyalitas, atau early access untuk produk baru.
* Aktifkan kembali pelanggan berisiko (At Risk, Can't Lose Them, About to Sleep)\
Gunakan email marketing dengan penawaran eksklusif atau reminder personal.
* Dorong segmen potensial (Potential Loyalists, Promising)\
Tawarkan diskon kecil atau edukasi produk agar mereka makin sering belanja.
* Re-engage pelanggan pasif (Hibernating)\
Kampanye reaktivasi atau survei kepuasan bisa membantu mengetahui penyebab ketidakaktifan.
* Bina pelanggan baru (New Customers, Need Attention)\
Buat onboarding yang baik dan dorong transaksi ulang dengan promo khusus pelanggan baru.