# Customer Segmentation with RFM Analysis on Online Retail Dataset

## Description

Stages of Project

1. Data Reading
2. Data Preperation
3. Defining RFM Metrics
4. Calculating RFM Scores
5. Segmentation with RFM Scores
6. Examining Outputs

**Business Problem:**

An e-commerce company wants to segment its customers and determine marketing strategies according to these segments.

**Story of dataset:**

https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

The data set named Online Retail II was obtained from a UK-based online store.
Includes sales between 01/12/2009 - 09/12/2011.

**Variables:**

* InvoiceNo: Invoice number. The unique number of each transaction, namely the invoice. (Aborted operation if it starts with C.)
* StockCode: Product code. Unique number for each product.
* Description: Product name
* Quantity: Number of products. It expresses how many of the products on the invoices have been sold.
* InvoiceDate: Invoice date and time.
* UnitPrice: Product price (in GBP)
* CustomerID: Unique customer number
* Country: Country name. Country where the customer lives.

## Data Preperation and Understanding

### Reading Data, Required Libraries, Settings and Functions

In [5]:
import pandas as pd
import datetime as dt
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.width',1000)

In [8]:
df_ = pd.read_excel("./online_retail_II.xlsx", sheet_name="Year 2009-2010")

In [9]:
df = df_.copy()

In [10]:
def check_df(dataframe, head=5):
    print("Shape: \n")
    print(dataframe.shape)
    print("\n Types: \n")
    print(dataframe.dtypes)
    print("\n Types: \n")
    print(dataframe.info())
    print("\n Head: \n")
    print(dataframe.head(head))
    print("\n Tail: \n")
    print(dataframe.tail(head))
    print("\n MissingValues: \n")
    print(dataframe.isnull().sum())
    print("\n Quantiles: \n")
    print(dataframe.quantile([0, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99, 1]).T)

### Exploratory Data Analysis

In [11]:
check_df(df)

Shape: 

(525461, 8)

 Types: 

Invoice                object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
Price                 float64
Customer ID           float64
Country                object
dtype: object

 Types: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      525461 non-null  object        
 1   StockCode    525461 non-null  object        
 2   Description  522533 non-null  object        
 3   Quantity     525461 non-null  int64         
 4   InvoiceDate  525461 non-null  datetime64[ns]
 5   Price        525461 non-null  float64       
 6   Customer ID  417534 non-null  float64       
 7   Country      525461 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 32.1+ MB
None

 Head

## Calculating RFM Scores and Creating RFM Metrics and Segments 

In [12]:
def create_rfm(dataframe, csv=False):

    # VERIYI HAZIRLAMA
    dataframe["TotalPrice"] = dataframe["Quantity"] * dataframe["Price"]
    dataframe.dropna(inplace=True)
    dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]

    # RFM METRIKLERININ HESAPLANMASI
    today_date = dt.datetime(2011, 12, 11)
    rfm = dataframe.groupby('Customer ID').agg({'InvoiceDate': lambda date: (today_date - date.max()).days,
                                                'Invoice': lambda num: num.nunique(),
                                                "TotalPrice": lambda price: price.sum()})
    rfm.columns = ['recency', 'frequency', "monetary"]
    rfm = rfm[(rfm['monetary'] > 0)]

    # RFM SKORLARININ HESAPLANMASI
    rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
    rfm["frequency_score"] = pd.qcut(rfm["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
    rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])

    # cltv_df skorları kategorik değere dönüştürülüp df'e eklendi
    rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +
                        rfm['frequency_score'].astype(str))


    # SEGMENTLERIN ISIMLENDIRILMESI
    seg_map = {
        r'[1-2][1-2]': 'hibernating',
        r'[1-2][3-4]': 'at_risk',
        r'[1-2]5': 'cant_loose',
        r'3[1-2]': 'about_to_sleep',
        r'33': 'need_attention',
        r'[3-4][4-5]': 'loyal_customers',
        r'41': 'promising',
        r'51': 'new_customers',
        r'[4-5][2-3]': 'potential_loyalists',
        r'5[4-5]': 'champions'
    }

    rfm['segment'] = rfm['RFM_SCORE'].replace(seg_map, regex=True)
    rfm = rfm[["recency", "frequency", "monetary", "segment"]]
    rfm.index = rfm.index.astype(int)

    if csv:
        rfm.to_csv("rfm.csv")

    return rfm

In [15]:
rfm_df = create_rfm(df, csv=True)

## RFM Analysis

Examination of the recency, frequency and monetary averages of the segments:

In [16]:
rfm_df[["segment", "recency", "frequency", "monetary"]].groupby("segment").agg(["mean", "count"])

Unnamed: 0_level_0,recency,recency,frequency,frequency,monetary,monetary
Unnamed: 0_level_1,mean,count,mean,count,mean,count
segment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
about_to_sleep,418.82,343,1.2,343,441.32,343
at_risk,517.16,611,3.07,611,1188.88,611
cant_loose,489.12,77,9.12,77,4099.45,77
champions,372.12,663,12.55,663,6852.26,663
hibernating,578.89,1015,1.13,1015,403.98,1015
loyal_customers,401.29,742,6.83,742,2746.07,742
need_attention,418.27,207,2.45,207,1060.36,207
new_customers,373.58,50,1.0,50,386.2,50
potential_loyalists,383.79,517,2.02,517,729.51,517
promising,390.75,87,1.0,87,367.09,87
