Creating a RFM segmentations for the customers that purchased (during the 7 months of data) which were identified on the notebook:Customer_Behavior_Profile_Analysis.ipynb

RFM Definitions
- Recency (R): Days since last purchase
- Frequency (F): Total purchases the customer has
- Monetary Value (M): Total customer spent
    
- Using the following article as reference: https://towardsdatascience.com/find-your-best-customers-with-customer-segmentation-in-python-61d602f9eee6

In [3]:
# Loading basic needed libraries
import pandas as pd
import numpy as np
import gc
from functools import reduce
import datetime as dt
from datetime import date

# Loading libraries for S3 bucket connection
import boto3
import io
from io import StringIO,BytesIO, TextIOWrapper
import gzip

client = boto3.client('s3') 
resource = boto3.resource('s3') 

In [4]:
# Reading customers who purchased
main_custs = pd.read_csv('s3://myaws-capstone-bucket/data/customers_of_focus.csv')
main_custs.nunique()

user_id                     1817173
total_view                     2424
total_cart_add                  484
total_purchases                 366
total_sessions                  306
total_spent                  387698
min_spent                     39053
max_spent                     57825
cust_retailer_age               213
days_since_last_activity        213
first_view_age                  214
days_since_last_view            214
first_cart_age                  214
days_since_last_cart            214
first_purchase_age              211
days_since_last_purchase        211
dtype: int64

In [5]:
# Verifying distribution of numeric columns
pd.options.display.float_format = '{:.2f}'.format
main_custs.describe()

Unnamed: 0,user_id,total_view,total_cart_add,total_purchases,total_sessions,total_spent,min_spent,max_spent,cust_retailer_age,days_since_last_activity,first_view_age,days_since_last_view,first_cart_age,days_since_last_cart,first_purchase_age,days_since_last_purchase
count,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0,1817173.0
mean,566039111.99,70.34,6.75,3.14,14.94,1047.44,101.11,377.66,138.5,60.75,154.43,77.02,454.69,422.85,106.9,87.25
std,40622062.9,142.05,12.79,7.81,15.54,3794.79,220.67,379.26,61.93,58.39,399.83,402.4,1816.12,1822.08,62.78,60.75
min,101875240.0,0.0,0.0,1.0,1.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,526974250.0,10.0,2.0,1.0,5.0,158.94,0.0,132.56,93.0,9.0,94.0,9.0,59.0,25.0,53.0,33.0
50%,561943831.0,30.0,4.0,2.0,10.0,348.92,0.0,231.64,155.0,44.0,155.0,44.0,123.0,72.0,113.0,79.0
75%,598362706.0,79.0,7.0,3.0,20.0,900.92,130.25,476.18,194.0,101.0,194.0,103.0,165.0,129.0,164.0,135.0
max,649772024.0,57349.0,2186.0,1975.0,5788.0,790098.29,2574.07,2574.07,212.0,212.0,9999.0,9999.0,9999.0,9999.0,212.0,212.0


In [6]:
# Creating RFM DF
rfm_df = main_custs[['user_id','days_since_last_purchase','total_purchases','total_spent']]# Keeping needed columns for RFM
rfm_df.columns = ['user_id','R','F','M']# renaming the columns
rfm_df.head()

Unnamed: 0,user_id,R,F,M
0,101875240,105.0,1.0,184.52
1,107620212,91.0,1.0,244.28
2,128968633,121.0,3.0,358.79
3,136662675,139.0,1.0,102.65
4,145611266,16.0,2.0,81.56


In [7]:
rfm_df.describe()

Unnamed: 0,user_id,R,F,M
count,1817173.0,1817173.0,1817173.0,1817173.0
mean,566039111.99,87.25,3.14,1047.44
std,40622062.9,60.75,7.81,3794.79
min,101875240.0,0.0,1.0,0.42
25%,526974250.0,33.0,1.0,158.94
50%,561943831.0,79.0,2.0,348.92
75%,598362706.0,135.0,3.0,900.92
max,649772024.0,212.0,1975.0,790098.29


In [8]:
# Creating RFM Scores based on Quantile distribution
Q = rfm_df.quantile(q=[0.25,0.5,0.75])
Q = Q.to_dict()

# Recency Score will be calculated inversely from Frequency and Monetary Values
# The lower the recency the higher the score should be 
def R_Score(x,p,d):
    if x >= d[p][0.25]:
        return 1
    elif x >= d[p][0.50]:
        return 2
    elif x >= d[p][0.75]: 
        return 3
    else:
        return 4
    
# The higher the Frequency and Monetary Values the higher the score should be
def FM_Score(x,p,d):
    if x >= d[p][0.25]:
        return 4
    elif x >= d[p][0.50]:
        return 3
    elif x >= d[p][0.75]: 
        return 2
    else:
        return 1
    

# Creating RFM quantile columns
rfm_df['R_quantile_score'] = rfm_df['R'].apply(R_Score, args=('R',Q,))
rfm_df['F_quantile_score'] = rfm_df['F'].apply(FM_Score, args=('F',Q,))
rfm_df['M_quantile_score'] = rfm_df['M'].apply(FM_Score, args=('M',Q,))

# RFM Score column creation
rfm_df['RFM_Score'] = rfm_df['R_quantile_score'] + rfm_df['F_quantile_score'] + rfm_df['M_quantile_score']

rfm_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

Unnamed: 0,user_id,R,F,M,R_quantile_score,F_quantile_score,M_quantile_score,RFM_Score
0,101875240,105.0,1.0,184.52,1,4,4,9
1,107620212,91.0,1.0,244.28,1,4,4,9
2,128968633,121.0,3.0,358.79,1,4,4,9
3,136662675,139.0,1.0,102.65,1,4,1,6
4,145611266,16.0,2.0,81.56,4,4,1,9


In [9]:
rfm_df['RFM_Score'].describe()# Checking the RFM_Score distribution

count   1817173.00
mean          8.99
std           1.80
min           6.00
25%           9.00
50%           9.00
75%           9.00
max          12.00
Name: RFM_Score, dtype: float64

In [10]:
rfm_df.head()

Unnamed: 0,user_id,R,F,M,R_quantile_score,F_quantile_score,M_quantile_score,RFM_Score
0,101875240,105.0,1.0,184.52,1,4,4,9
1,107620212,91.0,1.0,244.28,1,4,4,9
2,128968633,121.0,3.0,358.79,1,4,4,9
3,136662675,139.0,1.0,102.65,1,4,1,6
4,145611266,16.0,2.0,81.56,4,4,1,9


Creating Customer Segment Values:
- Top Customer - The customers with max RFM values
- High Value Customer - The customers that have RFM values larger than 50th percentile but less than the max score
- Mid Value Customer - The customers that have RFM values larger than 25th percentile but less than or equal to 50th percentile
- Low Value Customer - The customers that have RFM values less than 25th percentile 

In [11]:
# Creating customer segment based on RFM values
rfm_df.loc[(rfm_df['RFM_Score'] == 12),'user_segment'] = 'Top_Customer'
rfm_df.loc[(rfm_df['RFM_Score'] == 11),'user_segment'] = 'High_Value_Customer'
rfm_df.loc[(rfm_df['RFM_Score'] <= rfm_df['RFM_Score'].quantile(.75)),'user_segment'] = 'High_Value_Customer'
rfm_df.loc[(rfm_df['RFM_Score'] <= rfm_df['RFM_Score'].quantile(.50)),'user_segment'] = 'Mid_Value_Customer'
rfm_df.loc[(rfm_df['RFM_Score'] <= rfm_df['RFM_Score'].quantile(.25)),'user_segment'] = 'Low_Value_Customer'
rfm_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead



Unnamed: 0,user_id,R,F,M,R_quantile_score,F_quantile_score,M_quantile_score,RFM_Score,user_segment
0,101875240,105.0,1.0,184.52,1,4,4,9,Low_Value_Customer
1,107620212,91.0,1.0,244.28,1,4,4,9,Low_Value_Customer
2,128968633,121.0,3.0,358.79,1,4,4,9,Low_Value_Customer
3,136662675,139.0,1.0,102.65,1,4,1,6,Low_Value_Customer
4,145611266,16.0,2.0,81.56,4,4,1,9,Low_Value_Customer


In [12]:
# Counting amount of users in each segment
rfm_df['user_segment'].value_counts()

Low_Value_Customer    1491062
Top_Customer           326111
Name: user_segment, dtype: int64

In [13]:
# Saving Results in S3
rfm_df = rfm_df[['user_id','RFM_Score','user_segment']]
rfm_df.to_csv('s3://myaws-capstone-bucket/data/rfm_segment.csv',index=False)