# eCommerce Events History in Cosmetics Shop View Prediction
######  ※因為原始資料檔案太大，因此將Python程式分為資料預處理和資料分析兩部分，本檔案是資料分析的最後一小部分。


此專案將針對化妝品電商數據進行分析，並且針對顧客進行分群研究，找出其中的商業價值。  
此數據集是一家中型化妝品電商在2019年10月至2020年2月間的各種消費者行為數據。一筆數據代表一個事件，每個事件都類似於產品和用戶之間的多對多關係。

1. 資料預處理
2. 探索性資料分析
3. 重要電商指標分析
4. RSFM集群分析
5. 集群結果分析
6. Tableau視覺化報表
7. 測試集集群對照
8. 結論

Dataset: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop , from  REES46 Marketing Platform.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Import Data
匯入兩份已經完成集群的數據。  
data4m是我們主要參考的集群數據；data5m是包含最後一個月的測試組。

In [2]:
data4m = pd.read_csv('./final_data2.csv')
data5m = pd.read_csv('./final_data.csv')

## Processing

In [3]:
#選擇同樣日期的數據，排除掉那些2月才出現的客戶
data5m['date'] = pd.to_datetime(data5m['date'])
data5m = data5m[data5m['date'] <= '2020-01-31']
data5m.tail()

Unnamed: 0,event_time,event_type,product_id,category_id,brand,price,user_id,date,month,day,weekday,hr,group
15661355,2020-01-31 23:59:52,view,5796984,1487580005671109489,masura,1.73,417102560,2020-01-31,Jan 2020,31,Friday,23,Others
15661356,2020-01-31 23:59:52,remove_from_cart,5886774,1487580006317032337,Unknown,1.59,607092857,2020-01-31,Jan 2020,31,Friday,23,New
15661357,2020-01-31 23:59:53,view,5875432,2084144451428549153,Unknown,2.05,423651741,2020-01-31,Jan 2020,31,Friday,23,Others
15661358,2020-01-31 23:59:57,remove_from_cart,5820745,1487580006317032337,Unknown,2.22,607092857,2020-01-31,Jan 2020,31,Friday,23,New
15661359,2020-01-31 23:59:58,remove_from_cart,5820745,1487580006317032337,Unknown,2.22,607092857,2020-01-31,Jan 2020,31,Friday,23,New


In [4]:
data4m= data4m[data4m['event_type']=='purchase']
data5m= data5m[data5m['event_type']=='purchase']
data4m['user_id'].nunique()

92652

In [5]:
data5m['user_id'].nunique()

92652

檢查集群人數一樣都是92652，沒問題，接著來把兩份資料的集群貼到同一個Dataframe。

In [6]:
data4m_n= data4m[['user_id', 'group']]
data5m_n = data5m[['user_id', 'group']]
data5m_n.rename(columns={'group':'group_5m'}, inplace= True)
data = pd.merge(data4m_n, data5m_n, on= 'user_id', how= 'left')
data.drop_duplicates(inplace= True)
data['group'] = pd.Categorical(data['group'], categories=['Normal', 'New', 'Loyal', 'HighP', 'Sleeping'], ordered=True)
data['group_5m'] = pd.Categorical(data['group_5m'], categories=['Normal', 'New', 'Loyal', 'HighP', 'Sleeping'], ordered=True)
data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data5m_n.rename(columns={'group':'group_5m'}, inplace= True)


Unnamed: 0,user_id,group,group_5m
0,474232307,Sleeping,Sleeping
380,543446752,Sleeping,Sleeping
1605,463240011,Sleeping,Sleeping
1966,550021912,Sleeping,Sleeping
1970,250032160,Sleeping,Sleeping
...,...,...,...
37247625,608809325,HighP,HighP
37247626,608817151,New,New
37252803,608812129,New,New
37252867,593827158,New,New


## Check Different Values
來看看拿四個月跑集群 VS. 五個月跑集群的結果差異 ~

In [7]:
diff_values = data[data['group'] != data['group_5m']]
diff_values.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12195 entries, 25306 to 37213817
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   user_id   12195 non-null  int64   
 1   group     12195 non-null  category
 2   group_5m  12195 non-null  category
dtypes: category(2), int64(1)
memory usage: 214.8 KB


In [8]:
count = pd.DataFrame(diff_values.groupby('group').size()).rename(columns={0: 'diff'})
count

Unnamed: 0_level_0,diff
group,Unnamed: 1_level_1
Normal,3981
New,6006
Loyal,1037
HighP,157
Sleeping,1014


這樣看好像還好，我們將原本的各群數量（4個月數據的集群）貼過來方便對比。

In [9]:
original_count= pd.DataFrame({'group': ['Normal', 'New', 'Loyal', 'HighP', 'Sleeping'],
                              'count': [37255, 36877, 5091, 2548, 28747]})

final_data= pd.merge(original_count, count, on='group' )
final_data

Unnamed: 0,group,count,diff
0,Normal,37255,3981
1,New,36877,6006
2,Loyal,5091,1037
3,HighP,2548,157
4,Sleeping,28747,1014


轉化成百分比。

In [10]:
final_data['rate']= (final_data['diff'] / final_data['count']).apply(lambda x: f'{x * 100:.2f}%')
final_data

Unnamed: 0,group,count,diff,rate
0,Normal,37255,3981,10.69%
1,New,36877,6006,16.29%
2,Loyal,5091,1037,20.37%
3,HighP,2548,157,6.16%
4,Sleeping,28747,1014,3.53%


由於兩者做集群的過程差異僅在於多使用2月的數據與否，因此這些變化就是2月數據對這些顧客所屬集群影響的體現。  
或言之，這些顧客在2月的消費者行為使它們的集群屬性產生了改變。   
（可以看見沉睡客戶的變動最少，這個還蠻合理的，畢竟他們的屬性原本就是「很久以前有來過，但也很久沒來了」。）  

In [11]:
overall_rate = (final_data['diff'].sum() / final_data['count'].sum())
print(f'Overall Rate: {overall_rate:%}')

Overall Rate: 11.034402%


整體有11%的顧客集群因為多考量了2月數據而有所改變，我們來看看從各個集群改變到各個集群的人數和比例。

In [12]:
types = ["Normal", "New", "Loyal", "HighP", "Sleeping"]
for type in types:
    counts = data[data["group"] == type].groupby("group_5m").size()
    total_count_sum = counts.sum()

    for group_5m_type, count in counts.items():
        percentage = (count / total_count_sum) * 100 if total_count_sum > 0 else 0
        print(f'{type} to {group_5m_type}: {count} ({percentage:.2f}%)')

Normal to Normal: 29781 (88.21%)
Normal to New: 561 (1.66%)
Normal to Loyal: 379 (1.12%)
Normal to HighP: 1 (0.00%)
Normal to Sleeping: 3040 (9.00%)
New to Normal: 5764 (23.04%)
New to New: 19014 (76.00%)
New to Loyal: 212 (0.85%)
New to HighP: 30 (0.12%)
New to Sleeping: 0 (0.00%)
Loyal to Normal: 790 (15.04%)
Loyal to New: 2 (0.04%)
Loyal to Loyal: 4216 (80.26%)
Loyal to HighP: 0 (0.00%)
Loyal to Sleeping: 245 (4.66%)
HighP to Normal: 48 (2.16%)
HighP to New: 28 (1.26%)
HighP to Loyal: 1 (0.05%)
HighP to HighP: 2064 (92.93%)
HighP to Sleeping: 80 (3.60%)
Sleeping to Normal: 872 (3.30%)
Sleeping to New: 0 (0.00%)
Sleeping to Loyal: 141 (0.53%)
Sleeping to HighP: 1 (0.00%)
Sleeping to Sleeping: 25382 (96.16%)


各集群在加入多了2月的消費行為數據後，值得一提的現象包含：
- 每種集群的多數客戶維持原先集群的比例都最高，顯示四個月的集群分析的參考價值。
- 變成Normal的比例在各群都是次多，這算是蠻合乎常理的狀況（時間拉長之後向平均收斂）。
- 變成Sleeping的比例位居第三（除了New），這可能是因為添加了2月的數據，前面的客戶若在2月沒有消費紀錄便容易被歸類於沉睡客戶。
- 轉化為高購買力顧客的比例非常低，這類型的客戶在現實中屬於VIP客戶，是可以重點關注的客群。
- 轉化為長期忠誠顧客的人比例也非常低，可以重點考量如何增加消費者的購買頻率，但這或許跟店家性質（此為化妝品）有關？
- 轉化為新進顧客的比例最低，但這也非常合理，畢竟我們是在原本數據的後面添加一個月。

*※程式碼到此為止，專案分析結論詳見簡報檔案。*