# eCommerce Events History in Cosmetics Shop Data Processing
######  ※因為原始資料檔案太大，因此將Python程式分為資料預處理和資料分析兩部分，本檔案為資料預處理的一部分。  

此專案將針對化妝品電商數據進行分析，並且針對顧客進行分群研究，找出其中的商業價值。  
此數據集是一家中型化妝品電商在2019年10月至2020年2月間的各種消費者行為數據。一筆數據代表一個事件，每個事件都類似於產品和用戶之間的多對多關係。

1. 資料預處理
2. 探索性資料分析
3. 重要經營指標分析
4. RSFM集群分析
5. 集群結果分析
6. Tableau視覺化報表
7. 測試集集群對照
8. 結論

Dataset: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop , from  REES46 Marketing Platform.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import os 
pd.set_option('display.float_format', lambda x: '%.3f' % x) #禁用科學記號不然數字會很難看

## Loading Data

In [2]:
data1 = pd.read_csv('./ecommerce/2019-Oct.csv')

In [3]:
data1.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2019-10-01 00:00:00 UTC,cart,5773203,1487580005134238553,,runail,2.62,463240011,26dd6e6e-4dac-4778-8d2c-92e149dab885
1,2019-10-01 00:00:03 UTC,cart,5773353,1487580005134238553,,runail,2.62,463240011,26dd6e6e-4dac-4778-8d2c-92e149dab885
2,2019-10-01 00:00:07 UTC,cart,5881589,2151191071051219817,,lovely,13.48,429681830,49e8d843-adf3-428b-a2c3-fe8bc6a307c9
3,2019-10-01 00:00:07 UTC,cart,5723490,1487580005134238553,,runail,2.62,463240011,26dd6e6e-4dac-4778-8d2c-92e149dab885
4,2019-10-01 00:00:15 UTC,cart,5881449,1487580013522845895,,lovely,0.56,429681830,49e8d843-adf3-428b-a2c3-fe8bc6a307c9


In [4]:
data2 = pd.read_csv('./ecommerce/2019-Nov.csv')
data3 = pd.read_csv('./ecommerce/2019-Dec.csv')
data4 = pd.read_csv('./ecommerce/2020-Jan.csv')
all_m_data= pd.DataFrame()
all_m_data= pd.concat([data1, data2, data3, data4])

all_m_data.drop('user_session', axis= 1, inplace= True)
all_m_data.to_csv("alldata2.csv", index= False)

In [5]:
data = pd.read_csv('alldata2.csv')
data.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id
0,2019-10-01 00:00:00 UTC,cart,5773203,1487580005134238553,,runail,2.62,463240011
1,2019-10-01 00:00:03 UTC,cart,5773353,1487580005134238553,,runail,2.62,463240011
2,2019-10-01 00:00:07 UTC,cart,5881589,2151191071051219817,,lovely,13.48,429681830
3,2019-10-01 00:00:07 UTC,cart,5723490,1487580005134238553,,runail,2.62,463240011
4,2019-10-01 00:00:15 UTC,cart,5881449,1487580013522845895,,lovely,0.56,429681830


## Data Cleaning

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16536158 entries, 0 to 16536157
Data columns (total 8 columns):
 #   Column         Dtype  
---  ------         -----  
 0   event_time     object 
 1   event_type     object 
 2   product_id     int64  
 3   category_id    int64  
 4   category_code  object 
 5   brand          object 
 6   price          float64
 7   user_id        int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 1009.3+ MB


In [7]:
data.nunique()

event_time       7059662
event_type             4
product_id         51428
category_id          522
category_code         12
brand                262
price               2801
user_id          1341499
dtype: int64

In [8]:
print("Duplicates : ", len(data[data.duplicated()]))

Duplicates :  874708


可以看見數據中有一些重複值（所有欄位皆重複），這邊將重複的觀測值刪除。

In [9]:
data.drop_duplicates(inplace= True)

看一下資料的分布

In [10]:
data.describe()

Unnamed: 0,product_id,category_id,price,user_id
count,15661450.0,15661450.0,15661450.0,15661450.0
mean,5476818.088,1.55238452415731e+18,8.729,516141171.815
std,1316683.895,1.6651477120878227e+17,19.689,86182238.514
min,3752.0,1.4875800048070828e+18,-79.37,465496.0
25%,5724283.0,1.4875800057549955e+18,2.13,478511484.0
50%,5809912.0,1.4875800082631895e+18,4.11,550257646.0
75%,5856246.0,1.4875800135060687e+18,7.14,572005517.0
max,5924514.0,2.235524499636224e+18,327.78,608822072.0


發現price欄竟然有一些負值，檢視看看有多少負值。

In [11]:
len(data[data['price']<0])

90

相對於總共2000多萬筆數據而言並不多，因此我們選擇過濾這些不合理的值。

In [12]:
data = data[data['price']>= 0]

檢查缺失值

In [13]:
na = data.isna().sum()
na

event_time              0
event_type              0
product_id              0
category_id             0
category_code    15393312
brand             6555949
price                   0
user_id                 0
dtype: int64

可以看見category_code, brand, user session中都有不少NaN的值，來看一下這三欄中有多少比例的數據是空值。

In [14]:
round(na /data.shape[0], 4)*100

event_time       0.000
event_type       0.000
product_id       0.000
category_id      0.000
category_code   98.290
brand           41.860
price            0.000
user_id          0.000
dtype: float64

category_code中NaN的比例高達98%，加上有意義相近的category_id可以作分析品類使用，因此這邊選擇直接棄用此欄位

In [15]:
data.drop('category_code', axis= 1, inplace= True)

brand中有約42%是NaN，先將他們替換成Unknown。  


In [16]:
data['brand'] = data['brand'].fillna('Unknown')

## 時間資料
新增時間資料與修改資料結構。

In [17]:
data['event_time'] = pd.to_datetime(data['event_time'], format= '%Y-%m-%d %H:%M:%S UTC')
data = data.sort_values(by= 'event_time') 
data['event_time'].head()

0   2019-10-01 00:00:00
1   2019-10-01 00:00:03
2   2019-10-01 00:00:07
3   2019-10-01 00:00:07
4   2019-10-01 00:00:15
Name: event_time, dtype: datetime64[ns]

In [18]:
data['date'] = data['event_time'].dt.date
data['month'] = data['event_time'].dt.strftime('%b %Y') #月份顯示為 月-年 E.g. Feb 2020
data['day'] = data['event_time'].dt.day
data['weekday'] = data['event_time'].dt.day_name()  # 返回星期幾
data['hr'] = data['event_time'].dt.hour

In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15661360 entries, 0 to 16536157
Data columns (total 12 columns):
 #   Column       Dtype         
---  ------       -----         
 0   event_time   datetime64[ns]
 1   event_type   object        
 2   product_id   int64         
 3   category_id  int64         
 4   brand        object        
 5   price        float64       
 6   user_id      int64         
 7   date         object        
 8   month        object        
 9   day          int64         
 10  weekday      object        
 11  hr           int64         
dtypes: datetime64[ns](1), float64(1), int64(5), object(5)
memory usage: 1.5+ GB


In [20]:
data.to_csv("newdata2.csv", index= False)