# Projeto 3: Segmentação de clientes no ecommerce

Realizar uma segmentação de clientes aplicando a metodologia RFM para um e-commerce.

O dataset utilizado neste projeto foi obtido desse projeto no [Kaggle](https://www.kaggle.com/datasets/datacertlaboratoria/projeto-3-segmentao-de-clientes-no-ecommerce).

In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv("./dados/vendas-por-fatura.csv")

Realizando análise exploratória do dataset

In [3]:
df.head(50)

Unnamed: 0,N° da fatura,Data da fatura,ID Cliente,País,Quantidade,Valor
0,548370,3/30/2021 16:14:00,15528.0,United Kingdom,123,22933
1,575767,11/11/2021 11:11:00,17348.0,United Kingdom,163,20973
2,C570727,10/12/2021 11:32:00,12471.0,Germany,-1,-145
3,549106,4/6/2021 12:08:00,17045.0,United Kingdom,1,3995
4,573112,10/27/2021 15:33:00,16416.0,United Kingdom,357,34483
5,576630,11/16/2021 8:38:00,13816.0,Germany,91,19998
6,538125,12/9/2020 15:46:00,18225.0,United Kingdom,16,3000
7,544354,2/18/2021 10:42:00,13489.0,United Kingdom,64,7728
8,546369,3/11/2021 11:41:00,15513.0,United Kingdom,10,6750
9,570651,10/11/2021 13:34:00,14911.0,EIRE,86,32135


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25953 entries, 0 to 25952
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   N° da fatura    25953 non-null  object 
 1   Data da fatura  25953 non-null  object 
 2   ID Cliente      22229 non-null  float64
 3   País            25953 non-null  object 
 4   Quantidade      25953 non-null  int64  
 5   Valor           25953 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.2+ MB


In [5]:
df.dtypes

N° da fatura       object
Data da fatura     object
ID Cliente        float64
País               object
Quantidade          int64
Valor              object
dtype: object

In [6]:
# Renomeando as colunas
df.columns = df.columns.str.replace(" ", "_")
df.columns

Index(['N°_da_fatura', 'Data_da_fatura', 'ID_Cliente', 'País', 'Quantidade',
       'Valor'],
      dtype='object')

In [7]:
df.shape

(25953, 6)

As faturas iniciadas com a letra 'C' 

Removendo as linhas das faturas que não possuem ID de clientes registrados, pois assim não é possível identificar qual o cliente e se ele não já foi anteriormente contabilizado.

In [8]:
df = df.dropna(subset=["ID_Cliente"])
df.shape

(22229, 6)

Foram removidos 3.724 registros sem ID (25953 linhas iniciais, resultando em 22229 linhas).

In [9]:
# Convertendo para tipo inteiro o ID
df['ID_Cliente'] = df['ID_Cliente'].astype(int)

In [10]:
df['Valor'] = df['Valor'].astype(str)
df.head()

Unnamed: 0,N°_da_fatura,Data_da_fatura,ID_Cliente,País,Quantidade,Valor
0,548370,3/30/2021 16:14:00,15528,United Kingdom,123,22933
1,575767,11/11/2021 11:11:00,17348,United Kingdom,163,20973
2,C570727,10/12/2021 11:32:00,12471,Germany,-1,-145
3,549106,4/6/2021 12:08:00,17045,United Kingdom,1,3995
4,573112,10/27/2021 15:33:00,16416,United Kingdom,357,34483


In [11]:
# Convertendo tipo da 'Data da Fatura'
df["Data_da_fatura"] = pd.to_datetime(df["Data_da_fatura"])
df.head()


Unnamed: 0,N°_da_fatura,Data_da_fatura,ID_Cliente,País,Quantidade,Valor
0,548370,2021-03-30 16:14:00,15528,United Kingdom,123,22933
1,575767,2021-11-11 11:11:00,17348,United Kingdom,163,20973
2,C570727,2021-10-12 11:32:00,12471,Germany,-1,-145
3,549106,2021-04-06 12:08:00,17045,United Kingdom,1,3995
4,573112,2021-10-27 15:33:00,16416,United Kingdom,357,34483


In [12]:
# Convertendo o tipo de 'Valor'
df[]


SyntaxError: invalid syntax (Temp/ipykernel_25304/2895650547.py, line 2)

In [None]:
# Removendo os dados duplicados do dataset
df.drop_duplicates(inplace=True)

In [None]:
# Classificando se a fatura é devolução ou não
df.loc[df['N°_da_fatura'].str.startswith('C'), 'Devolucao'] = "SIM"
df.loc[~df['N°_da_fatura'].str.startswith('C'), 'Devolucao'] = "NÃO"
df.head(20)

In [None]:
# Removendo as faturas do tipo 'Devolução' do dataset

In [None]:
# Ordenando ascendente pela Data da Fatura
df.sort_values(by=['Data_da_fatura'], inplace=True)
df.head(10)


Unnamed: 0,N°_da_fatura,Data_da_fatura,ID_Cliente,País,Quantidade,Valor
9367,536365,2020-12-01 08:26:00,17850,United Kingdom,40,13912
18259,536366,2020-12-01 08:28:00,17850,United Kingdom,12,2220
11185,536368,2020-12-01 08:34:00,13047,United Kingdom,15,7005
6876,536367,2020-12-01 08:34:00,13047,United Kingdom,83,27873
8195,536369,2020-12-01 08:35:00,13047,United Kingdom,3,1785
505,536370,2020-12-01 08:45:00,12583,France,449,85586
12932,536371,2020-12-01 09:00:00,13748,United Kingdom,80,20400
2773,536372,2020-12-01 09:01:00,17850,United Kingdom,12,2220
20456,536373,2020-12-01 09:02:00,17850,United Kingdom,88,25986
24617,536374,2020-12-01 09:09:00,15100,United Kingdom,32,35040


In [None]:
# Obtendo Ano e Mês
df['Ano_mes'] = df['Data_da_fatura'].dt.to_period('M')
df.head(10)


Unnamed: 0,N°_da_fatura,Data_da_fatura,ID_Cliente,País,Quantidade,Valor,Ano_mes
9367,536365,2020-12-01 08:26:00,17850,United Kingdom,40,13912,2020-12
18259,536366,2020-12-01 08:28:00,17850,United Kingdom,12,2220,2020-12
11185,536368,2020-12-01 08:34:00,13047,United Kingdom,15,7005,2020-12
6876,536367,2020-12-01 08:34:00,13047,United Kingdom,83,27873,2020-12
8195,536369,2020-12-01 08:35:00,13047,United Kingdom,3,1785,2020-12
505,536370,2020-12-01 08:45:00,12583,France,449,85586,2020-12
12932,536371,2020-12-01 09:00:00,13748,United Kingdom,80,20400,2020-12
2773,536372,2020-12-01 09:01:00,17850,United Kingdom,12,2220,2020-12
20456,536373,2020-12-01 09:02:00,17850,United Kingdom,88,25986,2020-12
24617,536374,2020-12-01 09:09:00,15100,United Kingdom,32,35040,2020-12


In [None]:
# Classificando se o cliente pertence ou não ao Reino Unido
df.loc[df['País']=="United Kingdom", 'Pertence_UK'] = "SIM"
df.loc[df['País']!="United Kingdom", 'Pertence_UK'] = "NÃO"
df.head(50)

Unnamed: 0,N°_da_fatura,Data_da_fatura,ID_Cliente,País,Quantidade,Valor,Ano_mes,Pertence_UK
9367,536365,2020-12-01 08:26:00,17850,United Kingdom,40,13912,2020-12,SIM
18259,536366,2020-12-01 08:28:00,17850,United Kingdom,12,2220,2020-12,SIM
11185,536368,2020-12-01 08:34:00,13047,United Kingdom,15,7005,2020-12,SIM
6876,536367,2020-12-01 08:34:00,13047,United Kingdom,83,27873,2020-12,SIM
8195,536369,2020-12-01 08:35:00,13047,United Kingdom,3,1785,2020-12,SIM
505,536370,2020-12-01 08:45:00,12583,France,449,85586,2020-12,NÃO
12932,536371,2020-12-01 09:00:00,13748,United Kingdom,80,20400,2020-12,SIM
2773,536372,2020-12-01 09:01:00,17850,United Kingdom,12,2220,2020-12,SIM
20456,536373,2020-12-01 09:02:00,17850,United Kingdom,88,25986,2020-12,SIM
24617,536374,2020-12-01 09:09:00,15100,United Kingdom,32,35040,2020-12,SIM


In [None]:
df.head()

Unnamed: 0,N°_da_fatura,Data_da_fatura,ID_Cliente,País,Quantidade,Valor,Ano_mes,Pertence_UK,Devolucao
9367,536365,2020-12-01 08:26:00,17850,United Kingdom,40,13912,2020-12,SIM,NÃO
18259,536366,2020-12-01 08:28:00,17850,United Kingdom,12,2220,2020-12,SIM,NÃO
11185,536368,2020-12-01 08:34:00,13047,United Kingdom,15,7005,2020-12,SIM,NÃO
6876,536367,2020-12-01 08:34:00,13047,United Kingdom,83,27873,2020-12,SIM,NÃO
8195,536369,2020-12-01 08:35:00,13047,United Kingdom,3,1785,2020-12,SIM,NÃO


In [None]:
df.dtypes

N°_da_fatura              object
Data_da_fatura    datetime64[ns]
ID_Cliente                 int32
País                      object
Quantidade                 int64
Valor                     object
Ano_mes                period[M]
Pertence_UK               object
Devolucao                 object
Total                     object
dtype: object

In [None]:
df.drop(labels='Devolucao', axis=1, inplace=True)
df.head()

Unnamed: 0,N°_da_fatura,Data_da_fatura,ID_Cliente,País,Quantidade,Valor,Ano_mes,Pertence_UK,Total
9367,536365,2020-12-01 08:26:00,17850,United Kingdom,40,13912,2020-12,SIM,"139,12139,12139,12139,12139,12139,12139,12139,..."
18259,536366,2020-12-01 08:28:00,17850,United Kingdom,12,2220,2020-12,SIM,"22,2022,2022,2022,2022,2022,2022,2022,2022,202..."
11185,536368,2020-12-01 08:34:00,13047,United Kingdom,15,7005,2020-12,SIM,"70,0570,0570,0570,0570,0570,0570,0570,0570,057..."
6876,536367,2020-12-01 08:34:00,13047,United Kingdom,83,27873,2020-12,SIM,"278,73278,73278,73278,73278,73278,73278,73278,..."
8195,536369,2020-12-01 08:35:00,13047,United Kingdom,3,1785,2020-12,SIM,178517851785


In [None]:
temp = df['Valor'][0]+df['Valor'][1]
temp

'229,33209,73'

Unnamed: 0,Cliente,Data_ultima_compra,Recencia
275,12680,2021-12-09 12:50:00,0
587,13113,2021-12-09 12:49:00,0
2562,15804,2021-12-09 12:31:00,0
1067,13777,2021-12-09 12:25:00,0
3854,17581,2021-12-09 12:21:00,0
330,12748,2021-12-09 12:20:00,0
301,12713,2021-12-09 12:16:00,0
146,12526,2021-12-09 12:09:00,0
3215,16705,2021-12-09 12:08:00,0
2192,15311,2021-12-09 12:00:00,0


In [None]:
print(df['ID_Cliente'].unique())

[17850 13047 12583 ... 13298 14569 12713]


In [None]:
# Verificando se há valores nulos
temp = df.isnull().values.any()
temp


False

In [None]:
# contando clientes únicos por ano/mês
df_coorte = pd.DataFrame(df.groupby(['Ano_mes'])['ID_Cliente'].nunique()).reset_index()
df_coorte

Unnamed: 0,Ano_mes,ID_Cliente
0,2020-12,948
1,2021-01,783
2,2021-02,798
3,2021-03,1020
4,2021-04,899
5,2021-05,1079
6,2021-06,1051
7,2021-07,993
8,2021-08,980
9,2021-09,1302


In [None]:
# Criando coluna de Recencia 
df_recencia = df.groupby(by='ID_Cliente',
                        as_index=False)['Data_da_fatura'].max()
df_recencia.columns = ['Cliente', 'Data_ultima_compra']
ultima_compra = df_recencia['Data_ultima_compra'].max()
df_recencia['Recencia'] = df_recencia['Data_ultima_compra'].apply(
    lambda x: (ultima_compra - x).days)
df_recencia.head()

Unnamed: 0,Cliente,Data_ultima_compra,Recencia
0,12346,2021-01-18 10:17:00,325
1,12347,2021-12-07 15:52:00,1
2,12348,2021-09-25 13:13:00,74
3,12349,2021-11-21 09:51:00,18
4,12350,2021-02-02 16:01:00,309


In [None]:
df_frequencia = df.drop_duplicates().groupby(
    by=['ID_Cliente'], as_index=False)['Data_da_fatura'].count()
df_frequencia.columns = ['Cliente', 'Frequencia']
df_frequencia.head()

Unnamed: 0,Cliente,Frequencia
0,12346,2
1,12347,7
2,12348,4
3,12349,1
4,12350,1


In [None]:
df['Total'] = df['Valor']*df['Quantidade']
df_monetario = df.groupby(by='Cliente', as_index=False)['Total'].sum()
df_monetario.columns = ['Cliente', 'Monetario']
df_monetario.head()

KeyError: 'Cliente'