In [73]:
%run "part01_preprocessing.ipynb"

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
 8   Unnamed: 8   0 non-null       float64
dtypes: float64(3), int64(1), object(5)
memory usage: 37.2+ MB


In [74]:
HTML("""
<style>

h1 {
    background-color: DarkSlateGray;
    color: white;
    padding: 15px 15px;
    text-align: center;
    font-family: Arial, Helvetica, sans-serif;
    border-radius:10px 10px;
}

h2 {
    background-color: CadetBlue;
    color: white;
    padding: 10px 10px;
    text-align: center;
    font-family: Arial, Helvetica, sans-serif
    border-radius:10px 10px;
}

</style>
""")

# Content

* **Data Preparation**
    - Normalization
    - Standardization
    - Dimensonality
    - Feature Selection
    - Dealing with Outliers

# Data Preparation

In [75]:
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer
from sklearn.decomposition import PCA
from umap.umap_ import UMAP
from sklearn.manifold import TSNE

In [76]:
data_prep = data_client_resume.copy()

In [77]:
data_prep.isna().sum()

GrossRevenueTotal    0
RecencyDays          0
Frequency            0
dtype: int64

## Normalization

In [78]:
#sns.pairplot(data_prep, aspect=1.5);

In [79]:
log_columns = data_prep.skew().sort_values(ascending=False)
log_columns = log_columns.loc[log_columns > 0.75]
log_columns

GrossRevenueTotal    21.585393
Frequency            12.045707
RecencyDays           1.249082
dtype: float64

In [80]:
# The log transformations
for col in log_columns.index:
    data_prep[col] = np.log1p(data_prep[col])

## Standardization

In [81]:
#ss = StandardScaler()
#rs = RobustScaler()
#pt = PowerTransformer()

# MinmmaxScaler provides better resutls mainly because is robbust to outliers
mms = MinMaxScaler()

for col in data_prep.columns:
    data_prep[col] = mms.fit_transform(data_prep[[col]]).squeeze()

In [82]:
features = ['GrossRevenueTotal', 'RecencyDays', 'Frequency'] # We will use all features
X = data_prep[features].copy() 

In [83]:
#sns.pairplot(X, aspect=1.5);

## Dimensonality Reduction

In [84]:
clusters_results = X.copy()

In [85]:
# Dimensionality Reduction using technique UMAP 
umap = UMAP(random_state=3456)
umap_embedding = umap.fit_transform(X)

# X,y axis representation for UMAP
clusters_results['umap_x'] = umap_embedding[:,0]
clusters_results['umap_y'] = umap_embedding[:,1]

In [86]:
# Dimensionality Reduction using technique TSNE
tsne = TSNE(n_components=2, init='pca', learning_rate='auto', n_jobs=-1, random_state=3456)
tsne_embedding = tsne.fit_transform(X)

# X,y axis representation using TSNE
clusters_results['tsne_x'] = tsne_embedding[:,0]
clusters_results['tsne_y'] = tsne_embedding[:,1]

In [87]:
clusters_results

Unnamed: 0_level_0,GrossRevenueTotal,RecencyDays,Frequency,umap_x,umap_y,tsne_x,tsne_y
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
17850,0.638772,0.999490,0.615003,16.587051,1.668091,13.421610,65.418365
13047,0.589531,0.535628,0.366301,16.904604,6.878082,54.411797,26.065199
12583,0.666699,0.132437,0.446811,14.819579,12.711712,63.986713,-15.090689
13748,0.482357,0.741637,0.236060,15.430046,1.299198,7.576268,55.761478
15100,0.445907,0.978449,0.148937,13.987630,-1.100407,-13.432817,58.237530
...,...,...,...,...,...,...,...
13436,0.339589,0.077471,0.000000,10.942719,15.811964,35.044922,-42.773659
15520,0.390068,0.077471,0.000000,11.051549,15.759743,35.283615,-42.299320
13298,0.394328,0.077471,0.000000,11.029748,15.771748,35.297661,-42.264164
14569,0.352641,0.077471,0.000000,10.963229,15.804261,35.128387,-42.635925


## Feature Selection

The clustering algorithms do not have automated mechanisms to detect the best features, since in essence they only make groupings according to given variables, looking for similarities between them to form groups. Whether the groups formed are good or bad or have a good explanation for business problems is up to human beings to interpret this. As one of the objectives of this study is to compare the machine learning model with the statistical model, initially we will use the features "GrossRevenueTotal, "RecencyDays", "Frequency" to make a more balanced comparison.

## Dealing with Outliers

Later we will use a clustering algorithm, DBScan, which, in addition to being a classifier, is also an outlier detector.