<!--NOTEBOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="./figures/LogoOpenclassrooms.png">
<font size="4">
<p>
Cette étude a été réalisée dans le cadre du 5eme projet de ma formation Datascientist dispensée en MOOC par 
**<font color='blus'>Openclassrooms / écoles Centrale-Supélec</font>**.
</p>    

<p>
Ce notebook présente un générateur de modèle de segmentation d'une base de clients. Le générateur est implémenté dans le fichier <font color='blue'>P5_ModelBuilder.py</font>.
Ce dernier construit l'objet de la classe <font color='blue'>P5_SegmentClassifier.py</font>, objet 
    implémentant un modèle de prédiction par un algorithme de M.L. de l'appartenance d'un client à un segment.
</p>
<p>
Les résultats des analyses des algorithmes de M.L supervisé et non superisé issus du notebook <font color='blue'>P5_2_AllFeature.ipynb</font> sont utilisés pour l'optimisation des algorithmes.
</p>

<p>
Le modèle est basé sur les données fournies par le site :
</p>
<p>
https://archive.ics.uci.edu/ml/datasets/Online+Retail
</p>
<p>
</p>
</font>

In [48]:
import p3_util_plot

import P5_ModelBuilder
from P5_ModelBuilder import *
from P5_SegmentClassifier import *

#---------------------------------------------
# Reload from dumped file cleaned dataframe 
# If flag is False, then data is read fro CSV 
# file, cleaning process is applied and result 
# dataframe is dumped into a file. 
#---------------------------------------------
is_data_reloaded = True

#---------------------------------------------
# Reload sampling dumped file that has been 
# used for exloratory step into notebooks
# for features study.
#---------------------------------------------
is_sampling_load = True


#---------------------------------------------
# oP5_SegmentClassifier is activated for unit
# tests.
#---------------------------------------------
is_oP5_SegmentClassifier = False


#---------------------------------------------
# Unit tests
#---------------------------------------------
is_data_transform = False
is_data_transform_rfm = False
is_data_transform_timeFeature = False
is_data_transform_nlp = False
is_df_customers_build = False

#---------------------------------------------
# Building classifier
#---------------------------------------------
is_oP5_SegmentClassifier_built = True

if is_oP5_SegmentClassifier_built is True:
    is_data_transform = False
    is_data_transform_rfm = False
    is_data_transform_timeFeature = False
    is_data_transform_nlp = False
    is_df_customers_build = False
else:
    pass

is_model_print = False

In [49]:
import P5_ModelBuilder
help(P5_ModelBuilder)

Help on module P5_ModelBuilder:

NAME
    P5_ModelBuilder

CLASSES
    builtins.object
        P5_ModelBuilder
    
    class P5_ModelBuilder(builtins.object)
     |  This class aims to build a model for customer segmentation as per 
     |  P5 project description from Openclassrooms course of Data-Scientist training.
     |   
     |  Services provided by this class are :
     |     -> It acquires data from CSV file.
     |     -> It proceeds to data preparation : data cleaning and scaling.
     |     -> It builds data to be feed into a computable model.
     |     -> It builds computational models 
     |     -> It builds a deployable component, oP5_SegmentClassifier, 
     |     in charge of customer segmentation. Such class aims to be deployed for 
     |     production.
     |     
     |  Instructions flow :
     |  -------------------
     |     +-->__init__()
     |     |
     |     +-->data_load(fileName)
     |     |
     |     +-->data_clean(fileName)
     |     |
     |    

## <font color='blue'>``oP5_ModelBuilder`` backup and restore </font>

It is usefull to activate this sequence to take into account in object ``oP5_ModelBuilder`` a code modification that took place into ``P5_ModelBuilder``.

This save time avoiding the whole building process of ``oP5_ModelBuilder``.

To activate this sequence, the fix flag ``is_oP5_SegmentClassifier_built`` to ``False`` value.

In [3]:
from P5_ModelBuilder import *

is_oP5_SegmentClassifier_built = True

if is_oP5_SegmentClassifier_built is False:
    oP5_ModelBuilder_save = P5_ModelBuilder()
    try:
        oP5_ModelBuilder_save.copy(oP5_ModelBuilder)
        oP5_ModelBuilder_save.print()
    except NameError as nameErrorValue:
        p3_util_plot.printmd_error(" WARNING : oP5_ModelBuilder not yet defined")
else:
    pass


if is_oP5_SegmentClassifier_built is False:

    oP5_ModelBuilder = P5_ModelBuilder()
    oP5_ModelBuilder.copy(oP5_ModelBuilder_save)

    del(oP5_ModelBuilder_save)
    oP5_ModelBuilder.print()
else:
    pass

# <font color='blus'>Data-set reading or loading </font>

In [4]:
if is_data_reloaded is False:
    fileName = './data/OnlineRetail.xlsx'
    oP5_ModelBuilder = P5_ModelBuilder(path_to_data=fileName)
    oP5_ModelBuilder.data_read()
    is_data_reloaded = True
else : 
    fileName = './data/df_invoice_line_clean.dump'
    oP5_ModelBuilder = P5_ModelBuilder(path_to_data=fileName)
    oP5_ModelBuilder.data_load(fileName)

p5_util.object_load : fileName= ./data/df_invoice_line_clean.dump
Elapsed time = 0.14


In [5]:
oP5_ModelBuilder.df_invoice.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID'],
      dtype='object')

#### Sampling data from cleaned dataframe.

In [51]:
if is_model_print is True:
    oP5_ModelBuilder.print()

In [7]:
is_sampling_load = True
if is_sampling_load is False:
    oP5_ModelBuilder.data_sampling(ratio=0.8)
else:
    oP5_ModelBuilder.data_sampling_load()

p5_util.object_load : fileName= ./data/df_invoice_line_sample_random.dump
p5_util.object_load : fileName= ./data/df_invoice_line_out_sample_random.dump


In [52]:
if is_model_print is True:
    oP5_ModelBuilder.print()

# <font color='blus'>Data model cleaning</font>

Once dataframe is made clear, it is dumped into file *df_invoice_line_clean.dump*

In [9]:
if is_data_reloaded is False:
    oP5_ModelBuilder.data_clean()
    oP5_ModelBuilder.print()

# <font color='blus'>Data model building : unit tests</font>

### Test of method *<font color=blue>data_transform_rfm</font>*

In [10]:
import p3_util_plot
import P5_SegmentClassifier

is_data_transform_rfm = False

if is_data_transform_rfm is True:
    
    try:
        if oP5_SegmentClassifier is None:
            oP5_SegmentClassifier = P5_SegmentClassifier.P5_SegmentClassifier()
        else:
            pass
    except NameError as nameErrorValue:
        p3_util_plot.printmd_error("WARNING : "+str(nameErrorValue))
        oP5_SegmentClassifier = P5_SegmentClassifier.P5_SegmentClassifier()


    df_invoice_line = oP5_ModelBuilder.df_invoice
    try:
        if oP5_SegmentClassifier is not None:
            self = oP5_SegmentClassifier 
            self.df_invoice_line = df_invoice_line
            self.data_transform_rfm()
            oP5_ModelBuilder._oP5_SegmentClassifier = self
        else:
            p3_util_plot.printmd_error("ERROR 2 : oP5_SegmentClassifier : "+str(oP5_SegmentClassifier))
    except NameError as nameErrorValue:
        p3_util_plot.printmd_error("ERROR 3: "+str(nameErrorValue))
else:
    pass

### Test of method *<font color=blue>data_transform_timeFeature</font>*

In [11]:
is_data_transform_timeFeature = False

if is_data_transform_timeFeature is True:
    if oP5_SegmentClassifier is not None:
        self = oP5_SegmentClassifier 
        self.df_invoice_line = df_invoice_line
        self.data_transform_timeFeature()
    else:
        pass
else:
    pass

### Test of method *<font color=blue>data_transform_nlp</font>* with *<font color=blue>_oP5_SegmentClassifier</font>*

In [12]:
is_data_transform_nlp = False

if is_data_transform_nlp is True:
    oP5_ModelBuilder._oP5_SegmentClassifier.data_transform_nlp()
else:
    pass

In [13]:
import p3_util_plot
import P5_SegmentClassifier

is_data_transform_nlp = False

if is_data_transform_nlp is True:
    oP5_ModelBuilder._oP5_SegmentClassifier = P5_SegmentClassifier.P5_SegmentClassifier()
    oP5_ModelBuilder._oP5_SegmentClassifier._df_invoice_line = oP5_ModelBuilder._df_invoice.copy()

    self = oP5_ModelBuilder._oP5_SegmentClassifier 
    self.data_transform_nlp()
else:
    pass

### Test of method *<font color=blue>data_transform_nlp</font>* with *<font color=blue>oP5_SegmentClassifier</font>*

In [14]:
is_data_transform_nlp = False

if is_data_transform_nlp is True:
    if oP5_SegmentClassifier is not None:
        self = oP5_SegmentClassifier 
        self.df_invoice_line = df_invoice_line
        self.data_transform_nlp()
    else:
        pass
else:
    pass

### Test of method *<font color=blue>data_transform</font>*

Includes calls: 
* data_transform_timeFeatures()
* data_transform_rfm()
* data_transform_nlp()


In [15]:
is_data_transform = False

if is_data_transform is True:
    if oP5_SegmentClassifier is not None:
        self = oP5_SegmentClassifier 
        self.data_transform(df_invoice_line)

        self.print()
    else:
        pass
else:
    pass

### Test of method *<font color=blue>df_customers_build</font>*

In [16]:
is_oP5_SegmentClassifier = False

if is_df_customers_build is True:
    if oP5_SegmentClassifier is not None:
        oP5_SegmentClassifier.df_customers_build()
    else:
        pass
else:
    pass

# <font color='blus'>Data model building</font>

#### <font color='blue'>Note : transactions with Total value 0 are removed.</font>

In [17]:
oP5_ModelBuilder.data_transform()

P5_SegmentClassifier : init done!

*** Time features transformation ***
self.df_invoice_line : (194907, 7)
is_built_step : True
Time feature : month --> (2124, 12)
is_built_step : True
Time feature : day --> (2124, 31)
is_built_step : True
Time feature : dow --> (2124, 6)
is_built_step : True
Time feature : pod --> (2124, 2)
df_timeFeature= (0, 0)
p5_util.object_load : fileName= ./data/df_customers_month.dump
p5_util.object_load : fileName= ./data/df_customers_day.dump
p5_util.object_load : fileName= ./data/df_customers_dow.dump
p5_util.object_load : fileName= ./data/df_customers_pod.dump
df_customers_timeFeature : (2124, 51)
(2124, 30)

*** RFM transformation ***
df_customers_rfm =(2124, 59)

*** NLP transformation ***
3672
3623
3616
3583
3581
3581
3576
3573
(194907, 1716)
Dumping matrix_weights into file= ./data/matrix_weights_NLP.dump
Done!
df_invoice_line : (194907, 8)
self._df_w_nlp : (2124, 1717)
df_customers_pca_nlp : (2124, 250)


In [53]:
if is_model_print is True:
    oP5_ModelBuilder._oP5_SegmentClassifier.print()

In [54]:
if is_model_print is True:
    oP5_ModelBuilder.print()

# <font color='blus'>Building Clusters</font>

#### <font color='blue'>Backup P5_ModelBuilder before clustering</font>

In [20]:
import p5_util
p5_util.object_dump(oP5_ModelBuilder, "./data/oP5_ModelBuilder.dump")

#### <font color='blue'>*df_customers*</font> dataframe is aggregated from separate dataframes issued from NLP, RFM and Time features

In [21]:
import p3_util_plot
from P5_SegmentClassifier import *

is_oP5_SegmentClassifier = False

if is_oP5_SegmentClassifier is True:
    try:
        oP5_SegmentClassifier.df_customers_fileRead()
    except NameError as nameErrorValue:
        p3_util_plot.printmd_error("WARNING : oP5_SegmentClassifier : "+str(nameErrorValue))
else:
    pass

In [22]:
oP5_ModelBuilder.clusters_build()

p5_util.object_load : fileName= ./data/df_customers_rfm.dump
RFM features : (2124, 59)
p5_util.object_load : fileName= ./data/df_customers_timeFeature_pca.dump
Time features : (2124, 30)
p5_util.object_load : fileName= ./data/df_customers_nlp_pca.dump
NLP features : (2124, 250)
All features : (2124, 339)
p5_util.object_load : fileName= ./data/df_customers.dump
df_customers : (2124, 339)
Clustering model : GMM
Clustering parameters : {'n_clusters': 6, 'covariance_type': 'diag'}


In [55]:
if is_model_print is True:
    oP5_ModelBuilder.print()

# <font color='blus'>Building classifier</font>

In [24]:
oP5_ModelBuilder.classifier_build()

p5_util.object_load : fileName= ./data/df_customers.dump
df_customers : (2124, 339)


In [56]:
if is_model_print is True:
    oP5_ModelBuilder.print()

# <font color='blus'>Dump Predictor model</font>

In [26]:
import p3_util_plot

is_oP5_SegmentClassifier = False

if is_oP5_SegmentClassifier is True:
    try:
        oP5_ModelBuilder._oP5_SegmentClassifier = oP5_SegmentClassifier
    except NameError as nameErrorValue:
        p3_util_plot.printmd_error("WARNING : "+str(nameErrorValue))
    oP5_ModelBuilder._oP5_SegmentClassifier.print()
else:
    pass

The following sequence is used when class ``P5_SegmentClassifier`` is modified 
and such modifications have to take place into ``oP5_ModelBuilder._oP5_SegmentClassifier`` object.

This requires flag ``is_oP5_SegmentClassifier`` to be fixed to value ``True``.


In [46]:
import P5_SegmentClassifier

is_oP5_SegmentClassifier = True

#-------------------------------------------------------------------------
# This sequence is used when class P5_SegmentClassifier is modified 
# and such modifications have to take place into oP5_ModelBuilder object.
#
# This requires flag is_oP5_SegmentClassifier to be True.
#
#-------------------------------------------------------------------------
if is_oP5_SegmentClassifier is True:
    _oP5_SegmentClassifier_save = P5_SegmentClassifier.P5_SegmentClassifier()
    _oP5_SegmentClassifier_save.copy(oP5_ModelBuilder._oP5_SegmentClassifier)
    
    oP5_ModelBuilder._oP5_SegmentClassifier = _oP5_SegmentClassifier_save
else:
    pass

P5_SegmentClassifier : init done!


In [57]:
if is_model_print is True:
    oP5_ModelBuilder._oP5_SegmentClassifier.print()

In [47]:
oP5_ModelBuilder.model_dump()

p5_util.object_load : fileName= ./data/df_invoice_line_clean.dump
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID'],
      dtype='object')
(349189, 7)


# <font color='blus'>Tests and validation</font>

## <font color='blue'>Loading *P5_SegmentClassifier* object</font>

In [36]:
import p3_util_plot
import P5_SegmentClassifier

is_oP5_SegmentClassifier = False

#-------------------------------------------------------------------------
# This sequence is used when class P5_SegmentClassifier is modified 
# and such modifications have to take place into oP5_SegmentClassifier object.
#
# This requires flag is_oP5_SegmentClassifier to be True.
#
#-------------------------------------------------------------------------
if is_oP5_SegmentClassifier is True:
    is_oP5_SegmentClassifier_dumped = False
    oP5_SegmentClassifier_save = P5_SegmentClassifier.P5_SegmentClassifier()
    try:
        oP5_SegmentClassifier_save.copy(oP5_SegmentClassifier)
        P5_SegmentClassifier.model_dump(oP5_SegmentClassifier_save, file_name = "./data/oP5_SegmentClassifier.dump")
        is_oP5_SegmentClassifier_dumped = True
    except NameError as nameErrorValue:
        p3_util_plot.printmd_error("WARNING : "+str(nameErrorValue))


    if is_oP5_SegmentClassifier_dumped is True:
        oP5_SegmentClassifier = P5_SegmentClassifier.model_load("./data/oP5_SegmentClassifier.dump")
    else:
        oP5_SegmentClassifier = None    
else:
    pass
#type(oP5_SegmentClassifier)

In [37]:
if is_oP5_SegmentClassifier is True:
    if oP5_SegmentClassifier is not None:
        oP5_SegmentClassifier.print()
    else:
        pass
else:
    pass

### <font color='blue'> Load classifier model issued from dump of *P5_ModelBuilder*</font>

In [38]:
is_oP5_SegmentClassifier_built

True

In [39]:
import P5_SegmentClassifier

is_oP5_SegmentClassifier_built = False

if is_oP5_SegmentClassifier_built is True:
    _oP5_SegmentClassifier_restore = P5_SegmentClassifier.model_load("./data/_oP5_SegmentClassifier.dump")
    _oP5_SegmentClassifier = P5_SegmentClassifier.P5_SegmentClassifier()
    _oP5_SegmentClassifier.copy(_oP5_SegmentClassifier_restore)
    del(_oP5_SegmentClassifier_restore)
    _oP5_SegmentClassifier.print()
else:
    pass

In [40]:
import p5_util

if is_oP5_SegmentClassifier is True:
    p5_util.object_dump(_oP5_SegmentClassifier, "./data/_oP5_SegmentClassifier.dump")
    _oP5_SegmentClassifier.print()
else:
    pass

In [41]:
import p3_util_plot
if is_oP5_SegmentClassifier is False:
    _oP5_SegmentClassifier = oP5_ModelBuilder._oP5_SegmentClassifier
    p3_util_plot.printmd("Got _oP5_SegmentClassifier from oP5_ModelBuilder")
else:
    pass

<p><font color='green'>**Got _oP5_SegmentClassifier from oP5_ModelBuilder**</font></p>

### <font color='blue'>Ramdomly get an obervation</font>

In [42]:
import pandas as pd
import p5_util

if is_oP5_SegmentClassifier_built is False:
    df_invoice_clean = p5_util.object_load('./data/df_invoice_cleaned.dump')
    print(df_invoice_clean.shape)

    df_sample_test = df_invoice_clean.sample(1)
    if 'RFM' in df_sample_test.columns:
        del(df_sample_test['RFM'])
    print(df_sample_test.shape)
    print(df_sample_test)

    ### <font color='blue'>Data is processed in order to be computable</font>

    Description = df_sample_test.Description.iloc[0]
    Quantity = df_sample_test.Quantity.iloc[0]
    InvoiceDate = '2011-11-11 10:19:00'

    UnitPrice = df_sample_test.UnitPrice.iloc[0]
    InvoiceNo = df_sample_test.InvoiceNo.iloc[0]
    CustomerID = df_sample_test.CustomerID.iloc[0]

    dict_invoice_line = {'InvoiceDate':InvoiceDate, 'Description':Description\
    , 'Quantity':Quantity, 'UnitPrice':UnitPrice}
    dict_invoice_line['CustomerID'] = CustomerID
    dict_invoice_line['InvoiceNo'] = InvoiceNo
    df_invoice_line = pd.DataFrame(dict_invoice_line, columns=dict_invoice_line.keys(), index=[0])

    print(df_invoice_line.shape)    
    
else:
    pass

p5_util.object_load : fileName= ./data/df_invoice_cleaned.dump
(36664, 10)
(1, 9)
       InvoiceNo StockCode                        Description  Quantity  \
249051    569546     22720  SET OF 3 CAKE TINS PANTRY DESIGN          1   

        InvoiceDate  UnitPrice  CustomerID         Country  Total  
249051        15252       4.95       15276  United Kingdom   4.95  
(1, 6)


### Global test with classifier model issued from <font color='blue'>*P5_ModelBuilder*</font>

In [60]:
if is_model_print is True:
    _oP5_SegmentClassifier.print()

In [44]:
_oP5_SegmentClassifier._df_invoice_line.shape

(0, 0)

In [45]:
#oP5_SegmentClassifier.data_process(CustomerID, InvoiceDate, InvoiceNo, Description, Quantity, UnitPrice )
import p3_util_plot

segment = _oP5_SegmentClassifier.get_customer_marketSegment(df_invoice_line)
p3_util_plot.printmd("Customer belongs to segment = "+str(segment))


*** Time features transformation ***
self.df_invoice_line : (1, 6)
is_built_step : False
Time feature : month --> (1, 12)
is_built_step : False
Time feature : day --> (1, 31)
is_built_step : False
Time feature : dow --> (1, 6)
is_built_step : False
Time feature : pod --> (1, 2)
Aggregated time features df : (1, 51)
df_timeFeature= (1, 51)
df_customers_timeFeature : (1, 51)
(1, 30)

*** RFM transformation ***
df_customers_rfm =(1, 59)

*** NLP transformation ***
1
1
1
1
1
1
1
1
df_invoice_line : (1, 8)
self._df_w_nlp : (1, 1717)
df_customers_pca_nlp : (1, 250)
All features : (1, 339)


<p><font color='green'>**Customer belongs to segment = 0**</font></p>