### **4. Customer segmentation**

In this section, we will segment our customers in order to guide our commercial activity. Knowing what the customers who belong to each group are like, will help us a lot when defining our marketing for the future. The phases we will work on include:

**4.1 Construction and last cleaning:**  
**4.2 Clustering using k-means:**     
**4.3 Cluster interpretation:**    
**4.4 Clients file:**    



Libraries:

In [1]:
# Data analysis and wrangling
import numpy as np
import pandas as pd
import datetime
import time
from sklearn.preprocessing import LabelEncoder,MinMaxScaler,OrdinalEncoder,StandardScaler
from sklearn.feature_selection import VarianceThreshold

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors

# Machine learning
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn import metrics

Acquire data from the last notebook:

In [2]:
df = pd.read_csv('data2_easymoney.csv')

A small data type fix

In [3]:
df["pk_partition"]=pd.to_datetime(df["pk_partition"], format='%Y-%m-%d')
df["entry_date"]=pd.to_datetime(df["entry_date"], format='%Y-%m-%d')

#### **4.1 Construction and last cleaning**

We will want to segment current customers, so we will use the data from the last month that we have in the registry. Additionally, when segmenting customers, we will only want customers who are currently alive, that is, deceased = 0.

In [4]:
#We see the partitions of the dataset
df['pk_partition'].unique()

<DatetimeArray>
['2018-01-28 00:00:00', '2018-02-28 00:00:00', '2018-03-28 00:00:00',
 '2018-04-28 00:00:00', '2018-05-28 00:00:00', '2018-06-28 00:00:00',
 '2018-07-28 00:00:00', '2018-08-28 00:00:00', '2018-09-28 00:00:00',
 '2018-10-28 00:00:00', '2018-11-28 00:00:00', '2018-12-28 00:00:00',
 '2019-01-28 00:00:00', '2019-02-28 00:00:00', '2019-03-28 00:00:00',
 '2019-04-28 00:00:00', '2019-05-28 00:00:00']
Length: 17, dtype: datetime64[ns]

In [5]:
#We see that we have 1075 dead clients
df['deceased'].value_counts()

deceased
0    5961849
1       1075
Name: count, dtype: int64

We create our dataframe, df_last_partition , with the last partition + live clients:

In [6]:
df_last_partition = df.loc[(df['pk_partition']== '2019-05-28') &(df['deceased']==0)]
print(df_last_partition.shape)
df_last_partition.head()

(442909, 30)


Unnamed: 0,pk_cid,pk_partition,short_term_deposit,loans,mortgage,funds,securities,long_term_deposit,credit_card,payroll,...,country_id,region_code,gender,age,deceased,salary,total_products,account_prod,financing_prod,saving_prod
5519929,657826,2019-05-28,0,0,0,0,0,0,0,0,...,1,25.0,0,44,0,54493.38,1,1,0,0
5519930,657817,2019-05-28,0,0,0,0,0,0,0,0,...,1,8.0,1,32,0,110949.86671,0,0,0,0
5519931,657986,2019-05-28,0,0,0,0,0,0,1,1,...,1,41.0,0,39,0,100993.17,6,4,1,1
5519932,657905,2019-05-28,0,0,0,0,0,1,0,0,...,1,28.0,0,85,0,154059.09,2,1,0,1
5519933,657336,2019-05-28,0,0,0,0,0,0,0,0,...,1,28.0,1,38,0,106793.037009,1,1,0,0


##### **Creating new features**

In df_last_partition, we are leaving a lot of information from past months unused. We will create some features that collect information about the previous months and in this way we can optimize the construction of the clusters.

* Registration time: time from registration to last partition
* Total registrations: total product registrations
* Total cancellations: total product cancellations
* Total revenues: Easymoney revenues from contracted products
* Average duration of products

**Registration time**  

Knowing how long the client has been discharged can provide us with useful information.

In [45]:
df_last_partition['registration_time'] = (df_last_partition['pk_partition'] - df_last_partition['entry_date']).dt.days

**Total registrations and total cancellations**

We will calculate how manny times a customer has registered and unsubscribed for each product. Then will see the total registrations and total cancellations per customer.   

First of all we load the list of products from the Preprocessing section

In [8]:
prod_columns = ['short_term_deposit', 'loans', 'mortgage',
       'funds', 'securities', 'long_term_deposit', 'credit_card', 'payroll',
       'pension_plan', 'payroll_account', 'emc_account', 'debit_card',
       'em_account_p', 'em_account']

We create a RegistrationsCancellations function, which will return us how many registrations and cancellations we have per customer and per product.

In [9]:
def RegistrationsCancellations(df, products):
    
    for i in products:
        previous_month = i + "_previous_month"
        difference = i + "_dif"
        dicc_reg = i + "_dicc_reg"
        dicc_canc = i + "_dicc_canc"
        registrations = i + "_registrations"
        cancellations = i + "_cancellations"
    
        #calculation of the contracted product for the previus month and the difference between both months.
        df[previous_month] = df.groupby('pk_cid')[i].shift(1)
        df[difference] = df[i] - df[previous_month]

        # REGISTRATIONS: creation of a diccionary in which -->
        # keys = pk_cid  ;  values = number of times difference is equal to 1
        dicc_reg = (df.loc[df[difference] == 1].groupby('pk_cid').size()).to_dict()
        df[registrations] = df['pk_cid'].map(dicc_reg).fillna(0).astype(int)
        
        # CANCELLATIONS: creation of a diccionary in which -->
        # keys = pk_cid  ;  values = number of times difference is equal to -1
        dicc_canc = (df.loc[df[difference] == -1].groupby('pk_cid').size()).to_dict()
        df[cancellations] = df['pk_cid'].map(dicc_canc).fillna(0).astype(int)

        #remove irrelevant columns
        df = df.drop([previous_month,difference], axis = 1)
        
        
    return df
    

In [10]:
df = RegistrationsCancellations(df, prod_columns)

Example: The client with 'pk_cid' == 1052217 we see that he has up to 4 registrations for the debit_card product and 3 cancellations. Let's see how the final df of these columns would look.

In [12]:
df.loc[df['pk_cid'] == 1052217, ['pk_partition', 'debit_card','debit_card_registrations', 'debit_card_cancellations'] ]

Unnamed: 0,pk_partition,debit_card,debit_card_registrations,debit_card_cancellations
884,2018-01-28,0,4,3
474283,2018-02-28,1,4,3
694896,2018-03-28,1,4,3
965738,2018-04-28,1,4,3
981599,2018-05-28,0,4,3
1257703,2018-06-28,0,4,3
1581909,2018-07-28,1,4,3
2046404,2018-08-28,0,4,3
2327229,2018-09-28,0,4,3
2672230,2018-10-28,1,4,3


Now we have everything necessary to calculate for each client how many **Total Registrations and Cancellations** they have. We create a function, which will return us how many total registrations and total cancellations we have per customer.

In [21]:
def TotalRegistrationsCancellations(df):
    
    df['Total_registrations'] = df.filter(like='registration').sum(axis=1)
    df['Total_cancellations'] = df.filter(like='cancellation').sum(axis=1)
    
    return df

In [22]:
df = TotalRegistrationsCancellations(df)

In [30]:
df[['pk_cid','Total_registrations','Total_cancellations']].sample(10)

Unnamed: 0,pk_cid,Total_registrations,Total_cancellations
4602906,1119259,0,0
3062389,1213331,0,0
1501125,1112709,1,2
5327701,1343928,0,0
297240,1147020,0,0
874930,1316892,0,0
4774776,1314790,0,0
3812796,1467903,0,0
5569024,1377994,2,1
5393409,1452152,2,2


**Total revenues**

Easymoney gives us the following information about the income it receives from product registrations:  
* €10 for each account product sold
* €40 for each savings product sold
* €60 for each financing product sold   


First of all we load the list of the different products from the Preprocessing section

In [31]:
account_prod = ['payroll', 'payroll_account', 'emc_account', 'em_account_p', 'em_account', 
                'debit_card']
financing_prod = ['loans', 'mortgage', 'credit_card']
saving_prod = ['short_term_deposit', 'funds', 'securities', 'long_term_deposit',
               'pension_plan' ]

We create the feature total revenue:

In [34]:
df['total_revenue'] = df['payroll_registrations']*10 + df['payroll_account_registrations']*10 + df['emc_account_registrations']*10 + df['em_account_p_registrations']*10 + df['em_account_registrations']*10 + df['debit_card_registrations']*10 + df['loans_registrations']*60 + df['mortgage_registrations']*60 + df['credit_card_registrations']*60 + df['short_term_deposit_registrations']*40 + df['funds_registrations']*40 + df['securities_registrations']*40 + df['long_term_deposit_registrations']*40 + df['pension_plan_registrations']*40

We see how total_feature has values from 0 to 550 at most.
We also observe that the vast majority of clients have a 0 in this variable, that is, we can already see that there are many clients who have registered on the platform, but who have never contracted a product.

In [41]:
df['total_revenue'].unique()

array([ 20,   0,  50,  10,  40,  60,  70, 200, 130, 190, 110,  80, 160,
       120, 100,  90,  30, 150, 260, 230, 220, 140, 170, 330, 180, 210,
       240, 310, 280, 250, 290, 300, 270, 390, 410, 320, 380, 360, 370,
       420, 450, 400, 440, 340, 350, 430, 460, 550, 470, 500, 490])

In [44]:
df['total_revenue'].value_counts().head(10)

total_revenue
0      4704895
10      463407
20      156704
60       96041
40       72142
30       71543
50       67088
70       66577
80       36116
120      35947
Name: count, dtype: int64

**Average duration of products**

In [None]:
def DurationProducts(df, products):
    
    df