## Overview

Steps involved:

- **Step - 0** : Setup Environment
- **Step - 1** : Data Import
- **Step - 2** : Exploratory Data Analysis
- **Step - 3** : Data Preprocessing
- **Step - 4** : Model Building
- **Step - 5** : Model Evaluation

## Step - 0 : Setup Environment

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import pickle

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler

## Step - 1 : Data Import

In [2]:
# Step "1_preprocess.ipynb" produces files "train_hash.csv" and "test_hash.csv"

In [3]:
test_dtype_dict = {'ncodpers':np.int32,'ind_empleado':np.int8,'pais_residencia':np.int8,
'sexo':np.int8,'age':np.int16,'ind_nuevo':np.int8,'antiguedad':np.int16,'indrel':np.int8,
'indrel_1mes':np.int8,'tiprel_1mes':np.int8,'indresi':np.int8,'indext':np.int8,'conyuemp':np.int8,
'canal_entrada':np.int16,'indfall':np.int8,'tipodom':np.int8,'cod_prov':np.int8,
'ind_actividad_cliente':np.int8,'renta':np.int64,'segmento':np.int8,'fecha_dato_month':np.int8,
'fecha_dato_year':np.int8,'month_int':np.int8,'fecha_alta_month':np.int8,'fecha_alta_year':np.int8,
'fecha_alta_day':np.int8,'fecha_alta_month_int':np.int16,'fecha_alta_day_int':np.int32,'ult_fec_cli_1t_month':np.int8,
'ult_fec_cli_1t_year':np.int8,'ult_fec_cli_1t_day':np.int8,'ult_fec_cli_1t_month_int':np.int8}

In [4]:
test_orig = pd.read_csv('/home/pabhijit/out/test_hash.csv',dtype = test_dtype_dict,header=0)

In [5]:
test_orig.shape

(929615, 32)

In [6]:
test_orig.head()

Unnamed: 0,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,...,month_int,fecha_alta_month,fecha_alta_year,fecha_alta_day,fecha_alta_month_int,fecha_alta_day_int,ult_fec_cli_1t_month,ult_fec_cli_1t_year,ult_fec_cli_1t_day,ult_fec_cli_1t_month_int
0,15889,3,0,0,56,0,256,1,1,0,...,18,1,0,16,1,46,1,5,1,61
1,1170544,0,0,1,36,0,34,1,1,1,...,18,8,18,28,224,6838,1,5,1,61
2,1170545,0,0,0,22,0,34,1,1,0,...,18,8,18,28,224,6838,1,5,1,61
3,1170547,0,0,1,22,0,34,1,1,1,...,18,8,18,28,224,6838,1,5,1,61
4,1170548,0,0,1,22,0,34,1,1,1,...,18,8,18,28,224,6838,1,5,1,61


In [7]:
test_orig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 929615 entries, 0 to 929614
Data columns (total 32 columns):
ncodpers                    929615 non-null int32
ind_empleado                929615 non-null int8
pais_residencia             929615 non-null int8
sexo                        929615 non-null int8
age                         929615 non-null int16
ind_nuevo                   929615 non-null int8
antiguedad                  929615 non-null int16
indrel                      929615 non-null int8
indrel_1mes                 929615 non-null int8
tiprel_1mes                 929615 non-null int8
indresi                     929615 non-null int8
indext                      929615 non-null int8
conyuemp                    929615 non-null int8
canal_entrada               929615 non-null int16
indfall                     929615 non-null int8
tipodom                     929615 non-null int8
cod_prov                    929615 non-null int8
ind_actividad_cliente       929615 non-null int8
ren

In [8]:
dataset = test_orig

In [9]:
dataset.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ncodpers,929615.0,879456.621092,448156.939785,15889.0,483361.5,966425.0,1264316.5,1553689.0
ind_empleado,929615.0,0.001123,0.050841,0.0,0.0,0.0,0.0,4.0
pais_residencia,929615.0,0.100926,2.041953,0.0,0.0,0.0,0.0,118.0
sexo,929615.0,0.457275,0.498182,-1.0,0.0,0.0,1.0,1.0
age,929615.0,40.249821,17.185119,2.0,25.0,39.0,51.0,164.0
ind_nuevo,929615.0,0.027849,0.164541,0.0,0.0,0.0,0.0,1.0
antiguedad,929615.0,80.955549,67.241854,-1.0,23.0,55.0,136.0,257.0
indrel,929615.0,1.00181,0.042511,1.0,1.0,1.0,1.0,2.0
indrel_1mes,929615.0,1.000009,0.014668,-1.0,1.0,1.0,1.0,3.0
tiprel_1mes,929615.0,0.576555,0.494214,-1.0,0.0,1.0,1.0,2.0


In [10]:
dataset[dataset['cod_prov']<0].shape

(3996, 32)

In [11]:
dataset.shape

(929615, 32)

### 2. Exploratory Data Analysis

### 2.3 : Missing Value

In [12]:
# TOTAL count (non-null values)
sr_count_total = dataset.count()

# MISSING COUNT
sr_count_missing = dataset.isnull().sum()

# MISSING PERCENT
sr_percent_missing = round(dataset.isnull().sum() * 100 / len(dataset),2)

# Total count (non-null values)
df_missing_value = pd.DataFrame({'count_total': sr_count_total,
                                 'count_missing': sr_count_missing,
                                 'percent_missing': sr_percent_missing})

# Sort by the field "percent_missing" descending
df_missing_value.sort_values('percent_missing', ascending=False, inplace=True)

# print the dataframe
df_missing_value

Unnamed: 0,count_total,count_missing,percent_missing
ncodpers,929615,0,0.0
ind_empleado,929615,0,0.0
ult_fec_cli_1t_day,929615,0,0.0
ult_fec_cli_1t_year,929615,0,0.0
ult_fec_cli_1t_month,929615,0,0.0
fecha_alta_day_int,929615,0,0.0
fecha_alta_month_int,929615,0,0.0
fecha_alta_day,929615,0,0.0
fecha_alta_year,929615,0,0.0
fecha_alta_month,929615,0,0.0


**Variables with high missing percentages:**

- **conyuemp** (99% missing) : Spouse index. 1 if the customer is spouse of an employee. This can be dropped, since majority values are missing.
- **ult_fec_cli_1t** (99% missing) : Last date as primary customer. This can be dropped, since majority values are missing.
- **renta** (~25% missing) : Gross income of the household.  
  

**Variables with low (less than 1%) missing values**
- **segmento** : segmentation: 01 - VIP, 02 - Individuals 03 - college graduated. Can be Replaced with most frequent value.
- **canal_entrada** : channel used by the customer to join. Can be Replaced with most frequent value.
- **cod_prov** : Province code (customer's address). Can be Replaced with most frequent value.
- **tiprel_1mes** : Customer relation type at the beginning of the month, A (active), I (inactive), P (former customer),R (Potential). Can be Replaced with most frequent value.
- **indrel_1mes** : Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),3 (former primary), 4(former co-owner). Make missing values as the primary Customer (1).
- **nomprov** : Province name. Since province code is already there in the dataset, name can be dropped.

**Features of Irrelavance**
Based on business knowledge we may want to exclude some variables which may not be of much relevance for the target model. Let's discuss these variables one by one:

fecha_dato : This is a date field used for internal data partition.
ncodepers : Customer Unique Code are internal customer identifiers.
fecha_alta : The date in which the customer became as the first holder of a contract in the bank.

So these variables can be dropped from the final dataset.


### 2.4 : EDA Summary

**Variables : antiguedad**  
- Looks like there are just 3 records with this outlier value of -999999.00.  
- These values can be replaced with mean value.
- Use Feature Scaling

**Variables : indrel_1mes**
- Some of the values are stored as string (i.e '1', '1.0', 'P', '3', '3.0', '2.0', '2', '4.0', '4')
- Some of the values are stored as float (i.e 1.0, 3.0, 2.0, 4.0)
- Also there are some nan values
- We will revisit this variable to standardize it and take care of the nan values during Data Preprocessing.

**Variables "indresi" and "indext"**
- Majority of customers are residents, i.e Birth country same as Bank country.
- Variables "indresi" and "indext" identifies if a customer is Local or foreigner.
- We can use either of these two variables (not both) in target dataset.

**Variable: tipodom**
- All of the Customer Addresses are Primary address.
- This variable may be excluded from target dataset since the values are all the same.

**Variables : cod_prov and nomprov**
- There are Two variables Province Code and Description. 
- We can just keep the Code variable(cod_prov) and drop the description variable(nomprov).

**Suggested Actions**
- **Fix Outlier** : antiguedad
- **Fix Variable** : indrel_1mes.
- **Impute Missing** : segmento, canal_entrada, cod_prov, tiprel_1mes, indrel_1mes, renta.
- Impute Strategy : Replace with most frequent value for categorical and mean value for continuous variable.
- **Drop Variables** : indext, tipodom, nomprov, conyuemp, ult_fec_cli_1t, fecha_dato, ncodepers, fecha_alta
- **Feature Scaling** : age, antiguedad, renta

### 3 : Data Preprocessing

### 3.4 : Missing Value Imputation

### Fix Invalid values - antiguedad

In [13]:
# Drop invalid antiguedad
#dataset.drop(dataset[dataset['antiguedad'] <0 ].index, axis=0, inplace=True)
antiguedad_mean = dataset[dataset['antiguedad']>0]['antiguedad'].mean()
antiguedad_mean

80.96704942424982

In [16]:
dataset['antiguedad'] = dataset['antiguedad'].replace(-1, round(antiguedad_mean))
dataset[dataset['antiguedad']<0].shape

(0, 32)

### Fix Invalid values - renta

In [17]:
# Replace mean renta value where renta is -1
renta_mean = dataset[dataset['renta']>0]['renta'].mean()
renta_mean

134087.37664077533

In [18]:
dataset['renta'] = dataset['renta'].replace(-1, renta_mean)
dataset[dataset['renta']<0].shape

(0, 32)

### Fix Invalid values - sexo

In [19]:
dataset[dataset['sexo']<0]

Unnamed: 0,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,...,month_int,fecha_alta_month,fecha_alta_year,fecha_alta_day,fecha_alta_month_int,fecha_alta_day_int,ult_fec_cli_1t_month,ult_fec_cli_1t_year,ult_fec_cli_1t_day,ult_fec_cli_1t_month_int
496121,278257,0,0,-1,42,0,177,1,1,1,...,18,10,6,1,82,2491,1,5,1,61
536431,476023,0,0,-1,70,0,15,1,1,0,...,18,3,20,12,243,7402,1,5,1,61
551356,394860,0,0,-1,2,0,203,1,1,0,...,18,7,4,14,55,1684,1,5,1,61
575865,415005,0,51,-1,42,0,158,1,1,0,...,18,4,8,10,100,3050,1,5,1,61
644339,216507,0,0,-1,73,0,10,1,1,0,...,18,8,20,17,248,7557,1,5,1,61


In [20]:
dataset['sexo'].value_counts(normalize=True)

 0    0.542714
 1    0.457281
-1    0.000005
Name: sexo, dtype: float64

Missing values for "sexo" can be replaced with most popular value i.e 0

In [21]:
dataset['sexo'] = dataset['sexo'].replace(-1,0)
dataset[dataset['sexo']<0].shape

(0, 32)

### Fix Invalid values - cod_prov

In [22]:
dataset[dataset['cod_prov']<0].shape

(3996, 32)

In [23]:
dataset['cod_prov'].value_counts(normalize=True).head()

28    0.320832
8     0.095286
46    0.051630
41    0.043558
15    0.030889
Name: cod_prov, dtype: float64

Missing values for "cod_prov" can be replaced with most popular value i.e 28

In [24]:
dataset['cod_prov'] = dataset['cod_prov'].replace(-1,28)
dataset[dataset['cod_prov']<0].shape

(0, 32)

### Fix Invalid values - segmento

In [25]:
dataset[dataset['segmento']<0].shape

(2248, 32)

In [26]:
dataset['segmento'].value_counts(normalize=True)

 2    0.586671
 3    0.372227
 1    0.038684
-1    0.002418
Name: segmento, dtype: float64

Missing values for "segmento" can be replaced with most popular value i.e 2

In [27]:
dataset['segmento'] = dataset['segmento'].replace(-1,2)
dataset[dataset['segmento']<0].shape

(0, 32)

Re validate if there is any Missing Values in the dataset

In [28]:
# TOTAL count (non-null values)
sr_count_total = dataset.count()

# MISSING COUNT
sr_count_missing = dataset.isnull().sum()

# MISSING PERCENT
sr_percent_missing = round(dataset.isnull().sum() * 100 / len(dataset),2)

# Total count (non-null values)
df_missing_value = pd.DataFrame({'count_total': sr_count_total,
                                 'count_missing': sr_count_missing,
                                 'percent_missing': sr_percent_missing})

# Sort by the field "percent_missing" descending
df_missing_value.sort_values('percent_missing', ascending=False, inplace=True)

# print the dataframe
df_missing_value

Unnamed: 0,count_total,count_missing,percent_missing
ncodpers,929615,0,0.0
ind_empleado,929615,0,0.0
ult_fec_cli_1t_day,929615,0,0.0
ult_fec_cli_1t_year,929615,0,0.0
ult_fec_cli_1t_month,929615,0,0.0
fecha_alta_day_int,929615,0,0.0
fecha_alta_month_int,929615,0,0.0
fecha_alta_day,929615,0,0.0
fecha_alta_year,929615,0,0.0
fecha_alta_month,929615,0,0.0


### 3.5 : Feature Scaling

The continuous variables "age, antiguedad, renta" will be scaled using Normalization technique.

In [29]:
dataset[['age', 'antiguedad', 'renta']].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,929615.0,40.249821,17.185119,2.0,25.0,39.0,51.0,164.0
antiguedad,929615.0,80.955814,67.241693,0.0,23.0,55.0,136.0,257.0
renta,929615.0,134087.376641,201827.436801,1202.0,78295.0,131484.0,134087.376641,28894395.0


In [30]:
scaler = MinMaxScaler()

In [31]:
dataset[['age', 'antiguedad', 'renta']]

Unnamed: 0,age,antiguedad,renta
0,56,256,326124.000000
1,36,34,134087.376641
2,22,34,134087.376641
3,22,34,148402.000000
4,22,34,106885.000000
...,...,...,...
929610,55,206,128643.000000
929611,30,115,134087.376641
929612,52,115,72765.000000
929613,32,115,147488.000000


In [32]:
#scaler.fit_transform(X.f3.values.reshape(-1, 1))
dataset['age_scaled'] = scaler.fit_transform(np.array(dataset['age']).reshape(-1, 1))
dataset['antiguedad_scaled'] = scaler.fit_transform(np.array(dataset['antiguedad']).reshape(-1, 1))
dataset['renta_scaled'] = scaler.fit_transform(np.array(dataset['renta']).reshape(-1, 1))
dataset[['age','age_scaled','antiguedad','antiguedad_scaled','renta','renta_scaled']]

Unnamed: 0,age,age_scaled,antiguedad,antiguedad_scaled,renta,renta_scaled
0,56,0.333333,256,0.996109,326124.000000,0.011246
1,36,0.209877,34,0.132296,134087.376641,0.004599
2,22,0.123457,34,0.132296,134087.376641,0.004599
3,22,0.123457,34,0.132296,148402.000000,0.005095
4,22,0.123457,34,0.132296,106885.000000,0.003658
...,...,...,...,...,...,...
929610,55,0.327160,206,0.801556,128643.000000,0.004411
929611,30,0.172840,115,0.447471,134087.376641,0.004599
929612,52,0.308642,115,0.447471,72765.000000,0.002477
929613,32,0.185185,115,0.447471,147488.000000,0.005063


### 3.6 : Categorical Encoding

- Categorical encoding refers to Changing Categorical values to numeric values.
- Let's discuss how to Encode each of the categorical variables based on the variable types (i.e Ordinal/Nominal)
- **prod_matrix** : this can simply be assigned some numeric class label.

- **ind_empleado** : Employee Index (A/B/F/N/P). This has 99% value N. Hence we can create a single variable say ind_empleado_N and assign values 1=>N, 0=>Others

- **pais_residencia** : Customer's Country residence. This is a nominal variable. From EDA analysis we observed that the data is spread across 118 countries, however around 99% customers belong to ES. Hence we can create a group which will indicate if a customer is Spanish (pais_residencia=ES) or not. So this information can be encoded to a single variable say "pais_residencia_es" and it will be assigned a value 1 for customers with ES as country code and oters will be assigned "0".  

- **sexo** : Two groups, V(~ 55%), H(~ 45%). So this should be fine with direct one-hot encoding.
- **indrel_1mes** : There are 4 categories with percentage as (I~ 57%, A~ 42~, P/R <1%), so we can create 3 groups I,A, Others.
- **indresi** : Residence index (S (Yes) or N (No). 99% of the values are S, hence one group can be created with 1(for S) and 0(for N)
- **canal_entrada** : There are 162 banking channels but majority of those are either KHE, KAT, KFC. Hence this can be grouped into 4 categories as KHE, KAT, KFC, Other 
- **indfall** : Deceased index. N/S. 99% are N. Hence just one group should be fine.
- **cod_prov** : Province Code. MADRID(28) ~ 32%, Around 35% between 10 other provinces, rest are spread across others (~33%). Hence we can create 3 groups with Group1 (Madrid), Group2 (province code:8, 46,41,15,30,29,50,3, 11,36), Group3(Others).
- **ind_actividad_cliente** : Activity index (0~ 57%, 1~ 43%). Two Groups.
- **segmento** : segmentation: 01 - VIP (~ 3%), 02 - Individuals (~ 58%) 03 - college graduated (~ 4%). Can be splited into 3 groups.

### 3.6.1 Encode Target Variable - prod_matrix

### 3.6.2 Encode ind_empleado

Employee Index (A/B/F/N/P). This has 99% value N. Hence we can create a single variable say ind_empleado_N and assign values 1=>N, 0=>Others

In [33]:
df_ind_empleado = pd.get_dummies(dataset['ind_empleado'], prefix='ind_empleado')
df_ind_empleado.head()

Unnamed: 0,ind_empleado_0,ind_empleado_1,ind_empleado_2,ind_empleado_3,ind_empleado_4
0,0,0,0,1,0
1,1,0,0,0,0
2,1,0,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0


The encoded values are stored in a dataframe called "**df_ind_empleado**".  
We will pick only the field "**ind_empleado_N**" from this dataframe and concatenate with original dataframe.

In [34]:
df_ind_empleado = pd.DataFrame(df_ind_empleado['ind_empleado_0'])
df_ind_empleado.head()

Unnamed: 0,ind_empleado_0
0,0
1,1
2,1
3,1
4,1


### 3.3.3 Encode pais_residencia

Customer's Country residence. This is a nominal variable. From EDA analysis we observed that the data is spread across 118 countries, however around 99% customers belong to ES. Hence we can create a group which will indicate if a customer is Spanish (pais_residencia=ES) or not. So this information can be encoded to a single variable say "pais_residencia_es" and it will be assigned a value 1 for customers with ES as country code and oters will be assigned "0".

In [35]:
df_pais_residencia = pd.get_dummies(dataset['pais_residencia'],prefix='pais_residencia')
df_pais_residencia.head()

Unnamed: 0,pais_residencia_0,pais_residencia_2,pais_residencia_3,pais_residencia_4,pais_residencia_5,pais_residencia_6,pais_residencia_7,pais_residencia_8,pais_residencia_9,pais_residencia_10,...,pais_residencia_109,pais_residencia_110,pais_residencia_111,pais_residencia_112,pais_residencia_113,pais_residencia_114,pais_residencia_115,pais_residencia_116,pais_residencia_117,pais_residencia_118
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The encoded values are stored in a dataframe called "**df_pais_residencia**".  
We will pick only the field "**pais_residencia_ES**" from this dataframe and concatenate with original dataframe.

In [36]:
df_pais_residencia = pd.DataFrame(df_pais_residencia['pais_residencia_0'])
df_pais_residencia.head()

Unnamed: 0,pais_residencia_0
0,1
1,1
2,1
3,1
4,1


### 3.3.4 Encode sexo

Two groups, V(~ 55%), H(~ 45%).

In [37]:
df_sexo = pd.get_dummies(dataset['sexo'], prefix='sexo')
df_sexo.head()

Unnamed: 0,sexo_0,sexo_1
0,1,0
1,0,1
2,1,0
3,0,1
4,0,1


The encoded values are stored in a dataframe called "**df_sexo**".  
We will pick only the field "**sexo_H**" from this dataframe and concatenate with original dataframe.

In [38]:
df_sexo = pd.DataFrame(df_sexo['sexo_0'])
df_sexo.head()

Unnamed: 0,sexo_0
0,1
1,0
2,1
3,0
4,0


### 3.3.5 Encode ind_nuevo



In [39]:
df_ind_nuevo = pd.get_dummies(dataset['ind_nuevo'], prefix='ind_nuevo')
df_ind_nuevo.head()

Unnamed: 0,ind_nuevo_0,ind_nuevo_1
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


The encoded values are stored in a dataframe called "df_ind_nuevo".
We will pick only the field "ind_nuevo_0" from this dataframe and concatenate with original dataframe.

In [40]:
df_ind_nuevo = pd.DataFrame(df_ind_nuevo['ind_nuevo_0'])
df_ind_nuevo.head()

Unnamed: 0,ind_nuevo_0
0,1
1,1
2,1
3,1
4,1


### 3.3.6 Encode indrel_1mes

99% values for this variable is "1".

In [41]:
df_indrel_1mes = pd.get_dummies(dataset['indrel_1mes'], prefix='indrel_1mes')
df_indrel_1mes = df_indrel_1mes.rename(columns={"indrel_1mes_1.0": "indrel_1mes_1"})
df_indrel_1mes.head()

Unnamed: 0,indrel_1mes_-1,indrel_1mes_1,indrel_1mes_3
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


The encoded values are stored in a dataframe called "**df_indrel_1mes**".  
We will pick only the field "**indrel_1mes_1**" from this dataframe and concatenate with original dataframe.

In [42]:
df_indrel_1mes = pd.DataFrame(df_indrel_1mes['indrel_1mes_1'])
df_indrel_1mes.head()

Unnamed: 0,indrel_1mes_1
0,1
1,1
2,1
3,1
4,1


### 3.3.7 Encode tiprel_1mes

There are 4 categories with percentage as (I~ 57%, A~ 42~, P/R <1%), so we can create 3 groups I,A, Others.

In [43]:
df_tiprel_1mes = pd.get_dummies(dataset['tiprel_1mes'], prefix='tiprel_1mes')
df_tiprel_1mes.head()

Unnamed: 0,tiprel_1mes_-1,tiprel_1mes_0,tiprel_1mes_1,tiprel_1mes_2
0,0,1,0,0
1,0,0,1,0
2,0,1,0,0
3,0,0,1,0
4,0,0,1,0


The encoded values are stored in a dataframe called "**df_tiprel_1mes**".  
We will pick only the fields "**tiprel_1mes_A, tiprel_1mes_I**" from this dataframe and concatenate with original dataframe.

In [44]:
df_tiprel_1mes = pd.DataFrame(df_tiprel_1mes['tiprel_1mes_1'])
df_tiprel_1mes.head()

Unnamed: 0,tiprel_1mes_1
0,0
1,1
2,0
3,1
4,1


### 3.3.8 Encode indresi

Residence index (S (Yes) or N (No). 99% of the values are S, hence one group can be created with 1(for S) and 0(for N)

In [45]:
df_indresi = pd.get_dummies(dataset['indresi'], prefix='indresi')
df_indresi.head()

Unnamed: 0,indresi_0,indresi_1
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


The encoded values are stored in a dataframe called "**df_indresi**".  
We will pick only the field "**indresi_S**" from this dataframe and concatenate with original dataframe.

In [46]:
df_indresi = pd.DataFrame(df_indresi['indresi_1'])
df_indresi.head()

Unnamed: 0,indresi_1
0,1
1,1
2,1
3,1
4,1


### 3.3.9 Encode canal_entrada

There are 162 banking channels but majority of those are either KHE, KAT, KFC. Hence this can be grouped into 4 categories as KHE, KAT, KFC, Other

In [47]:
df_canal_entrada = pd.get_dummies(dataset['canal_entrada'], prefix='canal_entrada')
df_canal_entrada.head()

Unnamed: 0,canal_entrada_0,canal_entrada_1,canal_entrada_2,canal_entrada_3,canal_entrada_4,canal_entrada_5,canal_entrada_6,canal_entrada_7,canal_entrada_8,canal_entrada_9,...,canal_entrada_153,canal_entrada_154,canal_entrada_155,canal_entrada_156,canal_entrada_157,canal_entrada_158,canal_entrada_159,canal_entrada_160,canal_entrada_161,canal_entrada_162
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The encoded values are stored in a dataframe called "**df_canal_entrada**".  
We will pick only the fields "**canal_entrada_KHE, canal_entrada_KAT, canal_entrada_KFC**" from this dataframe and concatenate with original dataframe.

In [48]:
#df_canal_entrada = pd.DataFrame(df_canal_entrada[['canal_entrada_KHE', 'canal_entrada_KAT', 'canal_entrada_KFC']])
df_canal_entrada = pd.DataFrame(df_canal_entrada[['canal_entrada_1', 'canal_entrada_0']])
df_canal_entrada.head()

Unnamed: 0,canal_entrada_1,canal_entrada_0
0,0,0
1,0,0
2,1,0
3,1,0
4,1,0


### 3.3.10 Encode indfall

Deceased index. N/S. 99% are N. Hence just one group should be fine.

In [49]:
df_indfall = pd.get_dummies(dataset['indfall'], prefix='indfall')
df_indfall.head()

Unnamed: 0,indfall_0,indfall_1
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


The encoded values are stored in a dataframe called "**df_indfall**".  
We will pick only the fields "**indfall_N**" from this dataframe and concatenate with original dataframe.

In [50]:
df_indfall = pd.DataFrame(df_indfall['indfall_0'])
df_indfall.head()

Unnamed: 0,indfall_0
0,1
1,1
2,1
3,1
4,1


### 3.3.11 Encode cod_prov

Province Code. MADRID(28) ~ 32%, Around 35% between 10 other provinces, rest are spread across others (~33%). Hence we can create 3 groups with Group1 (Madrid), Group2 (province code:8, 46,41,15,30,29,50,3, 11,36), Group3(Others).

In [51]:
df_cod_prov = pd.get_dummies(dataset['cod_prov'], prefix='cod_prov')
df_cod_prov.head()

Unnamed: 0,cod_prov_1,cod_prov_2,cod_prov_3,cod_prov_4,cod_prov_5,cod_prov_6,cod_prov_7,cod_prov_8,cod_prov_9,cod_prov_10,...,cod_prov_43,cod_prov_44,cod_prov_45,cod_prov_46,cod_prov_47,cod_prov_48,cod_prov_49,cod_prov_50,cod_prov_51,cod_prov_52
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The encoded values are stored in a dataframe called "**df_cod_prov**".  
We will pick only the fields "**cod_prov_28**" from this dataframe and concatenate with original dataframe.

In [52]:
df_cod_prov = pd.DataFrame(df_cod_prov['cod_prov_28'])
#df_cod_prov = pd.DataFrame(df_cod_prov['cod_prov_28.0'], columns=['cod_prov_28'])
#df_cod_prov = df_cod_prov.rename(columns={"cod_prov_28.0": "cod_prov_28"})
df_cod_prov.head()

Unnamed: 0,cod_prov_28
0,1
1,0
2,0
3,0
4,0


### 3.3.12 Encode ind_actividad_cliente

Activity index (0~ 57%, 1~ 43%). Two Groups.

In [53]:
df_ind_actividad_cliente = pd.get_dummies(dataset['ind_actividad_cliente'], prefix='ind_actividad_cliente')
df_ind_actividad_cliente.head()

Unnamed: 0,ind_actividad_cliente_0,ind_actividad_cliente_1
0,0,1
1,1,0
2,0,1
3,1,0
4,1,0


The encoded values are stored in a dataframe called "**df_ind_actividad_cliente**".  
We will pick only the fields "**ind_actividad_cliente_0**" from this dataframe and concatenate with original dataframe.

In [54]:
df_ind_actividad_cliente = pd.DataFrame(df_ind_actividad_cliente['ind_actividad_cliente_0'])
df_ind_actividad_cliente.head()

Unnamed: 0,ind_actividad_cliente_0
0,0
1,1
2,0
3,1
4,1


### 3.3.13 segmento

Below are the percentage distribution for all the values  
02 - PARTICULARES     ~ 59%  
03 - UNIVERSITARIO    ~ 37  
01 - TOP              ~ 4%  

In [55]:
df_segmento = pd.get_dummies(dataset['segmento'], prefix='segmento')
df_segmento = df_segmento.rename(columns={"segmento_02 - PARTICULARES": "segmento_02", 
                                          "segmento_03 - UNIVERSITARIO": "segmento_03"})
df_segmento.head()

Unnamed: 0,segmento_1,segmento_2,segmento_3
0,1,0,0
1,0,1,0
2,0,0,1
3,0,0,1
4,0,0,1


The encoded values are stored in a dataframe called "**df_segmento**".  
We will pick only the fields "**segmento_01 - TOP,segmento_03 - UNIVERSITARIO**" from this dataframe and concatenate with original dataframe.

In [56]:
#df_segmento1 = df_segmento[['segmento_01 - TOP', 'segmento_03 - UNIVERSITARIO']]
#df_segmento = pd.DataFrame(df_segmento[['segmento_01 - TOP', 'segmento_03 - UNIVERSITARIO']])
#df_segmento = df_segmento.rename(columns={"segmento_01 - TOP": "segmento_01", "segmento_03 - UNIVERSITARIO": "segmento_03"})
df_segmento = pd.DataFrame(df_segmento['segmento_1'])
df_segmento.head()

Unnamed: 0,segmento_1
0,1
1,0
2,0
3,0
4,0


### 3.4 : Final Dataset for Modeling

### 3.4.1 - Continuous Variables

In [57]:
df_cont = dataset[['month_int', 'ncodpers', 'age_scaled', 'antiguedad_scaled', 'renta_scaled']]
df_cont = df_cont.rename(columns={"age_scaled": "age", 
                                "antiguedad_scaled": "antiguedad",
                                "renta_scaled": "renta"})
df_cont.head()

Unnamed: 0,month_int,ncodpers,age,antiguedad,renta
0,18,15889,0.333333,0.996109,0.011246
1,18,1170544,0.209877,0.132296,0.004599
2,18,1170545,0.123457,0.132296,0.004599
3,18,1170547,0.123457,0.132296,0.005095
4,18,1170548,0.123457,0.132296,0.003658


### 3.4.2 - Categorical Variables

### 3.3.14 - Join Dataframes

Join the categorical variable dataframes together and create a single dataframe.  
Ref - https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html  
result = left.join(right)

Below variables are highly imbalances, i.e 99% values are of one categories.

indfall, indresi, indrel_1mes, ind_empleado

Hence we can exclude respective dataframes from modeling.

In [58]:
#df_catg = df_ind_empleado.join(df_pais_residencia)
df_catg = df_pais_residencia.join(df_sexo)
df_catg = df_catg.join(df_ind_nuevo)
#df_catg = df_catg.join(df_indrel_1mes)
df_catg = df_catg.join(df_tiprel_1mes)
#df_catg = df_catg.join(df_indresi)
df_catg = df_catg.join(df_canal_entrada)
#df_catg = df_catg.join(df_indfall)
df_catg = df_catg.join(df_cod_prov)
df_catg = df_catg.join(df_ind_actividad_cliente)
df_catg = df_catg.join(df_segmento)
df_catg.head()

Unnamed: 0,pais_residencia_0,sexo_0,ind_nuevo_0,tiprel_1mes_1,canal_entrada_1,canal_entrada_0,cod_prov_28,ind_actividad_cliente_0,segmento_1
0,1,1,1,0,0,0,1,0,1
1,1,0,1,1,0,0,0,1,0
2,1,1,1,0,1,0,0,0,0
3,1,0,1,1,1,0,0,1,0
4,1,0,1,1,1,0,0,1,0


In [59]:
df_catg.shape

(929615, 9)

### 3.4.4 - Join Dataframes

In [60]:
print(df_cont.shape)
print(df_catg.shape)

(929615, 5)
(929615, 9)


In [61]:
df_model = df_cont.join(df_catg)

In [62]:
print(df_model.shape)
df_model.head()

(929615, 14)


Unnamed: 0,month_int,ncodpers,age,antiguedad,renta,pais_residencia_0,sexo_0,ind_nuevo_0,tiprel_1mes_1,canal_entrada_1,canal_entrada_0,cod_prov_28,ind_actividad_cliente_0,segmento_1
0,18,15889,0.333333,0.996109,0.011246,1,1,1,0,0,0,1,0,1
1,18,1170544,0.209877,0.132296,0.004599,1,0,1,1,0,0,0,1,0
2,18,1170545,0.123457,0.132296,0.004599,1,1,1,0,1,0,0,0,0
3,18,1170547,0.123457,0.132296,0.005095,1,0,1,1,1,0,0,1,0
4,18,1170548,0.123457,0.132296,0.003658,1,0,1,1,1,0,0,1,0


In [63]:
# Pickle the dataset for further use.
df_model.to_pickle('/home/pabhijit/data/model_ready_test.pkl')

...Move to next step - **Step_4_Modeling**

# Misc Testing Below