### **3. Wrangle, prepare, cleanse the data**

In this section, we will prepare the data to ensure that the model operates optimally. We will follow a set of steps based on the analysis we conducted in Section 2.Data Analysis. The phases we will work on include:

**3.1 Cleaning:** Removing duplicates, correcting incorrect column names, removing irrelevant columns and changing data types.\
**3.2 Integration:** Data grouping and handling missing values.\
**3.3 Construction:** Creating new variables and encoding categorical variables.\
**3.4 Variable selection:** Studying correlations and variance.

Libraries:

In [6]:
# Data analysis and wrangling
import numpy as np
import pandas as pd
import datetime
import time
from sklearn.preprocessing import LabelEncoder,MinMaxScaler,OrdinalEncoder,StandardScaler
from sklearn.feature_selection import VarianceThreshold

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Machine learning

# Best practices
pd.set_option("display.max_rows", None)
pd.set_option('display.max_columns',None)  
pd.options.display.float_format = '{:.2f}'.format

Acquire data:

In [7]:
df = pd.read_csv('data_easymoney.csv')

#### **3.1 Cleaning**

##### **Removing duplicates:**

We note that we do not have any duplicated rows.

In [8]:
df.duplicated().sum()

0

##### **Correcting incorrect column names:**

We can see how the em_acount variable is misspelled, it should be em_account. Not correcting this error, no matter how insignificant it may seem now, can lead to problems later when working with this column. We proceed to correct the error:

In [9]:
df.columns

Index(['Unnamed: 0', 'pk_cid', 'pk_partition', 'short_term_deposit', 'loans',
       'mortgage', 'funds', 'securities', 'long_term_deposit', 'em_account_pp',
       'credit_card', 'payroll', 'pension_plan', 'payroll_account',
       'emc_account', 'debit_card', 'em_account_p', 'em_acount', 'entry_date',
       'entry_channel', 'active_customer', 'segment', 'country_id',
       'region_code', 'gender', 'age', 'deceased', 'salary'],
      dtype='object')

In [10]:
df.rename(columns = {"em_acount":"em_account"}, inplace = True)

##### **Removing inrrelevant columns:**

Although later we will do correlation and variance studies, where we will delve into those columns that may be irrelevant, taking a quick look we can see that:

* 'Unnamed:0' : This column acts as an inndex of the rows of our sample. This column is obviusly irrelevant since the dataframe structure already has indexes.
* 'em_account_pp': All the values in the 'em_account_pp' column are 0`s. Obviously this column is irrelevant because it is not giving us any new information.

In [11]:
df['em_account_pp'].value_counts()

em_account_pp
0    5962924
Name: count, dtype: int64

In [12]:
df.drop(columns = ["Unnamed: 0", "em_account_pp"], axis = 1, inplace = True)

##### **Changing data types in columns:**

We have two columns that inform us of dates, so the optimal thing is to convert them to datetime. But first of all, we have to change those dates that can be problematic, such as leap years. In our data we have two leap years:

In [13]:
df.loc[df['entry_date'] == '2015-02-29', 'entry_date'] = '2015-02-28'
df.loc[df['entry_date'] == '2019-02-29', 'entry_date'] = '2019-02-28'

In [14]:
df["pk_partition"]=pd.to_datetime(df["pk_partition"], format='%Y-%m-%d')
df["entry_date"]=pd.to_datetime(df["entry_date"], format='%Y-%m-%d')

Columns that only have a 0 or a 1 as a values, can be convert to int8 to save memory and computational costs.

In [20]:
columns_to_convert = ['short_term_deposit', 'loans', 'mortgage', 'funds', 'securities', 'long_term_deposit', 'credit_card',
                      'payroll','pension_plan', 'payroll_account', 'emc_account', 'debit_card', 'em_account_p', 'active_customer']

In [21]:
for column in columns_to_convert:
    df[column] = df[column].astype('int8')

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

#### **3.2 Integration**

When we have many different values for the same variable, the models may not be as efficient. It is more optimal if instead of having many values for a single variable, we have fewer. We can create new values or group those less frequent values into one.
First of all let's see what variables we can apply this procedure:

* country_id: ES (Spain) has 5,960,672 of the total registrations, that is a 99.96% of the total. We can separate the data into Spain and Others

In [53]:
df['country_id'].value_counts()[df['country_id'].value_counts() < 500 ].index.tolist()

['GB',
 'FR',
 'DE',
 'US',
 'CH',
 'BR',
 'BE',
 'VE',
 'IE',
 'MX',
 'AT',
 'AR',
 'PL',
 'IT',
 'MA',
 'CL',
 'CN',
 'CA',
 'LU',
 'ET',
 'QA',
 'CI',
 'SA',
 'CM',
 'SN',
 'MR',
 'NO',
 'RU',
 'CO',
 'GA',
 'GT',
 'DO',
 'SE',
 'DJ',
 'PT',
 'JM',
 'RO',
 'HU',
 'DZ',
 'PE']

In [44]:
df['country_id'].value_counts()

country_id
ES    5960672
GB        441
FR        225
DE        199
US        195
CH        194
BR         87
BE         81
VE         79
IE         68
MX         58
AT         51
AR         51
PL         49
IT         45
MA         34
CL         30
CN         28
CA         22
LU         17
ET         17
QA         17
CI         17
SA         17
CM         17
SN         17
MR         17
NO         17
RU         17
CO         17
GA         17
GT         17
DO         17
SE         16
DJ         11
PT         11
JM         11
RO          9
HU          8
DZ          7
PE          4
Name: count, dtype: int64

In [34]:
df.columns

Index(['pk_cid', 'pk_partition', 'short_term_deposit', 'loans', 'mortgage',
       'funds', 'securities', 'long_term_deposit', 'credit_card', 'payroll',
       'pension_plan', 'payroll_account', 'emc_account', 'debit_card',
       'em_account_p', 'em_account', 'entry_date', 'entry_channel',
       'active_customer', 'segment', 'country_id', 'region_code', 'gender',
       'age', 'deceased', 'salary'],
      dtype='object')

#### **Handling missing values:**

#### **Creating new variables:**

#### **Encoding categorical variables:**

#### **Studying correlations:**

#### **Studying variance:**