# **[ExE]Bank Investimentos: Clean Data**

> **Campanha de marketing otimizada**: a campanha anual para a adesão ao depósito a prazo fixo impacta milhões de possíveis clientes, mas tem uma taxa de conversão modesta de 13%
>> **Objetivo**: melhorar os resultados da campanha aumentando a taxa de conversão e reduzir custo.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
infobank_df = pd.read_csv('bank-additional-full.csv', sep=';')
infobank_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


## **Dados faltantes**

In [3]:
infobank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [4]:
infobank_df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

In [5]:
infobank_df.dropna(inplace=True)

## **Features**

In [6]:
infobank_df.keys()

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')

## **Agrupar categorais minoritárias: jobs**

In [7]:
np.unique(infobank_df['job'])

array(['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management',
       'retired', 'self-employed', 'services', 'student', 'technician',
       'unemployed', 'unknown'], dtype=object)

In [8]:
ii = np.isin(infobank_df['job'] , ['student', 'retired', 'admin.'])
infobank_df['job'][~ii] = 'others'
infobank_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,others,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,others,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,others,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,others,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,others,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,others,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


## **Converter algumas variáveis categóricas em numéricas**

### **Education**

In [9]:
np.unique(infobank_df['education'])

array(['basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate',
       'professional.course', 'university.degree', 'unknown'],
      dtype=object)

In [10]:
new_edu = infobank_df['education'].replace(['illiterate', 'unknown', 'basic.4y', 'basic.6y', 'basic.9y', 
                                  'high.school', 'professional.course', 'university.degree'],
                                [0, 0, 1, 1, 1, 2, 3, 4])
infobank_df['education'] = new_edu

### **Marital**

In [11]:
np.unique(infobank_df['marital'])

array(['divorced', 'married', 'single', 'unknown'], dtype=object)

In [12]:
infobank_df['marital'] = infobank_df['marital'].replace(['divorced', 'married', 'single', 'unknown'],
                                                       [0, 0, 1, 1])

### **Default**

In [13]:
np.unique(infobank_df['default'])

array(['no', 'unknown', 'yes'], dtype=object)

In [14]:
infobank_df['default'] = infobank_df['default'].replace(['no', 'unknown', 'yes'], [0, 1, 1])

### **Housing**

In [15]:
np.unique(infobank_df['housing'])

array(['no', 'unknown', 'yes'], dtype=object)

In [16]:
infobank_df['housing'] = infobank_df['housing'].replace(['no', 'unknown', 'yes'], [0, 0, 1])

### **Loan**

In [17]:
np.unique(infobank_df['loan'])

array(['no', 'unknown', 'yes'], dtype=object)

In [18]:
infobank_df['loan'] = infobank_df['loan'].replace(['no', 'unknown', 'yes'], [0, 1, 1])

### **Contact**

In [19]:
np.unique(infobank_df['contact'])

array(['cellular', 'telephone'], dtype=object)

In [20]:
infobank_df['contact'] = infobank_df['contact'].replace(['telephone', 'cellular'], [0, 1])

### **Agrupar meses do ano em trimestres**

In [21]:
np.unique(infobank_df['month'])

array(['apr', 'aug', 'dec', 'jul', 'jun', 'mar', 'may', 'nov', 'oct',
       'sep'], dtype=object)

In [22]:
infobank_df['month'] = infobank_df['month'].replace(['mar', 'apr', 'may'], [1, 1, 1])
infobank_df['month'] = infobank_df['month'].replace(['jun', 'jul', 'aug'], [2, 2, 2])
infobank_df['month'] = infobank_df['month'].replace(['sep', 'oct', 'nov', 'dec'], [3, 3, 3, 3])

In [23]:
infobank_df.rename(columns={'month': 'trimestre'}, inplace=True)

### **Days of week**

In [24]:
np.unique(infobank_df['day_of_week'])

array(['fri', 'mon', 'thu', 'tue', 'wed'], dtype=object)

In [25]:
infobank_df['day_of_week'] = infobank_df['day_of_week'].replace(
    ['mon', 'tue', 'wed', 'thu', 'fri'], [0, 1, 2, 3, 4]
)

### **Pdays**

In [26]:
np.unique(infobank_df['pdays'], return_counts=True)

(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  25,  26,  27,
        999]),
 array([   15,    26,    61,   439,   118,    46,   412,    60,    18,
           64,    52,    28,    58,    36,    20,    24,    11,     8,
            7,     3,     1,     2,     3,     1,     1,     1, 39673]))

In [27]:
## Contato recente
ii = (infobank_df['pdays'] >= 0) & (infobank_df['pdays'] < 10)
infobank_df['pdays'][ii] = 0

# Contato distante
ii = (infobank_df['pdays'] >= 10) & (infobank_df['pdays'] < 20)
infobank_df['pdays'][ii] = 1

# Contato muito distante 
ii = infobank_df['pdays'] > 20 & (infobank_df['pdays'] != 999)
infobank_df['pdays'][ii] = 3

# Contato nunca existiu
ii = infobank_df['pdays'] == 999
infobank_df['pdays'][ii] = 4

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import

### **Poutcome**

In [28]:
np.unique(infobank_df['poutcome'])

array(['failure', 'nonexistent', 'success'], dtype=object)

In [29]:
infobank_df['poutcome'] = infobank_df['poutcome'].replace(['failure', 'nonexistent', 'success'], [0, 0, 1])

## **Remover ligações não completadas**

In [30]:
ii = infobank_df['duration'] == 0
print('Ligações não completadas:', ii.sum())
infobank_df = infobank_df[~ii]

Ligações não completadas: 4


## **Resposta do cliente e tempo de chamada**

In [31]:
infobank_df['y'] = infobank_df['y'].replace(['no', 'yes'], [0, 1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## **Remover dados duplicados**

In [32]:
ii = infobank_df.duplicated()
print('Quantidade de dados duplicados:', ii.sum())
infobank_df = infobank_df[~ii]

Quantidade de dados duplicados: 18


## **Clean Dataset**

In [33]:
infobank_df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,trimestre,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,others,0,1,0,0,0,0,1,0,...,1,3,0,0,1.1,93.994,-36.4,4.857,5191.0,0
1,57,others,0,2,1,0,0,0,1,0,...,1,3,0,0,1.1,93.994,-36.4,4.857,5191.0,0
2,37,others,0,2,0,1,0,0,1,0,...,1,3,0,0,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,0,1,0,0,0,0,1,0,...,1,3,0,0,1.1,93.994,-36.4,4.857,5191.0,0
4,56,others,0,2,0,0,1,0,1,0,...,1,3,0,0,1.1,93.994,-36.4,4.857,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,0,3,0,1,0,1,3,4,...,1,3,0,0,-1.1,94.767,-50.8,1.028,4963.6,1
41184,46,others,0,3,0,0,0,1,3,4,...,1,3,0,0,-1.1,94.767,-50.8,1.028,4963.6,0
41185,56,retired,0,4,0,1,0,1,3,4,...,2,3,0,0,-1.1,94.767,-50.8,1.028,4963.6,0
41186,44,others,0,3,0,0,0,1,3,4,...,1,3,0,0,-1.1,94.767,-50.8,1.028,4963.6,1


In [34]:
infobank_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41166 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41166 non-null  int64  
 1   job             41166 non-null  object 
 2   marital         41166 non-null  int64  
 3   education       41166 non-null  int64  
 4   default         41166 non-null  int64  
 5   housing         41166 non-null  int64  
 6   loan            41166 non-null  int64  
 7   contact         41166 non-null  int64  
 8   trimestre       41166 non-null  int64  
 9   day_of_week     41166 non-null  int64  
 10  duration        41166 non-null  int64  
 11  campaign        41166 non-null  int64  
 12  pdays           41166 non-null  int64  
 13  previous        41166 non-null  int64  
 14  poutcome        41166 non-null  int64  
 15  emp.var.rate    41166 non-null  float64
 16  cons.price.idx  41166 non-null  float64
 17  cons.conf.idx   41166 non-null 

In [35]:
infobank_df.to_csv('database_clean.csv', header=True, index=False)