
## Business Description

dataset link > http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

**Abstract: The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).**

Artigo: 

https://medium.com/@mauridurcak/bussines-analitycs-and-logistic-regression-for-bank-marketing-dataset-1f20cb829c85


Attribute Information:

Input variables:
# bank client data:
1 - **age** (numeric)
2 - **job**: type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - **marital** : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - **education** (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - **default**: has credit in default? (categorical: 'no','yes','unknown')

6 - **housing**: has housing loan? (categorical: 'no','yes','unknown')

7 - **loan**: has personal loan? (categorical: 'no','yes','unknown')

# related with the last contact of the current campaign:
8 - **contact**: contact communication type (categorical: 'cellular','telephone')

9 - **month**: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - **day_of_week**: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - **duration**: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). 
Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# other attributes:
12 - **campaign**: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - **pdays**: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - **previous**: number of contacts performed before this campaign and for this client (numeric)

15 - **poutcome**: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

# social and economic context attributes
16 - **emp.var.rate**: employment variation rate - quarterly indicator (numeric)

17 - **cons.price.idx**: consumer price index - monthly indicator (numeric)

18 - **cons.conf.idx**: consumer confidence index - monthly indicator (numeric)

19 - **euribor3m**: euribor 3 month rate - daily indicator (numeric)

20 - **nr.employed**: number of employees - quarterly indicator (numeric)

# 1.0 Imports

In [1]:
import pandas                 as pd
import numpy                  as np
import seaborn                as sns

from IPython.core.display     import HTML
from IPython.display          import Image
import matplotlib.pyplot      as plt
from scipy                    import stats

  import pandas.util.testing as tm


## Help Functions

In [2]:
# adjust jupyter notebook viz
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    sns.set()
    
jupyter_settings()




def crammer_v(x,y):
    cm = pd.crosstab(x,y).to_numpy()
    # n = soma de linhas
    n = cm.sum()
    # r, k nr de linhas e colunas respectivamente
    r,k = cm.shape
    
    # correcao das equacoes para evitar bias
    chi2 = stats.chi2_contingency(cm)[0]
    chi2corr = max(0, chi2 -(k-1)*(r-1)/(n-1))
    
    
    kcorr = k - (k-1)**2/(n-1)
    rcorr = r - (r-1)**2/(n-1)
    
    return  np.sqrt( (chi2corr/n) / (min(kcorr-1,rcorr-1))  )


Populating the interactive namespace from numpy and matplotlib


# 2.0 Import And Summarize data

In [3]:
data = pd.read_csv( 'bank.csv', sep = ';')

In [5]:
# separando variaveis numericas, e categoricas

num_atributes = data.select_dtypes(include = ['int64'])
cat_atributes = data.select_dtypes(exclude = ['int64', 'float64'])


In [4]:
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [16]:
## Medidas estatisticas 
# aplicar operacao em todas colunas
central_tendecy = pd.DataFrame(num_atributes.apply(np.mean)).T
central_tendecy2 = pd.DataFrame(num_atributes.apply(np.median)).T

# Disparsion - std, min, max, range, skew, kurtosis

dispersion_std = pd.DataFrame(num_atributes.apply(np.std)).T
dispersion_min = pd.DataFrame(num_atributes.apply(min)).T
dispersion_max = pd.DataFrame(num_atributes.apply(max)).T
dispersion_range = pd.DataFrame(num_atributes.apply(lambda x: x.max() - x.min())).T
dispersion_skew = pd.DataFrame(num_atributes.apply(lambda x: x.skew())).T
dispersion_kurtosis = pd.DataFrame(num_atributes.apply(lambda x: x.kurtosis())).T

metrics = pd.concat([dispersion_min,dispersion_max,dispersion_range,central_tendecy, central_tendecy2,dispersion_std, dispersion_skew,dispersion_kurtosis]).T.reset_index()
metrics.columns = ['atributes','min','max','range','mean','median','std','skew','kurtosis']

In [14]:
metrics

Unnamed: 0,atributes,min,max,range,mean,median,std,skew,kurtosis
0,age,19.0,87.0,68.0,41.170095,39.0,10.575041,0.699501,0.348775
1,balance,-3313.0,71188.0,74501.0,1422.657819,444.0,3009.305273,6.596431,88.390332
2,day,1.0,31.0,30.0,15.915284,16.0,8.246755,0.094627,-1.039531
3,duration,4.0,3025.0,3021.0,263.961292,185.0,259.827892,2.77242,12.53005
4,campaign,1.0,50.0,49.0,2.79363,2.0,3.109463,4.743914,37.16892
5,pdays,-1.0,871.0,872.0,39.766645,-1.0,100.110051,2.717071,7.957128
6,previous,0.0,25.0,25.0,0.542579,0.0,1.693375,5.875259,51.995212


In [42]:
# Tabela de frequencia de y
y_tab = pd.crosstab(index=data["y"],columns="count")   

# medidas estatisticas em relacao a y

metric_y = data.groupby('y').mean()

# Numero de clientes com saldos negativo
negative = data[data['balance'] < 0].count().unique()

# Numero de clientes com p_days igual a 1
p_days_negative = data[data['pdays'] == -1].count().unique()

metric_y

Unnamed: 0_level_0,age,balance,day,duration,campaign,pdays,previous
y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
no,40.998,1403.21175,15.94875,226.3475,2.86225,36.006,0.47125
yes,42.491363,1571.955854,15.658349,552.742802,2.266795,68.639155,1.090211


# Medidas estatisticas dos dados

1. A idade minima dos participantes 'e de 19 anos, e a maxima de 87 anos
2. O balanco maximo e de 71.188USD e o minimo de -3.313USD de divida
3. Duracao minima de chamadas e de 4s, e maximo de 3.025s
4. Numero de contactos realizados durante a campanha pra um user, max 50, min 1
5. Numero de didas passadis, min -1( podemos subistituir por 0), max, 871 dias
6. numero de ligacoes feitas antes da campanha ao cliente, min 0, max 25.



# Levantando Hipoteses



1. Quais perfis mais engajam nos  nas campanhas?
2. Clientes Estudantes nao adirem a campanhas mais que outros perfis
3. Quanto maior o nivel de escolaridade, maior seu engajamento na campanha
4. Clientes mais contactactados, engajam mais nas campanhas em relacao a menos  contactados
5. Clientes adirem mais a campanhas que solteiros
6.Clientes com saldo negativo adirem menos as campanhas
7. Clientes com emprestimos pessoais, adirem mais as campanhas
8. Clientes com telefone, adirem mais as campanhas
9. Clientes inadimplentes adirem menos as campanhas
10. Clientes com emprestimo habitacional adirem mais a campanhas

In [None]:
# Plotando contagem das variaveis categoricas.

plt.subplot(3,2,1)
sns.countplot(x = 'month', data = data)

plt.subplot(3,2,2)
sns.countplot(x = 'contact', data = data)

plt.subplot(3,2,3)
sns.countplot(x = 'marital', data = data)

plt.subplot(3,2,4)
sns.countplot(x = 'education', data = data)
# plt.xticks( rotation = 90);



### Hipoteses

O mês de Maio é o mês com amior numero de ligações

## numerical variables

In [None]:
num_atributes.hist(bins = 25);


In [None]:
fig, ax = plt.subplots()

sns.countplot(x = 'job', data = data)
ax.set_title('Age Count Distribution', fontsize=15)
ax.tick_params(labelsize=15)