# ¿Cuales son los clientes con mayor probabilidad de suscribirse al fondo del banco?

## Abstract: 
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y)

## Input variables:
### bank client data:
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
### related with the last contact of the current campaign:
8. contact: contact communication type (categorical: 'cellular','telephone')
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
### other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
### social and economic context attributes
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)

## Output variable (desired target):
21. y - has the client subscribed a term deposit? (binary: 'yes','no')

# Analytic plan
## preprocessing
* One-hot encoding for categorical data
* normalize numeric values
* **no time for feature engenieering**
## ML predicion
* Regression models:
    * XGBoost
    * Logistic Regression
## Clustering
* Spectral embedding dimentional reduction
* Ward dendrogram constrain.
* DbScan???

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import quantile_transform
from sklearn.preprocessing import OneHotEncoder




In [12]:
data_base=pd.read_csv("bank-full.csv", sep=";")
data_base.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [13]:
y_data=np.array((data_base["y"]=="yes")*1)

In [14]:
pd.crosstab(data_base["y"], columns="count")
# pd.crosstab(data_base["y"], columns="count", normalize=True)

col_0,count
y,Unnamed: 1_level_1
no,39922
yes,5289


### Pregunta 2
Proporción de los valores target:

    No: 88,3% 
    Si: 11,69% 

#### Se identifican clases desbalanceadas.
Para balancear los datos se utiliza "data augmentation". Se muestrean todos los casos si hasta igualar el número de no.

In [15]:
data_base.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

### Tipos de variables

    age           int64   num
    job          object   cat
    marital      object   cat   
    education    object   cat
    default      object   dicotómica
    balance       int64   num   
    housing      object   dicotómica
    loan         object   dicotómica
    contact      object   cat
    day           int64   num 
    month        object   cat # procesar como numérica 
    duration      int64   num
    campaign      int64   num
    pdays         int64   num
    previous      int64   num
    poutcome     object   cat
    y            object   dicotómica


In [16]:
months_nms=["jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"]

for i, month in enumerate (months_nms):
    data_base.month[data_base.month==month]=i*1.
data_base.month=pd.to_numeric(data_base.month)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


### Preprocesamiento
* Separar por tipos de variables
    * Las numéricas se normalizan al centil
    * Las categoricas se pasan a one hot encoding
    * Las dicotómicas se discretizan en 1/0


In [24]:
x_numericas=data_base[["age","balance","day","month","duration","campaign","pdays","previous"]]
x_categoricas=data_base[["job","marital","education","contact","poutcome"]]
x_dicotomicas=data_base[["default","housing","loan"]]

In [32]:
x_dicotomicas=(x_dicotomicas=="yes")*1.

In [37]:
x_numericas_cent=quantile_transform(x_numericas)

In [38]:
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(x_categoricas)
x_categoricas_hot=enc.transform(x_categoricas).toarray()

In [43]:
pd.DataFrame(np.array(x_categoricas_hot),np.array(x_numericas_cent),np.array(x_dicotomicas))

ValueError: Shape of passed values is (26, 45211), indices imply (45211, 45211)

## Modelo Predictivo
Se plantea probar dos modelos predictivos de regresión
* El primero es una regresión logística para establecer como línea de base
* El segundo es regresión con bosques aleatorios con Boosting (Gradient Boosting regression) como modelo de regresión

Los bosques aleatorios han mostrado ser buenos algoritmos tanto para regressión como clasificación, siendo estables al overfitting y problemas asociados a cesgos de la muestra.



## Modelo Clustering
Se plantea probar un modelos de clustering: 
dendrograma con distancia ward y restricción de espacio en una representación de 12 dimensiones.
la estrategia de reducción será spectral embedding.