# Exploring Bank Marketing Data Set
[link data: archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)

<div style="text-align: right"> Clemetne José </div>


* [Introducción](#Introducción).

* [Descripción delos datos](#Introducción).

* [Loading data](# Import data).




# Introducción

En el siguiente análisis buscamos ver si un cliente de un banco colocará o no sus ahorros en un plazo fijo ofertado por el banco. Para el análisis no nutriremos de información previa del mismo cliente (banco) para asi predecir que clientes tomarían un prestamso según los atributos medidos.

Los datos están relacionados con campañas de marketing directo (call-center) de una institución bancaria portuguesa. 
Las campañas de marketing se basaron en llamadas telefónicas. A menudo, se requería más de un contacto con el mismo cliente, para poder acceder si el producto (depósito bancario a plazo) estaría ('sí') o no ('no') suscrito.
Los datos obtenidos están ordenados por fecha (de mayo de 2008 a noviembre de 2010).



# Descripción de las columnas

asdgasgafgafd.

### _bank client data:_

1. **age** : (numeric)
2. **job** : type of job (categorical)
        'admin.',
        'blue collar',
        'entrepreneur',
        'housemaid',
        'management',
        'retired',
        'self-employed',
        'services',
        'student',
        'technician',
        'unemployed',
        'unknown'
3. **marital** : marital status (categorical)
        'divorced' =>'divorced' means divorced or widowed,
        'married',
        'single',
        'unknown'
4. **education** : (categorical)
        'basic.4y',
        'basic.6y',
        'basic.9y',
        'high.school',
        'illiterate',
        'professional.course',
        'university.degree',
        'unknown'
5. **default** : has credit in default? (categorical)
        'no',
        'yes',
        'unknown'
6. **housing** : has housing loan? (categorical)
        'no',
        'yes',
        'unknown'
7. **loan** : has personal loan? (categorical)
        'no',
        'yes',
        'unknown'
    
### _related with the last contact of the current campaign:_

8. **contact** : contact communication type (categorical)
        'cellular',
        'telephone'
9. **month** : last contact month of year (categorical)
        'jan',
        'feb',
        'mar',
        ...,
        'nov',
        'dec'
10. **day_of_week** : last contact day of the week (categorical)
        'mon',
        'tue',
        'wed',
        'thu',
        'fri'
11. **duration** : last contact duration, in seconds (numeric)

    **_Important note_**: _this attribute highly affects the output target (e.g., if duration=0 then y='no').
    Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known.
    Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to
    have a realistic predictive model._
    
### _other attributes:_

12. **campaign** : number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. **pdays** : number of days that passed by after the client was last contacted from a previous campaign (numeric) **_Note_**: _999 means client was not previously contacted_
14. **previous** : number of contacts performed before this campaign and for this client (numeric)
15. **poutcome** : outcome of the previous marketing campaign (categorical)
        'failure',
        'nonexistent',
        'success'
    
### _social and economic context attributes_

16. **emp.var.rate** : employment variation rate - quarterly indicator (numeric)
17. **cons.price.idx** : consumer price index - monthly indicator (numeric)
18. **cons.conf.idx** : consumer confidence index - monthly indicator (numeric)
19. **euribor3m** : euribor 3 month rate - daily indicator (numeric)
20. **nr.employed** : number of employees - quarterly indicator (numeric)

### _Output variable (desired target):_

21. y - has the client subscribed a term deposit? (binary)
        'yes',
        'no'

# Libraries

In [1]:
# Manipulación de datos
import os
import pandas as pd

# Gráficos
import matplotlib.pyplot as plt

In [2]:
# Pre-procesamiento
from sklearn.preprocessing import OneHotEncoder
# Modelado
from sklearn.model_selection import train_test_split


# Import data


In [3]:
path = '../Data'

In [4]:
df = pd.read_csv(
    filepath_or_buffer = os.path.join(path,'bank-additional.csv'),
    sep = ';')

# Exploratory Data Analysis

In [5]:
df.sample(10).style

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
3373,46,technician,single,professional.course,no,no,no,cellular,aug,thu,128,1,999,0,nonexistent,1.4,93.444,-36.1,4.968,5228.1,no
2738,31,admin.,married,university.degree,no,yes,no,cellular,nov,thu,210,2,999,0,nonexistent,-0.1,93.2,-42.0,4.076,5195.8,no
3297,40,entrepreneur,married,basic.4y,no,yes,yes,cellular,nov,wed,322,3,999,0,nonexistent,-0.1,93.2,-42.0,4.12,5195.8,no
1025,47,admin.,divorced,university.degree,no,yes,no,telephone,may,mon,208,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2973,39,blue-collar,married,basic.6y,no,no,no,telephone,may,wed,488,12,999,0,nonexistent,1.1,93.994,-36.4,4.859,5191.0,no
1690,33,blue-collar,married,basic.6y,unknown,no,no,cellular,may,wed,139,2,999,0,nonexistent,-1.8,92.893,-46.2,1.281,5099.1,no
3635,32,admin.,single,university.degree,no,yes,no,cellular,aug,tue,460,2,999,0,nonexistent,1.4,93.444,-36.1,4.966,5228.1,no
1005,38,management,married,basic.9y,unknown,no,no,telephone,may,fri,772,1,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,yes
980,29,technician,married,high.school,unknown,no,no,cellular,jul,thu,321,1,999,0,nonexistent,1.4,93.918,-42.7,4.958,5228.1,no
2125,45,technician,married,professional.course,no,no,no,telephone,jul,wed,20,5,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228.1,no


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4119 entries, 0 to 4118
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4119 non-null   int64  
 1   job             4119 non-null   object 
 2   marital         4119 non-null   object 
 3   education       4119 non-null   object 
 4   default         4119 non-null   object 
 5   housing         4119 non-null   object 
 6   loan            4119 non-null   object 
 7   contact         4119 non-null   object 
 8   month           4119 non-null   object 
 9   day_of_week     4119 non-null   object 
 10  duration        4119 non-null   int64  
 11  campaign        4119 non-null   int64  
 12  pdays           4119 non-null   int64  
 13  previous        4119 non-null   int64  
 14  poutcome        4119 non-null   object 
 15  emp.var.rate    4119 non-null   float64
 16  cons.price.idx  4119 non-null   float64
 17  cons.conf.idx   4119 non-null   f

# Consideraciones de la predicción

Siguiendo la documentación de los datos eliminamos el atributo "_duration_" el cuál refiere a la duración de la llamada telefónica. Dado que no tenemos datos del vendedor/a no podremos inferir en cuanto impacta la claidad de venta e indicadores como ser tiempo de la llamada.

In [7]:
df = df.loc[:,df.columns!="duration"]

Dado que el enfoque de le daremos al análisis está centrado en caracteriar al cliente y no el tipo de campaña de márketing realizada tampoco nos interesan los atributos de:
* contact
* month
* day_of_week
* pdays

In [69]:
df = df.drop(columns = ['contact', 'month','day_of_week', 'pdays'], axis = 1)

También nos preguntamos si quisieramos saber cómo afecta el contacto previo con el banco.¿Es esto positivo o negativo?

Podemos examinar los atrbutos referidos a la campaña y ver como se relacionan con el éxito de esta.
+ campaign : número de contactos realizados durante esta campaña y para este cliente.
+ pdays : número de días transcurridos desde la última vez que se contactó al cliente de una campaña anterior.
+ previous : number of contacts performed before this campaign and for this client.
+ poutcome : resultado de la campaña de marketing anterior.
+ contact : cellular, telephone, 

Estos datos al no tener registro alguno de vendedores podrían ser muy ruidosos o sesgados. Estimamos que la contactación será altamente dependiente de quien la haya realizado.

Ya que buscamos caracterizar la cartera de clientes y no tenemos la información completa para relacionarla con las acciones de la campaña, como primer medida eliminaremos los atributos anteriores.

In [12]:
df = df.drop(columns = ['campaign', 'pdays','previous', 'poutcome', 'contact'], axis = 1)

In [13]:
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,-0.1,93.200,-42.0,4.191,5195.8,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4114,30,admin.,married,basic.6y,no,yes,yes,cellular,jul,thu,1.4,93.918,-42.7,4.958,5228.1,no
4115,39,admin.,married,high.school,no,yes,no,telephone,jul,fri,1.4,93.918,-42.7,4.959,5228.1,no
4116,27,student,single,high.school,no,no,no,cellular,may,mon,-1.8,92.893,-46.2,1.354,5099.1,no
4117,58,admin.,married,high.school,no,no,no,cellular,aug,fri,1.4,93.444,-36.1,4.966,5228.1,no


In [None]:
# Separamos los datos descriptores del output
X_data=df.drop('y', axis=1)
y_data=df['y']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.200,-42.0,4.191,5195.8,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4114,30,admin.,married,basic.6y,no,yes,yes,cellular,jul,thu,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.958,5228.1,no
4115,39,admin.,married,high.school,no,yes,no,telephone,jul,fri,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.959,5228.1,no
4116,27,student,single,high.school,no,no,no,cellular,may,mon,...,2,999,1,failure,-1.8,92.893,-46.2,1.354,5099.1,no
4117,58,admin.,married,high.school,no,no,no,cellular,aug,fri,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.966,5228.1,no
