# Tema 7 - Ejercicio SVM

La base de datos incluida en el archivo Bank.csv (dentro de Bank.zip)
recoge información de 4.521 clientes a los que se les ofreció contratar un
depósito a plazo en una entidad bancaria portuguesa (el zip también
contiene un fichero de texto denominado Bank-names.txt con el detalle
completo de todas las variables incluidas)
Utilizando dicha base de datos, elabore una red neuronal que permita
pronosticar si, en base a sus características, el cliente contratará el
depósito o no.

De cara a la realización de este ejercicio, debe tener en cuenta que:

- La variable objetivo de nuestro modelo es “y”, la cual tiene el valor
“yes” si el cliente ha contratado el depósito y “no” en caso contrario.

- Observe que hay múltiples variable de tipo cualitativo que deberá
transformar antes de estimar el modelo.

- No olvide normalizar los datos antes de introducirlos en el modelo.
  



Importamos dependencias

In [1]:
import pandas as pd # to load and manipulate data and for One-Hot Encoding
import numpy as np # to calculate the mean and standard deviation
import matplotlib.pyplot as plt 
import matplotlib.colors as colors

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import scale  
from sklearn.model_selection import GridSearchCV # this will do cross validation (grid search cross validation)

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn.svm import SVC  # SVM for classification

# Extra?
from sklearn.decomposition import PCA # to perform PCA to plot the data

## Paso 1: importar datos

In [5]:
## import data
bank_raw = pd.read_csv(r"./Bank/bank.csv",sep=';')

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


## Paso 2: explorar y procesar datos

In [9]:
# explore and prepare data
bank_raw.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [10]:
bank_raw.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,15.915284,263.961292,2.79363,39.766645,0.542579
std,10.576211,3009.638142,8.247667,259.856633,3.109807,100.121124,1.693562
min,19.0,-3313.0,1.0,4.0,1.0,-1.0,0.0
25%,33.0,69.0,9.0,104.0,1.0,-1.0,0.0
50%,39.0,444.0,16.0,185.0,2.0,-1.0,0.0
75%,49.0,1480.0,21.0,329.0,3.0,-1.0,0.0
max,87.0,71188.0,31.0,3025.0,50.0,871.0,25.0


In [11]:
# To show the categorical variables
bank_raw.describe(include='object')

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,poutcome,y
count,4521,4521,4521,4521,4521,4521,4521,4521,4521,4521
unique,12,3,4,2,2,2,3,12,4,2
top,management,married,secondary,no,yes,no,cellular,may,unknown,no
freq,969,2797,2306,4445,2559,3830,2896,1398,3705,4000


In [19]:
#check  if categorical variables contain strange values, there are missing values ..
print(bank_raw['job'].unique())
print(bank_raw['marital'].unique())
print(bank_raw['education'].unique())
print(bank_raw['default'].unique())
print(bank_raw['housing'].unique())
print(bank_raw['loan'].unique())
print(bank_raw['contact'].unique())
print(bank_raw['month'].unique())
print(bank_raw['poutcome'].unique())
print(bank_raw['y'].unique())

['unemployed' 'services' 'management' 'blue-collar' 'self-employed'
 'technician' 'entrepreneur' 'admin.' 'student' 'housemaid' 'retired'
 'unknown']
['married' 'single' 'divorced']
['primary' 'secondary' 'tertiary' 'unknown']
['no' 'yes']
['no' 'yes']
['no' 'yes']
['cellular' 'unknown' 'telephone']
['oct' 'may' 'apr' 'jun' 'feb' 'aug' 'jan' 'jul' 'nov' 'sep' 'mar' 'dec']
['unknown' 'failure' 'other' 'success']
['no' 'yes']


En principio no hay ningún problema.

Ahora toca separar la variable dependiente de las demás:

In [20]:
X=bank_raw.drop('y',axis=1).copy()
X.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown


In [21]:
y = bank_raw['y'].copy()
y.head()

0    no
1    no
2    no
3    no
4    no
Name: y, dtype: object

SVM no acepta variables categóricas, así que las trasformamos en variables numéricas mediante la técnica conocida como "**hot encoding**": </br>
(La variable month la vamos a codificar con valores de 1 a 12, para no añadir 12 columnas).

In [23]:
X_encoded=pd.get_dummies(X, columns=['job','marital','education','default','housing','loan','contact','poutcome'], dtype='int')
X_encoded.head()

Unnamed: 0,age,balance,day,month,duration,campaign,pdays,previous,job_admin.,job_blue-collar,...,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,oct,79,1,-1,0,0,0,...,0,1,0,1,0,0,0,0,0,1
1,33,4789,11,may,220,1,339,4,0,0,...,1,0,1,1,0,0,1,0,0,0
2,35,1350,16,apr,185,1,330,1,0,0,...,1,1,0,1,0,0,1,0,0,0
3,30,1476,3,jun,199,4,-1,0,0,0,...,1,0,1,0,0,1,0,0,0,1
4,59,0,5,may,226,1,-1,0,0,1,...,1,1,0,0,0,1,0,0,0,1


Ahora transformamos los meses ("Ene" -> 1, "Feb" -> 2 ...)

In [28]:
#encoding month (name to number)
from datetime import datetime
X_encoded['month'] = X_encoded['month'].apply(lambda m : datetime.strptime(m, '%b').month)
#X_encoded['month']

In [30]:
X_encoded.head()

Unnamed: 0,age,balance,day,month,duration,campaign,pdays,previous,job_admin.,job_blue-collar,...,housing_yes,loan_no,loan_yes,contact_cellular,contact_telephone,contact_unknown,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,19,10,79,1,-1,0,0,0,...,0,1,0,1,0,0,0,0,0,1
1,33,4789,11,5,220,1,339,4,0,0,...,1,0,1,1,0,0,1,0,0,0
2,35,1350,16,4,185,1,330,1,0,0,...,1,1,0,1,0,0,1,0,0,0
3,30,1476,3,6,199,4,-1,0,0,0,...,1,0,1,0,0,1,0,0,0,1
4,59,0,5,5,226,1,-1,0,0,1,...,1,1,0,0,0,1,0,0,0,1


Ya solo queda normalizar las variables numéricas.

Pero primero vamos a separar los conjuntos de entrenamiento y de test (para evitar así "data leakage": que información del conjunto de entrenamiento contamine el conjunto de test).

In [43]:
%%time
X_train, X_test, y_train, y_test=train_test_split(X_encoded, y, test_size=0.25, random_state=99)

X_train_scaled = scale(X_train)
X_test_scaled = scale(X_test)

CPU times: user 12.1 ms, sys: 3.83 ms, total: 15.9 ms
Wall time: 14.7 ms


In [40]:
print(X_train_scaled.shape)
print(X_test_scaled.shape)
print(y_train.shape)
print(y_test.shape)

(3390, 40)
(1131, 40)
(3390,)
(1131,)


## Paso 3: Entrenamiento del modelo

In [44]:
%%time
#create classifier SVM
clf_svm=SVC(random_state=99)

CPU times: user 47 μs, sys: 8 μs, total: 55 μs
Wall time: 63.4 μs


In [45]:
%%time

clf_svm.fit(X_train_scaled, y_train)

CPU times: user 154 ms, sys: 0 ns, total: 154 ms
Wall time: 154 ms


## Paso 4: Evaluación del modelo

Primero construimos un modelo sencillo, con los parámetros por defecto.