# Aplicando algoritmo basado en Árboles
Inicialmente se toma el archivo csv que se obtuvo de la fase de tratamiento de datos en la primera entrega.

In [1]:
import pandas as pd

encoding = 'iso-8859-1'    
delimiter = ';'
filename = 'bank_balanced.csv'

bank_balanced = pd.read_csv(filename, 
                   delimiter = delimiter,
                   encoding = encoding)
bank_balanced.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,campaign,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,1.0,0.0,unknown,no
1,44.0,technician,single,secondary,no,29.0,yes,no,1.0,0.0,unknown,no
2,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,1.0,0.0,unknown,no
3,47.0,blue-collar,married,unknown,no,1506.0,yes,no,1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,1.0,0.0,unknown,no


## Aplicando Transformaciones
Debido a que scikit-learn utiliza una versión optimizada del algoritmo CART que no soporta valores categóricos se procede a numerizar los valores categóricos del conjunto de datos.

In [2]:
print('job: ', bank_balanced['job'].unique())
print('marital: ', bank_balanced['marital'].unique())
print('education: ', bank_balanced['education'].unique())
print('default: ', bank_balanced['default'].unique())
print('housing: ', bank_balanced['housing'].unique())
print('loan: ', bank_balanced['loan'].unique())
print('poutcome: ', bank_balanced['poutcome'].unique())
print('y: ', bank_balanced['y'].unique())

job:  ['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown'
 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid'
 'student' 'other']
marital:  ['married' 'single' 'divorced']
education:  ['tertiary' 'secondary' 'unknown' 'primary']
default:  ['no' 'yes']
housing:  ['yes' 'no']
loan:  ['no' 'yes']
poutcome:  ['unknown' 'failure' 'other' 'success']
y:  ['no' 'yes']


## Numerización
Se generan funciones para realizar las numeraciones.

In [3]:
def bool_to_numeric(x):
    if x=='no': return 0
    if x=='yes': return 1

def job_to_numeric(x):
    if x == 'unknown': return 0
    if x == 'management': return 1
    if x == 'technician': return 2
    if x == 'entrepreneur': return 3
    if x == 'blue-collar': return 4
    if x == 'retired': return 5
    if x == 'admin.': return 6
    if x == 'services': return 7
    if x == 'self-employed': return 8
    if x == 'unemployed': return 9
    if x == 'housemaid': return 10
    if x == 'student': return 11
    if x == 'other': return 12
    
def marital_to_numeric(x):
    if x == 'married': return 0
    if x == 'single': return 1
    if x == 'divorced': return 2

def education_to_numeric(x):
    if x == 'unknown': return 0
    if x == 'primary': return 1
    if x == 'secondary': return 2
    if x == 'tertiary': return 3

def poutcome_to_numeric(x):
    if x == 'unknown': return 0
    if x == 'failure': return 1
    if x == 'success': return 2
    if x == 'other': return 3

bank_balanced['job'] = bank_balanced['job'].apply(job_to_numeric)
bank_balanced['marital'] = bank_balanced['marital'].apply(marital_to_numeric)
bank_balanced['education'] = bank_balanced['education'].apply(education_to_numeric)
bank_balanced['default'] = bank_balanced['default'].apply(bool_to_numeric)
bank_balanced['housing'] = bank_balanced['housing'].apply(bool_to_numeric)
bank_balanced['loan'] = bank_balanced['loan'].apply(bool_to_numeric)
bank_balanced['poutcome'] = bank_balanced['poutcome'].apply(poutcome_to_numeric)
bank_balanced['y'] = bank_balanced['y'].apply(bool_to_numeric)

## Archivo Transformado
A continuación, se muestra el conjunto de datos numerizado:

In [4]:
bank_balanced.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,campaign,previous,poutcome,y
0,58.0,1,0,3,0,2143.0,1,0,1.0,0.0,0,0
1,44.0,2,1,2,0,29.0,1,0,1.0,0.0,0,0
2,33.0,3,0,2,0,2.0,1,1,1.0,0.0,0,0
3,47.0,4,0,0,0,1506.0,1,0,1.0,0.0,0,0
4,33.0,0,1,0,0,1.0,0,0,1.0,0.0,0,0


## Creando los conjuntos de prueba
Se separan los datos en dos conjuntos. El primero es para los datos utilizados durante el entrenamiento y el restante para realizar las pruebas.

In [5]:
from sklearn.model_selection import train_test_split

bank_data = bank_balanced[['age','job','marital', 'education', 'default', 'balance', 'housing', 'loan', 'campaign', 'previous', 'poutcome']]
bank_target = bank_balanced['y']

bank_train, bank_test, y_train, y_test = train_test_split(bank_data, bank_target, random_state=0)

### Verificando tamaños

In [6]:
print("Bank_train shape: {}".format(bank_train.shape))
print("Y_train shape: {}".format(y_train.shape))
print("Bank_test shape: {}".format(bank_test.shape))
print("Y_test shape: {}".format(y_test.shape))

Bank_train shape: (59850, 11)
Y_train shape: (59850,)
Bank_test shape: (19950, 11)
Y_test shape: (19950,)


### Validación Cruzada
La complejidad de un árbol se puede controlar mediante la profundidad del mismo. Con esto en mente, se realiza la validación cruzada dividiendo el conjunto de datos en 10 partes, iterando la profundidad en el rango de 3 a 40 para buscar el punto óptimo.

In [7]:
from sklearn import tree
from sklearn.model_selection  import cross_val_score
from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':range(3,40)}
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, cv=10)
clf.fit(X=bank_train, y=y_train)
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_) 

0.8331495405179615 {'max_depth': 37}


## Aplicación del algoritmo
Una vez se tienen los datos en el formato necesario y conociendo el punto óptimo de profundidad se procede a entrenar el modelo y generar el vector de predicciones.

In [8]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=37)
clf = clf.fit(bank_train, y_train)
y_predict = clf.predict(bank_test)
y_predict.size

19950

In [9]:
y_test.size

19950

## Métricas
Con los vectores de test y predicción se genera la matriz de confunsión.

In [10]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predict)
cm

array([[7950, 1871],
       [1349, 8780]], dtype=int64)

De la matriz de confusión se obtiene lo siguiente:

In [11]:
VN = cm[0,0]
FP = cm[0,1]
FN = cm[1,0]
VP = cm[1,1]

print('VN: ', VN)
print('FP: ', FP)
print('FN: ', FN)
print('VP: ', VP)
print('-------------')
print('Exactitud: ', (VN + VP) / (VN + VP + FN + FP))
print('Recall: ', VP / (FN + VP))
print('Precisión: ', VP / (FP + VP))
print('F1-Score: ', 2 * VP / (2 * VP + FN + FP))

VN:  7950
FP:  1871
FN:  1349
VP:  8780
-------------
Exactitud:  0.8385964912280702
Recall:  0.8668180471912331
Precisión:  0.8243357431227115
F1-Score:  0.8450433108758422
