# Árboles de decisión 

En este notebook vamos a ver un ejemplo sencillo de como implementar árboles de decisión utilizando el dataset de titanic.

In [26]:
import pandas as pd
data = pd.read_csv("titanic.csv").drop(["PassengerId", "Ticket", "Name"], axis=1)

In [27]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


## Variables categóricas

Como vemos, tenemos varias variables categóricas como lo son el sexo, la cabina o la embarcación. Un árbol de decisión funciona solo con variables númericas entonces tenemos que crear una columna por cada categoría asignando el valor de 1 si pertenece a dicha categoría y 0 si no pertenece. Para esto se utiliza la función de pandas "get_dummies".

In [30]:
numeric_data = pd.get_dummies(data)
numeric_data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Cabin_A10,Cabin_A14,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,1,35.0,1,0,53.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,3,35.0,0,0,8.05,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [31]:
numeric_data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Cabin_A10,Cabin_A14,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,...,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.352413,0.647587,0.001122,0.001122,...,0.002245,0.003367,0.003367,0.001122,0.002245,0.004489,0.001122,0.188552,0.08642,0.722783
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.47799,0.47799,0.033501,0.033501,...,0.047351,0.057961,0.057961,0.033501,0.047351,0.06689,0.033501,0.391372,0.281141,0.447876
min,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,3.0,28.0,0.0,0.0,14.4542,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,3.0,38.0,1.0,0.0,31.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Valores nulos
Vamos a revisar si tenemos valores nulos en nuestro dataset.

In [32]:
numeric_data.isnull().sum()

Survived        0
Pclass          0
Age           177
SibSp           0
Parch           0
             ... 
Cabin_G6        0
Cabin_T         0
Embarked_C      0
Embarked_Q      0
Embarked_S      0
Length: 158, dtype: int64

## Imputación de valores nulos
Como vemos que si hay valores nulos, vamos a imputarlos con la media. Pandas tiene un método para imputar valores nulos llamado fillna. Vamos a implementarlo

In [33]:
imputed_data = numeric_data.fillna(numeric_data.mean())

## Separación en matriz de características y variable objetivo

Como queremos predecir si una persona sobrevivió o no en el titanic, entonces dejaremos como la variable objetivo la columna Survived.

In [9]:
X = imputed_data.drop(["Survived"], axis=1)
Y = imputed_data.loc[:,"Survived"]

In [10]:
X.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Cabin_A10,Cabin_A14,Cabin_A16,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,3,22.0,1,0,7.25,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,26.0,0,0,7.925,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,3,35.0,0,0,8.05,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [11]:
Y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

## Separación en set de train y de test

Como sabemo, siempre hay que separar nuestro dataset en un set de entrenamiento y uno de prueba. sklearn tiene la funcion train_test_split para separar los datos.

In [14]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.33, random_state=42)

In [36]:
print(Xtrain.shape)
print(Ytrain.shape)
print(Xtest.shape)
print(Ytest.shape)

(596, 157)
(596,)
(295, 157)
(295,)


## Implementación del árbol de decisión

In [19]:
from sklearn.tree import DecisionTreeClassifier

In [45]:
arbol_titanic = DecisionTreeClassifier(max_depth=1,
                                     min_samples_split=2,
                                     min_samples_leaf=2, 
                                     max_leaf_nodes=2)
arbol_titanic.fit(Xtrain, Ytrain)
arbol_titanic.score(Xtest, Ytest)

0.7966101694915254

## Grid Search
Si nos ponemos a evaluar manualmente cada uno de estos parámetros, no terminaríamos nunca. Para esto, esta la función GridSearch de sklearn que evalúa todos los parámetros que queramos y nos devuelve la mejor combinacion de parámetros.

In [43]:
from sklearn.model_selection import GridSearchCV

In [44]:
parametros = {
    "max_depth": [1, 2, 3, 4],
    "min_samples_split": [ 2, 5, 10, 20],
    "min_samples_leaf": [2, 5, 10, 20],
    "max_leaf_nodes": [2, 5, 10, 20]
}

In [48]:
arbol_titanic = DecisionTreeClassifier()
titanic_search = GridSearchCV(arbol_titanic, parametros, cv=3, n_jobs=-1)
titanic_search.fit(Xtrain, Ytrain)
titanic_search.best_estimator_.score(Xtest, Ytest)

0.8067796610169492