# Manejo de Missing Values con XGBoost

In [1]:
# Load Libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd 
import multiprocessing
import random




from sklearn.preprocessing import scale
from sklearn.preprocessing import OneHotEncoder
import timeit

## Dataset

Leamos el dataset a utilizar.

In [2]:
# Load the dataset 
X = pd.read_csv("../datasets/dat_sanidad_raw.csv", sep='|', decimal=',', encoding='latin-1')
X.shape

(32706, 10)

El dataset consta de:

- 32706 **filas** o instancias
- 10 **columnas** o variables.

Veamos su contenido.

In [3]:
X.head()

Unnamed: 0,gravedad,pct_mortalidad_norma,edad_dias,numproc,potencial_ambul,proc,estancia_esperada,tipgrd,tiping,exitus
0,4,0.40873,12596.0,21,0,1,,Q,1,0
1,4,,20973.0,22,0,1,99.0,Q,1,0
2,4,0.278481,19611.0,19,0,1,,Q,1,0
3,3,0.150289,13583.0,22,0,1,100.0,Q,1,0
4,1,0.016573,18042.0,2,0,1,,Q,1,0


- **gravedad**: Gravedad dentro del GRD. Valores de 1 a 4.

- **pct_mortalidad_norma**: Tasa de mortalidad histórica para ese GRD.
        
- **edad_dias**: Edad en días del paciente.

- **numproc**: Número de procedimientos que se han llevado a cabo al paciente a las 24 horas.
    
- **potencial_ambul**: Flag (valor 0 = no / 1 = sí) que indica si el caso se ha catalogado como potencialmente ambulatorio, es decir, no requiere ingreso.
    
- **proc**: Procedencia del paciente.

- **estancia_esperada**: Número de días que se espera que el paciente esté ingresado en el hospital por este episodio asistencial.

- **tipgrd**: GRD médico (M) o de quirófano (Q).
    
- **tiping**: Tipo de ingreso: Programado, urgente...
    
- **exitus**: 1 = El paciente falleció.

De estas variables, fecing la usaremos para realizar el split en train/validacion/test, exitus es el target y las 10 variables restantes serán el input de nuestro modelo.

En este caso, **tenemos missing values** en nuestro dataset.

In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32706 entries, 0 to 32705
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   gravedad              32706 non-null  int64  
 1   pct_mortalidad_norma  29399 non-null  float64
 2   edad_dias             32232 non-null  float64
 3   numproc               32706 non-null  int64  
 4   potencial_ambul       32706 non-null  int64  
 5   proc                  32706 non-null  int64  
 6   estancia_esperada     5127 non-null   float64
 7   tipgrd                32706 non-null  object 
 8   tiping                32706 non-null  int64  
 9   exitus                32706 non-null  int64  
dtypes: float64(3), int64(6), object(1)
memory usage: 2.5+ MB


## Pre-procesado

### One-Hot Encoding

Utilizaremos la técnica de one-hot encoding.

<img src="../figures/oh.png" width="50%">

Seleccionemos las variables categóricas en primer lugar.

In [5]:
categorical_vars = set(['gravedad', 'proc', 'tiping', 'tipgrd'])
numerical_vars = set(X.columns) - categorical_vars
categorical_vars = list(categorical_vars)
numerical_vars = list(numerical_vars)

In [6]:
print(categorical_vars)
print(numerical_vars)

['tiping', 'tipgrd', 'gravedad', 'proc']
['edad_dias', 'potencial_ambul', 'estancia_esperada', 'pct_mortalidad_norma', 'exitus', 'numproc']


Realizamos one hot encoding de las variables categoricas

In [7]:
ohe = OneHotEncoder(sparse = False)
ohe_fit = ohe.fit(X[categorical_vars])
X_ohe = pd.DataFrame(ohe.fit_transform(X[categorical_vars]))
X_ohe.columns = pd.DataFrame(ohe_fit.get_feature_names())


Visualizamos los datos iniciales y los que están con one hot encoding para ver la diferencia

In [8]:
X[categorical_vars].head()

Unnamed: 0,tiping,tipgrd,gravedad,proc
0,1,Q,4,1
1,1,Q,4,1
2,1,Q,4,1
3,1,Q,3,1
4,1,Q,1,1


In [9]:
X_ohe.head()

Unnamed: 0,"(x0_1,)","(x0_2,)","(x0_3,)","(x1_M,)","(x1_Q,)","(x2_1,)","(x2_2,)","(x2_3,)","(x2_4,)","(x3_1,)","(x3_2,)","(x3_3,)","(x3_4,)","(x3_6,)","(x3_7,)","(x3_8,)","(x3_9,)"
0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Volvemos a pegar las variables numéricas.

In [10]:
X = pd.concat((X_ohe, X[numerical_vars].reset_index()), axis=1)

### Tipificar

Ahora vamos a tipificar los datos, es decir, llevarlos a media 0 y desviación estándar 1.

<img src="../figures/tipify.png" width="50%">

In [11]:
y = X['exitus']
del X['exitus']

In [12]:
X_scale = pd.DataFrame(scale(X))
X_scale.columns = X.columns
X = X_scale
X.columns = X_scale.columns
print(X.head())

    (x0_1,)  (x0_2,)   (x0_3,)   (x1_M,)   (x1_Q,)   (x2_1,)   (x2_2,)  \
0  0.589285  -0.5681 -0.118114 -1.554964  1.554964 -1.322185 -0.435743   
1  0.589285  -0.5681 -0.118114 -1.554964  1.554964 -1.322185 -0.435743   
2  0.589285  -0.5681 -0.118114 -1.554964  1.554964 -1.322185 -0.435743   
3  0.589285  -0.5681 -0.118114 -1.554964  1.554964 -1.322185 -0.435743   
4  0.589285  -0.5681 -0.118114 -1.554964  1.554964  0.756324 -0.435743   

    (x2_3,)   (x2_4,)   (x3_1,)  ...   (x3_6,)   (x3_7,)   (x3_8,)   (x3_9,)  \
0 -0.417921  4.120705  0.589002  ... -0.118114 -0.009578 -0.134718 -0.090724   
1 -0.417921  4.120705  0.589002  ... -0.118114 -0.009578 -0.134718 -0.090724   
2 -0.417921  4.120705  0.589002  ... -0.118114 -0.009578 -0.134718 -0.090724   
3  2.392797 -0.242677  0.589002  ... -0.118114 -0.009578 -0.134718 -0.090724   
4 -0.417921 -0.242677  0.589002  ... -0.118114 -0.009578 -0.134718 -0.090724   

      index  edad_dias  potencial_ambul  estancia_esperada  \
0 -1.731998 

### Split en Train/Validación/Test

Utilizaremos a modo de ejemplo los ratios habitualmente recomendados:

• Train: 70%.

• Validación: 15%.

• Test: 15%.


In [13]:
perc_values = [0.7, 0.15, 0.15];

Creamos los conjuntos de train, validacion y test con el tamaño seleccionado pero respetando el eje temporal.

In [14]:
# dimensiones de los conjuntos de train y test
n_train = int(X.shape[0] * perc_values[0])
n_val = int(X.shape[0] * perc_values[1])
n_test = int(X.shape[0] * perc_values[2])

# selección del conjunto de train
X_train = X.iloc[:n_train]
y_train = y.iloc[:n_train]

# selección del conjunto de validación
X_val = X.iloc[(n_train):(n_train+n_val)]
y_val = y.iloc[(n_train):(n_train+n_val)]

# selección del conjunto de test
X_test = X.iloc[(n_train+n_val):]
y_test = y.iloc[(n_train+n_val):]

Visualizamos el tamaño de los 3 subdatasets

In [15]:
print('Train data size = ' + str(X_train.shape))
print('Train target size = ' + str(y_train.shape))
print('Validation data size = ' + str(X_val.shape))
print('Validation target size = ' + str(y_val.shape))
print('Test data size = ' + str(X_test.shape))
print('Test target size = ' + str(y_test.shape))

Train data size = (22894, 23)
Train target size = (22894,)
Validation data size = (4905, 23)
Validation target size = (4905,)
Test data size = (4907, 23)
Test target size = (4907,)


## SVM

1) Importar modelo.

In [16]:
from sklearn.svm import SVC as model_constructor

2) Importar métrica.

In [17]:
from sklearn.metrics import roc_auc_score as metric

3) Definir el método.

Vamos a entrenar un modelo con parámetros por defecto

In [18]:
model = model_constructor(random_state = 1)

4) Llamar al método fit para entrenar el modelo. 

In [19]:
model.fit(X_train, 
          np.array(y_train))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Como vimos en las clases, la mayoría de modelos **no aceptan missing values**. Sin embargo, los **modelos basados en árboles** son una **excepción**.


## XGBoost

Vamos a aplicar ahora un modelo de XGBoost.

1) Importar modelo.

In [20]:
from xgboost import XGBClassifier as model_constructor

3) Definir el método.

In [21]:
model = model_constructor(random_state = 1)

4) Llamar al método fit para entrenar el modelo. 

In [22]:
model.fit(X_train, 
          np.array(y_train), 
          eval_metric = "auc", 
          eval_set=[(X_val, y_val)],
          early_stopping_rounds = 10,
          verbose=True)

[0]	validation_0-auc:0.90469
[1]	validation_0-auc:0.90519
[2]	validation_0-auc:0.91386
[3]	validation_0-auc:0.91716
[4]	validation_0-auc:0.91539
[5]	validation_0-auc:0.91755
[6]	validation_0-auc:0.91752
[7]	validation_0-auc:0.91800
[8]	validation_0-auc:0.92148




[9]	validation_0-auc:0.92331
[10]	validation_0-auc:0.92184
[11]	validation_0-auc:0.92404
[12]	validation_0-auc:0.92470
[13]	validation_0-auc:0.92558
[14]	validation_0-auc:0.92564
[15]	validation_0-auc:0.92563
[16]	validation_0-auc:0.92711
[17]	validation_0-auc:0.92477
[18]	validation_0-auc:0.92508
[19]	validation_0-auc:0.92623
[20]	validation_0-auc:0.92679
[21]	validation_0-auc:0.92693
[22]	validation_0-auc:0.92383
[23]	validation_0-auc:0.92419
[24]	validation_0-auc:0.92401
[25]	validation_0-auc:0.92464
[26]	validation_0-auc:0.92420


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=1, ...)

¡Funciona sin hacer nada!

5) Llamar al método predict para generar las predicciones.

In [23]:
pred_train_p = model.predict_proba(X_train)
pred_val_p = model.predict_proba(X_val)
pred_test_p = model.predict_proba(X_test) 

6) Calcular métrica usando las predicciones obtenidas en el paso anterior.

In [24]:
# Calcular métricas de evaluación
auc_train = metric(y_train, pred_train_p[:,1]);
auc_val = metric(y_val, pred_val_p[:,1]);
auc_test = metric(y_test, pred_test_p[:,1]);
results = pd.DataFrame()
results = results.append(pd.DataFrame(data={'model':['XGBoost (Default)'],'auc_train':[auc_train],'auc_val':[auc_val],'auc_test':[auc_test]}, columns=['model',  'auc_train','auc_val', 'auc_test']), ignore_index=True)

In [25]:
results

Unnamed: 0,model,auc_train,auc_val,auc_test
0,XGBoost (Default),0.965539,0.927106,0.925819


Comparemos su performance con respecto a un método básico como rellenar con la media.

In [28]:
numerical_vars = list(set(numerical_vars) - set(['exitus']))
means = X_train[numerical_vars].apply(lambda x: np.mean(x)).to_dict()
X_train = X_train.fillna(value = means, axis = 0)
X_val = X_val.fillna(value = means, axis = 0)
X_test = X_test.fillna(value = means, axis = 0)

3) Definir el método.

In [29]:
model = model_constructor(random_state = 1)

4) Llamar al método fit para entrenar el modelo. 

In [30]:
model.fit(X_train, 
          np.array(y_train), 
          eval_metric = "auc", 
          eval_set=[(X_val, y_val)],
          early_stopping_rounds = 10,
          verbose=True)

[0]	validation_0-auc:0.90608
[1]	validation_0-auc:0.90428
[2]	validation_0-auc:0.90338
[3]	validation_0-auc:0.91100
[4]	validation_0-auc:0.91124
[5]	validation_0-auc:0.91581
[6]	validation_0-auc:0.91632
[7]	validation_0-auc:0.91766




[8]	validation_0-auc:0.91728
[9]	validation_0-auc:0.92208
[10]	validation_0-auc:0.92358
[11]	validation_0-auc:0.92412
[12]	validation_0-auc:0.92524
[13]	validation_0-auc:0.92525
[14]	validation_0-auc:0.92553
[15]	validation_0-auc:0.92509
[16]	validation_0-auc:0.92540
[17]	validation_0-auc:0.92781
[18]	validation_0-auc:0.92878
[19]	validation_0-auc:0.92342
[20]	validation_0-auc:0.92365
[21]	validation_0-auc:0.92332
[22]	validation_0-auc:0.92365
[23]	validation_0-auc:0.92334
[24]	validation_0-auc:0.92336
[25]	validation_0-auc:0.92330
[26]	validation_0-auc:0.92367
[27]	validation_0-auc:0.92446


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=1, ...)

5) Llamar al método predict para generar las predicciones.

In [31]:
pred_train_p = model.predict_proba(X_train)
pred_val_p = model.predict_proba(X_val)
pred_test_p = model.predict_proba(X_test) 

6) Calcular métrica usando las predicciones obtenidas en el paso anterior.

In [32]:
# Calcular métricas de evaluación
auc_train = metric(y_train, pred_train_p[:,1]);
auc_val = metric(y_val, pred_val_p[:,1]);
auc_test = metric(y_test, pred_test_p[:,1]);
results = results.append(pd.DataFrame(data={'model':['XGBoost fill missing'],'auc_train':[auc_train],'auc_val':[auc_val],'auc_test':[auc_test]}, columns=['model',  'auc_train','auc_val', 'auc_test']), ignore_index=True)

In [33]:
results

Unnamed: 0,model,auc_train,auc_val,auc_test
0,XGBoost (Default),0.965539,0.927106,0.925819
1,XGBoost fill missing,0.964769,0.92878,0.921408
