# Predicción de Exitus (fallecimiento) con XGBoost y Variantes

In [1]:
# Load Libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd 
import multiprocessing
import random




from sklearn.preprocessing import scale
from sklearn.preprocessing import OneHotEncoder
import timeit

## Dataset

Leamos el dataset a utilizar.

In [2]:
# Load the dataset 
X = pd.read_csv("../datasets/dat_sanidad.csv", sep=';', decimal=',', encoding='latin-1')
X.shape

(32706, 10)

El dataset consta de:

- 32706 **filas** o instancias
- 10 **columnas** o variables.

Veamos su contenido.

In [3]:
X.head()

Unnamed: 0,gravedad,pct_mortalidad_norma,edad_dias,numproc,potencial_ambul,proc,estancia_esperada,tipgrd,tiping,exitus
0,4,0.40873,12596,21,0,1,151,Q,1,0
1,4,0.306931,20973,22,0,1,99,Q,1,0
2,4,0.278481,19611,19,0,1,87,Q,1,0
3,3,0.150289,13583,22,0,1,100,Q,1,0
4,1,0.016573,18042,2,0,1,44,Q,1,0


- **gravedad**: Gravedad dentro del GRD. Valores de 1 a 4.

- **pct_mortalidad_norma**: Tasa de mortalidad histórica para ese GRD.
        
- **edad_dias**: Edad en días del paciente.

- **numproc**: Número de procedimientos que se han llevado a cabo al paciente a las 24 horas.
    
- **potencial_ambul**: Flag (valor 0 = no / 1 = sí) que indica si el caso se ha catalogado como potencialmente ambulatorio, es decir, no requiere ingreso.
    
- **proc**: Procedencia del paciente.

- **estancia_esperada**: Número de días que se espera que el paciente esté ingresado en el hospital por este episodio asistencial.

- **tipgrd**: GRD médico (M) o de quirófano (Q).
    
- **tiping**: Tipo de ingreso: Programado, urgente...
    
- **exitus**: 1 = El paciente falleció.

De estas variables, fecing la usaremos para realizar el split en train/validacion/test, exitus es el target y las 10 variables restantes serán el input de nuestro modelo.

## Pre-procesado

### One-Hot Encoding

Utilizaremos la técnica de one-hot encoding.

<img src="../figures/oh.png" width="50%">

Seleccionemos las variables categóricas en primer lugar.

In [4]:
categorical_vars = set(['gravedad', 'proc', 'tiping', 'tipgrd'])
numerical_vars = set(X.columns) - categorical_vars
categorical_vars = list(categorical_vars)
numerical_vars = list(numerical_vars)

In [5]:
print(categorical_vars)
print(numerical_vars)

['tipgrd', 'proc', 'tiping', 'gravedad']
['numproc', 'edad_dias', 'potencial_ambul', 'exitus', 'estancia_esperada', 'pct_mortalidad_norma']


Realizamos one hot encoding de las variables categoricas

In [6]:
ohe = OneHotEncoder(sparse = False)
ohe_fit = ohe.fit(X[categorical_vars])
X_ohe = pd.DataFrame(ohe.fit_transform(X[categorical_vars]))
X_ohe.columns = pd.DataFrame(ohe_fit.get_feature_names())


Visualizamos los datos iniciales y los que están con one hot encoding para ver la diferencia

In [7]:
X[categorical_vars].head()

Unnamed: 0,tipgrd,proc,tiping,gravedad
0,Q,1,1,4
1,Q,1,1,4
2,Q,1,1,4
3,Q,1,1,3
4,Q,1,1,1


In [8]:
X_ohe.head()

Unnamed: 0,"(x0_M,)","(x0_Q,)","(x1_1,)","(x1_2,)","(x1_3,)","(x1_4,)","(x1_6,)","(x1_7,)","(x1_8,)","(x1_9,)","(x2_1,)","(x2_2,)","(x2_3,)","(x3_1,)","(x3_2,)","(x3_3,)","(x3_4,)"
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


Volvemos a pegar las variables numéricas.

In [9]:
X = pd.concat((X_ohe, X[numerical_vars].reset_index()), axis=1)

### Tipificar

Ahora vamos a tipificar los datos, es decir, llevarlos a media 0 y desviación estándar 1.

<img src="../figures/tipify.png" width="50%">

In [10]:
y = X['exitus']
del X['exitus']

In [11]:
X_scale = pd.DataFrame(scale(X))
X_scale.columns = X.columns
X = X_scale
X.columns = X_scale.columns
print(X.head())

    (x0_M,)   (x0_Q,)   (x1_1,)   (x1_2,)   (x1_3,)   (x1_4,)   (x1_6,)  \
0 -1.554964  1.554964  0.589002 -0.471911 -0.128712 -0.140259 -0.118114   
1 -1.554964  1.554964  0.589002 -0.471911 -0.128712 -0.140259 -0.118114   
2 -1.554964  1.554964  0.589002 -0.471911 -0.128712 -0.140259 -0.118114   
3 -1.554964  1.554964  0.589002 -0.471911 -0.128712 -0.140259 -0.118114   
4 -1.554964  1.554964  0.589002 -0.471911 -0.128712 -0.140259 -0.118114   

    (x1_7,)   (x1_8,)   (x1_9,)  ...   (x3_1,)   (x3_2,)   (x3_3,)   (x3_4,)  \
0 -0.009578 -0.134718 -0.090724  ... -1.322185 -0.435743 -0.417921  4.120705   
1 -0.009578 -0.134718 -0.090724  ... -1.322185 -0.435743 -0.417921  4.120705   
2 -0.009578 -0.134718 -0.090724  ... -1.322185 -0.435743 -0.417921  4.120705   
3 -0.009578 -0.134718 -0.090724  ... -1.322185 -0.435743  2.392797 -0.242677   
4 -0.009578 -0.134718 -0.090724  ...  0.756324 -0.435743 -0.417921 -0.242677   

      index   numproc  edad_dias  potencial_ambul  estancia_esperada

### Split en Train/Validación/Test

Utilizaremos a modo de ejemplo los ratios habitualmente recomendados:

• Train: 70%.

• Validación: 15%.

• Test: 15%.


In [12]:
perc_values = [0.7, 0.15, 0.15];

Creamos los conjuntos de train, validacion y test con el tamaño seleccionado pero respetando el eje temporal.

In [13]:
# dimensiones de los conjuntos de train y test
n_train = int(X.shape[0] * perc_values[0])
n_val = int(X.shape[0] * perc_values[1])
n_test = int(X.shape[0] * perc_values[2])

# selección del conjunto de train
X_train = X.iloc[:n_train]
y_train = y.iloc[:n_train]

# selección del conjunto de validación
X_val = X.iloc[(n_train):(n_train+n_val)]
y_val = y.iloc[(n_train):(n_train+n_val)]

# selección del conjunto de test
X_test = X.iloc[(n_train+n_val):]
y_test = y.iloc[(n_train+n_val):]

Visualizamos el tamaño de los 3 subdatasets

In [14]:
print('Train data size = ' + str(X_train.shape))
print('Train target size = ' + str(y_train.shape))
print('Validation data size = ' + str(X_val.shape))
print('Validation target size = ' + str(y_val.shape))
print('Test data size = ' + str(X_test.shape))
print('Test target size = ' + str(y_test.shape))

Train data size = (22894, 23)
Train target size = (22894,)
Validation data size = (4905, 23)
Validation target size = (4905,)
Test data size = (4907, 23)
Test target size = (4907,)



## [1] XGBoost

Vamos a aplicar ahora un modelo de XGBoost.

1) Importar modelo.

In [15]:
from xgboost import XGBClassifier as model_constructor

2) Importar métrica

In [16]:
from sklearn.metrics import roc_auc_score as metric

3) Definir el método.

In [17]:
model = model_constructor(random_state = 1)

4) Llamar al método fit para entrenar el modelo. 

En este caso vamos también


## [1] XGBoost

Vamos a aplicar ahora un modelo de XGBoost.

1) Importar modelo.

In [18]:
from xgboost import XGBClassifier as model_constructor

2) Importar métrica

In [19]:
from sklearn.metrics import roc_auc_score as metric

3) Definir el método.

In [20]:
model = model_constructor(tree_method = 'hist', random_state = 14)

4) Llamar al método fit para entrenar el modelo. 

En este caso vamos a medir también el tiempo que tarda el modelo en entrenarse.

In [21]:
import timeit

In [22]:
start = timeit.default_timer()
model.fit(X_train, 
          np.array(y_train), 
          eval_metric = "auc", 
          eval_set=[(X_val, y_val)],
          early_stopping_rounds = 10,
          verbose=True)
time = timeit.default_timer() - start

[0]	validation_0-auc:0.89724
[1]	validation_0-auc:0.91853
[2]	validation_0-auc:0.92261
[3]	validation_0-auc:0.92769
[4]	validation_0-auc:0.92363
[5]	validation_0-auc:0.92377
[6]	validation_0-auc:0.92750
[7]	validation_0-auc:0.92128
[8]	validation_0-auc:0.92424
[9]	validation_0-auc:0.92575
[10]	validation_0-auc:0.92916
[11]	validation_0-auc:0.92980
[12]	validation_0-auc:0.93024
[13]	validation_0-auc:0.93293
[14]	validation_0-auc:0.93369
[15]	validation_0-auc:0.93374
[16]	validation_0-auc:0.93382
[17]	validation_0-auc:0.93386
[18]	validation_0-auc:0.93260




[19]	validation_0-auc:0.93275
[20]	validation_0-auc:0.93298
[21]	validation_0-auc:0.93305
[22]	validation_0-auc:0.93257
[23]	validation_0-auc:0.93004
[24]	validation_0-auc:0.93029
[25]	validation_0-auc:0.92991
[26]	validation_0-auc:0.92931


5) Llamar al método predict para generar las predicciones.

In [23]:
pred_train_p = model.predict_proba(X_train)
pred_val_p = model.predict_proba(X_val)
pred_test_p = model.predict_proba(X_test) 

6) Calcular métrica usando las predicciones obtenidas en el paso anterior.

In [24]:
# Calcular métricas de evaluación
auc_train = metric(y_train, pred_train_p[:,1]);
auc_val = metric(y_val, pred_val_p[:,1]);
auc_test = metric(y_test, pred_test_p[:,1]);
results = pd.DataFrame()
results = results.append(pd.DataFrame(data={'model':['XGBoost'],'auc_train':[auc_train],'auc_val':[auc_val],'auc_test':[auc_test], 'time': [time]}, columns=['model',  'auc_train','auc_val', 'auc_test', 'time']), ignore_index=True)

In [25]:
results

Unnamed: 0,model,auc_train,auc_val,auc_test,time
0,XGBoost,0.970494,0.933863,0.928994,0.247592



## [2] LightGBM

Vamos a aplicar ahora un modelo de LightGBM.

In [26]:
%pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [27]:
import lightgbm as lgb

Hay que transformar los datasets a un formato específico.

In [28]:
X_train.columns = ["var_" + str(x) for x in range(X.shape[1])]
X_val.columns = ["var_" + str(x) for x in range(X.shape[1])]
X_test.columns = ["var_" + str(x) for x in range(X.shape[1])]
train_data = lgb.Dataset(data = X_train, label = y_train)
validation_data = train_data.create_valid(data = X_val, label = y_val)

Entrenamos el modelo

In [29]:
param = {'metric':'auc'}
start = timeit.default_timer()
model = lgb.train(param, train_data, valid_sets=[validation_data], callbacks=[lgb.early_stopping(stopping_rounds=10)])
time = timeit.default_timer() - start

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 910
[LightGBM] [Info] Number of data points in the train set: 22894, number of used features: 22
[LightGBM] [Info] Start training from score 0.038656
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[26]	valid_0's auc: 0.938033


5) Llamar al método predict para generar las predicciones.

In [30]:
pred_train_p = model.predict(X_train)
pred_val_p = model.predict(X_val)
pred_test_p = model.predict(X_test) 

6) Calcular métrica usando las predicciones obtenidas en el paso anterior.

In [31]:
# Calcular métricas de evaluación
auc_train = metric(y_train, pred_train_p);
auc_val = metric(y_val, pred_val_p);
auc_test = metric(y_test, pred_test_p);
results = results.append(pd.DataFrame(data={'model':['LightGBM'],'auc_train':[auc_train],'auc_val':[auc_val],'auc_test':[auc_test], 'time': [time]}, columns=['model',  'auc_train','auc_val', 'auc_test', 'time']), ignore_index=True)

In [32]:
results

Unnamed: 0,model,auc_train,auc_val,auc_test,time
0,XGBoost,0.970494,0.933863,0.928994,0.247592
1,LightGBM,0.961832,0.938033,0.935709,0.073074



## [3] CatBoost

Vamos a aplicar ahora un modelo de Catboost.

In [33]:
%pip install catboost

Note: you may need to restart the kernel to use updated packages.


1) Importar modelo.

In [34]:
from catboost import CatBoostClassifier as model_constructor

3) Definimos el modelo

In [35]:
model = model_constructor(eval_metric='AUC', verbose=True)

Entrenamos el modelo

In [36]:
start = timeit.default_timer()
model = model.fit(X_train, y_train,
                 eval_set=(X_val, y_val), early_stopping_rounds = 10)
time = timeit.default_timer() - start

Learning rate set to 0.068669
0:	test: 0.9129386	best: 0.9129386 (0)	total: 165ms	remaining: 2m 44s
1:	test: 0.9202737	best: 0.9202737 (1)	total: 175ms	remaining: 1m 27s
2:	test: 0.9277720	best: 0.9277720 (2)	total: 184ms	remaining: 1m 1s
3:	test: 0.9360158	best: 0.9360158 (3)	total: 194ms	remaining: 48.3s
4:	test: 0.9366291	best: 0.9366291 (4)	total: 201ms	remaining: 40.1s
5:	test: 0.9374164	best: 0.9374164 (5)	total: 210ms	remaining: 34.8s
6:	test: 0.9374897	best: 0.9374897 (6)	total: 218ms	remaining: 30.9s
7:	test: 0.9397091	best: 0.9397091 (7)	total: 226ms	remaining: 28s
8:	test: 0.9396106	best: 0.9397091 (7)	total: 232ms	remaining: 25.6s
9:	test: 0.9410954	best: 0.9410954 (9)	total: 240ms	remaining: 23.7s
10:	test: 0.9415363	best: 0.9415363 (10)	total: 247ms	remaining: 22.2s
11:	test: 0.9421896	best: 0.9421896 (11)	total: 254ms	remaining: 20.9s
12:	test: 0.9430749	best: 0.9430749 (12)	total: 260ms	remaining: 19.8s
13:	test: 0.9435158	best: 0.9435158 (13)	total: 267ms	remaining: 18

5) Llamar al método predict para generar las predicciones.

In [37]:
pred_train_p = model.predict_proba(X_train)
pred_val_p = model.predict_proba(X_val)
pred_test_p = model.predict_proba(X_test) 

6) Calcular métrica usando las predicciones obtenidas en el paso anterior.

In [38]:
# Calcular métricas de evaluación
auc_train = metric(y_train, pred_train_p[:,1]);
auc_val = metric(y_val, pred_val_p[:,1]);
auc_test = metric(y_test, pred_test_p[:,1]);
results = results.append(pd.DataFrame(data={'model':['CatBoost'],'auc_train':[auc_train],'auc_val':[auc_val],'auc_test':[auc_test], 'time': [time]}, columns=['model',  'auc_train','auc_val', 'auc_test', 'time']), ignore_index=True)

In [39]:
results

Unnamed: 0,model,auc_train,auc_val,auc_test,time
0,XGBoost,0.970494,0.933863,0.928994,0.247592
1,LightGBM,0.961832,0.938033,0.935709,0.073074
2,CatBoost,0.952849,0.946268,0.939088,0.724394


Conclusiones:
- Lightgbm es el más rápido.
- Lightgbm requiere algunos pasos extra para ejecutarse.
- Catboost tiene una muy buena performance con los parámetros por defecto.
- **Habría que realizar una búsqueda en rejilla para una comparación más informativa sobre el ganador final.**