# MIMIC-IV-ED Dataset

## Descrição
MIMIC-IV-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2019. The database contains ~425,000 ED stays. Vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses are available. All data are deidentified to comply with the Health Information Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-IV-ED is intended to support a diverse range of education initiatives and research studies.

## Plano
Utilizar dados disponíveis logo após a avaliação na triagem para prever se o paciente será internado no hospital ou não. As informações que estarão disponíveis neste momento são:
- raça
- gênero
- idade
- o meio pelo qual o paciente chegou ao hospital
- data e hora da entrada no hospital
- medicamentos em uso até antes da visita ao serviço de emergência
  - National Drug Code (NDC): O NDC, ou Código Nacional de Medicamentos, é um número único de 10 dígitos e 3 segmentos. É um identificador universal de produto para medicamentos humanos nos Estados Unidos. O código está presente em todas as embalagens e bulas de medicamentos sem receita (OTC) e com receita nos EUA.
  - Grupo ontológico
  Atenção: Note que como um medicamento pode ser classificado em múltiplos grupos na ontologia, pode haver mais de uma linha para um único medicamento. Por exemplo, o medicamento Adderall é (1) um estimulante do SNC, (2) uma terapia para Transtorno de Déficit de Atenção e Hiperatividade, e (3) uma terapia para narcolepsia.

- sinais vitais e outras avaliações objetivas, incluindo:
  - temperatura
  - frequência cardíaca
  - frequência respiratória
  - saturação de O2
  - pressão arterial sistólica
  - pressão arterial diastólica
- nível de urgência, que é um número de 1 a 5 representando o nível de urgência do caso com base na impressão do profissional que executou a triagem. 1 representa a maior gravidade e 5 a menor gravidade.

Cada uma dessas vai virar uma proporção de admissão, e será selecionada a máxima, a média, a mínima, desvio-padrão. As proporções de admissão devem ser baseadas somente nos dados selecionados para treino.

Modelos de machine learning para testar e otimizar para a tarefa:
- Random Forest
- XGBoost
- Light GBM
- GBM
- GLM


## Trabalhos anteriores

---

**A Machine Learning Pipeline Using KNIME to Predict Hospital Admission in the MIMIC-IV Database**

**Abstract:**
It is well known that overcrowding in emergency department (ED) lowers the standard of care and raises the risk of medical errors. An initial predictive supplementary tool of hospital admission at an early stage of a patient's arrival to the emergency department (ED) can provide health care professionals a number of advantages, such as, more efficient patient flow management and better hospital care. In this paper, we use data from the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database to predict whether a patient will be admitted to the hospital or not. The choice of predictive attributes was driven by simplicity (a set of basic vital signs were used) so that the prediction can be made at an early stage of the patient's arrival. Several versions of Machine Learning (ML) algorithms based on Decision Trees (DT) were used for classification and prediction. An important asset of the proposed methodology is that the whole process is implemented through an ML pipeline created with an open-source, visual programming tool. The proposed methodology contains the pre-processing stage, the modelling stage includes seven classifiers, and the combined visualization of the evaluation of the predictive models. The Gradient Boosted Trees method outperforms the rest of the algorithms that were used. An accuracy of 80% can be achieved only by using early triage data.

**URL:** https://ieeexplore.ieee.org/document/10345903

---
**Machine Learning in Medical Triage: A Predictive Model for Emergency Department Disposition**

**Abstract:**
The study explores the application of automated machine learning (AutoML) using the MIMIC-IV-ED database to enhance decision-making in emergency department (ED) triage. We developed a predictive model that utilizes triage data to forecast hospital admissions, aiming to support medical staff by providing an advanced decision-support system. The model, powered by H2O.ai’s AutoML platform, was trained on approximately 280,000 preprocessed records from the Beth Israel Deaconess Medical Center collected between 2011 and 2019. The selected Gradient Boosting Machine (GBM) model demonstrated an AUC ROC of 0.8256, indicating its efficacy in predicting patient dispositions. Key variables such as acuity and waiting hours were identified as significant predictors, emphasizing the model’s capability to integrate critical triage metrics into its predictions. However, challenges related to the complexity and heterogeneity of medical data, privacy concerns, and the need for model interpretability were addressed through the incorporation of Explainable AI (XAI) techniques. These techniques ensure the transparency of the predictive processes, fostering trust and facilitating ethical AI use in clinical settings. Future work will focus on external validation and expanding the model to include a broader array of variables from diverse healthcare environments, enhancing the model’s utility and applicability in global emergency care contexts.

**URL:** https://www.mdpi.com/2076-3417/14/15/6623

---

In [1]:
# %load_ext cudf.pandas

In [2]:
# !wget -r -N -c -np --user almeidava93 --password physionet@aL134921365 https://physionet.org/files/mimic-iv-ed/2.2/ -q

In [3]:
# !wget -r -N -c -np --user almeidava93 --password physionet@aL134921365 https://physionet.org/files/mimiciv/3.0/hosp/patients.csv.gz -q

In [4]:
# Após download de arquivos comprimidos em .gz, descomprimir para ter acesso aos csv

import os
from pathlib import Path
import gzip
import shutil

cwd = Path(os.getcwd())

for (root,dirs,files) in os.walk(cwd/'physionet.org'):
  for file in files:
    if file.endswith('.gz'):
      compressed_file_path = Path(root, file)
      decompressed_file_path = Path(root, file[:-3])

      with gzip.open(compressed_file_path, 'rb') as f1:
        with open(decompressed_file_path, 'wb') as f2:
          shutil.copyfileobj(f1, f2)

      os.remove(compressed_file_path)
      print(f"File {decompressed_file_path} decompressed.")

In [5]:
# Carregar DataFrames com todas as tabelas

import pandas as pd

edstays_path = Path('physionet.org/files/mimic-iv-ed/2.2/ed/edstays.csv')
edstays_df = pd.read_csv(edstays_path)

diagnosis_path = Path('physionet.org/files/mimic-iv-ed/2.2/ed/diagnosis.csv')
diagnosis_df = pd.read_csv(diagnosis_path)

medrecon_path = Path('physionet.org/files/mimic-iv-ed/2.2/ed/medrecon.csv')
medrecon_df = pd.read_csv(medrecon_path)

pyxis_path = Path('physionet.org/files/mimic-iv-ed/2.2/ed/pyxis.csv')
pyxis_df = pd.read_csv(pyxis_path)

triage_path = Path('physionet.org/files/mimic-iv-ed/2.2/ed/triage.csv')
triage_df = pd.read_csv(triage_path)

vitalsign_path = Path('physionet.org/files/mimic-iv-ed/2.2/ed/vitalsign.csv')
vitalsign_df = pd.read_csv(vitalsign_path)

patients_path = Path('physionet.org/files/mimiciv/3.0/hosp/patients.csv')
patients_df = pd.read_csv(patients_path)

In [6]:
edstays_df.head()

Unnamed: 0,subject_id,hadm_id,stay_id,intime,outtime,gender,race,arrival_transport,disposition
0,10000032,22595853.0,33258284,2180-05-06 19:17:00,2180-05-06 23:30:00,F,WHITE,AMBULANCE,ADMITTED
1,10000032,22841357.0,38112554,2180-06-26 15:54:00,2180-06-26 21:31:00,F,WHITE,AMBULANCE,ADMITTED
2,10000032,25742920.0,35968195,2180-08-05 20:58:00,2180-08-06 01:44:00,F,WHITE,AMBULANCE,ADMITTED
3,10000032,29079034.0,32952584,2180-07-22 16:24:00,2180-07-23 05:54:00,F,WHITE,AMBULANCE,HOME
4,10000032,29079034.0,39399961,2180-07-23 05:54:00,2180-07-23 14:00:00,F,WHITE,AMBULANCE,ADMITTED


In [7]:
patients_df.head()

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000032,F,52,2180,2014 - 2016,2180-09-09
1,10000048,F,23,2126,2008 - 2010,
2,10000058,F,33,2168,2020 - 2022,
3,10000068,F,19,2160,2008 - 2010,
4,10000084,M,72,2160,2017 - 2019,2161-02-13


In [8]:
# Organizar DataFrame único com dados de interesse
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer


data_df = pd.DataFrame()
data_df["id"] = edstays_df["stay_id"]
data_df["subject_id"] = edstays_df["subject_id"]

# Codificar dados categóricos
## Selecionar variáveis categóricas
data_df["gender"] = edstays_df["gender"]
data_df["race"] = edstays_df["race"]
data_df["arrival_transport"] = edstays_df["arrival_transport"]
categoric_cols = ["gender", "race", "arrival_transport"]
## Criar one-hot encoding de variáveis categóricas
data_df = pd.get_dummies(data_df, columns=categoric_cols)

# Incluir informações sobre a hora, o dia da semana e o mês do ano em que a visita ao serviço de emergência ocorreu
data_df['intime'] = pd.to_datetime(edstays_df['intime'])
data_df['in_day_of_the_week'] = data_df['intime'].dt.day_of_week
data_df['in_hour_of_the_day'] = data_df['intime'].dt.hour
data_df['in_month_of_the_year'] = data_df['intime'].dt.month

# Circular encoding - Padronizar dados cíclicos para representar melhor a ciclicidade presente no dado
## Selecionar variáveis cíclicas
cyclic_cols = ['in_day_of_the_week', 'in_hour_of_the_day', 'in_month_of_the_year']
## Realizar circular encoding para cada variável
for col in cyclic_cols:
  data_df[f'{col}_sin'] = np.sin(2 * np.pi * data_df[col] / len(data_df[col].unique()))
  data_df[f'{col}_cos'] = np.cos(2 * np.pi * data_df[col] / len(data_df[col].unique()))
## Deletar colunas originais
data_df.drop(columns=cyclic_cols, inplace=True)

# Incluir se o paciente foi admitido ao hospital ou não
data_df['admitted_to_hosp'] = edstays_df['hadm_id'].apply(lambda x: np.isnan(x)==False).astype(int)

# Incluir dados coletados na triagem do hospital
## Seleção das variáveis relevantes
triage_data_df = triage_df[['stay_id', 'temperature', 'heartrate', 'resprate', 'o2sat', 'sbp', 'dbp', 'acuity', 'pain']]

## Conversão dos dados da coluna pain para dado numérico inteiro quando possível. Se não é possível, deixar como missing.
def convert_to_int(value):
    try:
        return int(value)
    except:
        return np.nan
    
triage_data_df['pain'] = triage_data_df['pain'].apply(convert_to_int)

## União das tabelas
data_df = pd.merge(data_df, triage_data_df, left_on='id', right_on='stay_id', how='left')

# Incluir dado de idade do paciente ('anchor_age')
patients_age_df = patients_df[['subject_id', 'anchor_age']]
## União das tabelas através do id do paciente
data_df = pd.merge(data_df, patients_age_df, left_on='subject_id', right_on='subject_id', how='left')

# Tratar dados faltantes
## Criar coluna com one-hot encoding definindo se existe valor faltando ou não para aquela variável
cols_with_missing_values = []
for col in data_df.columns:
  if data_df[col].isna().any():
    cols_with_missing_values.append(col)
    data_df[f'{col}_is_missing'] = data_df[col].isna().astype(int)

# Reunir dados relativos os medicamentos que os pacientes faziam uso até antes de passarem na emergência
# Reunir por cada código disponível, que considera princípio ativo e grupo de medicamentos
# Pegar lista de códigos NDC (National Drug Code) de medicamentos e associar a cada stay_id
med_by_stay_id_ndc = medrecon_df.groupby('stay_id')['ndc'].unique()
data_df = pd.merge(data_df, med_by_stay_id_ndc, left_on='id', right_on='stay_id', how='left')

# Pegar lista de códigos etccode de medicamentos e associar a cada stay_id
med_by_stay_id_etccode = medrecon_df.groupby('stay_id')['etccode'].unique()
data_df = pd.merge(data_df, med_by_stay_id_etccode, left_on='id', right_on='stay_id', how='left')

# Pegar lista de códigos Generic Sequence Number (GSN) de medicamentos e associar a cada stay_id
med_by_stay_id_gsn = medrecon_df.groupby('stay_id')['gsn'].unique()
data_df = pd.merge(data_df, med_by_stay_id_gsn, left_on='id', right_on='stay_id', how='left')

# Pegar lista nomes de medicamentos e associar a cada stay_id
med_by_stay_id_name = medrecon_df.groupby('stay_id')['name'].unique()
data_df = pd.merge(data_df, med_by_stay_id_name, left_on='id', right_on='stay_id', how='left')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  triage_data_df['pain'] = triage_data_df['pain'].apply(convert_to_int)


In [9]:
# Incluir número de medicamentos que o paciente usava antes da passagem na emergência

import numpy as np

def get_len(list):
    if list is np.nan:
        return 0
    else:
        return len(list)

# Pegar o número de medicamentos segundo códigos NDC únicos 
data_df['med_count_by_ndc'] = data_df['ndc'].apply(get_len)
# Pegar o número de medicamentos segundo códigos ETC únicos 
data_df['med_count_by_etccode'] = data_df['etccode'].apply(get_len)
# Pegar o número de medicamentos segundo nomes únicos de medicamentos 
data_df['med_count_by_name'] = data_df['name'].apply(get_len)
# Pegar o número de medicamentos segundo códigos GSN únicos 
data_df['med_count_by_gsn'] = data_df['gsn'].apply(get_len)

In [10]:
# Criar coluna enfatizando o fato de o paciente não usar nenhuma medicação continuamente até o momento da passagem na emergência
data_df['use_no_medication'] = data_df['med_count_by_ndc']==0

In [11]:
# Preparar datasets
random_state = 42
## Separar variáveis preditoras da variável a ser predita
X = data_df.drop('admitted_to_hosp', axis=1)
y = data_df['admitted_to_hosp']
## Separar em dataset de treino (90%) e teste (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state)
## Separar dados de treino em treino (90%) e validação (10%)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=random_state)

### Target encoding de medicamentos
Proporções de internações calculadas com base em:
- códigos NDC
- códigos ETC
- nomes de medicamentos
- códigos GSN

In [12]:
# Separar dados de medicamentos para target encoding utilizando dados do dataset de treino
medrecon_train_df = medrecon_df[medrecon_df["stay_id"].isin(X_train["id"])]

#### Proporção de admissões por código NDC de medicamentos

In [13]:
# Agrupar medicamentos por código NDC presentes no dataset de treino e as passagens na emergência
med_adm_prop_by_ndc = medrecon_train_df.groupby('ndc')['stay_id'].unique().reset_index()

# Calcular a proporção das passagens na emergência em que havia uso do medicamento com um nome específico e que terminaram em admissão hospitalar
med_adm_prop_by_ndc['admission_proportion_by_ndc'] = med_adm_prop_by_ndc['stay_id'].apply(
    lambda stay_ids: edstays_df[edstays_df["stay_id"].isin(stay_ids)]['hadm_id'].notna().mean()
)
# med_adm_prop_by_ndc é a referência para a proporção de admissão relacionada a cada medicamento
med_adm_prop_by_ndc.drop(columns=['stay_id'], inplace=True)

# Usar chance de admissão geral dos dados de treino caso não haja medicamentos em uso com código NDC
general_admission_proportion = y_train.value_counts(normalize=True).iloc[1]

In [14]:
# Definir função que estima chance de internação baseada nos códigos NDC.
def estimate_max_admission_proportion_by_ndc(ndc_list):
  if ndc_list is None or ndc_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao de maior chance de internação
  return med_adm_prop_by_ndc[med_adm_prop_by_ndc['ndc'].isin(ndc_list)]['admission_proportion_by_ndc'].max()

X_train['ndc_max_adm_prop'] = X_train['ndc'].apply(estimate_max_admission_proportion_by_ndc).fillna(general_admission_proportion)
X_val['ndc_max_adm_prop'] = X_val['ndc'].apply(estimate_max_admission_proportion_by_ndc).fillna(general_admission_proportion)
X_test['ndc_max_adm_prop'] = X_test['ndc'].apply(estimate_max_admission_proportion_by_ndc).fillna(general_admission_proportion)

In [15]:
# Definir função que estima chance de internação baseada nos códigos NDC.
def estimate_mean_admission_proportion_by_ndc(ndc_list):
  if ndc_list is None or ndc_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente à média das chances de internação de cada medicamento
  return med_adm_prop_by_ndc[med_adm_prop_by_ndc['ndc'].isin(ndc_list)]['admission_proportion_by_ndc'].mean()

X_train['ndc_mean_adm_prop'] = X_train['ndc'].apply(estimate_mean_admission_proportion_by_ndc).fillna(general_admission_proportion)
X_val['ndc_mean_adm_prop'] = X_val['ndc'].apply(estimate_mean_admission_proportion_by_ndc).fillna(general_admission_proportion)
X_test['ndc_mean_adm_prop'] = X_test['ndc'].apply(estimate_mean_admission_proportion_by_ndc).fillna(general_admission_proportion)

In [16]:
# Definir função que estima chance de internação baseada nos códigos NDC.
def estimate_min_admission_proportion_by_ndc(ndc_list):
  if ndc_list is None or ndc_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor mínimo das chances de internação de cada medicamento
  return med_adm_prop_by_ndc[med_adm_prop_by_ndc['ndc'].isin(ndc_list)]['admission_proportion_by_ndc'].min()

X_train['ndc_min_adm_prop'] = X_train['ndc'].apply(estimate_min_admission_proportion_by_ndc).fillna(general_admission_proportion)
X_val['ndc_min_adm_prop'] = X_val['ndc'].apply(estimate_min_admission_proportion_by_ndc).fillna(general_admission_proportion)
X_test['ndc_min_adm_prop'] = X_test['ndc'].apply(estimate_min_admission_proportion_by_ndc).fillna(general_admission_proportion)

#### Proporção de admissões por código ETC de medicamentos

In [17]:
# Agrupar medicamentos por código etc presentes no dataset de treino e as passagens na emergência
med_adm_prop_by_etc = medrecon_train_df.groupby('etccode')['stay_id'].unique().reset_index()

# Calcular a proporção das passagens na emergência em que havia uso do medicamento com um nome específico e que terminaram em admissão hospitalar
med_adm_prop_by_etc['admission_proportion_by_etc'] = med_adm_prop_by_etc['stay_id'].apply(
    lambda stay_ids: edstays_df[edstays_df["stay_id"].isin(stay_ids)]['hadm_id'].notna().mean()
)
# med_adm_prop_by_etc é a referência para a proporção de admissão relacionada a cada medicamento
med_adm_prop_by_etc.drop(columns=['stay_id'], inplace=True)

# Usar chance de admissão geral dos dados de treino caso não haja medicamentos em uso com código etc
general_admission_proportion = y_train.value_counts(normalize=True).iloc[1]

In [18]:
# Definir função que estima chance de internação baseada nos códigos etc.
def estimate_max_admission_proportion_by_etc(etc_list):
  if etc_list is None or etc_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor max das chances de internação de cada medicamento
  return med_adm_prop_by_etc[med_adm_prop_by_etc['etccode'].isin(etc_list)]['admission_proportion_by_etc'].max()

X_train['etc_max_adm_prop'] = X_train['etccode'].apply(estimate_max_admission_proportion_by_etc).fillna(general_admission_proportion)
X_val['etc_max_adm_prop'] = X_val['etccode'].apply(estimate_max_admission_proportion_by_etc).fillna(general_admission_proportion)
X_test['etc_max_adm_prop'] = X_test['etccode'].apply(estimate_max_admission_proportion_by_etc).fillna(general_admission_proportion)

In [19]:
# Definir função que estima chance de internação baseada nos códigos etc.
def estimate_mean_admission_proportion_by_etc(etc_list):
  if etc_list is None or etc_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor mean das chances de internação de cada medicamento
  return med_adm_prop_by_etc[med_adm_prop_by_etc['etccode'].isin(etc_list)]['admission_proportion_by_etc'].mean()

X_train['etc_mean_adm_prop'] = X_train['etccode'].apply(estimate_mean_admission_proportion_by_etc).fillna(general_admission_proportion)
X_val['etc_mean_adm_prop'] = X_val['etccode'].apply(estimate_mean_admission_proportion_by_etc).fillna(general_admission_proportion)
X_test['etc_mean_adm_prop'] = X_test['etccode'].apply(estimate_mean_admission_proportion_by_etc).fillna(general_admission_proportion)

In [20]:
# Definir função que estima chance de internação baseada nos códigos etc.
def estimate_min_admission_proportion_by_etc(etc_list):
  if etc_list is None or etc_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor min das chances de internação de cada medicamento
  return med_adm_prop_by_etc[med_adm_prop_by_etc['etccode'].isin(etc_list)]['admission_proportion_by_etc'].min()

X_train['etc_min_adm_prop'] = X_train['etccode'].apply(estimate_min_admission_proportion_by_etc).fillna(general_admission_proportion)
X_val['etc_min_adm_prop'] = X_val['etccode'].apply(estimate_min_admission_proportion_by_etc).fillna(general_admission_proportion)
X_test['etc_min_adm_prop'] = X_test['etccode'].apply(estimate_min_admission_proportion_by_etc).fillna(general_admission_proportion)

#### Proporção de admissões por código GSN de medicamentos

In [21]:
# Agrupar medicamentos por código GSN presentes no dataset de treino e as passagens na emergência
med_adm_prop_by_gsn = medrecon_train_df.groupby('gsn')['stay_id'].unique().reset_index()

In [22]:
# Calcular a proporção das passagens na emergência em que havia uso do medicamento com um nome específico e que terminaram em admissão hospitalar
med_adm_prop_by_gsn['admission_proportion_by_gsn'] = med_adm_prop_by_gsn['stay_id'].apply(
    lambda stay_ids: edstays_df[edstays_df["stay_id"].isin(stay_ids)]['hadm_id'].notna().mean()
)
# med_adm_prop_by_gsn é a referência para a proporção de admissão relacionada a cada medicamento
med_adm_prop_by_gsn.drop(columns=['stay_id'], inplace=True)

In [23]:
# Usar chance de admissão geral dos dados de treino caso não haja medicamentos em uso com código GSN
general_admission_proportion = y_train.value_counts(normalize=True).iloc[1]

# Definir função que estima chance de internação baseada nos códigos GSN.
def estimate_max_admission_proportion_by_gsn(gsn_list):
  if gsn_list is None or gsn_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao de maior chance de internação
  return med_adm_prop_by_gsn[med_adm_prop_by_gsn['gsn'].isin(gsn_list)]['admission_proportion_by_gsn'].max()

X_train['gsn_max_adm_prop'] = X_train['gsn'].apply(estimate_max_admission_proportion_by_gsn).fillna(general_admission_proportion)
X_val['gsn_max_adm_prop'] = X_val['gsn'].apply(estimate_max_admission_proportion_by_gsn).fillna(general_admission_proportion)
X_test['gsn_max_adm_prop'] = X_test['gsn'].apply(estimate_max_admission_proportion_by_gsn).fillna(general_admission_proportion)

In [24]:
# Definir função que estima chance de internação baseada nos códigos GSN.
def estimate_mean_admission_proportion_by_gsn(gsn_list):
  if gsn_list is None or gsn_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente à média entre as chances de internação
  return med_adm_prop_by_gsn[med_adm_prop_by_gsn['gsn'].isin(gsn_list)]['admission_proportion_by_gsn'].mean()

X_train['gsn_mean_adm_prop'] = X_train['gsn'].apply(estimate_mean_admission_proportion_by_gsn).fillna(general_admission_proportion)
X_val['gsn_mean_adm_prop'] = X_val['gsn'].apply(estimate_mean_admission_proportion_by_gsn).fillna(general_admission_proportion)
X_test['gsn_mean_adm_prop'] = X_test['gsn'].apply(estimate_mean_admission_proportion_by_gsn).fillna(general_admission_proportion)

In [25]:
# Definir função que estima chance de internação baseada nos códigos GSN.
def estimate_min_admission_proportion_by_gsn(gsn_list):
  if gsn_list is None or gsn_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor mínimo entre as chances de internação
  return med_adm_prop_by_gsn[med_adm_prop_by_gsn['gsn'].isin(gsn_list)]['admission_proportion_by_gsn'].min()

X_train['gsn_min_adm_prop'] = X_train['gsn'].apply(estimate_min_admission_proportion_by_gsn).fillna(general_admission_proportion)
X_val['gsn_min_adm_prop'] = X_val['gsn'].apply(estimate_min_admission_proportion_by_gsn).fillna(general_admission_proportion)
X_test['gsn_min_adm_prop'] = X_test['gsn'].apply(estimate_min_admission_proportion_by_gsn).fillna(general_admission_proportion)

#### Proporção de admissões por nome de medicamentos

In [26]:
# Agrupar medicamentos por código name presentes no dataset de treino e as passagens na emergência
med_adm_prop_by_name = medrecon_train_df.groupby('name')['stay_id'].unique().reset_index()

# Calcular a proporção das passagens na emergência em que havia uso do medicamento com um nome específico e que terminaram em admissão hospitalar
med_adm_prop_by_name['admission_proportion_by_name'] = med_adm_prop_by_name['stay_id'].apply(
    lambda stay_ids: edstays_df[edstays_df["stay_id"].isin(stay_ids)]['hadm_id'].notna().mean()
)
# med_adm_prop_by_name é a referência para a proporção de admissão relacionada a cada medicamento
med_adm_prop_by_name.drop(columns=['stay_id'], inplace=True)

# Usar chance de admissão geral dos dados de treino caso não haja medicamentos em uso com código name
general_admission_proportion = y_train.value_counts(normalize=True).iloc[1]

In [27]:
# Definir função que estima chance de internação baseada nos códigos name.
def estimate_max_admission_proportion_by_name(name_list):
  if name_list is None or name_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor máximo entre as chances de internação de cada nome de medicamento
  return med_adm_prop_by_name[med_adm_prop_by_name['name'].isin(name_list)]['admission_proportion_by_name'].max()

X_train['med_name_max_adm_prop'] = X_train['name'].apply(estimate_max_admission_proportion_by_name).fillna(general_admission_proportion)
X_val['med_name_max_adm_prop'] = X_val['name'].apply(estimate_max_admission_proportion_by_name).fillna(general_admission_proportion)
X_test['med_name_max_adm_prop'] = X_test['name'].apply(estimate_max_admission_proportion_by_name).fillna(general_admission_proportion)

In [28]:
# Definir função que estima chance de internação baseada nos códigos name.
def estimate_mean_admission_proportion_by_name(name_list):
  if name_list is None or name_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor mean entre as chances de internação de cada nome de medicamento
  return med_adm_prop_by_name[med_adm_prop_by_name['name'].isin(name_list)]['admission_proportion_by_name'].mean()

X_train['med_name_mean_adm_prop'] = X_train['name'].apply(estimate_mean_admission_proportion_by_name).fillna(general_admission_proportion)
X_val['med_name_mean_adm_prop'] = X_val['name'].apply(estimate_mean_admission_proportion_by_name).fillna(general_admission_proportion)
X_test['med_name_mean_adm_prop'] = X_test['name'].apply(estimate_mean_admission_proportion_by_name).fillna(general_admission_proportion)

In [29]:
# Definir função que estima chance de internação baseada nos códigos name.
def estimate_min_admission_proportion_by_name(name_list):
  if name_list is None or name_list is np.nan:
    return general_admission_proportion
  # Entre os medicamentos em uso, selecionar o valor correspondente ao valor min entre as chances de internação de cada nome de medicamento
  return med_adm_prop_by_name[med_adm_prop_by_name['name'].isin(name_list)]['admission_proportion_by_name'].min()

X_train['med_name_min_adm_prop'] = X_train['name'].apply(estimate_min_admission_proportion_by_name).fillna(general_admission_proportion)
X_val['med_name_min_adm_prop'] = X_val['name'].apply(estimate_min_admission_proportion_by_name).fillna(general_admission_proportion)
X_test['med_name_min_adm_prop'] = X_test['name'].apply(estimate_min_admission_proportion_by_name).fillna(general_admission_proportion)

#### Normalizar dados numéricos com base nos dados de treino

In [30]:
# Normalizar dados numéricos
## Selecionar dados numéricos
numeric_cols = ['anchor_age','temperature', 'heartrate', 'resprate', 'o2sat', 'sbp', 'dbp', 'acuity', 'pain', 'etc_min_adm_prop', 'etc_mean_adm_prop', 'etc_max_adm_prop', 'ndc_min_adm_prop', 'ndc_mean_adm_prop', 'ndc_max_adm_prop', 'med_count_by_etccode', 'med_count_by_ndc', 'med_name_min_adm_prop', 'med_name_mean_adm_prop', 'med_name_max_adm_prop', 'gsn_min_adm_prop', 'gsn_mean_adm_prop', 'gsn_max_adm_prop']
## Normalização
scaler = MinMaxScaler()
## Dados de treino
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
## Aplicar o mesmo scaler aos dados de validação e teste
X_val[numeric_cols] = scaler.transform(X_val[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])


#### Eliminar colunas que serão excluídas da análise

In [31]:
# Eliminar colunas que não serão utilizadas na análise
cols_to_drop = ['intime','stay_id','subject_id','id','gsn','ndc','etccode','name']
datasets = [X_train, X_val, X_test, y_train, y_val, y_test]
for dataset in datasets:
  dataset.drop(columns=[*cols_to_drop], inplace=True, errors='ignore')

print("Train set size:", X_train.shape)
print("Validation set size:", X_val.shape)
print("Test set size:", X_test.shape)


Train set size: (344320, 81)
Validation set size: (38258, 81)
Test set size: (42509, 81)


#### Realizar substituição de dados faltantes com a média dos dados no dataset de treino

In [32]:
# Realizar substuição de dados faltantes com a média dos dados no dataset de treino
imputer = SimpleImputer(strategy='mean')
## Preparar imputer com dados de treino
imputer.fit(X_train[cols_with_missing_values])
## Aplicar em todos os datasets (treino, validação e teste)
X_train[cols_with_missing_values] = imputer.transform(X_train[cols_with_missing_values])
X_val[cols_with_missing_values] = imputer.transform(X_val[cols_with_missing_values])
X_test[cols_with_missing_values] = imputer.transform(X_test[cols_with_missing_values])

In [33]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 344320 entries, 235356 to 241375
Data columns (total 81 columns):
 #   Column                                          Non-Null Count   Dtype  
---  ------                                          --------------   -----  
 0   gender_F                                        344320 non-null  bool   
 1   gender_M                                        344320 non-null  bool   
 2   race_AMERICAN INDIAN/ALASKA NATIVE              344320 non-null  bool   
 3   race_ASIAN                                      344320 non-null  bool   
 4   race_ASIAN - ASIAN INDIAN                       344320 non-null  bool   
 5   race_ASIAN - CHINESE                            344320 non-null  bool   
 6   race_ASIAN - KOREAN                             344320 non-null  bool   
 7   race_ASIAN - SOUTH EAST ASIAN                   344320 non-null  bool   
 8   race_BLACK/AFRICAN                              344320 non-null  bool   
 9   race_BLACK/AFRICAN AMERICA

## Treinar primeiros modelos e avaliar com dataset de validação

### Random Forest

In [34]:
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier object
model = RandomForestClassifier(
    random_state=42,  # Set random_state for reproducibility
    n_estimators=400
)

# Fit the model to the training data
model.fit(X_train, y_train)

In [35]:
from sklearn.metrics import roc_auc_score

# Predict probabilities for the validation set
y_pred_proba = model.predict_proba(X_val)[:, 1]  # Get probabilities for the positive class

# Calculate AUC
auc = roc_auc_score(y_val, y_pred_proba)

print("AUC:", auc)

AUC: 0.8222197506347865


### XGBoost

In [36]:
import xgboost as xgb

# Assuming X_train, y_train are your training data and labels
dtrain = xgb.DMatrix(X_train, label=y_train)

# Set parameters (you'll need to tune these)
param = {
    'objective': 'binary:logistic',  # For binary classification
    'eval_metric': 'auc',
    'max_depth': 3,
    'eta': 0.2,
    'subsample': 0.8,
    'colsample_bytree': 0.8
}

# Train the model
num_round = 1000  # Number of boosting rounds
model = xgb.train(param, dtrain, num_round)

In [37]:
# Assuming X_val, y_val are your validation data and labels
dval = xgb.DMatrix(X_val, label=y_val)

# Predict probabilities for the positive class
y_pred_proba = model.predict(dval)

# Calculate AUC
auc = roc_auc_score(y_val, y_pred_proba)

print("AUC:", auc)

AUC: 0.8210867664085449


In [38]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search over
param_grid = {
    'max_depth': [3, 5, 7],
    'eta': [0.1, 0.2, 0.3],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc'),
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=3,  # Number of cross-validation folds
                           verbose=2)

# Perform the grid search
grid_search.fit(X_train, y_train)

# Print the best parameters found
print("Best parameters:", grid_search.best_params_)

# Get the best model
best_model = grid_search.best_estimator_

Fitting 3 folds for each of 81 candidates, totalling 243 fits
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.7; total time=   0.5s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.7; total time=   0.5s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.7; total time=   0.5s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.8; total time=   0.4s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.8; total time=   0.4s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.8; total time=   0.5s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.9; total time=   0.4s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.9; total time=   0.5s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=3, subsample=0.9; total time=   0.4s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=5, subsample=0.7; total time=   0.6s
[CV] END colsample_bytree=0.7, eta=0.1, max_depth=5, subsample=0.7; 

In [39]:
# Predict probabilities for the test set using the best model
y_pred_proba = best_model.predict_proba(X_val)[:, 1]

# Calculate AUC on the test set
auc = roc_auc_score(y_val, y_pred_proba)

print("Test AUC:", auc)

Test AUC: 0.8203943279662188


In [40]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as stats

# Define the hyperparameter distributions
param_dist = {
    'max_depth': stats.randint(3, 10),
    'learning_rate': stats.uniform(0.01, 0.1),
    'subsample': stats.uniform(0.5, 0.5),
    'n_estimators':stats.randint(50, 200)
}

# Create the XGBoost model object
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc')

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(xgb_model, param_distributions=param_dist, n_iter=100, cv=5, scoring='roc_auc')

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best set of hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)

Best set of hyperparameters:  {'learning_rate': np.float64(0.08057545948216362), 'max_depth': 9, 'n_estimators': 162, 'subsample': np.float64(0.8856221455197644)}
Best score:  0.8461416893458009


In [41]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as stats

# Define the hyperparameter distributions
param_dist = {
    'max_depth': stats.randint(6, 30),
    'learning_rate': stats.uniform(0.01, 0.1),
    'subsample': stats.uniform(0.5, 0.5),
    'n_estimators':stats.randint(400, 800)
}

# Create the XGBoost model object
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc')

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(xgb_model, param_distributions=param_dist, n_iter=10, cv=5, scoring='roc_auc')

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best set of hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)

#### XGBoost hyperparameters

Best set of hyperparameters:  {'learning_rate': np.float64(0.08057545948216362), 'max_depth': 9, 'n_estimators': 162, 'subsample': np.float64(0.8856221455197644)}

Best score:  0.8461416893458009

---

Best set of hyperparameters:  {'learning_rate': np.float64(0.03682064960174636), 'max_depth': 11, 'n_estimators': 374, 'subsample': np.float64(0.9099935577581488)}

Best score:  0.8465467772144158

---

Best set of hyperparameters:  {'learning_rate': np.float64(0.028847877325705033), 'max_depth': 11, 'n_estimators': 476, 'subsample': np.float64(0.908755289935228)}

Best score:  0.8468310267107876

---

In [None]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import xgboost as xgb

# Define the objective function to minimize
def objective(params):
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)

    model = xgb.train(params, dtrain, num_boost_round=1000,
                      early_stopping_rounds=10, evals=[(dval, 'eval')], verbose_eval=False)

    y_pred_proba = model.predict(dval)
    auc = roc_auc_score(y_val, y_pred_proba)

    return {'loss': -auc, 'status': STATUS_OK}

# Define the hyperparameter search space
space = {
    'max_depth': hp.choice('max_depth', range(3, 10)),
    'eta': hp.uniform('eta', 0.01, 0.3),
    'subsample': hp.uniform('subsample', 0.7, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.7, 1.0),
    'objective': 'binary:logistic',
    'eval_metric': 'auc'
}

# Initialize trials object to track results
trials = Trials()

# Run hyperparameter optimization
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50,  # Number of evaluations
            trials=trials)

print("Best hyperparameters:", best)

100%|██████████| 50/50 [01:40<00:00,  2.00s/trial, best loss: -0.821987666465536] 
Best hyperparameters: {'colsample_bytree': np.float64(0.8172142575101772), 'eta': np.float64(0.03588580696508417), 'max_depth': np.int64(5), 'subsample': np.float64(0.8320331706857366)}


In [None]:
# Train the final model with best hyperparameters
best_model = xgb.train(best, xgb.DMatrix(X_train, label=y_train), num_boost_round=1000)

# Predict probabilities on the test set
dval = xgb.DMatrix(X_val)
y_pred_proba = best_model.predict(dval)

# Calculate AUC on the test set
test_auc = roc_auc_score(y_val, y_pred_proba)

print("Test AUC with best hyperparameters:", test_auc)

Test AUC with best hyperparameters: 0.8231851611233246
