# Relação da aplicação de insulina com valores de CK (CPK)
O objetivo desse notebook é realizar uma análise para tentar encontrar alguma relação entre a insulina administrada nos pacientes, considerando ou não o fator da obesidade, e seus valores de CK (CPK).

Para isso são utilizadas as tabelas:
- `insulin_inputs_in_ck_pacients.csv`: contem informação sobre a administração de insulina em pacientes que fizeram exames de CK
- `ck_norm`: contem informações sobre os valores de CK de pacientes 
- `bmi_norm`: contem informações sobre IMC (BMI) dos pacientes

In [21]:
import pandas as pd

path_insulin = "../data/interim/insulin_inputs_in_ck_pacients.csv"
path_cpk = "../data/processed/ck_norm.csv"
path_bmi = "../data/processed/bmi_norm.csv"

df_insulin = pd.read_csv(path_insulin)
df_cpk = pd.read_csv(path_cpk)
df_bmi = pd.read_csv(path_bmi)
df_insulin.head()

Unnamed: 0.1,Unnamed: 0,subject_id,hadm_id,stay_id,caregiver_id,starttime,endtime,storetime,itemid,amount,...,ordercomponenttypedescription,ordercategorydescription,patientweight,totalamount,totalamountuom,isopenbag,continueinnextdept,statusdescription,originalamount,originalrate
0,772,10002428,23473524,35479615,41151,2156-05-19 22:41:00,2156-05-19 22:42:00,2156-05-19 22:42:00,223262,0.0,...,Main order parameter,Drug Push,48.4,,,0,0,FinishedRunning,0.0,0.0
1,780,10002428,23473524,35479615,41151,2156-05-20 04:00:00,2156-05-20 04:01:00,2156-05-20 06:05:00,223262,0.0,...,Main order parameter,Drug Push,48.4,,,0,0,FinishedRunning,0.0,0.0
2,859,10002428,23473524,35479615,74793,2156-05-13 09:43:00,2156-05-13 09:44:00,2156-05-13 09:43:00,223258,2.0,...,Main order parameter,Drug Push,48.4,,,0,0,FinishedRunning,2.0,2.0
3,993,10002428,23473524,35479615,98628,2156-05-13 16:00:00,2156-05-13 16:01:00,2156-05-13 17:16:00,223258,0.0,...,Main order parameter,Drug Push,48.4,,,0,0,FinishedRunning,0.0,0.0
4,2842,10004235,24181354,34100191,24745,2196-02-25 00:54:00,2196-02-25 04:12:00,2196-02-25 00:54:00,223258,13.197716,...,Main order parameter,Continuous Med,127.0,100.0,ml,0,0,ChangeDose/Rate,96.25,4.0


In [16]:
df_cpk.head()

Unnamed: 0.1,Unnamed: 0,subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
0,932,10000980,26913865,39765666,,2189-06-27 12:58:00,2189-06-27 14:18:00,225634,263,263.0,IU/L,1.0
1,18918,10001884,26184834,37510196,,2131-01-14 03:08:00,2131-01-14 09:18:00,225634,786,786.0,IU/L,1.0
2,19039,10001884,26184834,37510196,,2131-01-16 04:02:00,2131-01-16 05:06:00,225634,361,361.0,IU/L,1.0
3,22053,10002155,20345487,32358465,,2131-03-10 02:04:00,2131-03-10 03:17:00,225634,99,99.0,IU/L,0.0
4,27263,10002155,23822395,33685454,,2129-08-04 17:56:00,2129-08-04 18:51:00,225634,589,589.0,IU/L,1.0


In [22]:
df_bmi.head()

Unnamed: 0.1,Unnamed: 0,subject_id,chartdate,seq_num,result_name,result_value
0,329,10000980,2185-06-17,1,BMI (kg/m2),33.6
1,331,10000980,2185-09-17,1,BMI (kg/m2),34.0
2,333,10000980,2185-12-17,1,BMI (kg/m2),34.2
3,336,10000980,2186-03-17,1,BMI (kg/m2),34.0
4,340,10000980,2186-09-15,1,BMI (kg/m2),34.2


Consideramos apenas quantidades de insulina administrada maior que 0

In [23]:
df_insulin = df_insulin[df_insulin["amount"] > 0]
df_insulin.head()

Unnamed: 0.1,Unnamed: 0,subject_id,hadm_id,stay_id,caregiver_id,starttime,endtime,storetime,itemid,amount,...,ordercomponenttypedescription,ordercategorydescription,patientweight,totalamount,totalamountuom,isopenbag,continueinnextdept,statusdescription,originalamount,originalrate
2,859,10002428,23473524,35479615,74793,2156-05-13 09:43:00,2156-05-13 09:44:00,2156-05-13 09:43:00,223258,2.0,...,Main order parameter,Drug Push,48.4,,,0,0,FinishedRunning,2.0,2.0
4,2842,10004235,24181354,34100191,24745,2196-02-25 00:54:00,2196-02-25 04:12:00,2196-02-25 00:54:00,223258,13.197716,...,Main order parameter,Continuous Med,127.0,100.0,ml,0,0,ChangeDose/Rate,96.25,4.0
5,2876,10004235,24181354,34100191,24834,2196-02-24 22:55:00,2196-02-24 22:56:00,2196-02-24 22:55:00,223258,10.0,...,Main order parameter,Drug Push,127.0,,,0,0,FinishedRunning,10.0,10.0
6,2877,10004235,24181354,34100191,24834,2196-02-24 22:55:00,2196-02-25 00:01:00,2196-02-24 22:55:00,223258,1.1,...,Main order parameter,Continuous Med,127.0,100.0,ml,0,0,ChangeDose/Rate,100.0,1.0
7,2886,10004235,24181354,34100191,24834,2196-02-25 00:01:00,2196-02-25 00:54:00,2196-02-25 00:01:00,223258,2.65,...,Main order parameter,Continuous Med,127.0,100.0,ml,0,0,ChangeDose/Rate,98.900002,3.0


O MIMIC IV apresenta mais de um tipo de insulina.
O tipo de insulina administrada será considerado um valor categórico, sendo convertido para um número de identificação.

In [24]:
insulin_types = df_insulin["itemid"].unique()
insulin_onehot_dict = {}
for i, i_t in enumerate(insulin_types):
    insulin_onehot_dict[i_t] = i
print(insulin_onehot_dict)

{223258: 0, 223262: 1, 223259: 2, 223260: 3, 229299: 4, 223261: 5, 223257: 6, 229619: 7}


Coletando os dados que serão utilizados:
(tipo de insulina, quantidade de insulina administrada, IMC do paciente, valor de CK do paciente) 

In [42]:
from tqdm import tqdm

data_array = []
for _, row in tqdm(df_cpk.iterrows(), total=df_cpk.shape[0]):
    insulin_pacient = df_insulin[df_insulin['subject_id'] == row['subject_id']]
    if len(insulin_pacient) == 0:
        continue
    insulin_pacient = insulin_pacient.iloc[0]
    insulin_label = insulin_onehot_dict[insulin_pacient['itemid']]
    insulin_value = insulin_pacient['amount']
    cpk_value = row['value']
    bmi_value = df_bmi[df_bmi['subject_id'] == row['subject_id']]['result_value'].mean()
    data_array.append((insulin_label, insulin_value, bmi_value, cpk_value))
print(len(data_array))

100%|████████████████████████████████████| 46536/46536 [01:28<00:00, 524.34it/s]

29188





In [43]:
data_array[0] #(tipo de insulina, quantidade de insulina administrada, IMC do paciente, valor de CK do paciente)

(0, 2.0, 19.74285714285714, 29)

## Normalização dos dados
A fim de manter os dados na mesma escala, os dados são normalizados para se manterem no intervalo entre [0, 1]
Os dados também são divididos entre dados de todos os pacientes e apenas dados de pacientes obesos. A informação categorica do tipo de insulina é convertida para um array no formato _one hot encoding_.

Valores máximos e mínimos:
- Quantidade de insulina administrada (ml): min=0.0333275645971298; max=216.0
- IMC: min=0.4 max=89.5
- CK: min=3 max=9996

In [127]:
import math
import numpy as np

non_nan_data = [x for x in data_array if not math.isnan(x[2])]
min_insulin = min([x[1] for x in non_nan_data])
max_insulin = max([x[1] for x in non_nan_data])
min_bmi = min([x[2] for x in non_nan_data])
max_bmi = max([x[2] for x in non_nan_data])
min_ck = min([x[3] for x in non_nan_data])
max_ck = max([x[3] for x in non_nan_data])

print(min_insulin, max_insulin)
print(min_bmi, max_bmi)
print(min_ck, max_ck)

norm_data = []
norm_obese_data = []
for i_l, i_v, bmi, ck in non_nan_data:
    norm_insulin = (i_v-min_insulin) / (max_insulin - min_insulin)
    norm_bmi = (bmi-min_bmi) / (max_bmi - min_bmi)
    norm_ck = (ck-min_ck) / (max_ck - min_ck)
    one_hot_array = [0] * len(insulin_types)
    one_hot_array[i_l] = 1
    one_hot_array.append(norm_insulin)
    
    if bmi >= 30:
        norm_obese_data.append((one_hot_array.copy(), norm_ck))
        
    one_hot_array.append(norm_bmi)
    norm_data.append((one_hot_array, norm_ck))

print(norm_data[0])
print(norm_obese_data[0])

0.0333275645971298 216.0
0.4 89.5
3 9996
([1, 0, 0, 0, 0, 0, 0, 0, 0.009106370039530593, 0.21709155042488376], 0.0026018212748924246)
([1, 0, 0, 0, 0, 0, 0, 0, 0.060955645921728224], 0.989492644851396)


## Modelo para pacientes em geral

### Regressão linear
Um modelo regressão linear é treinado com objetivo de predizer o valor de CK do paciente, com base no tipo de insulina, quantidade administrada e IMC.

Os dados são divididos em 80% para treino (12359 amostras) e 20% para teste (3090 amostras).

In [140]:
import random
random.seed(10)
random.shuffle(norm_data)
train_size = int(len(norm_data) * 0.8)
train_data = norm_data[0 : train_size]1.30587561e+06
test_data = norm_data[train_size:]
print(len(train_data))
print(len(test_data))

12359
3090


In [141]:
x_train = [x[0] for x in train_data]
y_train = [x[1] for x in train_data]
x_test = [x[0] for x in test_data]
y_test = [x[1] for x in test_data]
print(x_train[0], "->", y_train[0])

[0, 1, 0, 0, 0, 0, 0, 0, 0.018367058170002268, 0.35099139543583996] -> 0.005403782647853498


In [142]:
from sklearn.linear_model import LinearRegression
insulin_model = LinearRegression().fit(x_train, y_train)

In [143]:
insulin_model.coef_

array([8.92736306e+10, 8.92736306e+10, 8.92736306e+10, 8.92736306e+10,
       8.92736306e+10, 8.92736306e+10, 8.92736306e+10, 1.30587561e+06,
       1.19682901e-01, 1.64008668e-01])

In [144]:
insulin_model.score(x_test, y_test)

0.016304729645855498

### Regressão logística
Um modelo regressão logística é treinado para comparação com o modelo linear. Para isso o problema é rearranjado. É calculado o valor de CK médio e os pacientes são divididos em pacientes com CK alto ou baixo, conforme o CK médio.

Objetivo da regressão logística é então predizer com base no tipo de insulina, quantidade administrada e IMC, se o paciente apresenta um valor alto ou baixo de CK (classificação binaria)

In [119]:
ck_values = [x[1] for x in norm_data]
mean_ck = sum(ck_values) / len(ck_values)
y_train_binary = [int(x > mean_ck) for x in y_train]
y_test_binary = [int(x > mean_ck) for x in y_test]

In [120]:
from sklearn.linear_model import LogisticRegression
insulin_class_model = LogisticRegression(random_state=0).fit(x_train, y_train_binary)

In [123]:
insulin_class_model.coef_

array([[ 0.80362295,  0.27992811,  0.28019234,  0.24901903,  0.73737364,
        -0.71708455, -1.63868049,  0.        ,  1.68608572,  2.01404578]])

In [122]:
insulin_class_model.score(x_test, y_test_binary)

0.7812297734627832

## Modelo para pacientes obesos

### Regressão linear
A regressão linear é aplicada considerando apenas dados de pacientes obesos.
Diferentemente dos dados dos pacientes em geral, nos dados dos pacientes obesos não são incluídos os valores de IMC.

In [128]:
random.shuffle(norm_obese_data)
train_size = int(len(norm_obese_data) * 0.8)
train_data = norm_obese_data[0 : train_size]
test_data = norm_obese_data[train_size:]
print(len(train_data))
print(len(test_data))

4635
1159


In [129]:
x_train = [x[0] for x in train_data]
y_train = [x[1] for x in train_data]
x_test = [x[0] for x in test_data]
y_test = [x[1] for x in test_data]
print(x_train[0], "->", y_train[0])

[0, 1, 0, 0, 0, 0, 0, 0, 0.009106370039530593] -> 0.11568097668367858


In [133]:
from sklearn.linear_model import LinearRegression
insulin_obese_model = LinearRegression().fit(x_train, y_train)

In [134]:
insulin_obese_model.score(x_test, y_test)

0.0010681015149365258

### Regressão logística

In [136]:
y_train_binary = [int(x > mean_ck) for x in y_train]
y_test_binary = [int(x > mean_ck) for x in y_test]

In [137]:
from sklearn.linear_model import LogisticRegression
insulin_obese_class_model = LogisticRegression(random_state=0).fit(x_train, y_train_binary)

In [139]:
insulin_obese_class_model.score(x_test, y_test_binary)

0.7428817946505608