# MODIFIED: 24:12 36 hour prediction window

In this notebook, we will be modifying the 48 hour prediction window to be a 36 hour prediction window. Using the data from each 24 hour time chunk, we will predict whether or not a patient will get AKI in the next 12 hours. Hopefully, I'll be able to get it such that the doctor can just input whatever settings they want, and this notebook will update accordingly.

Training and testing sets are separated via patients' id. We use a .npy file found in the Data folder that contains JUST patient IDs, so when we train on the actual dataset, the split is already clear. In this case, Zijian has already split the data and saved it into train.pkl and test.pkl.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

In [2]:
hadm_id = np.load('C:/Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/all_ids.npy')
# randomly spliated 70-30
np.random.seed(123) # set seed for reproducibility 
train_ids = np.random.choice( hadm_id, round(len(hadm_id)*0.7), replace = False )
test_ids = np.array(list(set(hadm_id) - set(train_ids)))

print("all patient ids stored in 'HADM_ID'", hadm_id, 
      "; the total number of patients being 4120, \n which is 147 short from the 4267 supposed patients stated in the article :(")

all patient ids stored in 'HADM_ID' [20001687 20005241 20006999 ... 29994991 29997844 29998115] ; the total number of patients being 4120, 
 which is 147 short from the 4267 supposed patients stated in the article :(


In [3]:
print("length of training data:", len(train_ids), '\n', "length of testing data:", len(test_ids))

length of training data: 2884 
 length of testing data: 1236


For some reason there are 491 hadm_ids?

In [44]:
# import os

# def count_files_in_folder(folder_path):
#     folder_path = os.path.join(folder_path, 'C:/Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/byID_daily')
#     num_files = len([name for name in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, name))])
#     return num_files

# folder_path = 'C:/Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/byID_daily'
# num_files = count_files_in_folder(folder_path)
# print(f'Number of files in folder "{folder_path}" is: {num_files}')


Number of files in folder "C:/Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/byID_daily" is: 491


### Revelations

Data -> byID: all 2000+ hadm_ids containing each patient's hourly data. 

Data -> byID_daily: all 2000+ hadm_ids DAILY summaries of their data (basically all of the daily minimums)

In [6]:
# prep test code for actual code below
temp = pd.read_pickle('C:/Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/byID/20955149.pkl')
print(temp)

row_count=0

temp_df= pd.DateFrame()

for id in train_ids:
    for row in id:
        row_count+=1

    rows_to_process= temp[0:12]
    min_values = np.amin(rows_to_process, axis=0)




       hadm_id        timedelta  arterial_bp_diastolic  arterial_bp_mean  \
0   20955149.0  0 days 01:00:00                    NaN               NaN   
1   20955149.0  0 days 02:00:00                    NaN               NaN   
2   20955149.0  0 days 03:00:00                    NaN               NaN   
3   20955149.0  0 days 04:00:00              62.470588         77.823529   
4   20955149.0  0 days 05:00:00              68.219563         86.328310   
5   20955149.0  0 days 06:00:00              63.236763         81.770156   
6   20955149.0  0 days 07:00:00              66.000000         85.000000   
7   20955149.0  0 days 08:00:00              57.440000         72.000000   
8   20955149.0  0 days 09:00:00              69.000000         95.000000   
9   20955149.0  0 days 10:00:00              63.000000         84.000000   
10  20955149.0  0 days 11:00:00              56.000000         76.000000   
11  20955149.0  0 days 12:00:00              53.000000         69.000000   
12  20955149

In [5]:
# prep test code for actual code below
temp = pd.read_pickle('C:/Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/byID_daily/20955149.pkl')
temp

#this is the daily patient aki result i guess??

Unnamed: 0,hadm_id,arterial_bp_diastolic_min,arterial_bp_mean_min,arterial_bp_systolic_min,cvp_min,heart_rate_min,spo2_min,pap_diastolic_min,pap_mean_min,pap_systolic_min,...,mild_liver_disease,diabetes_without_cc,diabetes_with_cc,paraplegia,renal_disease,malignant_cancer,severe_liver_disease,metastatic_solid_tumor,aids,AKI_any
0,20955149.0,51.0,68.0,93.0,,68.709988,94.0,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,20955149.0,62.0,82.0,113.0,,78.0,95.0,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Considering 'daily_data' is relatively robust (less affected by inaccurate time indicators)  
We re-arrange the data, using parameters in today, predicting AKI indicator tomorrow (24 hrs ahead)

In [24]:
# prepare training set
training_appended = []
training = pd.DataFrame()

for idx in tqdm(range(len(train_ids))):
    tem_id = train_ids[idx]
    tem_daily = pd.read_pickle( 'C:Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/byID_daily' + str(tem_id) + '.pkl')
    
    if len(tem_daily) > 2: #for all hadm_ids that represent patients that stayed in  the icu for more than two days
        tem_train = pd.concat(  [tem_daily.iloc[ 0:(len(tem_daily)-2)],  #FROM DAILYIDS: take info for all rows (days) except for the SECOND TO last day and last day
                     pd.DataFrame(tem_daily[ 'AKI_any' ].iloc[ 2:len(tem_daily) ].values) ] #extract the AKI_any column values from the second to last day to end
                    , axis = 1,
                    ).drop( 'AKI_any', axis = 1) #concatenate the second line to the first line such that it is displaced (AKI_any thus becomes AKI_in_24) and the last day is missing a value
        
        training_appended.append(tem_train) #each modified dataframe stored in tem_train and added to training_appended list
        
training = pd.concat(training_appended) #initially empty training df gets all modified dfs added to it.
training.rename( columns = {0:'AKI_in_48'}, inplace = True)
training.to_pickle('train_set_48_faith.pkl')

  0%|          | 0/2884 [00:00<?, ?it/s]


FileNotFoundError: [Errno 2] No such file or directory: 'C:Users/faith/OneDrive/Documents/FaithZhang/SURF/RealTimeDetection/Data/byID_daily25630745.pkl'

In [5]:
# used a faster way by list to store everything

# # prepare training set
# training = pd.DataFrame()
# for idx in tqdm(range(len(train_ids))):
#     tem_id = train_ids[idx]
#     tem_daily = pd.read_pickle( '../Data/byID_daily/' + str(tem_id) + '.pkl')
    
#     if len(tem_daily) > 1:
#         tem_train = pd.concat(  [tem_daily.iloc[ 0:(len(tem_daily)-1)],  
#                      pd.DataFrame(tem_daily[ 'AKI_any' ].iloc[ 1:len(tem_daily) ].values) ]
#                     , axis = 1,
#                     ).drop( 'AKI_any', axis = 1)
        
#         training = pd.concat( [training, tem_train], ignore_index = True )
        
# training.rename( columns = {0:'AKI_in_24'}, inplace = True)
# training.to_pickle('./train_set.pkl')

In [6]:
# load training if loss connection
training = pd.read_pickle('./train_set.pkl')

testing = pd.DataFrame( columns = np.concatenate( [ ['time'], training.columns.values]))
inform = pd.DataFrame( columns = ['hadm_id', 'if_AKI_inICU','first_AKI_detected'])

testing set will be the rolling window, using ( max(0, t-24), t) to predict (t+1, t+24)  
Record additional information such as  
'if the patients develop AKI during ICU'  
'when is the first time that the patient is found AKI'  
for the future report

CREATING INFORM and TESTING datasets:

In [7]:
testing_appended = []
inform_appended = []

# prepare testing
for idx in tqdm(range(len(test_ids))):
    tem_id = test_ids[idx]
    tem_patient = pd.read_pickle( '../Data/byID/' + str(tem_id) + '.pkl')
    time_stamps = tem_patient['timedelta']
    
    tem_testing = pd.DataFrame( columns = np.concatenate( [ ['time'], training.columns.values]))
    for time_id in range(len(time_stamps)):
        if time_id != (len(time_stamps) - 1):
            tem = pd.DataFrame( data = np.concatenate( (np.array([time_id, tem_id]),
                    tem_patient.loc[ tem_patient['timedelta'].isin( time_stamps[ max(0,(time_id-23)) : (time_id+1)].values ) ].drop( ['hadm_id','timedelta','Heart Rhythm','admission_category'], axis = 1 ).min(axis = 0).values[0:547],
                    tem_patient.loc[ tem_patient['timedelta'].isin( time_stamps[ max(0,(time_id-23)) : (time_id+1)].values ) ].drop( ['hadm_id','timedelta','Heart Rhythm','admission_category'], axis = 1 ).max(axis = 0).values[0:572],
                    np.array( [tem_patient.loc[ tem_patient['timedelta'].isin( time_stamps[ (time_id+1) : min(len(time_stamps), time_id+25 )].values ), 'AKI_any' ].max()])
                    )).reshape( [1,-1] ),
                    columns = testing.columns.values)
            
            testing_appended.append(tem)

    if tem_patient['AKI_any'].max() == 0:
        new_inform = pd.DataFrame( data = np.array([tem_id, tem_patient['AKI_any'].max(), float('NaN')]).reshape([1,-1]),
                    columns = ['hadm_id', 'if_AKI_inICU','first_AKI_detected'])
    else:
        new_inform = pd.DataFrame( data = np.array([tem_id, tem_patient['AKI_any'].max(), tem_patient.loc[ tem_patient['AKI_any'] == 1, 'timedelta'].index[0]]).reshape([1,-1]),
                    columns = ['hadm_id', 'if_AKI_inICU','first_AKI_detected'])
        
    inform_appended.append(new_inform)
    

#testing.to_pickle('./testing/test_set.pkl')
#inform.to_pickle('./testing/inform.pkl')

100%|██████████| 1236/1236 [27:03<00:00,  1.31s/it] 


In [8]:
testing = pd.concat(testing_appended)
inform = pd.concat(inform_appended)

In [9]:
testing.loc[testing['gender'] == 'M', 'gender'] = 0
testing.loc[testing['gender'] == 'F', 'gender'] = 1 

In [10]:
testing = testing.astype( 'float' )
testing.to_pickle('./test_set.pkl')

In [11]:
inform.to_pickle('./inform.pkl')