**Descsribe our project**

# About MIMIC dataset

**The overview of MIMIC data set: inlcuidng background, goal of MIMIC, content of the database**

MIMIC is a very large dataset, hence we only selected some part of it. We included the tables from four modules: core, ed, hosp, and icu:
* core:**The core module stores patient tracking information necessary for any data analysis using MIMIC-IV. The core module contains three tables: patients, admissions, and transfers. These tables provide demographics for the patient, a record for each hospitalization, and a record for each ward stay within a hospitalization.**
* ed:**Patient stays are tracked in the edstays table. Each row of the edstays table has a unique stay_id, which represents a unique patient stay in the ED. The edstays table contains the following columns: subject_id, hadm_id, stay_id, intime, and outtime. The intime indicates the time at which the patient was admitted to the ED, and the outtime indicates the time at which the patient was discharged from the ED. If the patient was admitted to the hospital following their ED stay, the hadm_id column will be populated with an identifier representing their hospital stay. hadm_id can be linked with the hadm_id in MIMIC-IV to obtain further detail about the patient’s hospital stay. Finally, each individual is assigned a unique subject_id, and patients with multiple ED stays will have the same subject_id across stays in the edstays table. Note that subject_id can be linked with MIMIC-IV to obtain patient demographics. subject_id can also be linked with the PatientID DICOM attribute in MIMIC-CXR to obtain chest x-rays for patients if they were taken**
* hosp:**The hosp module contains data derived from the hospital wide EHR. These measurements are predominantly recorded during the hospital stay, though some tables include data from outside the hospital as well (e.g. outpatient laboratory tests in labevents). Information includes laboratory measurements (labevents, d_labitems), microbiology cultures (microbiologyevents, d_micro), provider orders (poe, poe_detail), medication administration (emar, emar_detail), medication prescription (prescriptions, pharmacy), hospital billing information**
* icu:**The icu module contains data sourced from the clinical information system at the BIDMC: MetaVision (iMDSoft). MetaVision tables were denormalized to create a star schema where the icustays and d_items tables link to a set of data tables all suffixed with events. Data documented in the icu module includes intravenous and fluid inputs (inputevents), patient outputs (outputevents), procedures (procedureevents), information documented as a date or time (datetimeevents), and other charted information (chartevents). All events tables contain a stay_id column allowing identification of the associated ICU patient in icustays, and an itemid column allowing identification of the concept documented in d_items.**

We will take a glance at these modules first.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:
# Load the data 
admissions = pd.read_csv('../data/core/admissions.csv')
patients = pd.read_csv('../data/core/patients.csv')
diagnose = pd.read_csv('../data/hosp/d_icd_diagnoses.csv')
subj_diagnose = pd.read_csv('../data/hosp/diagnoses_icd.csv')
vital_raw = pd.read_csv('../data/ed/vitalsign.csv')
icu_stays = pd.read_csv('../data/icu/icustays.csv')
item_names = pd.read_csv('../data/icu/d_items.csv').set_index('itemid')
chart_event = pd.read_csv('../data/icu/chart_event_filtered.csv')

## core module 

The most important table is the `patients` table where the information that is consistent for the lifetime of a patient is stored. This will be used to identify each patient. `dod` column contains the information about mortality inside the hospital which will be used as target in this study.



In [3]:
patients.head(3)

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000048,F,23,2126,2008 - 2010,
1,10002723,F,0,2128,2017 - 2019,
2,10003939,M,0,2184,2008 - 2010,


**describe**

In [4]:
admissions.head(3)

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag
0,14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0
1,15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0
2,11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0


## `hosp` module

Since we foucsed on ICU patients in this analysis, most of the tables in this module is not of our interest. 

The only information we need is the billed diagnosis of patients in order to select our cohort.

`diagnoses_icd.csv` contains the icd code for each subject and the corresponding icd version. With the code and version we can acquire the name of the diagnoses from the `d_icd_diagnoses.csv` file. 

In [5]:
subj_diagnose.head(3)

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version
0,15734973,20475282,3,2825,9
1,15734973,20475282,2,V0251,9
2,15734973,20475282,5,V270,9


In [6]:
diagnose.head(3)

Unnamed: 0,icd_code,icd_version,long_title
0,10,9,Cholera due to vibrio cholerae
1,11,9,Cholera due to vibrio cholerae el tor
2,19,9,"Cholera, unspecified"


## `ed` module

We only use `vitalsign` table in this module since vitalsigns like heart rate, respotary rate are directly related to death. And the first set of vital signs after patients admitted to emergency room is most important since they haven't received any medication. 

In [7]:
vital_raw.head(3)

Unnamed: 0,subject_id,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp,rhythm,pain
0,16113983,37539106,2116-06-10 00:32:00,98.2,82.0,15.0,,106.0,72.0,,
1,15128994,30058281,2167-08-29 02:25:00,98.3,79.0,20.0,97.0,126.0,73.0,,0.0
2,15128994,30058281,2167-08-29 04:51:00,97.6,79.0,20.0,98.0,126.0,73.0,,0.0


## `icu` module

`icu` module contains the main data we are going to analyze. 

`icustays.csv` tracks information for ICU stays including adminission and discharge times. This table will be used to identify the patients who were admitted to the icu since not all patients admitted to ICU. 

In [8]:
icu_stays.head(3)

Unnamed: 0,subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
0,17867402,24528534,31793211,Trauma SICU (TSICU),Trauma SICU (TSICU),2154-03-03 04:11:00,2154-03-04 18:16:56,1.587454
1,14435996,28960964,31983544,Trauma SICU (TSICU),Trauma SICU (TSICU),2150-06-19 17:57:00,2150-06-22 18:33:54,3.025625
2,17609946,27385897,33183475,Trauma SICU (TSICU),Trauma SICU (TSICU),2138-02-05 18:54:00,2138-02-15 12:42:05,9.741725


`chart_event_filtered.csv` contains the majority of the data. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The electronic chart displays patients’ routine vital signs and any additional information relevant to their care: ventilator settings, laboratory values, code status, mental status, and so on. Each event is represented by an `item_id` in this table and the real name of the event can be found in the `d_items.csv` file

`chart_event` table is fairly large, so we uploaded a filtered version. The code used for filtering were shown in appendix. 

In [9]:
chart_event.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,subject_id,hadm_id,stay_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
0,0,0,10006277,25610553,30888848,2176-06-08 01:18:00,2176-06-08 01:20:00,224876,Cloudy,,,0
1,2,2,10006277,25610553,30888848,2176-06-08 00:57:00,2176-06-08 01:00:00,223770,90,90.0,%,0
2,3,3,10006277,25610553,30888848,2176-06-08 04:18:00,2176-06-08 04:18:00,223988,Clear,,,0


In [10]:
item_names.head(3)

Unnamed: 0_level_0,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
itemid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,


# Select the Cohort 
Here we select the patients dignosed with ICH from the whole hospital addmission. The corresponding ICD codes are 'I61*' (ICD-10) and '431*' (ICD-9). 


Then we filter the patients who were addmitteed to the intensive care unit. 

In [11]:
icu_ids = icu_stays['subject_id'].unique()

In [12]:
def get_icu_cohort(icd10, icd9):
    """
    Input parameters are the icd code for a certain disease. For higher reliability, they should be regular expression. 
    """
    # filter all diagnosed subjects
    df1=patients.set_index('subject_id') 
    df_icd10=subj_diagnose[subj_diagnose['icd_code'].str.contains(str(icd10))]
    df_icd9 = subj_diagnose[subj_diagnose['icd_code'].str.contains(str(icd9))]
    df_icdall=pd.concat([df_icd9,df_icd10]).drop_duplicates('subject_id', keep='first')
    # filter the patients admitted to icu
    df_icu = df_icdall[df_icdall['subject_id'].isin(icu_ids)].set_index('subject_id')
    data = df_icu.join(df1,how='left')
    # add the target, i.e., mortality flag to the table
    data['dod']=data['dod'].replace(np.nan, 0, regex=True)
    data['dod']=data['dod'].replace('-', 1, regex=True)

    return data

def mortality_rate(df):
    """Calculate the mortality rate for selected cohort. This p"""
    return np.count_nonzero(df['dod']==1)/len(df['dod'])

In [13]:
cohort = get_icu_cohort('I61',r'^431')
print('cohort size', cohort.shape[0])
print('mortality rate', mortality_rate(cohort))
cohort

cohort size 2485
mortality rate 0.2784708249496982


Unnamed: 0_level_0,hadm_id,seq_num,icd_code,icd_version,gender,anchor_age,anchor_year,anchor_year_group,dod
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12116269,24214849,2,431,9,M,66,2166,2008 - 2010,0
12155595,26672436,2,431,9,F,52,2168,2008 - 2010,0
19620109,26497452,2,431,9,F,66,2128,2008 - 2010,0
19330004,20802265,1,431,9,M,52,2187,2008 - 2010,0
18414729,22609366,2,431,9,F,60,2152,2008 - 2010,0
...,...,...,...,...,...,...,...,...,...
14050724,20952526,1,I615,10,F,57,2112,2011 - 2013,0
12557389,21294125,1,I613,10,F,79,2151,2017 - 2019,0
18065731,25556934,1,I618,10,M,46,2137,2011 - 2013,0
12853711,25849371,2,I611,10,F,87,2117,2017 - 2019,0


# Select Features

In [14]:
def get_data(cohort_index, data_table, index_colname):
    return data_table[data_table[index_colname].isin(cohort_index)]

In [15]:
def fill_na_mean(df, rd=2, inplace=True):
    for col in list(df.columns[df.isnull().sum() > 0]):
        mean_val = df[col].mean()
        if inplace:
            try:
                df[col].fillna(round(mean_val,rd),inplace=inplace)
            except TypeError:
                continue
        else:
            try:
                df = df[col].fillna(round(mean_val,rd))
            except TypeError:
                continue
    if not inplace:
        return df

In [16]:
cohort_ind = cohort.index

## Load Vital Signs




In [17]:
vital_raw.head()

Unnamed: 0,subject_id,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp,rhythm,pain
0,16113983,37539106,2116-06-10 00:32:00,98.2,82.0,15.0,,106.0,72.0,,
1,15128994,30058281,2167-08-29 02:25:00,98.3,79.0,20.0,97.0,126.0,73.0,,0
2,15128994,30058281,2167-08-29 04:51:00,97.6,79.0,20.0,98.0,126.0,73.0,,0
3,15128994,30058281,2167-08-29 05:35:00,98.3,76.0,18.0,,123.0,68.0,,0/10
4,18019452,37300626,2148-12-19 12:34:00,98.1,100.0,16.0,98.0,129.0,86.0,,0


In [18]:
# Drop rhythm, pain which have a lot of missing data
vital_id = get_data(cohort_ind, vital_raw, 'subject_id' ).iloc[:,:-2]
vital_id.head()

Unnamed: 0,subject_id,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp
105,10046166,38848658,2132-12-06 11:53:00,97.0,75.0,12.0,,154.0,73.0
261,10900387,30300312,2146-09-08 01:25:00,98.8,79.0,16.0,93.0,162.0,87.0
525,16652205,35474533,2169-04-09 18:55:00,,73.0,27.0,98.0,169.0,79.0
526,16652205,35474533,2169-04-09 19:50:00,,70.0,45.0,100.0,187.0,69.0
527,16652205,35474533,2169-04-09 21:00:00,98.4,70.0,30.0,98.0,138.0,58.0


In [19]:
fill_na_mean(vital_id)
vital_id

Unnamed: 0,subject_id,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp
105,10046166,38848658,2132-12-06 11:53:00,97.00,75.0,12.0,97.55,154.0,73.0
261,10900387,30300312,2146-09-08 01:25:00,98.80,79.0,16.0,93.00,162.0,87.0
525,16652205,35474533,2169-04-09 18:55:00,97.47,73.0,27.0,98.00,169.0,79.0
526,16652205,35474533,2169-04-09 19:50:00,97.47,70.0,45.0,100.00,187.0,69.0
527,16652205,35474533,2169-04-09 21:00:00,98.40,70.0,30.0,98.00,138.0,58.0
...,...,...,...,...,...,...,...,...,...
1650418,12557389,35529368,2153-01-24 23:15:00,97.47,75.0,23.0,98.00,119.0,62.0
1650902,14745196,36076575,2204-06-17 12:14:00,97.20,89.0,16.0,98.00,129.0,86.0
1650903,14745196,36076575,2204-06-17 14:59:00,98.90,88.0,18.0,97.00,166.0,91.0
1650904,14745196,36076575,2204-06-17 18:33:00,99.50,95.0,20.0,94.00,153.0,67.0


In [20]:
vital_id =vital_id.drop_duplicates('subject_id', keep='last').set_index('subject_id')

In [21]:
vital_id.head()

Unnamed: 0_level_0,stay_id,charttime,temperature,heartrate,resprate,o2sat,sbp,dbp
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
16652205,31288294,2169-04-22 23:50:00,101.4,100.0,44.0,97.55,97.0,61.0
11213607,32187974,2194-06-11 01:48:00,98.4,98.0,20.0,98.0,152.0,80.0
19620109,36855170,2132-10-16 23:43:00,98.5,67.0,15.0,100.0,101.0,67.0
16379037,37641494,2177-04-16 23:54:00,98.0,100.0,19.0,98.0,145.0,80.0
15936063,38505545,2161-05-22 16:30:00,97.47,118.0,20.0,98.0,145.0,72.0


In [22]:
final = cohort.join(vital_id)
final.shape

(2485, 17)

## Load chartted Event

In [23]:
chart_event.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,subject_id,hadm_id,stay_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
0,0,0,10006277,25610553,30888848,2176-06-08 01:18:00,2176-06-08 01:20:00,224876,Cloudy,,,0
1,2,2,10006277,25610553,30888848,2176-06-08 00:57:00,2176-06-08 01:00:00,223770,90,90.0,%,0
2,3,3,10006277,25610553,30888848,2176-06-08 04:18:00,2176-06-08 04:18:00,223988,Clear,,,0
3,6,6,10006277,25610553,30888848,2176-06-08 01:16:00,2176-06-08 01:16:00,220739,Spontaneously,4.0,,0
4,8,8,10006277,25610553,30888848,2176-06-08 11:00:00,2176-06-08 12:20:00,224650,,,,0


In [24]:
chart_pivot= chart_event.pivot(index='subject_id', columns='itemid', values='value')
chart_pivot.head()

itemid,220045,220046,220047,220048,220179,220180,220181,220210,220228,220277,...,227443,227457,227465,227466,227467,227944,227968,227969,228096,228299
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10006277,95,120,60,AF (Atrial Fibrillation),139,82,94,37,13.0,95,...,25.0,178.0,18.5,28.1,2.0,3 rails up,Yes,Bed locked in low position,,
10007677,84,130,50,SR (Sinus Rhythm),88,51,58,18,9.8,95,...,26.0,193.0,13.8,27.4,1.3,3 rails up,Yes,Pain evaluated and treated,0 Alert and calm,0 Alert and calm
10013310,100,130,50,ST (Sinus Tachycardia),116,61,84,19,8.8,100,...,27.0,290.0,12.8,27.6,1.3,3 rails up,Yes,Lines and tubes concealed,0 Alert and calm,0 Alert and calm
10017492,88,120,60,1st AV (First degree AV Block),149,63,71,25,8.4,98,...,23.0,357.0,11.1,23.7,1.0,3 rails up,Yes,Adequate lighting,0 Alert and calm,0 Alert and calm
10025463,81,120,60,ST (Sinus Tachycardia),122,81,91,22,,97,...,,,,,,3 rails up,Yes,Pain evaluated and treated,"-5 Unarousable, no response to voice or physic...","-2 Light sedation, briefly awakens to voice (e..."


In [25]:
item_dict = dict()
for item_id in chart_pivot.columns:
    item_dict[item_id] = item_names.loc[item_id, 'abbreviation']

In [26]:
feature_list = np.array(list(item_dict.values()))
chart_pivot.columns=feature_list
feature_list[:5]

array(['HR', 'HR Alarm - High', 'HR Alarm - Low', 'Heart Rhythm', 'NBPs'],
      dtype='<U43')

In [27]:
chart_pivot

Unnamed: 0_level_0,HR,HR Alarm - High,HR Alarm - Low,Heart Rhythm,NBPs,NBPd,NBPm,RR,Hemoglobin,SpO2,...,HCO3 (serum),Platelet Count,PT,PTT,INR,Side Rails,All Medications Tolerated,Safety Measures,Richmond-RAS Scale,Goal Richmond-RAS Scale
subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10006277,95,120,60,AF (Atrial Fibrillation),139,82,94,37,13,95,...,25,178,18.5,28.1,2,3 rails up,Yes,Bed locked in low position,,
10007677,84,130,50,SR (Sinus Rhythm),88,51,58,18,9.8,95,...,26,193,13.8,27.4,1.3,3 rails up,Yes,Pain evaluated and treated,0 Alert and calm,0 Alert and calm
10013310,100,130,50,ST (Sinus Tachycardia),116,61,84,19,8.8,100,...,27,290,12.8,27.6,1.3,3 rails up,Yes,Lines and tubes concealed,0 Alert and calm,0 Alert and calm
10017492,88,120,60,1st AV (First degree AV Block),149,63,71,25,8.4,98,...,23,357,11.1,23.7,1,3 rails up,Yes,Adequate lighting,0 Alert and calm,0 Alert and calm
10025463,81,120,60,ST (Sinus Tachycardia),122,81,91,22,,97,...,,,,,,3 rails up,Yes,Pain evaluated and treated,"-5 Unarousable, no response to voice or physic...","-2 Light sedation, briefly awakens to voice (e..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19992425,63,120,40,SB (Sinus Bradycardia),91,54,62,9,13.2,96,...,27,129,13.5,26.9,1.3,3 rails up,Yes,Call light within reach,0 Alert and calm,0 Alert and calm
19992885,73,130,50,ST (Sinus Tachycardia),133,52,80,13,7.7,100,...,28,133,17.4,39.4,1.6,3 rails up,Yes,Bed locked in low position,"-2 Light sedation, briefly awakens to voice (e...",0 Alert and calm
19994233,62,120,45,AF (Atrial Fibrillation),114,51,71,17,11.5,97,...,30,220,11.8,27.4,1.1,3 rails up,,Pain evaluated and treated,0 Alert and calm,0 Alert and calm
19999442,86,120,60,SR (Sinus Rhythm),146,80,98,13,14,100,...,22,119,12.5,26.2,1.3,3 rails up,Yes,Adequate lighting,-1 Awakens to voice (eye opening/contact) > 10...,0 Alert and calm


In [34]:
final = final.join(chart_pivot)

In [35]:
test = final.iloc[:,:-5].dropna(how='any',axis=0)
test[['dod']] = test[['dod']].astype('int')
test.describe()

Unnamed: 0,hadm_id,seq_num,icd_version,anchor_age,anchor_year,dod,stay_id,temperature,heartrate,resprate,o2sat,sbp,dbp
count,608.0,608.0,608.0,608.0,608.0,608.0,608.0,608.0,608.0,608.0,608.0,608.0,608.0
mean,24931690.0,3.166118,9.523026,64.560855,2152.975329,0.210526,35015780.0,97.689984,81.041941,18.415132,97.604112,129.319079,70.958882
std,2855539.0,4.732763,0.499881,15.039199,23.288345,0.408018,2941544.0,4.403925,16.703869,4.120856,4.421384,20.27421,14.681154
min,20037890.0,1.0,9.0,19.0,2110.0,0.0,30012880.0,34.7,42.0,2.0,2.0,10.0,23.0
25%,22577160.0,1.0,9.0,55.0,2134.0,0.0,32556220.0,97.47,70.0,16.0,96.0,115.0,61.0
50%,24922270.0,1.0,10.0,66.0,2152.0,0.0,34984340.0,97.8,80.0,18.0,98.0,129.0,71.0
75%,27385530.0,3.0,10.0,76.0,2173.0,0.0,37662860.0,98.3,92.0,20.0,100.0,140.25,79.0
max,29999620.0,38.0,10.0,91.0,2202.0,1.0,39965340.0,105.0,138.0,44.0,100.0,196.0,130.0


In [37]:
test.shape

(608, 143)

In [None]:
item_names.head()

# Description

Index(['charttime', 'temperature', 'heartrate', 'resprate', 'o2sat', 'sbp',
       'dbp', 'HR', 'HR Alarm - High', 'HR Alarm - Low',
       ...
       'IV/Saline lock', 'Gait/Transferring', 'Mental status',
       '20 Gauge Dressing Occlusive', 'Potassium (serum)', 'HCO3 (serum)',
       'Platelet Count', 'PT', 'PTT', 'INR'],
      dtype='object', length=133)

# Modelling

In [39]:
from DS_MIMIC_knn import train_test_split

In [40]:
test[test.dod==0].shape[0]/test.shape[0]

0.7894736842105263

In [51]:
test['Education Topic'].unique()

array(['ICU Environment', 'Activity', 'Plan of Care', 'IV Therapy',
       'Medications', 'Disease Process ', 'Coping',
       'Incentive Spirometry', 'Equipment monitor', 'MRI',
       'Cough/Deep Breath', 'Pain Scale', 'Diabetic',
       'Discharge Instruction', 'Procedures', 'Invasive Lines',
       'PCA/Pain Management', 'Pre-Op', 'Suctioning', 'Post-Op',
       'Blood Transfusion', 'Echocardiogram', 'Cardiac Cath',
       'Hypertension', 'Stroke Education', 'Pressure Injury Information',
       'Pacemaker', 'Nuclear medicine'], dtype=object)

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

X = test[test.columns[11:]]
y = test['dod']

X_train_orig, X_test_orig, y_train, y_test = train_test_split(X, y,0.2)
std= StandardScaler()
X_train=std.fit_transform(X_train_orig)
X_test=std.fit_transform(X_test_orig)

ValueError: could not convert string to float: 'SR (Sinus Rhythm)'

In [None]:
X_train.shape

## Logistic Regression

In [None]:
# check classification scores of logistic regression
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score 
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)
print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
print(logreg.__class__.__name__+" log_loss is %2.3f" % log_loss(y_test, y_pred_proba))
print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))

In [None]:
idx = np.min(np.where(tpr > 0.95)) # index of the first threshold for which the sensibility > 0.95

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0,fpr[idx]], [tpr[idx],tpr[idx]], 'k--', color='blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

print("Using a threshold of %.3f " % thr[idx] + "guarantees a sensitivity of %.3f " % tpr[idx] +  
      "and a specificity of %.3f" % (1-fpr[idx]) + 
      ", i.e. a false positive rate of %.2f%%." % (np.array(fpr[idx])*100))

## Random Forest

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

# Fit the model
rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_train_orig, y_train)

# Use score method to calculate the accuracy over the whole test set
acc=rf.score(X_test_orig, y_test)
print(acc)
rf = RandomForestClassifier(n_estimators=10)
rf.fit(X, y)


ft_imp = pd.Series(rf.feature_importances_).sort_values(ascending=False)
ft_imp.head(10)
ft_imp

In [None]:
import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(rf)

# calculate shap values. This is what we will plot.
# Calculate shap_values for all of val_X rather than a single row, to have more data for plot.
shap_values = explainer.shap_values(X_test_orig)

# Make plot. Index of [1] is explained in text below.
shap.summary_plot(shap_values[1], X_test_orig)

## KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

np.random.seed(40)
test_scores = []
train_scores = []

for i in range(1,15):

    knn = KNeighborsClassifier(i)
    knn.fit(X_train,y_train)
    
    train_scores.append(knn.score(X_train,y_train))
    test_scores.append(knn.score(X_test,y_test))


In [None]:
max_train_score = max(train_scores)
train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]
print('Max train score {} % and k = {}'.format(max_train_score*100,list(map(lambda x: x+1, train_scores_ind))))

In [None]:
## score that comes from testing on the datapoints that were split in the beginning to be used for testing solely
max_test_score = max(test_scores)
test_scores_ind = [i for i, v in enumerate(test_scores) if v == max_test_score]
print('Max test score {} % and k = {}'.format(max_test_score*100,list(map(lambda x: x+1, test_scores_ind))))

In [None]:
plt.figure(figsize=(12,5))
p = sns.lineplot(x=range(1,15),y=train_scores,marker='*',label='Train Score')
p = sns.lineplot(x=range(1,15),y=test_scores,marker='o',label='Test Score')

## My KNN


In [None]:
def My_knncomper_difftest( X_train, X_test, y_train, y_test,kmax):
    rightnumber = 0           
    print('enter phase 2')
    Train_acc=[]
    Test_acc=[]
    print(" maximum K value is "+str(kmax))           
    for k in range(1,kmax):
        test_predict_score=0
        rightnumber=0
        predictlist =MYknn(X_test, X_train, y_train, k)
        for m in range(len(X_test)):
            if predictlist[m] == y_test[m]:
                rightnumber = rightnumber + 1
        test_predict_score = (rightnumber / len(y_test))
        print(test_predict_score*100)
        Test_acc.append(  test_predict_score * 100)
        rightnumber=0
        Train_predict_score=0
        predictlist = MYknn(X_train, X_train, y_train, k)
        for m in range(len(X_train)):
            if predictlist[m] == y_train[m]:
                rightnumber = rightnumber + 1
        Train_predict_score = (rightnumber / len(X_train))
        Train_acc.append( Train_predict_score * 100)      

    plt.plot(Test_acc)
    plt.title('Test_acc')
    plt.show()
    plt.plot(Train_acc)
    plt.title('Train_acc')
    plt.show()

   


In [None]:
def MYknn(test_object, training_object, training_object_target, K):
    predictlist = []
    for newpoint in test_object:
        dataSetSize = training_object.shape[0]
        diffMat= np.tile(newpoint, (dataSetSize, 1)) - training_object
        sqDistances = (diffMat**2).sum(axis=1)
        distances = sqDistances ** 0.5
        sortedDistIndicies = distances.argsort()
        classCount = {}
        for i in range(K):
            voteIlabel = training_object_target[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  #Get the value of key from the map and return 0 without key
            sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        predictlist.append(sortedClassCount[0][0])
    return predictlist

In [None]:
My_knncomper_difftest(X_train, X_test, y_train, y_test, 15)

In [None]:
## from sklearn.model_selection import train_test_split
# from DS_MIMIC_knn import *
import numpy as np
# from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler




---------------------
# Find dod ID


In [None]:
raise TypeError

In [None]:
# fillnan_test.set_index('subject_id',inplae=True)

In [None]:
# dod_id=cohort[cohort['dod']==1]
# test = dod_id.join(fillnan_test,how='left')
# np.count_nonzero(test['stay_id']>1)

In [None]:
dod_id

In [None]:
test['stay_id']

# Load output event

In [None]:
outevents = pd.read_csv('../data/icu/outputevents.csv')
itmes = pd.read_csv('../data/icu/d_items.csv')

In [None]:
test_main = cohort.copy()

In [None]:
out = outevents[outevents['subject_id'].isin(test_main.index)]

In [None]:
for i in range(out.shape[0]):
    entry = out.iloc[i,:]
    sub_id, itemid = entry[['subject_id', 'itemid']]
    test_main.loc[sub_id, itemid] = entry['value']
test_main.head() 

# Appendix

We select chart event data from Google Bigquery using SQL. This command took about 20 mins to run, and returns milions of rows of data.
So won't run it in the notebook, we filtered the data and saved it locally. 

In [None]:
for i in range(5):
    sql = f"""SELECT * 
    FROM `physionet-data.mimic_icu.chartevents`
    WHERE subject_id in {tuple(cohort.index[497*i:497*(i+1)].values.tolist())}
    ORDER BY subject_id"""

    df = pd.read_gbq(sql, project_id='focus-dragon-313813', dialect='standard', use_bqstorage_api=True)
    df.to_csv(f'df{i}.csv')