### Descriptions:
Main: process parts of cohort and demographic variables.

**Inputs**: Ccohort (Conor's original cohort), encounters (SQL), code_status (SQL), flowsheet_HW (SQL), vs1st or vs1st_complete (from R3 notebook)

* Check Conor's cohort (such as years of admission 2015 - 2019)
* Join with encounter table to get ESI and inpatient ID information
* Exclude patients who have any order for EXISTING code status that is not full prior to and within 24hrs after inpatient admission (even if code was cancelled later)
* Exclude patients who are < 18 year old and later on, exclude patients who do not have at least a complete set of vital signs

--> *cohort_demo* only contains demographic and ESI --> **cohort**: first pass of processed cohort, same info as with *cohort_demo*

* Process language --> English: yes/no
* Process insurance (medicare/caid/cal - contains "MEDI" in insurance) --> medis: yes/no
* Process height and weight: combine with flowsheet H&W as first take
* Use *vs1s_complete* to impute ESI with MICE --> *cohort_demo_clean_imputedHW* 
* One hot coding gender and race

**Outputs**: 
* **cohort**: first pass of processed cohort, same info as with *cohort_demo*
* **cohort_demo_final**: final cohort with demographic variables processed and imputed

OLD -- Modify Conor's cohort with admit time (UTC) from order_proc and code status. --> Admit time is the earlier from either "Admit to Inpatient" order time from the order_proc table, or from effective time from adt table. When admit time from order_proc is NA, take admit time from ADT table, always available. 

OLD -- Remove 1 obs whose gender is unknown

### Importing R libraries

In [None]:
library(caret) # import this before glmnet to avoid rlang version problem
library(xgboost)
library(data.table)
library(tidyverse)
library(lubridate)
library(Matrix)
# library(slam)
library(glmnet)
library(bit64)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
library(mice)
options(repr.matrix.max.rows=200, repr.matrix.max.cols=30)

### Testing

In [None]:
# coh %>% filter(pat_enc_csn_id_coded == 131227093710) # old datalake2018 has no inpatient_id_coded

### OLD --- Call the cohort from Conor's or with inpatient IDs from Tiffany's
* Use both mrn and csn: anon_id and pat_enc_csn_id_coded as the unique combo
* For our cohort, the csn is still unique. Not true for the entire EHR data (a few errors)
* Tiffany filtered out further patients who has no inpatient ID, as these can be erroneous in the ADT table where effective time is before the admission time
* Ccohort is the original draft cohort, available on BigQuery under conor_db
* Ccohort_inptid.csv is Conor's cohort with inpatient ID, done by Tiffany, available on BQ, Traige
* Size is 52318, from 2015 - 2019

### NEW --- with shc_core
* Use Conor's original cohort, with admit_time (this is the effective time < event time for admission) from adt table
* No longer checking for admission order from order_proc because these orders don't have the level of care
* Get inpatient id and acuity_level = ESI from encounter table
* Keep patients with FULL CODE status and >=18 year old
* Keep aptients with admit years from 2015 - 2019

In [None]:
# Ccohort.csv is the original draft cohort using Conor's query
# fread here gives data.table, and read.csv gives dataframe
# cohort <- as.data.frame(cohort) # if use fread

ccohort <- read.csv("./Data/Ccohort.csv") 
nrow(ccohort) 

In [None]:
# Check admit years. Conor cohort still has 2020 
ccohort <- ccohort %>% mutate(admit_time = ymd_hms(admit_time_jittered), 
                              adm_year = year(admit_time))
unique(ccohort$adm_year)
nrow(ccohort %>% filter(adm_year==2020))

In [None]:
# remove patients with admit year 2020
# rename the label as max level of care at 24 hour
# change to date_time, but when saved, it goes back to factor!
ccohort <- ccohort %>% mutate(admit_time = ymd_hms(admit_time_jittered), 
                              adm_year = year(admit_time)) %>% 
                        filter(adm_year != 2020) %>% select(-c(admit_time_jittered, adm_year)) %>%
                        rename(label_max24 = label)
nrow(ccohort)

In [None]:
# 1 MRN (patient) can have multiple CSN (visits/encounters)
# but 1 CSN is associated with 1 MRN, not true for the whole EHR data

nrow(ccohort %>% distinct(anon_id))
nrow(ccohort %>% distinct(pat_enc_csn_id_coded))
nrow(ccohort %>% distinct(anon_id, pat_enc_csn_id_coded))
# nrow(cohort %>% group_by(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded) %>% unique()) # unique rows

In [None]:
# count how many visits/csn each MRN (anon_id) has
ccohort_count <- ccohort %>% count(anon_id, sort = TRUE, name = 'csn')
nrow(ccohort_count)
summary(ccohort_count$csn)

# display histogram of freq of patients who have more than 1 visit
hist(ccohort_count[ccohort_count$csn >1, ]$csn, breaks=38, col="powderblue")

In [None]:
head(ccohort_count, n=1)
ccohort %>% filter(pat_enc_csn_id_coded == 131227093710)
ccohort %>% filter(anon_id == 'JCe90a9c')
ccohort %>% filter(anon_id == "JCcec38b")

### Check encounter table, joined with Conor's cohort
* This table gives us the inpatient id and ESI level
* All have Hospital encounter type
* All Visit type are NA, as this might be more applicable to outpatient encounters
* Acuity levels are ESI with some missing (~1900)

==> we can remove enc_type and visit_type
* Hospital Admission time (such as ED arrival) is probably before the admit time (to inpatient) from order proc
* Admit time from Conor's is effective_time_jittered_utc, same as Hospital admission time, all times are UTC

In [None]:
enc <- read.csv("./Data/encounters.csv") 
nrow(enc)

In [None]:
head(enc, n=1)
enc %>% filter(is.na(inpatient_data_id_coded))
enc %>% filter(pat_enc_csn_id_coded == 131227093710)

In [None]:
enc %>% gather(var, value) %>% distinct() %>% count(var) %>% arrange(n)

In [None]:
summary(enc %>% select(acuity_level, enc_type, visit_type))

In [None]:
# join the encounter table with cohort table to get the inpatient data id coded
enc <- enc %>% select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, 
                      enc_type, visit_type, ESI = ACUITY_LEVEL_C, hosp_admsn_time = hosp_admsn_time_jittered_utc) %>% 
                mutate(hosp_admsn_time = ymd_hms(hosp_admsn_time))

cohort <- left_join(ccohort, enc) %>% 
            mutate(ed_time_hr = as.numeric(difftime(admit_time, hosp_admsn_time, units ="hours")))

summary(cohort$ed_time_hr)
hist(cohort$ed_time_hr, xlim=c(0, 20), breaks=300, col="blue")

In [None]:
cohort %>% filter(pat_enc_csn_id_coded==131227093710)
cohort %>% filter(anon_id == "JCcec38b")

### Skip getting earlier admit time
* skip adjusting admission time as the earliest time as done above
* Process code status: keep patients whose code status is Full prior to admission or within 24 hour after admission. Otherwise interventions might not match with presentations.
* Note that in the display_name, some codes are blank but they are converted to something in description --> use description

In [None]:
# codestatus.csv is from querying code status order from order_proc
code <- read.csv("./Data/code_status.csv") 
nrow(code) # 162069
summary(code %>% select(order_status, display_name, description)) # Full and n/a 117187 + 743 = 117930

In [None]:
head(code, n=1) 

In [None]:
code <- code %>% select(anon_id, pat_enc_csn_id_coded, description, order_time = order_time_jittered_utc) %>% 
                    mutate(order_time = ymd_hms(order_time)) %>% distinct() 
nrow(code) 
summary(code)   

In [None]:
# calculate the difference btwn admit time and code status order time
code <- left_join(cohort, code) %>% 
            select(-c(enc_type, visit_type)) %>%
            mutate(code_diff_hr = as.numeric(difftime(order_time, admit_time, units = 'hours')))

nrow(code %>% group_by(anon_id, pat_enc_csn_id_coded) %>% unique())
nrow(code %>% select(anon_id, pat_enc_csn_id_coded) %>% group_by(anon_id, pat_enc_csn_id_coded) %>% unique()) 

In [None]:
head(code, n=1)

In [None]:
# some do not have code status, time diff in hours
summary(code$code_diff_hr)
hist(code$code_diff_hr,  xlim=c(-80, 48), breaks=720, col="steelblue")

In [None]:
# number of patients whose code status are either full or na (if exists, consider na = FULL) 
# code oder time is prior to or within 24 hours after admission
nrow(filter(code, description != "FULL CODE" & !is.na(description))) 

# cohort of patients whose code status before admission or 24 hour after admission is not FULL CODE
code_notfull <- code %>% 
                filter(code_diff_hr <=24 & description != "FULL CODE" & !is.na(description)) %>% 
                select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, admit_time) %>%
                distinct()

# number of patients whose code status is not full 24
nrow(code_notfull)
nrow(code_notfull %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) # 6341 for 2015-2018
head(code_notfull, n=1)

In [None]:
# remove 8501 non-fullcode non-na code patients from the cohort
cohort_code <- anti_join(cohort, code_notfull) %>% select(-c(enc_type, visit_type))
nrow(cohort_code)
nrow(unique(cohort_code %>% select(anon_id, pat_enc_csn_id_coded))) 
nrow(filter(cohort_code, is.na(admit_time)))

In [None]:
head(cohort_code, n=1)
code %>% filter(anon_id == "JCcec38b")

In [None]:
# this is the cohort with corrected admit time and full code before and 24hr after admission!
write.csv(cohort_code, file = "./Data/cohort_code.csv", row.names=FALSE)

### Explore and process demographic variables
1. Age (only >=18)
2. insurance = Medical/care/caid and n/a insurance --> medis =1 otherwise 0
3. language --> Engl = 1 for English, otherwise 0
4. leave height and weight to be processed with flowsheet age (>=18 only)

In [None]:
# demographics.csv is the file from querying the demographics table
demo0 <- read.csv("./Data/demographic.csv") %>% distinct()
nrow(demo0)

In [None]:
cohort_code %>% filter(pat_enc_csn_id_coded == 131227093710)

In [None]:
head(demo0, n=3) 
demo0 %>% filter(anon_id == "JCcec38b")

In [None]:
# calculate age, only keep patients age >=18, (304 < 18)
cohort_demo <- left_join(cohort_code, demo0) %>%
                    mutate(dob = ymd(dob),
                           age = round(as.numeric(difftime(ymd_hms(admit_time), dob, units="days")/365),0)) %>%
                    filter(age >= 18) %>% select(-dob)
nrow(cohort_demo)

In [None]:
head(cohort_demo, n=1)

In [None]:
# checking duplicates
nrow(cohort_demo %>% select(anon_id) %>% distinct()) # 30073
nrow(cohort_demo %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) # 43524
cohort_demo[duplicated(cohort_demo[, c('anon_id','pat_enc_csn_id_coded')]),]

# cohort_demo %>% filter(anon_id == "JCd49287") # pat_enc_csn_id_coded = 131195706986, 
# cohort_demo <- cohort_demo %>% filter (!(anon_id == "JCd49287" & race == "Unknown"))

In [None]:
summary(cohort_demo %>% select(ESI, gender, race, recent_height_cm, recent_weight_kg, age))
cohort_demo %>% gather(var, value) %>% distinct() %>% count(var) %>% arrange(n)
cohort_demo %>% summarise_each(funs(n_distinct))
# cohort_demo %>% summarise(across(all_of(.), count))

In [None]:
summary(cohort_demo$age)
hist(cohort_demo$age, breaks=100, col="dodgerblue")

In [None]:
options(repr.matrix.max.rows=135, repr.matrix.max.cols=20)
cohort_demo %>%                 # filter(!is.na(col)) %>% filter out all rows with NAs in col 
                group_by(insurance) %>% count() %>% arrange(desc(n))

In [None]:
cohort_demo %>% filter(str_detect(insurance, "MEDI") | insurance == "") %>% group_by(insurance) %>% count()

In [None]:
# turn insurance into medis which has "MEDI" under insurance
nrow(cohort_demo %>% filter(insurance == "")) # | insurance == ""
cohort_demo <- cohort_demo %>%
                    mutate(medis = ifelse(str_detect(insurance, "MEDI") | insurance == "", 1, 0)) %>%
#                     mutate(medis = ifelse(str_detect(insurance, paste(medis, collapse="|")) | insurance=="", 1, 0)) %>%
                    group_by(anon_id, pat_enc_csn_id_coded) %>%
                    mutate(medis = sum(medis)) %>% ungroup() %>% # just to make sure if anyone else has more than 1
                    mutate(medis = ifelse(medis>0, 1, medis)) 

cohort_demo %>% count(medis)

In [None]:
cohort_demo %>% group_by(language) %>% count() %>% arrange(desc(n))

In [None]:
cohort_demo <- cohort_demo %>%
                    mutate(English = ifelse(language == "English", 1, 0)) %>% 
                    select(-c(language, insurance))

cohort_demo %>% count(English)
nrow(cohort_demo)
summary(cohort_demo %>% select(ESI, gender, race, recent_height_cm, recent_weight_kg, age, medis, English))

In [None]:
head(cohort_demo, n=1)
colnames(cohort_demo)

### Save the data: cohort_demo and updated_cohort

NEW --
* **cohort_demo** is now the new **cohort** but will change after incorporate h and w from flowsheet
* cohort will be updated with cohort patients who have at least 1 complete set of vital signs (to have same cohort for both simple and complex dataset approaches)
* cohort_demo will be have height and weight combined from flowsheet and imputed
* toward the end, combine with the cleaned first set of vital signs (-GCS) to impute missing ESI

OLD datalake2018:
* 1 unknown gender obs removed
* this patient (anon_id == "JCcf85d5" & pat_enc_csn_id_coded == 131227093710): nothing in order_proc, no inpatientID, wrong csn with anothe patient mrn, and has effective admit time before admit time in adt table. Tiffany confirmed

In [None]:
# cohort_demo includes cohort, just more variables
write.csv(cohort_demo, file = "./Data/cohort_demo.csv", row.names=FALSE) # same name on laptop

cohort <- cohort_demo %>% 
            select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, label_max24, admit_time) 

write.csv(cohort, file = "./Data/cohort.csv", row.names=FALSE)

In [None]:
nrow(cohort)
nrow(cohort_demo)
length(unique(cohort$anon_id))
cohort_demo %>% filter(anon_id == "JCcec38b")

### Use Flowsheet (named vitals now) to get Height and Weight
* Height in inches and Weight in oz in flowsheet
* Clean: return NA for height < 40 or  > 85 and Weight details below
* Take the closest value to admit_time
* Convert to cm and kg and merge with demographic' Height and Weight
* Check for erroneous values again, return NA for H&W using the same rules as above
* Patients who don't have H&W from flowsheet will get them from demographic table
* Finally, missing H & W will be imputed using the rest of demographic variables
* Update: new flowsheetHW, no time restriction!

In [None]:
cohort <- read.csv("./Data/cohort.csv")

vitals_hw <- read.csv("./Data/flowsheet_HW.csv") %>% 
                select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, row_disp_name, 
                       recorded_time = recorded_time_utc, num_value1, num_value2)

nrow(cohort)
nrow(vitals_hw) #208446
nrow(cohort %>% select(anon_id) %>% distinct())
cohort %>% summarise(n_patients = n_distinct(anon_id))
vitals_hw %>% group_by(row_disp_name) %>% count(sort=TRUE)
nrow(vitals_hw %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())

In [None]:
head(cohort, n=1)
head(vitals_hw, n=1)

In [None]:
# vitals %>% group_by(row_disp_name, units) %>% summarise(count1 = length(num_value1), count2 = num_value2) %>% distinct()
vitals_hw %>% group_by(row_disp_name) %>% summarise(n1 = length(num_value1[!is.na(num_value1)]), n2=length(num_value2[!is.na(num_value2)]))
summary(vitals_hw %>% select(row_disp_name, num_value1, num_value2))

In [None]:
# combine cohort with vitals sign, remove num_value2 as H&W are under num_value1
vitals_hw <- left_join(cohort, vitals_hw) %>% 
                select(-num_value2) %>% distinct()

nrow(vitals_hw)
nrow(vitals_hw %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
vitals_hw %>% group_by(row_disp_name) %>% count(sort=TRUE)

### Height and Weight
* Height in demographics is in cm and in flowsheet is inches (num_value1) --> convert to cm.

Most are missing from flowsheet, but all look reasonable, keep all in flowsheet, and merge with demo. 
* Weight in demographics is in kg and in flowsheet is ounces (num_value1) --> convert to kg. 

Some are missing, not much.
Low end: few within 700-1000 ounces --> remove below 600 oz (17kg)

The rest look reasonable even with the overweight people, 1 outlier of 317kgs, but yeah, we've seen them.
* Use Flowsheet data as the source, closest time to admit time. For missing values, get replacements from demographics

Use Date from timestamp as demographics only have Dates, chose the ones closest to Admit Date. Afterward, use mice to impute H&W with the other variables (age, gender, race, medis, (not language))

In [None]:
# get the closest H and W to admit time by the minutes (longest is 26 days)
# drop rows that have both NA in Height and Weight
vitals_hw <- vitals_hw %>%
                spread(row_disp_name, num_value1) %>%
                mutate(mindiff = abs(as.numeric(difftime(admit_time, recorded_time, units = 'mins')))) %>%
#                 group_by(anon_id, pat_enc_csn_id_coded) %>%
#                 filter(mindiff == min(mindiff)) %>% slice(1L) %>%
                filter_at(vars(Weight, Height), any_vars(!is.na(.))) 

# should remove NA to avoid having a NA col                
nrow(vitals_hw) # 173163
summary(vitals_hw %>% select(Height, Weight, mindiff))

In [None]:
head(vitals_hw, n=1)

In [None]:
# many weights recorded, some are errorenous
# filter out encounters with only 1 weight vs more than 1
# Weight, looks for within same encounter, and same patient
# Height, looks for same patients

suppressWarnings(
vitals_hw1 <- vitals_hw %>% group_by(anon_id, pat_enc_csn_id_coded) %>% 
                mutate(nWe = length(Weight[!is.na(Weight)]), nHe = length(Height[!is.na(Height)])) %>%
                group_by(anon_id) %>% 
                mutate(nWp = length(Weight[!is.na(Weight)]), nHp = length(Height[!is.na(Height)]),
                       minW = min(Weight, na.rm=TRUE), maxW = max(Weight, na.rm=TRUE),
                       rmaxw = round(maxW/Weight, 2)) %>% select(-'<NA>')
)

In [None]:
summary(vitals_hw1 %>% select(Weight, Height, nWe, nWp, nHe, nHp, rmaxw))

In [None]:
# by encounters or patients, same, Weight seems reasonable 
summary(vitals_hw1 %>% filter(nWp == 1) %>% select(Weight)) # oz here; filter < 20kg, > 220kg later
summary(vitals_hw1 %>% filter(nHp == 1) %>% select(Height)) # inches here; filter cm

### explore Height (inches): 
* Return NA if Height < 40 or > 85 --> some entries < 40: JCd0f204 and JCda7d7f

In [None]:
hist(vitals_hw1[vitals_hw1$Height < 50,]$Height, xlim=c(0, 100), col="slateblue") # remove < 40 (keep 42)
hist(vitals_hw1[vitals_hw1$Height > 80,]$Height, col="blue", add=TRUE) # remove > 85 or 90

errorH <- vitals_hw1 %>% filter(Height > 90 | Height < 40) %>% arrange(Height)
nrow(errorH)

In [None]:
# 81 is ok, lost the 90s
# errors: JCd0f204 and JCda7d7f
errorH

In [None]:
# this patient height at 91 is the only record, where was patient with height at 92 is error
vitals_hw1 %>% filter(anon_id %in% c("JCd2c9f5", "JCce21de") & !is.na(Height))

In [None]:
# it looks like most with < 40 height is error, except cases with only 1 record --> don't know JC2a2d702, ...
vitals_hw1 %>% filter(anon_id %in% errorH$anon_id  & !is.na(Height)) %>% 
                arrange(anon_id, Height) %>% slice(1:3)

### Explore Weight
* Return NA for Weight above 12345.9 oz (about 350kg) --> (only 1 and this one is erroneous  "JCe08adf")
* Only 1 slightly above 12000oz, but looks correct "JCceb47b"
* Return NA for Weight < 600 (17kg): JCd16b07, JCe3e7e8. min Weight above this is 710, about 20kg --> reasonable
* For those W < 900: if maxW/W is > 2, or if maxW > 1.3*W within 30 days --> return NA
* JCd0500f x4 weights in 3 years, keep this

In [None]:
summary(vitals_hw$Weight) # 22046.00 is 625kg, 710 is about 20kg, 600 is about 17kg

In [None]:
hist(vitals_hw1$Weight, breaks=100, xlim=c(0, 12000), col="dodgerblue")
hist(vitals_hw1[vitals_hw1$Weight < 1000,]$Weight, xlim=c(0,1000), col="slateblue") 
hist(vitals_hw1[vitals_hw1$Weight > 10000,]$Weight, col="blue") 

errorW <- vitals_hw1 %>% filter(Weight > 12000 | Weight < 1000) %>% 
            select(-c(inpatient_data_id_coded, label_max24)) %>% arrange(anon_id, Weight)
nrow(errorW)

In [None]:
# JCd0500f x4 weight in 3 years!!!
# JCea3866 doubled weight in 2 years!!
errorW # 710 is 20kg

In [None]:
# CHECK for weights compare to max weight, see the ratios
options(repr.matrix.max.rows=120, repr.matrix.max.cols=20)
vitals_hw1 %>% filter(anon_id %in% errorW$anon_id  & !is.na(Weight) & rmaxw > 2) %>% 
                select(-c(inpatient_data_id_coded, label_max24)) %>% 
                arrange(anon_id, Weight) %>% slice(1:3) 

In [None]:
# ok JCcba5a7 some examples
vitals_hw1 %>% filter(anon_id == "JCcba5a7") %>% arrange(admit_time, Weight)# different encounter
vitals_hw1 %>% filter(anon_id == "JCcbb4b8") # remove 840

In [None]:
# find 1 max Weight and its date for each patient.
nrow(vitals_hw1)
suppressWarnings(
maxWdate <- vitals_hw1 %>% group_by(anon_id) %>%
                filter(Weight == max(Weight, na.rm=TRUE)) %>% rename(maxWDate = recorded_time, maxWw = Weight) %>%
                select(anon_id, maxWDate, maxWw) %>% slice(1L)
)
nrow(maxWdate)
length(unique(maxWdate$anon_id))

vitals_hw2 <- vitals_hw1 %>% select(-maxW) %>% left_join(maxWdate) %>% rename(maxW = maxWw)
nrow(vitals_hw2)

In [None]:
head(maxWdate, n=1)
head(vitals_hw, n=3)
vitals_hw %>% filter(anon_id =="JCcb658e")
vitals_hw2 %>% filter(anon_id =="JCcb658e")

In [None]:
# calculate the days difference btwn max Weight day and another Weight
# this takes a while (~5 min)
vitals_hw2 <- vitals_hw2 %>% mutate(maxddiff = difftime(ymd_hms(maxWDate), ymd_hms(recorded_time), units="days"))
head(vitals_hw2)

In [None]:
# weight < 1000 is quite complicated: < 700 is error, weight ~800: if multiple --> correct, if only 1 --> error
# for any < 900, if there exist another value that is > 1.5 that the possible error, then flag it as error
# probably should add within a year as the patient JCd0500f 4x weight in 3 years
suppressWarnings(
vitals_hw2 <- vitals_hw2 %>% group_by(anon_id) %>% 
                mutate(maxddiff = abs(round(maxddiff,1)),
                       Werror = ifelse(rmaxw > 2 | (rmaxw > 1.3 & maxddiff < 30), 1, 0),
                       sumErr = sum(Werror, na.rm=TRUE)) # cumsum(replace_na(Werror, 0)) --> take max
) 

### Clean height and weight: Final step
* Height < 40 and > 85 and weight < 600 were removed. 
* Lowest Height remains 42, occured x2 for same patient. 2 patients 91 and 92 inches (1 is 1 measurement and 92 is an error)
* Return NA if Weight < 900 and (err > 1: if same have max Weight > 2*Weight or maxW > 1.3*Weight within 30 days)
* One patient has 950 and 4500 lbs, 4 times higher but within 3 years, and only 2 measurements different encounters: JCd0500f

In [None]:
# Weight: remove anything > 12345 oz and < 600 (17 kg and 350kg)
# For anything < 900, if there's another weight that is 1.5 or more, remove it (better to take within a year)
# Height: remove < 40 and > 85 (101 and 216 cm)

# DO NOT filter but instead replace these values with NA
vitals_hw <- vitals_hw2 %>% 
                mutate(Height = ifelse((Height < 40 | Height > 85), NA, Height),
                       Weight = ifelse((Weight > 12345 | Weight < 600 | (Weight < 900 & sumErr > 0)), NA, Weight))
#                 filter((Height >= 40 & Height <= 85) & (Weight <= 12345)) %>% filter(!(Weight < 900 & sumErr > 0))

In [None]:
summary(vitals_hw %>% select(Height, Weight))
lowW <- vitals_hw %>% select(anon_id, pat_enc_csn_id_coded, admit_time, recorded_time, Height, Weight,
                     nWe, nWp, minW, rmaxw, maxWDate, maxW, maxddiff, sumErr) %>%
                        filter(Weight < 900) %>% arrange(anon_id, Weight) # %>% slice(1:3) 

In [None]:
lowW

In [None]:
# looks OK
vitals_hw2 %>% filter(anon_id %in% lowW$anon_id) %>%
                select(anon_id, pat_enc_csn_id_coded, admit_time, recorded_time, Height, Weight,
                       nWe, nWp, minW, rmaxw, maxWDate, maxW, maxddiff, sumErr) %>% arrange(anon_id, Weight)

### Combine Height and Weight with demogrpahic table and clean again
* convert Weight as oz in flowsheet to kg as in demographic
* convert Height as inches in flowsheet to cm as in demographic
* Take the closest clean values to admit_time
* after combining, return NA as above again

In [None]:
# for flowsheet, convert W from oz to kg, H from inches to cm 
# keep track of the lowest weight obs of 8kg  (288 oz) to investigate. If scale, remove this here --> not here for new shc_core
# this patient later appear to have a weight of 83 kg

# take the closest values to admit_time, mindiff is time difference btwn admit_time and recorded_time, in minutes
w <- 0.0283495 # convert oz to kg
h <- 2.54 # convert inches to cm
vitals_hw <- vitals_hw %>% select(anon_id, pat_enc_csn_id_coded, admit_time, recorded_time, Height, Weight, mindiff) %>%
                mutate(Weight = round(Weight * 0.0283495, 0),
                       Height = round(Height * 2.54, 0)) %>%
                group_by(anon_id, pat_enc_csn_id_coded) %>%
                filter(mindiff == min(mindiff)) %>% slice(1L)
nrow(vitals_hw) # 41240
summary(vitals_hw %>% select(Height, Weight, mindiff))

In [None]:
write.csv(vitals_hw, "./Data/vitals_hw.csv", row.names=FALSE)
# vitals_hw <- read.csv("./Data/vitals_hw.csv")

In [None]:
# remember: age is relevant to the admit date/time
# only height, weight, and esi have missing values
cohort_demo <- read.csv("./Data/cohort_demo.csv")
nrow(cohort_demo)

# take the closest recorded date to admit_date (but only 1 record), not clean yet
demo_hw <- cohort_demo %>% 
            select(anon_id, pat_enc_csn_id_coded, admit_time, recent_height_cm, recent_weight_kg, recent_date, age) %>%
            mutate(admit_date = as.Date(admit_time),
                   recent_date = as.Date(recent_date),
                   daydiff = abs(as.numeric(difftime(admit_date, recent_date, units ="days")))) #%>%
#                 group_by(anon_id, pat_enc_csn_id_coded) %>%
#                 filter(daydiff == min(daydiff)) %>% slice(1L)
summary(demo_hw %>% select(recent_height_cm, recent_weight_kg, age, daydiff))

In [None]:
w <- 0.0283495 # convert oz to kg
h <- 2.54 # convert inches to cm

# check weight, if there are many, will process as above, NA for < 600 or for < 900 with another large weight
nrow(demo_hw %>% filter(recent_weight_kg < 600*w | recent_weight_kg > 12345*w))
demo_hw %>% filter(recent_weight_kg < 900*w | recent_weight_kg > 12345*w)
demo_hw %>% filter(anon_id == "JCcfecc2")
vitals_hw %>% filter(anon_id =="JCcfecc2")

# check height
nrow(demo_hw %>% filter(recent_height_cm < 40*h | recent_height_cm > 85*h))
demo_hw %>% filter(recent_height_cm < 40*h | recent_height_cm > 85*h)

hist(demo_hw$recent_weight_kg, breaks=100, xlim=c(0, 330), col="dodgerblue")
hist(demo_hw$recent_height_cm, breaks = 100, col = "blue")
hist(demo_hw[demo_hw$recent_height_cm < 120,]$recent_height_cm, breaks = 100, col = "blue")

In [None]:
nrow(demo_hw)
summary(demo_hw %>% select(recent_height_cm, recent_weight_kg))

In [None]:
600*w
900*w
demo_hw %>% filter(recent_weight_kg < 900*w)

In [None]:
# this is already distinct, only return NA for 
# Note that demo's weight are pretty clean. only 1 patient, 2 encounters with W > 600 but < 900, but look ok
# so we don't write the alogrithm to process W as above here
demo_hw <- demo_hw %>% 
                mutate(recent_height_cm = ifelse(recent_height_cm >= 40*h & recent_height_cm <= 85*h, recent_height_cm, NA),
                       recent_weight_kg = ifelse(recent_weight_kg >= 600*w, recent_weight_kg, NA)) 
nrow(demo_hw)
nrow(demo_hw %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
summary(demo_hw)

In [None]:
# within 4 years for weight?, but height ok
nrow(demo_hw %>% filter(daydiff > 1480)) #  4years
# demo_hw %>% filter(daydiff > 1480)

hist(demo_hw$daydiff, breaks= 100, col="blue")
hist(demo_hw$daydiff[demo_hw$daydiff > 1400 & demo_hw$daydiff < 1600], breaks= 100, col="red")

In [None]:
# this patient has both H and W in both flowsheet and demo
vitals_hw %>% filter(anon_id== "JCcfecc2")
demo_hw %>% filter(anon_id== "JCcfecc2")

# this patient here is just to check different encounters
vitals_hw %>% filter(anon_id == "JCe67df3") 
demo_hw %>% filter(anon_id== "JCe67df3")

In [None]:
# if values are missing from flowsheet, impute with information in demographics
nrow(vitals_hw %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
nrow(vitals_hw %>% distinct())

nrow(demo_hw %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
nrow(demo_hw %>% distinct())

# make sure to join by anon_id and csn, not admit_time!!! lots of headache with admit time
HW <- full_join(vitals_hw, demo_hw, by=c("anon_id", "pat_enc_csn_id_coded"))

nrow(HW)
nrow(HW %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
summary(HW %>% select(Height, Weight, recent_height_cm, recent_weight_kg))

HW %>% filter(anon_id== "JCcfecc2")
HW %>% filter(anon_id== "JCe67df3")

In [None]:
# take the height and weight (closest to admit_time) of the same admission/encounter from (flowsheet)
# if missing from flowsheet, take from recent values from demographic

# demo weight: only use if within 4 years: mutate(recent_weight_kg = ifelse(daydiff > 4*365, NA, recent_weight_kg)) %>%
# take everything for now, as 4 years is still arbitrary
HW <- HW %>% mutate(Height2 = ifelse(is.na(Height), recent_height_cm, Height),
                    Weight2 = ifelse(is.na(Weight), recent_weight_kg, Weight)) %>%
                    select(anon_id, pat_enc_csn_id_coded, Height2, Height, Weight2, Weight, daydiff) %>% distinct()

nrow(HW)
nrow(HW %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
summary(HW %>% select(Height2, Height, Weight2, Weight, daydiff))
HW %>% filter(anon_id== "JCe67df3") %>% arrange(pat_enc_csn_id_coded)

In [None]:
# among those with n/a Weight from flowsheet get from demo, check how far from admit_date
summary(HW %>% filter(is.na(Weight)) %>% select(Height2, Height, Weight2, Weight, daydiff))
# 1917 is 5.25 years

summary(HW %>% filter(is.na(Height)) %>% select(Height2, Height, Weight2, Weight, daydiff))

In [None]:
# join with cohort to see how many missing Weight from flowsheet
cohort_demo_clean <- left_join(cohort_demo, HW) %>%
        rename(Weight_fs = Weight, Height_fs = Height, Weight = Weight2, Height = Height2)

nrow(cohort_demo)
nrow(cohort_demo_clean)
nrow(cohort_demo_clean %>% distinct(anon_id, pat_enc_csn_id_coded))
summary(cohort_demo_clean %>% select(Height, Height_fs, Weight, Weight_fs, daydiff)) 
# if replace everyhing, 358 W  vs 3years: 1805 W missing vs 4years: 954 W missing

In [None]:
# # change 0 to NA, 
# HW <- HW %>% mutate(Height = ifelse(Height==0.0, NA, Height),
#                     Weight = ifelse(Weight==0.0, NA, Weight))
# HW %>% filter(anon_id == "JCd16b07")
# summary(HW %>% select(Height, Weight, daydiff))
# # the unknown gender obs JCd1cd10 is here from vitals, ok, just be careful when joining with demo_clean

In [None]:
# save this to impute height and weight # saved with (1) on laptop
write.csv(cohort_demo_clean, file = "./Data/cohort_demo_clean.csv", row.names=FALSE)

In [None]:
colnames(cohort_demo_clean)
nrow(cohort_demo_clean)

In [None]:
summary(cohort_demo_clean %>% select(label_max24, gender, race, age, medis, English, Height, Weight))
cohort_demo_clean %>% summarise_each(funs(n_distinct))

### Imputation for Height and Weight using just the demographics set

In [None]:
# this takes awhile(~4min)
# cohort_demo_clean <- read.csv("./Data/cohort_demo_clean.csv")
demo_clean <- cohort_demo_clean %>% select(gender, race, age, medis, Height, Weight)

md.pattern(demo_clean)
demo_mice <- mice(demo_clean, m=3, maxit=50, meth='pmm', seed=123)
demo_imp2 <- complete(demo_mice, 2)

In [None]:
summary(demo_imp2)

In [None]:
nrow(demo_imp2)
head(demo_imp2, n=1)

nrow(cohort_demo_clean)
head(cohort_demo_clean, n=1)

# here: keep bind the old set with the imputed set, just to compare
demo_imp2name <- demo_imp2 %>% select(Height, Weight) %>% 
                    rename(Height_i = Height, Weight_i = Weight)
cohort_demo_imputed_all <- bind_cols(cohort_demo_clean, demo_imp2name) 
head(cohort_demo_imputed_all, n=1)
colnames(cohort_demo_imputed_all)

In [None]:
# all means we have both the original values and the imputed values in this same dataset, 
# but it doesn't have the indicators of missingness
cohort_demo_imputed_all <- cohort_demo_imputed_all %>% 
    select(c(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, label_max24, admit_time, ESI, 
             gender, race, age, medis, English, Height, Height_i, Weight, Weight_i))

In [None]:
# only keep the imputed versions of Height and Weight and add indicators
cohort_demo_imputed <- cohort_demo_imputed_all %>% 
                        mutate(delta_H = ifelse(is.na(Height), 1, 0),
                               delta_W = ifelse(is.na(Weight), 1, 0)) %>%
                        select(-c(Height, Weight))
summary(cohort_demo_imputed %>% select(delta_H, delta_W))
head(cohort_demo_imputed, n=1)

In [None]:
# missing ESI
summary(cohort_demo_imputed)

In [None]:
# from the cohort_demo_clean, imputed height and weight, save the new cohort_demo table
write.csv(cohort_demo_imputed, file = "./Data/cohort_demo_imputedHW.csv", row.names=FALSE)

### ESI -- from acuity_level encounter table
* use first vital values from vitals_clean to impute ESI
* this require the vitals_clean dataset to be done
* before: use cohort with at least 1 vs --> change: use cohort with a complete set of vs --> simple and complex models
* vs1st_complete.csv is already updated

In [None]:
cohort_demo %>% group_by(ESI, label_max24) %>% count() %>% group_by(ESI) %>% mutate(p = round(100* n/sum(n), 2))
# will need to impute with the first sets of vital signs

In [None]:
# cohort_demo_imputedHW %>% group_by(esi) %>% count(label_max24) # --> convert NA
cohort_demo %>% group_by(ESI, label_max24) %>%
                summarise(n_esi = n()) %>%
                ungroup() %>% 
                mutate(percent = 100*n_esi/sum(n_esi)) %>%
                select(ESI, label = label_max24, n_esi, percent)

In [None]:
# use the vs1st_complete.csv --> 41654, better cohort for imputation
# if use vs1st.csv with GCS we have 43320, but remove these so only 43291
# vs1st <- read.csv("./Data/vs1st.csv")

vs1st <- read.csv("./Data/vs1st_complete.csv")
nrow(vs1st)

# this demo is not the same demo read from demographics, same as cohort_demo_imputed as above
demo <- read.csv("./Data/cohort_demo_imputedHW.csv")
nrow(demo)

summary(vs1wide)

In [None]:
colnames(demo)

In [None]:
vs1demo <- left_join(vs1wide, demo)
nrow(vs1demo)
head(vs1demo)
colnames(vs1demo)

In [None]:
summary(vs1demo) # missing 1665 ESI

In [None]:
# this chunk takes a while (~ 5min)
# m refers to the number of imputed datasets. Five is the default --> takes too long
# meth='pmm' refers to the imputation method, predictive mean matching
# complete(, returns the 2nd completed data set)
vs1demo_imp <- vs1demo %>% select(ESI, gender, race, age, medis, Height_i, Weight_i, SBP, DBP, Pulse, RR, SpO2, Temp)

md.pattern(vs1demo_imp)
vs1demo_mice <- mice(vs1demo_imp, m=3, maxit=50, meth='pmm', seed=123)
vs1demo_imp2 <- complete(vs1demo_mice, 2)

In [None]:
summary(vs1demo_imp2)

In [None]:
# rename the imputed variables from the imputed data set with an added _i
# not doing this: SBP_i=SBP, DBP_i=DBP, Pulse_i=Pulse, RR_i=RR, SpO2_i=SpO2, Temp_i=Temp
vs1demo_imp_name <- vs1demo_imp2 %>% select(ESI_i=ESI) 

# bind the imputed ESI with the original data and add a missing indicator for ESI
vs1demo_all <- bind_cols(vs1demo, vs1demo_imp_name) %>% mutate(delta_ESI = ifelse(is.na(ESI), 1, 0))
colnames(vs1demo_all)

In [None]:
dim(vs1demo_all)
head(vs1demo_all %>% filter(delta_ESI ==1))

In [None]:
# rearrange all the columns 
cohort_demo <- vs1demo_all %>% select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, label_max24, admit_time, 
                                      ESI_i, delta_ESI, gender, race, age, medis, English,  
                                      Height_i, delta_H, Weight_i, delta_W,
                                      SBP, DBP, Pulse, RR, SpO2, Temp)
dim(cohort_demo)

### One hot coding for gender and race
* Gender: simply 1 for female and 0 for male
* Race: one-hot coding as usual

In [None]:
# 1 for female and 0 for male:
cohort_demo <- cohort_demo %>% mutate(gender = ifelse(gender == "Male", 0, 1))
summary(cohort_demo %>% select(gender, race))

In [None]:
# onehot coding for race:
dummy <- dummyVars(~ race, data = cohort_demo) # if more ! gender + race
race_1hot <- data.frame(predict(dummy, newdata = cohort_demo))
cohort_demo <-  cohort_demo %>% select(-race) %>% bind_cols(race_1hot)
ncol(cohort_demo)

In [None]:
summary(cohort_demo)

In [None]:
nrow(cohort_demo)
head(cohort_demo, n=3)

In [None]:
# save file: all ESI (and first_val of vital signs imputed)
write.csv(cohort_demo, "./Data/cohort_demo_final.csv", row.names=FALSE)

# this is the correct new cohort with at least one component of vital signs
# cohort_has_vs <- cohort_demo %>% select(anon_id, pat_enc_csn_id_coded, inpatient_data_id_coded, label_max24, admit_time)

# nrow(cohort_has_vs)
# write.csv(cohort_has_vs, "./Data/cohort_has_vs.csv", row.names=FALSE)