### Description:
* Clean variables from flowsheet, finalized with vital signs and GCS scores only
* Keep SBP/DBP (BP num_value1 and num_value2), RR, Heart Rate, SpO2, Temp (2 units, num_value1 and num_value2), and GCS (check details). The rest has NA info

Inputs:
* flowsheet.csv (SQL), cohort.csv (1st pass of processed cohort from R1 notebook)
* cohort_labels (from Tiffany's) to get the final cohort with at least 1 complete set of VS
* cohort_demo_final to combine with summary stats for the simple data set

Output files: 

* **vitals_clean.csv** --> cleaned vs for joining later
* **vs1st_complete** -- for ESI imputation, cohort with at least a complete set of first VS
* **cohort_final** - patients who have a complete set of vital sign values (and with availabe labels from Tiffany's cohort)
* **data_simple** - simple set of data with summary statistics of vs and demographic

### Importing R libraries

In [None]:
library(caret) # import this before glmnet to avoid rlang version problem
library(xgboost)
library(data.table)
library(tidyverse)
library(lubridate)
library(Matrix)
# library(slam)
library(glmnet)
library(bit64)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
library(mice)
options(repr.matrix.max.rows=100, repr.matrix.max.cols=20)

### Check other flowsheet variables, mainly vital signs:
   * Combine/collapse similar names into standard names for the below features
   * Look at how common these are, look at the distributions, check extreme values on either sides
   * Investigate some of these to see how far they are off from normal range to be considered errorneous.
   * Note: num_value1 vs. num_value2, see details below
   * Replace erroneous values as "NA".
   
1. Blood pressure: 
* DBP: num_value2 --> return NA if SBP > 10*DBP (12000 value in datalake2018 not here)
* There's no DBP without an SBP
* Return NA for BP if both DBP and SBP are 0
* SBP: return NA if < 30 or 33, same effect

2. Pulse: return NA if < 21

3. Temp: exist in both num_value1 and num_value2
* num_value2 is not NA only when there's num_value1. if num_value1 is NA then num_value2 is also NA
* For num_value1 <= 60, take it as Celcius degree
* For num_value1 > 60, convert this temp in F to C
* Return NA for num_value1 < 20 --> only 2: 0.1 and 9.4 (lowest 26C)

4. RR: return NA for < 4 or > 60

5. SpO2: return NA for < 40%

6. GCS: points are different from scores, only use scores
* num_value2 are the normal GCS score, num_value1 can be just points on 0-4 scale

In [None]:
vitals0 <- read.csv("./Data/flowsheet.csv")
cohort <- read.csv("./Data/cohort.csv")
nrow(vitals0)
nrow(cohort) # 43493

In [None]:
head(cohort, n=1)
head(vitals0, n=1)

In [None]:
vitals0 %>% group_by(units) %>% count()

In [None]:
# combine cohort with vitals sign, calculate difftime
vitals0 <- vitals0 %>% 
                select(-c(admit_time, label_max24, template, units)) %>%
                rename(recorded_time = recorded_time_utc)

vitals0 <- left_join(cohort, vitals0) %>% 
            mutate(timediff = as.numeric(difftime(admit_time, recorded_time, units = "mins"))) %>%
            distinct()
#             filter(ymd_hms(recorded_time) < ymd_hms(admit_time)) %>% # no need this one, SQL took care of this

nrow(vitals0) # 1024402 for non-distinct vs 1019404 for distinct
summary(vitals0$timediff) # all positive --> recorded time is before admit time

In [None]:
head(vitals0, n=1)

In [None]:
# check summary to see num_value1 vs num_value
display_summary_num1and2 <- function(df, var1, var2){
    suppressWarnings(
        df %>% summarise(n=n(), 
                        mean1 = mean({{var1}}, na.rm=T), mean2 = mean({{var2}}, na.rm=T),
                        median1 = median({{var1}}, na.rm=T), median2 = median({{var2}}, na.rm=T),
                        min1 = min({{var1}}, na.rm=T), min2 = min({{var2}}, na.rm=T), 
                        max1 = max({{var1}}, na.rm=T), max2 = max({{var2}}, na.rm=T)) %>%
                arrange(desc(n))
    )
}

In [None]:
# check the original vitals data,
# some variables have no values, temp and gcs scores have 2 different units/scales, BP has SBP and DBP
vitals0 %>% group_by(row_disp_name) %>% display_summary_num1and2(num_value1, num_value2)# %>% arrange(desc(n))

In [None]:
# rename the variables, combine similar ones
# only 1 name: SpO2 and BP
GCS = c("Glasgow", "GCS Score") # 40654 + 44 = 40698 # not using GCS points, different scale
Pulse = c("Pulse", "Heart Rate") 
RR = c("Resp", "Resp Rate") # not using Respiratory Rate, different scale

# will name BP as SBP and take num_value1 only, DBP will be processed separately
# distinct will reduce many rows, that have values with same recorded time
vitals <- vitals0 %>% rename(features = row_disp_name) %>% 
                mutate(features = ifelse(str_detect(features, paste(GCS, collapse="|")), "GCS",
                                    ifelse(features == "BP", "SBP",
                                    ifelse(features %in% Pulse, "Pulse", 
                                    ifelse(features == "SpO2", "SpO2",
                                    ifelse(str_detect(features, "Temp"), "Temp", 
                                    ifelse(features %in% RR, "RR", as.character(features)))))))) %>% # last ... in ifelse(cond, iftrue, ...)
                distinct()
nrow(vitals)       
unique(vitals$features)
fs_feats <- c("Pulse", "SpO2", "RR", "SBP", "Temp", "GCS")
vitals %>% filter(features %in% fs_feats) %>% group_by(features) %>% display_summary_num1and2(num_value1, num_value2)

### num_value1 vs num_value2:
* Temp: num_value1 (in C when num_value2 is not NA) and num_value2 (F)  vs. Temp(in C) num_value1
* BP --> SBP: num_value1; DBP: num_value2. Do not use Blood Pressure (0-2 scale)
* GCS: Glasgow Coma Scale Score's num_value1 and GCS Score's num_value1. Do not use POINTS, it's different scale 0-4
* SpO2, Pulse & Heart Rate, RR or Resp Rate (Do not use Respiratory Rate --> different scale) : num_value1

### Check Temperatures
* Only check Temp becuase Temp in C only has num_value1 in C where as Temp has both C and F degrees

In [None]:
# when num_value2 is not NA, take num_value1, equivalent, in C --> all look correct
summary(vitals0 %>% filter(row_disp_name == "Temp" & !is.na(num_value2)) %>% select(num_value1, num_value2))

# when num_value2 is NA or num_value1 is not NA, --> need further checking
summary(vitals0 %>% filter(row_disp_name == "Temp" & is.na(num_value2)) %>% select(num_value1, num_value2))
summary(vitals0 %>% filter(row_disp_name == "Temp" & !is.na(num_value1)) %>% select(num_value1, num_value2))

# when num_value1 is less than 3, num_value2 = NA
summary(vitals0 %>% filter(row_disp_name == "Temp" & num_value1 < 3) %>% select(num_value1, num_value2))

# when num_value1 is NA --> num_value2 is also NA
summary(vitals0 %>% filter(row_disp_name == "Temp" & is.na(num_value1)) %>% select(num_value1, num_value2))

In [None]:
# check when num_value2 is NA:
# for num_value1 <= 60, take it, and for num_value1 > 60, convert to C
# return NA for < 10 (or 20) --> only 2: 0.1 and 9.4,  
temp1 <- vitals0 %>% filter(row_disp_name=="Temp"  & is.na(num_value2)) %>% arrange(num_value1)
hist(temp1$num_value1, ylim=c(0, 50))

hist(temp1[temp1$num_value1<=60, ]$num_value1, xlim=c(0, 60), ylim=c(0,30))
hist(temp1[temp1$num_value1>60, ]$num_value1, xlim=c(60, 120))

# now check the other temp variable, max is 43.6
hist(vitals0[vitals0$row_disp_name=="Temp (in Celsius)" & vitals0$num_value1 < 30, ]$num_value1)
vitals0 %>% filter(row_disp_name =="Temp (in Celsius)" & num_value1 < 30) %>% arrange(desc(num_value1))

### Check GCS scores
* GCS, num_value2 is good, but GCS Score num_value1 is on 0-2 scale --> either repalce num_value1 by num_value2 earlier then only use num_value1 
* if use GCS points --> 0- 4 in num_value1 (points) --> not using this
* or, here, just use num_value1 for all >= 3 and for any <3, use num_value2 (no overlapping). 
* If there's overlapping like GCS points, then we need to process this prior to combining the names as above

In [None]:
gcs <- vitals %>% filter(features == "GCS")
summary(gcs %>% select(num_value1, num_value2)) 

# if num_value1 is na, then no num_value2 --> no num_value2 without num_value1
gcs2 <- gcs %>% filter(is.na(num_value1)) %>% arrange(num_value2)
summary(gcs2$num_value2)

# when num_value1 >= 3, then all NA for num_value2 --> take all num_value1 >= 3
nrow(gcs %>% filter(num_value1 >= 3) %>% drop_na(num_value2))
# for num_value1  < 3, only 7 has num_value2 --> take these
gcs %>% filter(num_value1 < 3 | is.na(num_value1)) %>% drop_na(num_value2)

# if num_value2 is na, there are still num_value1
gcs1 <- gcs %>% filter(is.na(num_value2)) %>% arrange(num_value1)
summary(gcs1$num_value1)

In [None]:
# GCS: cannot < 3 --> take all num_value1 >=3 (no num_value2), otherwise, replace num_value1 by num_value2 (only 7 entries)
# keep this one separate as we will row bind later
gcs <- gcs %>% mutate(num_value1 = ifelse(num_value1 >=3, num_value1, num_value2)) %>% 
                    drop_na(num_value1) %>% mutate(features = "GCS") %>% 
                    select(-c(num_value2, timediff)) %>%
                    rename(values = num_value1)
nrow(gcs)
summary(gcs$values)

# looks for errors GCS 4 and 15 at the same time
gcs %>% filter(pat_enc_csn_id_coded %in% c(131187403487) | anon_id =="JCec2887")

# remove total GCS points solved the problem of GCS 4 and 15 at the same time
# here, check if any rows that are the same except for values
gcs %>% group_by_at(vars(-values)) %>% filter(n() > 1) 

### Check DBP and process this separately, to bind rows afterward

In [None]:
# check DBP (num_value2). Note: BP always have 2 values
nrow(vitals %>% filter(features == "SBP" & is.na(num_value1) & !is.na(num_value2)))
nrow(vitals %>% filter(features == "SBP" & is.na(num_value2) & !is.na(num_value1)))

# display some extrem values of DBP
options(repr.matrix.max.rows=150, repr.matrix.max.cols=20)
vitals %>% filter(features == 'SBP' & (num_value2 > 200 | num_value2 < 25)) %>% arrange(num_value2)

In [None]:
# again, DBP value is num_value2
summary(vitals$num_value2)
nrow(vitals %>% filter(num_value2 > 200))
nrow(vitals %>% filter(num_value2 < 15))

hist(vitals[vitals$num_value2 > 150 & vitals$num_value2 < 200,]$num_value2, breaks =50)
hist(vitals[vitals$num_value2 < 30,]$num_value2, breaks = 30)

### Clean DBP 

In [None]:
# remove erroneous values for DBP if SBP > 10*DBP, and both SBP and DBP = 0
# will bind rows later --> so ok to filter instead of replacing by NA
DBP <- vitals %>% filter(features == "SBP" & num_value1*num_value2 !=0) %>% 
                    mutate(num_value2 = ifelse(num_value1 > 10*num_value2, NA, num_value2)) %>%
                    drop_na(num_value2) 
nrow(DBP)
summary(DBP$num_value2)
hist(DBP$num_value2, col = "dodgerblue", breaks = 125)

In [None]:
# check DBP and SBP
DBP %>% filter(num_value2 <20) %>% arrange(num_value2)

In [None]:
# change variable name and drop SBP
DBP <- DBP %>% mutate(features = "DBP") %>% 
        select(-c(num_value1, timediff)) %>%
        rename(values = num_value2) %>% distinct()

# remove the one duplicated row
nrow(DBP %>% distinct())
DBP[duplicated(DBP), ]

# rows that are the same except for values, ok to keep
DBP %>% group_by_at(vars(-values)) %>% filter(n() > 1) 

In [None]:
# how many GCS score per patient: 35 max
summary(vitals %>% filter(features == "GCS") %>% 
                    group_by(anon_id, pat_enc_csn_id_coded) %>% 
                    count(num_value1) %>% select(num_value1, n))

In [None]:
# this set, we look at num_value1 only, no DBP, but first, process GCS: take num_value1 unless:
# GCS num_value1 < 3 (on 0-2 scale), then replaced by num_value2 (only abt 7 patients)
vitals <- vitals %>% mutate(num_value1 = ifelse(features == "GCS",
                                                ifelse(num_value1 >= 3, num_value1, num_value2),
                                                num_value1))
# keep only variable in the list of VS
vitals <- vitals %>% filter(features %in% fs_feats) %>%
                        select(-c(num_value2)) %>% rename(values=num_value1) %>% distinct()

# check for distributions of these
nrow(vitals) # 929382 vs 929186
for (f in fs_feats){
    print(f)
    df = vitals %>% select(features, values) %>%  filter(features==f)
    val = df$values
    print(summary(val))
    hist(val, main = f, breaks = 100)
}

### Explore the rest of VS

In [None]:
df <- vitals %>% filter(features == "SBP")
nrow(filter(df, values > 310 | values <33))

vitals %>% filter(features == 'SBP' & (values > 300 | values < 50)) %>% arrange(values)

hist(df[df$values > 250,]$values, breaks=100)
hist(df[df$values < 50,]$values, breaks = 50)

In [None]:
df <- vitals %>% filter(features == "Pulse")
nrow(filter(df, values > 250 | values < 25)) #remove <10 only
filter(df, values > 250 | values < 25 & values > 6) %>% arrange(values)

hist(df[df$values > 250,]$values, breaks=100)
hist(df[df$values < 30,]$values, breaks=25)

In [None]:
df <- vitals %>% filter(features == "RR")
nrow(filter(df, values > 70 | values < 5))

filter(df, values > 70 | values < 8 & values > 0) %>% arrange(values)

hist(df[df$values > 50,]$values, breaks=100)
hist(df[df$values < 10,]$values, breaks=10)

In [None]:
df <- vitals %>% filter(features == "SpO2")
nrow(filter(df, values < 40))
filter(df, values < 50 & values > 10) %>% arrange(values)

hist(df[df$values < 70 & df$values > 10,]$values, breaks=35)

In [None]:
df <-vitals %>% filter(features == "Temp")
nrow(filter(df, values < 30 & values > 0))
hist(df[df$values < 30,]$values, breaks=30, xlim=c(20, 30))

In [None]:
unique(vitals$features)
summary(vitals$values)

head(vitals)

### Now clean the variables:
* Note, we did process the GCS first as above: use num_value1, and replace num_value1 < 3 by num_value2
* Then here, we clean temp first as it's a bit more complicated than the rest, but it's only involved num_value1

In [None]:
# process temp: num_value2 is na, for num_value1 <= 60, take it, and for num_value1 > 60, convert to C
# replace num_value1 < 10 (or 20) --> only 2: 0.1 and 9.4 for Temp, and a bunch for Temp(in C)
vitals <- vitals %>% 
            mutate(values=ifelse(features=="Temp",                                  
                                 ifelse(values < 20, NA,
                                         ifelse(values <= 60, values, round((values - 32)*5.0/9.0))),
                                 values))

summary(vitals %>% filter(features == "Temp") %>% select(values))

In [None]:
# remove  all NA's, by each feature, as the data is in the long format,
# keep only distinct rows, ok with recorded_time same but different values
vitals <- vitals %>% 
            mutate(values = ifelse(features == "SBP" & (values < 33), NA, # < 33 or 30 same effect & > 310 old
                             ifelse(features == "Pulse" & values < 21, NA, # 25 before
                                 ifelse(features == "RR" & (values < 4 | values > 60), NA, # 60 before (tried 70)
                                     ifelse(features == "SpO2" & values < 40, NA, # 40 before (tried 30)
                                         ifelse(features == "Temp" & values < 25, NA, values)))))) %>%  # 29 before, 26 min here
            drop_na(values) %>% distinct()

nrow(vitals)

vitals %>% count(features) %>% arrange(desc(n))
nrow(vitals %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
summary(vitals$values)

In [None]:
for (f in fs_feats){
    print(f)
    df = vitals %>% select(features, values) %>%  filter(features==f)
    val = df$values
    print(summary(val))
    hist(val, main = f, col="dodgerblue", xlim=c(min(val), max(val)), breaks= 120)
}

In [None]:
# look at duplicates rows with same recorded time, but different values. These are close, ok to keep
for (f in fs_feats){
    print(f)
    df <- vitals %>% filter(features == f) %>% group_by_at(vars(-values)) %>% filter(n() > 1)
    print(nrow(df))
}

In [None]:
vitals %>% filter(features == "SBP") %>% group_by_at(vars(-values)) %>% filter(n() > 1)
vitals %>% filter(features == "SpO2") %>% group_by_at(vars(-values)) %>% filter(n() > 1)
vitals %>% filter(features == "RR") %>% group_by_at(vars(-values)) %>% filter(n() > 1)

In [None]:
df_pulse <- vitals %>% filter(features == "Pulse") %>% group_by_at(vars(-values)) %>% filter(n() > 1) %>% 
                arrange(anon_id, pat_enc_csn_id_coded, features, recorded_time, values)
head(df_pulse, n=10)

In [None]:
df_temp <- vitals %>% filter(features == "Temp") %>% group_by_at(vars(-values)) %>% filter(n() > 1) %>% 
                arrange(anon_id, pat_enc_csn_id_coded, features, recorded_time, values)
head(df_temp, n=10)

### Combined DBP back to VS!

In [None]:
head(DBP, n=1)
head(vitals, n=1)

In [None]:
nrow(DBP)
nrow(vitals)
nrow(cohort)

vitals <- vitals %>% select(-timediff)
vitals <- bind_rows(vitals, DBP) %>% mutate(feature_type = "vitals") 
summary(vitals$values)
nrow(vitals)
nrow(vitals %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())

In [None]:
head(vitals, n=1)

In [None]:
summary(vitals$values)

In [None]:
# save cohort vital signs after cleaning, no NA's here, has recorded time, used for binning
# 43230 encounters, will remove those without vital signs (except for GCS)
write.csv(vitals, file = "./Data/vitals_clean.csv", row.names=FALSE)

### Get the first set of vital signs
Use this to combine with demographic table to impute ESI, under the *features_demographicR1.ipynb*

In [None]:
vitals <- read.csv("./Data/vitals_clean.csv")

In [None]:
# same as above, but takes ~7min to run
vs1st <- vitals %>% mutate(recorded_time = ymd_hms(recorded_time)) %>% 
            group_by(anon_id, pat_enc_csn_id_coded, features) %>%
            top_n(n=-1, recorded_time) %>%
            summarise(first_val = mean(values, na.rm=TRUE)) %>% distinct()

In [None]:
nrow(vs1st)
nrow(vs1st %>% distinct(anon_id, pat_enc_csn_id_coded, features))
nrow(vs1st %>% distinct(anon_id, pat_enc_csn_id_coded)) # 43320
head(vs1st)

In [None]:
write.csv(vs1st, "./Data/vs1st.csv", row.names=FALSE)

### Get VS for imputation and summary statistics for simple data/model: USE cohort with labels from Tiffany's
* Demographic features
* Vital signs (no GCS): first and last values, summary statistics, difference from last - first and max - min
* Cohort with labels from Tiffany's and only contains those with at least a complete set of vital signs

Take vs1st already done (or redo faster with top_n), find most recent values, and summary stats

Note that first and last values at the same time might have more than 1, take the average if this happens

In [None]:
# cohort_labels is Tiffany's final cohort of 43,008 encounters
# will update this to remove patients without a complete set of vital signs
# update vs1st, to redo imputation for ESI --> update features_demos_vitals
vitals <- read.csv("./Data/vitals_clean.csv")
cohort <- read.csv("./Data/cohort_labels.csv")
vs1st <- read.csv("./Data/vs1st.csv") # 291538
nrow(vitals)
nrow(cohort)
nrow(vs1st)

In [None]:
nrow(vs1st %>% drop_na() %>% distinct())
nrow(vs1st %>% drop_na() %>% distinct(pat_enc_csn_id_coded))

In [None]:
head(vitals, n=1)
head(cohort, n=1)
head(vs1st, n=1)

In [None]:
# get the most recent value of each feature, top_n(n=1, recorded_time) or slice_max
# top_n(n=-1, recorded_time) for earliest value, slice_min
vs_last <- vitals %>% mutate(recorded_time = ymd_hms(recorded_time)) %>% 
            group_by(anon_id, pat_enc_csn_id_coded, features) %>%
            top_n(n=1, recorded_time) %>%
            summarise(last_val = mean(values, na.rm=TRUE)) %>% distinct()

In [None]:
### DO NOT RUN THIS CELL, it won't produce the same result here
# this is prior to taking the mean values, we have 2 temp recorded at the same time
vs_last %>% filter(pat_enc_csn_id_coded == 131231466934) %>% arrange(features)

In [None]:
### DO NOT RUN THIS CELL
# THIS IS AFTER taking the mean values for those recorded at the same time, temp is 36.8 avg of 37 and 36.6
vs_last2 %>% filter(pat_enc_csn_id_coded == 131231466934) %>% arrange(features)

In [None]:
vs_last %>% filter(pat_enc_csn_id_coded == 131231466934) %>% arrange(features)

In [None]:
vitals %>% filter(pat_enc_csn_id_coded == 131231466934) %>% arrange(features, recorded_time)

In [None]:
nrow(cohort %>% select(pat_enc_csn_id_coded) %>% distinct())

In [None]:
# exclude GCS from first set of VS and for simple data
cohortID <- cohort %>% select(anon_id, pat_enc_csn_id_coded)
vitals <- vitals %>% select(anon_id, pat_enc_csn_id_coded, features, values) %>% 
            filter(features != "GCS")
vs1st <- vs1st %>% filter(features != "GCS") %>% drop_na()
vs_last <- vs_last %>% filter(features != "GCS") %>% 
            select(anon_id, pat_enc_csn_id_coded, features, last_val)

In [None]:
# get the the cohort with 1st complete set of VS for ESI imputation
cohort1vs <- left_join(cohortID, vs1st) %>% spread(features, first_val) %>% drop_na() 
nrow(cohort1vs %>% distinct(pat_enc_csn_id_coded)) # 40953??? if join with cohort directly 
head(cohort1vs, n=1)

In [None]:
# use this for imputation of ESI, better cohort, more complete
write.csv(cohort1vs, './Data/vs1st_complete.csv', row.names = FALSE)

In [None]:
head(cohort1vs)

In [None]:
# join first, last, and the rest of values for vitals
vitals0 <- left_join(cohortID, vs1st) %>% left_join(vs_last) %>% left_join(vitals) 
nrow(vitals0)
vitals0 %>% group_by(features) %>% count()
head(vitals0, n=1)

In [None]:
head(vitals0)

In [None]:
# get summary stats, including differences for first and last (0 if 1 value), min and max
# all the NA were actually already dropped, so na.rm here is redundant
vsum <- vitals0 %>% 
            group_by(anon_id, pat_enc_csn_id_coded, features, first_val, last_val) %>%
            summarise(count = n(), meanx = mean(values, na.rm=TRUE), medianx = median(values, na.rm=TRUE), 
                      minx = min(values, na.rm=TRUE), maxx = max(values, na.rm=TRUE), sdx = sd(values, na.rm=TRUE),
                      madx = mad(values, na.rm=TRUE), IQRx = IQR(values, na.rm=TRUE)) %>%
            mutate(mmdiff = round(maxx - minx, 1), fldiff = round(last_val - first_val, 1)) #

In [None]:
head(vsum)

In [None]:
nrow(vsum)
head(vsum, n=1)
summary(vsum)

In [None]:
# replace na of sdx (due to denominator of n-1) as 0
# wide to long on multiple cols to wide
# remove rows with any na --> only complete cases of 1 set of vital signs
vsum_wide <- vsum %>% mutate(sdx = ifelse(is.na(sdx), 0, sdx)) %>%
                gather(variable, value, first_val:fldiff) %>%
                unite(temp, features, variable) %>%
                spread(temp, value) %>% drop_na()

In [None]:
# remove further 1354 patients from 43,008 = 41654
colnames(vsum_wide)
nrow(vsum_wide %>% select(pat_enc_csn_id_coded) %>% distinct()) 
summary(vsum_wide)

### Get the dataset for simple models:
A cohort with complete set of VS, with labels (43008 --> 41654), with the following features
* get back the demographics (with imputed ESI using 1st set of vs)
* vital signs (first values and summary statistics) only (no GCS)

In [None]:
demos <- read.csv("./Data/cohort_demo_final.csv")
nrow(demos)
colnames(demos)

In [None]:
# add demographic features to this data with vital signs
demos <- demos %>% select(-c(inpatient_data_id_coded, label_max24, admit_time))
data_simple <- left_join(vsum_wide, demos)
dim(data_simple)
nrow(data_simple %>% select(pat_enc_csn_id_coded) %>% distinct())
colnames(data_simple)
summary(data_simple)

In [None]:
# update cohort with labels to include only patients with a complete set of VS
# cohort <- read.csv("./Data/cohort_labels.csv")
cohort_final <- data_simple %>% select(anon_id, pat_enc_csn_id_coded) %>% left_join(cohort) %>%
                select(-c(int64_field_0))

dim(cohort_final)
head(cohort_final, n=1)

In [None]:
colnames(cohort_final)
summary(cohort_final)

In [None]:
# updata data simple to include labels:
data_simple <- left_join(cohort_final, data_simple)
dim(data_simple)
colnames(data_simple)

In [None]:
write.csv(cohort_final, './Data/cohort_final.csv', row.names = FALSE)

In [None]:
write.csv(data_simple, "./Data/data_simple.csv", row.names=FALSE)

### EXTRA

In [None]:
# did not miss any BP
added_vs <- read.csv("./Data/added_vs.csv")
nrow(added_vs)
added_vs %>% count(row_disp_name)
head(added_vs)