### Description:
Explore and clean lab variables

* First grab by lab_names (can use base_name), then check their base_names from all_labs file and get the rest for the same base_names added to the original labs with lab_names --> updated labs
* Check the units and ranges of these labs
* Combine lab_names and convert to same units as needed, remove some lab_names

It would be shorter to use base_name, but still need to check because some same labs have different base_names

* Use ord_num_value, 9999999 are extreme cases on either end (OLD-- about 5 NA, and 1 hi)
* Need ord_value to take bounds for 9999999 values. Those without the ord_value --> removed. For example, if ord_value is <0.2 and ord_num_value is 9999999, then replace 9999999 with 0.2

Output files:
* **labs99** --> pushed to BQ to get the ord_value
* *labs9joined* --> intermediate file with 9999999 replaced with bounds from ord_value, the bind to the rest
* **labs_clean** --> missing patients with no labs

### Importing R libraries

In [None]:
library(caret) # import this before glmnet to avoid rlang version problem
library(xgboost)
library(data.table)
library(tidyverse)
library(lubridate)
library(Matrix)
# library(slam)
library(glmnet)
library(bit64)
# library(mtools) for one hot coder, not available on Nero or use caret or tidyr
library(mice)
options(repr.matrix.max.rows=100, repr.matrix.max.cols=20)

### LABS
2. we will use these as is. too many to manually process, and labs are usually more accurately reported. errors usually in the specimen collection task, and usually get redrawn and/or flagged, so counts are ok
3. We also have a short list of labs that can use more of the actually values, in combination with the result_in_range flag (Y/N) and result_flag (such as erroneous issues), also, using time drawn to resulted would be helpful to categorized to Stat (within 1 hr) vs. routine labs since this is not reflected in the data here (maybe available if we connect with the order proc)
4. Here are some notes about labs after checking these:
a) Cannot trust "result_in_range_yn" --> remove this
b) There is 1 incident of Glucose by Meter = hi in ord_value
c) ord_value has text, blanks and #
d) ord_num_value has NA, 9999999, #. The blanks from ord_value are NA in ord_num_value
e) result_flag: "Abnormal" for text in ord_value, usually correct. However, when ord_value is N/A or < or >, the ord_num_value is 9999999.
5. Use ord_num_value and result_flag are good

In [None]:
# labs filtered by cohort and prior to admit time
labs <- read.csv("./Data/lab_result.csv")
nrow(labs) #1769014
nrow(labs %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())

In [None]:
# checking all lab_name and base_name from 2019
all_labs <- read.csv("./Data/labs_all.csv")
base_names <- read.csv("./Data/labs_basename.csv")

In [None]:
# check base_name of selected lab_name, then re-query for missing other lab_name here
labs_add <- read.csv('./Data/labs_additional.csv')
nrow(labs_add)

In [None]:
colnames(labs)
colnames(labs_add)

In [None]:
## merge labs and labs_add
labs_updated <- full_join(labs, labs_add)
nrow(labs_updated) # correct 1789991 = 20977 + 1769014

In [None]:
# check all base_name with count > 10K
all_labs %>% group_by(base_name) %>% count(sort=TRUE) %>% filter(n > 10000)

In [None]:
# look at base_name as the parent and corresponding lab_name as children, from cohort and our list
options(repr.matrix.max.rows=130, repr.matrix.max.cols=20)
labs %>% group_by(base_name, lab_name) %>% count() %>% arrange(base_name, lab_name)

In [None]:
# look at individual lab name and its group base name from the cohort and our list
options(repr.matrix.max.rows=130, repr.matrix.max.cols=20)
labs %>% group_by(lab_name, base_name) %>% count() %>% arrange(lab_name, base_name)

In [None]:
options(repr.matrix.max.rows=130, repr.matrix.max.cols=20)
base_names %>% select(base_name, lab_name) %>% distinct() %>% arrange(base_name)

In [None]:
### Check additional labs after looking backward for basename
labs_add %>% group_by(base_name) %>% count(sort=TRUE)

In [None]:
labs_add %>% group_by(lab_name, base_name) %>% count(sort=TRUE)

In [None]:
# take the above lab names and put in a vector to check and group

# long_labs <- as.vector(labs$lab_name %>% unique())
ulabs <- labs %>% count(lab_name) %>% arrange(desc(n)) # labs %>% group_by(lab_name, base_name) %>% count() %>% arrange(lab_name, base_name)
# quoted <- paste0(ulabs, collapse=";")
quoted <- gsub(";", "'", ulabs$lab_name) %>% strsplit("; ")
# long_labs <- strsplit(quoted, "; ")
paste(shQuote(quoted), collapse=", ")

long_labs <- c("Platelet count",  "RBC", "WBC", "MCH", "MCHC", "MCV","Calcium, Ser/Plas", "CO2, Ser/Plas",
               "Albumin, Ser/Plas", "ALT (SGPT), Ser/Plas", "AST (SGOT), Ser/Plas", "Alk P'TASE, Total, Ser/Plas",
               "Globulin", "Total Bilirubin, Ser/Plas", "Protein, Total, Ser/Plas",
               "TROPONIN I", "Magnesium, Ser/Plas",  "Phosphorus, Ser/Plas",
               
               "Hematocrit", "Hct, ISTAT", "Hct(Calc), ISTAT",
               "Hemoglobin", "Hgb(Calc), ISTAT",
               "Potassium, Ser/Plas", "Potassium, ISTAT",  "Potassium, Whole Bld",
               "Sodium, Ser/Plas", "Sodium, ISTAT",  "Sodium, Whole Blood", 
               "Chloride, Ser/Plas",  "Chloride, ISTAT",  "Chloride, Whole Bld",
               "Creatinine, Ser/Plas", "Creatinine,ISTAT", 
               "Anion Gap", "Anion Gap, ISTAT",
               "Glucose, Ser/Plas", "Glucose,ISTAT", "Glucose, Whole Blood", "Glucose by Meter",
               "BUN, Ser/Plas", "BUN, ISTAT", "Urea Nitrogen,Ser/Plas",
               "Neutrophil, Absolute", "NEUT, ABS", 
               "Basophil, Absolute", "BASOS, ABS", "Basophils",
               "Eosinophil, Absolute", "EOS, ABS", 
               "Lymphocyte, Absolute", "LYM, ABS", 
               "Monocyte, Absolute", "MONO, ABS",
               "Lactate, ISTAT", "Lactate, Whole Bld", 
               "Base Excess, ISTAT", "Base Excess (vt)",
                "HCO3", "HCO3 (a), ISTAT", 
                "HCO3 (v), ISTAT", "HCO3, ISTAT", 
                "O2 Saturation (a)", "O2 Saturation, ISTAT",
                "O2 Saturation (v)",  "O2 Saturation, ISTAT (Ven)",
                "pCO2 (a)", "pCO2 (a), ISTAT",  "PCO2, ISTAT",
                "pCO2 (v)", "PCO2 (v), ISTAT",
                "pH (a)", "PH (a), ISTAT", 
                "pH (v)", "PH (v), ISTAT", "PH, ISTAT", 
                "pO2 (a)", "PO2 (a), ISTAT",  "PO2, ISTAT", 
                "pO2 (v)", "PO2 (v), ISTAT",
                "tCO2", "TCO2 (a), ISTAT","TCO2, ISTAT",
                "TCO2 (v), ISTAT", 
               "INR", "INR, ISTAT", 
               "Prothrombin Time", "PT, ISTAT", 
              "pH", "Ketone, urine", "Leukocyte Esterase, urine") #"O2 Saturation, ISTAT (Oth)", "ctO2 (a)",
length(long_labs)
nrow(ulabs)

In [None]:
# check to see if missing any important labs
labs %>% filter(!lab_name %in% long_labs) %>% select(lab_name, base_name) %>% count(lab_name, sort=TRUE)

In [None]:
# base name for only lab in our list, can query this to see if we miss any individual lab name
base_names <- labs %>% filter(lab_name %in% long_labs) %>% 
                            group_by(lab_name, base_name) %>% count() %>% arrange(base_name)
base_names

In [None]:
unique(base_names$base_name)

### Short list of labs

### ADD MORE LABS --> LONG LIST of LABS
short_labs = c("Glucose by Meter", "Sodium, Ser/Plas", "Potassium, Ser/Plas",
               "Magnesium, Ser/Plas", "Albumin, Ser/Plas", "Creatinine, Ser/Plas",
               "BUN, Ser/Plas", "CO2, Ser/Plas","Anion Gap",
               "Glucose Ser/Plas", "AST (SGOT), Ser/Plas", "ALT (SGPT), Ser/Plas",
               "Total Bilirubin, Ser/Plas", "Platelet count", "Hemoglobin", 
               "WBC", "Neutrophil, Absolute")

In [None]:
labs_add %>% group_by(lab_name, base_name) %>% select(lab_name, base_name) %>% distinct() %>% arrange(base_name)

In [None]:
# take the above lab names and put in a vector to check and group
# updated with additional lab_name, note that lab_name Glucose has 608 UGLU base_name and only 1 GLU base_name

long_labs <- c("RBC", "MCH", "MCHC", "MCV","Calcium, Ser/Plas", "CO2, Ser/Plas",
               "Albumin, Ser/Plas", "ALT (SGPT), Ser/Plas", "AST (SGOT), Ser/Plas", "Alk P'TASE, Total, Ser/Plas",
               "Globulin", "Protein, Total, Ser/Plas",
               "Magnesium, Ser/Plas",  "Phosphorus, Ser/Plas",
               
               "WBC", "WBC count",
               "TROPONIN I", "Troponin I, POCT",
               "Total Bilirubin, Ser/Plas", "Total Bilirubin",
               "Platelet count", "Platelets",
               "Hematocrit", "Hct, ISTAT", "Hct(Calc), ISTAT", "Hct (Est)", "HCT, POC", "Hematocrit (Manual Entry) See EMR for details",
               "Hemoglobin", "Hgb(Calc), ISTAT", "Hgb, calculated, POC", "HgB", 
               "Potassium, Ser/Plas", "Potassium, ISTAT",  "Potassium, Whole Bld", "Potassium, whole blood, ePOC", "Potassium", 
               "Sodium, Ser/Plas", "Sodium, ISTAT",  "Sodium, Whole Blood", 
               "Chloride, Ser/Plas",  "Chloride, ISTAT",  "Chloride, Whole Bld",
               "Creatinine, Ser/Plas", "Creatinine,ISTAT", 
               "Anion Gap", "Anion Gap, ISTAT",
               "Glucose, Ser/Plas", "Glucose,ISTAT", "Glucose, Whole Blood", "Glucose by Meter", "Glucose, Non-fasting",
               "BUN, Ser/Plas", "BUN, ISTAT", "Urea Nitrogen,Ser/Plas",
               "Neutrophil, Absolute", "NEUT, ABS", "Neut, ABS (Seg+Band) (man diff)", "Neutrophils, Absolute (Manual Diff)", "Neut, ABS (Seg+Band) (man diff)",
               "Basophil, Absolute", "BASOS, ABS", "Basophils", "Basophils, ABS (man diff)", "Baso, ABS (man diff)", 
               "Eosinophil, Absolute", "EOS, ABS", "Eosinophils, ABS (man diff)", "Eos, ABS (man diff)",
               "Lymphocyte, Absolute", "LYM, ABS", "Lymphocytes, ABS (man diff)", "Lym, ABS (man diff)", "Lymphocytes, Abs.",
               "Monocyte, Absolute", "MONO, ABS", "Monocytes, ABS (man diff)", "Mono, ABS (man diff)",
               "Lactate, ISTAT", "Lactate, Whole Bld", "Lactic Acid",
               "Base Excess, ISTAT", "Base Excess (vt)", "Base Excess Arterial for POC",
                "HCO3", "HCO3 (a), ISTAT", "Bicarbonate, Art for POC", 
                "HCO3 (v), ISTAT", "HCO3, ISTAT", 
                "O2 Saturation (a)", "O2 Saturation, ISTAT", "Oxygen Saturation for POC", 
                "O2 Saturation (v)",  "O2 Saturation, ISTAT (Ven)",
                "pCO2 (a)", "pCO2 (a), ISTAT",  "PCO2, ISTAT", "Arterial pCO2 for POC",
                "pCO2 (v)", "PCO2 (v), ISTAT",
                "pH (a)", "PH (a), ISTAT", "pH by Meter", "Arterial pH for POC",
                "pH (v)", "PH (v), ISTAT", "PH, ISTAT", 
                "pO2 (a)", "PO2 (a), ISTAT",  "PO2, ISTAT", "Arterial pO2 for POC",
                "pO2 (v)", "PO2 (v), ISTAT", 
                "tCO2", "TCO2 (a), ISTAT","TCO2, ISTAT", "TCO2, (ISTAT)", "CO2 Arterial Total for POC", 
                "TCO2 (v), ISTAT", 
               "INR", "INR, ISTAT", 
               "Prothrombin Time", "PT, ISTAT", 
              "pH", "Ketone, urine", "Leukocyte Esterase, urine") #"O2 Saturation, ISTAT (Oth)", "ctO2 (a)",
length(long_labs)

In [None]:
# all 3 K, Cl, cats are the same... all electrolytes except for Ca, remove Calcium, Ion, ISTAT = Calcium inonized, not used
# references are different, but vary among the same lab (121 and 135 low for ser/plas). values are comparable
# Hct and Hgb, all the diff white counts.
# all Creatinine,ISTAT are NA
# lactate ok ISTAT and whole blood, not Basophils (not consistent)

# lab <- labs %>% select(lab_name, ord_value, ord_num_value, reference_low, reference_high, reference_unit) %>%
# #         filter(lab_name == "Base Excess (vt)") %>%
# #         filter(lab_name == "HCO3" & reference_unit != "mEq/L") %>%
#         filter(str_detect(lab_name, "Base Excess, ISTAT") & ord_num_value != 9999999) %>% 
#         drop_na(ord_num_value) # grepl("Potassium", lab_name) 

In [None]:
# check each lab in the list for reference units, low and high to see if consistent among similar lab names
lab_test <- labs_updated %>% select(lab_name, ord_num_value, reference_unit, reference_low, reference_high, base_name) %>%
                                drop_na(ord_num_value)
c=0
for (l in long_labs){
    c = c+1
    lab <- lab_test %>% filter(lab_name == l & ord_num_value != 9999999)
    print(l)
    print(lab %>% group_by(base_name) %>% count())
    print(summary(lab %>% select(ord_num_value, reference_unit, reference_low, reference_high)))
}
# summary(lab)
print(c)

In [None]:
# UPDATE lab list, cut some labs 
# Glucose has 1 GLU basename, and 608 UGLU basename --> already removed this
# be careful with Lactate has 2 basename whole blood and none

# remove Potassium --> 1 entry, 7.4 weird
# remove Basophils --> unit is %
# remove Platelets --> all NA
# remove "pH", "Ketone, urine", "Leukocyte Esterase, urine" # few, and all NA

lab_list <- c("RBC", "MCH", "MCHC", "MCV", "Calcium, Ser/Plas", "CO2, Ser/Plas",
               "Albumin, Ser/Plas", "ALT (SGPT), Ser/Plas", "AST (SGOT), Ser/Plas", "Alk P'TASE, Total, Ser/Plas",
               "Globulin", "Protein, Total, Ser/Plas",
               "Magnesium, Ser/Plas",  "Phosphorus, Ser/Plas", #14 ind
               
               "WBC", "WBC count", # WBC count is 1000*WBC
               "TROPONIN I", "Troponin I, POCT",
               "Total Bilirubin, Ser/Plas", "Total Bilirubin",
               "Platelet count", # remove Platelets
              
               "Hematocrit", "Hct, ISTAT", "Hct(Calc), ISTAT", "Hct (Est)", "HCT, POC", "Hematocrit (Manual Entry) See EMR for details",
               "Hemoglobin", "Hgb(Calc), ISTAT", "Hgb, calculated, POC", "HgB", 
               "Potassium, Ser/Plas", "Potassium, ISTAT",  "Potassium, Whole Bld", "Potassium, whole blood, ePOC", # remove Potassium
               "Sodium, Ser/Plas", "Sodium, ISTAT",  "Sodium, Whole Blood", 
               "Chloride, Ser/Plas",  "Chloride, ISTAT",  "Chloride, Whole Bld",
               "Creatinine, Ser/Plas", "Creatinine,ISTAT", 
              
               "Anion Gap", "Anion Gap, ISTAT",
               "Glucose, Ser/Plas", "Glucose,ISTAT", "Glucose, Whole Blood", "Glucose by Meter", "Glucose, Non-fasting",
               "BUN, Ser/Plas", "BUN, ISTAT", "Urea Nitrogen,Ser/Plas",
               "Neutrophil, Absolute", "NEUT, ABS", "Neut, ABS (Seg+Band) (man diff)", "Neutrophils, Absolute (Manual Diff)", "Neut, ABS (Seg+Band) (man diff)",
               "Basophil, Absolute", "BASOS, ABS", "Basophils, ABS (man diff)", "Baso, ABS (man diff)", # remove Basophils
               "Eosinophil, Absolute", "EOS, ABS", "Eosinophils, ABS (man diff)", "Eos, ABS (man diff)",
               "Lymphocyte, Absolute", "LYM, ABS", "Lymphocytes, ABS (man diff)", "Lym, ABS (man diff)", "Lymphocytes, Abs.",
                  # Lymphocytes, Abs. = 1000* the rest of lymphocytes
               "Monocyte, Absolute", "MONO, ABS", "Monocytes, ABS (man diff)", "Mono, ABS (man diff)",
               "Lactate, ISTAT", "Lactate, Whole Bld", "Lactic Acid",
               "Base Excess, ISTAT", "Base Excess (vt)", "Base Excess Arterial for POC",
              
                "HCO3", "HCO3 (a), ISTAT", "Bicarbonate, Art for POC", 
                "HCO3 (v), ISTAT", "HCO3, ISTAT", 
                "O2 Saturation (a)", "O2 Saturation, ISTAT", "Oxygen Saturation for POC", 
                "O2 Saturation (v)",  "O2 Saturation, ISTAT (Ven)",
                "pCO2 (a)", "pCO2 (a), ISTAT",  "PCO2, ISTAT", "Arterial pCO2 for POC",
                "pCO2 (v)", "PCO2 (v), ISTAT",
                "pH (a)", "PH (a), ISTAT", "pH by Meter", "Arterial pH for POC",
                "pH (v)", "PH (v), ISTAT", "PH, ISTAT", 
                "pO2 (a)", "PO2 (a), ISTAT",  "PO2, ISTAT", "Arterial pO2 for POC",
                "pO2 (v)", "PO2 (v), ISTAT", 
              
                "tCO2", "TCO2 (a), ISTAT","TCO2, ISTAT", "TCO2, (ISTAT)", "CO2 Arterial Total for POC", 
                "TCO2 (v), ISTAT", # 1 individual
                "INR", "INR, ISTAT", 
                "Prothrombin Time", "PT, ISTAT") # 33 more than 1
#               "pH", "Ketone, urine", "Leukocyte Esterase, urine") #"O2 Saturation, ISTAT (Oth)", "ctO2 (a)"
length(lab_list)

### Redo lab processing with the long list of labs:

In [None]:
llabs <- labs_updated %>% filter(lab_name %in% lab_list) %>% 
                    select(-c(taken_time_utc, order_time_utc)) %>%
#             select(anon_id, pat_enc_csn_id_coded, lab_name, base_name, ord_num_value, 
#                    result_flag, order_time_utc, taken_time_utc, result_time_utc) %>%
                    rename(features = lab_name, values = ord_num_value, result_time = result_time_utc) %>% 
#             mutate(feature_type = "labs", result_flag = ifelse(result_flag=="",0,1)) %>% 
                    drop_na(values) %>% distinct()
nrow(llabs)

In [None]:
# convert WBC count to WBC unit and Lymphocytes Abs. before combining
llabs <- llabs %>% mutate(values = ifelse(features == "WBC count", round(values/1000.0, 1), 
                                          ifelse(features == "Lymphocytes, Abs.", round(values/1000.0, 3), values))) 

In [None]:
# in range: Y or blank 
llabs %>% count(result_in_range_yn) %>% arrange(desc(result_in_range_yn))
# normal as blanks
llabs %>% count(result_flag) %>% arrange(desc(n))
head(llabs %>% count(values) %>% arrange())
tail(llabs %>% count(values) %>% arrange())

In [None]:
# combine same labs and change the lab name in the data
# could use base_name except for LACTATE has 2 different base_names
# Glucose lab_name as base_name as GLU and UGLU. exclude glucose here, only 1 entry anyways

Platelet = c("Platelet count") # remove platelets
TBili = c("Total Bilirubin, Ser/Plas", "Total Bilirubin")
Trop = c("TROPONIN I", "Troponin I, POCT")
WBC = c("WBC", "WBC count") # WBC count /1000 7

Hct = c("Hematocrit", "Hct, ISTAT", "Hct(Calc), ISTAT", "Hct (Est)", "HCT, POC", "Hematocrit (Manual Entry) See EMR for details")
Hgb = c("Hemoglobin", "Hgb(Calc), ISTAT", "Hgb, calculated, POC", "HgB")
K = c("Potassium, Ser/Plas", "Potassium, ISTAT",  "Potassium, Whole Bld", "Potassium, whole blood, ePOC") # remove Potassium
Na = c("Sodium, Ser/Plas", "Sodium, ISTAT",  "Sodium, Whole Blood")
Cl = c("Chloride, Ser/Plas",  "Chloride, ISTAT",  "Chloride, Whole Bld")
Cr = c("Creatinine, Ser/Plas", "Creatinine,ISTAT") #22

AnionGap = c("Anion Gap", "Anion Gap, ISTAT")
Glucose = c("Glucose, Ser/Plas", "Glucose,ISTAT", "Glucose, Whole Blood", "Glucose by Meter", "Glucose, Non-fasting")
BUN = c("BUN, Ser/Plas", "BUN, ISTAT", "Urea Nitrogen,Ser/Plas") #10
Neut = c("Neutrophil, Absolute", "NEUT, ABS", "Neut, ABS (Seg+Band) (man diff)", "Neutrophils, Absolute (Manual Diff)", "Neut, ABS (Seg+Band) (man diff)")
Basos = c("Basophil, Absolute", "BASOS, ABS", "Basophils, ABS (man diff)", "Baso, ABS (man diff)")
Eos = c("Eosinophil, Absolute", "EOS, ABS", "Eosinophils, ABS (man diff)", "Eos, ABS (man diff)")
Lymp = c("Lymphocyte, Absolute", "LYM, ABS", "Lymphocytes, ABS (man diff)", "Lym, ABS (man diff)", "Lymphocytes, Abs.")
Mono = c("Monocyte, Absolute", "MONO, ABS", "Monocytes, ABS (man diff)", "Mono, ABS (man diff)")
Lactate = c("Lactate, ISTAT", "Lactate, Whole Bld", "Lactic Acid")
Base = c("Base Excess, ISTAT", "Base Excess (vt)", "Base Excess Arterial for POC") #28

HCO3_a = c("HCO3", "HCO3 (a), ISTAT", "Bicarbonate, Art for POC")
HCO3_v = c("HCO3 (v), ISTAT", "HCO3, ISTAT") 
O2sat_a = c("O2 Saturation (a)", "O2 Saturation, ISTAT", "Oxygen Saturation for POC")
O2sat_v = c("O2 Saturation (v)",  "O2 Saturation, ISTAT (Ven)")
pCO2_a = c("pCO2 (a)", "pCO2 (a), ISTAT",  "PCO2, ISTAT", "Arterial pCO2 for POC")
pCO2_v = c("pCO2 (v)", "PCO2 (v), ISTAT") 
pH_a = c("pH (a)", "PH (a), ISTAT", "pH by Meter", "Arterial pH for POC")
pH_v = c ("pH (v)", "PH (v), ISTAT", "PH, ISTAT")
PO2_a = c("pO2 (a)", "PO2 (a), ISTAT",  "PO2, ISTAT", "Arterial pO2 for POC")
PO2_v = c ("pO2 (v)", "PO2 (v), ISTAT") #29

TCO2_a = c("tCO2", "TCO2 (a), ISTAT","TCO2, ISTAT", "TCO2, (ISTAT)", "CO2 Arterial Total for POC")
INR = c("INR", "INR, ISTAT")
PT = c("Prothrombin Time", "PT, ISTAT") #9

llabs <- llabs %>% 
                mutate(features = 
                            ifelse(features %in% Platelet, "Platelet", ifelse(features %in% TBili, "TBili", 
                            ifelse(features %in% Trop, "Trop", ifelse(features %in% WBC, "WBC",
                            ifelse(features %in% Hct, "Hct", ifelse(features %in% Hgb, "Hgb", 
                            ifelse(features %in% K, "K", ifelse(features %in% Na, "Na",
                            ifelse(features %in% Cl, "Cl", ifelse(features %in% Cr, "Cr",
                            ifelse(features %in% AnionGap, "AnionGap", ifelse(features %in% Glucose, "Glucose",
                            ifelse(features %in% BUN, "BUN", ifelse(features %in% Neut, "Neut",    
                            ifelse(features %in% Basos, "Basos", ifelse(features %in% Eos, "Eos",      
                            ifelse(features %in% Lymp, "Lymp", ifelse(features %in% Mono, "Mono",    
                            ifelse(features %in% Lactate, "Lactate",  ifelse(features %in% Base, "Base", 
                            ifelse(features %in% HCO3_a, "HCO3_a", ifelse(features %in% HCO3_v, "HCO3_v", 
                            ifelse(features %in% O2sat_a, "O2sat_a", ifelse(features %in% O2sat_v, "O2sat_v",
                            ifelse(features %in% pCO2_a, "pCO2_a", ifelse(features %in% pCO2_v, "pCO2_v",
                            ifelse(features %in% pH_a, "pH_a", ifelse(features %in% pH_v, "pH_v",      
                            ifelse(features %in% PO2_a, "PO2_a", ifelse(features %in% PO2_v, "PO2_v",
                            ifelse(features %in% TCO2_a, "TCO2_a", ifelse(features %in% INR, "INR",
                            ifelse(features %in% PT, "PT", as.character(features))))))))))))))))))))))))))))))))))) %>%
                distinct()
# nrow(llabs)       
# llabs %>% count(features) %>% arrange(desc(n))

In [None]:
nrow(llabs)

# # total 48 lab_name but 56 base_name
nrow(llabs %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())
llabs %>% select(base_name, features) %>% distinct() %>% arrange(features, base_name) 
llabs %>% group_by(features) %>% count(sort=TRUE) #%>% arrange(features, base_name) 

In [None]:
# there are no NA, but only 8964 of 9999999 values
value99 <- llabs %>% filter(values==9999999) 
summary(value99 %>% select(values, reference_unit, reference_low, reference_high, result_in_range_yn, result_flag))

In [None]:
value99 %>% count(features, sort=TRUE)

In [None]:
head(value99 %>% filter(features == "Trop"))

In [None]:
# to SAVE on big query, table for 9999999 values with order_id
llabs99 <- labs_updated %>% filter(lab_name %in% lab_list & ord_num_value == 9999999) %>% 
                drop_na(ord_num_value) %>% distinct() %>% mutate(features = lab_name) %>%
                mutate(features = 
                            ifelse(features %in% Platelet, "Platelet", ifelse(features %in% TBili, "TBili", 
                            ifelse(features %in% Trop, "Trop", ifelse(features %in% WBC, "WBC",
                            ifelse(features %in% Hct, "Hct", ifelse(features %in% Hgb, "Hgb", 
                            ifelse(features %in% K, "K", ifelse(features %in% Na, "Na",
                            ifelse(features %in% Cl, "Cl", ifelse(features %in% Cr, "Cr",
                            ifelse(features %in% AnionGap, "AnionGap", ifelse(features %in% Glucose, "Glucose",
                            ifelse(features %in% BUN, "BUN", ifelse(features %in% Neut, "Neut",    
                            ifelse(features %in% Basos, "Basos", ifelse(features %in% Eos, "Eos",      
                            ifelse(features %in% Lymp, "Lymp", ifelse(features %in% Mono, "Mono",    
                            ifelse(features %in% Lactate, "Lactate",  ifelse(features %in% Base, "Base", 
                            ifelse(features %in% HCO3_a, "HCO3_a", ifelse(features %in% HCO3_v, "HCO3_v", 
                            ifelse(features %in% O2sat_a, "O2sat_a", ifelse(features %in% O2sat_v, "O2sat_v",
                            ifelse(features %in% pCO2_a, "pCO2_a", ifelse(features %in% pCO2_v, "pCO2_v",
                            ifelse(features %in% pH_a, "pH_a", ifelse(features %in% pH_v, "pH_v",      
                            ifelse(features %in% PO2_a, "PO2_a", ifelse(features %in% PO2_v, "PO2_v",
                            ifelse(features %in% TCO2_a, "TCO2_a", ifelse(features %in% INR, "INR",
                            ifelse(features %in% PT, "PT", as.character(features))))))))))))))))))))))))))))))))))) %>%
                distinct() 
nrow(llabs99)
# write.csv(llabs99, "./Data/labs99.csv", row.names=FALSE)

In [None]:
nrow(value99)
# write.csv(value99, "./Data/labs99.csv", row.names=FALSE)

In [None]:
# read in the joined file for labs9
labs9joined <- read.csv("./Data/labs9joined.csv")
nrow(labs9joined)

In [None]:
head(llabs99, n=1)
head(labs9joined)

In [None]:
# OLD, new data doesn't have ord_value
# take the upper/lower bounds (old datalake2018, a few NA are there) for the 9999999 values
temp <-  seq(from=2, 2*nrow(labs9joined), by= 2)
head(temp)
labs9joined <- labs9joined %>% mutate(ord_value = as.character(ord_value)) %>%
                    mutate(values = as.double(gsub(paste(c(">", "<"), collapse = "|"),"", ord_value))) %>%
        drop_na(values) # %>% distinct() # same
head(labs9joined)
nrow(labs9joined)

In [None]:
# JCe2f9dd
summary(labs9joined$values)
head(labs9joined %>% arrange(desc(values)))

In [None]:
labs9joined <- labs9joined %>% select(colnames(llabs))
nrow(labs9joined)
head(labs9joined)

In [None]:
# remove all 99999999 values and bind with those that we got valued replaced
nrow(llabs)
llabs <- llabs %>% filter(values != 9999999) %>% bind_rows(labs9joined)
nrow(llabs) # removed 490 rows without any values in ord_value

In [None]:
labs_clean <- llabs %>% select(anon_id, pat_enc_csn_id_coded, features, values, result_time) %>%
                        mutate(result_time = ymd_hms(result_time))

### Join with cohort (after removing patients without vital signs)

In [None]:
nrow(labs_clean)
nrow(labs_clean %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct())

cohort <- read.csv("./Data/cohort_final.csv") #41654
nrow(cohort)
labs_clean <- inner_join(cohort, labs_clean) %>% mutate(feature_type = "labs")
nrow(labs_clean)
nrow(labs_clean %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) #39226 about 2428 pts have no labs

In [None]:
head(labs_clean)

In [None]:
# save the new lab table, with 9999999 values still available
write.csv(labs_clean, "./Data/labs_clean.csv", row.names=FALSE)

### OLD ---Check overlapping cohort

In [None]:
#llabs is cohort_long_labs
noLabs <- cohort_demo_imputed %>% select(anon_id, pat_enc_csn_id_coded) %>% 
            anti_join(cohort_long_labs %>% select(anon_id, pat_enc_csn_id_coded) %>% distinct()) %>%
            mutate(LABS = "noLabs")
nrow(noLabs) # dropped from 7800 to 348!
head(noLabs, n=1) # 

In [None]:
# table noLabs noVS show how many are in each and combined
nrow(cohort_demo_imputed)
nrow(noVS)
nrow(noLabs)

VsLabs <- cohort_demo_imputed %>% select(anon_id, pat_enc_csn_id_coded, admit_time, label) %>%
                left_join(noVS) %>% left_join(noLabs)
head(VsLabs)
nrow(VsLabs)
VsLabs %>% count(VS, LABS)

cohort_VsLabs <- VsLabs %>% filter(is.na(VS) | is.na(LABS)) %>% select(-c(VS, LABS))
nrow(cohort_VsLabs)
head(cohort_VsLabs)
cohort_VsLabs %>% count(label)

noVSnoLabs <- VsLabs %>% filter(!is.na(VS) & !is.na(LABS))
nrow(noVSnoLabs)
noVSnoLabs %>% count(label)

In [None]:
nrow(cohort_VsLabs)

In [None]:
# this is the new cohort, from adjust_cohort to remove those without both VS and LAB values
write.csv(cohort_VsLabs, file = "./Data/cohort_VsLabs.csv", row.names=FALSE)

### Exploration - Labs!!!

In [None]:
slabs %>% filter(ord_value %in% c("N/A","hi") | result_flag == "Abnormal" )

In [None]:
slabs %>% filter(ord_num_value == 9999999 & reference_high == 100) # only a few

In [None]:
slabs %>% filter(result_in_range_yn == "Y") %>% count(lab_name)
head(slabs %>% filter(result_in_range_yn == ""), n=10)
slabs %>% filter(result_in_range_yn == "Y" & lab_name == "Magnesium, Ser/Plas")

In [None]:
options(repr.matrix.max.rows=900, repr.matrix.max.cols=20)
slabs %>% filter(ord_num_value == 9999999) %>% group_by(lab_name) %>% mutate(n=n()) %>% arrange(n)

In [None]:
slabs %>% count(ord_value) %>% arrange(ord_value)

In [None]:
head(slabs %>% count(ord_num_value) %>% arrange(desc(ord_num_value)))
tail(slabs %>% count(ord_num_value) %>% arrange(desc(ord_num_value)))
slabs %>% filter(ord_num_value == 6589)

In [None]:
head(slabs %>% filter(result_flag == ""), n=20)