# Format lab results

Here, we'll use the `dplyr` package in R to join and arrange our lab data along with sample metadata. [`dplyr`'s verbs](https://dplyr.tidyverse.org/) make this process a lot easier than doing similar work in Python's `pandas` library.

In [1]:
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Read and select metadata

In [3]:
meta_csv <- "output/br1-br2_hise_metadata_2024-02-03.csv"
meta <- read.csv(meta_csv)

We won't need all of the columns provided by HISE for our analysis. Let's keep columns related to subject and sample, which have "subject" and "sample" prefixes.

The column `subject.id` gets assigned per HISE project, so we can have the same sample listed multiple times if we keep this column. We'll drop it for this use.

In [4]:
nrow(meta)

In [5]:
meta <- meta %>%
  select(starts_with("subject"), starts_with("sample")) %>%
  select(-subject.id) %>%
  unique()

In [6]:
nrow(meta)

In [7]:
names(meta)

## Read labs

To keep lab names, we'll use the `read.csv` parameter `check.names = FALSE`.

In [8]:
labs_csv <- "output/br1-br2_hise_labs_2024-02-03.csv"
labs <- read.csv(labs_csv, check.names = FALSE, row.names = 1)

## Assign groups

There are a TON of different lab results, and they're provided alphabetically - to help with parsing these, we'll group them based on the category of assay.

We have a table that includes the groups for labs, and simplified column names that can help with use of these data for computational work.

In [9]:
lab_groups <- read.csv("br1-br2_clinical_lab_groups.csv")

In [10]:
head(lab_groups)

Unnamed: 0_level_0,category_name,lab,column_name
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,Anthropometric measures,Body Mass Index (BMI),am.bmi
2,Anthropometric measures,Height,am.height
3,Anthropometric measures,Weight,am.weight
4,Blood Chemistry,Alanine Transaminase (ALT),chem.alt
5,Blood Chemistry,Albumin,chem.albumin
6,Blood Chemistry,Alkaline Phosphatase,chem.alkaline_phosphatase


In [11]:
old_names <- names(labs)
new_names <- names(labs)
for(i in 1:length(old_names)) {
    if(old_names[i] %in% lab_groups$lab) {
        new_names[i] <- lab_groups$column_name[lab_groups$lab == old_names[i]]
    }
}
new_names <- sub("sampleKitGuid", "sample.sampleKitGuid", new_names)
new_names <- sub("subjectGuid", "subject.subjectGuid", new_names)

In [12]:
names(labs) <- new_names

We'll drop the columns we don't need, and arrange the columns by name so that they fall into their categories.

In [15]:
keep_columns <- c("subject.subjectGuid", "sample.sampleKitGuid", intersect(lab_groups$column_name, names(labs)))
labs <- labs[,keep_columns]

In [16]:
labs <- labs[,sort(names(labs))]
labs <- labs %>%
  select(subject.subjectGuid, sample.sampleKitGuid, everything()) %>%
  arrange(subject.subjectGuid, sample.sampleKitGuid)

In [17]:
names(labs)

## Clean up duplicates

We should have just one row of metadata and one row of labs per sample. If we have multiple rows, we need to clean them up.

In [18]:
meta <- unique(meta)

In [19]:
nrow(meta)

Remove some duplicate entries that have missing values:  
missing sample.visitDetails  
missing sample.daysSinceFirstVisit

In [20]:
meta <- meta %>%
  filter(sample.visitDetails != "") %>%
  filter(!sample.daysSinceFirstVisit == "NA")

A few kits are still duplicate due to differences in daysSinceFirstVisit. They differ by 1 day. Since I'm not sure which is correct, we'll just take the first one of each.

In [21]:
meta <- meta %>%
  group_by(sample.sampleKitGuid) %>%
  slice(1) %>%
  ungroup()

In [22]:
nrow(meta)

In [23]:
labs <- unique(labs)

Some entries are duplicates missing all values. We can find and remove these by counting the number of missing labs, and keep the row that has the most per sample.

In [24]:
count_missing <- function(x) {
    sum(is.na(x) | x == "")
}

In [25]:
labs <- labs %>%
  mutate(n_missing = apply(labs, 1, count_missing))

In [26]:
labs <- labs %>%
  group_by(sample.sampleKitGuid) %>%
  arrange(n_missing) %>%
  slice(1) %>%
  ungroup()

In [27]:
nrow(labs)

## Join metadata and labs

In [28]:
all_results <- meta %>%
  left_join(labs) %>%
  arrange(subject.subjectGuid, sample.sampleKitGuid)

[1m[22mJoining with `by = join_by(subject.subjectGuid, sample.sampleKitGuid)`


In [29]:
head(all_results)

subject.subjectGuid,subject.cohort,subject.biologicalSex,subject.race,subject.ethnicity,subject.birthYear,subject.ageAtEnrollment,sample.visitName,sample.visitDetails,sample.sampleGuid,⋯,infl.rf_iga_result,infl.rf_igm_interpretation,infl.rf_igm_result,lip.chlesterol_hdl_ratio,lip.cholesterol_hdl,lip.cholesterol_ldl,lip.cholesterol_non_hdl,lip.cholesterol_total,lip.triglycerides,n_missing
<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,⋯,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
BR1001,BR1,Female,Caucasian,Non-Hispanic origin,1987,32,Flu Year 1 Day 0,N/A - Flu-Series Timepoint Only,599,⋯,0.0,Negative,5.902,2.6,65.0,85.0,104.0,169.0,94.0,5
BR1001,BR1,Female,Caucasian,Non-Hispanic origin,1987,32,Flu Year 1 Day 7,N/A - Flu-Series Timepoint Only,729,⋯,,,,,,,,,,58
BR1001,BR1,Female,Caucasian,Non-Hispanic origin,1987,32,Flu Year 1 Day 90,N/A - Flu-Series Timepoint Only,1811,⋯,,,,,,,,,,56
BR1002,BR1,Male,Caucasian,Non-Hispanic origin,1991,28,Flu Year 1 Day 0,N/A - Flu-Series Timepoint Only,641,⋯,,,,2.8,56.0,72.0,99.0,155.0,197.0,11
BR1002,BR1,Male,Caucasian,Non-Hispanic origin,1991,28,Flu Year 1 Day 7,N/A - Flu-Series Timepoint Only,730,⋯,,,,2.5,61.0,80.0,93.0,154.0,58.0,11
BR1002,BR1,Male,Caucasian,Non-Hispanic origin,1991,28,Flu Year 1 Day 90,N/A - Flu-Series Timepoint Only,1808,⋯,,,,2.3,65.0,71.0,84.0,149.0,45.0,11


## Write assembled lab results

In [30]:
out_file <- paste0("output/br1-br2_assembled_labs_", Sys.Date(), ".csv")
write.csv(
    all_results,
    out_file,
    row.names = FALSE,
    quote = TRUE
)

In [31]:
sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4

loaded via a namespace (and not attached):
 [1] digest_0.6.34    IRdisplay_1.1    R6_2.5.1         utf8_1.2.4      
 [5] base64enc_0.1-3  fastmap_1.1.1    tidyselect_1.2.0 magrittr_2.0.3  
 [9] glue_1.7.0       tibble_3.2.1     pkgconfig_2.0.3  htmltools_0.5.7 
[13] generics_0.1.3   repr_1.1.6.9