# UK Biobank lipids phenotypes and covariates

In this notebook we review and explore the available UK Biobank data for lipids phenotypes and covariates.

TODOs
* check that the assays used here are comparable to the data from the AoU measurements
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30690
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30760
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30780
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30870
  * The AoU and UKB measures used different units, converted `mg_dl = 18 * mmol_L` but the median values for the corresponding measurements in each cohort differ by a noticeable amount.
  * This needs a correction and/or explanation.
  * Margaret is going to take a closer look
* use lower and upper bound cutoffs appropriate for each measurement
* add in the cohort query for WGS samples
* also determine the relevant statin phenotypes so that we can correct for statin use:
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=20003
  * per Margaret, statin use for UKB was documented at two time points
* incorporate exclusion criteria
* other issues to decide
  * the NIH has a newer calculation for LDL, we should decide whether to use it


Friedewald formula 
`(LDL-c (mg/dL) = TC (mg/dL) − HDL-c (mg/dL) − TG (mg/dL)/5)`

1.	LDL adjustment based on TG/LDL values 
  1.	`If TG > 400, then LDL = NA`
  2.	`If LDL < 10, then LDL=NA`
2.	LDL and TC adjustment based on Statin (Lipid lowering medication)
  1.	`If STATIN is used, LDL_ADJ = LDL/0.7`
  2.	`If STATIN is used, TOTAL_ADJ = TC/0.8`
3.	TG adjustment
  1.	`TG_LOG = log(TG)`
4.	Calculation of residuals – residuals calculated by adjusting for covariates 
  1.	residual calculation Example for LDL: `tmp.ldl$LDL_ADJ.resid <- resid(lm(LDL_ADJ ~ sex+age+age2+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11, data = tmp.ldl))`
5.	normalization Example for LDL: `tmp.ldl$LDL_ADJ.norm <- sd(tmp.ldl$LDL_ADJ)*scale(qnorm((rank(tmp.ldl$LDL_ADJ.resid,na.last="keep")-0.5)/length(tmp.ldl$LDL_ADJ.resid)))`

# Setup

<div class="alert alert-block alert-warning">
This notebook will run correctly on the UK Biobank Research Analysis Platform.
</div>

In [None]:
lapply(c('skimr', 'tidyverse'),
       function(pkg) { if(! pkg %in% installed.packages()) { install.packages(pkg)} } )

In [None]:
library(skimr)
library(tidyverse)

In [None]:
## Plot setup.
theme_set(theme_bw(base_size = 16)) # Default theme for plots.

#' Returns a data frame with a y position and a label, for use annotating ggplot boxplots.
#'
#' @param d A data frame.
#' @return A data frame with column y as max and column label as length.
get_boxplot_fun_data <- function(df) {
  return(data.frame(y = max(df), label = stringr::str_c('N = ', length(df))))
}

## Retrieve and load the data extract

<div class="alert alert-block alert-warning">
Section assumes the availability of a CSV was created via notebook <kbd>ukb_lipids_phenotypes_retrieval.ipynb</kbd>.
</div>

In [None]:
system('dx download lipids.csv', intern = TRUE)

In [None]:
pheno <- read_csv('lipids.csv')

In [None]:
skim(pheno)

In [None]:
# At this time, we are only interested in the first instance.
pheno <- pheno %>% select(eid, contains('_i0_'))

In [None]:
skim(pheno)

In [None]:
options(repr.plot.height = 12, repr.plot.width = 16)

pheno %>%
    ggplot(aes(x = cut_width(p21003_i0_Age_when_attended_assessment_centre_years, width = 10, boundary = 0), y = `p30690_i0_Cholesterol_mmol/L`)) +
    geom_boxplot() +
    stat_summary(fun.data = get_boxplot_fun_data, geom = 'text', size = 4,
                 position = position_dodge(width = 0.9), vjust = -0.8) +
    coord_flip() +
    xlab('age') +
    labs(title = str_glue('Instance 0 measurement per person, by age'),
         caption = 'Source: UK Biobank data')

In [None]:
table(pheno$p30693_i0_Cholesterol_correction_level, useNA = 'always')

In [None]:
table(pheno$p30694_i0_Cholesterol_correction_reason, useNA = 'always')

In [None]:
table(pheno$p30695_i0_Cholesterol_missing_reason, useNA = 'always')

In [None]:
table(pheno$p30692_i0_Cholesterol_aliquot, useNA = 'always')

In [None]:
pheno %>%
    filter(!is.na(`p30690_i0_Cholesterol_mmol/L`)) %>%
    group_by(p30695_i0_Cholesterol_missing_reason) %>%
    summarize(count = n())

# Pivot and plot the data 

In [None]:
assay <- pheno %>%
    select(eid, p21003_i0_Age_when_attended_assessment_centre_years, ends_with('mmol/L')) %>%
    pivot_longer(
        cols = ends_with('mmol/L'),
        names_to = c('instance', 'measurement'),
        names_pattern = 'p\\d+_(i\\d+)_(.*)_mmol/L',
        values_to = 'mmol_L') %>%
    # Convert to units used by AoU measurements.
    mutate(mg_dl = 18 * mmol_L) %>%
    inner_join(
      pheno %>%
      select(eid, p21003_i0_Age_when_attended_assessment_centre_years, ends_with('correction_reason')) %>%
      pivot_longer(
          cols = ends_with('correction_reason'),
          names_to = c('instance', 'measurement'),
          names_pattern = 'p\\d+_(i\\d+)_(.*)_correction_reason',
          values_to = 'correction_reason')) %>% 
    inner_join(
      pheno %>%
      select(eid, p21003_i0_Age_when_attended_assessment_centre_years, ends_with('missing_reason')) %>%
      pivot_longer(
          cols = ends_with('missing_reason'),
          names_to = c('instance', 'measurement'),
          names_pattern = 'p\\d+_(i\\d+)_(.*)_missing_reason',
          values_to = 'missing_reason')) %>%
    inner_join(
      pheno %>%
      select(eid, p21003_i0_Age_when_attended_assessment_centre_years, ends_with('correction_level')) %>%
      pivot_longer(
          cols = ends_with('correction_level'),
          names_to = c('instance', 'measurement'),
          names_pattern = 'p\\d+_(i\\d+)_(.*)_correction_level',
          values_to = 'correction_level')) %>%
    inner_join(
      pheno %>%
      select(eid, p21003_i0_Age_when_attended_assessment_centre_years, ends_with('aliquot')) %>%
      pivot_longer(
          cols = ends_with('aliquot'),
          names_to = c('instance', 'measurement'),
          names_pattern = 'p\\d+_(i\\d+)_(.*)_aliquot',
          values_to = 'aliquot')) %>%
    inner_join(
      pheno %>%
      select(eid, p21003_i0_Age_when_attended_assessment_centre_years, ends_with('reportability')) %>%
      pivot_longer(
          cols = ends_with('reportability'),
          names_to = c('instance', 'measurement'),
          names_pattern = 'p\\d+_(i\\d+)_(.*)_reportability',
          values_to = 'reportability'))

In [None]:
# Check the result of the join.
(dim(assay))
(nrow(pheno) * 4)
stopifnot(nrow(assay) == nrow(pheno) * 4)

In [None]:
# Uncomment the line below to see row level data.
#head(assay)

In [None]:
assay %>%
    filter(!is.na(mg_dl)) %>%
    group_by(missing_reason) %>%
    summarize(count = n())

In [None]:
assay %>%
    group_by(measurement) %>%
    summarize(
        count = n(),
        missing = sum(is.na(mg_dl)),
        median = median(mg_dl, na.rm = TRUE),
        mean = mean(mg_dl, na.rm = TRUE),
        stddev = sd(mg_dl, na.rm = TRUE)
    )

In [None]:
options(repr.plot.height = 18, repr.plot.width = 16)

assay %>%
    filter(!is.na(p21003_i0_Age_when_attended_assessment_centre_years)) %>%
    ggplot(aes(x = cut_width(p21003_i0_Age_when_attended_assessment_centre_years, width = 10, boundary = 0), y = mg_dl)) +
    geom_boxplot() +
    stat_summary(fun.data = get_boxplot_fun_data, geom = 'text', size = 4,
                 position = position_dodge(width = 0.9), vjust = -0.8) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
#    scale_y_log10(breaks = scales::pretty_breaks(n = 10)) +  # Uncomment if the data looks skewed.
    coord_flip() +
    facet_wrap(~ measurement, nrow = length(unique(assay$measurement)), scales = 'free_x') +
    xlab('age') +
    labs(title = str_glue('Instance 0 measurement per person, by age'),
         caption = 'Source: UK Biobank data')

# Provenance 

In [None]:
devtools::session_info()