# Prepare phenotypes and covariates for GWAS

In this notebook we prepare the All of Us lipids phenotypes and covariates for GWAS.

Note that the corresponding UKB phentypes are:
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30690
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30760
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30780
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30870

TODOs
* use lower and upper bound cutoffs appropriate for each measurement
* add in the cohort query for WGS samples
* also determine the relevant statin phenotypes so that we can correct for statin use:
  * statin drug [concept set](https://preprod-workbench.researchallofus.org/workspaces/aou-rw-preprod-e2a365a8/pooledanalysisofallofusandukbiobankgenomicdata/data/concepts/sets/53)
  * participants with statin use [dataset](https://preprod-workbench.researchallofus.org/workspaces/aou-rw-preprod-e2a365a8/pooledanalysisofallofusandukbiobankgenomicdata/data/data-sets/70)
  * https://databrowser.researchallofus.org/ehr/drug-exposures?search=statin
  * https://databrowser.researchallofus.org/ehr/conditions?search=statin
  * https://databrowser.researchallofus.org/ehr/procedures?search=statin
* incorporate exclusion criteria
* other issues to decide
  * timing of AoU measurements - which to use
    * right now its is using the most recentl
    * another suggestion was to use the maximum value per participant
  * whether to incorporate the measurements with a missing unit concept id so that we can work with a larger portion of the cohort
  * the NIH has a newer calculation for LDL, we should decide whether to use it


Friedewald formula 
`(LDL-c (mg/dL) = TC (mg/dL) − HDL-c (mg/dL) − TG (mg/dL)/5)`

1.	LDL adjustment based on TG/LDL values 
  1.	`If TG > 400, then LDL = NA`
  2.	`If LDL < 10, then LDL=NA`
2.	LDL and TC adjustment based on Statin (Lipid lowering medication)
  1.	`If STATIN is used, LDL_ADJ = LDL/0.7`
  2.	`If STATIN is used, TOTAL_ADJ = TC/0.8`
3.	TG adjustment
  1.	`TG_LOG = log(TG)`
4.	Calculation of residuals – residuals calculated by adjusting for covariates 
  1.	residual calculation Example for LDL: `tmp.ldl$LDL_ADJ.resid <- resid(lm(LDL_ADJ ~ sex+age+age2+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11, data = tmp.ldl))`
5.	normalization Example for LDL: `tmp.ldl$LDL_ADJ.norm <- sd(tmp.ldl$LDL_ADJ)*scale(qnorm((rank(tmp.ldl$LDL_ADJ.resid,na.last="keep")-0.5)/length(tmp.ldl$LDL_ADJ.resid)))`

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench. It runs fine on the default Cloud Environment. 
</div>

In [None]:
lapply(c('viridis', 'ggthemes', 'skimr', 'fuzzyjoin'),
       function(pkg_name) { if(! pkg_name %in% installed.packages()) { install.packages(pkg_name)} } )
lapply(c('IRanges'),
       function(pkg_name) { if(! pkg_name %in% installed.packages()) { BiocManager::install(pkg_name)} } )

library(viridis)    # A nice color scheme for plots.
library(ggthemes)   # Common themes to change the look and feel of plots.
library(scales)     # Graphical scales map data to aesthetics in plots.
library(skimr)      # Better summaries of data.
library(lubridate)  # Date library from the tidyverse.
library(bigrquery)  # BigQuery R client.
library(tidyverse)  # Data wrangling packages.
library(fuzzyjoin)
library(lubridate)

In [None]:
## BigQuery setup.
BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')
# Get the BigQuery curated dataset for the current workspace context.
CDR <- Sys.getenv('WORKSPACE_CDR')

WORKSPACE_BUCKET <- Sys.getenv('WORKSPACE_BUCKET')

## Plot setup.
theme_set(theme_bw(base_size = 16)) # Default theme for plots.

#' Returns a data frame with a y position and a label, for use annotating ggplot boxplots.
#'
#' @param d A data frame.
#' @return A data frame with column y as max and column label as length.
get_boxplot_fun_data <- function(df) {
  return(data.frame(y = max(df), label = stringr::str_c('N = ', length(df))))
}

# Retrieve genomics alpha2 cohort

In [None]:
participants_with_genomic_data <- read_csv(
    pipe(str_glue('gsutil cat {WORKSPACE_BUCKET}/data/researchIDsAlpha2Release_04272021.txt')),
    col_names = c('person_id')
)

dim(participants_with_genomic_data)

In [None]:
head(participants_with_genomic_data)

# Retrieve most recent lipids measurements

This CSV was created in notebook `aou_lipids_phenotypes.ipynb`.

In [None]:
most_recent_lipids_measurements_df <- read_csv(
    pipe(str_glue('gsutil cat {WORKSPACE_BUCKET}/data/most_recent_lipids_measurements.csv')))

## Full cohort summary

In [None]:
most_recent_lipids_measurements_df %>%
    group_by(title) %>%
    summarize(
        count = n(),
        missing = sum(is.na(value_as_number)),
        median = median(value_as_number, na.rm = TRUE),
        mean = mean(value_as_number, na.rm = TRUE),
        stddev = sd(value_as_number, na.rm = TRUE)
    )

## Genomics cohort summary

In [None]:
most_recent_lipids_measurements_df %>%
    filter(person_id %in% participants_with_genomic_data$person_id) %>%
    group_by(title) %>%
    summarize(
        count = n(),
        missing = sum(is.na(value_as_number)),
        median = median(value_as_number, na.rm = TRUE),
        mean = mean(value_as_number, na.rm = TRUE),
        stddev = sd(value_as_number, na.rm = TRUE)
    )

# Retrieve statin drug exposures

These CSVs are from notebook `aou_participants_with_statin_use.ipynb`.

## Individual statin drug exposures

In [None]:
statin_use_df <- read_csv(
    pipe(str_glue('gsutil cat {WORKSPACE_BUCKET}/data/participants_with_statin_use.csv')),
    col_types = cols(
        DRUG_EXPOSURE_START_DATETIME = col_datetime(format = '%Y/%m/%d %H:%M:%S'),
        DRUG_EXPOSURE_END_DATETIME = col_datetime(format = '%Y/%m/%d %H:%M:%S')),
    guess_max = 25000
)

dim(statin_use_df)

In [None]:
head(statin_use_df)

## Statin drug exposures summarized per person

In [None]:
statin_use_summary_df <- read_csv(
    pipe(str_glue('gsutil cat {WORKSPACE_BUCKET}/data/participants_with_statin_use_summary.csv')),
    col_types = cols(
        first_use = col_datetime(format = '%Y/%m/%d %H:%M:%S'),
        last_use_start = col_datetime(format = '%Y/%m/%d %H:%M:%S'),
        last_use_end = col_datetime(format = '%Y/%m/%d %H:%M:%S')),
    guess_max = 25000
)

dim(statin_use_summary_df)

In [None]:
head(statin_use_summary_df)

# Merge measurements with statin use

<div class="alert alert-block alert-info">
In the merge below we do the following  to keep the memory requirements for R lower:<ol>
    <li>only use the 22k participants in the genomics alpha2 release</li>
    <li>merge ldl and total cholesterol separately, as opposed to merging statin use with all the lipids measurments</li>
    </ol>
        
To run the merge at scale on the entire cohort and all the lipids measurements, we'll need to use a machine with more RAM or use BigQuery to do the merge.
</div>

## Merge with drug exposures summarized per person

In [None]:
        statin_use_summary_df %>%
        mutate(
            statin_use_start_date = as_date(first_use),
            # Fill NAs.
            statin_use_end_date = case_when(
                is.na(last_use_end) ~ as_date(last_use_start),
                TRUE ~ as_date(last_use_end)
                ),
            statin_use_end_date1 = as_date(ifelse(is.na(last_use_end), last_use_start, last_use_end))
        ) %>%
        select(first_use, last_use_start, last_use_end, statin_use_start_date, statin_use_end_date)

In [None]:
ldl_with_exposure_summary <- most_recent_lipids_measurements_df %>%
    filter(title == 'Cholesterol in LDL [Mass/volume] in Serum or Plasma by calculation [milligram per deciliter]') %>%
    mutate(
        date = as_date(measurement_date)
    ) %>%
    filter(person_id %in% participants_with_genomic_data$person_id) %>%
    fuzzy_left_join(
        statin_use_summary_df %>%
        mutate(
            statin_use_start_date = as_date(first_use),
            # Fill NAs.
            statin_use_end_date = case_when(
                is.na(last_use_end) ~ as_date(last_use_start),
                TRUE ~ as_date(last_use_end)
                ),
        ) %>%
        filter(PERSON_ID %in% participants_with_genomic_data$person_id),
        by = c('person_id' = 'PERSON_ID',
               'date' = 'statin_use_start_date',
               'date' = 'statin_use_end_date'),
        match_fun = list(`==`, `>=`, `<=`)
    )

dim(ldl_with_exposure_summary)

In [None]:
# Confirm that after the merge, we still have one row per participant.
stopifnot(
    nrow(most_recent_lipids_measurements_df %>%
         filter(title == 'Cholesterol in LDL [Mass/volume] in Serum or Plasma by calculation [milligram per deciliter]') %>%
         filter(person_id %in% participants_with_genomic_data$person_id))
    ==
    nrow(ldl_with_exposure_summary))

In [None]:
# Check the merge.
ldl_with_exposure_summary %>%
  mutate(
      statin_use = !is.na(statin_use_start_date)
  ) %>%
  select(person_id, PERSON_ID, statin_use, measurement_date, statin_use_start_date, statin_use_end_date) %>%
  head(n = 20)

Take a look at the distribution of statin use for our indicator variable.

In [None]:
ldl_with_exposure_summary %>%
  mutate(
      statin_use = !is.na(statin_use_start_date)
  ) %>%
  group_by(title, statin_use) %>%
  summarize(
      count = n(),
      missing = sum(is.na(value_as_number)),
      median = median(value_as_number, na.rm = TRUE),
      mean = mean(value_as_number, na.rm = TRUE),
      stddev = sd(value_as_number, na.rm = TRUE)
  )

In [None]:
options(repr.plot.height = 16, repr.plot.width = 16)

ldl_with_exposure_summary %>%
    filter(value_as_number > 0) %>% # Get rid of nonsensical outliers.
    mutate(
        age_at_measurement = year(as.period(interval(start = birth_datetime, end = measurement_date))),
        statin_use = !is.na(statin_use_start_date)
    ) %>%
    # Exclude measurements taken during childhood.
    filter(age_at_measurement > 20) %>%
    ggplot(aes(x = cut_width(age_at_measurement, width = 10, boundary = 0), y = value_as_number, fill = statin_use)) +
    geom_boxplot() +
    stat_summary(fun.data = get_boxplot_fun_data, geom = 'text', size = 4,
                 position = position_dodge(width = 0.9), vjust = -0.8) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
#    scale_y_log10(breaks = scales::pretty_breaks(n = 10)) +  # Uncomment if the data looks skewed.
    coord_flip() +
    facet_wrap(~ title, nrow = length(unique(ldl_with_exposure_summary$title)), scales = 'free_x') +
    xlab('age') +
    labs(title = str_glue('Most recent measurement per person, by age'),
         caption = 'Source: All Of Us Data')

## Merge with individual drug exposures

In [None]:
ldl_with_overlapping_exposures <- most_recent_lipids_measurements_df %>%
    filter(title == 'Cholesterol in LDL [Mass/volume] in Serum or Plasma by calculation [milligram per deciliter]') %>%
    mutate(
        date = as_date(measurement_date)
    ) %>%
    filter(person_id %in% participants_with_genomic_data$person_id) %>%
    fuzzy_left_join(
        statin_use_df %>%
        filter(PERSON_ID %in% participants_with_genomic_data$person_id) %>%
        mutate(
            statin_use_start_date = as_date(DRUG_EXPOSURE_START_DATETIME),
            statin_use_end_date = as_date(DRUG_EXPOSURE_END_DATETIME)
    ),
        by = c('person_id' = 'PERSON_ID',
               'date' = 'statin_use_start_date',
               'date' = 'statin_use_end_date'),
        match_fun = list(`==`, `>=`, `<=`)
    )

dim(ldl_with_overlapping_exposures)

In [None]:
# After this merge, we'll have multiple rows per person. Consolidate into one row.
ldl_with_overlapping_exposures %>%
    group_by(person_id, title, measurement_date, value_as_number) %>%
    summarize(
        first_use = min(DRUG_EXPOSURE_START_DATETIME),
        last_use_start = max(DRUG_EXPOSURE_START_DATETIME),
        last_use_end = max(DRUG_EXPOSURE_END_DATETIME),
        statin_rx_count = sum(!is.na(STANDARD_CONCEPT_CODE)),
        statin_drugs = str_c(sort(unique(STANDARD_CONCEPT_NAME)), collapse = ', ')
    )

### TODO(deflaux) finish refactoring this section 

Just test the logic on a small set of participants. To do this at scale, we'll need to use a machine with more RAM or use BigQuery to do the JOIN.

In [None]:
# Use this for a test to include some participants with no statin drug exposures.
(a_few_eids = most_recent_lipids_measurements_df$person_id %>% head(100))

In [None]:
ldl <- most_recent_lipids_measurements_df %>%
    filter(title == 'Cholesterol in LDL [Mass/volume] in Serum or Plasma by calculation [milligram per deciliter]') %>%
#    filter(person_id %in% a_few_eids) %>%
    filter(person_id %in% participants_with_genomic_data$person_id) %>%
    mutate(
        date = as_date(measurement_date)
    )

dim(ldl)

In [None]:
length(unique(ldl$person_id))

In [None]:
statin <- statin_use_df %>%
    filter(PERSON_ID %in% a_few_eids) %>%
    mutate(
        start_date = as_date(DRUG_EXPOSURE_START_DATETIME),
        end_date = as_date(DRUG_EXPOSURE_END_DATETIME)
    )

dim(statin)

In [None]:
length(unique(statin$PERSON_ID))

In [None]:
overlap <- ldl %>%
    fuzzy_left_join(
        statin,
        by = c('person_id' = 'PERSON_ID',
               'date' = 'start_date',
               'date' = 'end_date'),
        match_fun = list(`==`, `>=`, `<=`)
    )

dim(overlap)

In [None]:
length(unique(overlap$person_id))

In [None]:
overlap %>%
    group_by(person_id, title, measurement_date, value_as_number) %>%
    summarize(
        first_use = min(DRUG_EXPOSURE_START_DATETIME),
        last_use_start = max(DRUG_EXPOSURE_START_DATETIME),
        last_use_end = max(DRUG_EXPOSURE_END_DATETIME),
        statin_rx_count = sum(!is.na(STANDARD_CONCEPT_CODE)),
        statin_drugs = str_c(sort(unique(STANDARD_CONCEPT_NAME)), collapse = ', ')
    )

In [None]:
head(overlap)

# Apply Friedewald formula 

TODO

# Normalize values

TODO

# Provenance 

In [None]:
devtools::session_info()