# Compare SQL phenotype to R phenotype

<div class="alert alert-block alert-success">
    <b>For AoU there are some logic changes that will affect <i>which of a person's measurements</i> is used.</b> And <i>which measurement</i> is used will affect the <b>age</b>, since its age at time of measurement, and the <b>statin use indicator</b>, since the measurment must occur with in the statin use interval to be true.
    <ol>
        <li>AoU: We now retain only measurements where <kbd>value_as_number IS NOT NULL AND value_as_number > 0</kbd>.</li>
        <li>AoU: Previously the R code was modifying LDL during the lipids adjustment. Now LDL is the original value from the measurements table. Adjustments only occur within LDL_adjusted.
        <li>AoU: A single age and statin use indicator was previously chosen per person, even though those values could vary between a person's different lipid measurements. Now each measurement is retaining the age and statin use flag associated with the datetime of the measurment.</li>
        <li>AoU: When choosing the "most recent" measurement, the SQL code goes to greater lengths to make the result reproducible by sorting not only by measurement date, but also by measurement time, and measurement id in the case of ties.</li>
        <li>AoU: The SQL JOIN logic for measurements and statin use intervals uses the datetime instead of the date.</li>
        <li>UKB: 148 UKB samples were getting dropped incorrectly. I narrowed it down to the na.omit command being used to keep only people with all four lipids. Since na.omit is run on the entire dataframe, it checks other columns for NAs too such as the european ancestry column.</li>
        <li>UKB: the lipids adjustment is not the same formula, specifically the rule If TG > 400, then LDL = NA`  was not applied to to ldladj in the natarajan dataframe provided.</li>
    </ol>
 </div>

# Setup

In [None]:
lapply(c('hexbin', 'hrbrthemes', 'skimr', 'viridis'),
       function(pkg) { if(! pkg %in% installed.packages()) { install.packages(pkg)} } )

In [None]:
library(hexbin)
library(hrbrthemes)
library(skimr)
library(tidyverse)

In [None]:
ORIG_R_PHENO <- c(
    HDL = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_HDL_Iteration2_ForGWAS.csv',
    LDL = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_LDL_Iteration2_ForGWAS.csv',
    TC = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_TC_Iteration2_ForGWAS.csv',
    TG = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/MergedData_TG_Iteration2_ForGWAS.csv'
)

In [None]:
NEW_SQL_PHENO <- 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/pooled/phenotypes/20211224/aou_alpha2_ukb_pooled_lipids_phenotype.tsv'

In [None]:
# Set some visualiation defaults.
theme_set(theme_ipsum(base_size = 16)) # Default theme for plots.

#' Returns a data frame with a y position and a label, for use annotating ggplot boxplots.
#'
#' @param d A data frame.
#' @return A data frame with column y as max and column label as length.
get_boxplot_fun_data <- function(df) {
  return(data.frame(y = max(df), label = stringr::str_c('N = ', length(df))))
}

# Load data

In [None]:
orig_hdl <- read_csv(pipe(str_glue('gsutil cat {ORIG_R_PHENO[["HDL"]]}')))

In [None]:
orig_ldl <- read_csv(pipe(str_glue('gsutil cat {ORIG_R_PHENO[["LDL"]]}')))

In [None]:
orig_tc <- read_csv(pipe(str_glue('gsutil cat {ORIG_R_PHENO[["TC"]]}')))

In [None]:
orig_tg <- read_csv(pipe(str_glue('gsutil cat {ORIG_R_PHENO[["TG"]]}')))

In [None]:
orig_pheno_wide <- orig_hdl %>%
    full_join(orig_ldl) %>%
    full_join(orig_tc) %>%
    full_join(orig_tg) %>%
    mutate(
        FID = paste0(sampleid, '_', cohort),
        IID = FID
    )

In [None]:
nrow(orig_pheno_wide)
length(unique(orig_pheno_wide$IID))

stopifnot(nrow(orig_pheno_wide) == length(unique(orig_pheno_wide$IID)))

In [None]:
colnames(orig_pheno_wide)

In [None]:
new_pheno_wide = read_tsv(pipe(str_glue('gsutil cat {NEW_SQL_PHENO}')))

In [None]:
colnames(new_pheno_wide)

# Compare data

In [None]:
dim(orig_pheno_wide)
dim(new_pheno_wide)

<div class="alert alert-block alert-success">
We've retained more non-zero and non-null measurements.
</div>

In [None]:
length(unique(orig_pheno_wide$IID))
length(unique(new_pheno_wide$IID))

nrow(new_pheno_wide) - nrow(orig_pheno_wide)

In [None]:
new_pheno_wide %>%
    filter(!IID %in% orig_pheno_wide$IID) %>%
    group_by(cohort) %>%
    summarize(count = n())

<div class="alert alert-block alert-success">
We've also included more genomes.
</div>

In [None]:
pheno_versions <- inner_join(
    new_pheno_wide,
    orig_pheno_wide,
    suffix = c('_sql_phenotypes', '_r_phenotypes'),
    by = c('FID', 'IID')
)

dim(pheno_versions)

In [None]:
stopifnot(nrow(orig_pheno_wide) == nrow(pheno_versions))

In [None]:
colnames(pheno_versions)

## Check age

In [None]:
sum(abs(pheno_versions$age_sql_phenotypes - pheno_versions$age_r_phenotypes) > 2)

In [None]:
pheno_versions %>%
    select(IID, age_r_phenotypes, age_sql_phenotypes) %>%
    filter(age_sql_phenotypes - age_r_phenotypes > 2)

## Check cohort

In [None]:
table(pheno_versions$cohort_r_phenotypes, pheno_versions$cohort_sql_phenotypes)

<div class="alert alert-block alert-success">
The results are identical.
</div>

## Check sex_at_birth

In [None]:
table(pheno_versions$sex, pheno_versions$sex_at_birth)

<div class="alert alert-block alert-success">
The results are identical.
</div>

## Check PCs

In [None]:
skim(pheno_versions %>%
     select(pc1, PC1, pc2, PC2, pc3, PC3, pc4, PC4, pc5, PC5, pc6, PC6, pc7, PC7, pc8, PC8, pc9, PC9, pc10, PC10))

<div class="alert alert-block alert-success">
The results are identical.
</div>

## Check raw lipids

In [None]:
skim(pheno_versions %>%
     filter(cohort_r_phenotypes == 'AOU') %>%
     select(HDLraw, HDL, LDLraw, LDL,
            TCraw, TC, TGraw, TG))

In [None]:
skim(pheno_versions %>%
     filter(cohort_r_phenotypes == 'UKB') %>%
     select(HDLraw, HDL, LDLraw, LDL,
            TCraw, TC, TGraw, TG))

<div class="alert alert-block alert-success">
The results have minor differences, but no major differences.
</div>

## Check adjusted lipids

In [None]:
skim(pheno_versions %>%
     filter(cohort_r_phenotypes == 'AOU') %>%
     select(HDLadj, HDL, LDLadj, LDL_adjusted,
            TCadj, TC_adjusted, TGadj, TG_adjusted))

In [None]:
skim(pheno_versions %>%
     filter(cohort_r_phenotypes == 'UKB') %>%
     select(HDLadj, HDL, LDLadj, LDL_adjusted,
            TCadj, TC_adjusted, TGadj, TG_adjusted))

<div class="alert alert-block alert-success">
The results have minor differences, but no unexpected major differences. (It is expected that we have more NA values for LDL_adjusted.)
</div>

## Check normalized lipids

In [None]:
skim(pheno_versions %>%
     filter(cohort_r_phenotypes == 'AOU') %>%
     select(HDLnorm, HDL_norm, LDLnorm, LDL_adjusted_norm,
            TCnorm, TC_adjusted_norm, TGnorm, TG_adjusted_norm))

In [None]:
skim(pheno_versions %>%
     filter(cohort_r_phenotypes == 'UKB') %>%
     select(HDLnorm, HDL_norm, LDLnorm, LDL_adjusted_norm,
            TCnorm, TC_adjusted_norm, TGnorm, TG_adjusted_norm))

<div class="alert alert-block alert-success">
The results have minor differences, but no major differences.
</div>

# Provenance

In [None]:
devtools::session_info()