# Prepare a pooled lipids phenotype

In this notebook we combine the _All of Us_ lipids phenotype with the UK Biobank lipids phenotype adjusted for statin use..

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://github.com/all-of-us/ukb-cross-analysis-demo-project). Specifically, this notebook combines the results from `aou_workbench_siloed_analyses/01_aou_lipids_phenotype.ipynb` and `aou_workbench_pooled_analyses/01_ukb_lipids_phenotype.ipynb` is for the portion of the project that is the  **pooled** analysis.

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the <i>All of Us</i> Workbench.
    <ul>
        <li>Use "Recommended Environment" <kbd><b>General Analysis</b></kbd> which creates compute type <kbd><b>Standard VM</b></kbd> with reasonable defaults for CPU, RAM, and disk.</li>
        <li>This notebook only takes a few minutes to run interactively. You can also it in the background via <kbd>run_notebook_in_the_background</kbd> for the sake of provenance and reproducibility.</li>
    </ul>
</div>

In [None]:
lapply(c('skimr'), function(pkg_name) { if(! pkg_name %in% installed.packages()) { install.packages(pkg_name)} } )

In [None]:
library(lubridate)
library(skimr)
library(tidyverse)

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# Created via aou_workbench_siloed_analyses/01_aou_lipids_phenotype.ipynb
AOU_PHENO <- 'gs://fc-secure-471c1068-cd3d-4b43-9b5d-a618c85ceea5/data/aou/pheno/20220208/aou_alpha3_lipids_phenotype.csv'
# Created via aou_workbench_pooled_analyses/01_ukb_lipids__phenotype.ipynb
UKB_PHENO <- 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/ukb/pheno/20220304/ukb_lipids_phenotype.csv'

#---[ Outputs ]---
# Create a timestamp for a folder of results generated today.
DATESTAMP <- strftime(now(), '%Y%m%d')
DESTINATION <- str_glue('{Sys.getenv("WORKSPACE_BUCKET")}/data/pooled/pheno/{DATESTAMP}/')
MERGED_PHENOTYPE_FILENAME <- 'aou_alpha3_ukb_lipids_phenotype.csv'
MERGED_ID_FILENAME <- 'aou_alpha3_ukb_lipids_ids.tsv'

# Load data

## Retrieve AoU lipids phenotype

In [None]:
aou_pheno <- read_csv(pipe(str_glue('gsutil cat {AOU_PHENO}')))

dim(aou_pheno)
head(aou_pheno)

In [None]:
skim(aou_pheno)

## Retrieve UKB lipids phenotype

In [None]:
ukb_pheno <- read_csv(pipe(str_glue('gsutil cat {UKB_PHENO}')))

dim(ukb_pheno)
head(ukb_pheno)

In [None]:
skim(ukb_pheno)

# Pool the phenotypes

Add the `IID` and `FID` columns needed by regenie and the cohort covariate. Also keep in mind that UKB data has a sample id that is different than the eid.

In [None]:
long_pooled_pheno <- bind_rows(
    aou_pheno %>%
        mutate(
            sample_id = person_id,
            cohort = 'AOU',            
        ) %>%
        select(id=person_id, sample_id, cohort, age, age2, statin_use, sex_at_birth,
               race, lipid_type, mg_dl = value_as_number),
    ukb_pheno %>%
        mutate(
            sample_id = eid_31063,
            cohort = 'UKB',
            statin_use = ifelse(statin_use == 1, TRUE, FALSE)
        ) %>%
        select(id=eid, sample_id, cohort, age, age2, statin_use, sex_at_birth,
               race=top_level_ethnic_background, lipid_type, mg_dl)
    ) %>%
    mutate(
        IID = paste0(format(sample_id, scientific = FALSE), '_', cohort),
        FID = IID
    )

dim(long_pooled_pheno)
head(long_pooled_pheno)

In [None]:
skim(long_pooled_pheno)

In [None]:
length(unique(long_pooled_pheno$id))

# Write phenotypes to workspace bucket

In [None]:
# Write the dataframe to a file.
write_csv(long_pooled_pheno, MERGED_PHENOTYPE_FILENAME)

In [None]:
# Copy the file to the workspace bucket.
system(str_glue('gsutil cp {MERGED_PHENOTYPE_FILENAME} {DESTINATION}'), intern = T)

In [None]:
# Write the the ids to a file.
write_tsv(long_pooled_pheno %>%
              select(FID, IID) %>%
              distinct(),
          MERGED_ID_FILENAME)

In [None]:
# Copy the file to the workspace bucket.
system(str_glue('gsutil cp {MERGED_ID_FILENAME} {DESTINATION}'), intern = T)

In [None]:
# Check the destination.
system(str_glue('gsutil ls -lh {DESTINATION}'), intern = T)

# Provenance

In [None]:
devtools::session_info()