# Aggregate GWAS results

In this notebook we modify our GWAS results per the *All of Us* Researcher workbench data dissemination rules:

> Researchers must comply with the All of Us Data and Statistics Policy (detailed in full on https://www.researchallofus.org/data-tools/data-access/), which prevents download or dissemination of any data or statistics that allow a participant count of 1 to 20 to be derived. For this reason, researchers are prohibited from downloading or distributing results with allele counts <40 without an exemption. Researchers may apply for an exemption by contacting support@researchallofus.org.

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://github.com/all-of-us/ukb-cross-analysis-demo-project). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform.
    <ul>
        <li>Use compute type 'Single Node' with sufficient CPU and RAM (e.g. start with 4 CPUs and 15 GB RAM, increase if needed).</li>
        <li>This notebook is pretty fast, but in general it is recommended to be run in the background via <kbd>dx run dxjupyterlab</kbd> to capture provenance.</li>
    </ul>
</div>

```
dx run dxjupyterlab \
    --instance-type=mem2_ssd1_v2_x4 \
    -icmd="papermill 11_aggregate_gwas_results.ipynb 11_aggregate_gwas_results_$(date +%Y%m%d).ipynb" \
    -iin=11_aggregate_gwas_results.ipynb \
    --folder=outputs/aggregate-gwas-results/$(date +%Y%m%d)/
```
See also https://platform.dnanexus.com/app/dxjupyterlab

In [None]:
lapply(c('tidyverse'), function(pkg) { if(! pkg %in% installed.packages()) { install.packages(pkg) } } )

In [None]:
library(tidyverse)

# Constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

# Created via notebook aou_workbench_siloed_analyses/06_aou_regenie_gwas.ipynb
REGENIE_RESULTS <- c(
    HDL='/mnt/project/outputs/regenie-step-2/20220426/ukb_200kwes_lipids_regenie_step2_HDL_mg_dl_norm.regenie',
    LDL='/mnt/project/outputs/regenie-step-2/20220426/ukb_200kwes_lipids_regenie_step2_LDL_adj_mg_dl_norm.regenie',
    TC='/mnt/project/outputs/regenie-step-2/20220426/ukb_200kwes_lipids_regenie_step2_TC_adj_mg_dl_norm.regenie',
    TG='/mnt/project/outputs/regenie-step-2/20220426/ukb_200kwes_lipids_regenie_step2_TG_log_mg_dl_norm.regenie'
)

LIPIDS <- names(REGENIE_RESULTS)

RAW_FILE_SUFFIX <- '.regenie'
AGGREGATE_FILE_SUFFIX <- '_aggregated.tsv'

# Copy the inputs locally

For some reason `read_csv`, `read_delim`, and `read_tsv` return errors when reading from the project mounted location.

In [None]:
for (result in REGENIE_RESULTS) {
    system(str_glue('cp -v {result} .'), intern=TRUE)
}

# Load the regenie GWAS results

Bring our results into a single dataframe with a lipid type column.

In [None]:
combined_regenie_results <- bind_rows(
    lapply(LIPIDS, function(lipid) {
        file = REGENIE_RESULTS[lipid]
        read_delim(basename(file), delim = ' ') %>%
        mutate(lipid_type = lipid)
    })) %>%
    mutate(
        AN = 2 * N,
        AC_alt = A1FREQ * AN,
        AC_ref = (1 - A1FREQ) * AN
    )

dim(combined_regenie_results)

In [None]:
head(combined_regenie_results)

In [None]:
combined_regenie_results %>%
    group_by(lipid_type) %>%
    summarize(
        count = n(),
        min_LOG10P = min(LOG10P),
        max_LOG10P = max(LOG10P),
        min_A1FREQ = min(A1FREQ),
        max_A1FREQ = max(A1FREQ),
        min_N = min(N),
        max_N = max(N),
        min_AC_alt = min(AC_alt),
        max_AC_alt = max(AC_alt),
        min_AC_ref = min(AC_ref),
        max_AC_ref = max(AC_ref),
    )

## How many significant results will be removed from the aggregate?

In [None]:
combined_regenie_results %>%
    mutate(
        significant = LOG10P > -log10(5e-08),
        group_size_threshold = ifelse(AC_alt < 40 | AC_ref < 40,
                                      'below minimum group size threshold',
                                      'meets group size threshold'),
    ) %>%
    group_by(lipid_type, significant, group_size_threshold) %>%
    summarize(count = n())

# Filter to ensure at least 20 individuals have the variant

In [None]:
aggregate_regenie_results <- combined_regenie_results %>%
    filter(AC_alt >= 40 & AC_ref >= 40)

In [None]:
aggregate_regenie_results %>%
    group_by(lipid_type) %>%
    summarize(
        count = n(),
        min_LOG10P = min(LOG10P),
        max_LOG10P = max(LOG10P),
        min_A1FREQ = min(A1FREQ),
        max_A1FREQ = max(A1FREQ),
        min_N = min(N),
        max_N = max(N),
        min_AC_alt = min(AC_alt),
        max_AC_alt = max(AC_alt),
        min_AC_ref = min(AC_ref),
        max_AC_ref = max(AC_ref),
    )

# Write out the aggregate data to local disk

In [None]:
for (lipid in LIPIDS) {
    input_file <- REGENIE_RESULTS[lipid]
    output_file <- input_file %>%
        str_replace('/mnt/project/outputs/', '') %>%
        str_replace_all('/', '_') %>%
        str_replace(str_glue('{RAW_FILE_SUFFIX}$'), AGGREGATE_FILE_SUFFIX)
    message(str_glue('Aggregating results from {input_file} to {output_file}'))
    stopifnot('output filename must be different from input filename' =
              output_file != input_file)
    write_tsv(aggregate_regenie_results %>% filter(lipid_type == lipid), output_file)
}

# Now you can download these files!

**Be sure to download the aggregated TSV files**, not the .regenie files with the raw results.

# Provenance 

In [None]:
devtools::session_info()