# Run a GWAS via regenie

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/), but not on lipids. **Instead, this GWAS explores the batch effect of the data.**

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://github.com/all-of-us/ukb-cross-analysis-demo-project). Specifically this is for the portion of the project that is the **pooled** analysis.

# Setup

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the <i>All of Us</i> Workbench.
    <ul>
        <li>Use "Recommended Environment" <kbd><b>General Analysis</b></kbd> which creates compute type <kbd><b>Standard VM</b></kbd> with sufficient CPU and RAM (e.g. start with <b>8 CPUs</b> and <b>30 GB RAM</b>, increase if needed).</li>
        <li>This notebook takes about 7 hours to run. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>
    </ul>
</div>

In [None]:
from datetime import datetime
import os
import time

## Setup regenie

Note: regenie is already installed locally by default, but we are choosing to update to a more recent version.

In [None]:
!regenie --help | head

<div class="alert alert-block alert-warning">
    <b>Note:</b> REGENIE 2.2.4 was use for the lipids GWAS, but we upgraded to 3.1.1 for the batch GWAS due to error <kbd>ERROR: logistic regression did not converge for phenotype is_aou. Perhaps increase --niter?</kbd> on chromosome 10.
</div>

In [None]:
%%bash

REGENIE_VERSION=v3.1.1
rm regenie.zip
curl -L -o regenie.zip "https://github.com/rgcgithub/regenie/releases/download/${REGENIE_VERSION}/regenie_${REGENIE_VERSION}.gz_x86_64_Linux.zip"
unzip -o regenie.zip

In [None]:
!./regenie_v3.1.1.gz_x86_64_Linux --help | head

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# The BGEN file created via aou_workbench_pooled_analyses/05_write_pooled_bgen.ipynb.
REMOTE_MERGED_BGEN = 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/pooled/geno/20220215/aou-alpha3-ukb-chr1-chr22.bgen'
# The sample file created via aou_workbench_pooled_analyses/05_write_pooled_bgen.ipynb.
REMOTE_MERGED_BGEN_SAMPLE = 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/pooled/geno/20220215/aou-alpha3-ukb-chr1-chr22.sample'
# Created via aou_workbench_pooled_analyses/08_pooled_phenotype_for_gwas.ipynb
REMOTE_GWAS_PHENOTYPES = 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/pooled/pheno/20220413/aou_alpha3_ukb_lipids_gwas_phenotype.tsv'
# These four files were created via notebook aou_workbench_pooled_analyses/06_pooled_variant_qc.ipynb
REMOTE_STEP1_VARIANT_QC_ID = 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/pooled/variant-qc/20220311/aou_alpha3_ukb_lipids_step1QC_plink.id'
REMOTE_STEP1_VARIANT_QC_SNPLIST = 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/pooled/variant-qc/20220311/aou_alpha3_ukb_lipids_step1QC_plink.snplist'
REMOTE_STEP2_VARIANT_QC_ID = 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/pooled/variant-qc/20220311/aou_alpha3_ukb_lipids_step2QC_plink.id'
REMOTE_STEP2_VARIANT_QC_SNPLIST = 'gs://fc-secure-e53e4a44-7fe2-42b7-89b7-01aae1e399f7/data/pooled/variant-qc/20220311/aou_alpha3_ukb_lipids_step2QC_plink.snplist'

#---[ Outputs ]---
# Create a timestamp for a folder of results generated today.
DATESTAMP = time.strftime('%Y%m%d')
OUTPUT_FILENAME_PREFIX = 'aou_alpha3_ukb_batch'
REGENIE_OUTPUTS = f'{os.getenv("WORKSPACE_BUCKET")}/data/pooled/regenie/{DATESTAMP}/'

In [None]:
LOCAL_MERGED_BGEN = os.path.basename(REMOTE_MERGED_BGEN)
LOCAL_MERGED_BGEN_SAMPLE = os.path.basename(REMOTE_MERGED_BGEN_SAMPLE)
LOCAL_GWAS_PHENOTYPES = os.path.basename(REMOTE_GWAS_PHENOTYPES)
LOCAL_STEP1_VARIANT_QC_ID = os.path.basename(REMOTE_STEP1_VARIANT_QC_ID)
LOCAL_STEP1_VARIANT_QC_SNPLIST = os.path.basename(REMOTE_STEP1_VARIANT_QC_SNPLIST)
LOCAL_STEP2_VARIANT_QC_ID = os.path.basename(REMOTE_STEP2_VARIANT_QC_ID)
LOCAL_STEP2_VARIANT_QC_SNPLIST = os.path.basename(REMOTE_STEP2_VARIANT_QC_SNPLIST)

## Copy data locally

In [None]:
!gsutil cp -n {REMOTE_MERGED_BGEN} {REMOTE_MERGED_BGEN_SAMPLE} .

In [None]:
!gsutil cp {REMOTE_GWAS_PHENOTYPES} {REMOTE_STEP1_VARIANT_QC_ID} {REMOTE_STEP1_VARIANT_QC_SNPLIST} \
    {REMOTE_STEP2_VARIANT_QC_ID} {REMOTE_STEP2_VARIANT_QC_SNPLIST} .

# regenie step 1

From https://rgcgithub.github.io/regenie/overview/:
> In the first step a subset of genetic markers are used to fit a whole genome regression model that captures a good fraction of the phenotype variance attributable to genetic effects.

See also the regenie documentation https://rgcgithub.github.io/regenie/options/#input

<div class="alert alert-block alert-warning">
    <b>Note:</b> Compared to the pooled lipids GWAS, added parameter <kbd>--loocv</kbd> to address error <kbd>ERROR: one of the folds has only cases/controls for phenotype 'is_aou'. Either use smaller #folds (option --cv) or use LOOCV (option --loocv).</kbd>
</div>

In [None]:
!./regenie_v3.1.1.gz_x86_64_Linux \
    --step 1 \
    --bgen {LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample {LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile {LOCAL_GWAS_PHENOTYPES} \
    --phenoColList is_aou,is_ukb \
    --bt \
    --loocv \
    --covarFile {LOCAL_GWAS_PHENOTYPES} \
    --catCovarList sex_at_birth \
    --covarColList age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --extract {LOCAL_STEP1_VARIANT_QC_SNPLIST} \
    --keep {LOCAL_STEP1_VARIANT_QC_ID} \
    --bsize 1000 \
    --verbose \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_step1

In [None]:
!ls -lth {OUTPUT_FILENAME_PREFIX}*

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}_regenie_step1* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls -lh {REGENIE_OUTPUTS}

# regenie step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

<div class="alert alert-block alert-warning">
    <b>Note:</b> Not using a Firth logistic regression model to address error <kbd>ERROR: Firth penalized logistic regression failed to converge for all phenotypes. Try decreasing the maximum step size using `--maxstep-null` (currently=5) and increasing the maximum number of iterations using `--maxiter-null` (currently=5000).</kbd>
</div>

In [None]:
!./regenie_v3.1.1.gz_x86_64_Linux \
    --step 2 \
    --bgen {LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample {LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile {LOCAL_GWAS_PHENOTYPES} \
    --phenoColList is_aou,is_ukb \
    --bt \
    --covarFile {LOCAL_GWAS_PHENOTYPES} \
    --catCovarList sex_at_birth \
    --covarColList age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --extract {LOCAL_STEP2_VARIANT_QC_SNPLIST} \
    --keep {LOCAL_STEP2_VARIANT_QC_ID} \
    --pred {OUTPUT_FILENAME_PREFIX}_regenie_step1_pred.list \
    --bsize 400 \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_step2

In [None]:
!ls -lth {OUTPUT_FILENAME_PREFIX}*

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}_regenie_step2* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls -lh {REGENIE_OUTPUTS}

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze