# Run a GWAS via regenie

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/).

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://docs.google.com/document/d/19ZS0z_-7FEM37pNDAXaWaqBSLnqyd9MZEkiOmtF3n_0/edit#). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the <i>All of Us</i> Workbench.
    <ul>
        <li>Use "Recommended Environment" <kbd><b>General Analysis</b></kbd> which creates compute type <kbd><b>Standard VM</b></kbd> with sufficient CPU and RAM (e.g. start with <b>8 CPUs</b> and <b>30 GB RAM</b>, increase if needed).</li>
        <li>This notebook can take a while to run <b>TBD DETAILS CHR21 VS. ALL CHRS</b>. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>    </ul>
</div>

In [None]:
from datetime import datetime
import os
import time

## Setup regenie

Note: regenie is already installed locally by default, but we are choosing to update to a more recent version.

In [None]:
!regenie --version

In [None]:
%%bash

REGENIE_VERSION=v2.2.4
rm regenie.zip
curl -L -o regenie.zip "https://github.com/rgcgithub/regenie/releases/download/${REGENIE_VERSION}/regenie_${REGENIE_VERSION}.gz_x86_64_Linux.zip"
unzip -o regenie.zip

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux --version # --help

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# The BGEN file was created via aou_workbench_siloed_analyses/02_aou_write_filtered_bgen.ipynb.
REMOTE_BGEN = 'gs://fc-secure-471c1068-cd3d-4b43-9b5d-a618c85ceea5/data/aou/geno/20220207/aou-alpha3-chr21.bgen'
# The sample file was created via aou_workbench_siloed_analyses/02_aou_write_filtered_bgen.ipynb.
REMOTE_BGEN_SAMPLE = 'gs://fc-secure-471c1068-cd3d-4b43-9b5d-a618c85ceea5/data/aou/geno/20220207/aou-alpha3-chr21.sample'
# This CSV was created via notebook aou_workbench_siloed_analyses/05_aou_phenotype_for_gwas
REMOTE_GWAS_PHENOTYPES = 'path/to/gwas/pheno'
# These four files were created via notebook aou_workbench_siloed_analyses/03_aou_variant_qc.ipynb
REMOTE_STEP1_VARIANT_QC_ID = 'path/to/_step1QC_plink.id'
REMOTE_STEP1_VARIANT_QC_SNPLIST = 'path/to/_step1QC_plink.snplist'
REMOTE_STEP2_VARIANT_QC_ID = 'path/to/_step2QC_plink.id'
REMOTE_STEP2_VARIANT_QC_SNPLIST = 'path/to/_step2QC_plink.snplist'

#---[ Outputs ]---
# Create a timestamp for a folder of results generated today.
DATESTAMP = time.strftime('%Y%m%d')
OUTPUT_FILENAME_PREFIX = 'aou_alpha3_lipids'
REGENIE_OUTPUTS = f'{os.getenv("WORKSPACE_BUCKET")}/data/aou/regenie/{DATESTAMP}/'

In [None]:
LOCAL_BGEN = os.path.basename(REMOTE_BGEN)
LOCAL_BGEN_SAMPLE = os.path.basename(REMOTE_BGEN_SAMPLE)
LOCAL_GWAS_PHENOTYPES = os.path.basename(REMOTE_GWAS_PHENOTYPES)
LOCAL_STEP1_VARIANT_QC_ID = os.path.basename(REMOTE_STEP1_VARIANT_QC_ID)
LOCAL_STEP1_VARIANT_QC_SNPLIST = os.path.basename(REMOTE_STEP1_VARIANT_QC_SNPLIST)
LOCAL_STEP2_VARIANT_QC_ID = os.path.basename(REMOTE_STEP2_VARIANT_QC_ID)
LOCAL_STEP2_VARIANT_QC_SNPLIST = os.path.basename(REMOTE_STEP2_VARIANT_QC_SNPLIST)

## Copy data locally

In [None]:
!gsutil cp -n {REMOTE_BGEN} {REMOTE_BGEN_SAMPLE} .

In [None]:
!gsutil cp {REMOTE_GWAS_PHENOTYPES} {REMOTE_STEP1_VARIANT_QC_ID} {REMOTE_STEP1_VARIANT_QC_SNPLIST} \
    {REMOTE_STEP2_VARIANT_QC_ID} {REMOTE_STEP2_VARIANT_QC_SNPLIST} .

# regenie step 1

From https://rgcgithub.github.io/regenie/overview/:
> In the first step a subset of genetic markers are used to fit a whole genome regression model that captures a good fraction of the phenotype variance attributable to genetic effects.

See also the regenie documentation https://rgcgithub.github.io/regenie/options/#input

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux \
    --step 1 \
    --bgen {LOCAL_BGEN} \
    --ref-first \
    --sample {LOCAL_BGEN_SAMPLE} \
    --phenoFile {LOCAL_GWAS_PHENOTYPES} \
    --phenoColList LDL_adjusted_norm,HDL_norm,TC_adjusted_norm,TG_adjusted_norm \
    --covarFile {LOCAL_GWAS_PHENOTYPES} \
    --catCovarList sex_at_birth \
    --covarColList age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --extract {LOCAL_STEP1_VARIANT_QC_SNPLIST} \
    --keep {LOCAL_STEP1_VARIANT_QC_ID} \
    --bsize 1000 \
    --verbose \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_step1

In [None]:
!ls -lth . | head

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls -lh {REGENIE_OUTPUTS}

# regenie step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux \
    --step 2 \
    --bgen {LOCAL_BGEN} \
    --ref-first \
    --sample {LOCAL_BGEN_SAMPLE} \
    --phenoFile {LOCAL_GWAS_PHENOTYPES} \
    --phenoColList LDL_adjusted_norm,HDL_norm,TC_adjusted_norm,TG_adjusted_norm \
    --covarFile {LOCAL_GWAS_PHENOTYPES} \
    --catCovarList sex_at_birth \
    --covarColList age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --extract {LOCAL_STEP2_VARIANT_QC_SNPLIST} \
    --keep {LOCAL_STEP2_VARIANT_QC_ID} \
    --pred {OUTPUT_FILENAME_PREFIX}_regenie_step1_pred.list \
    --bsize 400 \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_step2

In [None]:
!ls -lth {OUTPUT_FILENAME_PREFIX}*

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls -lh {REGENIE_OUTPUTS}

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze