# Run a GWAS via regenie

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/).

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench.
    <ul>
        <li>Use compute type 'Standard VM' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook can take a while to run. Recommend that it is run in the background via <kbd>run_notebook_in_the_background</kbd>.</li>
    </ul>
</div>

In [None]:
from datetime import datetime
import os
import time

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Setup regenie

Note: regenie is already installed locally by default, but we are choosing to update to a more recent version.

In [None]:
!regenie --version # --help

In [None]:
%%bash

REGENIE_VERSION=v2.2.4
rm regenie.zip
curl -L -o regenie.zip "https://github.com/rgcgithub/regenie/releases/download/${REGENIE_VERSION}/regenie_${REGENIE_VERSION}.gz_x86_64_Linux.zip"
unzip -o regenie.zip

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux --version # --help

## Define constants

The BGEN file created via `write_bgen.ipynb`.

In [None]:
REMOTE_MERGED_BGEN = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210906/ukb-aou-alpha2-chr1-chr22.bgen'
REMOTE_MERGED_BGEN_SAMPLE = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210906/ukb-aou-alpha2-chr1-chr22.sample'

LOCAL_MERGED_BGEN = os.path.basename(REMOTE_MERGED_BGEN)
LOCAL_MERGED_BGEN_SAMPLE = os.path.basename(REMOTE_MERGED_BGEN_SAMPLE)

This TSV was created via notebook `7_pooled_lipids_gwas_phenotype.ipynb`.

In [None]:
REMOTE_REGENIE_PHENOTYPES = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/pooled/phenotypes/20211224/aou_alpha2_ukb_pooled_lipids_phenotype.tsv'

LOCAL_REGENIE_PHENOTYPES = os.path.basename(REMOTE_REGENIE_PHENOTYPES)

In [None]:
RESULT_BUCKET = os.getenv("WORKSPACE_BUCKET")
DATESTAMP = time.strftime('%Y%m%d')

# Outputs
OUTPUT_FILENAME_PREFIX = 'aou_alpha2_ukb_lipids'
REGENIE_OUTPUTS = f'{os.getenv("WORKSPACE_BUCKET")}/data/merged/regenie/{DATESTAMP}/'

## Copy data locally

In [None]:
!gsutil cp -n {REMOTE_MERGED_BGEN} {REMOTE_MERGED_BGEN_SAMPLE} .

In [None]:
!gsutil cp {REMOTE_REGENIE_PHENOTYPES} .

# Variant QC via PLINK

Use [plink2 to perform the variant QC](https://rgcgithub.github.io/regenie/recommendations/#exclusion-files) and obtain a subset of SNPs roughly equal to the number of samples.

## TEMPORARY: test on chr21

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_MERGED_BGEN} ref-first \
  --chr 21 \
  --geno 0.1 \
  --mind 0.1 \
  --mac 100 \
  --hwe 1e-15 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --out {OUTPUT_FILENAME_PREFIX}_plink

# Change this back when we are finished testing.
#  --chr 1-22 \


# This is too strict and removes too many samples.
#  --maf 0.01 \

In [None]:
!ls -lth . | head

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_plink.id

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_plink.id

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_plink.id

In [None]:
!head {OUTPUT_FILENAME_PREFIX}_plink.snplist

In [None]:
!tail {OUTPUT_FILENAME_PREFIX}_plink.snplist

In [None]:
!wc -l {OUTPUT_FILENAME_PREFIX}_plink.snplist

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

# regenie

This work is based on https://github.com/briansha/Regenie_WDL/blob/master/regenie.wdl

See also:
* regenie documentation https://rgcgithub.github.io/regenie/options/#input
* dsub documentation https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md

## Step 1

From https://rgcgithub.github.io/regenie/overview/:
> In the first step a subset of genetic markers are used to fit a whole genome regression model that captures a good fraction of the phenotype variance attributable to genetic effects.

In [None]:
# Parameters to add
# 8 core machine
# 11 GB ram
# 500 GB disk

!./regenie_v2.2.4.gz_x86_64_Linux \
    --step 1 \
    --bgen={LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample={LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile={LOCAL_REGENIE_PHENOTYPES} \
    --phenoColList=LDL_adjusted_norm,HDL_norm,TC_adjusted_norm,TG_adjusted_norm \
    --covarFile={LOCAL_REGENIE_PHENOTYPES} \
    --catCovarList=sex_at_birth,cohort \
    --covarColList=age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --extract {OUTPUT_FILENAME_PREFIX}_plink.snplist \
    --bsize 1000 \
    --verbose \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_part1


# Note that no samples were omitted by the QC step, so we are leaving out this file since regenie 
# complained that it did not match the samples in the BGEN file.
#     --keep {OUTPUT_FILENAME_PREFIX}_plink.id \


In [None]:
!ls -lth . | head

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls {REGENIE_OUTPUTS}

## Step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux \
    --step 2 \
    --bgen={LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample={LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile={LOCAL_REGENIE_PHENOTYPES} \
    --phenoColList=LDL_adjusted_norm,HDL_norm,TC_adjusted_norm,TG_adjusted_norm \
    --covarFile={LOCAL_REGENIE_PHENOTYPES} \
    --catCovarList=sex_at_birth,cohort \
    --covarColList=age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --firth 0.01 \
    --approx \
    --pred {OUTPUT_FILENAME_PREFIX}_regenie_part1_pred.list \
    --bsize 400 \
    --out {OUTPUT_FILENAME_PREFIX}_regenie_part2

In [None]:
!ls -lth {OUTPUT_FILENAME_PREFIX}*

In [None]:
!gsutil -m cp {OUTPUT_FILENAME_PREFIX}* {REGENIE_OUTPUTS}

In [None]:
!gsutil ls {REGENIE_OUTPUTS}

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze