# Run a GWAS via regenie step 2

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/). This is step two of two.

Note that this work is part of a larger project to [Demonstrate the Potential for Pooled Analysis of All of Us and UK Biobank Genomic Data](https://github.com/all-of-us/ukb-cross-analysis-demo-project). Specifically this is for the portion of the project that is the **siloed** analysis.

# Setup 


<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform.
    <ul>
        <li>Use compute type 'Single Node' with sufficient CPU and RAM (e.g. start with 8 CPUs and 30 GB RAM, increase if needed).</li>
        <li>This notebook can take a while to run (e.g., 90 minutes). Recommend that it is run in the background via <kbd>dx run dxjupyterlab</kbd> which will also capture provenance.</li>
    </ul>
</div>

```
dx run dxjupyterlab \
    --instance-type=mem2_ssd1_v2_x8 \
    -icmd="papermill 09_ukb_regenie_step2_gwas.ipynb 09_ukb_regenie_step2_gwas_$(date +%Y%m%d).ipynb" \
    -iin=09_ukb_regenie_step2_gwas.ipynb \
    -iduration=1440 \
    --folder=outputs/regenie-step-2/$(date +%Y%m%d)/
```
See also https://platform.dnanexus.com/app/dxjupyterlab

In [None]:
from datetime import datetime
import os
import pandas as pd
import time

## Setup regenie

In [None]:
%%bash

REGENIE_VERSION=v2.2.4
rm regenie.zip
curl -L -o regenie.zip "https://github.com/rgcgithub/regenie/releases/download/${REGENIE_VERSION}/regenie_${REGENIE_VERSION}.gz_x86_64_Linux.zip"
unzip -o regenie.zip

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux --help | head

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

#---[ Inputs ]---
# This was created via ukb_rap_siloed_analyses/04_ukb_plink_merge_bed_files.ipynb
BED_FILE = '/mnt/project/outputs/plink-merge-bed/20220420/ukb_200kwes_filtered_plink_mergebed'
# This was created via ukb_rap_siloed_analyses/07_ukb_lipids_phenotype_for_gwas.ipynb
GWAS_PHENOTYPES = '/mnt/project/outputs/r-prepare-phenotype-for-gwas/20220425/ukb_200kwes_lipids_gwas_phenotype.tsv'
# These two files were created via notebook ukb_rap_siloed_analyses/05_ukb_plink_variant_qc.ipynb
QCED_VARIANTS = '/mnt/project/outputs/plink-variant-qc/20230329/ukb_200kwes_lipids_regenie_step2_plink_variant_qc.snplist'
QCED_SAMPLES = '/mnt/project/outputs/plink-variant-qc/20230329/ukb_200kwes_lipids_regenie_step2_plink_variant_qc.id'
# This was created via ukb_rap_siloed_analyses/08_ukb_regenie_step1_gwas.ipynb
REGENIE_STEP_1 = '/mnt/project/outputs/regenie-step-1/20230403/ukb_200kwes_lipids_regenie_step1_pred.list'

#---[ Outputs ]---
REGENIE_OUTPUT_FILENAME_PREFIX = 'ukb_200kwes_lipids'

# regenie step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

In [None]:
# Regenie expects the loco files to be in the current working directory.
!cp $(dirname '{REGENIE_STEP_1}')/*loco .

In [None]:
!./regenie_v2.2.4.gz_x86_64_Linux \
    --step 2 \
    --bed={BED_FILE} \
    --phenoFile={GWAS_PHENOTYPES} \
    --phenoColList=TC_adj_mg_dl_norm,LDL_adj_mg_dl_norm,HDL_mg_dl_norm,TG_log_mg_dl_norm \
    --covarFile={GWAS_PHENOTYPES} \
    --catCovarList=sex \
    --covarColList=age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
    --extract {QCED_VARIANTS} \
    --keep {QCED_SAMPLES} \
    --pred {REGENIE_STEP_1} \
    --bsize 400 \
    --out {REGENIE_OUTPUT_FILENAME_PREFIX}_regenie_step2

In [None]:
!ls -lth . | head

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze