# Run a GWAS via regenie

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/) via [dsub](https://github.com/databiosphere/dsub).

# Setup 

In [None]:
import os

## Setup plink2

https://www.cog-genomics.org/plink/2.0/

In [None]:
%%bash

##### plink 2 install
PLINK_VERSION=2.3.Alpha
PLINK_ZIP_PATH=/tmp/plink-$PLINK_VERSION.zip
curl -L -o $PLINK_ZIP_PATH https://s3.amazonaws.com/plink2-assets/alpha2/plink2_linux_x86_64.zip
mkdir -p /tmp/plink2/
unzip -o $PLINK_ZIP_PATH -d /tmp/plink2/

In [None]:
!/tmp/plink2/plink2 --version # --help

## Setup regenie

Note: regenie is already installed locally by default.

For longer-running jobs we will run it via dsub. regenie is installed in Docker image `briansha/regenie:v2.0.1_boost`.

In [None]:
!regenie --version # --help

## Setup dsub

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench. It runs fine on the default Cloud Environment. 
</div>

In [None]:
!pip3 install --upgrade dsub

In [None]:
!dsub --version # --help

In [None]:
%%bash

gcloud auth list

<div class="alert alert-block alert-warning">
    <b>Note:</b> (1) You must use your own PET account. (2) Your PET account has to be granted access to run itself as a service account.
</div>

## Setup bgen_reader

In [None]:
!pip3 install bgen-reader

## Define constants

The BGEN file created via `write_bgen_20210719_172314.ipynb`. To be determined whether it is in the correct format for regenie.

Note that Brian successfully created a BGEN for regenie using this command:
`./plink2 --bfile 'MEGA_data_common_filtered_final' --chr 1-22 --export bgen-1.2 bits=8  --out 'MEGA_data_common_filtered_final_chr1_22'`

In [None]:
REMOTE_MERGED_BGEN = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210802/ukb-aou-alpha1-chr3.bgen'
#'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210729/ukb-aou-alpha1.bgen'
#'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210729/ukb-aou-alpha1-chr3.bgen'
REMOTE_MERGED_BGEN_SAMPLE = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210802/ukb-aou-alpha1-chr3.sample'
# 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210729/ukb-aou-alpha1.sample'
# 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210729/ukb-aou-alpha1-chr3.sample'

LOCAL_MERGED_BGEN = os.path.basename(REMOTE_MERGED_BGEN)
LOCAL_MERGED_BGEN_SAMPLE = os.path.basename(REMOTE_MERGED_BGEN_SAMPLE)

This TSV was created via notebook `hail_gwas.ipynb`.

In [None]:
REMOTE_PHENOTYPES = 'gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210727/phenotypes.tsv'

LOCAL_PHENOTYPES = os.path.basename(REMOTE_PHENOTYPES)

## Copy data locally for testing

In [None]:
!gsutil cp {REMOTE_MERGED_BGEN} {REMOTE_MERGED_BGEN_SAMPLE} {REMOTE_PHENOTYPES} .    

In [None]:
!head {LOCAL_PHENOTYPES}

# Check the BGEN file

There were a few problems that needed to be fixed:
* the first time I created it, I hit [this issue](https://hail.zulipchat.com/#narrow/stream/123010-Hail-0.2E2.20support/topic/hl.2Eexport_bgen).
* the second time, I realized that PLINK required rsids, so those were computed by Hail and written to a new BGEN

In [None]:
!head {LOCAL_MERGED_BGEN_SAMPLE}

In [None]:
from bgen_reader import read_bgen

In [None]:
bgen = read_bgen(LOCAL_MERGED_BGEN, verbose=True)

In [None]:
# Variants metadata.
print(bgen["variants"].head())

In [None]:
# Samples read from the bgen file.
print(bgen["samples"].head())

In [None]:
# Samples read from the bgen file.
print(bgen["samples"].tail())

In [None]:
# There are X variants in total.
print(len(bgen["genotype"]))

In [None]:
# This library avoid as much as possible accessing the bgen file for performance
# and memory reasons. The `compute` function actually tells the library to
# access the file to retrieve some data.
geno = bgen["genotype"][0].compute()
print(geno.keys())
# Let's have a look at the probabilities regarding the first variant.
print(geno["probs"])
# The above matrix is of size samples-by-(combination-of-alleles).
print(geno["probs"].shape)

## Check missingness

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_MERGED_BGEN} ref-first \
  --missing \
  --chr 21

# Variant QC via PLINK

Per Margaret, use [plink2 to perform the variant QC](https://rgcgithub.github.io/regenie/recommendations/#exclusion-files) and obtain a subset of SNPs roughly equal to the number of samples.


We'll run this locally since its pretty quick.

In [None]:
!/tmp/plink2/plink2 \
  --bgen {LOCAL_MERGED_BGEN} ref-first \
  --geno 0.1 \
  --mind 0.1 \
  --maf 0.01 \
  --hwe 1e-15 \
  --chr 1-22 \
  --write-snplist \
  --write-samples \
  --no-id-header \
  --out qc_pass

# This is too strict and removes too many samples.
#   --mac 100 \

# These parameters yield errors, and are not relevant to this variant QC.
#  --pheno {LOCAL_PHENOTYPES} \
#  --pheno-name FID,IID,sex,LDL_norm \

In [None]:
!ls -lth . | head

In [None]:
!head qc_pass.id

In [None]:
!wc -l qc_pass.id

In [None]:
!head qc_pass.snplist

In [None]:
!tail qc_pass.snplist

In [None]:
!wc -l qc_pass.snplist

# regenie

This work is based on https://github.com/briansha/Regenie_WDL/blob/master/regenie.wdl

See also:
* regenie documentation https://rgcgithub.github.io/regenie/options/#input
* dsub documentation https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md

## Step 1

From https://rgcgithub.github.io/regenie/overview/:
> In the first step a subset of genetic markers are used to fit a whole genome regression model that captures a good fraction of the phenotype variance attributable to genetic effects.

In [None]:
# Parameters to add
# 8 core machine
# 11 GB ram
# 500 GB disk

!regenie \
    --step 1 \
    --bgen={LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample={LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile={LOCAL_PHENOTYPES} \
    --phenoColList=LDL_norm,HDL_norm,TC_norm,TG_norm \
    --covarFile={LOCAL_PHENOTYPES} \
    --covarColList=is_male,is_aou_cohort,age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
    --extract qc_pass.snplist \
    --bsize 1000 \
    --verbose \
    --out fit_bin_out

# Note that no samples were omitted by the QC step, so we are leaving out this file since regenie 
# complained that it did not match the samples in the BGEN file.
#     --keep qc_pass.id \


## Step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

In [None]:
!regenie \
    --step 2 \
    --bgen={LOCAL_MERGED_BGEN} \
    --ref-first \
    --sample={LOCAL_MERGED_BGEN_SAMPLE} \
    --phenoFile={LOCAL_PHENOTYPES} \
    --phenoColList=LDL_norm,HDL_norm,TC_norm,TG_norm \
    --covarFile={LOCAL_PHENOTYPES} \
    --covarColList=is_male,is_aou_cohort,age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
    --firth 0.01 \
    --approx \
    --pred fit_bin_out_pred.list \
    --bsize 400 \
    --split \
    --out merged_aou_ukb_step2

In [None]:
!ls -lth . | head -20

# Appendix

Still re-writing the sections below to run this at scale in the background via dsub.

## regenie via dsub

In [None]:
#%%bash

# Parameters to add
# 8 core machine
# 11 GB ram
# 500 GB disk
# --input VARIANT_EXCLUSION_FILE=TODO(deflaux) \
# --exclude="${VARIANT_EXCLUSION_FILE}" \
# NOTE: per Margaret, instead of using the exclusion file, use plink2 to perform the QC
# and obtain a subset of SNPs roughly equal to the number of samples.

job_response = !dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --image "briansha/regenie:v2.0.1_boost" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --input PHENO_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210727/phenotypes.tsv \
  --input BGEN_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.bgen \
  --input SAMPLE_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.sample \
  --output OUT="${WORKSPACE_BUCKET}/dsub/regenie-step1/$(date +'%Y%m%d/%H%M%S')/fit_bin_out" \
  --command 'set -euo pipefail \
    regenie \
      --step 1 \
      --bgen="${BGEN_FILE}"\
      --sample="${SAMPLE_FILE}" \
      --phenoFile="${PHENO_FILE}" \
      --phenoColList=LDL_norm,HDL_norm,TC_norm,TG_norm \
      --covarFile="${PHENO_FILE}" \
      --covarColList=is_male,is_aou_cohort,age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
      --bt \
      --bsize 1000 \
      --verbose \
      --out fit_bin_out \
    && ls -la'

job_response

In [None]:
JOB_ID = job_response[1].split(' ')[3]

JOB_ID

In [None]:
%%bash

# Parameters to add
# 8 core machine
# --input VARIANT_EXCLUSION_FILE=TODO(deflaux) \
# --exclude="${VARIANT_EXCLUSION_FILE}" \

dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --image "briansha/regenie:v2.0.1_boost" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --input PHENO_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210727/phenotypes.tsv \
  --input BGEN_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.bgen \
  --input SAMPLE_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.sample \
  --output OUT="${WORKSPACE_BUCKET}/dsub/regenie-step1/$(date +'%Y%m%d/%H%M%S')/fit_bin_out" \
  --command 'set -euo pipefail
        regenie \
        --step 1 \
        --bgen="${BGEN_FILE}"\
        --sample="${SAMPLE_FILE}" \
        --phenoFile="${PHENO_FILE}" \
        --phenoColList=LDL_norm,HDL_norm,TC_norm,TG_norm \
        --covarFile="${PHENO_FILE}" \
        --covarColList=is_male,is_aou_cohort,age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
        --bt \
        --bsize 1000 \
        --verbose \
        --out fit_bin_out \
    && ls -la
  '

In [None]:
stat_response = !dstat --provider google-cls-v2 --project aou-rw-preprod-acef10ae --location us-central1 \
    --users jupyter-user --status '*' --full \
    --jobs {job_id[1].split(' ')[3]}

stat_response

In [None]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/logging/20210727/235345/set--jupyter-user--210727-235345-76.log"

## Compress Hail logs

In [None]:
logs = !gsutil ls "${WORKSPACE_BUCKET}/hail-logs/*/*.log"

logs

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data={
    '--input INPUT_FILE': logs,
    '--output OUTPUT_FILE': [f'{log}.gz' for log in logs]
})

df.head()

In [None]:
df.to_csv('compress_hail_logs.tsv', sep='\t', index=False)

In [None]:
!cat compress_hail_logs.tsv | head

In [None]:
%%bash

dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --preemptible \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --command 'set -o errexit && \
             set -o xtrace && \
             gzip ${INPUT_FILE} && \
             mv ${INPUT_FILE}.gz $(dirname ${OUTPUT_FILE})' \
  --tasks compress_hail_logs.tsv \
  --wait

In [None]:
%%bash

dstat --provider google-cls-v2 --project aou-rw-preprod-acef10ae --location us-central1 \
    --jobs 'set--jupyter-user--210719-221709-68' \
    --users 'jupyter-user' --status '*'   --full

In [None]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/logging/20210719/221708/set--jupyter-user--210719-221709-68.1*"

In [None]:
compressed_logs = !gsutil ls "${WORKSPACE_BUCKET}/hail-logs/*/*.log.gz"

compressed_logs[0:9]

In [None]:
len(logs)

In [None]:
len(compressed_logs)

In [None]:
!gsutil -m rm "${WORKSPACE_BUCKET}/hail-logs/*/*.log"

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze