# Run a GWAS via regenie

In this notebook, we perform a genome-wide association study using [regenie](https://rgcgithub.github.io/regenie/) via [dsub](https://github.com/databiosphere/dsub).

# Setup dsub

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench. It runs fine on the default Cloud Environment. 
</div>

In [None]:
!pip3 install --upgrade dsub

In [None]:
%%bash

gcloud auth list

<div class="alert alert-block alert-warning">
    <b>Note:</b> (1) You must use your own PET account. (2) Your PET account has to be granted access to run itself as a service account.
</div>

# regenie

This work is based on https://github.com/briansha/Regenie_WDL/blob/master/regenie.wdl

See also https://rgcgithub.github.io/regenie/options/#input

## Step 1

From https://rgcgithub.github.io/regenie/overview/:
> In the first step a subset of genetic markers are used to fit a whole genome regression model that captures a good fraction of the phenotype variance attributable to genetic effects.

In [None]:
%%bash

# Parameters to add
# 8 core machine
# --input VARIANT_EXCLUSION_FILE=TODO(deflaux) \
# --exclude="${VARIANT_EXCLUSION_FILE}" \

job_id = !dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --image "briansha/regenie:v2.0.1_boost" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --input PHENO_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210727/phenotypes.tsv \
  --input BGEN_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.bgen \
  --input SAMPLE_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.sample \
  --output OUT="${WORKSPACE_BUCKET}/dsub/regenie-step1/$(date +'%Y%m%d/%H%M%S')/fit_bin_out" \
  --command 'set -euo pipefail
        regenie \
        --step 1 \
        --bgen="${BGEN_FILE}"\
        --sample="${SAMPLE_FILE}" \
        --phenoFile="${PHENO_FILE}" \
        --phenoColList=LDL_norm,HDL_norm,TC_norm,TG_norm \
        --covarFile="${PHENO_FILE}" \
        --covarColList=is_male,is_aou_cohort,age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
        --bt \
        --bsize 1000 \
        --verbose \
        --out fit_bin_out \
    && ls -la
  '

In [None]:
%%bash

# Parameters to add
# 8 core machine
# --input VARIANT_EXCLUSION_FILE=TODO(deflaux) \
# --exclude="${VARIANT_EXCLUSION_FILE}" \

dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --image "briansha/regenie:v2.0.1_boost" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --input PHENO_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210727/phenotypes.tsv \
  --input BGEN_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.bgen \
  --input SAMPLE_FILE=gs://fc-secure-fd6786bf-6c28-4f33-ac30-3860fbeee5bb/data/merged/20210719/ukb-aou-alpha1.sample \
  --output OUT="${WORKSPACE_BUCKET}/dsub/regenie-step1/$(date +'%Y%m%d/%H%M%S')/fit_bin_out" \
  --command 'set -euo pipefail
        regenie \
        --step 1 \
        --bgen="${BGEN_FILE}"\
        --sample="${SAMPLE_FILE}" \
        --phenoFile="${PHENO_FILE}" \
        --phenoColList=LDL_norm,HDL_norm,TC_norm,TG_norm \
        --covarFile="${PHENO_FILE}" \
        --covarColList=is_male,is_aou_cohort,age,age2,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
        --bt \
        --bsize 1000 \
        --verbose \
        --out fit_bin_out \
    && ls -la
  '

In [None]:
%%bash

dstat --provider google-cls-v2 --project aou-rw-preprod-acef10ae --location us-central1 \
    --users jupyter-user --status '*' --full \
    --jobs set--jupyter-user--210727-235345-76    

In [None]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/logging/20210727/235345/set--jupyter-user--210727-235345-76.log"

## Step 2

From https://rgcgithub.github.io/regenie/overview/:
> In the second step, a larger set of genetic markers (e.g. imputed markers) are tested for association with the phenotype conditional upon the prediction from the regression model in Step 1, using a leave one chromosome out (LOCO) scheme, that avoids proximal contamination.

# Appendix

## Compress Hail logs

In [None]:
logs = !gsutil ls "${WORKSPACE_BUCKET}/hail-logs/*/*.log"

logs

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data={
    '--input INPUT_FILE': logs,
    '--output OUTPUT_FILE': [f'{log}.gz' for log in logs]
})

df.head()

In [None]:
df.to_csv('compress_hail_logs.tsv', sep='\t', index=False)

In [None]:
!cat compress_hail_logs.tsv | head

In [None]:
%%bash

dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --preemptible \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --command 'set -o errexit && \
             set -o xtrace && \
             gzip ${INPUT_FILE} && \
             mv ${INPUT_FILE}.gz $(dirname ${OUTPUT_FILE})' \
  --tasks compress_hail_logs.tsv \
  --wait

In [None]:
%%bash

dstat --provider google-cls-v2 --project aou-rw-preprod-acef10ae --location us-central1 \
    --jobs 'set--jupyter-user--210719-221709-68' \
    --users 'jupyter-user' --status '*'   --full

In [None]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/logging/20210719/221708/set--jupyter-user--210719-221709-68.1*"

In [None]:
compressed_logs = !gsutil ls "${WORKSPACE_BUCKET}/hail-logs/*/*.log.gz"

compressed_logs[0:9]

In [None]:
len(logs)

In [None]:
len(compressed_logs)

In [None]:
!gsutil -m rm "${WORKSPACE_BUCKET}/hail-logs/*/*.log"

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze