# REGENIE for EBV DNA GWAS

This notebook is largely following the demo workspace for performing GWAS in AoU: https://workbench.researchallofus.org/workspaces/aou-rw-5981f9dc/aouldlgwasregeniedsubctv6duplicate/analysis. In particular, this is a modified version of the `4.0_regenie_dsub_HP_TM` script to run REGENIE with our EBV DNA binary trait.  

### Environment setup

In [None]:
# Python Package Import
import sys
import os 
import numpy as np
import pandas as pd
from datetime import datetime

In [None]:
# Ensuring dsub is up to date
! pip3 install --upgrade dsub

### Environment Variables setup
These steps are necessary for setting up environmental variables referenced in the main dsub script.

In [None]:
# Save as an environment variable so it's easier to use within %%bash cells
%env JOB_ID={LINE_COUNT_JOB_ID}

In [None]:
# Defining necessary pathways
my_bucket = os.environ['WORKSPACE_BUCKET']

In [None]:
# Setting for running dsub jobs
pd.set_option('display.max_colwidth', 0)

In [None]:
USER_NAME = os.getenv('OWNER_EMAIL').split('@')[0].replace('.','-')

# Save as an environment variable so it's easier to use within %%bash cells
%env USER_NAME={USER_NAME}

In [None]:
# Modify the JOB_NAME variable for the individual job names 
## NOTE: Use underscores, not whitespace, since it will become part of the bucket path
JOB_NAME='EBV_DNA' ## add name in quotes, copy name in quotes to 4.1

# Save as an environment variable so it's easier to use within %%bash cells
%env JOB_NAME={JOB_NAME}

In [None]:
## Set up analysis results folder
line_count_results_folder = os.path.join(
    os.getenv('WORKSPACE_BUCKET'),
    'dsub',
    'results',
    JOB_NAME,
    USER_NAME,
    datetime.now().strftime('%Y%m%d'))

line_count_results_folder

In [None]:
# Set up path for saving output files
output_files = os.path.join(line_count_results_folder, "results")
print(output_files)

In [None]:
OUTPUT_FILES = output_files

# Save as an environment variable so it's easier to use within %%bash cells
%env OUTPUT_FILES={OUTPUT_FILES}

## Get input files

REGENIE requires input bgen and sample files. Get the filepaths to the datasets of interest listed here and make a copy in a personal gs bucket: https://support.researchallofus.org/hc/en-us/articles/29475233432212-Controlled-CDR-Directory .

In [None]:
# Get bgen and sample files for ACAF threshold callsets
## TODO: replace my_bucket with the actual string
! gsutil -u $GOOGLE_PROJECT -m cp -r gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/ {my_bucket}/data/dsub/

In [None]:
# This should list .bgen, .bgen.bgi, and .sample files for each chromosome
! gsutil ls {my_bucket}/data/dsub/bgen

## Shell script for analysis
This is the shell script that runs REGENIE.
The variable files inputs:

- `bgen_file`: the path to the bgen file
- `sample_file`: the path the sample file
- `pheno_file`: the path to the phenotype file
- `cov_file`: the path to the covariate file
- `step1_snplist`: input SNPs for step1
- `step2_snplist`: input SNPs for step2

The environment strings inputs:

- `cat_cov`: categorical covariates (comma separated)
- `cov_list`: continous covariates (comma separated)
- `phen_col`: phenotype as defined by the phenotype column in the phenotype file (name of trait) (comma separated if more than one)
- `trait`: qt or bt for quantitative or binary trait
- `chrom`: chromosome (automatic from the dsub loop)
- `prefix`: desired file prefix

### Potential modifications
Add the `--apply-rint` flag for quantitative traits (i.e., if running EBV DNA without binarizing by our 0.0018 threshold).

In [None]:
%%writefile ~/Regenie_GWAS_custom.sh

set -o pipefail 
set -o errexit

# step1
regenie \
    --step 1 \
    --bgen "${bgen_file}" \
    --sample  "${sample_file}" \
    --phenoFile "${pheno_file}" \
    --phenoColList "${phen_col}" \
    --covarFile "${cov_file}" \
    --catCovarList sex_at_birth \
    --covarColList age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15\
    --bsize 1000 \
    --extract "${step1_snplist}" \
    --verbose \
    --"${trait}" \
    --ref-first \
    --out "${prefix}"_step1_chr"${chrom}"

# step2
regenie \
    --step 2 \
    --bgen "${bgen_file}" \
    --sample  "${sample_file}" \
    --phenoFile "${pheno_file}" \
    --phenoColList "${phen_col}" \
    --covarFile "${cov_file}" \
    --catCovarList sex_at_birth \
    --covarColList age,age2,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15\
    --pred "${prefix}"_step1_chr"${chrom}"_pred.list \
    --extract "${step2_snplist}" \
    --bsize 400 \
    --verbose \
    --"${trait}" \
    --ref-first \
    --out "${prefix}"_step2_chr"${chrom}"

export regenie_results="${prefix}_step2_chr${chrom}_${phen_col}.regenie"
echo "regenie_results: ${regenie_results}"
mv ${regenie_results} ${OUTPUT_PATH}

## Copy scripts and inputs into gs bucket
Input files must be in a gs bucket for dsub to recognize the pathway.

In [None]:
# Copy files to personal gs bucket
! gsutil cp /home/jupyter/Regenie_GWAS_custom.sh {my_bucket}/data/dsub/
! gsutil -m cp -r /home/jupyter/workspaces/ebvgwas/AOU_SNPs_EUR {my_bucket}/data/dsub
! gsutil -m cp -r /home/jupyter/workspaces/ebvgwas/EBV_GWAS_data/EUR {my_bucket}/data/dsub
# Check files are in bucket
## NOTE: replace {my_bucket} with the actual string
! gsutil ls {my_bucket}/data/dsub

## Run REGENIE

This script submits a job for each chromosome in the for loop.

Note that the `--disk-size 220` flag only needs to be set when running chr2, as its bgen file is 197 GB and the default of 200 runs out of space. 

In [None]:
%%bash --out LINE_COUNT_JOB_ID

# Get a shorter username to leave more characters for the job name
DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"

# For AoU RWB projects network name is "network"
AOU_NETWORK=network
AOU_SUBNETWORK=subnetwork

## TODO: replace {my_bucket} with the actual string
MACHINE_TYPE="n2-standard-4"
BASH_SCRIPT="{my_bucket}/data/dsub/Regenie_GWAS_custom.sh"

# Set the chromosomes of interest
## TODO: replace {my_bucket} with the actual string
## TODO: make sure the snplists naming format is correct (for example, when running chr2_snplist_2.txt)
LOWER=1
UPPER=21
for ((chromo=$LOWER;chromo<$UPPER;chromo+=1))
do
    # Print all relevant variables
    echo "GOOGLE_PROJECT: ${GOOGLE_PROJECT}"
    echo "AOU_NETWORK: ${AOU_NETWORK}"
    echo "AOU_SUBNETWORK: ${AOU_SUBNETWORK}"
    echo "DSUB_USER_NAME: ${DSUB_USER_NAME}"
    echo "MACHINE_TYPE: ${MACHINE_TYPE}"
    echo "BASH_SCRIPT: ${BASH_SCRIPT}"
    echo "chromo: ${chromo}"
    echo "WORKSPACE_BUCKET: ${WORKSPACE_BUCKET}"
    echo "JOB_NAME: ${JOB_NAME}"
    echo "OUTPUT_FILES: ${OUTPUT_FILES}"
    echo "bgen_file: {my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.bgen"
    echo "sample_file: {my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.sample"
    echo "pheno_file: {my_bucket}/data/dsub/EUR/ebv_EUR_0018.tsv"
    echo "cov_file: {my_bucket}/data/dsub/EUR/ebv_EUR_covar.tsv"
    echo "step1_snplist: {my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt"
    echo "step2_snplist: {my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt"
    echo "phen_col: ${phen_col}"
    echo "prefix: ${prefix}"
    echo "trait: ${trait}"

    # Now run the dsub command
    dsub \
    --provider google-cls-v2 \
    --user-project "${GOOGLE_PROJECT}" \
    --project "${GOOGLE_PROJECT}" \
    --image "gcr.io/bick-aps2/ghcr.io/rgcgithub/regenie/regenie:v3.2.4.gz" \
    --network "${AOU_NETWORK}" \
    --subnetwork "${AOU_SUBNETWORK}" \
    --service-account "$(gcloud config get-value account)" \
    --user "${DSUB_USER_NAME}" \
    --regions us-central1 \
    --logging "${WORKSPACE_BUCKET}/dsub/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log" \
    "$@" \
    --preemptible \
    --boot-disk-size 1000 \
    --disk-size 220 \
    --machine-type ${MACHINE_TYPE} \
    --name "${JOB_NAME}" \
    --script "${BASH_SCRIPT}" \
    --env GOOGLE_PROJECT=${GOOGLE_PROJECT} \
    --input bgen_file="{my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.bgen" \
    --input sample_file="{my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.sample" \
    --input pheno_file="{my_bucket}/data/dsub/EUR/ebv_EUR_0018.tsv" \
    --input cov_file="{my_bucket}/data/dsub/EUR/ebv_EUR_covar.tsv" \
    --input step1_snplist="{my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt" \
    --input step2_snplist="{my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt" \
    --env chrom=${chromo} \
    --env prefix=EBV_DNA \
    --env trait=bt \
    --env phen_col=has_ebv \
    --output-recursive OUTPUT_PATH="${OUTPUT_FILES}/${chromo}"
done

Sometimes a job can be terminated (sometimes for no apparent reason). Other times, we don't necessarily want to run chromosomes contiguously. In that case, specific chromosomes can be specified:

In [None]:
%%bash --out LINE_COUNT_JOB_ID

# Get a shorter username to leave more characters for the job name
DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"

# For AoU RWB projects network name is "network"
AOU_NETWORK=network
AOU_SUBNETWORK=subnetwork

## TODO: replace {my_bucket} with the actual string
MACHINE_TYPE="n2-standard-4"
BASH_SCRIPT="{my_bucket}/data/dsub/Regenie_GWAS_custom.sh"

# Set the chromosomes of interest
## TODO: replace {my_bucket} with the actual string
## TODO: make sure the snplists naming format is correct (for example, when running chr2_snplist_2.txt)
for chromo in 2
do
    # Print all relevant variables
    echo "GOOGLE_PROJECT: ${GOOGLE_PROJECT}"
    echo "AOU_NETWORK: ${AOU_NETWORK}"
    echo "AOU_SUBNETWORK: ${AOU_SUBNETWORK}"
    echo "DSUB_USER_NAME: ${DSUB_USER_NAME}"
    echo "MACHINE_TYPE: ${MACHINE_TYPE}"
    echo "BASH_SCRIPT: ${BASH_SCRIPT}"
    echo "chromo: ${chromo}"
    echo "WORKSPACE_BUCKET: ${WORKSPACE_BUCKET}"
    echo "JOB_NAME: ${JOB_NAME}"
    echo "OUTPUT_FILES: ${OUTPUT_FILES}"
    echo "bgen_file: {my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.bgen"
    echo "sample_file: {my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.sample"
    echo "pheno_file: {my_bucket}/data/dsub/EUR/ebv_EUR_0018.tsv"
    echo "cov_file: {my_bucket}/data/dsub/EUR/ebv_EUR_covar.tsv"
    echo "step1_snplist: {my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt"
    echo "step2_snplist: {my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt"
    echo "phen_col: ${phen_col}"
    echo "prefix: ${prefix}"
    echo "trait: ${trait}"

    # Now run the dsub command
    dsub \
    --provider google-cls-v2 \
    --user-project "${GOOGLE_PROJECT}" \
    --project "${GOOGLE_PROJECT}" \
    --image "gcr.io/bick-aps2/ghcr.io/rgcgithub/regenie/regenie:v3.2.4.gz" \
    --network "${AOU_NETWORK}" \
    --subnetwork "${AOU_SUBNETWORK}" \
    --service-account "$(gcloud config get-value account)" \
    --user "${DSUB_USER_NAME}" \
    --regions us-central1 \
    --logging "${WORKSPACE_BUCKET}/dsub/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log" \
    "$@" \
    --preemptible \
    --boot-disk-size 1000 \
    --disk-size 220 \
    --machine-type ${MACHINE_TYPE} \
    --name "${JOB_NAME}" \
    --script "${BASH_SCRIPT}" \
    --env GOOGLE_PROJECT=${GOOGLE_PROJECT} \
    --input bgen_file="{my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.bgen" \
    --input sample_file="{my_bucket}/data/dsub/bgen/acaf_threshold.chr${chromo}.sample" \
    --input pheno_file="{my_bucket}/data/dsub/EUR/ebv_EUR_0018.tsv" \
    --input cov_file="{my_bucket}/data/dsub/EUR/ebv_EUR_covar.tsv" \
    --input step1_snplist="{my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt" \
    --input step2_snplist="{my_bucket}/data/dsub/AOU_SNPs_EUR/chr${chromo}_snplist.txt" \
    --env chrom=${chromo} \
    --env prefix=EBV_DNA \
    --env trait=bt \
    --env phen_col=has_ebv \
    --output-recursive OUTPUT_PATH="${OUTPUT_FILES}/${chromo}"
done

Running this dsub command prints out things like:
```text
Job properties:
  job-id: ebv-dna--snyeo--250415-142725-88
  job-name: ebv-dna
  user-id: snyeo
Provider internal-id (operation): projects/681565494320/locations/us-central1/operations/12640613304380835749
Launched job-id: ebv-dna--snyeo--250415-142725-88
To check the status, run:
  dstat --provider google-cls-v2 --project terra-vpc-sc-47bdfd92 --location us-central1 --jobs 'ebv-dna--snyeo--250415-142725-88' --users 'snyeo' --status '*'
To cancel the job, run:
  ddel --provider google-cls-v2 --project terra-vpc-sc-47bdfd92 --location us-central1 --jobs 'ebv-dna--snyeo--250415-142725-88' --users 'snyeo'
```

## Monitor dsub job progress

In [None]:
# Summary of current job(s) status
## TODO: replace jobs with the job-id printed above 
## TODO: replace users with the user-id
! dstat \
        --provider google-cls-v2 \
        --project terra-vpc-sc-47bdfd92 \
        --location us-central1 \
        --jobs 'ebv-dna--snyeo--250416-125527-75' \
        --users 'snyeo' \
        --status '*'

This prints something like:
```text
Job Name    Status    Last Update
----------  --------  --------------
ebv-dna     Success   04-16 18:05:06
```

In [None]:
# Full summary of current job(s) status
! dstat \
        --provider google-cls-v2 \
        --project terra-vpc-sc-47bdfd92 \
        --location us-central1 \
        --jobs 'ebv-dna--snyeo--250416-125527-75' \
        --users 'snyeo' \
        --status '*' \
        --full

This prints something like:
```text
- create-time: '2025-04-16 12:55:27.911982'
  dsub-version: v0-5-0
  end-time: ''
  envs:
    GOOGLE_PROJECT: terra-vpc-sc-47bdfd92
    chrom: '2'
    phen_col: has_ebv
    prefix: EBV_DNA
    trait: bt
  events:
  - name: start
    start-time: 2025-04-16 12:55:42.756543+00:00
  - name: pulling-image
    start-time: 2025-04-16 12:56:24.834398+00:00
  - name: localizing-files
    start-time: 2025-04-16 12:56:37.460297+00:00
  - name: running-docker
    start-time: 2025-04-16 13:48:37.851483+00:00
  input-recursives: {}
  inputs:
  ...
```

Check the log file(s) outputs:

In [None]:
# print last five lines
## TODO: replace log file filepath with what's printed in the full summary
! gsutil cat gs://fc-secure-44d65cda-9cdd-49bc-b829-a681f3123cfa/dsub/logs/ebv-dna/snyeo/20250416/125526/ebv-dna--snyeo--250416-125527-75-task-None.log | tail -n 5

## Save results

Copy over the log file(s) and the .regenie output file(s) to the local workspace.

In [None]:
# copy log file
## TODO: replace log file filepath with what's printed in the full summary
! gsutil cp gs://fc-secure-44d65cda-9cdd-49bc-b829-a681f3123cfa/dsub/logs/ebv-dna/snyeo/20250416/125526/ebv-dna--snyeo--250416-125527-75-task-None.log .

In [None]:
# copy REGENIE output file (will be listed in the log file)
! gsutil cp gs://fc-secure-44d65cda-9cdd-49bc-b829-a681f3123cfa/dsub/results/EBV_DNA/snyeo/20250412/results/1/* .

Zip REGENIE files to save space: `gzip *.regenie`