**AUTHOR:** <br>
Vasilis Raptis

**DATE:** <br>
09.07.2024 

**PURPOSE:** <br>
This notebook: 
- Runs regenie step2 with acaf srWGS data
- uses dsub

**NOTES:** <br>
- ***run the 00_GWAS_pipeline_00_dsub_setup.ipynb notebook first***
- uses the for_bgen files created in  03_part2b_run_regenie_step2_clinvar.ipynb 
- uses raw acaf.bgen files. Will filter the regenie sumstats later 
- Uses code from: the "How to use dsub in the Researcher Workbench (v7)" featured workspace: https://workbench.researchallofus.org/workspaces/aou-rw-6221d5ec/howtousedsubintheresearcherworkbenchv7/analysis

**Setup:**

In [2]:
## Python Package Import
import sys
import os 
import numpy as np
import pandas as pd
from datetime import datetime

In [3]:
##Ensuring dsub is up to date
!pip3 install --upgrade dsub

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
my_bucket = os.getenv('WORKSPACE_BUCKET')
my_bucket

'gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f'

In [5]:
USER_NAME = os.getenv('OWNER_EMAIL').split('@')[0].replace('.','-')

# Save this Python variable as an environment variable so that its easier to use within %%bash cells.
%env USER_NAME={USER_NAME}

env: USER_NAME=vraptis


In [6]:
## Setting for running dsub jobs
pd.set_option('display.max_colwidth', 0)

**Check WGS files in google bucket:**

In [6]:
!gsutil -u $GOOGLE_PROJECT du -h gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/*bgen
# 2.52 TiB total 

190.75 GiB   gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr1.bgen
124.66 GiB   gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr10.bgen
107.56 GiB   gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr11.bgen
115.94 GiB   gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr12.bgen
79.68 GiB    gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr13.bgen
75.73 GiB    gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr14.bgen
76.21 GiB    gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr15.bgen
88.61 GiB    gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr16.bgen
87.28 GiB

**Check files in bucket:**

In [7]:
# check in bucket
!gsutil ls {my_bucket}/data/files_for_bgen_step2_all/

gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/afr_pheno_clean_for_bgen.txt
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/amr_pheno_clean_for_bgen.txt
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/amr_pheno_for_bgen.txt
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/arrays_qc_afr_clean_for_bgen.id
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/arrays_qc_amr_clean_for_bgen.id
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/arrays_qc_eur_clean_for_bgen.id
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/del_afr_clean_step1_1.loco
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/del_afr_clean_step1_for_bucket_pred.list
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/del_amr_c

In [8]:
## make updated *pred.list file so the path to the *loco file corresponds to bucket folder (copy them from above ^) 
# space delimited 

## eur
!echo delirium_status /mnt/data/input/gs/fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/del_eur_clean_step1_1.loco > del_eur_clean_step1_for_bucket_pred.list
!gsutil cp del_eur_clean_step1_for_bucket_pred.list {my_bucket}/data/files_for_bgen_step2_all/
## afr
!echo delirium_status /mnt/data/input/gs/fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/del_afr_clean_step1_1.loco > del_afr_clean_step1_for_bucket_pred.list
!gsutil cp del_afr_clean_step1_for_bucket_pred.list {my_bucket}/data/files_for_bgen_step2_all/
## amr
!echo delirium_status /mnt/data/input/gs/fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all/del_amr_clean_step1_1.loco > del_amr_clean_step1_for_bucket_pred.list
!gsutil cp del_amr_clean_step1_for_bucket_pred.list {my_bucket}/data/files_for_bgen_step2_all/

## clean in workspace
!rm del_*_clean_step1_for_bucket_pred.list

Copying file://del_eur_clean_step1_for_bucket_pred.list [Content-Type=application/octet-stream]...
/ [1 files][  139.0 B/  139.0 B]                                                
Operation completed over 1 objects/139.0 B.                                      
Copying file://del_afr_clean_step1_for_bucket_pred.list [Content-Type=application/octet-stream]...
/ [1 files][  139.0 B/  139.0 B]                                                
Operation completed over 1 objects/139.0 B.                                      
Copying file://del_amr_clean_step1_for_bucket_pred.list [Content-Type=application/octet-stream]...
/ [1 files][  139.0 B/  139.0 B]                                                
Operation completed over 1 objects/139.0 B.                                      


**Download & test latest regenie version:**

See x_dsub_tut 6.dsub_use_case_regenie.ipynb. Find latest regenie version here: https://github.com/rgcgithub/regenie/releases/tag/v3.3

In [53]:
## get zip file 
# !wget https://github.com/rgcgithub/regenie/releases/download/v3.3/regenie_v3.3.gz_x86_64_Linux.zip
## unzip and rename
# !unzip regenie_v3.3.gz_x86_64_Linux.zip
# !mv regenie_v3.3.gz_x86_64_Linux regenie_new

In [55]:
# !gsutil cp regenie_new {my_bucket}/data/regenie/
# !gsutil ls {my_bucket}/data/regenie/regenie_new

Copying file://regenie_new [Content-Type=application/octet-stream]...
- [1 files][ 11.2 MiB/ 11.2 MiB]                                                
Operation completed over 1 objects/11.2 MiB.                                     
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/regenie/regenie_new


In [56]:
# %%writefile regenie_new_test.sh
# #!/bin/bash
# set -o errexit
# set -o nounset
# cp "${input_file}" regenie_new
# chmod +x regenie_new
# # replace the line below with your own command
# ./regenie_new --help > "${OUTPUT_FILE}"


Writing regenie_new_test.sh


In [57]:
# !gsutil cp regenie_new_test.sh {my_bucket}/dsub/scripts/

Copying file://regenie_new_test.sh [Content-Type=text/x-sh]...
/ [1 files][  181.0 B/  181.0 B]                                                
Operation completed over 1 objects/181.0 B.                                      


In [62]:
# %%bash --out test_ID

# source ~/aou_dsub.bash 
# aou_dsub \
#   --image us.gcr.io/broad-dsp-gcr-public/terra-jupyter-aou:2.2.4 \
#   --disk-size 512 \
#   --boot-disk-size 100 \
#   --logging "${WORKSPACE_BUCKET}/data/logging" \
#   --input input_file="${WORKSPACE_BUCKET}/data/regenie/regenie_new" \
#   --output OUTPUT_FILE="${WORKSPACE_BUCKET}/data/regenie/regenie_new_sh_test.txt" \
#   --script "${WORKSPACE_BUCKET}/dsub/scripts/regenie_new_test.sh"

Job properties:
  job-id: regenie-ne--vraptis--240612-100003-45
  job-name: regenie-new-test
  user-id: vraptis
Provider internal-id (operation): projects/143618326083/locations/us-central1/operations/6283796120732617686
Launched job-id: regenie-ne--vraptis--240612-100003-45
To check the status, run:
  dstat --provider google-cls-v2 --project terra-vpc-sc-47152b29 --location us-central1 --jobs 'regenie-ne--vraptis--240612-100003-45' --users 'vraptis' --status '*'
To cancel the job, run:
  ddel --provider google-cls-v2 --project terra-vpc-sc-47152b29 --location us-central1 --jobs 'regenie-ne--vraptis--240612-100003-45' --users 'vraptis'


In [78]:
# %%bash
# dstat --provider google-cls-v2 --project terra-vpc-sc-47152b29 --location us-central1 --jobs 'regenie-ne--vraptis--240612-100003-45' --users 'vraptis' --status '*'


Job Name         Status    Last Update
---------------  --------  --------------
regenie-new-...  Success   06-12 10:21:44



In [79]:
## check output 
# !gsutil cat ${WORKSPACE_BUCKET}/data/regenie/regenie_new_sh_test.txt

              |      REGENIE v3.3.gz      |

Copyright (c) 2020-2023 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.


Usage:
  ./regenie_new [OPTION...]

  -h, --help      print list of available options
      --helpFull  print list of all available options

 Main options:
      --step INT                specify if fitting null model (=1) or 
                                association testing (=2)
      --bed PREFIX              prefix to PLINK .bed/.bim/.fam files
      --pgen PREFIX             prefix to PLINK2 .pgen/.pvar/.psam files
      --bgen FILE               BGEN file
      --sample FILE             sample file corresponding to BGEN file
      --bgi FILE                index bgi file corresponding to BGEN file
      --ref-first               use the first allele as the reference for 
                                BGEN or PLINK bed/bim/fam input format 
               

**Write the regenie script:** 

In [9]:
%%writefile step2.sh
#!/bin/bash

set -o errexit
set -o nounset

### update: the following lines run the latest regenie version (v3.3)
cp "${regenie_new}" regenie_new
chmod +x regenie_new
### the --af-cc option only available with regenie_new

#regenie --step 2 \
./regenie_new --step 2 \
        --bgen "${bgen_file}" \
        --ref-first \
        --sample  "${sample_file}" \
        --covarFile "${pheno_file}" \
        --covarColList age,PC{1:10} \
        --catCovarList sex \
        --phenoFile "${pheno_file}" \
        --phenoCol "${phen_col}" \
        --keep "${keep_file}" \
        --af-cc \
        --bsize 400 \
        --bt \
        --firth --approx --pThresh 0.01 --firth-se \
        --pred "${pred_file}" \
        --out "${out_path}/chr${chromo}" \



Writing step2.sh


In [10]:
# copy script to bucket
!gsutil cp step2.sh {my_bucket}/dsub/scripts/step2.sh

Copying file://step2.sh [Content-Type=text/x-sh]...
/ [1 files][  753.0 B/  753.0 B]                                                
Operation completed over 1 objects/753.0 B.                                      


In [11]:
!gsutil ls {my_bucket}/dsub/scripts/

gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/dsub/scripts/regenie_new_test.sh
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/dsub/scripts/regenie_test.sh
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/dsub/scripts/step2.sh


**Run the dsub command:**

In [36]:
%%bash --out step2_ID

source ~/aou_dsub.bash # This file was created via notebook 01_dsub_setup.ipynb.

# give name
JOB_NAME="step2"

# Change for your bucket, path in output of cell directly above:
BASH_SCRIPT="gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/dsub/scripts/step2.sh"

# Change for eur, afr or amr analysis
ANCESTRY="afr"
# Change for input folder in my bucket
INPUT_FOLDER="gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_all"

MACHINE_TYPE="n2-standard-4"

# Python is 'right side limited' wherein the last value is not included
# To run the regression across all chromosomes, set lower to 1 and upper to 23
# To run across one chromosome, set lower to the chomosome-of-interest and upper to the following

LOWER=22
UPPER=23
for ((chromo=$LOWER;chromo<$UPPER;chromo+=1))
#for chromo in "X"
do
    aou_dsub \
    --name "${JOB_NAME}_${ANCESTRY}_${chromo}" \
    --after step2--vraptis--240612-195412-18 \
    --image us.gcr.io/broad-dsp-gcr-public/terra-jupyter-aou:2.2.4 \
    --disk-size 512 \
    --boot-disk-size 1000 \
    --logging "${WORKSPACE_BUCKET}/data/logging" \
    --machine-type ${MACHINE_TYPE} \
    --input regenie_new="${WORKSPACE_BUCKET}/data/regenie/regenie_new" \
    --input bgen_file="gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr${chromo}.bgen" \
    --input sample_file="gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr${chromo}.sample" \
    --input pheno_file="${INPUT_FOLDER}/${ANCESTRY}_pheno_clean_for_bgen.txt" \
    --input keep_file="${INPUT_FOLDER}/arrays_qc_${ANCESTRY}_clean_for_bgen.id" \
    --input pred_file="${INPUT_FOLDER}/del_${ANCESTRY}_clean_step1_for_bucket_pred.list" \
    --input loco_file="${INPUT_FOLDER}/del_${ANCESTRY}_clean_step1_1.loco" \
    --env chromo=${chromo} \
    --env phen_col=delirium_status \
    --env ancestry=${ANCESTRY} \
    --script "${BASH_SCRIPT}" \
    --output-recursive out_path="${WORKSPACE_BUCKET}/data/regenie/step2_acaf/all/${ANCESTRY}" 

#      --input-recursive input_path="${WORKSPACE_BUCKET}/data/examples" \

done


Job properties:
  job-id: step2-afr---vraptis--240709-191526-61
  job-name: step2-afr-22
  user-id: vraptis
Waiting for predecessor jobs to complete...
Waiting for: step2--vraptis--240612-195412-18.
  step2--vraptis--240612-195412-18: SUCCESS
Provider internal-id (operation): projects/143618326083/locations/us-central1/operations/17833311607551591597
Launched job-id: step2-afr---vraptis--240709-191526-61
To check the status, run:
  dstat --provider google-cls-v2 --project terra-vpc-sc-47152b29 --location us-central1 --jobs 'step2-afr---vraptis--240709-191526-61' --users 'vraptis' --status '*'
To cancel the job, run:
  ddel --provider google-cls-v2 --project terra-vpc-sc-47152b29 --location us-central1 --jobs 'step2-afr---vraptis--240709-191526-61' --users 'vraptis'


In [13]:
# Save this Python variable value as an environment variable so that its easier to use within %%bash cells.
%env JOB_ID={step2_ID}
#%env JOB_ID='step2-acaf--vraptis--240611-200604-06'
#%env USER_NAME='vraptis'

env: JOB_ID=step2-eur--vraptis--240709-133910-35


In [34]:
%%bash

#dstat -h

**Monitor dsub jobs:**

In [8]:
%%bash


## check status. The same command is given on the output above 
dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --users "${USER_NAME}" \
    --status 'RUNNING' 'SUCCESS' 'FAILURE' \
#    --name step2_afr_d{} \
#    --jobs "${JOB_ID}" \
#    --full \
#    | grep 'logging'

Job Name         Status                                      Last Update       Task
---------------  ------------------------------------------  --------------  ------
step2-afr-22     Success                                     07-09 20:52:35
step2-amr-22     Success                                     07-09 20:45:07
step2-amr-21     Success                                     07-09 20:37:12
step2-amr-20     Success                                     07-09 21:10:16
step2-amr-19     Success                                     07-09 21:28:58
step2-amr-18     Success                                     07-09 21:16:45
step2-amr-17     Success                                     07-09 21:45:35
step2-amr-16     Success                                     07-09 22:00:54
step2-amr-15     Success                                     07-09 21:32:38
step2-amr-14     Success                                     07-09 21:31:01
step2-amr-13     Success                                     07-09 21:51

step2            Success                                     06-12 21:46:19
step2            Success                                     06-12 22:45:20
step2            Success                                     06-12 23:16:54
step2            Success                                     06-12 23:05:14
step2            Success                                     06-12 22:45:40
step2            Success                                     06-12 23:31:16
step2            Success                                     06-12 23:15:31
step2            Success                                     06-12 23:21:48
step2            Success                                     06-13 00:10:38
step2            Success                                     06-12 17:17:28
step2            Success                                     06-12 16:10:30
step2            Success                                     06-12 16:57:20
step2            Success                                     06-12 15:02:50
step2       

In [67]:
## check log file (get path from dstat output with --full option)
#!gsutil cat {my_bucket}/data/logging/step2-eur--vraptis--240709-095859-68.log 
#!gsutil cat {my_bucket}/data/regenie/step2_acaf/1on10/eur/chr22.log

Start time: Wed Jun 12 15:20:27 2024

              |      REGENIE v3.3.gz      |

Copyright (c) 2020-2023 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini.
Distributed under the MIT License.
Compiled with Boost Iostream library.

Log of output saved in file : /mnt/data/output/gs/fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/regenie/step2_acaf/1on10/eur/chr22.log

Options in effect:
  --step 2 \
  --bgen /mnt/data/input/gs/fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr22.bgen \
  --ref-first \
  --sample /mnt/data/input/gs/fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/acaf_threshold_v7.1/bgen/acaf_threshold.chr22.sample \
  --covarFile /mnt/data/input/gs/fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/files_for_bgen_step2_1on10/eur_pheno_for_bgen.txt \
  --covarColList age,PC{1:10} \
  --catCovarList sex \
  --phenoFile /mnt/data/input/gs/fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/fi

 block [2514/3992] : done (440ms) 
 block [2515/3992] : done (485ms) 
 block [2516/3992] : done (650ms) 
 block [2517/3992] : done (655ms) 
 block [2518/3992] : done (414ms) 
 block [2519/3992] : done (487ms) 
 block [2520/3992] : done (627ms) 
 block [2521/3992] : done (518ms) 
 block [2522/3992] : done (467ms) 
 block [2523/3992] : done (475ms) 
 block [2524/3992] : done (360ms) 
 block [2525/3992] : done (333ms) 
 block [2526/3992] : done (333ms) 
 block [2527/3992] : done (635ms) 
 block [2528/3992] : done (389ms) 
 block [2529/3992] : done (668ms) 
 block [2530/3992] : done (460ms) 
 block [2531/3992] : done (446ms) 
 block [2532/3992] : done (554ms) 
 block [2533/3992] : done (454ms) 
 block [2534/3992] : done (603ms) 
 block [2535/3992] : done (733ms) 
 block [2536/3992] : done (389ms) 
 block [2537/3992] : done (488ms) 
 block [2538/3992] : done (542ms) 
 block [2539/3992] : done (576ms) 
 block [2540/3992] : done (507ms) 
 block [2541/3992] : done (3

In [197]:
%%bash
## delete dsub job
#ddel --provider google-cls-v2 --project terra-vpc-sc-47152b29 --location us-central1 --jobs 'step2-acaf--vraptis--240611-135053-42' --users 'vraptis'

**Check results:**

In [28]:
!gsutil ls {my_bucket}/data/regenie/step2_acaf/all/eur/
#!gsutil du -h {my_bucket}/data

gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/regenie/step2_acaf/all/eur/chr22.log
gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/regenie/step2_acaf/all/eur/chr22_delirium_status.regenie


In [9]:
## copy to workspace
!mkdir -p acaf/eur
!mkdir -p acaf/amr
!mkdir -p acaf/afr

#!gsutil cp -r {my_bucket}/data/regenie/step2_acaf/1on10/eur/chr*_delirium_status.regenie acaf/eur/
#!gsutil cp -r {my_bucket}/data/regenie/step2_acaf/1on10/amr/chr*_delirium_status.regenie acaf/amr/
!gsutil cp -r {my_bucket}/data/regenie/step2_acaf/1on10/afr/chr*_delirium_status.regenie acaf/afr/
#!gsutil cp -r {my_bucket}/data/regenie/step2_acaf/1on10/afr/chrX_delirium_status.regenie acaf/afr/


Copying gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/regenie/step2_acaf/1on10/afr/chr22_delirium_status.regenie...
\ [1 files][ 65.9 MiB/ 65.9 MiB]                                                
Operation completed over 1 objects/65.9 MiB.                                     
Copying gs://fc-secure-0e4de6e0-e2d7-4267-949d-7b1ad758a53f/data/regenie/step2_acaf/1on10/afr/chrX_delirium_status.regenie...
\ [1 files][184.6 MiB/184.6 MiB]                                                
Operation completed over 1 objects/184.6 MiB.                                    
