<div class="alert alert-info" style="font-family:'arial';font-size:25px"> Set up for computing with cuKING on the *All of Us* Researcher Workbench with dsub </div>

- For this notebook, your cloud analysis environment can be left with the default settings for a General Analyses.
- This notebook only takes a couple minutes to run interactively. 
- This notebook writes a cuKING_dsub.bash file containing the bash `cuKING_dsub` function to disk which is then sourced within the .bashrc so the function can be used when the terminal or the file can be sourced within a notebook cell to use it interactively.

**This notebook only needs to be run once per compute environment. If the environment is paused the notebook does not need to be rerun. If it is deleted, the notebook will need to be rerun.

<div class="alert alert-block alert-info"><b>Cloud Analysis Environment</b>: Use "Recommended Environment" <kbd><b>General Analysis</b></kbd> which creates compute type <kbd><b>Standard VM</b></kbd> with default values of 4 CPUs, 15GB RAM, and 120GB disk.
</div>

# Objectives

This notebook will set up [dsub](https://github.com/databiosphere/dsub) with the cuKING image from the gnomAD artifact registry for use on the Researcher Workbench.

See also the [dsub documentation](https://github.com/databiosphere/dsub#dsub-features).

# What is dsub?

dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud.

The dsub user experience is modeled after traditional high-performance computing job schedulers like Grid Engine and Slurm. You write a script and then submit it to a job scheduler from a shell prompt. You can submit via jupyter notebook cell or via the terminal application after running this notebook which will edit the .bashrc.

NOTE: You need to source the .bashrc in every cell that you use the cuking_dsub command in. Each cell is independent so running source once, as you will in the terminal, will not work in the notebook.

See also the [dsub documentation](https://github.com/databiosphere/dsub#dsub-features).

# When to use dsub?

You can use `%%bash` or `!` in a notebook to run command line tools like `plink`. It works fine, and it's nice to show your work in a [literate programming style](https://en.wikipedia.org/wiki/Literate_programming)! You can of course also use the Jupyter terminal to run command line tools like `plink`.


You might prefer to run those command line tools via scripts with dsub when you want to:
* run the script in **parallel** (e.g., to process data for different chromosomes on different machines simultaneously)
* run the script **on a different machine** than where Jupyter is running (e.g., so that CPU and RAM are dedicated, not shared)
* run something that may take **longer than 24 hours** (e.g., to avoid cloud analysis environment [autopause](https://support.terra.bio/hc/en-us/articles/360029761352-Preventing-runaway-costs-with-notebook-auto-pause-#h_de5698f5-3c82-4763-aaaf-ea7df6a1869c))
* run something using **inexpensive [preemptible VMs](https://cloud.google.com/compute/docs/instances/preemptible)**

# Setup dsub

In [1]:
# optional as dsub is installed already
!pip3 install --upgrade dsub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# View `dsub --help`

dsub has many parameters. It is a really flexible system, but for several of the parameter values, we either always want to pass the same value when running on the *All of Us* Researcher Workbench, or there is a recommended default value.

In [2]:
%%bash

dsub --help

usage: /opt/conda/bin/dsub [-h] [--provider PROVIDER] [--version VERSION]
                           [--unique-job-id] [--name NAME]
                           [--tasks [FILE M-N ...]] [--image IMAGE]
                           [--dry-run] [--command COMMAND] [--script SCRIPT]
                           [--env [KEY=VALUE ...]] [--label [KEY=VALUE ...]]
                           [--input [KEY=REMOTE_PATH ...]]
                           [--input-recursive [KEY=REMOTE_PATH ...]]
                           [--output [KEY=REMOTE_PATH ...]]
                           [--output-recursive [KEY=REMOTE_PATH ...]]
                           [--user USER] [--user-project USER_PROJECT]
                           [--mount [KEY=PATH_SPEC ...]] [--wait]
                           [--retries RETRIES] [--poll-interval POLL_INTERVAL]
                           [--after AFTER [AFTER ...]] [--skip] [--summary]
                           [--min-cores MIN_CORES] [--min-ram MIN_RAM]
                        

# Setup `cuKING_dsub` function

We can avoid a lot of unnecessary typing of those default parameter values by using this [bash function](https://linuxize.com/post/bash-functions/) to call dsub.

Use of the `cuKING_dsub` function instead of `dsub` is optional but recommended because, in addition to removing a lot of boilerplate code, it also creates a nice folder structure for dsub log files. 

You can use this `cuKING_dsub` function from both within `%%bash` cells in notebooks and also from the Jupyter terminal. However every cell needs to run `source ~/.bashrc` or `source ~/cuKING_dsub.bash` to utilize the below `cuKING_dsub` function or the cell will not find the `cuKING_dsub` command.

In [3]:
%%writefile ~/cuKING_dsub.bash

#!/bin/bash


# This shell function passes reasonable defaults for several dsub parameters, while
# allowing the caller to override any of them. It creates a nice folder structure within
# the workspace bucket for dsub log files.

# --[ Parameters ]--
# any valid dsub parameter flag

#--[ Returns ]--
# the job id of the job created by dsub

#--[ Details ]--
# The first five parameters below should always be those values when running on AoU RWB.

# Feel free to change the values for --user, --regions, --logging, and --image if you like.

# Note that we insert some job data into the logging path.
# https://github.com/DataBiosphere/dsub/blob/main/docs/logging.md#inserting-job-data

function cuKING_dsub () {

  # Get a shorter username to leave more characters for the job name.
  local DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"

  # For AoU RWB projects network name is "network".
  local AOU_NETWORK=network
  local AOU_SUBNETWORK=subnetwork

  dsub \
      --provider google-cls-v2 \
      --user-project "${GOOGLE_PROJECT}"\
      --project "${GOOGLE_PROJECT}"\
      --network "${AOU_NETWORK}" \
      --subnetwork "${AOU_SUBNETWORK}" \
      --service-account "$(gcloud config get-value account)" \
      --image 'us-central1-docker.pkg.dev/broad-mpg-gnomad/images/cuking:v1.0.6' \
      --user "${DSUB_USER_NAME}" \
      --regions us-central1 \
      --logging "${WORKSPACE_BUCKET}/dsub/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log" \
      "$@"
}

Writing /home/jupyter/cuKING_dsub.bash


## Make `cuKING_dsub` available from the terminal

We can also add this function to `.bashrc` so that we can easily use it from the terminal.

In [4]:
%%bash

echo source /home/jupyter/cuKING_dsub.bash >> ~/.bashrc

## Call cuKING_dsub  --help
**Call cuKING_dsub  --help**

In [5]:
%%bash
source ~/.bashrc
cuKING_dsub --help | more

usage: /opt/conda/bin/dsub [-h] [--provider PROVIDER] [--version VERSION]
                           [--unique-job-id] [--name NAME]
                           [--tasks [FILE M-N ...]] [--image IMAGE]
                           [--dry-run] [--command COMMAND] [--script SCRIPT]
                           [--env [KEY=VALUE ...]] [--label [KEY=VALUE ...]]
                           [--input [KEY=REMOTE_PATH ...]]
                           [--input-recursive [KEY=REMOTE_PATH ...]]
                           [--output [KEY=REMOTE_PATH ...]]
                           [--output-recursive [KEY=REMOTE_PATH ...]]
                           [--user USER] [--user-project USER_PROJECT]
                           [--mount [KEY=PATH_SPEC ...]] [--wait]
                           [--retries RETRIES] [--poll-interval POLL_INTERVAL]
                           [--after AFTER [AFTER ...]] [--skip] [--summary]
                           [--min-cores MIN_CORES] [--min-ram MIN_RAM]
                        

If you have run all the cells in this notebook, open a **new** terminal and run:
```
cuKING_dsub --help | more
```

[If you run this in an existing terminal and get error `aou_dsub: command not found`, just run `source ~/.bashrc` first.]

By default in this environment, our username is `jupyter` but that's not very informative. There is an environment variable holding our account name that we can use instead as our username.

In [6]:
from datetime import datetime
import os

USER_NAME = os.getenv('OWNER_EMAIL').split('@')[0].replace('.','-')

# Save this Python variable as an environment variable so that its easier to use within %%bash cells.
%env USER_NAME={USER_NAME}

env: USER_NAME=kchao


# Run a basic dsub job that succeeds
**Run a basic dsub job that succeeds**

First, we call `dsub`. This is a simple example that will run a "Hello World" job on a Google Cloud VM. It will run on a **different** VM than the one on which this Jupyter notebook is currently running.

Logs and output files will be written to the workspace bucket.

Notice in the cell below that we use the `cuKING_dsub` function we created previously to set a bunch of default parameter values. The only new parameter values we need to pass are:
* `--name` a friendly name for the job
* `--output` the path in the workspace bucket where the result will be stored (notice the nice folder structure we create here which is similar to the one in `cuKING_dsub` for `--logging`)
* `--command` which in this case is a simple call to `echo`

In [7]:
%%bash --out HELLO_WORLD_JOB_ID

source ~/.bashrc # This file was created earlier in this notebook

cuKING_dsub \
  --name "${JOB_NAME}" \
  --output OUT="${WORKSPACE_BUCKET}/dsub/results/${JOB_NAME}/${USER_NAME}/$(date +'%Y%m%d/%H%M%S')/out.txt" \
  --command 'set -o errexit && \
             set -o xtrace && \
             echo Hello world from the AoU workbench!! > "${OUT}"'

Job properties:
  job-id: set--kchao--250605-174037-81
  job-name: set
  user-id: kchao
Provider internal-id (operation): projects/417087853780/locations/us-central1/operations/13226008634939225106
Launched job-id: set--kchao--250605-174037-81
To check the status, run:
  dstat --provider google-cls-v2 --project terra-vpc-sc-93ccd8d2 --location us-central1 --jobs 'set--kchao--250605-174037-81' --users 'kchao' --status '*'
To cancel the job, run:
  ddel --provider google-cls-v2 --project terra-vpc-sc-93ccd8d2 --location us-central1 --jobs 'set--kchao--250605-174037-81' --users 'kchao'


In [8]:
# Save this Python variable value as an environment variable so that its easier to use within %%bash cells.
%env JOB_ID={HELLO_WORLD_JOB_ID}

env: JOB_ID=set--kchao--250605-174037-81


## Check the status of the job
**Check the status of the job**

You can see in the pink box above that dsub helpfully returns a message saying "*To check the status, run:*" and that command above is the same as the one below that uses shell environment variables.

Always feel free to copy and run those commands from the dsub output with the exact parameter values. We're just using shell environment variables below so that you can skip that copy/paste step.

In [9]:
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*'

Job Name    Status                                 Last Update
----------  -------------------------------------  --------------
set         VM starting (awaiting worker checkin)  06-05 17:40:54



**Use `--full` to get more detail**

When you add `--full` you get a lot more detail such as where to find the log files for the run of the job.

In [10]:
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*' \
    --full

- create-time: '2025-06-05 17:40:38.124566'
  dsub-version: v0-5-0
  end-time: ''
  envs: {}
  events:
  - name: start
    start-time: 2025-06-05 17:40:54.304511+00:00
  input-recursives: {}
  inputs: {}
  internal-id: projects/417087853780/locations/us-central1/operations/13226008634939225106
  job-id: set--kchao--250605-174037-81
  job-name: set
  labels: {}
  last-update: '2025-06-05 17:40:54.304511'
  logging: gs://fc-secure-b25d1307-7763-48b8-8045-fcae9caadfa1/dsub/logs/set/kchao/20250605/174037/set--kchao--250605-174037-81-task-None.log
  mounts: {}
  output-recursives: {}
  outputs:
    OUT: gs://fc-secure-b25d1307-7763-48b8-8045-fcae9caadfa1/dsub/results//kchao/20250605/174036/out.txt
  provider: google-cls-v2
  provider-attributes:
    accelerators: []
    block-external-network: null
    boot-disk-size: 10
    cpu_platform: ''
    disk-size: 200
    disk-type: pd-standard
    enable-stackdriver-monitoring: false
    instance-name: google-pipelines-worker-e8530c3a779f836f9277e

In [11]:
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*'

Job Name    Status                                      Last Update
----------  ------------------------------------------  --------------
set         Pulling "gcr.io/google.com/cloudsdktool...  06-05 17:41:31



In [12]:
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*'

Job Name    Status    Last Update
----------  --------  --------------
set         Success   06-05 17:42:21



**Take a look at the result file created by the job**

Next, we list a few directories in the workspace bucket to ensure that log files and output files exist.

<div class="alert alert-block alert-warning">
    <b>Note:</b> the output file will not exist until the job is in <b>status: SUCCESS</b>. You might need to wait a minute or two.
</div>

In [13]:
%%bash

gsutil ls "${WORKSPACE_BUCKET}/dsub/results/${JOB_NAME}/${USER_NAME}/**"

gs://fc-secure-b25d1307-7763-48b8-8045-fcae9caadfa1/dsub/results//kchao/20250605/174036/out.txt


In [14]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/results/${JOB_NAME}/${USER_NAME}/$(date +'%Y%m%d')/*/out.txt"

Hello world from the AoU workbench!!


You should see
```
Hello world from the AoU workbench!!
```

# Run a basic dsub job that fails
**Run a basic dsub job that fails**

In [15]:
%%bash --out HELLO_WORLD_JOB_ID

source ~/cuKING_dsub.bash # This file was created via notebook 01_dsub_setup.ipynb.

cuKING_dsub \
  --name "${JOB_NAME}" \
  --output OUT="${WORKSPACE_BUCKET}/dsub/results/${JOB_NAME}/${USER_NAME}/$(date +'%Y%m%d/%H%M%S')/out.txt" \
  --command 'set -o errexit && \
             set -o xtrace && \
             echo "This job fails because no input is passed to cuKING" && \
             cuking > "${OUT}"'

Job properties:
  job-id: set--kchao--250605-174339-51
  job-name: set
  user-id: kchao
Provider internal-id (operation): projects/417087853780/locations/us-central1/operations/968185655538755317
Launched job-id: set--kchao--250605-174339-51
To check the status, run:
  dstat --provider google-cls-v2 --project terra-vpc-sc-93ccd8d2 --location us-central1 --jobs 'set--kchao--250605-174339-51' --users 'kchao' --status '*'
To cancel the job, run:
  ddel --provider google-cls-v2 --project terra-vpc-sc-93ccd8d2 --location us-central1 --jobs 'set--kchao--250605-174339-51' --users 'kchao'


In [16]:
# Save this Python variable value as an environment variable so that its easier to use within %%bash cells.
%env JOB_ID={HELLO_WORLD_JOB_ID}

env: JOB_ID=set--kchao--250605-174339-51


In [17]:
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*' \


Job Name    Status    Last Update
----------  --------  --------------
set         Pending   06-05 17:43:39



In [18]:
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*'

Job Name    Status                                      Last Update
----------  ------------------------------------------  --------------
set         Stopped running "user-command": exit st...  06-05 17:45:13



## Take a look at the error message in the log file
**Take a look at the error message in the log file**

<div class="alert alert-block alert-info">
    <b>Note:</b> the log files may not exist until the job is in <b>status: FAILURE</b>. You might need to wait a minute or two.
</div>

<div class="alert alert-block alert-warning">
    <b>Note:</b> If you have waited for the job to complete, you will see <b>two sets of log files</b>: one set from the successful run and the other set from the failed run.
</div>

In [19]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/logs/${JOB_NAME}/${USER_NAME}/$(date +'%Y%m%d')/*/${JOB_ID}*.log"

CommandException: No URLs matched: gs://fc-secure-b25d1307-7763-48b8-8045-fcae9caadfa1/dsub/logs//kchao/20250605/*/set--kchao--250605-174339-51*.log


CalledProcessError: Command 'b'\ngsutil cat "${WORKSPACE_BUCKET}/dsub/logs/${JOB_NAME}/${USER_NAME}/$(date +\'%Y%m%d\')/*/${JOB_ID}*.log"\n'' returned non-zero exit status 1.

You should see the message from the successful `echo` command that occurred before the error:
```
This job fails because no input is passed to cuKING
```

**Take a look at what was sent to [stderr](http://www.linfo.org/standard_error.html).**

In [20]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/logs/${JOB_NAME}/${USER_NAME}/$(date +'%Y%m%d')/*/${JOB_ID}*-stderr.log"

CommandException: No URLs matched: gs://fc-secure-b25d1307-7763-48b8-8045-fcae9caadfa1/dsub/logs//kchao/20250605/*/set--kchao--250605-174339-51*-stderr.log


CalledProcessError: Command 'b'\ngsutil cat "${WORKSPACE_BUCKET}/dsub/logs/${JOB_NAME}/${USER_NAME}/$(date +\'%Y%m%d\')/*/${JOB_ID}*-stderr.log"\n'' returned non-zero exit status 1.

You should see the error message:
```
Error: INVALID_ARGUMENT: No input URI specified
```