<div class="alert alert-info" style="font-family:'arial';font-size:25px"> Set up for batch-style computing on the *All of Us* Researcher Workbench with dsub </div>

- The cell below provides the environment information, time and cost to run the notebook. For this tutorial, your cloud analysis environment can be left with the default settings for a General Analyses.
- This notebook only takes a few minutes to run interactively.

<div class="alert alert-block alert-info"><b>Cloud Analysis Environment</b>: Use "Recommended Environment" <kbd><b>General Analysis</b></kbd> which creates compute type <kbd><b>Standard VM</b></kbd> with default values of 4 CPUs, 15GB RAM, and 120GB disk.
</div>

# Objectives

We recommend that researchers use this notebook to learn the basics of using dsub in the *All of Us* Research Workbench and with Genomic Data. This notebook will show how to set up [dsub](https://github.com/databiosphere/dsub) for use on the *All of Us* Researcher Workbench.

**What you will learn:**
1. What is dsub?
1. When would I want to use dsub instead of a notebook?
1. How to install dsub.
1. How to create a bash function with default argument values for dsub.


See also the [dsub documentation](https://github.com/databiosphere/dsub#dsub-features).

# What is dsub?

dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud.

The dsub user experience is modeled after traditional high-performance computing job schedulers like Grid Engine and Slurm. You write a script and then submit it to a job scheduler from a shell prompt on your local machine.

See also the [dsub documentation](https://github.com/databiosphere/dsub#dsub-features).

# When would I want to use dsub?

You can use `%%bash` or `!` in a notebook to run command line tools like `plink`. It works fine, and it's nice to show your work in a [literate programming style](https://en.wikipedia.org/wiki/Literate_programming)! You can of course also use the Jupyter terminal to run command line tools like `plink`.


You might prefer to run those command line tools via scripts with dsub when you want to:
* run the script in **parallel** (e.g., to process data for different chromosomes on different machines simultaneously)
* run the script **on a different machine** than where Jupyter is running (e.g., so that CPU and RAM are dedicated, not shared)
* run something that may take **longer than 24 hours** (e.g., to avoid cloud analysis environment [autopause](https://support.terra.bio/hc/en-us/articles/360029761352-Preventing-runaway-costs-with-notebook-auto-pause-#h_de5698f5-3c82-4763-aaaf-ea7df6a1869c))
* run something using **inexpensive [preemptible VMs](https://cloud.google.com/compute/docs/instances/preemptible)**

# Setup dsub

In [1]:
!pip3 install --upgrade dsub

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# View `dsub --help`

dsub has many parameters. It is a really flexible system, but for several of the parameter values, we either always want to pass the same value when running on the *All of Us* Researcher Workbench, or there is a recommended default value.

In [2]:
%%bash

dsub --help

usage: /opt/conda/bin/dsub [-h] [--provider PROVIDER] [--version VERSION]
                           [--unique-job-id] [--name NAME]
                           [--tasks [FILE M-N ...]] [--image IMAGE]
                           [--dry-run] [--command COMMAND] [--script SCRIPT]
                           [--env [KEY=VALUE ...]] [--label [KEY=VALUE ...]]
                           [--input [KEY=REMOTE_PATH ...]]
                           [--input-recursive [KEY=REMOTE_PATH ...]]
                           [--output [KEY=REMOTE_PATH ...]]
                           [--output-recursive [KEY=REMOTE_PATH ...]]
                           [--user USER] [--user-project USER_PROJECT]
                           [--mount [KEY=PATH_SPEC ...]] [--wait]
                           [--retries RETRIES] [--poll-interval POLL_INTERVAL]
                           [--after AFTER [AFTER ...]] [--skip] [--summary]
                           [--min-cores MIN_CORES] [--min-ram MIN_RAM]
                        

  --boot-disk-size BOOT_DISK_SIZE
                        Size (in GB) of the boot disk (default: 10, 30 for
                        google-batch)
  --preemptible [PREEMPTIBLE]
                        If --preemptible is given without a number, enables
                        preemptible VMs for all attempts for all tasks. If a
                        number value N is used, enables preemptible VMs for up
                        to N attempts for each task. Defaults to not using
                        preemptible VMs.
  --zones ZONES [ZONES ...]
                        List of Google Compute Engine zones.
  --scopes SCOPES [SCOPES ...]
                        Space-separated scopes for Google Compute Engine
                        instances. If unspecified, provider will use 'https://
                        www.googleapis.com/auth/bigquery,https://www.googleapi
                        s.com/auth/compute,https://www.googleapis.com/auth/dev
                        storage.full_control,

# Setup `aou_dsub` function

Researchers can avoid a lot of unnecessary typing of those default parameter values by using this [bash function](https://linuxize.com/post/bash-functions/) to call dsub.

Use of the `aou_dsub` function is optional but recommended because, in addition to removing a lot of boilerplate code, it also creates a nice folder structure for dsub log files. Feel free to customize it to meet your needs.

You can use this `aou_dsub` function from both within `%%bash` cells in notebooks and also from the Jupyter terminal.

In [3]:
%%writefile ~/aou_dsub.bash

#!/bin/bash

# This shell function passes reasonable defaults for several dsub parameters, while
# allowing the caller to override any of them. It creates a nice folder structure within
# the workspace bucket for dsub log files.

# --[ Parameters ]--
# any valid dsub parameter flag

#--[ Returns ]--
# the job id of the job created by dsub

#--[ Details ]--
# The first five parameters below should always be those values when running on AoU RWB.

# Feel free to change the values for --user, --regions, --logging, and --image if you like.

# Note that we insert some job data into the logging path.
# https://github.com/DataBiosphere/dsub/blob/main/docs/logging.md#inserting-job-data

function aou_dsub () {

  # Get a shorter username to leave more characters for the job name.
  local DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"

  # For AoU RWB projects network name is "network".
  local AOU_NETWORK=network
  local AOU_SUBNETWORK=subnetwork

  dsub \
      --provider google-cls-v2 \
      --user-project "${GOOGLE_PROJECT}"\
      --project "${GOOGLE_PROJECT}"\
      --image 'marketplace.gcr.io/google/ubuntu1804:latest' \
      --network "${AOU_NETWORK}" \
      --subnetwork "${AOU_SUBNETWORK}" \
      --service-account "$(gcloud config get-value account)" \
      --user "${DSUB_USER_NAME}" \
      --regions us-central1 \
      --logging "${WORKSPACE_BUCKET}/dsub/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log" \
      "$@"
}

Writing /home/jupyter/aou_dsub.bash


## Make `aou_dsub` available from the terminal too

We can also add this function to `.bashrc` so that we can easily use it from the terminal.

In [4]:
%%bash

echo source ~/aou_dsub.bash >> ~/.bashrc

# Try it now!

If you have run all the cells in this notebook, open a **new** terminal and run:
```
aou_dsub --help | more
```

[If you run this in an existing terminal and get error `aou_dsub: command not found`, just run `source ~/.bashrc` first.]

In [5]:
%%bash

pip3 freeze



dsub==0.4.12


[0m