# batch-style computing on the All of Us Workbench

Full workflow support is coming soon, but in the interim we will use [dsub](https://github.com/databiosphere/dsub).

# Test data access 

## UKB phenotypes

This is our UK Biobank data project for which we have:
* WRITE access when using Terra
* READ-ONLY access when using the AoU workbench

See Terra workspace [ukb-application-7089](https://app.terra.bio/#workspaces/uk-biobank-sek/ukb-application-7089).

In [None]:
!gsutil ls gs://uk-biobank-sek-data-us-east1/phenotypes/raw/

## UKB exomes

This is the Terra workspace where the UKB Exomes are stored.

In [None]:
!gsutil ls gs://fc-7130e767-a885-4678-95ed-7c966c79e2d0/200K/pvcf/ukb23156_c10_b0_v1.vcf.gz

## Public annotation data

In [None]:
!bq ls bigquery-public-data:human_variant_annotation

In [None]:
%load_ext google.cloud.bigquery

In [None]:
%%bigquery --use_rest_api

SELECT COUNT(*) AS cnt FROM `bigquery-public-data.gnomAD.v3_genomes__chr21`

In [None]:
GNOMAD_V3 = 'gs://gnomad-public/release/3.0/ht/genomes/gnomad.genomes.r3.0.sites.ht'

!gsutil ls {GNOMAD_V3}

In [None]:
!gsutil ls gs://genomics-public-data/

## DeepVariant 1,000 Genomes

In [None]:
%%bash

gsutil ls gs://brain-genomics-public/research/cohort/1KGP

# Setup dsub

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench. It runs fine on the default Cloud Environment. 
</div>

In [None]:
!pip3 install --upgrade dsub

# Run some test dsub jobs

## Hello world

In [None]:
%%bash

gcloud auth list

<div class="alert alert-block alert-warning">
    <b>Note:</b> (1) You must use your own PET account. (2) Your PET account has to be granted access to run itself as a service account.
</div>

In [None]:
%%bash

dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --output OUT="${WORKSPACE_BUCKET}/dsub/hello/$(date +'%Y%m%d/%H%M%S')/out.txt" \
  --command 'echo Hello world from the AoU workbench!! > "${OUT}"' \
  --wait

In [None]:
%%bash

gsutil ls "${WORKSPACE_BUCKET}/dsub/**"

In [None]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/hello/$(date +'%Y%m%d')/*/out.txt"

## regenie 'hello world'

TODO(deflaux) port the three tasks in https://github.com/briansha/Regenie_WDL/blob/master/regenie.wdl to three dsub pipelines.

In [None]:
%%bash

dsub \
  --provider google-cls-v2 \
  --service-account "pet-101767132834091462320@aou-rw-preprod-acef10ae.iam.gserviceaccount.com" \
  --project "${GOOGLE_PROJECT}" \
  --zones "us-central1-*" \
  --network "network" \
  --subnetwork "subnetwork" \
  --image "briansha/regenie:v2.0.1_boost" \
  --logging "${WORKSPACE_BUCKET}/dsub/logging/$(date +'%Y%m%d/%H%M%S')" \
  --output OUT="${WORKSPACE_BUCKET}/dsub/hello-regenie/$(date +'%Y%m%d/%H%M%S')/out.txt" \
  --command 'echo Hello world from regenie on the AoU workbench!! > "${OUT}"' \
  --wait

In [None]:
%%bash

gsutil cat "${WORKSPACE_BUCKET}/dsub/hello-regenie/$(date +'%Y%m%d')/*/out.txt"

# Provenance 

In [None]:
%%bash

date

In [None]:
%%bash

pip3 freeze