# HLA haplotype reconstruction using T1K

Jacob Gutierrez

This notebook runs T1K haplotyping on the All of Us workbench using `dsub`. The docker image can be accessed here: https://hub.docker.com/r/jacobog02/jg-t1k . The overview section contains general instructions on how the dsub command was executed.

In [1]:
from datetime import datetime
import os
import pandas as pd
pd.set_option('display.max_colwidth', 0)

In [2]:
USER_NAME = os.getenv('OWNER_EMAIL').split('@')[0].replace('.','-')

# Save this Python variable as an environment variable to use within %%bash cells.
%env USER_NAME={USER_NAME}

env: USER_NAME=jacobog02


## Install and define bash function

Following the tutorial: https://workbench.researchallofus.org/workspaces/aou-rw-6221d5ec/howtousedsubintheresearcherworkbenchv7/analysis/preview/1.%20dsub%20set%20up%20and%20read%20me.ipynb⁠  

This shell function passes defaults for several dsub parameters so that the actual call is cleaner. 

In [None]:
# optional
#!pip3 install --upgrade dsub

In [None]:
%%writefile ~/aou_dsub.bash

#!/bin/bash

function aou_dsub () {

  # Get a shorter username to leave more characters for the job name.
  local DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"

  # For AoU projects, network name is "network".
  local AOU_NETWORK=network
  local AOU_SUBNETWORK=subnetwork

  dsub \
      --provider google-cls-v2 \
      --user-project "${GOOGLE_PROJECT}"\
      --project "${GOOGLE_PROJECT}"\
      --image 'gcr.io/jg-public-docker-gcp/jg-t1k:latest' \
      --network "${AOU_NETWORK}" \
      --subnetwork "${AOU_SUBNETWORK}" \
      --service-account "$(gcloud config get-value account)" \
      --user "${DSUB_USER_NAME}" \
      --regions us-central1 \
      --logging "${WORKSPACE_BUCKET}/dsub/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log" \
      "$@"
}

In [None]:
# save to path 
%%bash

echo source ~/aou_dsub.bash >> ~/.bashrc
source ~/.bashrc

## Upload hg38 reference sequences

Obtain reference files:

In [None]:
!cd ref; samtools faidx Homo_sapiens_assembly38.fasta;
!/home/jupyter/bin/gatk-4.2.6.0/gatk CreateSequenceDictionary -R Homo_sapiens_assembly38.fasta

The `small_hla_regions_nounmapped.intervals` file is the output of `HLA_haplotype_construction/compute_hla_regions.R`. 

Copy these files to the google bucket file path:

In [None]:
!gsutil cp ref/small_hla_regions_nounmapped.intervals ${WORKSPACE_BUCKET}/ref/small_hla_regions_nounmapped.intervals

## DO NOT USE `gsutil cp` for big files... this would be done with a single core and is VERY slow. 
## `gcloud storage cp` does multithreaded upload, so the limits are by cores and networking speed. 
## i.e. gsutil cp ~ 1.8 MBPS | gcloud storage cp ~ 20-25MBPS -> 10 times faster via gcloud storage.
!gcloud storage cp Homo_sapiens_assembly38.fasta ${WORKSPACE_BUCKET}/ref/Homo_sapiens_assembly38.fasta

In [None]:
!gsutil ls ${WORKSPACE_BUCKET}/ref/

## Tasks script 

In [5]:
%%writefile scripts/yield_dsub_tasks_v5.R

#!/bin/R

library(data.table)
library(dplyr)

# All AoU IDs mapped to corresponding WGS cram files
## This can be obtained via: gs://fc-aou-datasets-controlled/v8/wgs/cram/manifest.csv
df <- fread("data/manifest.csv")

# Subset to individuals of EUR ancestry. See: EBV DNA Quantification / Format EBV DNA covariates notebook.
ancestry_df <- fread("data/ancestry_preds.tsv")
EUR_IDs <- ancestry_df %>% filter(ancestry_pred == "eur") %>% pull(research_id)
df <- df %>% filter(person_id %in% EUR_IDs)

# Request all samples that exist in the output folder
prev_run <-  system("gsutil ls ${WORKSPACE_BUCKET}/t1k/out/", intern = T)
prev_run <- prev_run %>% basename() %>% gsub("_genotype.tsv","",.)

# Exclude previously run sample with genotype.tsv from new runs 
`%ni%` <- Negate(`%in%`)
df <- df %>% filter(person_id %ni% prev_run) 

use_v <- df[,2] %>% pull()

# Create output dir  
out_p <- "dsub_files_euro_f/"
dir.create(out_p,showWarnings=F)

## Batches of ~90 should be roughly 12 hours of run time. This way we can get a sense of the time needed
size <- 90

split_l <- split(use_v, ceiling(seq_along(use_v)/size))
names(split_l) <- paste0("Task_", names(split_l))
names(split_l) <- paste0(out_p,names(split_l), "_input.txt")

# Write test output
catch <- lapply(seq_along(split_l) , function(i) writeLines(split_l[[i]],names(split_l)[i]))

# Create task file
gproj <- Sys.getenv("WORKSPACE_BUCKET")
task_out <- paste0(out_p,"taskfile.tsv")
header <- c("--input input_p")
out_df <- data.frame(V1 = paste0(gproj,"/", names(split_l)))
writeLines(paste(header, collapse = "\t"), task_out)
fwrite(out_df, task_out, col.names = F, append = T )
system(sprintf("gsutil -m cp -r %s/* ${WORKSPACE_BUCKET}/%s/",out_p,out_p) , intern = T)

## output is dsub_files_test2/taskfile.tsv

Writing scripts/yield_dsub_tasks_v5.R


In [None]:
!Rscript scripts/yield_dsub_tasks_v5.R

## dsub bash script 

Bash script running T1K for all crams. `proc_t1k` processes one cram (individual), and the script loops through all EUR individuals using the `dsub_files_test2/taskfile.tsv` file.

In [8]:
%%writefile scripts/t1k_singlerun_mnt.sh
#!/bin/bash

#set -o errexit
set -o nounset

function proc_t1k (){

## Input parameters: 
## 1) one_cram is a single cram input path "gs::""
## 1) Regions.interval "gs::"
## 2) hg38 fasta 'gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta'
## 3) Google project is environment variable
## 4) output path # in dsub call --output-recursive out_path="${WORKSPACE_BUCKET}/data/T1K/"

# Grab basename from cram
one_cram=$1 ## as function
basename=`basename "${one_cram}" .cram | cut -f2 -d "_"`
prefix="${basename}_T1K"
mkdir -p ${prefix}
    
# Get reference filepaths 
regions=${ref_bucket}/small_hla_regions_nounmapped.intervals
reference=${ref_bucket}/Homo_sapiens_assembly38.fasta


# Get paired R1 R2 fastqs 
# Requires that the input fasta has both a .fai and .dict indexes in the same directory
gatk PrintReads -I "${one_cram}" \
    -L "${regions}" -R "${reference}" \
    --gcs-project-for-requester-pays "${the_proj}" \
    --cloud-prefetch-buffer 0 --cloud-index-prefetch-buffer 0 -ip 1 \
    -O /dev/stdout | gatk SamToFastq I=/dev/stdin VALIDATION_STRINGENCY=SILENT \
    F="${prefix}/${basename}"_hla_R1.fq F2="${prefix}/${basename}"_hla_R2.fq 


# Run T1K 
/gatk/T1K/run-t1k -t 2 --noExtraction --preset hla-wgs -f /gatk/T1K/hlaidx/hlaidx_dna_seq.fa \
-1 "${prefix}/${basename}"_hla_R1.fq -2 "${prefix}/${basename}"_hla_R2.fq  -o "${prefix}/${basename}"


# Export result to gsutil
gsutil cp "${prefix}/${basename}_genotype.tsv" "${out_path}/"

# Remove the input directory (not relevant for single sample, but will be for loops)
rm -rf "${prefix}/"
} 

run_v=(`cat ${input_p}`)

for one_cram in ${run_v[@]};do

# Start time
echo "start_time," ${one_cram} "," $(date +%s)

proc_t1k ${one_cram}

# End time
echo "end_time," ${one_cram} "," $(date +%s)

    
done


Overwriting scripts/t1k_singlerun_mnt.sh


In [None]:
!gsutil cp scripts/t1k_singlerun_mnt.sh ${WORKSPACE_BUCKET}/scripts/t1k_singlerun_mnt.sh

In [10]:
# Use hyphens and not whitespace, since it will become part of the bucket path
JOB_NAME='t1k-task-euro'

# Save this Python variable as an environment variable 
%env JOB_NAME={JOB_NAME}

env: JOB_NAME=t1k-task-euro


In [None]:
%%bash --out JOB_ID

source ~/aou_dsub.bash # created above

intask="dsub_files_euros/taskfile.tsv"

aou_dsub --name "${JOB_NAME}" --boot-disk-size 20 --machine-type n2-standard-2 --logging "${WORKSPACE_BUCKET}/t1k/logging" --env the_proj="${GOOGLE_PROJECT}" --input-recursive ref_bucket="gs://fc-secure-0a5076ba-84f3-4336-8ae9-f629927dcc61/ref/" --tasks $intask --output-recursive out_path="${WORKSPACE_BUCKET}/t1k/out/" --script "${WORKSPACE_BUCKET}/scripts/t1k_singlerun_mnt.sh"


In [7]:
# Save this Python variable 
%env JOB_ID={JOB_ID}

env: JOB_ID=t1k-task-v--jacobog02--250501-194508-00


Check job status:

In [None]:
!dstat --provider google-cls-v2 --project terra-vpc-sc-ae102581 --location us-central1 --jobs 't1k-task-e--jacobog02--250502-213635-98' --users 'jacobog02' --status '*' --format json | jq --arg key "status-message" '.[] | .[$key]' -  | sort | uniq -c

In [None]:
%%bash 

a_data=`dstat --provider google-cls-v2 --project terra-vpc-sc-ae102581 --location us-central1 --jobs 't1k-task-e--jacobog02--250502-213635-98' --users 'jacobog02' --status '*'`

$a_data | wc -l 

In [None]:
%%bash

# --full
dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*'

In [None]:
!gsutil ls ${WORKSPACE_BUCKET}/t1k/logging/

In [None]:
!gsutil ls -lh ${WORKSPACE_BUCKET}/t1k/out/

Copy over outputs to local folder:

In [None]:
!mkdir -p t1k_out/
!gcloud storage cp -r ${WORKSPACE_BUCKET}/t1k/out/ t1k_out/