**Working-IGV-example**  
This notebook is a patch for the materials distributed with [Genomics in the Cloud](https://oreil.ly/genomics-cloud), an O'Reilly book by Geraldine A. Van der Auwera and Brian D. O'Connor. You can read it [online in the O'Reilly library](https://learning.oreilly.com/library/view/genomics-in-the/9781491975183/)
or [order the hardcopy on Amazon](https://www.amazon.com/Genomics-Cloud-GATK-Spark-Docker/dp/1491975199/). It provides a workaround for a bug that currently affects the Genomics Notebook that supports the exercises in Chapter 12 of the book.  

**Environment configuration**   
This patch notebook requires a custom [Terra](https://app.terra.bio/) Cloud Environment image provided as the container `gcr.io/broad-dsde-outreach/terra-base:ipyigv1`, complemented by a startup script (gs://genomics-in-the-cloud/v1/scripts/install_GATK_4130_with_igv.sh) that installs GATK version 4.1.3.0.  

You must customize your environment using the Cloud Environment configuration panel to match this notebook's requirements; SOME COMMANDS WILL NOT WORK IF YOU DO NOT DO THIS. 

- In the configuration panel, set the `Application Configuration` to `Custom Environment` (all the way at the bottom of the menu) and paste the container address given above into the `Container image` field. 
- Then (still in the config panel), in the `Cloud compute profile` box, paste the startup script link given above into the `Startup Script` field. 

Refer to [Terra documentation on customizing your environment](https://support.terra.bio/hc/en-us/articles/360038125912) to learn more about environment customization options.

**Kernel**  
By default this notebook opens on a Python 3 kernel. When you have the notebook running in EDIT mode, the upper right corner of the notebook (under the Notebook Runtime widget) should display the label `Python3`. 

----

# Getting started with Jupyter in Terra

Compared to the original notebook, this patch notebook skips past the very basic intro-to-Jupyter material and picks things up at the intro-to-cloud-data material. The patch retains the cell numbering from the original in order to make them easier to compare.


## Using gsutil to Interact with Google Cloud Storage Buckets
Let's look at how to pull in data from GCS buckets 

*Cell 9: List the bucket contents*

In [1]:
! gsutil ls gs://genomics-in-the-cloud/

gs://genomics-in-the-cloud/hello.txt
gs://genomics-in-the-cloud/figures/
gs://genomics-in-the-cloud/v1/


*Cell 10: Copy a file from the bucket to the notebook's local storage*

In [2]:
! gsutil cp gs://genomics-in-the-cloud/hello.txt .

Copying gs://genomics-in-the-cloud/hello.txt...
/ [1 files][   20.0 B/   20.0 B]                                                
Operation completed over 1 objects/20.0 B.                                       


*Cell 11: Read the contents of a locally-stored text file*

In [3]:
! cat hello.txt

HELLO, DEAR READER!


## Setting Up a Variable Pointing to the Germline Data in the Book Bucket
We're going to want to access the data in the bucket multiple times, so we make a variable to avoid hardcoding and repeating file paths.

*Cell 12: Create a Python variable*

In [4]:
GERM_DATA = "gs://genomics-in-the-cloud/v1/data/germline"

*Cell 13: Use the variable to list the bucket contents*  
*(Erratum: this cell was erroneously numbered 14 in the initial print run)*

In [5]:
! gsutil ls {GERM_DATA}

gs://genomics-in-the-cloud/v1/data/germline/bams/
gs://genomics-in-the-cloud/v1/data/germline/gvcfs/
gs://genomics-in-the-cloud/v1/data/germline/intervals/
gs://genomics-in-the-cloud/v1/data/germline/ref/
gs://genomics-in-the-cloud/v1/data/germline/resources/
gs://genomics-in-the-cloud/v1/data/germline/vcfs/


*Cell 14: List the `bams` directory to get the paths of the files it contains*

In [6]:
! gsutil ls {GERM_DATA}/bams

gs://genomics-in-the-cloud/v1/data/germline/bams/father.bai
gs://genomics-in-the-cloud/v1/data/germline/bams/father.bam
gs://genomics-in-the-cloud/v1/data/germline/bams/mother.bai
gs://genomics-in-the-cloud/v1/data/germline/bams/mother.bam
gs://genomics-in-the-cloud/v1/data/germline/bams/motherNEX.bai
gs://genomics-in-the-cloud/v1/data/germline/bams/motherNEX.bam
gs://genomics-in-the-cloud/v1/data/germline/bams/motherRnaseq.bai
gs://genomics-in-the-cloud/v1/data/germline/bams/motherRnaseq.bam
gs://genomics-in-the-cloud/v1/data/germline/bams/son.bai
gs://genomics-in-the-cloud/v1/data/germline/bams/son.bam


*Cell 15: Copy the BAM file and index for the mother*

In [7]:
! gsutil cp {GERM_DATA}/bams/mother.ba* .

Copying gs://genomics-in-the-cloud/v1/data/germline/bams/mother.bai...
Copying gs://genomics-in-the-cloud/v1/data/germline/bams/mother.bam...          
/ [2 files][ 23.8 MiB/ 23.8 MiB]                                                
Operation completed over 2 objects/23.8 MiB.                                     


*Cell 16: List the local working directory to confirm the success of the copy operation*

In [8]:
! ls .

Genomics-Notebook-executed.ipynb  mother.bai  token.txt
Genomics-Notebook.ipynb		  mother.bam  Working-IGV-example.ipynb
hello.txt			  sandbox


## Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
Now that we know how to bring in data, let's go over how we're going to save the outputs of any analyses we run.

*Cell 17: Create a new directory, ignoring any errors if the path already exists (`-p`)*

In [9]:
! mkdir -p sandbox/

*Cell 18: Move the mother BAM and index files that we copied earlier to the sandbox*

In [10]:
! mv mother.ba* sandbox/

*Cell 19: List the contents of the sandbox to check that everything is where you expect it to be*

In [11]:
! ls sandbox

mother.bai	   motherHCdebug.bam	  mother_variants.200k.vcf.gz
mother.bam	   motherHCdebug.vcf	  mother_variants.200k.vcf.gz.tbi
motherHCdebug.bai  motherHCdebug.vcf.idx


*Cell 20: Import the `os` package, look up the value of the `WORKSPACE_BUCKET` environment variable (set by Terra at the kernel level) and store it in a Python variable for easy access*

In [12]:
import os
WS_BUCKET = os.environ['WORKSPACE_BUCKET']

*Cell 21: Check the value of your new variable*

In [13]:
print(WS_BUCKET)

gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19


*Cell 22: List the full (`-r`) contents of the workspace bucket (results will depend on what other work you have done in your workspace)*

In [14]:
! gsutil ls -r {WS_BUCKET}

gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/notebooks/:
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/notebooks/Genomics-Notebook-executed.ipynb
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/notebooks/Genomics-Notebook.ipynb
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/notebooks/Working-IGV-example.ipynb

gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/:
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/mother.bai
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/mother.bam


*Cell 23: Copy the contents of your sandbox to the workspace bucket (using `-m` for efficient transfer)*

In [15]:
! gsutil -m cp -r sandbox {WS_BUCKET}

Copying file://sandbox/mother.bai [Content-Type=application/octet-stream]...
Copying file://sandbox/mother_variants.200k.vcf.gz.tbi [Content-Type=application/octet-stream]...
Copying file://sandbox/motherHCdebug.vcf.idx [Content-Type=application/octet-stream]...
Copying file://sandbox/motherHCdebug.bam [Content-Type=application/octet-stream]...
Copying file://sandbox/mother.bam [Content-Type=application/octet-stream]...    
Copying file://sandbox/mother_variants.200k.vcf.gz [Content-Type=text/vcard]... 
Copying file://sandbox/motherHCdebug.vcf [Content-Type=text/vcard]...           
Copying file://sandbox/motherHCdebug.bai [Content-Type=application/octet-stream]...
- [8/8 files][ 23.9 MiB/ 23.9 MiB] 100% Done                                    
Operation completed over 8 objects/23.9 MiB.                                     


*Cell 24: List the contents of the copy of your sandbox that is now stored in the bucket*

In [16]:
! gsutil ls {WS_BUCKET}/sandbox

gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/mother.bai
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/mother.bam
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/motherHCdebug.bai
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/motherHCdebug.bam
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/motherHCdebug.vcf
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/motherHCdebug.vcf.idx
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/mother_variants.200k.vcf.gz
gs://fc-a955f70a-e7db-4f26-a86c-2f00bf178e19/sandbox/mother_variants.200k.vcf.gz.tbi


This completes the "getting started" portion of this notebook.

----

# Visualizing Genomic Data in an Embedded IGV Window
In this section, we embed IGV windows in the notebook in order to visualize genomic data without leaving the notebook environment.

## Setting Up the Embedded IGV Browser
First we need to import the `ipyigv` package and initialize a browser window.

*Cell 25: Import the `ipyigv` package*

In [17]:
import ipyigv as igv
from ipywidgets.widgets.trait_types import InstanceDict
from ipyigv.options import ReferenceGenome, Track
from ipywidgets import Output

*Cell 26: Initialize the browser instance with a genome reference*

In [18]:
genomeDict = igv.PUBLIC_GENOMES.hg19
genome = ReferenceGenome(**genomeDict)
browser = igv.IgvBrowser(genome=genome)

*Cell 27: Display the browser window*

In [19]:
browser

IgvBrowser(genome=ReferenceGenome(cytobandURL='https://s3.dualstack.us-east-1.amazonaws.com/igv.broadinstitute…

## Adding Data to the IGV Browser
Now we can add data by pointing to files in a GCS bucket.

*Cells 28 and 29: Define data tracks for two BAM files (whole genome and exome versions of the mother sample)*

In [20]:
wgs_track = {
  'name': 'Mother WGS',
  'format': 'bam',
  'url': GERM_DATA + '/bams/mother.bam',
  'indexURL': GERM_DATA + '/bams/mother.bai',
  'height': 200
}
browser.add_track(Track(**wgs_track))

In [21]:
exome_track = {
  'name': 'Mother Exome',
  'format': 'bam',
  'url': GERM_DATA + '/bams/motherNEX.bam',
  'indexURL': GERM_DATA + '/bams/motherNEX.bai',
  'height': 200
}
browser.add_track(Track(**exome_track))

*Extra cells: zoom to region of interest*

In [22]:
browser.search('chr20:10,025,584-10,036,143')

Search completed. Check the widget instance for results.


## Setting Up an Access Token to View Private Data
IGV needs an access token to retrieve data from private buckets (including the workspace bucket).

*Cell 30: Emit an acces token and save it to a file*

In [23]:
!gcloud auth print-access-token > token.txt

**Important note:** As long as this file is saved only to your notebook’s local storage, it is secure because your cloud environment is strictly personal to you and cannot be accessed by others, even if you share your workspace or your notebook with them. But don’t save this
file to your workspace bucket! Saving it to the bucket would make it visible to anyone
with whom you share the workspace.

*Cell 31: Read the contents of the token file into a Python variable*

In [24]:
token_file = open("token.txt","r") 
token = token_file.readline()

*Cell 32: Include the token in the track definition of any private files*

In [25]:
private_track = {
  'name': 'Workspace bucket copy of Mother WGS',
  'format': 'bam',
  'url': WS_BUCKET + '/sandbox/mother.bam',
  'indexURL': WS_BUCKET + '/sandbox/mother.bam',
  'height': 200,
  'oauthToken': token
}

browser.add_track(Track(**private_track))

This concludes the section on visualizing genomic data.

----

# Running GATK Commands to Learn, Test, or Troubleshoot
Now let's look at how we can run GATK commands inside the notebook.

## Running a Basic GATK Command: HaplotypeCaller
First we run a simple command like we did in Chapter 5, except we're running directly on the files located in GCS instead of localizing them first.

*Cell 33: Run HaplotypeCaller on files in GCS*

In [26]:
! gatk HaplotypeCaller \
-R {GERM_DATA}/ref/ref.fasta \
-I {GERM_DATA}/bams/mother.bam \
-O sandbox/mother_variants.200k.vcf.gz \
-L 20:10,000,000-10,200,000

Using GATK jar /etc/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar HaplotypeCaller -R gs://genomics-in-the-cloud/v1/data/germline/ref/ref.fasta -I gs://genomics-in-the-cloud/v1/data/germline/bams/mother.bam -O sandbox/mother_variants.200k.vcf.gz -L 20:10,000,000-10,200,000
05:42:10.806 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
05:42:11.206 INFO  HaplotypeCaller - ------------------------------------------------------------
05:42:11.207 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.3.0
05:42:11.207 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
05:42:11.211 INFO  HaplotypeCa

*Cell 34: Verify that the output file is in the sandbox*

In [27]:
! ls sandbox

mother.bai	   motherHCdebug.bam	  mother_variants.200k.vcf.gz
mother.bam	   motherHCdebug.vcf	  mother_variants.200k.vcf.gz.tbi
motherHCdebug.bai  motherHCdebug.vcf.idx


**Note:** This works with GATK from anywhere with an internet connection! We could even write the output directly to a GCS bucket if we wanted to; the output filepath just has to start with a valid `gs://` bucket address. 

## Loading the Data (BAM and VCF) into IGV
Now we do a simple visual check of the result.

*Cell 35: Initialize a new IGV window*

In [28]:
second_browser = igv.IgvBrowser(genome=genome)

second_browser

IgvBrowser(genome=ReferenceGenome(cytobandURL='https://s3.dualstack.us-east-1.amazonaws.com/igv.broadinstitute…

*Cell 36: Load the variant calls produced by the HaplotypeCaller above*

*Adding `'color': "#000000"` as a workaround to [this issue](https://github.com/QuantStack/ipyigv/issues/21).*

In [29]:
var_track = {
  'name': 'Mother variants',
  'format': 'vcf',
  'url': 'files/sandbox/mother_variants.200k.vcf.gz',
  'indexURL': 'files/sandbox/mother_variants.200k.vcf.gz.tbi',
  'color': "#000000"
}
second_browser.add_track(Track(**var_track))

In [30]:
second_browser.search('chr20:10,002,000-10,003,000')

Search completed. Check the widget instance for results.


*Cell 37: Load the original BAM file on which you ran HaplotypeCaller*

In [31]:
wgs_track = {
  'name': 'Mother WGS',
  'format': 'bam',
  'url': GERM_DATA + '/bams/mother.bam',
  'indexURL': GERM_DATA + '/bams/mother.bai',
  'height': 200
}
second_browser.add_track(Track(**wgs_track))

## Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
Something looks odd so we do some systematic troubleshooting...

*Cell 38: Run HaplotypeCaller on the problem region to produce an output BAM, the `bamout`*

In [32]:
! gatk HaplotypeCaller \
-R {GERM_DATA}/ref/ref.fasta \
-I {GERM_DATA}/bams/mother.bam \
-O sandbox/motherHCdebug.vcf \
-bamout sandbox/motherHCdebug.bam \
-L 20:10,002,000-10,003,000

Using GATK jar /etc/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar HaplotypeCaller -R gs://genomics-in-the-cloud/v1/data/germline/ref/ref.fasta -I gs://genomics-in-the-cloud/v1/data/germline/bams/mother.bam -O sandbox/motherHCdebug.vcf -bamout sandbox/motherHCdebug.bam -L 20:10,002,000-10,003,000
05:43:39.773 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
05:43:40.123 INFO  HaplotypeCaller - ------------------------------------------------------------
05:43:40.124 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.3.0
05:43:40.124 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
05:43:

*Cell 39: Load the `bamout` file into the IGV window*

In [33]:
bamout_track = {
"name": "Mother HC bamout",
"url": "files/sandbox/motherHCdebug.bam",
"indexURL": "files/sandbox/motherHCdebug.bai",
"height": 500,
"format": "bam"
}
second_browser.add_track(Track(**bamout_track))

This concludes the GATK variant calling section of this notebook. 

----