**Content summary**  
This notebook provides a short introduction to basic Jupyter Notebook functionality and illustrates some options for working with genomic data in cloud storage. It is based on [source code](https://github.com/broadinstitute/genomics-in-the-cloud/tree/main/notebooks) provided with the [Genomics in the Cloud](https://oreil.ly/genomics-cloud) book (Van der Auwera & O'Connor, O'Reilly 2020).


**Environment configuration**   
This notebook requires a custom [Terra](https://app.terra.bio/) Cloud Environment image provided as the container `gcr.io/broad-dsde-outreach/terra-base:ipyigv1`, complemented by a startup script (gs://genomics-in-the-cloud/v1/scripts/install_GATK_4130_with_igv.sh) that installs GATK version 4.1.3.0.  

You must customize your environment using the Cloud Environment configuration panel to match this notebook's requirements; SOME COMMANDS WILL NOT WORK IF YOU DO NOT DO THIS. 

- In the configuration panel, set the `Application Configuration` to `Custom Environment` (all the way at the bottom of the menu) and paste the container address given above into the `Container image` field. 
- Then (still in the config panel), in the `Cloud compute profile` box, paste the startup script link given above into the `Startup Script` field. 

Refer to [Terra documentation on customizing your environment](https://support.terra.bio/hc/en-us/articles/360038125912) to learn more about environment customization options.

**Kernel**  
By default this notebook opens on a Python 3 kernel. When you have the notebook running in EDIT mode, the upper right corner of the notebook (under the Notebook Runtime widget) should display the label `Python3`. 

----

# Getting started with Jupyter in Terra

In this section, we run through some exercises to familiarize you with the basic usage of Jupyter notebooks in the Terra environment.


## Run the Hello World cells
We start with some simple Hello World examples, first in Python, then with a command-line tool call.

Run the basic Hello World in Python

In [None]:
print("Hello World!")

Run the command-line tool `echo` using `!`

In [None]:
! echo "Hello World!"

## Interact with local storage

List contents of local storage (persistent disk)

In [None]:
! ls .

Make a sandbox directory to keep project files organized

In [None]:
! mkdir -p sandbox/
! ls

## Access data in cloud storage buckets 

List the contents of a public cloud storage bucket called `genomics-in-the-cloud`

In [None]:
! gsutil ls gs://genomics-in-the-cloud/

Copy a file from the bucket to the sandbox (on persistent disk)

In [None]:
! gsutil cp gs://genomics-in-the-cloud/hello.txt sandbox/

Read the contents of the locally-stored text file

In [None]:
! cat sandbox/hello.txt

## Save local files to the workspace's storage bucket

Import the `os` package, look up the value of the `WORKSPACE_BUCKET` environment variable (set by Terra at the kernel level) and store it in a Python variable for easy access

In [None]:
import os
WS_BUCKET = os.environ['WORKSPACE_BUCKET']
print(WS_BUCKET)

Back up the sandbox directory from the persistent disk to the workspace bucket 

In [None]:
! gsutil cp -r sandbox {WS_BUCKET}

Verify that it worked as expected

In [None]:
! gsutil ls -r {WS_BUCKET}

## Set up variables pointing to genomic data in the bucket
We're going to want to access the data in the bucket multiple times, so we make a variable to avoid hardcoding and repeating file paths.

Create Python variables

In [None]:
BAMS = "gs://genomics-in-the-cloud/v1/data/germline/bams"
REF = "gs://genomics-in-the-cloud/v1/data/germline/ref"

Use the variable to list the bucket contents and verify they work as expected

In [None]:
! gsutil ls {BAMS}

In [None]:
! gsutil ls {REF}

This completes the "getting started" portion of this notebook.

----

# Visualizing genomic data in an embedded IGV window
In this section, we embed IGV windows in the notebook in order to visualize genomic data without leaving the notebook environment.

## Set up the embedded IGV browser
First we need to import the `ipyigv` package and initialize a browser window.

In [None]:
import ipyigv as igv
from ipywidgets.widgets.trait_types import InstanceDict
from ipyigv.options import ReferenceGenome, Track
from ipywidgets import Output

Initialize the browser instance with a genome reference

In [None]:
genomeDict = igv.PUBLIC_GENOMES.hg19
genome = ReferenceGenome(**genomeDict)
browser = igv.IgvBrowser(genome=genome)

Display the browser window

In [None]:
browser

## Add data to the IGV browser
Now we can add data by pointing to files in a bucket.

Define data tracks for two BAM files (whole genome and exome versions of the mother sample)

In [None]:
wgs_track = {
  'name': 'Mother WGS',
  'format': 'bam',
  'url': BAMS + '/mother.bam',
  'indexURL': BAMS + '/mother.bai',
  'height': 200
}
browser.add_track(Track(**wgs_track))

In [None]:
exome_track = {
  'name': 'Mother Exome',
  'format': 'bam',
  'url': BAMS + '/motherNEX.bam',
  'indexURL': BAMS + '/motherNEX.bai',
  'height': 200
}
browser.add_track(Track(**exome_track))

Zoom in to region of interest

In [None]:
browser.search('chr20:10,025,584-10,036,143')

## Set up an access token to view private data
IGV needs an access token to retrieve data from private buckets (including your workspace's own bucket).

Emit an acces token and save it to a file, then read it into a variable

In [None]:
!gcloud auth print-access-token > token.txt

token_file = open("token.txt","r") 
token = token_file.readline()

**Important note:** As long as this file is saved only to your notebook’s local storage, it is secure because your cloud environment is strictly personal to you and cannot be accessed by others, even if you share your workspace or your notebook with them. But don’t save this
file to your workspace bucket! Saving it to the bucket would make it visible to anyone
with whom you share the workspace.

Copy a BAM file and its index to the workspace bucket

In [None]:
! gsutil cp {BAMS}/mother.ba* {WS_BUCKET}/bams
! gsutil ls {WS_BUCKET}/bams

Include the token in the track definition of any private files

In [None]:
private_track = {
  'name': 'Workspace bucket copy of Mother WGS',
  'format': 'bam',
  'url': WS_BUCKET + '/sandbox/mother.bam',
  'indexURL': WS_BUCKET + '/sandbox/mother.bam',
  'height': 200,
  'oauthToken': token
}

browser.add_track(Track(**private_track))

This concludes the section on visualizing genomic data.

----

# Running GATK Commands to Learn, Test, or Troubleshoot
Now let's look at how we can run GATK commands inside the notebook.

## Running a Basic GATK Command: HaplotypeCaller
First we run a simple command. Note that we can run GATK directly on the files located in cloud storage — no need to copy them to local storage first.

Run HaplotypeCaller on files in cloud storage

In [None]:
! gatk HaplotypeCaller \
-R {REF}/ref.fasta \
-I {BAMS}/mother.bam \
-O sandbox/mother_variants.200k.vcf.gz \
-L 20:10,000,000-10,200,000

Verify that the output file is in the sandbox

In [None]:
! ls sandbox

**Note:** This works with GATK from anywhere with an internet connection! We could even write the output directly to a bucket if we wanted to; the output filepath just has to start with a valid `gs://` bucket address. 

## Loading the Data (BAM and VCF) into IGV
Now we do a simple visual check of the result.

Initialize a new IGV window

In [None]:
second_browser = igv.IgvBrowser(genome=genome)

second_browser

Load the variant calls produced by the HaplotypeCaller above

*Adding `'color': "#000000"` as a workaround to [this issue](https://github.com/QuantStack/ipyigv/issues/21).*

In [None]:
var_track = {
  'name': 'Mother variants',
  'format': 'vcf',
  'url': 'files/sandbox/mother_variants.200k.vcf.gz',
  'indexURL': 'files/sandbox/mother_variants.200k.vcf.gz.tbi',
  'color': "#000000"
}
second_browser.add_track(Track(**var_track))

In [None]:
second_browser.search('chr20:10,002,000-10,003,000')

Load the original BAM file on which we ran HaplotypeCaller

In [None]:
wgs_track = {
  'name': 'Mother WGS',
  'format': 'bam',
  'url': BAMS + '/mother.bam',
  'indexURL': BAMS + '/mother.bai',
  'height': 200
}
second_browser.add_track(Track(**wgs_track))

## Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
Something looks odd so we do some systematic troubleshooting...

Run HaplotypeCaller on the problem region to produce an output BAM, the `bamout`

In [None]:
! gatk HaplotypeCaller \
-R {REF}/ref.fasta \
-I {BAMS}/mother.bam \
-O sandbox/motherHCdebug.vcf \
-bamout sandbox/motherHCdebug.bam \
-L 20:10,002,000-10,003,000

Load the `bamout` file into the IGV window

In [None]:
bamout_track = {
"name": "Mother HC bamout",
"url": "files/sandbox/motherHCdebug.bam",
"indexURL": "files/sandbox/motherHCdebug.bai",
"height": 500,
"format": "bam"
}
second_browser.add_track(Track(**bamout_track))

This concludes the GATK variant calling section of this notebook. 

----