<div style="padding-bottom:30px">
<a href="https://github.com/iqtk/iqtk"><img src="https://raw.githubusercontent.com/iqtk/iqtk/master/inquiry/docs/assets/logotype_blue_small.png" style="width:100px; margin-left:0px"></img></a>
<p style="color:#9E9E9E">
<a href="https://github.com/cwbeitel/inquiry/tree/master/docs">Getting Started Guide</a> // <a href="https://goo.gl/forms/2cOmuUrQ3n3CKpim1">Documentation Feedback</a></p>
</div>

<h1 style="color:#9E9E9E">Genotype analysis with GATK</h1>

In this analysis we use the Broad Institute GATK toolset to make genotype calls for a sample of interest. Read more about [GATK](https://software.broadinstitute.org/gatk/) or [genotyping in general](https://en.wikipedia.org/wiki/Genotyping).

To review, here we'll be running workflows generated by the Inquiry Toolkit directly from the Cloud DataFlow UI so once you've completed the [getting started documentation]() you'll be ready for this tutorial!

Our data processing pipeline will deliver data to a BigQuery data table and we'll both explore our data from the BigQuery UI as well as pull a subset into this notebook for subsequent analysis and visualization.

<h2 style="color:#9E9E9E">Configuration and Run</h2>

The first thing we need to do is parameterize our analysis. The parameters file we'll need to provide should have the following structure.

```json

{
  "ref_fasta": "gs://cflow-public/data/genomes/Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/BWAIndex/genome.fa",
  "reads": [
    ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794486_C2_R1_1_small.fq",
     "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794486_C2_R1_2_small.fq"],
    ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794487_C2_R2_1_small.fq",
     "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794487_C2_R2_2_small.fq"],
    ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794488_C2_R3_1_small.fq",
     "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794488_C2_R3_2_small.fq"]
   ]
}
```

(Dataflow submission UI screenshot)

Once our job is submitted we'll be able to check its status using the Cloud DataFow UI as described in the [getting started documentation](). For illustration this workflow should look like the following:

(Dataflow job running screenshot)

When the workflow is complete you can obtain the path in Google Cloud Storage to the resulting files as described in the [getting started documentation]().

<h2 style="color:#9E9E9E">Exploring the data</h2>

Next we'll obtain some data resulting from the above workflow and play around a bit.

In [34]:
import google.datalab.bigquery as bq
query = bq.Query('SELECT * FROM `jbei-cloud.demonstration.variants3` LIMIT 3')
output_options = bq.QueryOutput.table(use_cache=False)
result = query.execute(output_options=output_options).result()
result.to_dataframe()

Unnamed: 0,id,refname,start,end,refbases,altbases,quality,filter,info,format
0,.,2L,18327,,T,"[<, *, >]",,.,"DP=17;I16=15,2,0,0,680,27200,0,0,423,11439,0,0...",PL
1,.,2L,10936,,T,"[<, *, >]",,.,"DP=32;I16=20,12,0,0,1276,50896,0,0,1920,115200...",PL
2,.,2L,9858,,T,"[<, *, >]",,.,"DP=30;I16=27,0,0,0,1200,57600,0,0,1620,97200,0...",PL


<h3 style="color:#9E9E9E">References</h3>

1. McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data." Genome research 20.9 (2010): 1297-1303.
2. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.

<h3 style="color:#9E9E9E">Contact</h3>

Want to get in touch? You can [provide feedback](https://goo.gl/forms/2cOmuUrQ3n3CKpim1) regarding this or other documentation,
[reach out to us](https://goo.gl/forms/j8FWdNJqABAoJvcW2) regarding collaboration, or [request a new feature or analytical capability](https://goo.gl/forms/dQm3SDcoNZsV7AAd2). We're looking forward to hearing from you!

<div style="padding-top: 30px">
<p style="color:#9E9E9E; text-align:center">This notebook was prepared for <a href="https://github.com/cwbeitel/inquiry">Project Inquiry</a> in support of the research mission of the Joint BioEnergy Institute (JBEI). Learn more at https://www.jbei.org/.</p>
<p style="color:#9E9E9E; text-align:center">The Joint BioEnergy Institute is a program of the U.S. Department of Energy Office of Science.</p>
<p style="color:#9E9E9E; text-align:center">© Regents of the University of California, 2017. Licensed under a BSD-3 <a href="https://github.com/cwbeitel/inquiry/blob/master/LICENSE">license</a>.</p>
<img src="https://raw.githubusercontent.com/iqtk/iqtk/master/inquiry/docs/assets/logotype_blue_small.png" style="width:100px"></img>
</div>