<div style="padding-bottom:30px">
<a href="https://github.com/cwbeitel/inquiry"><img src="https://raw.githubusercontent.com/cwbeitel/iqassets/master/logotype_blue_small.png" style="width:100px; margin-left:0px"></img></a>
<p style="color:#9E9E9E">
<a href="https://github.com/cwbeitel/inquiry/tree/master/docs">Getting Started Guide</a> // <a href="https://goo.gl/forms/2cOmuUrQ3n3CKpim1">Documentation Feedback</a></p>
</div>

<h1 style="color:#9E9E9E">Genotype analysis with GATK</h1>

In this analysis we use the GATK toolset to make genotype calls for a sample of interest. Read more about GATK [here](https://software.broadinstitute.org/gatk/). Concepts you might want to brush up on include [SNV genotyping](https://en.wikipedia.org/wiki/SNV_calling_from_NGS_data), [indels](https://en.wikipedia.org/wiki/Indel), and perhaps [genotyping in general](https://en.wikipedia.org/wiki/Genotyping).

To review, here we'll be performing our analyses using the Inquiry analysis toolkit which helps automate and abstract portions of the process. Read more about the toolkit and the project in general in the [Inquiry Toolkit Overview]() and check out the [Background Reading]() notebook for a review of necessary concepts and technologies.

<h2 style="color:#9E9E9E">Configuration and Run</h2>

The first thing we need to do is parameterize our analysis. See the [Getting Started Guide]() for a review of the different ways workflows can be parameterized and run. We'll use the following configuration:

```json

{
  "dry_run": true,
  "debug": true,
  "cloud": true,
  "_meta": {
    "workflow": "core:genotype"
  },
  "ref_fasta": "gs://cflow-public/data/genomes/Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/BWAIndex/genome.fa",
  "reads": [
    ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794486_C2_R1_1_small.fq",
     "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794486_C2_R1_2_small.fq"],
    ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794487_C2_R2_1_small.fq",
     "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794487_C2_R2_2_small.fq"],
    ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794488_C2_R3_1_small.fq",
     "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794488_C2_R3_2_small.fq"]
   ]
}
```

With this configuration saved to a file on our local filesystem we can submit a genotyping run using the following command:

In [None]:
%%bash
iqtk run genotyping config.yaml

<h2 style="color:#9E9E9E">Exploring the data</h2>

For a tutorial on analyzing variant data using BigQuery and SQL, check out [this](https://github.com/googledatalab/notebooks/blob/master/samples/Exploring%20Genomics%20Data.ipynb) notebook from the Google Genomics team. What follows is similar.

In [2]:
import datalab.bigquery as bq

# First, we'll reference the BigQuery table to which our variant data was written.
variants = bq.Table('jbei-cloud:demonstration.variants3')

# We can query this object to see how many variant records are stored here.
variants.length

578

In [None]:
%%bq query --name single_base
SELECT
  reference_name,
  start,
  reference_bases,
  alternate_bases
FROM
  `jbei-cloud:demonstration.variants3`
WHERE
  reference_bases IN ('A','C','G','T') AND
  ARRAY_LENGTH(alternate_bases) = 1
LIMIT 100

In [None]:
%bq execute --query single_base

In [7]:
# TODO: The above don't work outside of datalab.

<h3 style="color:#9E9E9E">References</h3>

1. McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data." Genome research 20.9 (2010): 1297-1303.
2. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.

<h3 style="color:#9E9E9E">Contact</h3>

Want to get in touch? You can [provide feedback](https://goo.gl/forms/2cOmuUrQ3n3CKpim1) regarding this or other documentation,
[reach out to us](https://goo.gl/forms/j8FWdNJqABAoJvcW2) regarding collaboration, or [request a new feature or analytical capability](https://goo.gl/forms/dQm3SDcoNZsV7AAd2). We're looking forward to hearing from you!

<div style="padding-top: 30px">
<p style="color:#9E9E9E; text-align:center">This notebook was prepared for <a href="https://github.com/cwbeitel/inquiry">Project Inquiry</a> in support of the research mission of the Joint BioEnergy Institute (JBEI). Learn more at https://www.jbei.org/.</p>
<p style="color:#9E9E9E; text-align:center">The Joint BioEnergy Institute is a program of the U.S. Department of Energy Office of Science.</p>
<p style="color:#9E9E9E; text-align:center">© Regents of the University of California, 2017. Licensed under a BSD-3 <a href="https://github.com/cwbeitel/inquiry/blob/master/LICENSE">license</a>.</p>
<img src="https://raw.githubusercontent.com/cwbeitel/iqassets/master/logotype_blue_small.png" style="width:100px"></img>
</div>