# Setup
Let's setup a few directories that we will use for this week's tutorial.

In [None]:
!echo $PWD

In [None]:
!mkdir -p $PWD/ref
!mkdir -p $PWD/unaligned/normal
!mkdir -p $PWD/aligned/normal

In [None]:
ls

## Docker
As discussed in Week 1, we will be using Docker throughout this workshop. 

Hopefully everyone has Docker installed on their local environment, if not please [see Week1](https://github.com/genome/bfx-workshop/tree/master/week_01).

We are pulling a commonly used image that is used in the O'Reilly book [Genomics in the Cloud](https://www.oreilly.com/library/view/genomics-in-the/9781491975183/).

"Pulling" the image means that Docker is downloading the binary image that includes all of the necessary software tools pre-installed.


In [None]:
!docker pull broadinstitute/genomes-in-the-cloud:2.3.1-1512499786

### Samtools
[Samtools](http://www.htslib.org/) is a suite of programs for interacting with high-throughput sequencing data.

In [None]:
!docker run -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/local/bin/samtools

### BWA
[BWA](https://github.com/lh3/bwa) is a software package for mapping DNA sequences against a large reference genome, such as the human genome.

In [None]:
!docker run -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/gitc/bwa

## Inputs

We are using a toy example data set based on the HCC1395 blood normal cell line. The sequence reads and genome reference are a subset targeting chr6, genes HLA-A and HLA-B-C, and chr17, genes TP53 and BRCA1.

[FASTA](https://storage.googleapis.com/analysis-workflows-example-data/somatic_inputs/hla_and_brca_genes.fa)
[Normal Reads Lane 3](https://storage.googleapis.com/analysis-workflows-example-data/unaligned_subset_bams/normal/2895499331.bam)
[Normal Reads Lane 4](https://storage.googleapis.com/analysis-workflows-example-data/unaligned_subset_bams/normal/2895499399.bam)

All inputs and additional resources can be viewed at: https://console.cloud.google.com/storage/browser/analysis-workflows-example-data

In this example, each file was downloaded to ~/Downloads. If you saved the downloaded files in another folder or location, the following paths will need to be updated to account for those differences.

In [None]:
!mv ~/Downloads/hla_and_brca_genes.fa $PWD/ref/.
!mv ~/Downloads/2895499331.bam $PWD/unaligned/normal/.
!mv ~/Downloads/2895499399.bam $PWD/unaligned/normal/.

# Index

In [None]:
ls $PWD/ref

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/local/bin/samtools faidx /data/ref/hla_and_brca_genes.fa

In [None]:
ls $PWD/ref

In [None]:
!head -n 20 $PWD/ref/hla_and_brca_genes.fa

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/local/bin/samtools faidx /data/ref/hla_and_brca_genes.fa chr17:43044295-43170245

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/gitc/bwa index /data/ref/hla_and_brca_genes.fa

In [None]:
ls $PWD/ref

# Alignment

## Align FASTQ

In [None]:
ls $PWD/unaligned/normal

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/local/bin/samtools view -H /data/unaligned/normal/2895499331.bam

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/local/bin/samtools view /data/unaligned/normal/2895499331.bam

https://broadinstitute.github.io/picard/explain-flags.html

In [None]:
!docker run -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 java -Xms2G -jar /usr/gitc/picard.jar

In [None]:
!docker run -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 java -Xms2G -jar /usr/gitc/picard.jar SamToFastq 

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 java -Xms2G -jar /usr/gitc/picard.jar SamToFastq \
        INPUT=/data/unaligned/normal/2895499331.bam OUTPUT_PER_RG=true COMPRESS_OUTPUTS_PER_RG=true RG_TAG=ID OUTPUT_DIR=/data/unaligned/normal

In [None]:
ls $PWD/unaligned/normal

In [None]:
!head $PWD/unaligned/normal/2895499331_1.fastq.gz

In [None]:
!gunzip $PWD/unaligned/normal/2895499331_1.fastq.gz

In [None]:
ls $PWD/unaligned/normal

In [None]:
!head $PWD/unaligned/normal/2895499331_1.fastq

In [None]:
!gzip $PWD/unaligned/normal/2895499331_1.fastq

In [None]:
ls $PWD/unaligned/normal

In [None]:
!docker run -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/gitc/bwa mem

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/gitc/bwa mem /data/ref/hla_and_brca_genes.fa /data/unaligned/normal/2895499331_1.fastq.gz /data/unaligned/normal/2895499331_2.fastq.gz

In [None]:
!docker run -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/local/bin/samtools view

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 "/usr/gitc/bwa mem /data/ref/hla_and_brca_genes.fa /data/unaligned/normal/2895499331_1.fastq.gz /data/unaligned/normal/2895499331_2.fastq.gz > /data/aligned/normal/2895499331.sam"

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c "/usr/gitc/bwa mem /data/ref/hla_and_brca_genes.fa /data/unaligned/normal/2895499331_1.fastq.gz /data/unaligned/normal/2895499331_2.fastq.gz > /data/aligned/normal/2895499331.sam" 

In [None]:
!head $PWD/aligned/normal/2895499331.sam

In [None]:
!docker run -v $PWD:/data -v $PWD/aligned/normal:/data/aligned/normal -v $PWD/ref:/data/ref -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c '/usr/gitc/bwa mem -R "@RG\tID:2895499331\tPL:ILLUMINA\tPU:H7HY2CCXX-TGACCACG.3\tLB:H_NJ-HCC1395-HCC1395_BL-lg21-lib1\tSM:H_NJ-HCC1395-HCC1395_BL\tCN:MGI" /data/ref/hla_and_brca_genes.fa /data/unaligned/normal/2895499331_1.fastq.gz /data/unaligned/normal/2895499331_2.fastq.gz | /usr/local/bin/samtools view -1 -o /data/aligned/normal/2895499331.bam -' 

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c "/usr/local/bin/samtools view -H /data/aligned/normal/2895499331.bam" 

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c "/usr/local/bin/samtools view /data/aligned/normal/2895499331.bam | head" 

## Align Unaligned BAM

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /usr/local/bin/samtools view -H /data/unaligned/normal/2895499399.bam

In [None]:
set -o pipefail
set -o errexit

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c \
    'java -Xms2G -jar /usr/gitc/picard.jar SamToFastq INPUT=/data/unaligned/normal/2895499399.bam FASTQ=/dev/stdout INTERLEAVE=true NON_PF=true | /usr/gitc/bwa mem -R "@RG\tID:2895499399\tPL:ILLUMINA\tPU:H7HY2CCXX-TGACCACG.4\tLB:H_NJ-HCC1395-HCC1395_BL-lg21-lib1\tSM:H_NJ-HCC1395-HCC1395_BL\tCN:MGI" -p /data/ref/hla_and_brca_genes.fa /dev/stdin | /usr/local/bin/samtools view -1 -o /data/aligned/normal/2895499399.bam -' 

In [None]:
ls $PWD/aligned/normal

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c "/usr/local/bin/samtools view -H /data/aligned/normal/2895499399.bam" 

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c "/usr/local/bin/samtools view /data/aligned/normal/2895499399.bam | head" 

 ## Merge Alignments

In [None]:
!docker run -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 java -Xms2G -jar /usr/gitc/picard.jar MergeSamFiles

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 java -Xms2G -jar /usr/gitc/picard.jar MergeSamFiles OUTPUT=/data/aligned/normal.bam INPUT=/data/aligned/normal/2895499331.bam INPUT=/data/aligned/normal/2895499399.bam

In [None]:
ls $PWD/aligned

In [None]:
!docker run -v $PWD:/data -it broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 /bin/bash -c "/usr/local/bin/samtools view -H /data/aligned/normal.bam" 

# Homework
- Index the normal.bam file. HINT: samtools index OR igvtools
- View the indexed normal.bam file with IGV HINT: Search for BRCA1.
- Make a list of questions and/or observations about the alignments to discuss next week.
- Are there other post-alignment processing steps we've missed? Bring suggestions for next week.