# Joint Genotyping gVCF
```
pi:ababaian
files: ~/Crown/data/..
start: 2017 03 10
complete : 2017 03 20
```
## Introduction

Reviewing the arrays of variation one of the most important distinctions to be made is that there is an important difference between; a called non-variant (reference sequence only) and no-data. Some sites like A59G are really amenable to sequencing/coverage and so in 100% of the samples it's detected. Other (many) variant positions are variant in 50%+ of samples but there's 'no data' in the VCF file format of why no variant was called in the other data sets, either it's truely the reference sequence or there's no sequencing depth or the reads there are trashy.

I found a very useful [guide](http://gatkforums.broadinstitute.org/gatk/discussion/6925/understanding-and-adapting-the-generic-hard-filtering-recommendations) on variant calling which led me to [another guide](https://software.broadinstitute.org/gatk/documentation/topic?name=methods) which outlines how to address exactly this problem. Essentially what I want is a `gVCF` file as shown [here](http://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf) to retain this information.

The idea is to load up all the aligned bam files to an EC2 instance and re-run HaplotypeCaller and generate a 'gVCF' file for the 107 genomes. (Includes the Utah Trio platinum genomes).

The hypothesis being that there are multiple '100% present' variant sites in rDNA besides A59G but the data is missing because of high GC content.

Essentially I want a rate of 'calling' reference alleles at the ~1,600 variant sites.


## Objective

- Generate a gVCF file for 100genomes data

## Materials and Methods

### EC2 Instance

Opened up a C4.2xlarge instance with 400 Gb of space.

DNS: ec2-52-26-25-196.us-west-2.compute.amazonaws.com

### Guide

I'm essentially going to follow this [guide](https://software.broadinstitute.org/gatk/guide/article?id=3893) for doing this. 

#### Manual Filter Cutoffs (roughly)
[Good read](http://gatkforums.broadinstitute.org/gatk/discussion/6925/understanding-and-adapting-the-generic-hard-filtering-recommendations)

- FS > 55
- DP < 30
- QD < 2


In [None]:
## EC2 instance Code
## Up and running! May take some hours
## ec2-52-26-25-196.us-west-2.compute.amazonaws.com

# Make folder for analysis
  mkdir -p work; cd work;
  
# Download resource files
# GATK variant calling
  aws s3 cp s3://crownproject/resources/hgr1.gatk.fa ./
  aws s3 cp s3://crownproject/resources/hgr1.gatk.fa.fai ./
  aws s3 cp s3://crownproject/resources/hgr1.gatk.dict ./


# Download all the hgr1 bam files and their index
# ~ 1 Gb
  aws s3 cp s3://crownproject/1kg_hgr1/ ./ --recursive --exclude "*" --include "*.bam"
  aws s3 cp s3://crownproject/1kg_hgr1/ ./ --recursive --exclude "*" --include "*.bam.bai"

rm testGenome.hgr1.bam
rm testGenome.hgr1.bam.bai

# BAM files
#BAM=$(ls *.bam)

ls *.bam > bams.list

# Run HaplotypeCaller to make VCF file (unified)
  
  java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
  -R hgr1.gatk.fa -T HaplotypeCaller \
  -ploidy 2 --max_alternate_alleles 6 \
  -I bams.list \
  -o 100g.hgr1_all.vcf

# Re-run with (unified)
# -ploidy 10 --max_alternate_alleles 10
  
  java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
  -R hgr1.gatk.fa -T HaplotypeCaller \
  -ploidy 10 --max_alternate_alleles 10 \
  -I bams.list \
  -o 100g.hgr1_all.ploidy10.vcf

aws s3 cp 100g.hgr1_all.vcf s3://crownproject/1kg_hgr1/reVCF/
aws s3 cp 100g.hgr1_all.vcf.idx s3://crownproject/1kg_hgr1/reVCF/

aws s3 cp 100g.hgr1_all.ploidy10.vcf s3://crownproject/1kg_hgr1/reVCF/
aws s3 cp 100g.hgr1_all.ploidy10.vcf.idx s3://crownproject/1kg_hgr1/reVCF/


# Re-make g.vcf files for each bam library

for BAMFILE in $(ls *.bam)
do

   # Sample ID
   sampleID=$(echo $BAMFILE | cut -f1 -d'.' -)
   
   java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
  -R hgr1.gatk.fa -T HaplotypeCaller \
  -ERC BP_RESOLUTION \
  -ploidy 2 --max_alternate_alleles 6 \
  -I $BAMFILE \
  -o $sampleID.g.vcf
  
  aws s3 cp $sampleID.g.vcf s3://crownproject/1kg_hgr1/reVCF/
  aws s3 cp $sampleID.g.vcf.idx s3://crownproject/1kg_hgr1/reVCF/

done

# QED

In [None]:
## EC2 instance Code
# m4.xlarge 400 Gb
# ec2-52-38-47-116.us-west-2.compute.amazonaws.com
## I'm worried the script above will take a LONG time to run
## alternative approach will be to run full BP_RESOLUTION
## vcf files for EACH VCF and merge them last.

# Make folder for analysis
  mkdir -p work; cd work;
  
# Download resource files
# GATK variant calling
  aws s3 cp s3://crownproject/resources/hgr1.gatk.fa ./
  aws s3 cp s3://crownproject/resources/hgr1.gatk.fa.fai ./
  aws s3 cp s3://crownproject/resources/hgr1.gatk.dict ./


# Download all the hgr1 bam files and their index
# ~ 1 Gb
  aws s3 cp s3://crownproject/1kg_hgr1/ ./ --recursive --exclude "*" --include "*.bam"
  aws s3 cp s3://crownproject/1kg_hgr1/ ./ --recursive --exclude "*" --include "*.bam.bai"

rm testGenome.hgr1.bam
rm testGenome.hgr1.bam.bai

# BAM files
#BAM=$(ls *.bam)

ls *.bam > bams.list

# Re-make g.vcf files for each bam library

for BAMFILE in $(ls *.bam)
do

   # Sample ID
   sampleID=$(echo $BAMFILE | cut -f1 -d'.' -)
   
   java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
  -R hgr1.gatk.fa -T HaplotypeCaller \
  -ERC BP_RESOLUTION \
  -ploidy 2 --max_alternate_alleles 6 \
  -I $BAMFILE \
  -o $sampleID.g.vcf
  
  gzip $sampleID.g.vcf
  aws s3 cp $sampleID.g.vcf.gz s3://crownproject/1kg_hgr1/reVCF/
  aws s3 cp $sampleID.g.vcf.idx s3://crownproject/1kg_hgr1/reVCF/

done

# QED

In [1]:
aws s3 ls s3://crownproject/1kg_hgr1/reVCF/

2017-03-10 21:09:42          0 
2017-03-11 12:18:11   10380413 100g.hgr1_all.ploidy10.vcf
2017-03-11 16:26:40    3483506 100g.hgr1_all.ploidy10.vcf.gz
2017-03-11 12:18:10    2778272 100g.hgr1_all.vcf
2017-03-11 16:26:54     902300 100g.hgr1_all.vcf.gz
2017-03-11 12:18:11       4326 100g.hgr1_all.vcf.idx
2017-03-11 12:18:48   67604874 HG00128.g.vcf
2017-03-10 22:00:12    2725466 HG00128.g.vcf.gz
2017-03-11 12:18:50        190 HG00128.g.vcf.idx
2017-03-11 12:19:56   67641800 HG00139.g.vcf
2017-03-10 22:01:31    2746535 HG00139.g.vcf.gz
2017-03-11 12:19:58        190 HG00139.g.vcf.idx
2017-03-11 12:21:26   67618954 HG00234.g.vcf
2017-03-10 22:03:17    2730628 HG00234.g.vcf.gz
2017-03-11 12:21:28        190 HG00234.g.vcf.idx
2017-03-11 12:22:14   67631559 HG00253.g.vcf
2017-03-10 22:04:14    2741457 HG00253.g.vcf.gz
2017-03-11 12:22:15        190 HG00253.g.vcf.idx
2017-03-11 12:23:11   67623799 HG00337.g.vcf
2017-03-10 22:05:21    2737871 HG00337.g.vcf.gz
2017-03-11 12:

### Runs Finished

Lots of analysis files; for now stick to the total g.vcf file for variant analysis.


In [2]:
cd ~/Crown/data/hgr1_vcf/

mkdir -p 100g_gvcf; cd 100g_gvcf

aws s3 cp s3://crownproject/1kg_hgr1/reVCF/100g.hgr1_all.vcf.gz ./
aws s3 cp s3://crownproject/1kg_hgr1/reVCF/100g.hgr1_all.ploidy10.vcf.gz ./

gzip -d 100g.hgr1_all.vcf.gz

Completed 256.0 KiB/881.2 KiB with 1 file(s) remainingCompleted 512.0 KiB/881.2 KiB with 1 file(s) remainingCompleted 768.0 KiB/881.2 KiB with 1 file(s) remainingCompleted 881.2 KiB/881.2 KiB with 1 file(s) remainingdownload: s3://crownproject/1kg_hgr1/reVCF/100g.hgr1_all.vcf.gz to ./100g.hgr1_all.vcf.gz


In [3]:
## Plot vcfStats

bcftools stats 100g.hgr1_all.vcf > 100gvcf.vcfplot
plot-vcfstats -p plot/100gvcf 100gvcf.vcfplot

Parsing bcftools stats output: 100gvcf.vcfplot
Plotting graphs: python plot/100gvcf-plot.py
Creating PDF: pdflatex 100gvcf-summary.tex >100gvcf-plot-vcfstats.log 2>&1
Finished: plot/100gvcf-summary.pdf


#### Quality Metrics of g.vcf Genotype Qualities

Standard vcfR Quality Control for the `100g.hgr1_all.vcf` joint genotyping. Script: `~/Crown/data/vcfR_analysis/gvcf_100g.r`

![100g hgr1 QC](../figure/20170313_QC_hgr1_100gvcf.png)

Note: The 'Quality' metric is the PHRED score (-log( p_alt_genotype_DNE ) ) and has a huge range in this file: 30.44 555697.96 so it was re-plotted on a log scale.

![100g hgr1 Quality Log Replot](../figure/20170313_QC_AltGeno_Quality.png)

### Individual VCF File per library

For detailed analysis, each individual VCF file was also generated. (above)



In [6]:
# Download individual VCF files from each library
cd ~/Crown/data/hgr1_vcf/
mkdir -p 100g_gvcf/gvcf_individual
cd 100g_gvcf/gvcf_individual

aws s3 cp s3://crownproject/1kg_hgr1/reVCF ./ --recursive --exclude "*" --include "*.vcf.gz"


Completed 256.0 KiB/284.0 MiB with 109 file(s) remainingCompleted 512.0 KiB/284.0 MiB with 109 file(s) remainingCompleted 768.0 KiB/284.0 MiB with 109 file(s) remainingCompleted 1.0 MiB/284.0 MiB with 109 file(s) remaining  Completed 1.2 MiB/284.0 MiB with 109 file(s) remaining  Completed 1.5 MiB/284.0 MiB with 109 file(s) remaining  Completed 1.8 MiB/284.0 MiB with 109 file(s) remaining  Completed 2.0 MiB/284.0 MiB with 109 file(s) remaining  Completed 2.2 MiB/284.0 MiB with 109 file(s) remaining  Completed 2.5 MiB/284.0 MiB with 109 file(s) remaining  Completed 2.8 MiB/284.0 MiB with 109 file(s) remaining  Completed 3.0 MiB/284.0 MiB with 109 file(s) remaining  Completed 3.2 MiB/284.0 MiB with 109 file(s) remaining  Completed 3.5 MiB/284.0 MiB with 109 file(s) remaining  Completed 3.8 MiB/284.0 MiB with 109 file(s) remaining  Completed 4.0 MiB/284.0 MiB with 109 file(s) remaining  Completed 4.2 MiB/284.0 MiB with 109 file(s) remaining  Completed 4.4 MiB/284.0 MiB wit

### Renaming libraries for figure

For the `~/Crown/data/hgr1_vcf/100g.hgr1.g.vcf` file (quality metrics above), make a renamed version with sample names converted to include population identifiers.



In [7]:
# Sample Order
cd ~/Crown/data/hgr1_vcf/

cut -f 5 nameConversion_figure.txt > alpha.order.tmp

# search and replaced \n with , in gedit



In [15]:
## Renaming Sample IDs to include population information
cd ~/Crown/data/hgr1_vcf/

FILE='100g.hgr1.g.vcf'
OUTPUT_VCF='POPID_100g.hgr1.g.vcf'
CONVERSION='nameConversion_figure.txt'

cp $FILE $OUTPUT_VCF

N_LINES=$(wc -l $CONVERSION | cut -f1 -d' ' -)

for N in $(seq 1 $N_LINES)
do

    originalID=$(sed -n "$N"p $CONVERSION | cut -f 1 -)
    convertID=$(sed -n "$N"p $CONVERSION | cut -f 5 - )
        
    #echo "Converting $originalID to $convertID"
    
    # Within VCF file, convert ID
    sed -i "s/\t$originalID/\t$convertID/g" $OUTPUT_VCF
    
done

cp $OUTPUT_VCF holder.tmp.vcf

bcftools view -s $(cat alpha.order.tmp) holder.tmp.vcf > $OUTPUT_VCF

#gzip $OUTPUT_VCF



In [None]:
## Note: Sample AFR ASW NA20362 is not included in gvcf file.
## it was dropped out somewhere. Since I included 5 yoruba samples
## to have NA19240 in the analysis I was miscounting the sum
## for now continue analysis with the 107 genomes available. In next major
## analysis pipeline I'll include more genomes anyways.
## requires manually removing NA20362 from conversion table.

## Discussion
