# hgr1 bcftools analysis / variant analysis
```
pi:ababaian
files: ~/Crown/data/hgr1_vcf/
start: 2017 03 07
complete : 2017 03 13
```
## Introduction

I have the hgr1_v1 pipeline up and running; part of the output includes variant calling from GATK.

```
  java -Xmx12G -jar /home/ubuntu/software/GenomeAnalysisTK.jar \
  -R hgr1.gatk.fa -T HaplotypeCaller \
  -ploidy 2 --max_alternate_alleles 6 \
  -I $LIBRARY.hgr1.bam -o $LIBRARY.hgr1.vcf
```

This is the downstream analysis of those files to broadly describe variation in rDNA across many individuals.

## Objective

- Automate VCF analysis
- Calculate which intra-individual variants are found in the population and which are maintained in humans
- Calculate the frequency of variation in different regions of the rDNA loci


## Materials and Methods

### Key Files

- `100g_hgr1_individual.tar.gz` : Each individual hgr1_v1 output vcf file from 107 genomes (3 platinum + 104 population)
- 

### Current VCF files - 30 genomes

#### Manipulation of VCF files

In [1]:
cd ~/Crown/data/
mkdir -p hgr1_vcf/; cd hgr1_vcf




In [6]:
date # this is in flux, will be updated
aws s3 ls s3://crownproject/1kg_hgr1/ | grep ".vcf" -

Tue Mar  7 16:55:32 PST 2017
2017-03-07 14:24:07      61563 HG00478.hgr1.vcf
2017-03-07 14:24:07       4324 HG00478.hgr1.vcf.idx
2017-03-07 16:08:21      58909 HG00536.hgr1.vcf
2017-03-07 16:08:22       2298 HG00536.hgr1.vcf.idx
2017-03-07 16:09:22      37016 HG00557.hgr1.vcf
2017-03-07 16:09:22        782 HG00557.hgr1.vcf.idx
2017-03-07 16:11:04      17098 HG00717.hgr1.vcf
2017-03-07 16:11:05        284 HG00717.hgr1.vcf.idx
2017-03-06 23:18:53      63888 HG00851.hgr1.vcf
2017-03-06 23:18:53       4325 HG00851.hgr1.vcf.idx
2017-03-06 22:54:06      43098 HG00978.hgr1.vcf
2017-03-06 22:54:07        781 HG00978.hgr1.vcf.idx
2017-03-07 11:28:46      53048 HG00982.hgr1.vcf
2017-03-07 11:28:46       2301 HG00982.hgr1.vcf.idx
2017-03-06 22:18:41      28059 HG02190.hgr1.vcf
2017-03-06 22:18:42        284 HG02190.hgr1.vcf.idx
2017-03-06 16:16:48      26310 HG02283.hgr1.vcf
2017-03-06 16:16:49        283 HG02283.hgr1.vcf.idx
2017-03-06 16:03:32      23316 HG02343.hgr1.vcf
201

In [8]:
aws s3 cp s3://crownproject/1kg_hgr1/ ./ --recursive --exclude "*" --include "*.vcf"

rm testGenome.hgr1.vcf

Completed 16.7 KiB/1.0 MiB with 31 file(s) remainingdownload: s3://crownproject/1kg_hgr1/HG00717.hgr1.vcf to ./HG00717.hgr1.vcf
Completed 16.7 KiB/1.0 MiB with 30 file(s) remainingCompleted 42.4 KiB/1.0 MiB with 30 file(s) remainingdownload: s3://crownproject/1kg_hgr1/HG02283.hgr1.vcf to ./HG02283.hgr1.vcf
Completed 42.4 KiB/1.0 MiB with 29 file(s) remainingCompleted 78.5 KiB/1.0 MiB with 29 file(s) remainingdownload: s3://crownproject/1kg_hgr1/HG00557.hgr1.vcf to ./HG00557.hgr1.vcf
Completed 78.5 KiB/1.0 MiB with 28 file(s) remainingCompleted 105.9 KiB/1.0 MiB with 28 file(s) remainingdownload: s3://crownproject/1kg_hgr1/HG02190.hgr1.vcf to ./HG02190.hgr1.vcf
Completed 105.9 KiB/1.0 MiB with 27 file(s) remainingCompleted 168.3 KiB/1.0 MiB with 27 file(s) remainingdownload: s3://crownproject/1kg_hgr1/HG00851.hgr1.vcf to ./HG00851.hgr1.vcf
Completed 168.3 KiB/1.0 MiB with 26 file(s) remainingCompleted 210.0 KiB/1.0 MiB with 26 file(s) remainingdownload: s3://crownproject

In [11]:
# convert files to BCF
VCF=$(ls *.vcf)

for FILE in $(ls *.vcf)
do
    # Convert VCF to BCF File
    bcftools view -O b $FILE -o $FILE.bcf
    bcftools index $FILE.bcf
done

bcftools merge $(ls *.bcf) -o 100genomes_merged.vcf
bcftools view -O b 100genomes_merged.vcf -o 100genomes_merged.bcf
bcftools index 100genomes_merged.bcf




In [4]:
# Filter for variants with 30x coverage

bcftools view -i "DP>=30" 100genomes_merged.vcf -o 100genomes_DP30.vcf
bcftools index 100genomes_DP30.vcf
#bcftools filter -i "DP>=30" 100genomes_merged.bcf | bcftools view -O b 100genomes.DP30.bcf


[E::main_vcfindex] unknown filetype; expected bgzip compressed VCF or BCF
[E::main_vcfindex] was the VCF/BCF compressed with bgzip?


#### Summary Plots for VCF

In [3]:
## vcfplot (summary stats)

# DP30 filtered
bcftools stats 100genomes_DP30.vcf > 100g.dp30.vcfplot
plot-vcfstats -p plot_dp30/100g 100g.dp30.vcfplot

# nonDP30 filtered
bcftools stats 100genomes_merged.vcf > 100g.vcfplot
plot-vcfstats -p plot/100g 100g.vcfplot

Parsing bcftools stats output: 100g.dp30.vcfplot
Plotting graphs: python plot_dp30/100g-plot.py
Creating PDF: pdflatex 100g-summary.tex >100g-plot-vcfstats.log 2>&1
Finished: plot_dp30/100g-summary.pdf
Parsing bcftools stats output: 100g.vcfplot
Plotting graphs: python plot/100g-plot.py
Creating PDF: pdflatex 100g-summary.tex >100g-plot-vcfstats.log 2>&1
Finished: plot/100g-summary.pdf


In [None]:
## Measure the density of variants in each region


#### vcfR Analysis - statistical analysis in R

To do the heavy lifting number-wise I went to an R based package called `vcfR` it's actually really good!

Still I have to do some custom code stuff for rRNA variation analysis since it's outside the norm but I can extract information from the VCF file into a data.frame fairly quiickly with vcfR. Script copy below, subject to change on disk.

` cd ~/Crown/data/vcfR_analysis/`

vcf_100g.r - script for analysis

```
# vcfR analysis
# 100 genomes hgr1
# 170308
#
# hgr1_v1 vcf (variant input only) analysis

# Install vcfR
#install.packages("vcfR")
library("vcfR")
library("ggplot2")
library("reshape2")

# File pointers

  vcf_file = '30genomes_merged.vcf.gz'
  dna_file = 'rDNA.fa'
  gff_file = 'rDNA.gff'

# output Prefix (for plots)
  
  outPrefix='30genomes'

# SCRIPT ============================================================
#====================================================================
# Import VCF / DNA / GFF
  VCF = read.vcfR(vcf_file)
  DNA = ape::read.dna(dna_file, format = 'fasta')
  GFF = read.table(gff_file, sep="\t", quote = "")
  
  
# Create chromR Object
  chrom = create.chromR(name="rDNA", vcf=VCF, seq=DNA, ann=GFF)

  #plot(chrom)
 #chromoqc(chrom, dp.alpha = 22)


# Variant Statistics ================================================

# Intra-individual variation (_i)
  
  # Depth of coverage at each variant
  DP = extract.gt(chrom, element="DP", as.numeric=TRUE)

  # Reference allele-only depth
  RAD = extract.gt(chrom, element="AD", as.numeric=TRUE)

  # Variant allele frequency (intra-individual)
  iVAF = (DP-RAD)/DP
  
  
# Population variation (called variants)
  
  # Number of individual genomes in total population
  N_pop = length(DP[1,])
  
  # Called Variant allele count (population)
  pV = rowSums(!is.na(iVAF))
  pVAF = pV / N_pop
  
  # Average intra-genomic variant allele frequency
  mean_iVAF = apply(iVAF,1,mean, na.rm = T)
  sd_iVAF = apply(iVAF,1,sd, na.rm = T)

# Population Variant Statistics
  
POPVAR = data.frame(pVAF, mean_iVAF, sd_iVAF)

# Plot--------------------------------------------------------------
# Mean Intra-individual Variant Allele Frequency vs.
# Population-level Variant Allele Frequency (called vs not called)

# open PDF device
pdf(file = paste0(outPrefix,".MeaniVAF_pVAF.pdf"), width = 5, height = 5)

PLOT1 = ggplot(POPVAR, aes(mean_iVAF, pVAF))
PLOT1 = PLOT1 + geom_point(alpha = 2/10, stroke = 0, aes(size = 0.5))
PLOT1 = PLOT1 + theme_bw()
PLOT1 = PLOT1 + theme(legend.position="none")
PLOT1 = PLOT1 + xlab('Mean Intra-individual Variant Allele Frequency')
PLOT1 = PLOT1 + ylab('Population Variant Allele Frequency')
PLOT1

dev.off()
# ------------------------------------------------------------------ 
  
# Stratify variants by their mean Intra-individual allele frequency
# the hypothesis was that there are two classes of variants;
# Directional Variants, which can range from 0 - 1
# Stabilized Variants, which are maintained in a narrower range

# strat0_2 =  which(mean_iVAF <= 0.2)
# strat2_4 =  which(mean_iVAF > 0.2 & mean_iVAF <= 0.4)
# strat4_6 =  which(mean_iVAF > 0.4 & mean_iVAF <= 0.6)
# strat6_8 =  which(mean_iVAF > 0.6 & mean_iVAF <= 0.8)
# strat8_1 =  which(mean_iVAF > 0.8)


Stratify = function(iVAF, minVAF, maxVAF){
  
  # calcualte the average intra-individual VAF for each variant
  mean_iVAF = apply(iVAF,1, mean, na.rm = T)
  
  # Subselect (index) mean iVAF between a range of values
  strat_lim = which(mean_iVAF > minVAF & mean_iVAF <= maxVAF)
  
  # Actually subselect by stratification
  iVAF_str = apply(iVAF[strat_lim,], 1, sort, decreasing = TRUE)
  
  N_strats = length(iVAF_str) # number of variants in this stratification
  N_samples = length(iVAF[1,]) # number of samples in the analysis (people)
  
  iVAF_str_matrix = matrix( rep(0, N_strats * N_samples), nrow = N_strats)
  
  for (X in 1:N_strats){
    LINE_VALUES = unlist(iVAF_str[X])
    length(LINE_VALUES) = N_samples
    LINE_VALUES[is.na(LINE_VALUES)] <- 0
    
    iVAF_str_matrix[X,] = iVAF_str_matrix[X,] + LINE_VALUES
  }
  
  return(iVAF_str_matrix)
  
}

Stratify_rownames = function(iVAF, minVAF, maxVAF){
  # calcualte the average intra-individual VAF for each variant
  mean_iVAF = apply(iVAF,1, mean, na.rm = T)
  
  # Subselect (index) mean iVAF between a range of values
  strat_lim = which(mean_iVAF > minVAF & mean_iVAF <= maxVAF)
  
  return(names(strat_lim))
}

StratifyPlot = function(iVAF, minVAF, maxVAF){
  
  ## Stratify variants by intra-individual variant allele frequency
  iVAF_SUB_matrix = data.frame(Stratify(iVAF, minVAF, maxVAF), row.names = Stratify_rownames(iVAF, minVAF, maxVAF))
  
  # Shape data to be plotted
  iVAF_SUB_matrix = melt(iVAF_SUB_matrix)
  iVAF_SUB_matrix$rowid = Stratify_rownames(iVAF, minVAF, maxVAF)

    # Plot--------------------------------------------------------------
    # Mean Intra-individual Variant Allele Frequency vs.
    # Population-level Variant Allele Frequency (called vs not called)
    plotTitle = paste0("Mean VAF: ",minVAF," - ",maxVAF)
  
    PLOT2 = ggplot(iVAF_SUB_matrix, aes(variable, value, group = factor(rowid)))
    PLOT2 = PLOT2 + geom_line(aes(color = factor(rowid)))
    PLOT2 = PLOT2 + theme_bw()
    PLOT2 = PLOT2 + theme(legend.position="none")
    PLOT2 = PLOT2 + ylim(0,1)
    PLOT2 = PLOT2 + scale_x_discrete(breaks = c(1,N_pop))
    PLOT2 = PLOT2 + ggtitle(plotTitle)
    #PLOT2 = PLOT2 + ylab('Intra-individual Variant Allele Frequency')
    #PLOT2 = PLOT2 + xlab('Genomes')
    ## -----------------------------------------------------------------
    #PLOT2
    
    return(PLOT2)
}


# Plot Stratify for multiple ranges
minVAF=0 # initialize with min VAF at 0
nStratification=5 # How many equal parts to divide the VAF from 0-1

# range width of each stratification
widthStratification=1/nStratification

for (nS in 1:nStratification){

  maxVAF = minVAF + widthStratification
  
  # Generate Plot
  PLOT_VAF = StratifyPlot(iVAF, minVAF, maxVAF)
  
  # Open PDF file to write to
  pdf(file = paste0(outPrefix,".",nS,".strat.pdf"),
     width = 5, height = 5)
  
    print(PLOT_VAF)
  
  dev.off()
  
  minVAF = minVAF + widthStratification
}


# Heatmap ================================================
# Log transform depth-data at each variant
logDP = log(DP)

# # chrom from tutotrial

heatmap.bp(iVAF, cbarplot = F, rbarplot = F, rlabels = F)

heatmap.bp(logDP, cbarplot = F, rbarplot = F, rlabels = F)

```


#### Output from vcfR analysis + plotVCF

Summary Statistics (read-depth 30 filtered) for 30 genomes
![Summary Statistics (read-depth 30) for 30 genomes](../figure/20170308_summary_30g.png)

Intra-genomic Variant Allele Frequency (VAF) vs. Population VAF
![iVAF vs. pVAF](../figure/20170308_30g_iVAF_v_pVAF.png)

- Bottom right panel is iVAF vs. pVAF for every variant in the dataset.
- Note the variants which are variable within a high fraction of genomes, namely at (0.48,1.0) which is the 28S.A49G variation. It's present in 100% of genomes. Evidence for stabalizing selection of this allele in all humans
- The line graphs are little involved. Variants were stratified by their average intra-genomic VAF. In the 0.4 - 0.6 graph this is best exemplified. The vast majority of individual variants (lines) are present at a high intragenomic frequency (y-axis) in a few individuals (x-axis) and absent in others. The exception is again 28S.A49G which is intra-genomically variant in all genomes.
- Keep in mind this doesn't account for 'allelic dropout' when sequencing depth is poor over high GC regions so other 'stabalized' intragenomic variants may have individual samples where that variant looks 'fixed' or 'absent'. This needs to be refined a little bit but it's incredibly strong evidence for paralogous rRNA in humans

30 Genomes - Intra-genomic variant allele frequency for each called variant
![30g iVAF heatmap](../figure/20170308_30g_iVAF_hm.png)

30 Genomes - Read depth of each called variant.
![30g DP heatmap](../figure/20170308_30g_log_DP_hm.png)

- Column = individual genome
- Row = individual variants
- Scale = logDP 1 - 3200; iVAF 0.0 - 1.0

### Full dataset - 107 genomes

I moved the joined 30 genomes VCF file (and renamed it) to the 30g folder in ~/Crown/data/hgr1_vcf

Download the complete set of .vcf files from 104 genomes and re-analyze as above.


In [2]:
# Move to hgr1_vcf folder
cd ~/Crown/data/hgr1_vcf/; ls
mkdir -p 100g; cd 100g
mkdir -p vcf; cd vcf

date

aws s3 ls s3://crownproject/1kg_hgr1/ | grep ".vcf" -

aws s3 cp s3://crownproject/1kg_hgr1/ ./ --recursive --exclude "*" --include "*.vcf"
aws s3 cp s3://crownproject/1kg_hgr1/ ./ --recursive --exclude "*" --include "*.vcf.idx"


rm testGenome.hgr1.vcf testGenome.hgr1.vcf.idx

30g.dp30.vcfplot    30genomes_DP30.bcf.csi  30genomes_merged.vcf  plot
30genomes_DP30.bcf  30genomes_DP30.vcf	    30g.vcfplot		  plot_dp30
Fri Mar 10 10:19:47 PST 2017
2017-03-08 12:31:14      39989 HG00128.hgr1.vcf
2017-03-08 12:31:15        781 HG00128.hgr1.vcf.idx
2017-03-08 12:41:08      26601 HG00139.hgr1.vcf
2017-03-08 12:41:09        282 HG00139.hgr1.vcf.idx
2017-03-08 13:53:28      60138 HG00234.hgr1.vcf
2017-03-08 13:53:29       2300 HG00234.hgr1.vcf.idx
2017-03-08 12:17:24      22863 HG00253.hgr1.vcf
2017-03-08 12:17:24        282 HG00253.hgr1.vcf.idx
2017-03-08 00:48:57      41492 HG00337.hgr1.vcf
2017-03-08 00:48:57        782 HG00337.hgr1.vcf.idx
2017-03-08 00:15:02      67236 HG00353.hgr1.vcf
2017-03-08 00:15:02       4325 HG00353.hgr1.vcf.idx
2017-03-08 00:13:24      49655 HG00358.hgr1.vcf
2017-03-08 00:13:24        780 HG00358.hgr1.vcf.idx
2017-03-08 01:07:43      27207 HG00362.hgr1.vcf
2017-03-08 01:07:43        281 HG00362.hgr1.vcf.idx
2017-03-07 14

In [7]:
# convert files to BCF
cd ~/Crown/data/hgr1_vcf/100g_raw/vcf
mkdir -p ../bcf #bcf storage folder

VCF=$(ls *.vcf)

for FILE in $(ls *.vcf)
do
    # Convert VCF to BCF File
    bcftools view -O b $FILE -o ../bcf/$FILE.bcf
    bcftools index ../bcf/$FILE.bcf
done

cd ../bcf

bcftools merge $(ls *.bcf) -o ../100g_hgr1_v1.vcf
bcftools view -O b ../100g_hgr1_v1.vcf -o 100g_hgr1_v1.bcf
bcftools index 100g_hgr1_v1.bcf

# Filter for variants with 30x coverage
cd ..

bcftools view -i "DP>=30" 100g_hgr1_v1.vcf -o 100g_hgr1_v1.dp30.vcf
bcftools index 100g_hgr1_v1.dp30.vcf


Failed to open 100g_hgr1_v1.bcf: could not load index
[E::main_vcfindex] unknown filetype; expected bgzip compressed VCF or BCF
[E::main_vcfindex] was the VCF/BCF compressed with bgzip?


In [8]:
## vcfplot (summary stats)
cd ~/Crown/data/hgr1_vcf/100g_raw/

# nonDP30 filtered
bcftools stats 100g_hgr1_v1.vcf > 100g_hgr1_v1.vcfplot
plot-vcfstats -p plot/100g 100g_hgr1_v1.vcfplot

# DP30 filtered
bcftools stats 100g_hgr1_v1.dp30.vcf > 100g_hgr1_v1.dp30.vcfplot
plot-vcfstats -p plot_dp30/100g 100g_hgr1_v1.dp30.vcfplot


Parsing bcftools stats output: 100g_hgr1_v1.vcfplot
Plotting graphs: python plot/100g-plot.py
Creating PDF: pdflatex 100g-summary.tex >100g-plot-vcfstats.log 2>&1
Finished: plot/100g-summary.pdf
Parsing bcftools stats output: 100g_hgr1_v1.dp30.vcfplot
Plotting graphs: python plot_dp30/100g-plot.py
Creating PDF: pdflatex 100g-summary.tex >100g-plot-vcfstats.log 2>&1
Finished: plot_dp30/100g-summary.pdf


In [10]:
# I'd like to re-name the files based on their Super-population_Population_ID
# so they alpha-sort to world populations
# I'll likely set-this up in excel (a conversion table)

cd ~/Crown/data/hgr1_vcf/
mkdir -p renamed_100g

#cp -r 100g_raw/vcf/*.vcf renamed_100g/vcf/
ln -s ln -s $PWD/100g_raw/vcf/ $PWD/renamed_100g/vcf 

cd renamed_100g



In [11]:
# I made a name conversion table in the main spreadsheet
cat nameConversion.txt
cp nameConversion.txt ../

# I'll be renaming files/ID to
# <SuperPopulation_Number>.<SuperPopulation>.<Population>.<Individual>
#
# This means the vcf files will alpha-sort into populations easily
# as opposed to the chronological order in which theby were entered into
# the project

## For figure make the Utah Trio super-group 0 (they will be 3 genomes at the top)
## in ~/Crown/data/hgr1_vcf/renamed_100g
##
## > nameConversion_figure.txt
##

NA12878_pp	CEU	EUR	4	4.EUR.CEU_NA12878_pp
NA12891_pp	CEU	EUR	4	4.EUR.CEU_NA12891_pp
NA12892_pp	CEU	EUR	4	4.EUR.CEU_NA12892_pp
HG02283	ACB	AFR	1	1.AFR.ACB_HG02283
HG02343	ACB	AFR	1	1.AFR.ACB_HG02343
HG02508	ACB	AFR	1	1.AFR.ACB_HG02508
HG02479	ACB	AFR	1	1.AFR.ACB_HG02479
NA20357	ASW	AFR	1	1.AFR.ASW_NA20357
NA20359	ASW	AFR	1	1.AFR.ASW_NA20359
NA20362	ASW	AFR	1	1.AFR.ASW_NA20362
NA19984	ASW	AFR	1	1.AFR.ASW_NA19984
HG03607	BEB	SAS	5	5.SAS.BEB_HG03607
HG03604	BEB	SAS	5	5.SAS.BEB_HG03604
HG04162	BEB	SAS	5	5.SAS.BEB_HG04162
HG03616	BEB	SAS	5	5.SAS.BEB_HG03616
HG02190	CDX	EAS	3	3.EAS.CDX_HG02190
HG00851	CDX	EAS	3	3.EAS.CDX_HG00851
HG00978	CDX	EAS	3	3.EAS.CDX_HG00978
HG00982	CDX	EAS	3	3.EAS.CDX_HG00982
NA06985	CEU	EUR	4	4.EUR.CEU_NA06985
NA11881	CEU	EUR	4	4.EUR.CEU_NA11881
NA12234	CEU	EUR	4	4.EUR.CEU_NA12234
NA12283	CEU	EUR	4	4.EUR.CEU_NA12283
NA18622	CHB	EAS	3	3.EAS.CHB_NA18622
NA18632	CHB	EAS	3	3.EAS.CHB_NA18632
NA18533	CHB	EAS	3	3.EAS.CHB_NA18533
NA18647	CHB	EAS	3	3.

In [18]:
# Iterate through every VCF file in the VCF folder;
# rename the file; rename the ID tag in the file
# create VCF index for it
cd ~/Crown/data/hgr1_vcf/renamed_100g/
mkdir -p renaming

cp -r vcf/* renaming/

cd renaming

VCF=$(ls *.vcf)

for FILE in $(ls *.vcf)
do
    originalID=$(echo $FILE | cut -f1 -d'.' -)
    
    convertID=$(grep $originalID ../nameConversion.txt | cut -f 5 - )
    

    # Underscores are removed from contigID so search without it
    dropUnderScore=$(echo $originalID | cut -f1 -d'_' -)
    convertID_ds=$(echo $convertID | cut -f1 -d'_' -)
    
    # Within VCF file, convert ID
    sed -i "s/\t$dropUnderScore/\t$convertID_ds/g" $FILE
    
    # Rename the file itself (retain underscores)
    mv $FILE $convertID.vcf
    
    # Convert to bcf file
    bcftools view -O b $convertID.vcf -o $convertID.bcf
    bcftools index $convertID.bcf
done


bcftools merge $(ls *.bcf) -o ../100g_hgr1_pop.vcf
cd ..

gzip 100g_hgr1_pop.vcf

rm -r renaming
#bcftools view -O b 100g_hgr1_pop.vcf -o 100g_hgr1_pop.bcf
#bcftools index 100g_hgr1_pop.bcf

[E::bcf_hdr_add_sample] Empty sample name: trailing spaces/tabs in the header line?
Aborted (core dumped)
[E::hts_open_format] fail to open file '.bcf'
Failed to read .bcf
[E::bcf_hdr_add_sample] Empty sample name: trailing spaces/tabs in the header line?
Aborted (core dumped)
[E::hts_open_format] fail to open file '.bcf'
Failed to read .bcf
[E::bcf_hdr_add_sample] Empty sample name: trailing spaces/tabs in the header line?
Aborted (core dumped)
[E::hts_open_format] fail to open file '.bcf'
Failed to read .bcf


## Discussion

Data loaded into IGV and used to make figure. Roughly...

![Raw 100 genome variants viewed in IGV and made into figure](../figure/20170313_hgr1_100g_raw.png)

The main problem with this data is that it doesn't distinguish 'No Data' with reference allele only. Needs to be re-run using g.vcf file format.