# Comparing various genotyping strategies

In [1]:
library(tidyverse)
library(here)

devtools::load_all(".")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
here() starts at /mnt/expressions/mp/ychr
Loading ychr


# A00

We can clearly see that the strict consensus produces the smalles number of genotypes. This is even more extreme for the ancient Y chromosomes below due to significant proportion of false aDNA damage alleles.

In [2]:
read_vcf(here("test/a00_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [3]:
read_vcf(here("test/a00_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [4]:
read_vcf(here("test/a00_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [5]:
read_vcf(here("test/a00_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [6]:
a00 <- read_vcf(here("test/genotyping_a00.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [7]:
nrow(a00)

Sites which are differently called under bcftools with or without BAQ option are very weird and manual inspection of the alignments shows that it's pretty much impossible to tell which are the true variants sites and which are errors. Many look like strange alignment errors. I'd say being conservative in these particular cases is a good thing. The situation is even worse for the archaic samples below, where we often have a 50:50 mixture of two alleles and bcftools calls pretty much randomly one or another.

In [8]:
filter(a00, baq != nobaq)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,2805552,G,A,0,1,,
Y,3715248,A,G,0,1,,
Y,8561368,C,T,0,1,,
Y,8561376,C,T,0,1,,
Y,17029481,G,A,0,1,,
Y,17029484,T,A,0,1,,


In [9]:
filter(a00, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,6


In [10]:
filter(a00, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(10) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,19119198,A,G,1.0,1,,1
Y,10035969,C,T,1.0,1,,1
Y,8388616,A,C,,1,,1
Y,23466468,T,C,1.0,1,,1
Y,3720176,C,T,1.0,1,,1
Y,19504659,A,T,1.0,1,,1
Y,6868118,A,G,1.0,1,,1
Y,7180949,G,A,1.0,1,,1
Y,17566434,C,A,1.0,1,,1
Y,9930635,C,G,1.0,1,,1


Based on the table right above, whenever we call an allele with the 90% tolerance consensus cutoff these calls are consistent with bcftools. So the only thing we are losing are calls which would be unreliable anyway, based on the manual inspection of alignments above.

# Mezmaiskaya 2 (high-coverage archaic, ideal case)

In [11]:
read_vcf(here("test/mez2_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [12]:
read_vcf(here("test/mez2_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [13]:
read_vcf(here("test/mez2_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [14]:
read_vcf(here("test/mez2_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [15]:
mez2 <- read_vcf(here("test/genotyping_mez2.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [16]:
nrow(mez2)

In [17]:
filter(mez2, baq != nobaq) %>% sample_n(20)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,15946979,G,A,0,1,,
Y,8455563,T,C,1,0,,
Y,3720007,C,A,0,1,,
Y,22894136,G,A,0,1,,
Y,14860080,A,G,0,1,,
Y,18083168,G,A,0,1,,
Y,8363429,G,A,0,1,,
Y,23540925,C,T,0,1,,
Y,19420838,G,A,0,1,,
Y,3719194,C,A,0,1,,


In [18]:
filter(mez2, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,35
1,0,5


In [19]:
filter(mez2, cons != tol)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


In [20]:
filter(mez2, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(10) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,16790509,A,G,1.0,1,,1
Y,6993202,T,C,1.0,1,,1
Y,21070230,A,G,1.0,1,,1
Y,2889149,T,C,1.0,1,,1
Y,2889635,G,C,1.0,1,,1
Y,18127181,A,G,1.0,1,,1
Y,18081586,C,G,1.0,1,,1
Y,19130311,T,C,1.0,1,,1
Y,9642406,A,G,1.0,1,,1
Y,18069115,A,G,1.0,1,,1


# Denisova 8 (low coverage archaic, extreme case)

At 90% consensus cutoff we expect the lower coverage genotypes to be the same as strict 100% consensus.

In [21]:
read_vcf(here("test/den8_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [22]:
read_vcf(here("test/den8_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [23]:
read_vcf(here("test/den8_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [24]:
read_vcf(here("test/den8_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [25]:
den8 <- read_vcf(here("test/genotyping_den8.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [26]:
nrow(den8)

In [27]:
filter(den8, baq != nobaq) %>% sample_n(20)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,21570706,G,A,0,1,,
Y,17097106,C,T,0,1,,
Y,14533615,T,C,0,1,,
Y,7146238,G,A,0,1,,
Y,13206359,G,A,0,1,,
Y,15021654,C,T,0,1,,
Y,15494975,G,A,0,1,,
Y,19471407,C,T,0,1,,
Y,19119665,G,A,0,1,,
Y,7054697,G,A,0,1,,


In [28]:
filter(den8, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,224
1,0,11


In [29]:
filter(den8, cons != tol)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


In [30]:
filter(den8, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(3) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,8396636,A,G,1,1,,1
Y,17517362,A,G,1,1,,1
Y,18759298,A,G,1,1,,1
Y,9640642,C,,0,0,,0
Y,8859053,C,,0,0,,0
Y,15689844,A,,0,0,,0


# Spy 1 (low coverage archaic, even more extreme case)

At 90% consensus cutoff we expect the lower coverage genotypes to be the same as strict 100% consensus.

In [31]:
read_vcf(here("test/spy1_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [32]:
read_vcf(here("test/spy1_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [33]:
read_vcf(here("test/spy1_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [34]:
read_vcf(here("test/spy1_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [35]:
spy1 <- read_vcf(here("test/genotyping_spy1.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [36]:
nrow(spy1)

In [37]:
filter(spy1, baq != nobaq) %>% sample_n(20)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,17180462,C,T,0,1,,
Y,8299933,G,T,0,1,,
Y,15315939,G,A,0,1,,
Y,21843326,G,A,1,0,,
Y,22748900,C,T,0,1,,
Y,15241736,G,A,0,1,,
Y,21242313,C,T,0,1,,
Y,18184489,C,G,0,1,,
Y,8707564,G,A,0,1,,
Y,2740233,C,T,0,1,,


In [38]:
filter(spy1, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,61
1,0,14


In [39]:
filter(spy1, cons != tol)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


In [40]:
filter(spy1, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(10) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
