# Comparing various genotyping strategies

In [1]:
library(tidyverse)
library(here)

devtools::load_all(".")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
here() starts at /mnt/expressions/mp/archaic-ychr
Loading ychr


# A00

We can clearly see that the strict consensus produces the smalles number of genotypes. The fact that the difference is fairly small is because the rate of sequencing errors is expected to be relatively low. However, the consensus-vs-tolerance difference is much more extreme for the ancient Y chromosomes below due to significant proportion of false aDNA damage alleles in many of the reads.

In [2]:
read_vcf(here("test/a00_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [3]:
read_vcf(here("test/a00_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [4]:
read_vcf(here("test/a00_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [5]:
read_vcf(here("test/a00_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [6]:
a00 <- read_vcf(here("test/genotyping_a00.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [7]:
nrow(a00)

Sites which are differently called under bcftools with or without BAQ option are very weird and manual inspection of the alignments shows that it's pretty much impossible to tell which are the true variants sites and which are errors. I don't really understand why does bcftools even call an allele in cases like this, but I guess they decided on slightly higher false positive rate and put the responsibility for filtering on the user.

I'd say being conservative in these particular cases is a good thing (note that my bam-caller decides to rather not call anything at these weird sites). The situation is even worse for the archaic samples below, where we often have a 50:50 mixture of two alleles and bcftools calls pretty much randomly one or another, which is not great.

In [8]:
filter(a00, baq != nobaq)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,2805552,G,A,0,1,,
Y,3715248,A,G,0,1,,
Y,8561368,C,T,0,1,,
Y,8561376,C,T,0,1,,
Y,17029481,G,A,0,1,,
Y,17029484,T,A,0,1,,


In [9]:
filter(a00, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,6


In [10]:
filter(a00, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(10) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,9863471,C,T,1,1,,1
Y,18137451,A,T,1,1,,1
Y,7647357,A,G,1,1,,1
Y,17566434,C,A,1,1,,1
Y,23021227,C,T,1,1,,1
Y,22730777,C,T,1,1,,1
Y,8882487,C,A,1,1,,1
Y,23466468,T,C,1,1,,1
Y,8616761,A,G,1,1,,1
Y,19477659,C,A,1,1,,1


In [11]:
filter(a00, tol != baq | tol != nobaq)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


Based on the table right above, whenever we call an allele with the 90% tolerance consensus cutoff these calls are consistent with bcftools. So the only thing we are losing are calls which would be unreliable anyway, based on the manual inspection of alignments above.

# Mezmaiskaya 2 (high-coverage archaic, ideal case)

In [12]:
read_vcf(here("test/mez2_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [13]:
read_vcf(here("test/mez2_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

Notice the crazy drop in called SNPs when we do 100% strict consensus! This is the effect of aDNA damage SNPs throughout aDNA reads.

In [14]:
read_vcf(here("test/mez2_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [15]:
read_vcf(here("test/mez2_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [16]:
mez2 <- read_vcf(here("test/genotyping_mez2.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [17]:
nrow(mez2)

In [18]:
filter(mez2, baq != nobaq) %>% sample_n(20)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,6922468,A,G,1,0,,
Y,15946979,G,A,0,1,,
Y,23052718,G,A,0,1,,
Y,23540925,C,T,0,1,,
Y,14860080,A,G,0,1,,
Y,22160922,C,T,1,0,,
Y,22894136,G,A,0,1,,
Y,8363429,G,A,0,1,,
Y,18614096,A,G,0,1,,
Y,9984843,G,A,0,1,,


In [19]:
filter(mez2, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,35
1,0,5


There shouldn't be any difference between strict and less strict consensus at sites called by both:

In [20]:
filter(mez2, cons != tol)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


In [21]:
filter(mez2, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(10) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,21571895,A,G,1,1,,1
Y,15008055,T,G,1,1,,1
Y,22025458,T,C,1,1,,1
Y,8846519,T,G,1,1,,1
Y,23079657,C,T,1,1,,1
Y,17517362,A,G,1,1,,1
Y,22200468,A,G,1,1,,1
Y,13229168,G,C,1,1,,1
Y,8565452,A,G,1,1,,1
Y,17019065,T,G,1,1,,1


The looser 90% tolerance consensus matches perfectly the calls made by bcftools.

In [22]:
filter(mez2, tol != baq | tol != nobaq)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


# Denisova 8 (low coverage archaic, extreme case)

At 90% consensus cutoff we expect the lower coverage genotypes to be the same as strict 100% consensus.

In [23]:
read_vcf(here("test/den8_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [24]:
read_vcf(here("test/den8_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [25]:
read_vcf(here("test/den8_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [26]:
read_vcf(here("test/den8_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [27]:
den8 <- read_vcf(here("test/genotyping_den8.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [28]:
nrow(den8)

In [29]:
filter(den8, baq != nobaq) %>% sample_n(20)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,15937550,C,T,0,1,,
Y,15211776,C,T,0,1,,
Y,16725174,G,A,0,1,,
Y,15494975,G,A,0,1,,
Y,22921994,G,A,0,1,,
Y,14532528,G,A,1,0,,
Y,8735945,G,A,0,1,,
Y,7367933,G,A,0,1,,
Y,21205887,G,A,0,1,,
Y,14735404,C,T,0,1,,


In [30]:
filter(den8, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,224
1,0,11


In [31]:
filter(den8, cons != tol)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


In [32]:
filter(den8, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(3) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,8396636,A,G,1,1,,1
Y,17517362,A,G,1,1,,1
Y,18759298,A,G,1,1,,1
Y,6761519,G,,0,0,,0
Y,22935742,G,,0,0,,0
Y,15591592,G,,0,0,,0


In [33]:
filter(den8, tol != baq | tol != nobaq)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


# Spy 1 (low coverage archaic, even more extreme case)

At 90% consensus cutoff we expect the lower coverage genotypes to be the same as strict 100% consensus.

In [34]:
read_vcf(here("test/spy1_baq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [35]:
read_vcf(here("test/spy1_nobaq.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [36]:
read_vcf(here("test/spy1_consensus.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [37]:
read_vcf(here("test/spy1_tolerance.vcf.gz"), mindp = 3, maxdp = 0.98) %>% nrow

In [38]:
spy1 <- read_vcf(here("test/genotyping_spy1.vcf.gz"), mindp = 3, maxdp = 0.98) %>%
    filter(!is.na(baq) | !is.na(nobaq) | !is.na(cons) | !is.na(tol))

In [39]:
nrow(spy1)

In [40]:
filter(spy1, baq != nobaq) %>% sample_n(20)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Y,17906889,G,A,0,1,,
Y,21195481,G,A,0,1,,
Y,21843326,G,A,1,0,,
Y,8586539,C,T,1,0,,
Y,15833781,G,A,1,0,,
Y,18108864,C,T,0,1,,
Y,15668421,G,A,0,1,,
Y,8508441,C,T,0,1,,
Y,22748358,G,A,0,1,,
Y,18184489,C,G,0,1,,


In [41]:
filter(spy1, baq != nobaq) %>% group_by(baq, nobaq) %>% tally

baq,nobaq,n
<dbl>,<dbl>,<int>
0,1,61
1,0,14


In [42]:
filter(spy1, cons != tol)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


In [43]:
filter(spy1, is.na(cons) & !(is.na(cons) & is.na(tol))) %>%
    group_by(alts = ALT == "") %>% sample_n(10) %>% ungroup %>% select(-alts)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>


In [44]:
filter(spy1, tol != baq | tol != nobaq)

chrom,pos,REF,ALT,baq,nobaq,cons,tol
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
