# Reference bias using expected vs observed # of bases with no coverage

* https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
* https://en.wikipedia.org/wiki/DNA_sequencing_theory#Early_uses_derived_from_elementary_probability_theory
* http://seqanswers.com/forums/showpost.php?p=161353&postcount=2

The problem can be boiled down to the following question: given sequence coverage $X$, the proportion of sites with no coverage can be modelled as $Poisson(\lambda = 0)$. Under ideal conditions and infinite number of sites, the empirical counts of sites with no coverage should be exactly equal to this value.

What is the proportion of sites with no coverage in our samples? Are there some worrying differences between Neanderthal and Denisovan samples? Keep in mind that both _Denisova 4_ and _Denisova 8_ have almost the same TMRCA, significantly different from all other Neanderthals, who also give the same values between each other.

In [1]:
library(tidyverse)
library(glue)
library(here)
suppressPackageStartupMessages({library(rtracklayer); library(GenomicRanges)})

devtools::load_all(".")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.1     [32m✔[39m [34mpurrr  [39m 0.3.0
[32m✔[39m [34mtibble [39m 2.1.1     [32m✔[39m [34mdplyr  [39m 0.7.8
[32m✔[39m [34mtidyr  [39m 0.8.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘glue’

The following object is masked from ‘package:dplyr’:

    collapse

here() starts at /mnt/expressions/mp/ychr
“no function found corresponding to methods exports from ‘GenomicRanges’ for: ‘concatenateObjects’”Loading ychr


In [2]:
cov_df <- readRDS(here("data/rds/cov_df.rds"))

In [3]:
samples <- unique(cov_df$name)

In [4]:
samples

In [79]:
avg_cov <- cov_df %>%
    group_by(name, regions) %>%
    summarise(avg_coverage = mean(coverage)) %>%
    spread(regions, avg_coverage) %>%
    rename(avg_cov = full) %>%
    select(-exome, -lippold) %>%
    filter(!str_detect(name, "elsidron"))

In [84]:
avg_cov <- avg_cov %>%
    mutate(observed_zero = round(100 * map_dbl(name, ~ mean(filter(cov_df, name == .x)$coverage == 0)), 2) / 100,
           expected_zero = round(100 * dpois(x = 0, lambda = avg_cov), 2) / 100)

In [89]:
avg_cov %>% mutate(obs_vs_exp_ratio = observed_zero / expected_zero) %>% arrange(obs_vs_exp_ratio)

name,avg_cov,observed_zero,expected_zero,obs_vs_exp_ratio
shotgun_spy1,0.5218778,0.6106,0.5934,1.028986
shotgun_mez2,0.8879353,0.4337,0.4115,1.053949
spy1,0.9103632,0.4896,0.4024,1.2167
den4,1.6634329,0.2601,0.1895,1.372559
den8,3.6038601,0.0975,0.0272,3.584559
a00,21.2707718,0.0005,0.0,inf
mez2,14.645643,0.04,0.0,inf
S_BedouinB-1,21.961267,0.0001,0.0,inf
S_Dai-2,20.0761092,0.0001,0.0,inf
S_Dinka-1,21.2868001,0.0001,0.0,inf
