# Reference bias using expected vs observed # of bases with no coverage

* https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf
* https://en.wikipedia.org/wiki/DNA_sequencing_theory#Early_uses_derived_from_elementary_probability_theory
* http://seqanswers.com/forums/showpost.php?p=161353&postcount=2

The problem can be boiled down to the following question: given sequence coverage $X$, the proportion of sites with no coverage can be modelled as $Poisson(\lambda = 0)$. Under ideal conditions and infinite number of sites, the empirical counts of sites with no coverage should be exactly equal to this value.

What is the proportion of sites with no coverage in our samples? Are there some worrying differences between Neanderthal and Denisovan samples? Keep in mind that both _Denisova 4_ and _Denisova 8_ have almost the same TMRCA, significantly different from all other Neanderthals, who also give the same values between each other.

In [1]:
library(tidyverse)
library(glue)
library(here)
suppressPackageStartupMessages({library(rtracklayer); library(GenomicRanges)})

devtools::load_all(".")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.1     [32m✔[39m [34mpurrr  [39m 0.3.0
[32m✔[39m [34mtibble [39m 2.1.1     [32m✔[39m [34mdplyr  [39m 0.7.8
[32m✔[39m [34mtidyr  [39m 0.8.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘glue’

The following object is masked from ‘package:dplyr’:

    collapse

here() starts at /mnt/expressions/mp/ychr
“no function found corresponding to methods exports from ‘GenomicRanges’ for: ‘concatenateObjects’”Loading ychr


In [2]:
cov_df <- readRDS(here("data/rds/cov_df.rds"))

In [3]:
samples <- unique(cov_df$name)

In [4]:
samples

In [5]:
avg_cov <- cov_df %>%
    group_by(name, regions) %>%
    summarise(avg_coverage = mean(coverage)) %>%
    spread(regions, avg_coverage) %>%
    rename(avg_cov = full) %>%
    select(-exome, -lippold) %>%
    filter(!str_detect(name, "elsidron"))

In [16]:
avg_cov <- avg_cov %>%
    mutate(observed_zero = round(map_dbl(name, ~ mean(filter(cov_df, name == .x)$coverage == 0)), 5),
           expected_zero = round(dpois(x = 0, lambda = avg_cov), 5))

In [17]:
avg_cov %>%
    mutate(difference = observed_zero - expected_zero) %>% arrange(-difference) %>% print(n=Inf)

[38;5;246m# A tibble: 27 x 5[39m
[38;5;246m# Groups:   name [27][39m
   name              avg_cov observed_zero expected_zero difference
   [3m[38;5;246m<chr>[39m[23m               [3m[38;5;246m<dbl>[39m[23m         [3m[38;5;246m<dbl>[39m[23m         [3m[38;5;246m<dbl>[39m[23m      [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m spy1                0.910      0.490           0.402    0.087[4m2[24m  
[38;5;250m 2[39m den4                1.66       0.260           0.189    0.070[4m6[24m  
[38;5;250m 3[39m den8                3.60       0.097[4m5[24m          0.027[4m2[24m   0.070[4m2[24m  
[38;5;250m 4[39m mez2               14.6        0.040[4m0[24m          0        0.040[4m0[24m  
[38;5;250m 5[39m shotgun_mez2        0.888      0.434           0.412    0.022[4m2[24m  
[38;5;250m 6[39m shotgun_spy1        0.522      0.611           0.593    0.017[4m2[24m  
[38;5;250m 7[39m S_Mbuti-1          20.6        0.001[4m5[24m[4m2[24m      