In [1]:
library(tidyverse)
library(here)

suppressPackageStartupMessages(library(VariantAnnotation))

devtools::load_all(".")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
here() starts at /mnt/expressions/mp/archaic-ychr
Loading ychr


### Read Mez2 genotypes generated by `bam-sample`

In [2]:
bamsample <- read_vcf(here("data/vcf/full_mez2.vcf.gz"), mindp = 3, maxdp = 0.98)

### Read Mez2 genotypes generated by snpAD

In [3]:
path <- here("data/vcf/full_mez2_snpad.vcf.gz")

vcf <- VariantAnnotation::readVcf(path)
gr <- GenomicRanges::granges(vcf)
dp <- VariantAnnotation::geno(vcf)$DP

In [4]:
mask <- apply(dp, 2, function(i) ifelse(i >= 3 & i <= quantile(i, 0.98, na.rm = TRUE), TRUE, FALSE))
if ("chimp" %in% colnames(mask)) 
    mask[, "chimp"] <- TRUE

In [5]:
gt <- VariantAnnotation::geno(vcf)$GT %>% replace(. == ".", NA) %>% replace(!mask, NA)

In [6]:
length(as.character(GenomicRanges::seqnames(gr)))

In [7]:
length(GenomicRanges::start(gr))

In [8]:
length(as.character(gr$REF))

In [9]:
length(as.character(unlist(gr$ALT)))

In [10]:
elementNROWS(gr$ALT)  %>% table

.
      1       2 
6806673       2 

In [11]:
biallelic_pos <- elementNROWS(gr$ALT) < 2

gt_df <- tibble::as_tibble(gt)[biallelic_pos, ]

info_df <- tibble::tibble(
    chrom = as.character(GenomicRanges::seqnames(gr))[biallelic_pos], 
    pos = GenomicRanges::start(gr)[biallelic_pos],
    REF = as.character(gr$REF)[biallelic_pos], 
    ALT = as.character(unlist(gr$ALT[biallelic_pos, ]))
)

df <- dplyr::bind_cols(info_df, gt_df)
colnames(df) <- str_replace_all(colnames(df), "-", "_")

In [12]:
snpad <- df

### Read pileups

In [13]:
pileups <- read_tsv(here("data/pileup/full_mez2.txt.gz"), col_types = "cicccccc") %>% rename(REF = ref)

In [14]:
head(pileups)

chrom,pos,REF,pileup,A,C,G,T
<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Y,2649811,A,A,1,0,0,0
Y,2649812,A,AA,2,0,0,0
Y,2649813,A,AAA,3,0,0,0
Y,2649814,A,AAAA,4,0,0,0
Y,2649815,A,AAAA,4,0,0,0
Y,2649816,A,AAAA,4,0,0,0


### Merge all three tables into one

In [15]:
merged <-
    full_join(bamsample, snpad, by = c("chrom", "pos", "REF")) %>%
    left_join(pileups, by = c("chrom", "pos", "REF"))

In [16]:
nrow(merged)

In [17]:
head(merged)

chrom,pos,REF,ALT.x,mez2,ALT.y,mez2_snpad,pileup,A,C,G,T
<chr>,<int>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Y,2649811,A,,,,,A,1,0,0,0
Y,2649812,A,,,,,AA,2,0,0,0
Y,2649813,A,,0.0,,,AAA,3,0,0,0
Y,2649814,A,,0.0,,,AAAA,4,0,0,0
Y,2649815,A,,0.0,,,AAAA,4,0,0,0
Y,2649816,A,,0.0,,,AAAA,4,0,0,0


There is one site which is called as heterozygous by snpAD but has exceedingly high coverage and is filtered out:

In [18]:
filter(merged, ALT.x != ALT.y)

chrom,pos,REF,ALT.x,mez2,ALT.y,mez2_snpad,pileup,A,C,G,T
<chr>,<int>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Y,13449837,T,,,A,,TTTTTTATTTTTTATTTTTTTTTTTATATTTATTTTTTTTTTTTTTTTTTTTTTTTT,5,0,0,52


In [19]:
merged <- mutate(merged, ALT = ALT.y) %>% select(-ALT.x, -ALT.y) %>%
    mutate(total = as.numeric(A) + as.numeric(C) + as.numeric(G) + as.numeric(T))

# Miscalled ALTs?

In [20]:
filter(merged, mez2 == "1" & mez2_snpad != "1/1") %>% head

chrom,pos,REF,mez2,mez2_snpad,pileup,A,C,G,T,ALT,total
<chr>,<int>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>


None.

# Miscalled REFs?

In [21]:
filter(merged, mez2 == "0" & mez2_snpad != "0/0") %>% head

chrom,pos,REF,mez2,mez2_snpad,pileup,A,C,G,T,ALT,total
<chr>,<int>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>


None.

# Investigate snpAD hets

These must be errors, because we're dealing with the Y chromosome.

How many?

In [22]:
filter(merged, mez2_snpad != "0/0", mez2_snpad != "1/1") %>% nrow

Do I even call something at snpAD het sites?

In [23]:
filter(merged, mez2_snpad == "0/1") %>% filter(!is.na(mez2))

chrom,pos,REF,mez2,mez2_snpad,pileup,A,C,G,T,ALT,total
<chr>,<int>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>


Nope, all snpAD het sites are excluded with my genotyper.

Write out all snpAD het sites:

In [24]:
filter(merged, mez2_snpad == "0/1")

chrom,pos,REF,mez2,mez2_snpad,pileup,A,C,G,T,ALT,total
<chr>,<int>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
Y,2702987,G,,0/1,TGTGATGG,1,0,4,3,T,8
Y,2702988,T,,0/1,TTGTTTTTG,0,0,2,7,G,9
Y,2777844,G,,0/1,GGGGAGAGAGAGAGAG,6,0,10,0,A,16
Y,2841055,G,,0/1,GGGGGGGGAAAAAAAGGG,7,0,11,0,A,18
Y,2854738,G,,0/1,GAGGGAAGGGGGAAA,6,0,9,0,A,15
Y,3405761,T,,0/1,TTTTTTTACATTTTAT,3,1,0,12,A,16
Y,3405762,A,,0/1,AAAAAAATATAAAATAA,14,0,0,3,T,17
Y,3405811,A,,0/1,AACAAACAAAAACCACAA,13,5,0,0,C,18
Y,3406061,G,,0/1,GGAAGAAGGGGGGGGAGG,5,0,13,0,A,18
Y,3406216,G,,0/1,GAGAAAGGGGAGGGGGGGGGAAGA,8,0,16,0,A,24


Mixture of bases at sites that I ignore but snpAD calls? Note that these carry mixtures of alleles which I remove, but snpAD calls as homozygous.

In [25]:
filter(merged, is.na(mez2) & !is.na(mez2_snpad) & !is.na(A)) %>% arrange(desc(A), desc(C), desc(G), desc(T)) %>% filter(total < 4)

chrom,pos,REF,mez2,mez2_snpad,pileup,A,C,G,T,ALT,total
<chr>,<int>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
Y,2652955,A,,0/0,AAC,2,1,0,0,,3
Y,6900142,A,,0/0,CAA,2,1,0,0,,3
Y,6900147,A,,0/0,CAA,2,1,0,0,,3
Y,6936384,A,,0/0,AAC,2,1,0,0,,3
Y,7637931,A,,0/0,CAA,2,1,0,0,,3
Y,7948643,A,,0/0,ACA,2,1,0,0,,3
Y,8014399,A,,0/0,AAC,2,1,0,0,,3
Y,8037425,A,,0/0,AAC,2,1,0,0,,3
Y,8160183,A,,0/0,AAC,2,1,0,0,,3
Y,8224198,A,,0/0,CAA,2,1,0,0,,3
