# Cuckoo fresh sample analysis

    author: Gekkonid Consulting
    date: 2021-10-24

This is the full basic popgen analyses on all fresh or otherwise
high-quality samples. This takes a much more traditional shape than the
analysis of the museum samples.

In [None]:
library(tidyverse)
library(SNPRelate)
library(pcaMethods)
library(adegenet)
library(hierfstat)
if (!dir.exists("out/04_fresh")) dir.create("out/04_fresh")
meta = read_csv("../rawdata/cuckoo_metadata_oct2021.csv")


## Poor sample removal

Nearly all fresh samples look pretty good, and nearly all museum samples
performed poorly. This means we can just filter the data by sample
missingness to select our set of samples we want to carry forward.

In [None]:
gds = snpgdsOpen("data/2_popgen/cuckoo_q50_dp10_maf3_mis80_radloci.gds",
                 allow.duplicate=T)
gds.sum = snpgdsSummary(gds)

Ensure we keep metadata only for sequenced samples -- make code much simpler
below.

In [None]:
meta = meta %>%
    filter(Library_id %in% gds.sum$sample.id)

So in total we have 82 samples and 40k SNPs (approx 4 SNPs per locus, which
given the range of insert sizes is 50-500bp, sounds on the high side).


In [None]:
smr = snpgdsSampMissRate(gds)

In [None]:
hist(smr, breaks=60)
abline(v=0.45)

45% missing data seems to be a pretty good threshold

In [None]:
table(smr < 0.45)
good.samp = gds.sum$sample.id[smr < 0.45]
meta.good = meta[match(good.samp, meta$Library_id),]

This leaves us with 29 good samples. Let's very quickly see what population
structure looks like in PCA form:

In [None]:
pca.fresh.initial = snpgdsPCA(gds, num.thread=12, autosome.only=F,
                              missing.rate=0.85, sample.id = good.samp)

In [None]:
morph.col = ifelse(is.na(meta.good$Adult_morph), "Chick_unknown_morph",
                   meta.good$Adult_morph) %>%
    as.factor()
plot(pca.fresh.initial, 1:10, col=morph.col, pch=19, oma=c(14,4,4,4))
par(xpd=TRUE)
legend("bottom", legend=levels(morph.col), pch=19, ncol=3,
       col=1:length(levels(morph.col)), y.intersp=0.5, x.intersp=0.5,
       text.width=0.2)

In [None]:
par(xpd=F)

In [None]:
dev.copy(pdf, "out/04_fresh/initial-pca.pdf", width=12, height=10)
dev.off()

Still not super clear. we have a couple of outlier samples (which drive PC1
& PC2), probably correponding to genuine ouliers given we know there are
some non-LBC individuals in there. 

## Extract genotypes

To perform subsequent analyses, we first need to extract a set of good
quality SNPs from the GDS file SNPRelate.

First, get the per-SNP stats

In [None]:
snprate = snpgdsSNPRateFreq(gds, sample.id = good.samp)

...and plot them.

In [None]:
hist(snprate$MissingRate, breaks=30)
abline(v=0.3)
hist(snprate$MinorFreq, breaks=30)
abline(v=0.03)

The missingness rate looks really good for most of these, and MAF follows a
pretty common pattern. We exclude the long tail of high missingness SNPs,
and throw out very rare alleles (mainly to reduce the number of SNPs).

In [None]:
good.snp = snprate$MissingRate < 0.3 & snprate$MinorFreq > 0.03
table(good.snp)

So in total we have 25k good SNPs, which seems like plenty.

In [None]:
good.gn = snpgdsGetGeno(gds, sample.id = good.samp, snp.id=which(good.snp))

Below is a visual representation of the missingness in this dataset.

In [None]:
image(is.na(good.gn))


## Probablistic PCA

Use PPCA from the pcaMethods[^pcamethods] package to do a missing-data
tolerant PCA, as it performs well with more missing data. 

In [None]:
dim(good.gn)
bpc = xfun::cache_rds({
     pca(good.gn, method="ppca", center=T, nPcs=3, maxSteps=1000)
}, file="04_bpca", dir="data/cache/",  compress="xz")

In [None]:
bpc.scores = scores(bpc)
pairs(bpc.scores, col=morph.col, pch=19, oma=c(14,4,4,4))
par(xpd=TRUE)
legend("bottom", legend=levels(morph.col), pch=19, ncol=3,
       col=1:length(levels(morph.col)), y.intersp=0.5, x.intersp=0.5,
       text.width=0.2)
par(xpd=F)
dev.copy(pdf, "out/04_fresh/prob-pca.pdf", width=12, height=10)
dev.off()

In [None]:
write_tsv(meta.good, "allgood.tsv")

In [None]:
morphchick.col = ifelse(is.na(meta.good$Adult_morph), meta.good$Library_id, 
                   meta.good$Adult_morph) %>%
    as.factor()

In [None]:
pairs(bpc.scores, col=morphchick.col, pch=19, oma=c(14,4,4,4))
par(xpd=TRUE)
legend("bottom", legend=levels(morphchick.col), pch=19, ncol=3,
       col=1:length(levels(morphchick.col)), y.intersp=0.5, x.intersp=0.5,
       text.width=0.2)
par(xpd=F)

In [None]:
dev.copy(pdf, "out/04_fresh/prob-pca-chickid.pdf", width=12, height=10)
dev.off()

In [None]:
str(meta.good)
latcut = cut(meta.good$Lat, breaks=8)
pairs(bpc.scores, col=latcut, pch=19, oma=c(14,4,4,4))
par(xpd=TRUE)
legend("bottom", legend=levels(latcut), pch=19, ncol=3,
       col=1:length(levels(latcut)), y.intersp=0.5, x.intersp=0.5,
       text.width=0.2)
par(xpd=F)

So that looks a lot better than the default SVD-based PCA. We can see that
despite there being a quite dispersed signal, there are some signs of very
weak population structure between morphs.  It looks like most of the chicks
are probably QLD russatus, given where they cluster. They aren't getting
split any time soon (hopefully) but I think there is clearly some genetic
evidence to support very weak and likely nascent population differentiation.

## DAPC

DAPC is part of Adegenet, so first let's convert the snp matrix to a
genlight object.

In [None]:
snp.gl = new("genlight", gen=good.gn, ploidy=2, indnames=good.samp)

I find DAPC to be a bit of a funny method, as it will *always* find the
expected structure. I do this analysis mostly for completeness, as it is a
good way of showing structure visually when a vanilla PCA gets swamped by
technical noise as is the case to some extent here.

In [None]:
dapc.morph =  dapc(snp.gl, morph.col, n.pca=20, n.da=4)
scatter(dapc.morph)


## Population differentiation: $F_{ST}$

We will use hierfstat to compute FST

Hierfstat takes a dataframe whose rows are individuals and whose first
column is a populaton code, and remaning columns are loci

In [None]:
hierf.dat = as.data.frame(cbind(morph.col, as.matrix(snp.gl)))

This calculation takes a while so cache it.

In [None]:
pwfst = xfun::cache_rds({
    pairwise.WCfst(hierf.dat)
}, file="04_pwfst", dir="data/cache/",  compress="xz")

In [None]:
pwfst

So this reveals the extremely small amount of divergence in these samples.
inter-pop $F_{ST}$ is under 1%.


# Biological take-home messages

1. The dataset is of pretty good quality, and a high quality subset had 29
   samples and about 25k SNPs.
2. In a vanilla PCA, we see a small signal of divergence between morphs,
   however technical noise muddies this signal.
3. Probablistic PCA resolved this, revealing some potentially weak
   population structure.
4. DAPC shows very clear divergence betwen populatons, however DAPC will
   always do so with data of any reasonable quality.
5. $F_{ST}$ shows that the relative divegence is very low (<1%). This is
   likely an underestimate as $F_{ST}$ will be underestimated in the
   presence of noisy data like ours.

[^pcamethods]: pcaMethodsâ€”a bioconductor package providing PCA methods for incomplete data https://academic.oup.com/bioinformatics/article/23/9/1164/272597