# Basic Popgen analyses for Cuckoo dataset

    author: Gekkonid Consulting
    date: 2021-10-24

In this notebook we perform some basic popgen analyses including PCA,
$F_{ST}$, etc. 

In [None]:
library(tidyverse)
library(SNPRelate)
if (!dir.exists("data/2_popgen/")) dir.create("data/2_popgen")
if (!dir.exists("out/03_basic_popgen/")) dir.create("out/03_basic_popgen")

In [None]:
meta = read_csv("../rawdata/cuckoo_metadata_oct2021.csv")

We are using the SNP set with standard filters (QUAL > 50, DEPTH > 10, MAF >
3%, MISSING < 80%) combined with a RAD-locus filter. In total this should be
at about 10k RAD loci. These thresholds throw out a large number of very
rare or near-fixed SNPs, many of which have data only in a few samples (i.e.
high missingness).

We will use SNPRelate to do most of the basic popgen analyses. One therefore
needs to convert the vcf.gz to a 'gds' file (see `?gdsfmt`).

In [None]:
if (!file.exists("data/2_popgen/cuckoo_q50_dp10_maf3_mis80_radloci.gds")) {
    snpgdsVCF2GDS("data/1_filtered/cuckoo_q50_dp10_maf3_mis80_radloci.vcf.gz",
                  "data/2_popgen/cuckoo_q50_dp10_maf3_mis80_radloci.gds")
}

In [None]:
gds = snpgdsOpen("data/2_popgen/cuckoo_q50_dp10_maf3_mis80_radloci.gds", allow.duplicate=T)

First, let's get a summary of this file (snp and sample IDs are saved to
`gds.sum`).

In [None]:
gds.sum = snpgdsSummary(gds)

In [None]:
# Ensure we keep metadata only for sequenced samples -- make code much simpler
# below.
meta = meta %>%
    filter(Library_id %in% gds.sum$sample.id)

So in total we have 82 samples and 40k SNPs (approx 4 SNPs per locus, which
given the range of insert sizes is 50-500bp, sounds on the high side).

## Screen for failed samples.

The first step in a typical analysis would be removing poor samples. However
this analysis is a bit different, in that we need to split the dataset in
two: the "fresh" samples that broadly worked well, and the museum samples
that performed largely quite poorly. We still wish to do some basic analysis
of the museum samples, but that should be conducted separately.

First though, let's plot everything together.

In [None]:
sampmiss = read_tsv("data/1_filtered/cuckoo_q50_dp10_maf3_mis80_radloci_samphist.tsv") %>%
    inner_join(meta, by=c("sample"="Library_id")) %>%
    arrange(-missing_prop)

In [None]:
ggplot(sampmiss, aes(x=missing_prop)) +
    geom_histogram(aes(fill=Sample_type, colour=Sample_type)) +
    labs(title="RAD-locus Filtered Sample Missingness")

Now let's subdivide the dataset into a "fresh" section and a "museum"
section.

In [None]:
museum.samp.types = c("Feather", "Ethanol-preserved_eggshell",
                      "Museum_eggshell", "Nest")
fresh.samp.types=  c("Adult_tissue",
                     "Ethanol-preserved_chick",
                     "Ethanol-preserved_embryo",
                     "Frozen_chick", "Frozen_embryo",
                     "Frozen_embryos")

In [None]:
samp.fresh = meta %>%
    filter(Sample_type %in% fresh.samp.types) %>%
    pull(Library_id)
writeLines(samp.fresh, "data/samples_fresh.txt")

In [None]:
samp.museum = meta %>%
    filter(Sample_type %in% museum.samp.types) %>%
    pull(Library_id)
writeLines(samp.museum, "data/samples_museum.txt")

...and then plot the same figure again, this time coloured by "museum" vs
"fresh" (and some excluded samples).

In [None]:
sampmiss = sampmiss %>%
    mutate(sample.category = case_when(
        Sample_type %in% fresh.samp.types ~ "fresh",
        Sample_type %in% museum.samp.types ~ "museum",
        T ~ "excluded"))

In [None]:
ggplot(sampmiss, aes(x=missing_prop)) +
    geom_histogram(aes(fill=sample.category,
                       colour=sample.category)) +
    labs(x="Sample Missingness Rate")

## Museum sample basic analysis

Let's dig into the museum samples. As these samples are mostly very poor
quality, we don't filter out failed samples (by any traditional definition
of failed, they all are!). We still weed out the crappier SNPs though (snp
missing threshold of 90%).

In [None]:
pca.museum = snpgdsPCA(gds, sample.id=samp.museum, num.thread=12,
                       autosome.only=F, missing.rate=0.9)
plot(pca.museum)
plot(pca.museum[,1:3])

It appears that the first couple of PCs describe the two mysteriously
successful "museum" samples that deviate from the "cloud" of low coverage
samples, as is normal. Let's plot the subsequent axes which hopefully "see
past" these two outlier samples.

In [None]:
miss.col.mus = sampmiss[match(pca.museum$sample.id, sampmiss$sample),]$missing_prop %>%
    cut(breaks=5)
plot(pca.museum, 1:3, col=miss.col.mus, pch=19, oma=c(14,4,4,4))
par(xpd=TRUE)
legend("bottom", legend=levels(miss.col.mus), pch=19, col=1:5, ncol=3,
       y.intersp=0.5, x.intersp=0.5, text.width=0.2)
par(xpd=F)
dev.copy(pdf, "out/03_basic_popgen/musuem-samples.pdf", width=12, height=10)
dev.off()

Apparently not. It takes until the ninth and tenth axes before we get past
axes describing single samples' divergences from the pack, and even then it
seems as though the patterns are driven by missing data (or something
correlated to it).

I'm not sure how much more we can do with this dataset.

# Fresh sample basic popgen


In [None]:
pca.fresh = snpgdsPCA(gds, num.thread=12, autosome.only=F, missing.rate = .85,
                      sample.id = samp.fresh)

In [None]:
meta.fresh = sampmiss[match(pca.fresh$sample.id, sampmiss$sample),]
morph.col = ifelse(is.na(meta.fresh$Adult_morph), "Chick_unknown_morph",
                   meta.fresh$Adult_morph) %>%
    as.factor()
plot(pca.fresh, 1:3, col=morph.col, pch=19, oma=c(14,4,4,4))
par(xpd=TRUE)
legend("bottom", legend=levels(morph.col), pch=19, ncol=3,
       col=1:length(levels(morph.col)), y.intersp=0.5, x.intersp=0.5,
       text.width=0.2)
par(xpd=F)
dev.copy(pdf, "out/03_basic_popgen/fresh-samples.pdf", width=12, height=10)
dev.off()

So that is promising. There is some clear population structure in there (see
clustering of morphs), but there is also a fairly noisy signal and the first
few axes are dominated by single outlier samples. This is possibly
biological as there are a few individuals from other species in there, but
also likely due to technical artefacts.

We will do subsequent analyses on this sample set in the next notebook.