analyze.Rmd

---
title: "Canine Epilepsy - Data Analysis"
author: "BJC"
date: "3/28/2020"
output: html_document
---

## Setup

Load libraries, setup paths, prepare environment:
```{r init, warning=FALSE, message=FALSE}
library(dada2);packageVersion("dada2")
library(phyloseq); packageVersion("phyloseq")
library(DECIPHER); packageVersion("DECIPHER")
library(phangorn); packageVersion("phangorn")
library(vegan); packageVersion("vegan")
library(ggplot2)
library(reshape2)
path.fig <- "Figures/" # Local path. Assumes working directory is this files location
path.rds <- "RDS/" # Local path...
fnMeta <- "epilepsy_metadata.tsv" # Local path...
theme_set(theme_bw())
set.seed(100)
```

Read in and coordinate the processed data:
```{r}
df <- read.table(fnMeta, header=TRUE, sep="\t", stringsAsFactors = FALSE)
df$Category <- factor(df$Category)
df$Household <- factor(df$Household)
rownames(df) <- df$SampleID
st <- readRDS(file.path(path.rds, "st.rds"))
tax <- readRDS(file.path(path.rds, "taxp.rds"))
if(!identical(colnames(st), rownames(tax))) stop("st/tax mismatch.")
# Fix sample names for st, currently fastq filename rather than shroter sample name
rownames(st) <- sapply(strsplit(rownames(st), "_"), `[`, 1)
if(!all(rownames(st) %in% rownames(df))) stop("st/df mismatch.")
# Coordinate df with st
df <- df[rownames(st),]
if(!identical(rownames(df), rownames(st))) stop("st/df mismatch (2).")
# Make frequency table, i.e. normalized to proportions
ft <- sweep(st, 1, rowSums(st), "/")
```

Save these input tables in renamed variables:
```{r}
dfi <- df
sti <- st
fti <- ft
taxi <- tax
sq <- colnames(st)
```

## Quality Assurance

Do some QA on the sequencing libraries:
```{r}
plot(rowSums(st), log="y")
```

One sample had about 10x more reads than the others, and another sample had very few (<100) reads. Going to need to remove that very low read count sample:

```{r}
keep.samples <- rownames(sti)[rowSums(sti) > 100]
st <- sti[keep.samples,]
ft <- fti[keep.samples,]
samdf <- dfi[keep.samples,]
if(!identical(rownames(samdf), rownames(st))) stop("st/samdf mismatch (3).")
```

## Identify and Explore the Lactobacilli

Identifying the Lactobacillus ASVs:
```{r}
i.lacto <- which(tax[,6] %in% "Lactobacillus")
sq.lacto <- getSequences(tax)[i.lacto]
unname(tax[sq.lacto,6:7]) # Just 2 have species IDs
```

Eight Lactobacillus ASVs. Just two have species assignments by the unambiguous exact-matching method, so will have to do some by hand species (or species-group) assignments.

Looking at the overall frequencies of these:
```{r}
unname(colSums(ft[,sq.lacto]))/sum(ft) # OOM dropoff after the first 3
```

These are at relatively low overall frequencies. The three most abundant Lactobacillus ASVs are only ~0.05\% of total bacterial frequency across the study, and the remaining are an order of magnitude less frequent than that.

Let's look on a per-sample basis.
```{r}
plot(rowSums(ft[,sq.lacto]), ylab="Lactobacillus proportion", xlab="Sample")
```

```{r, warning=FALSE}
plot(rowSums(ft[,sq.lacto]), log="y", ylab="Lactobacillus proportion", xlab="Sample")
```

One sample is ~4% Lactobacillus, all the rest at less than 0.2%, but most samples do have some Lactobacillus.

Plotting the per-sample frequencies of each Lactobacillus ASV.
```{r, warning=FALSE}
ft.lacto <- ft[,sq.lacto]
colnames(ft.lacto) <- paste0("Lacto", seq_along(sq.lacto))
rownames(ft.lacto) <- seq(nrow(ft.lacto))
pdf.lacto <- melt(ft.lacto, varnames=c("Sample", "ASV"), value.name="Proportion")
ggplot(data=pdf.lacto, aes(x=Sample, y=Proportion, color=ASV)) + geom_point() +
  scale_y_log10()
```

Quite a bit of co-occurrence here, in which multiple distinct Lactobacillus ASVs are observed in the same sample. No discernible evidence of this coming from multiple ASVs from the same strain though, as no ASVs are strongly correlated in their frequency distribution across samples.

Consider how different the various Lactobacillus ASVs are from one another:
```{r}
outer(sq.lacto, sq.lacto, nwhamming, vec=TRUE)
```

These Lactobacillus ASVs are quite different from each other, on the order of 15 substitution difference between a random pair of Lactobacillus ASVs is about a 6\% sequence-level difference.

Now going to to by-hand species-level assignment using BLAST against nt (excluding environmental/uncultured), and manual inspection/curation of the results. Only 100\% identity matches will be considered.

```{r}
dada2:::pfasta(sq.lacto)
# BLAST againt nt excluding environmental/uncultured using web interface
```

By-hand species group assignments based on the BLAST 100\% match results. *Note, there are some ad hoc cutoffs being used here based on there being "lots more" of certain species matches than others, and these assignments are best interpreted as suggestive rather than definitive.*
```{r}
species.lacto <- c("johnsonii/gasseri",
                   "reuteri",
                   "murinus/animals",
                   "reuteri",
                   "casei/paracasei",
                   "sakei/curvatus",
                   "aviarius",
                   "amylovorus")
```

Taking a closer look at Lacto ASV2/4 that were both assigned to *L. reuteri*. Could they be intragenomic variants?
```{r}
freq.lacto2 <- ft[,sq.lacto[[2]]]
freq.lacto4 <- ft[,sq.lacto[[4]]]
cor(freq.lacto2, freq.lacto4)
cor(freq.lacto2, freq.lacto4, method="spearman")
```

```{r}
p <- ggplot(data=data.frame(Lacto2=freq.lacto2, Lacto4=freq.lacto4), aes(x=Lacto2, y=Lacto4)) + 
     geom_point()
p
p + scale_y_log10() + scale_x_log10()
```

It's possible they could be intragenomic variants from a single strain, but the evidence is not compelling. The huge Pearson correlation is driven by the one sample in which both ASVs appear in high frequency.

## Make Table of Lactobacillus species prevalance based on 16S results

Creating a table of prevalence (i.e. number of samples in which a taxa was detected) for each of the 7 Lactobacillus species detected by 16S (remember that 2/8 ASVs both came from *L. reuteri*). This data makes up Table 2 in the manuscript.
```{r}
species <- unique(species.lacto)
df.lacto <- data.frame(Species = paste("Lactobacillus", species), 
                       Prevalence = integer(length(species)), 
                       Frequency=numeric(length(species)))
rownames(df.lacto) <- species
for(spc in species) {
  ii <- i.lacto[species.lacto==spc]
  prv <- sum(rowSums(st[,ii,drop=FALSE])>0)
  frq <- sum(ft[,ii,drop=FALSE])/nrow(ft)
  df.lacto[spc,"Prevalence"] <- prv
  df.lacto[spc,"Frequency"] <- frq
}
df.lacto <- df.lacto[order(df.lacto$Prevalence, decreasing=TRUE),]
rownames(df.lacto) <- NULL
df.lacto
write.table(df.lacto, file.path(path.fig, "Lacto_prev_freq.tsv"), 
            sep="\t", row.names=FALSE, quote=FALSE)
```

## Compare Lacto CFUs to Lacto 16S Frequencies

We generated Lactobacillus CFU counts from these same samples. How well do the frequencies of Lactoabcillus as measured by 16S rRNA gene sequencing correspond to those CFU counts?

```{r, warning=FALSE}
samdf$Freq.lacto <- rowSums(ft[,i.lacto])
plot(samdf$CFU.lacto, samdf$Freq.lacto)
plot(samdf$CFU.lacto, samdf$Freq.lacto, log="xy")
```

There is some correlation, but it is not that strong. Of course, this is expected, as the CFU count is an absolute abundance count, while the 16S gene frequency is a relative abundance measurement, and both are subject to different biases in their sensitivity to *Lactobacillus* generally, that could also potentially vary between different *Lactobacillus* species.

How about the consistency of presence-absence detections across the methods?
```{r}
has.CFU <- !is.na(samdf$CFU.lacto) # At least one sample had a failed CFU measurement
table(samdf$CFU.lacto[has.CFU]>0, samdf$Freq.lacto[has.CFU]>0)
```

No correspondence in this crude presence-absence comparison. Perhaps better selecting presence-absence thresholds could improve this, but again, it is well known that CFU plating methods and community sequencing methods have qualitatively different sensitivity thresholds to rare taxa.

Now making the publication quality figure of this correspondence:
```{r, warning=FALSE}
p.cor <- ggplot(data=samdf, aes(x=CFU.lacto, y=Freq.lacto)) + geom_point() + 
  scale_x_log10() + scale_y_log10() + 
  xlab("CFUs/g (log10 scale)") + ylab("16S Frequency (log10 scale)") +
  geom_smooth(method="lm") + ggtitle("Lactobacillus")
p.cor
ggsave(file.path(path.fig, "lacto_cor.pdf"), p.cor, width=6, height=4, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "lacto_cor.png"), p.cor, width=6, height=4, units="in")
```

Quantifying the correlation between these measures:
```{r}
fitdf <- samdf[has.CFU,]
fitdf$LogCFU <- log(fitdf$CFU.lacto)
fitdf$LogFreq <- log(fitdf$Freq.lacto)
fitdf.nozeros <- fitdf[is.finite(fitdf$LogCFU) & is.finite(fitdf$LogFreq),]
mod <- lm(LogCFU ~ LogFreq, fitdf.nozeros)
summary(mod)
```

Note that in this analysis we excluded the zero measurements (by either method). Alternatively one could replace the zeros with defined replacement values, e.g. half the minimum value observed in the dataset. We also check that approach to ensure that our qualitative conclusions remain the same:

```{r}
fitdf.pseudo <- fitdf
fitdf.pseudo$CFU.lacto[fitdf.pseudo$CFU.lacto == 0] <- 
  (1./2.) * min(fitdf.pseudo$CFU.lacto[fitdf.pseudo$CFU.lacto>0])
fitdf.pseudo$Freq.lacto[fitdf.pseudo$Freq.lacto == 0] <- 
  (1./2.) * min(fitdf.pseudo$Freq.lacto[fitdf.pseudo$Freq.lacto>0])
fitdf.pseudo$LogCFU <- log(fitdf.pseudo$CFU.lacto)
fitdf.pseudo$LogFreq <- log(fitdf.pseudo$Freq.lacto)
mod.pseudo <- lm(LogCFU ~ LogFreq, fitdf.pseudo)
summary(mod.pseudo)
```

Clearly R^2 is less, but still a significant trend and a non-trivial R^2 of greater than 0.2. Will use the previous numbers in the paper though, as they correspond to the fit line on the plot which excludes the zero/infinites.

## Ordination plotting of microbiome profiles

First we'll now move the data into the phyloseq R package which provides some convenient interfaces for doing ordination plotting. For this, we will only keep samples from households in which samples from both animals (the epileptic and control dog) are available.

```{r}
ps <- phyloseq(otu_table(ft, taxa_are_rows=FALSE), sample_data(samdf), tax_table(tax))
tab <- table(samdf$Household) # 020Petey was removed for too few reads, so just 1 member of Household 10
keep <- names(tab)[tab>1]
ps <- subset_samples(ps, Household %in% keep); ps # 26 samples, OK
ps <- subset_taxa(ps, taxa_sums(ps)>0)
seqs <- taxa_names(ps); names(seqs) <- seqs
if(!identical(unname(seqs), sq)) stop("Mismatch between phyloseq seqs and sq.")
```

Now to create a phylogenetic tree for these ASVs. The code here is adapted from [Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses](https://f1000research.com/articles/5-1492/v2).
```{r}
## Uncomment to run this part, otherwise will load pre-computed object
#alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)
#phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
#dm <- dist.ml(phang.align)
#treeNJ <- NJ(dm) # Note, tip order != sequence order
#fit = pml(treeNJ, data=phang.align)
#fitGTR <- update(fit, k=4, inv=0.2)
#fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
#                    rearrangement = "stochastic", control = pml.control(trace = 0))
#saveRDS(fitGTR, file.path(path.rds, "fitGTR.rds"))
## End uncomment
fitGTR <- readRDS(file.path(path.rds, "fitGTR.rds"))
psord <- merge_phyloseq(ps, phy_tree(fitGTR$tree))
if(!identical(otu_table(psord), otu_table(ps))) stop("psord/ps mismatch.")
```

Plotting ordinations for the manuscript figure. First perform the ordination calculations:
```{r, results='hide'}
ord.bray.mds <- ordinate(psord, method="MDS", distance="bray")
ord.bray.nmds <- ordinate(psord, method="NMDS", distance="bray")
ord.wuf.mds <- ordinate(psord, method="MDS", distance="wunifrac")
ord.wuf.nmds <- ordinate(psord, method="NMDS", distance="wunifrac")
```

Plot Bray_Curtis and weighted Unifrac ordination figures, using both MDS and NMDS ordination methods, to confirm that results are qualitatively robust (i.e. households group together) to choice of dissimilarity and ordination method.

Bray-Curtis ordinations:
```{r}
plot.ord <- function(psi, ordi, color="Category", label="Household", size=5, axes=c(1,2)) {
  p <- plot_ordination(psi, ordi, axes=axes)
  ggplot(data=p$data, aes_string(x=quo_name(p$mapping$x), y=quo_name(p$mapping$y), 
                                 color=color, label=label)) + geom_text(size=size)
}

p.bray <- plot.ord(psord, ord.bray.mds)
p.bray + ggtitle("Bray - MDS")
plot.ord(psord, ord.bray.nmds) + ggtitle("Bray - NMDS")
```

Weighted-Unifrac ordinations:
```{r}
p.wuf <- plot.ord(psord, ord.wuf.mds)
p.wuf + ggtitle("WUF - MDS")
plot.ord(psord, ord.wuf.nmds) + ggtitle("WUF - NMDS")
```

Saving the MDS ordinations for use in the mansucript:
```{r}
ggsave(file.path(path.fig, "ordination_bray.pdf"), p.bray, width=6, height=4, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "ordination_bray.png"), p.bray, width=6, height=4, units="in")
ggsave(file.path(path.fig, "ordination_unifrac.pdf"), p.wuf, width=6, height=4, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "ordination_unifrac.png"), p.wuf, width=6, height=4, units="in")
```

## Permutation testing of houshold effect on the microbiome

The tendency for samples from the same household to group together in the ordination is clear visually, but to confirm this insight we explore this statistically by comparing the community dissimilarities between pairs of dogs in the same household to the null distribution generated by computing dissimilarities between random pairs of dogs.

First calculate the dissimilarities between all samples:
```{r}
bc <- phyloseq::distance(psord, method="bray")
```

Our test statistic will be the median dissimilarity between samples from the same household. Define method for extracting this statistic from the dissimilarities and the vector of households:
```{r}
get.median.dist <- function(dst, hh) {
  dst <- as.matrix(dst)
  hmat <- sapply(levels(hh), function(x) which(hh==x))
  hdsts <- sapply(seq(ncol(hmat)), function(c) dst[hmat[1,c], hmat[2,c]])
  median(hdsts)
}
hhs <- sample_data(psord)$Household
test.stat <- get.median.dist(bc, hhs)
```

Now calculate a null distribution of the test statistic under permutations of the household label:
```{r}
perms <- lapply(1:10000, function(i) sample(hhs))
distro.stat <- sapply(perms, function(prm) get.median.dist(bc, prm))
```

Plot the null (permuted) distribution of the statistic:
```{r}
hist(distro.stat, xlim=c(0,1))
abline(v=test.stat, col="red")
```

The median distance between samples from the same household is lower than the smallest value calculated from 10,000 permutations. We can safely conclude that the same-household dogs have lower dissimilarity between their microbiomes with P < 0.001

## Permanova testing of Household/Epilepsy/Age/Sex effects on the microbiome

First perform permanova testing on Household:
```{r}
samdf.ord <- data.frame(sample_data(psord))
vegan::adonis(bc ~ Household, data = samdf.ord, permutations=1e4)
```

There is a very statistically significant household effect, P < 0.001, and it explains quite a lot of the overall community dissimilarities (R2 > 0.6).

Next perform permanova testing on Category (i.e. epeleptic status), both independently, and when controlling for the strong Household effect by constraining permutations to be within households:
```{r}
vegan::adonis(bc ~ Category, data = samdf.ord, permutations=1e4)
vegan::adonis(bc ~ Category, data = samdf.ord, strata=samdf.ord$Household, permutations=1e4)
```

There is no statistically significant Category effect. P > 0.05 (much greater) whether controlling for Household or not.

Next perform permanova testing on Age, both independently, and when controlling for the strong Household effect by constraining permutations to be within households:
```{r}
vegan::adonis(bc ~ Age, data = samdf.ord, permutations=1e4)
vegan::adonis(bc ~ Age, data = samdf.ord, strata=samdf.ord$Household, permutations=1e4)
```

There is no statistically significant Age effect (P > 0.05). However, it "trends towards significance" when controlling for the Household effect, so Age is certainly worth revisiting in our subsequent larger study, also because other larger studies have shown an Age effect.

Finally perform permanova testing on Sex, both independently, and when controlling for the strong Household effect by constraining permutations to be within households:
```{r}
vegan::adonis(bc ~ Sex, data = samdf.ord, permutations=1e4)
vegan::adonis(bc ~ Sex, data = samdf.ord, strata=samdf.ord$Household, permutations=1e4)
```

There is no statistically significant Sex effect. P > 0.05 (much greater) whether controlling for Household or not.

## Differential Prevalence testing of Lactobacillus

### Fisher's exact testing

We'll perform Fisher's exact testing of the differential prevalence of each ASV in the Epilepsy and Control categories:
```{r}
stv <- as(otu_table(psord), "matrix")
stv10 <- stv[,colSums(stv>0)>=2 & colSums(stv>0)<=(nrow(stv)-2)] # At least two presences/absences for each
ctv <- sample_data(psord)$Category
fpv <- sapply(colnames(stv10), function(sqi) fisher.test(x=factor(stv10[,sqi]>0), y=ctv)$p.value)
plot(fpv, ylim=c(0,1)) 
abline(h=0.1, col="red")
```

Given that this is an exploratory analysis with a high number of hypotheses, there is nothing statistically significant here after any reasonable FDR correction.

We look more closely at our prior hypotheses, the Lactobacilli:
```{r}
unname(fpv[sq.lacto])
```

Nope, even in the absence of an FDR correction for these taxa we previously proposed to study, there is no significant differential prevalence.

Last, we will look at the Lactobacillus genus:
```{r}
table(ctv,LactoPresent=rowSums(stv10[,sq.lacto])>0)
fisher.test(x=factor(rowSums(stv10[,sq.lacto])>0), y=ctv)$p.value
```

Nope.

```{r}
sessionInfo()
```