-
Notifications
You must be signed in to change notification settings - Fork 3
/
analyze.Rmd
407 lines (333 loc) · 17.6 KB
/
analyze.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "Canine Epilepsy - Data Analysis"
author: "BJC"
date: "3/28/2020"
output: html_document
---
## Setup
Load libraries, setup paths, prepare environment:
```{r init, warning=FALSE, message=FALSE}
library(dada2);packageVersion("dada2")
library(phyloseq); packageVersion("phyloseq")
library(DECIPHER); packageVersion("DECIPHER")
library(phangorn); packageVersion("phangorn")
library(vegan); packageVersion("vegan")
library(ggplot2)
library(reshape2)
path.fig <- "Figures/" # Local path. Assumes working directory is this files location
path.rds <- "RDS/" # Local path...
fnMeta <- "epilepsy_metadata.tsv" # Local path...
theme_set(theme_bw())
set.seed(100)
```
Read in and coordinate the processed data:
```{r}
df <- read.table(fnMeta, header=TRUE, sep="\t", stringsAsFactors = FALSE)
df$Category <- factor(df$Category)
df$Household <- factor(df$Household)
rownames(df) <- df$SampleID
st <- readRDS(file.path(path.rds, "st.rds"))
tax <- readRDS(file.path(path.rds, "taxp.rds"))
if(!identical(colnames(st), rownames(tax))) stop("st/tax mismatch.")
# Fix sample names for st, currently fastq filename rather than shroter sample name
rownames(st) <- sapply(strsplit(rownames(st), "_"), `[`, 1)
if(!all(rownames(st) %in% rownames(df))) stop("st/df mismatch.")
# Coordinate df with st
df <- df[rownames(st),]
if(!identical(rownames(df), rownames(st))) stop("st/df mismatch (2).")
# Make frequency table, i.e. normalized to proportions
ft <- sweep(st, 1, rowSums(st), "/")
```
Save these input tables in renamed variables:
```{r}
dfi <- df
sti <- st
fti <- ft
taxi <- tax
sq <- colnames(st)
```
## Quality Assurance
Do some QA on the sequencing libraries:
```{r}
plot(rowSums(st), log="y")
```
One sample had about 10x more reads than the others, and another sample had very few (<100) reads. Going to need to remove that very low read count sample:
```{r}
keep.samples <- rownames(sti)[rowSums(sti) > 100]
st <- sti[keep.samples,]
ft <- fti[keep.samples,]
samdf <- dfi[keep.samples,]
if(!identical(rownames(samdf), rownames(st))) stop("st/samdf mismatch (3).")
```
## Identify and Explore the Lactobacilli
Identifying the Lactobacillus ASVs:
```{r}
i.lacto <- which(tax[,6] %in% "Lactobacillus")
sq.lacto <- getSequences(tax)[i.lacto]
unname(tax[sq.lacto,6:7]) # Just 2 have species IDs
```
Eight Lactobacillus ASVs. Just two have species assignments by the unambiguous exact-matching method, so will have to do some by hand species (or species-group) assignments.
Looking at the overall frequencies of these:
```{r}
unname(colSums(ft[,sq.lacto]))/sum(ft) # OOM dropoff after the first 3
```
These are at relatively low overall frequencies. The three most abundant Lactobacillus ASVs are only ~0.05\% of total bacterial frequency across the study, and the remaining are an order of magnitude less frequent than that.
Let's look on a per-sample basis.
```{r}
plot(rowSums(ft[,sq.lacto]), ylab="Lactobacillus proportion", xlab="Sample")
```
```{r, warning=FALSE}
plot(rowSums(ft[,sq.lacto]), log="y", ylab="Lactobacillus proportion", xlab="Sample")
```
One sample is ~4% Lactobacillus, all the rest at less than 0.2%, but most samples do have some Lactobacillus.
Plotting the per-sample frequencies of each Lactobacillus ASV.
```{r, warning=FALSE}
ft.lacto <- ft[,sq.lacto]
colnames(ft.lacto) <- paste0("Lacto", seq_along(sq.lacto))
rownames(ft.lacto) <- seq(nrow(ft.lacto))
pdf.lacto <- melt(ft.lacto, varnames=c("Sample", "ASV"), value.name="Proportion")
ggplot(data=pdf.lacto, aes(x=Sample, y=Proportion, color=ASV)) + geom_point() +
scale_y_log10()
```
Quite a bit of co-occurrence here, in which multiple distinct Lactobacillus ASVs are observed in the same sample. No discernible evidence of this coming from multiple ASVs from the same strain though, as no ASVs are strongly correlated in their frequency distribution across samples.
Consider how different the various Lactobacillus ASVs are from one another:
```{r}
outer(sq.lacto, sq.lacto, nwhamming, vec=TRUE)
```
These Lactobacillus ASVs are quite different from each other, on the order of 15 substitution difference between a random pair of Lactobacillus ASVs is about a 6\% sequence-level difference.
Now going to to by-hand species-level assignment using BLAST against nt (excluding environmental/uncultured), and manual inspection/curation of the results. Only 100\% identity matches will be considered.
```{r}
dada2:::pfasta(sq.lacto)
# BLAST againt nt excluding environmental/uncultured using web interface
```
By-hand species group assignments based on the BLAST 100\% match results. *Note, there are some ad hoc cutoffs being used here based on there being "lots more" of certain species matches than others, and these assignments are best interpreted as suggestive rather than definitive.*
```{r}
species.lacto <- c("johnsonii/gasseri",
"reuteri",
"murinus/animals",
"reuteri",
"casei/paracasei",
"sakei/curvatus",
"aviarius",
"amylovorus")
```
Taking a closer look at Lacto ASV2/4 that were both assigned to *L. reuteri*. Could they be intragenomic variants?
```{r}
freq.lacto2 <- ft[,sq.lacto[[2]]]
freq.lacto4 <- ft[,sq.lacto[[4]]]
cor(freq.lacto2, freq.lacto4)
cor(freq.lacto2, freq.lacto4, method="spearman")
```
```{r}
p <- ggplot(data=data.frame(Lacto2=freq.lacto2, Lacto4=freq.lacto4), aes(x=Lacto2, y=Lacto4)) +
geom_point()
p
p + scale_y_log10() + scale_x_log10()
```
It's possible they could be intragenomic variants from a single strain, but the evidence is not compelling. The huge Pearson correlation is driven by the one sample in which both ASVs appear in high frequency.
## Make Table of Lactobacillus species prevalance based on 16S results
Creating a table of prevalence (i.e. number of samples in which a taxa was detected) for each of the 7 Lactobacillus species detected by 16S (remember that 2/8 ASVs both came from *L. reuteri*). This data makes up Table 2 in the manuscript.
```{r}
species <- unique(species.lacto)
df.lacto <- data.frame(Species = paste("Lactobacillus", species),
Prevalence = integer(length(species)),
Frequency=numeric(length(species)))
rownames(df.lacto) <- species
for(spc in species) {
ii <- i.lacto[species.lacto==spc]
prv <- sum(rowSums(st[,ii,drop=FALSE])>0)
frq <- sum(ft[,ii,drop=FALSE])/nrow(ft)
df.lacto[spc,"Prevalence"] <- prv
df.lacto[spc,"Frequency"] <- frq
}
df.lacto <- df.lacto[order(df.lacto$Prevalence, decreasing=TRUE),]
rownames(df.lacto) <- NULL
df.lacto
write.table(df.lacto, file.path(path.fig, "Lacto_prev_freq.tsv"),
sep="\t", row.names=FALSE, quote=FALSE)
```
## Compare Lacto CFUs to Lacto 16S Frequencies
We generated Lactobacillus CFU counts from these same samples. How well do the frequencies of Lactoabcillus as measured by 16S rRNA gene sequencing correspond to those CFU counts?
```{r, warning=FALSE}
samdf$Freq.lacto <- rowSums(ft[,i.lacto])
plot(samdf$CFU.lacto, samdf$Freq.lacto)
plot(samdf$CFU.lacto, samdf$Freq.lacto, log="xy")
```
There is some correlation, but it is not that strong. Of course, this is expected, as the CFU count is an absolute abundance count, while the 16S gene frequency is a relative abundance measurement, and both are subject to different biases in their sensitivity to *Lactobacillus* generally, that could also potentially vary between different *Lactobacillus* species.
How about the consistency of presence-absence detections across the methods?
```{r}
has.CFU <- !is.na(samdf$CFU.lacto) # At least one sample had a failed CFU measurement
table(samdf$CFU.lacto[has.CFU]>0, samdf$Freq.lacto[has.CFU]>0)
```
No correspondence in this crude presence-absence comparison. Perhaps better selecting presence-absence thresholds could improve this, but again, it is well known that CFU plating methods and community sequencing methods have qualitatively different sensitivity thresholds to rare taxa.
Now making the publication quality figure of this correspondence:
```{r, warning=FALSE}
p.cor <- ggplot(data=samdf, aes(x=CFU.lacto, y=Freq.lacto)) + geom_point() +
scale_x_log10() + scale_y_log10() +
xlab("CFUs/g (log10 scale)") + ylab("16S Frequency (log10 scale)") +
geom_smooth(method="lm") + ggtitle("Lactobacillus")
p.cor
ggsave(file.path(path.fig, "lacto_cor.pdf"), p.cor, width=6, height=4, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "lacto_cor.png"), p.cor, width=6, height=4, units="in")
```
Quantifying the correlation between these measures:
```{r}
fitdf <- samdf[has.CFU,]
fitdf$LogCFU <- log(fitdf$CFU.lacto)
fitdf$LogFreq <- log(fitdf$Freq.lacto)
fitdf.nozeros <- fitdf[is.finite(fitdf$LogCFU) & is.finite(fitdf$LogFreq),]
mod <- lm(LogCFU ~ LogFreq, fitdf.nozeros)
summary(mod)
```
Note that in this analysis we excluded the zero measurements (by either method). Alternatively one could replace the zeros with defined replacement values, e.g. half the minimum value observed in the dataset. We also check that approach to ensure that our qualitative conclusions remain the same:
```{r}
fitdf.pseudo <- fitdf
fitdf.pseudo$CFU.lacto[fitdf.pseudo$CFU.lacto == 0] <-
(1./2.) * min(fitdf.pseudo$CFU.lacto[fitdf.pseudo$CFU.lacto>0])
fitdf.pseudo$Freq.lacto[fitdf.pseudo$Freq.lacto == 0] <-
(1./2.) * min(fitdf.pseudo$Freq.lacto[fitdf.pseudo$Freq.lacto>0])
fitdf.pseudo$LogCFU <- log(fitdf.pseudo$CFU.lacto)
fitdf.pseudo$LogFreq <- log(fitdf.pseudo$Freq.lacto)
mod.pseudo <- lm(LogCFU ~ LogFreq, fitdf.pseudo)
summary(mod.pseudo)
```
Clearly R^2 is less, but still a significant trend and a non-trivial R^2 of greater than 0.2. Will use the previous numbers in the paper though, as they correspond to the fit line on the plot which excludes the zero/infinites.
## Ordination plotting of microbiome profiles
First we'll now move the data into the phyloseq R package which provides some convenient interfaces for doing ordination plotting. For this, we will only keep samples from households in which samples from both animals (the epileptic and control dog) are available.
```{r}
ps <- phyloseq(otu_table(ft, taxa_are_rows=FALSE), sample_data(samdf), tax_table(tax))
tab <- table(samdf$Household) # 020Petey was removed for too few reads, so just 1 member of Household 10
keep <- names(tab)[tab>1]
ps <- subset_samples(ps, Household %in% keep); ps # 26 samples, OK
ps <- subset_taxa(ps, taxa_sums(ps)>0)
seqs <- taxa_names(ps); names(seqs) <- seqs
if(!identical(unname(seqs), sq)) stop("Mismatch between phyloseq seqs and sq.")
```
Now to create a phylogenetic tree for these ASVs. The code here is adapted from [Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses](https://f1000research.com/articles/5-1492/v2).
```{r}
## Uncomment to run this part, otherwise will load pre-computed object
#alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)
#phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
#dm <- dist.ml(phang.align)
#treeNJ <- NJ(dm) # Note, tip order != sequence order
#fit = pml(treeNJ, data=phang.align)
#fitGTR <- update(fit, k=4, inv=0.2)
#fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
# rearrangement = "stochastic", control = pml.control(trace = 0))
#saveRDS(fitGTR, file.path(path.rds, "fitGTR.rds"))
## End uncomment
fitGTR <- readRDS(file.path(path.rds, "fitGTR.rds"))
psord <- merge_phyloseq(ps, phy_tree(fitGTR$tree))
if(!identical(otu_table(psord), otu_table(ps))) stop("psord/ps mismatch.")
```
Plotting ordinations for the manuscript figure. First perform the ordination calculations:
```{r, results='hide'}
ord.bray.mds <- ordinate(psord, method="MDS", distance="bray")
ord.bray.nmds <- ordinate(psord, method="NMDS", distance="bray")
ord.wuf.mds <- ordinate(psord, method="MDS", distance="wunifrac")
ord.wuf.nmds <- ordinate(psord, method="NMDS", distance="wunifrac")
```
Plot Bray_Curtis and weighted Unifrac ordination figures, using both MDS and NMDS ordination methods, to confirm that results are qualitatively robust (i.e. households group together) to choice of dissimilarity and ordination method.
Bray-Curtis ordinations:
```{r}
plot.ord <- function(psi, ordi, color="Category", label="Household", size=5, axes=c(1,2)) {
p <- plot_ordination(psi, ordi, axes=axes)
ggplot(data=p$data, aes_string(x=quo_name(p$mapping$x), y=quo_name(p$mapping$y),
color=color, label=label)) + geom_text(size=size)
}
p.bray <- plot.ord(psord, ord.bray.mds)
p.bray + ggtitle("Bray - MDS")
plot.ord(psord, ord.bray.nmds) + ggtitle("Bray - NMDS")
```
Weighted-Unifrac ordinations:
```{r}
p.wuf <- plot.ord(psord, ord.wuf.mds)
p.wuf + ggtitle("WUF - MDS")
plot.ord(psord, ord.wuf.nmds) + ggtitle("WUF - NMDS")
```
Saving the MDS ordinations for use in the mansucript:
```{r}
ggsave(file.path(path.fig, "ordination_bray.pdf"), p.bray, width=6, height=4, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "ordination_bray.png"), p.bray, width=6, height=4, units="in")
ggsave(file.path(path.fig, "ordination_unifrac.pdf"), p.wuf, width=6, height=4, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "ordination_unifrac.png"), p.wuf, width=6, height=4, units="in")
```
## Permutation testing of houshold effect on the microbiome
The tendency for samples from the same household to group together in the ordination is clear visually, but to confirm this insight we explore this statistically by comparing the community dissimilarities between pairs of dogs in the same household to the null distribution generated by computing dissimilarities between random pairs of dogs.
First calculate the dissimilarities between all samples:
```{r}
bc <- phyloseq::distance(psord, method="bray")
```
Our test statistic will be the median dissimilarity between samples from the same household. Define method for extracting this statistic from the dissimilarities and the vector of households:
```{r}
get.median.dist <- function(dst, hh) {
dst <- as.matrix(dst)
hmat <- sapply(levels(hh), function(x) which(hh==x))
hdsts <- sapply(seq(ncol(hmat)), function(c) dst[hmat[1,c], hmat[2,c]])
median(hdsts)
}
hhs <- sample_data(psord)$Household
test.stat <- get.median.dist(bc, hhs)
```
Now calculate a null distribution of the test statistic under permutations of the household label:
```{r}
perms <- lapply(1:10000, function(i) sample(hhs))
distro.stat <- sapply(perms, function(prm) get.median.dist(bc, prm))
```
Plot the null (permuted) distribution of the statistic:
```{r}
hist(distro.stat, xlim=c(0,1))
abline(v=test.stat, col="red")
```
The median distance between samples from the same household is lower than the smallest value calculated from 10,000 permutations. We can safely conclude that the same-household dogs have lower dissimilarity between their microbiomes with P < 0.001
## Permanova testing of Household/Epilepsy/Age/Sex effects on the microbiome
First perform permanova testing on Household:
```{r}
samdf.ord <- data.frame(sample_data(psord))
vegan::adonis(bc ~ Household, data = samdf.ord, permutations=1e4)
```
There is a very statistically significant household effect, P < 0.001, and it explains quite a lot of the overall community dissimilarities (R2 > 0.6).
Next perform permanova testing on Category (i.e. epeleptic status), both independently, and when controlling for the strong Household effect by constraining permutations to be within households:
```{r}
vegan::adonis(bc ~ Category, data = samdf.ord, permutations=1e4)
vegan::adonis(bc ~ Category, data = samdf.ord, strata=samdf.ord$Household, permutations=1e4)
```
There is no statistically significant Category effect. P > 0.05 (much greater) whether controlling for Household or not.
Next perform permanova testing on Age, both independently, and when controlling for the strong Household effect by constraining permutations to be within households:
```{r}
vegan::adonis(bc ~ Age, data = samdf.ord, permutations=1e4)
vegan::adonis(bc ~ Age, data = samdf.ord, strata=samdf.ord$Household, permutations=1e4)
```
There is no statistically significant Age effect (P > 0.05). However, it "trends towards significance" when controlling for the Household effect, so Age is certainly worth revisiting in our subsequent larger study, also because other larger studies have shown an Age effect.
Finally perform permanova testing on Sex, both independently, and when controlling for the strong Household effect by constraining permutations to be within households:
```{r}
vegan::adonis(bc ~ Sex, data = samdf.ord, permutations=1e4)
vegan::adonis(bc ~ Sex, data = samdf.ord, strata=samdf.ord$Household, permutations=1e4)
```
There is no statistically significant Sex effect. P > 0.05 (much greater) whether controlling for Household or not.
## Differential Prevalence testing of Lactobacillus
### Fisher's exact testing
We'll perform Fisher's exact testing of the differential prevalence of each ASV in the Epilepsy and Control categories:
```{r}
stv <- as(otu_table(psord), "matrix")
stv10 <- stv[,colSums(stv>0)>=2 & colSums(stv>0)<=(nrow(stv)-2)] # At least two presences/absences for each
ctv <- sample_data(psord)$Category
fpv <- sapply(colnames(stv10), function(sqi) fisher.test(x=factor(stv10[,sqi]>0), y=ctv)$p.value)
plot(fpv, ylim=c(0,1))
abline(h=0.1, col="red")
```
Given that this is an exploratory analysis with a high number of hypotheses, there is nothing statistically significant here after any reasonable FDR correction.
We look more closely at our prior hypotheses, the Lactobacilli:
```{r}
unname(fpv[sq.lacto])
```
Nope, even in the absence of an FDR correction for these taxa we previously proposed to study, there is no significant differential prevalence.
Last, we will look at the Lactobacillus genus:
```{r}
table(ctv,LactoPresent=rowSums(stv10[,sq.lacto])>0)
fisher.test(x=factor(rowSums(stv10[,sq.lacto])>0), y=ctv)$p.value
```
Nope.
```{r}
sessionInfo()
```