/
statistics.Rmd
253 lines (189 loc) · 9.25 KB
/
statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: "Statistics"
---
```{r, include = FALSE}
library(rbiom)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
R.options = list(
pillar.print_min = 5,
pillar.print_max = 5 ))
```
# Introduction
Visualizations are one of the best ways to identify correlations in your
dataset. If you can see a trend with your eyes, then you're on the right track.
Rbiom's plotting functions are
```{r, echo=FALSE, results='asis'}
cat(glue::glue('
<table class="table-borderless">
<tr>{adiv_boxplot}{bdiv_corrplot}{rare_stacked}{taxa_boxplot}</tr>
<tr>{adiv_corrplot}{bdiv_heatmap}{rare_corrplot}{taxa_corrplot}</tr>
<tr>{bdiv_boxplot}{bdiv_ord_plot}{taxa_stacked}{taxa_heatmap}</tr>
</table>',
.transformer = function (text, ...) {
sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text)
}))
```
The `*_boxplot()`, `*_corrplot()`, and `bdiv_ord_plot()` functions will
automatically add p-values to your figures whenever possible. The ggplot object
they return has `$data`, `$code`, `$stats`, and `$stats$code` attributes you
can use to automate, reproduce, and customize your figures.
To generate statistics without creating a plot, use one of the following:
```{r, echo=FALSE, results='asis'}
cat(glue::glue('
<table class="table-borderless">
<tr>{adiv_stats}{bdiv_stats}{distmat_stats}{taxa_stats}</tr>
</table>',
.transformer = function (text, ...) {
sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text)
}))
```
\cr
# Quick Start
Your metadata field of interest and microbiome property will determine which
rbiom function to use.
<table>
<tr>
<th rowspan=2 class="border-0 border-end border-bottom align-middle">Metadata<br>Property</th>
<th colspan=3 class="border-0" align="center">Microbiome Property</th>
</tr><tr>
<th class="border-bottom">Alpha Diversity<br><span class="eg">Shannon, Simpson</span></th>
<th class="border-bottom">Beta Diversity<br><span class="eg">UniFrac, Jaccard</span></th>
<th class="border-bottom">Taxa Abundance<br><span class="eg">Phylum, Genus</span></th>
</tr><tr>
<th class="border-end">Categorical<br><span class="eg">Sex, Body Site</span></th>
<td>`adiv_boxplot()`</td>
<td>`bdiv_boxplot()`<br>`bdiv_ord_plot()`</td>
<td>`taxa_boxplot()`</td>
</tr><tr>
<th class="border-end">Numeric<br><span class="eg">Age, BMI</span></th>
<td>`adiv_corrplot()`</td>
<td>`bdiv_corrplot()`</td>
<td>`taxa_corrplot()`</td>
</tr><tr>
<th class="border-end"><i>Any</i></th>
<td>`adiv_stats()`</td>
<td>`bdiv_stats()`</td>
<td>`taxa_stats()`</td>
</tr>
</table>
For instance, to explore the effect of Body Site (a categorical metadata field)
on Shannon Diversity (an alpha diversity metric), we'd use `adiv_boxplot()` to
produce a plot with statistics, or `adiv_stats()` if we only want the stats.
Although all the functions have important differences, the statistical methods
they employ can be grouped by the three plot types: ordination plot, box plot,
and correlation plot.
\cr
# Ordination Plots
Statistics for ordination plots are the most straight-forward. Set a
categorical metadata field to the `color.by` parameter to test whether
inter-sample distances are correlated with that variable.
```{r}
p <- bdiv_ord_plot(
biom = rarefy(hmp50),
color.by = "Body Site",
bdiv = c("Jaccard", "Bray-Curtis"),
ord = c("PCoA", "UMAP") )
p
p$stats
p$stats$code
```
The plot subtitles have the summary statistics. Additionally, `p$stats`
contains a tibble data.frame with the full statistics table, and `p$stats$code`
shows the R commands for reproducing the statistics outside of rbiom.
Note that the ordination statistics are not dependent on the ordination, only
the distance metric. This is because the statistics are based on beta diversity
distances which are computed prior to ordination.
By default, `bdiv_ord_plot()` applies the perMANOVA test. You can change this
to MRPP by specifying `test="mrpp"`. Details on the available tests are below.
| Test | Function | Method |
| --------- | ------------------ | ----------------------------------------------------------- |
| `adonis2` | `vegan::adonis2()` | Permutational Multivariate Analysis of Variance (perMANOVA) |
| `mrpp` | `vegan::mrpp()` | Multiple Response Permutation Procedure (MRPP) |
\cr
# Box Plots
Statistics on box plots will automatically toggle between pairwise and
group-wise statistics based on the values of `x` and `color.by`: `x` controls
pairwise and `color.by` controls group-wise. You can set `x` and `color.by` to
the same categorical metadata field to get colored pairwise statistics, or set
them to different categorical metadata fields to get multiple group-wise
statistics per plot.
```{r}
biom <- rarefy(hmp50) %>%
subset(`Body Site` %in% c('Saliva', 'Stool', 'Buccal mucosa'))
p1 <- adiv_boxplot(biom, x = "Body Site")
p2 <- adiv_boxplot(biom, color.by = "Body Site")
p3 <- adiv_boxplot(biom, x = "Body Site", color.by = "Body Site")
p4 <- adiv_boxplot(biom, x = "Sex", color.by = "Body Site")
plots <- list(
p1 + ggplot2::labs(subtitle = 'x = "Body Site"'),
p2 + ggplot2::labs(subtitle = 'color.by = "Body Site"'),
p3 + ggplot2::labs(subtitle = 'x = color.by = "Body Site"'),
p4 + ggplot2::labs(subtitle = 'x = "Sex", color.by = "Body Site"')) %>%
lapply(`+`, ggplot2::labs(x = NULL, y = NULL, caption = NULL))
patchwork::wrap_plots(plots, guides = "collect")
```
Above, the two plots on the left are annotated with pairwise statistics while
the two on the right have group-wise statistics. As with other plots, you can
find the full statistics tables and reproducible R code in the plot attributes.
```{r}
p1$stats
p2$stats
p2$stats$code
```
Internally, rbiom uses the non-parametric functions listed below.
| Test | Function | Method |
| ------------- | ----------------------- | -------------------------------------------------------- |
| pairwise | `stats::wilcox.test()` | Two-sample Wilcoxon Rank Sum Test, aka Mann-Whitney Test |
| group-wise | `stats::kruskal.test()` | Kruskal-Wallis Rank Sum Test |
\cr
# Background
## Normality
A normal distribution is visualized as a "bell curve", where values further
from the mean are observed less often. Microbial abundances do not follow this
pattern; it's common to observe high or low abundances more often than a
"medium" abundance.
```{r}
library(ggplot2)
patchwork::wrap_plots(
widths = c(1, 1.5),
ggplot() +
geom_histogram(aes(x=rnorm(1000)), bins = 10) +
ggtitle("Normal Distribution"),
ggplot(data = taxa_table(rarefy(hmp50), taxa = 4)) +
geom_histogram(aes(x=.abundance), bins = 10) +
facet_wrap(".taxa") +
ggtitle("Genera Abundance Distributions")
)
```
To compensate for this non-normality, rbiom uses the following non-parametric
tests for categorical variables that are based on ranking or permutations.
| Test | Function | Used For |
| ----------------------- | ----------------------- | -------------------------- |
| Wilcoxon Rank-Sum | `stats::wilcox.test()` | Pairwise boxplot |
| Kruskal-Wallis Rank Sum | `stats::kruskal.test()` | Groupwise boxplot |
| Permutational MANOVA | `vegan::adonis2()` | `bdiv_ord_plot()` clusters |
For correlation/regression analysis, rbiom uses estimated marginal means
(`emmeans::emmeans()`) on rank-transformed values.
Further reading:
* [Applied Multivariate Statistics in R](https://uw.pressbooks.pub/appliedmultivariatestatistics/):
[PERMANOVA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/permanova/),
[Comparison of Techniques](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/comparison-of-techniques/)
* [R Applications — Part 1: Simple Linear Regression](https://medium.com/datasciencearth/r-applications-part-1-simple-linear-regression-ef5a0e19a05d),
particularly the section on [assumption control](https://medium.com/datasciencearth/r-applications-part-1-simple-linear-regression-ef5a0e19a05d#a6f6).
\cr
## Compositionality
Compositional data arises when the counts don't represent the entire
population. In microbiome studies, the number of microbes that get sequenced
is far less than the number of microbes from where the sample was collected.
Articles by [Gloor et al](https://doi.org/10.3389/fmicb.2017.02224) and
[McMurdie and Holmes](https://doi.org/10.1371/journal.pcbi.1003531)
propose the use of their analysis tools (
[ALDEx2](https://doi.org/doi:10.18129/B9.bioc.ALDEx2) and
[metagenomeSeq](https://doi.org/doi:10.18129/B9.bioc.metagenomeSeq),
respectively) to apply the proper statistical methods for this situation.
Conversely, rbiom does not correct for compositionality. This is because
correcting for compositionality introduces extra noise into your dataset and
severely limits your selection of metrics and visualizations, typically
without any significant benefit to analysis.