# 7. Beta diversity

**************************

## Section contents
[1. Overview](#overview)

[2. Setting up your analysis environment](#env1)

[3. Preparing your data](#data1)

[4. Choosing a categorical variable to analyse](#var1)

[5. PCoA plots and statistics - categorical variables](#catana)

[6. PCoA plots - continuous variables](#contana)



****************************

## Overview <a class="anchor" id="overview"></a>

While alpha diversity examines differences within treatment groups (and thus can only examine categorical variables), beta diversity measures the similarity (or dissimilarity) of microbial community composition **between** samples.

Each variable is plotted on [Principal coordinates analysis (PCoA)](https://mb3is.megx.net/gustame/dissimilarity-based-methods/principal-coordinates-analysis) plots, to examine the variance between samples based on a dissimilarity matrix. A detailed explanation of PCoA and other ordination methods can be seen here: http://albertsenlab.org/ampvis2-ordination/

Sample distance has been measured using 3 distance-based ordination methods (plotted on 2 separate PCoA plots per variable). These methods are:

1. [**Bray–Curtis dissimilarity**](https://www.statisticshowto.com/bray-curtis-dissimilarity/) measures the fraction of overabundant counts.

`Sorenson, T. (1948) “A method of establishing groups of equal amplitude in plant sociology based on similarity of species content.” Kongelige Danske Videnskabernes Selskab 5.1-34: 4-7.`

2. [**Cao index**](https://www.sciencedirect.com/science/article/abs/pii/S0043135496003223) is a minimally biased index for high beta diversity and variable sampling intensity. Chao index tries to take into account the number of unseen species pairs.

`Cao, Y., Bark, A. W., & Williams, W. P. (1997). Analysing benthic macroinvertebrate community changes along a pollution gradient: a framework for the development of biotic indices. Water Research, 31(4), 884-892.`

3. [**Jaccard similarity index**](https://www.statology.org/jaccard-similarity/) measures the fraction of unique features, regardless of abundance..

`Jaccard, P. (1908). “Nouvellesrecherches sur la distribution florale.” Bull. Soc. V and. Sci. Nat., (44):223-270.`


**Significance tests**

For each beta diversity method, both overall significance and pairwise significance were calculated using a [Permutational Multivariate Analysis of Variance (PERMANOVA)](https://archetypalecology.wordpress.com/2018/02/21/permutational-multivariate-analysis-of-variance-permanova-in-r-preliminary/), a non-parametric multivariate statistical test. This was done in R using the [adonis](https://rdrr.io/rforge/vegan/man/adonis.html) function from the [vegan: Community Ecology Package](https://rdrr.io/rforge/vegan/). A sample-sample distance matrix was first generated from relative (normalised) abundance tables (except for Cao index, which used absolute abundances) using the [vegdist](https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/vegdist) function with each of the three distance-based ordination methods (Bray-Curtis, Cao and Jaccard). On this distance matrix PERMANOVA R and p values were calculated using adonis. The R-squared value represents the percentage of variance explained by the examined groups. E.g. if R = 0.23 then 23% of the total diversity is explained by groupwise differences. PERMANOVA is based on groupwise differences, thus cannot be applied to continuous data.

***********************************

## 2. Setting up your analysis environment <a class="anchor" id="envana1"></a>

Before you can begin your analysis you need to set up certain requirements, such as setting your working directory and installing/loading required R packges.

### Set your working directory

Your [working directory](https://r-coder.com/working-directory-r/) in R is a base directory where R looks for your anacapa data files (and outputs files to).

You will need to find the path of your anacapa results directory on the HPC and paste the location into the code cell below (setwd("~/Paste/Your/anacapa/Results/Directory/Path/Here")).

In [None]:
setwd("~/anacapa")

### Install the required R packages

R does most of its analysis using [functions](https://www.tutorialspoint.com/r/r_functions.htm). Some of these are built into base R, but many come as external [packages](https://r-pkgs.org/intro.html), which need to be installed and loaded into R.

Load any required packages that have previously been installed using the [library()](https://www.tutorialspoint.com/r/r_packages.htm) function:

In [None]:
library(tidyverse)
library(scales)
library(viridis)

Other packages need to be installed first.

The key R analysis package used was ampvis2 (['Tools for visualising amplicon data'](https://madsalbertsen.github.io/ampvis2/)). Results from anacapa (ASV tables, taxonomy assignment, etc) is used as input for ampvis2 in this notebook. Install and load ampvis2 (this will take a few minutes as there are multiple dependent packages installed):

In [None]:
install.packages("remotes", verbose = F)
remotes::install_github("MadsAlbertsen/ampvis2", quiet = T)
library(ampvis2)

Install and load the vegan package. This is needed for statistical analysis of beta diversity results.

In [None]:
install.packages("vegan")
library(vegan)

Define a set of colours for plotting. Some of these plots have multiple clusters and it's difficult to find eougn contrasting colours to visually separate the clusters. I've developed a set of 25 colours that I've found contrast well, that we can use in the plots for this (and other) sections.

In [None]:
c25 <- c(
  "dodgerblue2", "#E31A1C", # red
  "green4",
  "#6A3D9A", # purple
  "#FF7F00", # orange
  "black", "gold1",
  "skyblue2", "#FB9A99", # lt pink
  "palegreen2",
  "#CAB2D6", # lt purple
  "#FDBF6F", # lt orange
  "gray70", "khaki2",
  "maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
  "darkturquoise", "green1", "yellow4", "yellow3",
  "darkorange4", "brown"
)

Set the default width and height for plots output on this Notebook. You can modify this as you prefer. Note that every plot in this Notebook is followed by code to output it as a file and this code defines width/height separately from the options below.

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

****************************

## 3. Preparing your data <a class="anchor" id="dataana1"></a>

This ampvis2-based analysis requires, as input, an R object in ampvis2 format, which has a specific structure. At minumum this requires a samples table, an ASV count table and a taxonomy table, which are then combined into a single ampvis2 object.

First you'll need to provide an ID for your project. This must be the project ID you used in the filtration section. See the previous section for details.

In [None]:
project_id <- "rbcl"

### Import the samples table

In [None]:
samples_table <- read.csv("sample_table.csv", header = T)

Have a look at your samples table and variables (metadata). In the previous filtration section we didn't use this information, but when examining diversity indices, etc, the metadata is critical.

In [None]:
samples_table

### Import the ASV abundance table and taxonomy table

**IMPORTANT: this is the FILTERED data that you exported at the end of the previous filtration section. You must have run that section once (only once is needed) for this and following sections to work**

In [None]:
filtered_data <- read.csv(paste0(project_id, "_filtered_data.csv"))

Have a look at the top few rows of your data. The first 'ASV' column should contain the ASV IDs, the next columns are the samples, followed by the taxonomy levels.

In [None]:
head(filtered_data)

### Create the ampvis2 database

Combine the samples data with the ASV table using amp_load(). This creates an ampvis2 database that can be used by ampvis2

In [None]:
ampvisdata <- amp_load(otutable = filtered_data,
              metadata = samples_table)

You can see a summary of your dataset by simply running the object name. This shows you the number of taxa identified, total/average/maximum/minimum number of reads per sample, etc. 

In [None]:
ampvisdata

************************************

## 4. Choosing a categorical variable to analyse <a class="anchor" id="varana"></a>

**NOTE** This section is for choosing categorical variables only. See section 8 onward for analysis of continous (i.e. numeric) variables.

You can view your variables as column names in your samples_table:

In [None]:
colnames(samples_table)

Enter the column name of the variable you want to analyse.

In [None]:
group <- "Site"

### Ordering your variable

The plotting done in ampvis2 is done by the [ggplot2](https://ggplot2.tidyverse.org/) package. ggplot [factorises](https://www.datamentor.io/r-programming/factor/) variables and automatically orders them on the plot by alphabetical order.

This can cause your groups to be ordered incorrectly on the plot axes (e.g. a time series may not be plotted sequentially). 

You can manually set the order of your variable here. This can be useful where orderof groups on a plot matters, e.g. for time series, or low, medium, high groups, etc. **You can skip this 'Ordering your variable' section if you don't need you groups in a particular order on plots.**

First have a look at how ggplot will order your variable.

In [None]:
levels(factor(ampvisdata$metadata[[group]]))

If these are in the order you want to see them on your plot axes, nothing needs to be done. If they are in the wrong order you need to order them manually by setting the [**levels**](https://www.datamentor.io/r-programming/factor/).

Choose how you want to order your groups here:

In [None]:
lev <- c("S1", "S2", "S3", "S4")

To order your variable you need to put **all** the variable levels into the `lev = c(..)`. Make sure each level is in double quotes and separated by a comma.

Then run the following to apply the levels to your data:

In [None]:
ampvisdata$metadata[[group]] <- factor(ampvisdata$metadata[[group]], levels = lev)

**************************************

## 5. PCoA plots and statistics - categorical variables <a class="anchor" id="catana"></a>

The overview section outlined (with links and references) the ordination methods that can be used to estimate and plot beta diversity.

Briefly, these are: **Bray–Curtis dissimilarity**, **Cao index**, and **Jaccard similarity index**. Each of these has strengths and weaknesses. It's up to you, the researcher, to explore the literature and decide which is the best index to use for your data.

First, confirm the samples and variable that you chose in the previous section:

In [None]:
group

In [None]:
samples_table$sample.id

Then, choose the ordination method you want to use to estimate and plot beta diversity.

Bray–Curtis dissimilarity is used by default (`"bray"`) Change this to `"cao"` for Cao index, or `"jaccard"` for Jaccard similarity index.

In [None]:
index <- "bray"

<font color="blue">**Optional** You can include another variable in the plot using different point shapes. The initial variable is differentiated by colour, which can be seen in the below plot code with the `sample_color_by =` parameter. You can add the second variable to the `sample_shape_by =` argument.</font>

<font color="blue">First we'll name the variable as an object. To what variables you have to choose from you can view the samples table column names:</font>

In [None]:
colnames(samples_table)

<font color="blue">Add the name of one of these to the code cell below (e.g. `group2 <- "Animal"`). If you don't want to include a second variable, leave the code cell below as `group2 <- NULL`</font>

In [None]:
group2 <- NULL

Now create the PCoA plot.

In [None]:
p <- amp_ordinate(ampvisdata, type = "pcoa", filter_species = 0, transform = "none", sample_label_by = 'sample_id', distmeasure = index, sample_color_by = group, sample_shape_by = group2, sample_point_size = 3, sample_colorframe = TRUE) + 
scale_color_manual(values=c25) + 
scale_fill_manual(values=c25) + 
theme(text = element_text(size = 18))
p

You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. **These files can be found in your working directory.**

**Tip:** you can adjust the width and height of the saved images by changing `width =` and `height =` in the code below (and every time ggsave appears in this workflow). Pdf files can be opened within Jupyter, so a good way to test a suitable width/height would be to save the image by running the pdf code below with the default 20cm width/height, then open the pdf file by clicking on it in the file browser panel (to the left of this notebook), then change the width/height and repeat this process as needed.

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0("PCoA_beta_div_", group, "_", index, ".tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0("PCoA_beta_div_", group, "_", index, ".pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Statistical analysis

To compare the overall differences between groups within your chosen variable, a PERMANOVA test can be performed and similarly a pairwise PERMANOVA test can be performed to compare differences between each group.

**PERMANOVA:**

In [None]:
# Need to remove rows (from ASV abundance table) with all 0 counts first
asvmatrix <- ampvisdata$abund
asvmatrix <- asvmatrix[rowSums(asvmatrix) > 0, ]
# Also need to transpose (samples need to be as rows, asv's as columns)
asvmatrix <- t(asvmatrix)
# Then generate pairwise distance matrix
sampdist <- vegdist(asvmatrix, method="bray")
# Use adonis function (vegan package: "Permutational Multivariate Analysis of Variance Using Distance Matrices") to run PERMANOVA on distances
pathotype.adonis <- adonis2(sampdist ~ get(group), data = samples_table)
# Output the r squared and p values as variables
r2 <- pathotype.adonis$R2[1]
pval <- pathotype.adonis$`Pr(>F)`[1]

PERMANOVA R squared =

In [None]:
round(r2, 2)

PERMANOVA significance (p) =

In [None]:
pval

**Pairwise PERMANOVA:**

Calculate the pairwise PERMANOVA. This is a bit complex, as each group within the variable has to be compared to each other group in a variety of ways. Code comments (#) explain what each line of code does.

In [None]:
# The combn function creates every combination of provided elements
# Below it takes all group names and combines them pairwise (2)
# Creates a matrix where each column = a combination
comb_pair <- data.frame(combn(unique(samples_table[[group]]),2))
# Convert scores to relative abundance
# Use sweep function to divide ("/") each column (2) by its total (colSums)
comm <- sweep(ampvisdata$abund,2,colSums(ampvisdata$abund),"/")
# Using adonis function (vegan package: "Permutational Multivariate Analysis of Variance Using Distance Matrices")
tabstat_adonis <- c()
# Loop through each pair (i.e. column in 'comb_pair')
for (i in 1:ncol(comb_pair)) {
  # Pull out pair data
  # From samples table
  samples_table_SB_pair <- samples_table[samples_table[[group]] %in% comb_pair[[i]], ]
  # From ASV matrix
  asvmatrix_pair <- comm[samples_table_SB_pair$sample.id]
  # Transpose
  asvmatrix_pair <- asvmatrix_pair[rowSums(asvmatrix_pair) > 0, ]
  # Also need to transpose (samples need to be as rows, asv's as columns)
  asvmatrix_pair <- t(asvmatrix_pair)
  # Then generate pairwise distance matrix
  sampdist_pair <- vegdist(asvmatrix_pair, method="bray")
  # Use anosim (vegan): Analysis of Similarities
  x2 <- adonis2(sampdist_pair ~ get(group), data = samples_table_SB_pair)
  # Pull out just r squared and p value
  r2_adonis <- x2$R2[1]
  pval_adonis <- x2$`Pr(>F)`[1]
  # Combine into data frame
  tabstat_adonis <- cbind(tabstat_adonis, c(r2_adonis, pval_adonis))
  # Name vector with group combinations
  colnames(tabstat_adonis)[i] <- paste(comb_pair[[i]], collapse=' Vs ')
}
row.names(tabstat_adonis) <- c("R squared", "p")

Now output these results as a table.

In [None]:
tabstat_adonis <- t(tabstat_adonis)
tabstat_adonis <- data.frame(tabstat_adonis)
tabstat_adonis

You can export this table as a csv file:

In [None]:
write_csv(tabstat_adonis, paste0("beta_div_pairwise_PERMANOVA", group, "_", index, "_.csv"))

***********************************

## 6. PCoA plots - continuous variables <a class="anchor" id="contana"></a>

**NOTE:** PERMANOVA scores aren't generated for continuous variables as PERMANOVA depends on groupwise comparisons. For statistics of continuous variables, it's recommended that you use the Alpha diversity correlation and GLM statistics.

To refresh your memory regarding which variables exist and which are categorical or continuous, have a look at the first few rows of the samples table:

In [None]:
head(samples_table)

Now choose the continuous variable you want to analyse.

Enter the column name of the continuous variable you want to analyse (i.e. change `group <- "Myvariable"` in the below cell to your chosen variable's column name). This has to be exactly the same as the column name, including capitalisation, characters such as underscores, etc:

In [None]:
group <- "Dummy3"

The code to generate the PCoA plot for a continuous variable is very similar to the previous categorical variable PCoA, with a few minor differences.

Again, choose the ordination method you want to use to estimate and plot beta diversity. Bray–Curtis dissimilarity is used by default (`"bray"`) Change this to `"cao"` for Cao index, or `"jaccard"` for Jaccard similarity index.

In [None]:
index <- "bray"

<font color="blue">**Optional** You can include a second categorical variable in the plot using different point shapes. The initial variable is differentiated by colour, which can be seen in the below plot code with the `sample_color_by =` parameter. You can add the second variable to the `sample_shape_by =` argument.</font>

<font color="blue">First we'll name the variable as an object. To what variables you have to choose from you can view the samples table column names:</font>

In [None]:
colnames(samples_table)

<font color="blue">Add the name of one of these to the code cell below (e.g. `group2 <- "Animal"`). If you don't want to include a second variable, leave the code cell below as `group2 <- NULL`</font>

In [None]:
group2 <- NULL

Now create the PCoA plot.

In [None]:
p <- amp_ordinate(ampvisdata, type = "pcoa", filter_species = 0, transform = "none", distmeasure = index, sample_color_by = group, sample_shape_by = group2, sample_point_size = 4) + 
scale_color_viridis() + 
theme(text = element_text(size = 18))
p

You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. **These files can be found in your working directory.**

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0("beta_div_PCoA_", group, "_", index, ".tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0("beta_div_PCoA_", group, "_", index, ".pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

[Click here to go to the next section: 8. Community structure](./anacapa_8_CS.ipynb)