# 6. Community structure

**************************

## Section contents
[1. Overview](#overview)

[2. Setting up your analysis environment](#envana2)

[3. Preparing your data](#dataana2)

[4. Taxonomic abundance bar plots](#abbarana)

[5. Taxonomic abundance box and whisker plots](#abboxana)

[6. Choosing a categorical variable to analyse](#varana2)

[7. Community structure by variable](#commana)


****************************

## Overview <a class="anchor" id="overview"></a>

This section of the analysis workflow examines the community structure - i.e. the proportion, variance and abundance of taxonomic groups found in each sample and treatment group.

**Amplicon sequence variants**

Amplicon sequence variants (ASV) was referenced in the previous alpha diversity section but, as it is a key concept, will be discussed in more detail here. Taxonomic groups in this study are based on ASVs, inferred using the [**DADA2 software package**](https://www.nature.com/articles/nmeth.3869). 

**DADA2** infers sample sequences exactly and resolves differences of as little as 1 nucleotide.

**ASV vs OTU**: Traditionally Operational Taxonomic Units (OTUs) have been used in 16S ampicon studies. More recently ASVs have been used, due their improved accuracy in identifying taxa, particularly genus and species. Basically, OTUs utilise a similarity clustering method to identify taxa, whereas ASV are generated by quantifying exact sequence matches to a sequence database and then statistically adjusting this using confidence thresholds. 

The OTU method typically can identify 97% similarity (with any accuracy) whereas the ASV method can identify even single base-pair differences. This enables a finer resolution of taxa down to the genus and species level. Note that there is increasing 'fuzziness' toward the lower taxonomic levels, as the diversity within some taxa is greater than the diversity between this and other taxa (in other words, even with ASV, not all taxa can be resolved to lower taxonomic levels and this highly depedant on the taxonomic group involved). See here for more details: https://www.zymoresearch.com/blogs/blog/microbiome-informatics-otu-vs-asv

**Figures used in this section**

Taxonomic differences between groups is presented in different ways in this section, including:

 - Stacked bar plots - To examine proportional taxonomic diversity per sample.

 - Heatmaps - Raw read counts in the abundance matrix are normalised (transformed to percentages). For categorical variables only. Samples are aggregated by variable (e.g. generation) and the most abundant taxon compared.

 - Box and whisker plot - For categorical variables only. As with heatmps, samples aggregated by variable and ordered by mean read abundance across all samples.

*************************

## 2. Setting up your analysis environment <a class="anchor" id="envana1"></a>

Before you can begin your analysis you need to set up certain requirements, such as setting your working directory and installing/loading required R packges.

### Set your working directory

Your [working directory](https://r-coder.com/working-directory-r/) in R is a base directory where R looks for your anacapa data files (and outputs files to).

You will need to find the path of your anacapa results directory on the HPC and paste the location into the code cell below (setwd("~/Paste/Your/anacapa/Results/Directory/Path/Here")).

In [None]:
setwd("~/anacapa")

### Install the required R packages

R does most of its analysis using [functions](https://www.tutorialspoint.com/r/r_functions.htm). Some of these are built into base R, but many come as external [packages](https://r-pkgs.org/intro.html), which need to be installed and loaded into R.

Load any required packages that have previously been installed using the [library()](https://www.tutorialspoint.com/r/r_packages.htm) function:

In [None]:
library(tidyverse)
library(scales)

Other packages need to be installed first.

The key R analysis package used was ampvis2 (['Tools for visualising amplicon data'](https://madsalbertsen.github.io/ampvis2/)). Results from anacapa (ASV tables, taxonomy assignment, etc) is used as input for ampvis2 in this notebook. Install and load ampvis2 (this will take a few minutes as there are multiple dependent packages installed):

In [None]:
install.packages("remotes", verbose = F)
remotes::install_github("MadsAlbertsen/ampvis2", quiet = T)
library(ampvis2)

Define a set of colours for plotting. Some of these plots have multiple clusters and it's difficult to find eougn contrasting colours to visually separate the clusters. I've developed a set of 25 colours that I've found contrast well, that we can use in the plots for this (and other) sections.

In [None]:
c25 <- c(
  "dodgerblue2", "#E31A1C", # red
  "green4",
  "#6A3D9A", # purple
  "#FF7F00", # orange
  "black", "gold1",
  "skyblue2", "#FB9A99", # lt pink
  "palegreen2",
  "#CAB2D6", # lt purple
  "#FDBF6F", # lt orange
  "gray70", "khaki2",
  "maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
  "darkturquoise", "green1", "yellow4", "yellow3",
  "darkorange4", "brown"
)

Set the default width and height for plots output on this Notebook. You can modify this as you prefer. Note that every plot in this Notebook is followed by code to output it as a file and this code defines width/height separately from the options below.

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

****************************

## 3. Preparing your data <a class="anchor" id="dataana1"></a>

This ampvis2-based analysis requires, as input, an R object in ampvis2 format, which has a specific structure. At minumum this requires a samples table, an ASV count table and a taxonomy table, which are then combined into a single ampvis2 object.

First you'll need to provide an ID for your project. This must be the project ID you used in the filtration section. See the previous section for details.

In [None]:
project_id <- "rbcl"

### Import the samples table

In [None]:
samples_table <- read.csv("sample_table.csv", header = T)

Have a look at your samples table and variables (metadata). In the previous filtration section we didn't use this information, but when examining diversity indices, etc, the metadata is critical.

In [None]:
samples_table

### Import the ASV abundance table and taxonomy table

**IMPORTANT: this is the FILTERED data that you exported at the end of the previous filtration section. You must have run that section once (only once is needed) for this and following sections to work**

In [None]:
filtered_data <- read.csv(paste0(project_id, "_filtered_data.csv"))

Have a look at the top few rows of your data. The first 'ASV' column should contain the ASV IDs, the next columns are the samples, followed by the taxonomy levels.

In [None]:
head(filtered_data)

### Create the ampvis2 database

Combine the samples data with the ASV table using amp_load(). This creates an ampvis2 database that can be used by ampvis2

In [None]:
ampvisdata <- amp_load(otutable = filtered_data,
              metadata = samples_table)

You can see a summary of your dataset by simply running the object name. This shows you the number of taxa identified, total/average/maximum/minimum number of reads per sample, etc. 

In [None]:
ampvisdata

************************************

## 4. Taxonomic abundance bar plots <a class="anchor" id="abbarana"></a>

This section generates stacked bar plots showing the proportion of taxa per sample, for each taxonomic level

First, you can output the raw (absolute) ASV abundance data as a csv file. You can add this as supplemental data to a manuscript or use it in other plotting or statistical programs. If you want the normalised (relative) abundance data, change `normalise = F` to `normalise = T` in the below code 

In [None]:
amp_export_otutable(ampvisdata, filename = paste0("ASVtable"), normalise = F, sep = ",")

Choose the taxonomic level you want to plot (Choose from Phylum, Class, Order, Family, Genus, Species)

In [None]:
taxgroup <- "Order"

Generate the data for plotting. > 1% relative abundance only

In [None]:
# Run amp_boxplot to generate aggregated data for the chosen taxonomic level, but output as a data object rather than plotting
x <- amp_boxplot(ampvisdata, tax_aggregate = taxgroup, tax_empty = "remove", normalise = F)
# Filter out any taxa with > 1% relative abundance
# First calculate mean on aggregated taxa
x2 <- aggregate(x$data$Abundance, list(x$data$Display), mean)
# Select taxa only with abundance > 1%
x3 <- x2[x2$x > 0,]
# Pull out just those taxa in x$data
x4 <- x$data[x$data$Display %in% x3$Group.1, ]
# Remove duplicate rows (means already calculated, duplicates only cause bias)
x5 <- x4[!duplicated(x4), ]

If you want your samples to be in a specific order on the plot (e.g. to cluster them by groups) then you need to re-order the levels, as you did in the previous sections. If you don't want to re-order your samples

Check the current order of your samples:

In [None]:
levels(factor(x5$Sample))

Set the order of the samples how you like (all samples have to be inlcluded) then re-order them by running the below two code cells:

In [None]:
lev <- c("PS.F1", "PS.F4", "PS.F7", "PS.F11", "PS.F12", "PS.F21", "PS.F24", "PS.N6", "PS.N8", "PS.N10", "PS.N11", "PS.N12", "PS.N13", "PS.N19", "PS.N22", "PS.N23", "PS.N24", "PS.F16", "PS.F18", "PS.F5", "PS.F9", "PS.F14", "PS.F15", "PS.F17", "PS.F22")

In [None]:
x5$Sample <- factor(x5$Sample, levels = lev)

Generate stacked bar plot. 

As with previous plots, you can modify colours (`scale_fill_manual()` is using the `c25` vector of colours we defined in section 2. 

You can change the `c25` colours in section 2 or add colours of your choice to the below code, e.g. `scale_fill_manual(values = c("red", "blue", "green"))`)

You can also change axis labels, text size and angle, [default theme](https://ggplot2.tidyverse.org/reference/ggtheme.html), etc.

Feel free to add additional modifications to this or any other plot in this Notebook. Here is a good guide for doing this: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html

In [None]:
p <- ggplot(x5, aes(x = Sample, y = Abundance, fill = Display))
p <- p + geom_bar(position="fill", stat="identity") +
  labs(y = "Abundance (%)") +
  theme_classic() +
  scale_y_continuous(label = label_percent()) + 
  scale_fill_manual(values = c25) + 
  theme(text = element_text(size = 18), axis.text.x = element_text(angle = 90, size=12), axis.text.y = element_text(size=14)) + 
  labs(fill = taxgroup)
p

To plot a different taxonomic level, You can change `taxgroup <- "Phylum"` above to another taxonomic level, then re-run the code from that point.

You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. **These files can be found in your working directory.**

**Tip:** you can adjust the width and height of the saved images by changing `width =` and `height =` in the code below (and every time ggsave appears in this workflow). Pdf files can be opened within Jupyter, so a good way to test a suitable width/height would be to save the image by running the pdf code below with the default 20cm width/height, then open the pdf file by clicking on it in the file browser panel (to the left of this notebook), then change the width/height and repeat this process as needed.

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0("community_tax_abund_bar_plot_", taxgroup, ".tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0("community_tax_abund_bar_plot_", taxgroup, ".pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

******************************

## 5. Taxonomic abundance box and whisker plots <a class="anchor" id="abboxana"></a>

This section contains box and whisker plots for each taxonomic level (phylum … species).

As with the bar plots in the previous section, these B&W plots show the relative abundance (in %) of taxa. But whereas heatmaps show the mean %, box and whisker plots show the interquartile range between samples ([max, min, median, quartiles](https://courses.lumenlearning.com/fscj-introstats1/chapter/box-plots/)), thus providing additional statistical information.

Choose the taxonomic level you want to plot (Choose from Phylum, Class, Order, Family, Genus, Species)

In [None]:
taxgroup <- "Genus"

Generate the plot (and modify the attributes as you like)

Note the `tax_show = 10` attribute. This says to show the top 10 taxa. Change this as desired. 

In [None]:
p <- amp_boxplot(ampvisdata, tax_aggregate = taxgroup, tax_empty = "remove", tax_show = 10, normalise = F) +
theme_bw() +
theme(text = element_text(size = 18), axis.text.x = element_text(size=16), axis.text.y = element_text(size=14)) +
labs(x = taxgroup)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0("community_tax_abund_BW_plot_", taxgroup, ".tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0("community_tax_abund_BW_plot_", taxgroup, ".pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

***********************************

## 6. Choosing a categorical variable to analyse <a class="anchor" id="varana2"></a>

In your metadata you'll usually have multiple variables. These need to be analysed individually, by selecting the variable in this section, then running the remaining analysis sections on this chosen variable. You can then re-run the analysis on another variable by returning to this section, changing the variable name, then running again the remaining analysis sections.

**NOTE** This section is for choosing categorical variables only. See section 8 onward for analysis of continous (i.e. numeric) variables.

You can view your variables as column names in your samples_table:

In [None]:
colnames(samples_table)

Enter the column name of the variable you want to analyse (i.e. change `group <- "Myvariable"` in the below cell to your chosen variable's column name). This has to be exactly the same as the column name, including capitalisation, characters such as underscores, etc.

In [None]:
group <- "Site"

### Ordering your variable

The plotting done in ampvis2 is done by the [ggplot2](https://ggplot2.tidyverse.org/) package. ggplot [factorises](https://www.datamentor.io/r-programming/factor/) variables and automatically orders them on the plot by alphabetical order.

This can cause your groups to be ordered incorrectly on the plot axes (e.g. a time series may not be plotted sequentially). 

You can manually set the order of your variable here. 

First have a look at how ggplot will order your variable.

In [None]:
levels(factor(ampvisdata$metadata[[group]]))

If these are in the order you want to see them on your plot axes, nothing needs to be done. If they are in the wrong order you need to order them manually by setting the [**levels**](https://www.datamentor.io/r-programming/factor/).

Choose how you want to order your groups here:

In [None]:
lev <- c("S1", "S2", "S3", "S4")

To order your variable you need to put **all** the variable levels into the `lev = c(..)`. Make sure each level is in double quotes and separated by a comma.

Then run the following to apply the levels to your data:

In [None]:
ampvisdata$metadata[[group]] <- factor(ampvisdata$metadata[[group]], levels = lev)

**************************************

## 7. Community structure by variable <a class="anchor" id="commana"></a>

In this section you can generate heatmaps and box and whisker plots for the variable you selected above, and for each taxonomic level.

If you want to analyse another variable here, go back to the '7. Choosing a categorical variable to analyse' section, choose another variable and re-run the Notebook from that point.

Choose the taxonomic level you want to plot (Choose from Phylum, Class, Order, Family, Genus, Species). This applies to both the heatmaps and B&W plots below.

In [None]:
taxgroup <- "Family"

You can add a secondary variable here (e.g. `var2 <- "Phase"`), which will split the plots by that variable. If you don't want to examine a secondary variable, leave the below as `var2 <- NULL`. **You must run the below code cell regardless**

In [None]:
var2 <- "Animal"

### Heatmaps

This section contains heatmaps for each taxonomic level (class - species) for your selected variable. The number in each cell is the relative proportion (in %) of taxa per group.

Generate the plot. There are a variety of attributes you can modify:

Note the `tax_show = 5` attribute. This says to show the top 5 taxa. 

The `plot_values_size =` defines the size of the text in the heatmap cells.

The `tax_add = NULL` adds an additional taxonomic grup to the taxa names (e.g. changing this to `tax_add = "Phylum"` and plotting Genus will name the taxa as 'phylum:genus' but leaving it as `tax_add = NULL` will just give the genus name.

`showRemainingTaxa = T` will aggregate, in a single row, the remaining taxa not shown on this heatmap. Change to `F` if you don't want to see this.

`color_vector = NULL` uses the default colour range (orange -> blue). You can change this by providing your own colour range, e.g. `color_vector = c("red", "green")`. You can choose from the huge number of R colours here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

`facet_by = var2` splits the plot by the seconday variable you provided above

In [None]:
p <- amp_heatmap(ampvisdata, tax_show = 5, plot_values_size = 5, tax_add = NULL, showRemainingTaxa = T, color_vector = NULL, facet_by = var2, group_by = group, tax_aggregate = taxgroup, tax_empty = "remove")
p

Add some additional plot modifications. Change or remove these as desired ([or add your own modifications](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html)).

In [None]:
p <- p +
theme_bw() +
theme(text = element_text(size = 18), axis.text.x = element_text(angle = 90, size=16), axis.text.y = element_text(size=14)) +
labs(y = taxgroup)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0("community_tax_abund_heatmap_", group, "_", taxgroup, ".tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0("community_tax_abund_heatmap_", group, "_", taxgroup, ".pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Box and whisker plots

This section contains B&W plots for each taxonomic level (class - species) for your selected variable. This is similar to the heatmaps you generated in section 6, but separates the B&W into groups per variable, thus enabling an examination of how the variable groups differ in terms of taxonomic abundance.

Generate the plot. There are a variety of attributes you can modify:

Note the `tax_show = 5` attribute. This says to show the top 7 taxa (**Tip:** to look at specific taxa you can provide a vector of taxa names, e.g. `tax_show = c("Muridae", "Dasyuridae")`).

The `tax_add = NULL` adds an additional taxonomic grup to the taxa names (e.g. changing this to `tax_add = "Phylum"` and plotting Genus will name the taxa as 'phylum:genus' but leaving it as `tax_add = NULL` will just give the genus name.

In [None]:
p <- amp_boxplot(ampvisdata, tax_show = 5, tax_add = NULL, tax_aggregate = taxgroup, tax_empty = "remove", group_by = group)
p

You can change additional properties of the plot here

In [None]:
p$mapping$fill <- as.name(".Group")
p <- p + theme_bw() +
scale_color_manual(values=c25) + scale_fill_manual(values=c25) +
labs(fill = group, x = taxgroup) +
theme(text = element_text(size = 18), axis.text.x = element_text(size=18), axis.text.y = element_text(size=16)) +
guides(color = "none")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0("community_tax_abund_BW_", group, "_", taxgroup, ".tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0("community_tax_abund_BW_", group, "_", taxgroup, ".pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**************************************

[Click here to go to the next section: 9. Differential abundance](./anacapa_9_DA.ipynb)