# 4. Filtering taxonomic assignments

****************************

## Section contents
[1. Overview](#overview1)

[2. Setting up your analysis environment](#envana1)

[3. Importing your data](#dataana1)

[4. Removing low abundance taxa](#lowana1)

[5. Removing contaminant taxa](#contana1)

[6. Filtering taxonomy assignments by confidence level](filtana1)

[7. Examining filtered taxonomy assignments](#viewana1)

[8. Exporting filtered results](#expana1)


****************************

## Overview <a class="anchor" id="overview1"></a>

In this section we will remove 1) low abundance taxa, 2) contaminant and 3) low confidence taxa.

**Low abundance taxa**

Most of the assigned taxa will be based on a very small number of matching reads. These are almost certainly false-positives, due to sequencing errors, very low quantities of contaminant DNA (from the air, water), etc. We will filter out taxa that fall below a low read count threshold.

**Contaminant taxa**

There may be taxa in your dataset that you know are present due to contamination of your samples. One of the most common contaminants is human, but there could be others. For example, if you were collecting kangaroo faeces (in order to identify their diet) from a cattle field, you likely will find cattle in your sample. You also would find kangaroo DNA in your sample, which you also would want to remove that as a 'diet' taxa.

**Taxonomy confidence**

By default, Anacapa assigns taxonomy to ASVs using a [Bayesian Lowest Common Ancestor method (BLCA)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1670-4). Each taxonomy assignment is given a confidence score that indicates the likely accuracy of the assignment. Every ASV is assigned to species level, but the confidence scores vary. Because of this, many of the lower taxonomic levels (family, genus, species) may have low confidence scores and therefore be an incorrect assignment. 

For example, you may have an ASV that has 86% confidence for the family level taxonomic assignment, 54% confidence at the genus level and 32% confidence at the species level. In this case you might say that you only have sufficient confidence in the family level assignment, but you think the genus and species level assignments are likely false positives.

***********************************

## 2. Setting up your analysis environment <a class="anchor" id="envana1"></a>

Before you can begin your analysis you need to set up certain requirements, such as setting your working directory and installing/loading required R packges.

Choose your working directory (should just be called 'anacapa', if you followed the previous sections).

In [None]:
setwd("~/anacapa")

R can tell you what working directory you're currently in using the getwd() function:

In [None]:
getwd()

You can see what is in this directory by running the dir function. You should see your ASV table (outout from anacapa) and your samples table should be here:

In [None]:
dir()

### Install the required R packages

R does most of its analysis using [functions](https://www.tutorialspoint.com/r/r_functions.htm). Some of these are built into base R, but many come as external [packages](https://r-pkgs.org/intro.html), which need to be installed and loaded into R.

Load any required packages that have previously been installed using the [library()](https://www.tutorialspoint.com/r/r_packages.htm) function:

In [None]:
library(tidyverse)
library(scales)
library(DT)
library(ggpubr)

Other packages need to be installed first.

The key R analysis package used was ampvis2 (['Tools for visualising amplicon data'](https://madsalbertsen.github.io/ampvis2/)). Results from anacapa (ASV tables, taxonomy assignment, etc) is used as input for ampvis2 in this notebook. Install and load ampvis2 (this will take a few minutes as there are multiple dependent packages installed):

In [None]:
install.packages("remotes", verbose = F)
remotes::install_github("MadsAlbertsen/ampvis2", quiet = T)
library(ampvis2)

Set the default width and height for plots output on this Notebook. You can modify this as you prefer. Note that every plot in this Notebook is followed by code to output it as a file and this code defines width/height separately from the options below.

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

****************************

## 3. Importing your data <a class="anchor" id="dataana1"></a>

Anacapa outputs an ASV abundance table (read counts per sample per ASV) that also contains taxonomy assignmnents and taxonomy confidence scores in separate columns. So to filter your taxonomy assignmnents by a confidence threshhold, you first need to import your ASV table, then separate out the taxonomy data (assignments and confidence), then finally filter the taxonomy assignmnents based on the confidence data.

### Import the samples table

First import your samples table. This is not needed for filtration, but is used later when comparing pre and post filtration results.

This samples table contains information on your samples and variables. We need to import this from a file to run our analysis on selected variables.

This needs to be in the form of a csv file named 'sample_table.csv'. It also needs to have the sample IDs in the first column (these need to be the same as the column names in the ASV table you import below) as a minimum. Then any additional columns can have other variable information (i.e. metadata) that is available for your samples (e.g. location collected, sex, size, etc, etc). You can create, save and modify this file to be in the correct structure in Excel before you start this analysis.

In [None]:
samples_table <- read.csv("sample_table.csv", header = T)

You can view your samples table:

In [None]:
samples_table

### Import the ASV abundance table

Now, provide an ID for your project. This is the 'xxxx' in your 'xxxx_ASV_taxonomy_brief.txt' file. See the previous section for details.

In [None]:
project_id <- "rbcl"

Import the abundance table into R:

In [None]:
asvtable <- read.table(paste0("./" , project_id, "_ASV_taxonomy_brief.txt"), check.names = FALSE, sep = "\t", stringsAsFactors = FALSE, comment.char = "", header = T)
colnames(asvtable)[1] <- "ASV"

<mark><font color="red">Pia, I'm also removing some repeated text from your sample names, as they appear on the ASV table (named after the fastq files). This is specififc to your owl test dataset only and won't work on other datasets:</font></mark>
    
<mark><font color="red">Dataset1:</font></mark>

In [None]:
colnames(asvtable) <- gsub("X16Smam_Pia357NanoTest", "", colnames(asvtable))
colnames(asvtable) <- gsub(".L001", "", colnames(asvtable))

<mark><font color="red">
Dataset2:</font></mark>

In [None]:
colnames(asvtable) <- gsub("cytbvert_", "", colnames(asvtable))
colnames(asvtable) <- gsub(".L001", "", colnames(asvtable))

<mark><font color="red">
Plant data:</font></mark>

In [None]:
colnames(asvtable) <- gsub("rbcl_", "", colnames(asvtable))

Have a look at the top few rows of the ASV table, to see if it looks right. The sample IDs should be the column names and the ASV IDs in the first column. All the other columns should contain numbers (i.e. the count of the number of times each ASV was found in each sample).

In [None]:
head(asvtable)

### Create a taxonomy table

The taxonomy data (both taxonomic assignments and confidence scores) is already in the anacapa ASV taxonomy file. We just need to put this in a separate object and in the correct format for Apmvis2 (the primary R package we're using for downstream analysis). This allows each ASV to be assigned taxonomy by ampvis2.

In [None]:
# Import taxonomy table from ASV table as separate object
mytax <- data.frame(asvtable$taxonomy)
# Split taxa into separate columns. Taxa are separated by ";"
mytaxsplit <- tidyr::separate(data = mytax, col = asvtable.taxonomy, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = ";")

The taxonomy assignments from anacapa are prefaced by the taxonomy name `Phylum`, `Class` etc but ampvis2 expects taxonomy assignments to be prefaced by `p_`, `c_`, etc. So we need to convert the assignment names.

In [None]:
mytaxsplit$Kingdom <- gsub("superkingdom:", "k__", mytaxsplit$Kingdom)
mytaxsplit$Phylum <- gsub("phylum:", "p__", mytaxsplit$Phylum)
mytaxsplit$Class <- gsub("class:", "c__", mytaxsplit$Class)
mytaxsplit$Order <- gsub("order:", "o__", mytaxsplit$Order)
mytaxsplit$Family <- gsub("family:", "f__", mytaxsplit$Family)
mytaxsplit$Genus <- gsub("genus:", "g__", mytaxsplit$Genus)
mytaxsplit$Species <- gsub("species:", "s__", mytaxsplit$Species)

Have a look at the first few rows of the taxonomy table. Each taxonomic level should be in its own column.

In [None]:
head(mytaxsplit)
# Also create a copy of the taxonomy and ASV tables, so can later compare pre and post-filtration results
mytaxsplit_prefilt <- mytaxsplit
asvtable_prefilt <- asvtable

************************************

## 4. Removing low abundance taxa <a class="anchor" id="lowana1"></a>

In any dataset there will be a majority of very low abundance taxa assignments that are false positives. This is due to a variety of reasons, such as the inherent (but very low) error rate in Illumina sequencing - for example if 1 in 10,000 sequences has a single base pair error, then these minority of sequences may match an incorrect taxa. There is also minor contamination from air, water, soil etc. Even a miniscule amount of contaminant DNA will be present in the dataset, though, again, usually as very low abundance taxa assignments.

In this section we will remove the very low abundance taxa. The optimal filtration threshold here is somewhat dependant on individual dataset factors, such as total number of sequences, overall diversity and so on. You can confidently remove any taxa that has one or two read counts, but depending on your dataset you can increase this to a higher number. The default threahold we use here is 5 - i.e. any taxa with fewer than 5 reads in any sample will be filtered out.

Have a look at your abundance table to decide on a reasonable abundance threshold.

In [None]:
asvtable_prefilt

Select the minumum read count threshold you wish to filter by (or leave at the default of 5):

In [None]:
minreads <- 5

You can check how many taxa (ASVs) will be present before and after filtration by this threshold.

Total number of (prefiltered) taxa:

In [None]:
nrow(asvtable_prefilt)

Number of taxa present after filtration by the provided abundance threshold (default:5):

In [None]:
asvtable_thresh <- subset(asvtable, select=-c(ASV, taxonomy, taxonomy_confidence, accessions))
asvtable_thresh_bool <- asvtable_thresh >= minreads
sum(rowSums(asvtable_thresh_bool) > 0)

Now we can do the actual filtration of taxa, based on the above abundance threshold:

In [None]:
asvtable <- asvtable[rowSums(asvtable_thresh_bool) > 0, ]
# Taxonomy table must also be filtered by this threshold
mytaxsplit <- mytaxsplit[rowSums(asvtable_thresh_bool) > 0, ]

You can now view the abundance-filtered table:

In [None]:
asvtable

************************************

## 5. Removing contaminant taxa <a class="anchor" id="contana1"></a>

In this section we will identify any contaminant genus or species in your dataset, then remove them. See the overview section for an explanation of contaminant taxa.

**If you don't find any contaminant taxa, you can skip this section**

### Finding contaminant taxa

Have a look through your taxonomy table for any contaminant species. Note: filtration in this section is based on genus or species level only:

In [None]:
DT::datatable(mytaxsplit, rownames = F)

Alternatively (if, for example, you have a very large number of taxa), you can search for taxa names using R.

Enter a genus or species name below (i.e. a possible contaminant species). You can change this as many times as you like to search for any and all potentially contaminant genus/species. 

In [None]:
is_contaminant <- "Homo sapiens"

Now search your taxa table for this name. The output is the number of ASVs that matched to this taxa.

In [None]:
nrow(mytaxsplit[grepl(is_contaminant, mytaxsplit$Species),])

### Removing contaminiant genus/species

Enter all of your contaminant genus/species names in the code cell below. Each species name must be separated by an upright line (on your keyboard, this is the key just above the 'enter' key usually). E.g. `"Homo sapiens|Bos|Canis familiaris"`. In this example we will filter out all humans, dog and any cattle (Bos genus) species. Remember, R is case-sensitive.

In [None]:
contaminants <- "Coleura afra|Bos|Canis"

Now we will identify which rows of your taxa table contain these species names, and remove them. The same rows need to be removed from your ASV abundance table too.

In [None]:
asvtable <- asvtable[!grepl(contaminants, mytaxsplit$Species),]
mytaxsplit <- mytaxsplit[!grepl(contaminants, mytaxsplit$Species),]

If you like, you can have another look at your taxa table, to see that the contaminant taxa have been removed:

In [None]:
DT::datatable(mytaxsplit, rownames = F)

************************************

## 6. Filtering taxonomy assignments by confidence level <a class="anchor" id="filtana1"></a>

In this section you can examine the confidence levels for each taxonomic level, then filter out low confidence assignments. These filtered results can then be exported as a file, to be used in downstream analysis (i.e. the other sections of this report - alpha, beta diversity, etc). See the overview section for an explanation of confidence scores.

As with the taxonomy assignments, the taxonomy confidence scores are in a separate column in the original Anacapa ASV file.

First, extract the confidence scores. 

In [None]:
# Import from ASV table as separate object
myconf <- data.frame(asvtable$taxonomy_confidence)
# Taxa are separated by ";". Use this separator to put each taxa in its own column
myconfsplit <- tidyr::separate(data = myconf, col = asvtable.taxonomy_confidence, into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"), sep = ";")

The confidence scores have some text attached to them (taxonomy names). Remove this and convert to numeric data.

In [None]:
# Remove superfluous text
myconfsplit$Kingdom <- gsub("superkingdom:", "", myconfsplit$Kingdom)
myconfsplit$Phylum <- gsub("phylum:", "", myconfsplit$Phylum)
myconfsplit$Class <- gsub("class:", "", myconfsplit$Class)
myconfsplit$Order <- gsub("order:", "", myconfsplit$Order)
myconfsplit$Family <- gsub("family:", "", myconfsplit$Family)
myconfsplit$Genus <- gsub("genus:", "", myconfsplit$Genus)
myconfsplit$Species <- gsub("species:", "", myconfsplit$Species)
# Convert columns to numeric
myconfsplit <- sapply(myconfsplit, as.numeric)
# Need to convert any NAs to 0, else they won't be converted to boolean
myconfsplit[is.na(myconfsplit)] <- 0

Now you should just see the confidence scores for each ASV (rows) and each taxonomic level (columns). The `head()` function simply displays the first 6 columns.

In [None]:
head(mytaxsplit)

Now we can filter by a confidence level of your choice.

Select a confidence level (i.e. change the number in the code cell below. 85 = any taxa below 85% confidence will be removed). 

The confidence level you choose is dependent on the nature of your dataset. If you increase the level too high, you risk removing accurate taxonomy assignments, but too low you risk including false positives. You can experiment with different levels. For example, if you've set your level at 0.85 and you are seeing species that you know cannot be in your sample (e.g. from a different country) then you can adjust your score higher.

In [None]:
conf <- 85

Filter your data using this confidence level.

In [None]:
mytaxsplit[myconfsplit <= conf] <- NA

Any taxa that were below the confidence threshold will now be replaced by 'NA', which means they will be excluded from downstream analysis. You can view your entire filtered taxonomy table like so: 

In [None]:
DT::datatable(mytaxsplit, rownames = F,
              width = "100%",
              extensions = 'Buttons',
              options = list(scrollX = TRUE,
                             dom = 'Bfrtip',
                             columnDefs = list(list(className = 'dt-center', targets="_all")),
                             buttons =
          list('copy', 'print', list(
            extend = 'collection',
            buttons = list(
                list(extend = 'csv', filename = "tax_table_filtered"),
                list(extend = 'excel', filename = "tax_table_filtered"),
                list(extend = 'pdf', filename = "tax_table_filtered")),
            text = 'Download'
          ))
      )
    ) 

************************************

## 7. Examining filtered taxonomy assignments <a class="anchor" id="viewana1"></a>

In this section we can quantify how many taxa passed filtration to each taxonomic level. You can use this section to validate your previous filtration choices. For example, if you see taxa that are still present that you know are contaminants, or you see legitimate taxa in your unfiltered table but removed in your filtered table, then you may want to re-run the filtration steps using different parameters. 

First, combine the ASV table, taxonomy table and samples table into an ampvis2 object, for both pre-filtered and post-filtered data. Ampvis2 objects allow an examination of various statistics of the whole dataset.

Combine pre-filtered data:

In [None]:
# Re-combine the ASV and taxonomy data
asv_table_prefilt <- cbind(asvtable_prefilt, mytaxsplit_prefilt)
# Also remove the Feature.ID and Confidence columns, as they are not needed
asv_table_prefilt <- subset(asv_table_prefilt, select=-c(taxonomy, taxonomy_confidence, accessions))
# Combine as an apmvis2 object, using the `amp_load` function
ampvisdata_prefilt <- amp_load(otutable = asv_table_prefilt, metadata = samples_table)

Combine post-filtered data:

In [None]:
asv_table <- cbind(asvtable, mytaxsplit)
asv_table <- subset(asv_table, select=-c(taxonomy, taxonomy_confidence, accessions))
ampvisdata <- amp_load(otutable = asv_table, metadata = samples_table)

Now we can simply type the name of the ampvis2 objects to see a variety of information about the datasets.

Let's look at the pre-filtered data first:

In [None]:
ampvisdata_prefilt

Then the post-filtered results:

In [None]:
ampvisdata

These show an overview of the number of ASVs (called 'OTUs' here) identified. Remember that an ASV indicates a specific taxonomic assignment, so if you have 200 ASVs here then, based on anacapa results (which annotates every ASV to species level), you have 200 taxa that identified to species level. However, often you will see the same species annotated to multiple ASVs. This is due to variations in the genomic databases for that species.

Bceause of this, the 200 species-level taxa may only represent far fewer actual species. We can see how many actual taxa there are, per taxonomic level, in the following subsections.

### Phylum

First we can see how many phyla were identified (this will probably be a single phyla, so the results won't look very interesting. Lower taxonomic levels will be more detailed).

Pre-filtered first:

In [None]:
gsub("p__", "", unique(mytaxsplit_prefilt$Phylum))

Number of ASVs that fall under each Phyla, pre-filtration

In [None]:
table(mytaxsplit_prefilt$Phylum)

Then post-filtered:

In [None]:
gsub("p__", "", unique(mytaxsplit$Phylum))

Number of ASVs that fall under each Phyla, post-filtration

In [None]:
table(mytaxsplit$Phylum)

Now we can plot the pre and post filtered phylum, to both see how many phylum there were/are (pre and post-filtration) and the number of ASVs that matched to each phylum.

In [None]:
p <- ggplot(as.data.frame(table(mytaxsplit_prefilt$Phylum)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p <- p + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Pre-filtered") + theme(legend.position="none", text = element_text(size = 18))
p2 <- ggplot(as.data.frame(table(mytaxsplit$Phylum)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p2 <- p2 + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Post-filtered") + theme(legend.position="none", text = element_text(size = 18))

In [None]:
p3 <- ggarrange(p, p2)
p3

### Exporting your plot as a file

You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. **These files can be found in your working directory.**

**Tip:** you can adjust the width and height of the saved images by changing `width =` and `height =` in the code below (and every time ggsave appears in this workflow). Pdf files can be opened within Jupyter, so a good way to test a suitable width/height would be to save the image by running the pdf code below with the default 20cm width/height, then open the pdf file by clicking on it in the file browser panel (to the left of this notebook), then change the width/height and repeat this process as needed.

Export as a 300dpi TIFF

In [None]:
tiff_exp <- paste0(project_id, "_pre_vs_postfiltration_phylum.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p3, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(project_id, "_pre_vs_postfiltration_phylum.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p3, width = 20, height = 20, units = "cm")

You can now find these files in your working directory (which you originally defined in the 'Setting up your analysis environment' section).

Tip: To see what your working directory is, use the getwd() command. This will be where you output the above images to.

In [None]:
getwd()

Now we can do the same for every other taxonomic level:

### Class

Unique classes, pre-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit_prefilt$Class))

Number of ASVs that fall under each class, pre-filtration

In [None]:
table(mytaxsplit_prefilt$Class)

Unique classes, post-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit$Class))

Number of ASVs that fall under each class, post-filtration

In [None]:
table(mytaxsplit$Class)

Plot of pre-filtration vs post-filtration classes

In [None]:
p <- ggplot(as.data.frame(table(mytaxsplit_prefilt$Class)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p <- p + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Pre-filtered") + theme(legend.position="none", text = element_text(size = 18))
p2 <- ggplot(as.data.frame(table(mytaxsplit$Class)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p2 <- p2 + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Post-filtered") + theme(legend.position="none", text = element_text(size = 18))
p3 <- ggarrange(p, p2)
p3

Export as a 300dpi TIFF

In [None]:
tiff_exp <- paste0(project_id, "_pre_vs_postfiltration_class.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p3, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(project_id, "_pre_vs_postfiltration_class.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p3, width = 20, height = 20, units = "cm")

### Order

Unique orders, pre-filtration.

In [None]:
gsub("o__", "", unique(mytaxsplit_prefilt$Order))

Number of ASVs that fall under each order, pre-filtration

In [None]:
table(mytaxsplit_prefilt$Order)

Unique orders, post-filtration.

In [None]:
gsub("o__", "", unique(mytaxsplit$Order))

Number of ASVs that fall under each order, post-filtration

In [None]:
table(mytaxsplit$Order)

Plot of pre-filtration vs post-filtration orders

In [None]:
p <- ggplot(as.data.frame(table(mytaxsplit_prefilt$Order)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p <- p + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Pre-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 10))
p2 <- ggplot(as.data.frame(table(mytaxsplit$Order)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p2 <- p2 + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Post-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 12))
p3 <- ggarrange(p, p2)
p3

Export as a 300dpi TIFF

In [None]:
tiff_exp <- paste0(project_id, "_pre_vs_postfiltration_order.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p3, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(project_id, "_pre_vs_postfiltration_order.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p3, width = 20, height = 20, units = "cm")

### Family

Unique families, pre-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit_prefilt$Family))

Number of ASVs that fall under each family, pre-filtration

In [None]:
table(mytaxsplit_prefilt$Family)

Unique families, post-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit$Family))

Number of ASVs that fall under each family, post-filtration

In [None]:
table(mytaxsplit$Family)

Plot of pre-filtration vs post-filtration families

In [None]:
p <- ggplot(as.data.frame(table(mytaxsplit_prefilt$Family)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p <- p + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Pre-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 6))
p2 <- ggplot(as.data.frame(table(mytaxsplit$Family)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p2 <- p2 + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Post-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 12))
p3 <- ggarrange(p, p2)
p3

Export as a 300dpi TIFF

In [None]:
tiff_exp <- paste0(project_id, "_pre_vs_postfiltration_family.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p3, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(project_id, "_pre_vs_postfiltration_family.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p3, width = 20, height = 20, units = "cm")

### Genus

Unique genera, pre-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit_prefilt$Genus))

Number of ASVs that fall under each genus, pre-filtration

In [None]:
table(mytaxsplit_prefilt$Genus)

Unique genera, post-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit$Genus))

Number of ASVs that fall under each genus, post-filtration

In [None]:
table(mytaxsplit$Genus)

Plot of pre-filtration vs post-filtration genera

In [None]:
p <- ggplot(as.data.frame(table(mytaxsplit_prefilt$Genus)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p <- p + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Pre-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 6))
p2 <- ggplot(as.data.frame(table(mytaxsplit$Genus)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p2 <- p2 + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Post-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 12))
p3 <- ggarrange(p, p2)
p3

Export as a 300dpi TIFF

In [None]:
tiff_exp <- paste0(project_id, "_pre_vs_postfiltration_genus.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p3, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(project_id, "_pre_vs_postfiltration_genus.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p3, width = 20, height = 20, units = "cm")

### Species

Unique species, pre-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit_prefilt$Species))

Number of ASVs that fall under each species, pre-filtration

In [None]:
table(mytaxsplit_prefilt$Species)

Unique species, post-filtration.

In [None]:
gsub("c__", "", unique(mytaxsplit$Species))

Number of ASVs that fall under each species, post-filtration

In [None]:
table(mytaxsplit$Species)

Plot of pre-filtration vs post-filtration species

In [None]:
p <- ggplot(as.data.frame(table(mytaxsplit_prefilt$Species)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p <- p + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Pre-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 6))
p2 <- ggplot(as.data.frame(table(mytaxsplit$Species)), aes(Var1, Freq, fill = Var1)) + 
  geom_bar(stat="identity")
p2 <- p2 + theme_bw() + ylab("Number of taxa") + xlab("Taxa name") + ggtitle("Post-filtered") + theme(legend.position="none", text = element_text(size = 18), axis.text.x = element_text(angle = 90, size = 12))
p3 <- ggarrange(p, p2)
p3

Export as a 300dpi TIFF

In [None]:
tiff_exp <- paste0(project_id, "_pre_vs_postfiltration_species.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p3, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(project_id, "_pre_vs_postfiltration_species.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p3, width = 20, height = 20, units = "cm")

******************************

## 8. Exporting filtered results <a class="anchor" id="expana1"></a>

In this section we'll save the filtrated data as files, which can be imported into the other sections of this workflow. Thus filtration only has to be completed once. The filtered data files could also be used as supplemental material in manuscripts.

Export the filtered abundance table and taxonomy table (to your working directory):

In [None]:
write.csv(asv_table, paste0(project_id, "_filtered_data.csv"), row.names = F)

[Click here to go to the next section: 6. Alpha diversity](./anacapa_6_AD.ipynb)