# Workflow sections
[1. How to use this notebook](#overview)

[2. anacapa pipeline](#pipelineana)

[3. Copying your data to Jupyter](#dataana)

[4. Data overview](#spana)

[5. Filtering taxonomic assignments](./anacapa_5_filtration.ipynb)

[6. Alpha diversity](./anacapa_6_AD.ipynb)

[7. Beta diversity](./anacapa_7_BD.ipynb)

[8. Community Structure](./anacapa_8_CS.ipynb)

[9. Differential abundance](./anacapa_9_DA.ipynb)

******************************

# 1. How to use this notebook <a class="anchor" id="overview"></a>

This is a [Jupyter notebook](https://jupyter.org/) for analysing 16S amplicon sequence data that has been processed by the [anacapa](https://ucedna.com/software) pipeline.

Technically, however, as long as you have a table of taxa abundance and a samples tables, you can run this workflow (with some minor modifications)

The majority of the sections in this notebook can be opened as separate Jupyter notebooks (by clicking on the link in the table of contents) and are designed to be run completely independantly.

Jupyter notebooks are interactive documents that contain 'live code', which allows the user to complete an analysis by running code 'cells', which can be modified, updated or added to by the user.

Individual Jupyter notebooks are based on a specific 'kernel', or analysis envirnment (mostly programming languages). This particular notebook is based on R. To see which version of R this notebook is based on, and as an example of running a code cell, click on the cell below and press the 'Run' button (top of the page).


In [None]:
R.Version()

You should now see details about the version of R installed for this notebook. Every other code cell can be run similarly. There are two main types of code cells, plain R code (as seen above) and [markdown](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf). 

Markdown is a simple language for formatting text and the instructions (including this cell), headings, etc are written in markdown code cells. See: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf. You can add additional cells (markdown or R) by clicking on the plus sign, and then in the dropdown box, selecting 'Markdown' or 'code'. This way you can add your own analysis code cells or your own notes in markdown.

***********************************

# 2. anacapa pipeline <a class="anchor" id="pipelineana"></a>

This Jupyter eDNA analysis workflow has been designed to process data generated by the anacapa pipeline (though it can be readily modified to use output from other upstream eDNA analysis pipelines). Output data generated by anacapa can be directly used by this Jupyter workflow. Anacapa overview:

> Anacapa is an eDNA toolkit that allows users to build comprehensive reference databases and assign taxonomy to raw multilocus metabarcode sequence data. It address longstanding needs of the eDNA for modular informatics tools, comprehensive and customizable reference databases, flexibility across high-throughput sequencing platforms, fast multilocus metabarcode processing, and accurate taxonomic assignment. Anacapa toolkit processes eDNA reads and assigns taxonomy using existing software or modifications to existing software. This modular toolkit is designed to analyze multiple samples and metabarcodes simultaneously from any Ilumina sequencing platform. A significant advantage of the Anacapa toolkit is that it does not require that paired reads overlap, or that both reads in a pair pass QC. Taxonomy results are generated for all read types and the user can decide which read types they wish to retain for downstream analysis.

See the anacapa website for details: https://github.com/limey-bean/Anacapa

Anacapa includes four modules:

1. building reference libraries using CRUX
2. running quality control (QC) and assigning Amplicon Sequence Variants (ASV) using Dada2 (Sequence QC and ASV Parsing),
3. assigning taxonomy using Bowtie 2 and a Bowtie 2 specific Bayesian Least Common Ancestor (BLCA) (Assignment) and
4. Running exploratory data analysis and generating ecological diversity summary statistics for a set of samples using ranacapa.

### Sequence QC and ASV Parsing using dada2

![](https://raw.githubusercontent.com/limey-bean/Anacapa/New-Master/dada2_QC_flow.png)

### Taxonomic Assignment using Bowtie 2 and BLCA

![](https://raw.githubusercontent.com/limey-bean/Anacapa/New-Master/Anacapa_class_flow.png)

The analysis in these Jupiter notebooks is for downstream analysis of eDNA data, based on a table of read/sequence counts per taxa, per sample. Thus, **to be able to run these Jupyter notebooks, you must have already run your sequence data through an upstream eDNA data processing pipeline, such as anacapa**.

**CITING anacapa**

>Curd, Emily E., Zack Gold, Gaurav S. Kandlikar, Jesse Gomer, Max Ogden, Taylor O'Connell, Lenore Pipes et al. "Anacapa Toolkit: an environmental DNA toolkit for processing multilocus metabarcode datasets." Methods in Ecology and Evolution 10, no. 9 (2019): 1469-1475.

## R analysis

This Jupyter notebook, and the eDNA analysis therein, is based on R. Run the following code cell to see the version of R installed on this notebook:

In [None]:
R.Version()$version.string

The key R analysis package that we will use in this Jupyter Notebook is **[ampvis2 ('Tools for visualising amplicon data'](https://github.com/KasperSkytte/ampvis2))**. Output from the anacapa pipeline is used as input for ampvis2. Alpha diversity, beta diversity and community structure were mostly analysed using the ampvis2 package. Details of this analysis are in those sections of this analyis workflow.

The final section of the workflow examines differential abundance of taxa between sample groups. The main R package used in this analysis is **[AMCOM-BC](https://bioconductor.org/packages/release/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC.html)**

********************************************

# 3. Copying your data to Jupyter <a class="anchor" id="dataana"></a>

## What data do you need?

You need three pieces of information to run this eDNA analysis workflow.

1. An amplicon abundance table, with samples as columns and taxa (ASVs) as rows
2. A taxonomy table, where each ASV is linked to a taxonomic group (Kingdom ... Species) 
3. A samples table

As long as you have these three pieces of information, you can run this Jupyter eDNA analysis workflow. Most upstream eDNA pipelines will generate this information, but this workflow has been specifically designed to use the output produced by the [anacapa pipeline](https://ucedna.com/software). If you have generated your data using a different upstream eDNA pipeline, this Jupyter workflow will need to be modified slightly to use it. Contact Paul Whatmore - paul.whatmore@qut.edu.au (the author of this workflow) if you need this or any other support.

**Anacapa produces an output file that contains both 1. An amplicon abundance table and 2. A taxonomy table, combined together. So you will just need to upload this file to Jupyter and a samples table.**

Your anacapa output file will be in a subdirectory in your main anacapa directory, called 'xxxx_anacapa_output/xxxx/xxxx_taxonomy_tables/', where xxxx = the primer pair name you provided for anacapa (e.g. '16Smam'). The file itself is called 'xxxx__ASV_taxonomy_brief.txt'. Again, xxxx - primer pair name.

You will also need a samples table. This needs to be in a specific format and structure. It needs a column called 'sample-id' and a column for each variable you have (such as location, collection date, etc). An example of the required structure is seen below. Create this in Excel and save it as 'sample_table' and as a csv file (`File` -> `Save as` -> drop down to select `CSV (Comma delimited) (*.csv)`).  

| sample-id  | Species | Dummy1 | Dummy2 | Dummy3 |
|------------|---------|--------|--------|--------|
| Samp10.S10 | Owl     | AB     | Big    | 34     |
| Sampl1.S1  | Owl     | AB     | Big    | 23     |
| Sampl2.S2  | Owl     | AB     | Big    | 12     |
| Sampl3.S3  | Owl     | CD     | Big    | 45     |
| Sampl4.S4  | Owl     | EF     | Big    | 23     |
| Sampl5.S5  | Owl     | CD     | Small  | 54     |
| Sampl6.S6  | Owl     | CD     | Small  | 32     |
| Sampl7.S7  | Owl     | CD     | Small  | 28     |
| Sampl8.S8  | Owl     | EF     | Small  | 52     |
| Sampl9.S9  | Owl     | EF     | Small  | 19     |

## Uploading your data

First, create a folder called 'anacapa' in the left hand file panel in Jupyter. Righ click -> 'New folder'. Rename this to 'anacapa' (all lower case - R is case-sensitive). All your data files need to be in this folder and any files/figures created by this workflow will be output here. Open this folder.

You can now drag and drop your two data files ('xxxx__ASV_taxonomy_brief.txt' and 'samples_table.csv') to this anacapa folder. 

*************************************

# 4. Data overview <a class="anchor" id="spana"></a>

As a basic example of how to examine your data in Jupyter, in this section we'll import your data file and display it as a table. This is also to check if you have the correct data files uploaded and they have the correct structure.

## Amplicon abundance table

The following table shows a count of the number of amplicons that mapped to a taxonomic group for each sample. Taxonomic groups are identified as ASVs - [Amplicon Sequence Variant](https://www.nature.com/articles/ismej2017119). This amplicon abundance table (absolute abundance) provides the foundation for the analysis in this analysis pipeline. Quality filtering was done by DADA2, which performs several sequence quality control steps: quality filtering, denoising, read pair merging and PCR chimera removal.

These steps are outlined in detail in the DADA2 article (https://www.nature.com/articles/nmeth.3869):

`DADA2, "Callahan, Benjamin J., et al. "DADA2: high-resolution sample inference from Illumina amplicon data." Nature methods 13.7 (2016): 581."`

First, choose your working directory. This is the directory where your anacapa output files are (abundance table and samples table). See the previous section for details.

In [None]:
setwd("~/anacapa")

Now, provide an ID for your project. This must be the same name as in your anacapa output file (see previous '3. Copying your data to Jupyter' section). So in the 'xxxx_ASV_taxonomy_brief.txt' file you uploaded, your project ID will be 'xxxx'. E.g. if your uploaded file is called '16Smam_ASV_taxonomy_brief.txt' your project ID will be called '16Smam'. Edit the below code cell to match your anacapa output file name.

In [None]:
project_id <- "rbcl"

Then, we import the abundance table into R:

In [None]:
asvtable <- read.table(paste0("./" , project_id, "_ASV_taxonomy_brief.txt"), check.names = FALSE, sep = "\t", stringsAsFactors = FALSE, comment.char = "", header = T)
colnames(asvtable)[1] <- "ASV"

<mark><font color="red">Pia, I'm also removing some repeated text from your sample names, as they appear on the ASV table (named after the fastq files). This is specific to your owl test dataset only and won't work on other datasets:</font></mark>

<mark><font color="red">Dataset1:</font></mark>

In [None]:
colnames(asvtable) <- gsub("X16Smam_Pia357NanoTest", "", colnames(asvtable))
colnames(asvtable) <- gsub(".L001", "", colnames(asvtable))

<mark><font color="red">
Dataset2:</font></mark>

In [None]:
colnames(asvtable) <- gsub("cytbvert_", "", colnames(asvtable))
colnames(asvtable) <- gsub(".L001", "", colnames(asvtable))

<mark><font color="red">
Plant data:</font></mark>

In [None]:
colnames(asvtable) <- gsub("rbcl_", "", colnames(asvtable))
colnames(asvtable) <- gsub("\\.", "_", colnames(asvtable))

To generate the table we'll use the [DT: datatables package](https://rstudio.github.io/DT/). This creates a table that is searchable, columns can be ordered, table can be exported as a csv or Excel file, etc. Load the DT library:

In [None]:
library(DT)

Now generate the table

In [None]:
DT::datatable(asvtable, rownames = F,
              width = "100%",
              extensions = 'Buttons',
              options = list(scrollX = TRUE,
                             dom = 'Bfrtip',
                             columnDefs = list(list(className = 'dt-center', targets="_all")),
                             buttons =
          list('copy', 'print', list(
            extend = 'collection',
            buttons = list(
                list(extend = 'csv', filename = "ASV_table"),
                list(extend = 'excel', filename = "ASV_table"),
                list(extend = 'pdf', filename = "ASV_table")),
            text = 'Download'
          ))
      )
    ) 

## Samples table

Now import your samples table into R and view it as a table.

In [None]:
samples_table <- read.csv("sample_table.csv", header = T)
samples_table

**IMPORTANT: The sample IDs in the sample table must exactly match the sample IDs (column names) in the ASV table.** If they don't, you'll need to manually edit the sample names in the samples table and re-upload it.

[Click here to go to the next section: 5. Filtering taxonomic assignments](./anacapa_5_filtration.ipynb)