<a href="https://colab.research.google.com/github/bokulich-lab/sysbio_course_2022/blob/main/amplicon_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ü¶† Amplicon Sequencing Data Analysis with Qiime 2
This notebook, and the corresponding setup script were adapted from the [ISB Virtual Microbiome Series workshop](https://github.com/Gibbons-Lab/isb_course_2021) by Gibbons Lab (including all images, which are re-used with CC-BY-SA 4.0 license). The adapted materials accompany the **Advanced Block Course: Computational Biology**. This notebook gives a minimal example of 16S rRNA gene amplicon sequence analysis for bacterial community profiling with the bioinformatics platform [QIIME 2](https://qiime2.org/). To learn more about the QIIME 2 project, and applications in microbiome research and beyond, visit https://qiime2.org/.

Save your own local copy of this notebook by using `File > Save a copy in Drive`. At some point you may be prompted to trust the notebook. We promise that it is safe ü§û

**Disclaimer:**

The Google Colab notebook environment will interpret any command as Python code by default. If we want to run bash commands we will have to prefix them by `!`. So any command you see with a leading `!` is a bash command and if you wanted to run it in your terminal you would omit the leading `!`. For example, if in the Colab notebook you ran `!wget` you would just run `wget` in your terminal. 

In this notebook we use the `!` prefix because we run all QIIME 2 commands using the [`q2cli`](https://github.com/qiime2/q2cli/) (QIIME 2 command-line interface). However, QIIME 2 also has a python API and a Galaxy interface. You can learn more about these and other QIIME 2 interfaces at https://qiime2.org/.

## ‚ùóSTOP! Before you run this notebook... ‚ùó
_Note:_ In order to fetch the sequencing data from NCBI, we need to provide our e-mail address. Please fill out your e-mail address below
so that it can be used when connecting to the NCBI servers.

In [None]:
your_email = ''

Once you have entered in your email address, you can run the entire notebook by selecting `Runtime > Run all` from the menu in Google Colab. Some steps are time-comsuming and the entire notebook may take up to 30-60 minutes, so run the entire notebook now and we will inspect the commands and results as we work through as a class.

## Setup

QIIME 2 is usually installed by following the [official installation instructions](https://docs.qiime2.org/2022.2/install/). However, because we are using Google Colab and there are some caveats to using conda here, we will have to hack around the installation a little. But no worries, we provide a setup script below which does all this work for us. üòå

So...let's start by pulling a local copy of the project repository down from GitHub.

In [None]:
!git clone https://github.com/bokulich-lab/sysbio_course_2022.git materials
!mkdir /content/prefetch_cache

We will switch to working within the `materials` directory for the rest of the notebook.

In [None]:
%cd materials

Now we are ready to set up our environment. This will take about 10 minutes.

**Note**: This setup is only relevant for Google Colaboratory and will not work on your local machine. Please follow the [official installation instructions](https://docs.qiime2.org/2021.8/install/) for that.

In [None]:
%run setup_qiime2

And we will use some python packages below, so let's load those here:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

## Today's dataset

Today we will use a small subset of public data from the [Earth Microbiome Project](https://www.nature.com/articles/nature24621), which profiled the microbiome of > 27,000 samples from different ecosystems across planet Earth. That study sought to explain generalizable rules to explain the diversity and biogeography of microbial life across the planet... in today's tutorial we will tackle a much more humble goal: quantitatively comparing the bacterial diversity of a small selection of earth's samples.

We can use pandas to inspect some of the metadata about our samples, to learn more about what we are investigating today:

In [5]:
md = pd.read_csv('data/metadata.tsv', sep='\t', index_col=0)
md

Unnamed: 0_level_0,alias,read_length,barcode,num_bases,bioproject,source,animal,saline
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ERR1548680,qiita_sid_1064:1064.G.CV263,151,ACAGGAGGGTGT,5456838,PRJEB14927,honey bee microbiome,yes,no
ERR1548733,qiita_sid_1064:1064.G.CV328,151,AACCATGCCAAC,5436000,PRJEB14927,honey bee microbiome,yes,no
ERR1548821,qiita_sid_1064:1064.H.CV214,151,GAGGTTCTTGAC,5223090,PRJEB14927,honey bee microbiome,yes,no
ERR1529655,qiita_sid_1481:1481.PO1.2.T0,151,TCTAACGAGTGC,3804898,PRJEB14782,human microbiome,yes,no
ERR1529656,qiita_sid_1481:1481.PO1.2.T8,151,CATCTGGGCAAT,3286364,PRJEB14782,human microbiome,yes,no
ERR1529685,qiita_sid_1481:1481.PO5.10.T4,151,GCTTCCAGACAA,3030570,PRJEB14782,human microbiome,yes,no
ERR1530728,qiita_sid_1222:1222.B1.5.11.06,141,GACGCACTAACT,27746200,PRJEB14793,ocean,no,yes
ERR1530761,qiita_sid_1222:1222.B3.5.19.06,142,GAGCGTATCCAT,20439520,PRJEB14793,ocean,no,yes
ERR1530769,qiita_sid_1222:1222.B4.5.9.06,142,TCACGGTGACAT,21630381,PRJEB14793,ocean,no,yes
ERR1844439,qiita_sid_1714:1714.McG.PB02,150,TATGGAGCTAGT,6137796,PRJEB19497,soil,no,no


## Our first QIIME 2 command

The following schematic gives an overview of today's workflow:

![our workflow](https://github.com/Gibbons-Lab/isb_course_2021/raw/main/docs/16S/assets/steps.png)

Before we begin, we first need to import our input data as a QIIME 2 ["artifact"](https://dev.qiime2.org/latest/glossary/).

We can import the data with the `import` action from the tools. Here, we will import a list of SRA run IDs that
we will later use to fetch the corresponding sequences. For that we have to tell
QIIME 2 what *type of data* we are importing and what *type of artifact* we want.

**QoL Tip:** QIIME 2 commands can get very long. To split them up over several lines we can use `\` which means "continue on the next line".

In [None]:
!qiime tools import \
  --type 'NCBIAccessionIDs' \
  --input-path data/ids.tsv \
  --output-path ids.qza

# Fetching sequencing data from NCBI
The data we will be analyzing today is already deposited in the Sequence Read Archive (SRA) maintained by NCBI. Given a list of accession IDs,
we can use the [**q2-fondue**](https://github.com/bokulich-lab/q2-fondue plugin to fetch all those sequences. They will be automatically imported into a QIIME artifact that we can then
directly use for the subsequent analysis steps.

In [None]:
!qiime fondue get-all \
    --i-accession-ids ids.qza \
    --p-email {your_email} \
    --p-n-jobs 2 \
    --o-metadata metadata.qza  \
    --o-single-reads sequences.qza \
    --o-paired-reads sequences-paired.qza \
    --o-failed-runs failed-runs.qza \
    --verbose

The action above should have fetched all the single-end sequences we need. Since we have quality information for the sequencing reads, let's also generate
our first visualization to inspect sequence quality.

---

Qiime 2 commands can become pretty long. Here are some pointers to remember the
structure of a command:

```
qiime plugin action --i-argument1 ... --o-argument2 ...
```

Argument types usually begin with a letter denoting their meaning:

- `--i-...` = input files
- `--o-...` = output files
- `--p-...` = parameters
- `--m-...` = metadata

If you ever need help, just add the `--help` flag to a command to see the help documentation for that plugin or command inline (or check out the online documentation at https://docs.qiime2.org/).

---

In this case we will use the `summarize` action from the `demux` plugin with the previously generated artifact as input and output the resulting visualization to the `qualities.qzv` file.

In [None]:
!qiime demux summarize \
    --i-data sequences.qza \
    --o-visualization qualities.qzv

You can view the plot by downloading the .qzv file and opening it using http://view.qiime2.org. To download the file click on the folder symbol to the left, open the `materials` folder, and choose download from the dot menu next to the `qualities.qzv` file. Note that the visualization that opens in your browser window has multiple tabs, allowing you to also view the citation and data provenance information associated with this output.

ü§î What do you observe across the read? Where would you truncate the reads?

Note that `q2-fondue` downloads sequences belonging to each sample requested from NCBI, as well as metadata associated with these samples. For most of this tutorial we will use a smaller sample metadata file provided as a text file alongside the tutorial (no downloading needed). The medata file downloaded by `q2-fondue` contains even more information about these samples! This might be interesting to use below when you get to the exercises. You can inspect the full sample metadata (and various other sample and feature metadata) as an interactive table using the following command:

In [None]:
!qiime metadata tabulate \
    --m-input-file metadata.qza \
    --o-visualization metadata.qzv

# Denoising amplicon sequence variants

Raw DNA sequencing reads (in this case from a Illumina sequence data) can contain various types of errors, e.g., base call error, chimeric reads, or other defects. To correct and remove these errors, we will use QIIME 2's plugin for [DADA2](https://benjjneb.github.io/dada2/) to perform the following quality control steps:

1. filter and trim the reads (i.e., to remove low quality terminal segments)
2. find the most likely set of unique sequences in the sample (ASVs)
3. remove chimeras
4. count the abundances of each ASV


This step can take a long time to run, so let's start the process and use the time to
understand what is happening:

In [None]:
!qiime dada2 denoise-single \
    --i-demultiplexed-seqs sequences.qza \
    --p-trunc-len 140 \
    --p-n-threads 2 \
    --output-dir dada2 --verbose

If this step takes too long or fails, you can also copy the results from the treasure chest. **However, don't run the next cell if the previous cell completed successfully.**

In [None]:
# obscure magic that will only copy if the previous command failed
![ -d dada2 ] || cp -r results/dada2 .

Ok, this step ran, but we should also make sure it kind of worked. One good way to tell if the identified ASVs are representative of the sample is to see how many reads were maintained throughout the pipeline. Here, the most common issues and solutions are:

**A large fraction of reads is lost during "merging" (only relevant for paired-end data)**

![read overlap](https://gibbons-lab.github.io/isb_course_2020/16S/assets/read_overlap.png)

In order to merge ASVs DADA2 uses an overlap of 12 bases between forward and reverse reads by default. Thus, your reads must allow for sufficient overlap *after* trimming. So if your amplified region is 450bp long and you have 2x250bp reads and you trim the last 30 bases of each read, truncating the length to 220bp, the total length of covered sequence is 2x220 = 440 which is shorter than 450bp so there will be no overlap. To solve this issue trim less of the reads or adjust the `--p-min-overlap` parameters to something lower (but not too low).

<br>

**Most of the reads are lost as chimeric**

![read overlap](https://gibbons-lab.github.io/isb_course_2020/16S/assets/chimera.png)

This is usually an experimental issue as chimeras are introduced during amplification. If you can adjust your PCR, try to run fewer cycles. Chimeras can also be introduced by incorrect merging. If your minimum overlap is too small, ASVs may be merged at incorrect positions. Possible fixes are to increase the `--p-min-overlap` parameter or run the analysis on the forward reads only (in our empirical observations, chimeras are more likely to be introduced in the joined reads). *However, losing between 5-25% of your reads to chimeras is normal and does not require any adjustments.*

Our denoising stats are contained in an artifact. We can view this report as an interactive table by running the command `qiime metadata tabulate`.

In [None]:
!qiime metadata tabulate \
    --m-input-file dada2/denoising_stats.qza \
    --o-visualization dada2/denoising-stats.qzv

What proprotion of reads was retained throughout the entire pipeline? Look at the final number of used reads (non-chimeric). What do you observe when comparing those values between samples and how might that affect diversity metrics?

# Phylogeny and diversity

## Building a tree

We can build a phylogenetic tree for our sequences using the following command:

In [None]:
!qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences dada2/representative_sequences.qza \
    --output-dir tree

We can create a visualization for the tree using the [empress](https://github.com/biocore/empress) Qiime 2 plugin.

In [None]:
!qiime empress tree-plot \
    --i-tree tree/rooted_tree.qza \
    --o-visualization tree/empress.qzv

This looks tree-like but is not particularly informative as is because we have not yet passed it any annotation information. For now, the main utility of our tree will be in complementing our diversity analyses. It will tell us which ASVs are more or less related to one another, which will allow us to calculate different kinds of ecological diversity metrics that are _phylogenetically aware_ (i.e., incorporate evolutionary distances to measure the similarity between communities on the basic of genetic/evolutionary diversity, rather than unique hits alone).

## Alpha and Beta Diversity

![sample sources](https://github.com/Gibbons-Lab/isb_course_2021/raw/main/docs/16S/assets/sample_sources.png)

One of our main goals will be to compare the microbial composition from different environments. Some very common comparisons performed in microbial ecology include the species richness (alpha diversity) and similarity between communities (beta diversity), each of which can be compared with various metrics. Qiime 2 has many options and actions for each of these; here we will use a "run-all" command for diversity analyses to perform some of the most routine diversity measurements. This will

1. Subsample our feature table so that each sample has the same total number of reads (Why?) 
2. Calculate alpha and beta diversity measures using multiple metrics (Why multiple? What do these measure?)
3. Visualize [PCoA](https://en.wikipedia.org/wiki/Multidimensional_scaling) projections (PCoA is one dimensionality reduction method available in QIIME 2, and is very widely used in microbial ecology; t-sne and umap are also available in the `q2-diversity` plugin used here)

In [None]:
!qiime diversity core-metrics-phylogenetic \
    --i-table dada2/table.qza \
    --i-phylogeny tree/rooted_tree.qza \
    --p-sampling-depth 10000 \
    --m-metadata-file data/metadata.tsv \
    --output-dir diversity

## Statistical analyses

Let's first have a look at alpha diversity. This action runs a series of [Kruskal-Wallis](https://en.wikipedia.org/wiki/Kruskal‚ÄìWallis_one-way_analysis_of_variance) tests to see if the normalized alpha diversity ([Shannon entropy](https://en.wikipedia.org/wiki/Diversity_index#Shannon_index) in this example) is different between groups (null hypothesis: all group medians are equal).

Can we see a difference in the per-sample diversity across environments? And between animal-asociated and free-living communities?

In [None]:
!qiime diversity alpha-group-significance \
    --i-alpha-diversity diversity/shannon_vector.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization diversity/alpha_groups.qzv

Now, let's use beta diversity to see how different the samples are from one another. First download `diversity/weighted_unifrac_emperor.qzv` and `diversity/unweighted_unifrac_emperor.qzv` and take a look at each. Do samples separate based on the environment? Which metric does a "better job" of separating these?

We can check whether that separation is 'significant' by using a [PERMANOVA](https://en.wikipedia.org/wiki/Permutational_analysis_of_variance) test (null hypothesis: group centroids and dispersions are equivalent for all groups).

In [None]:
!qiime diversity adonis \
    --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
    --m-metadata-file data/metadata.tsv \
    --p-formula "source" \
    --p-n-jobs 2 \
    --o-visualization diversity/permanova.qzv

# Taxonomy

Another common question in microbial ecology research is to ask "who is there"? We have a set of sequences derived from our samples, and the next step is to classify these to predict the nearest taxonomic lineage (e.g., to detect known pathogens or other functionally important species). There are many approaches, some based on DNA sequence alignment and others based on machine learning models (e.g., using subsequence signatures).

Here we will use a [Na√Øve Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) trained on k-mer frequencies derived from the GreenGenes 16S rRNA gene database. Various pre-trained classifiers can be downloaded from https://docs.qiime2.org/2022.2/data-resources/ (hint: use `q2View`'s provenance view to see how these were generated). We will use a "bespoke" classifer, which is trained using taxonomic class weights derived from the average frequency distributions of various microbial species across planet earth (using data from [Qiita](https://qiita.ucsd.edu/)). Use of taxonomic weights [improves classification accuracy](https://www.nature.com/articles/s41467-019-12669-6) vs. assuming uniform distributions, as it helps differentiate related species based on niche segregation patterns.

Microbial taxonomy is a thorny topic that we don't have time to brush on now. 

In [None]:
!wget https://data.qiime2.org/2022.2/common/gg-13-8-99-515-806-nb-weighted-classifier.qza

In [None]:
!qiime feature-classifier classify-sklearn \
    --i-reads dada2/representative_sequences.qza \
    --i-classifier gg-13-8-99-515-806-nb-weighted-classifier.qza \
    --p-n-jobs 2 \
    --o-classification taxa.qza

There are a few ways to view and evaluate taxonomic classifications using QIIME 2. We will qualitatively compare for now by inspecting the relative abundances of the different bacterial taxa we have in each sample using interactive barplots:

In [None]:
!qiime taxa barplot \
    --i-table dada2/table.qza \
    --i-taxonomy taxa.qza \
    --m-metadata-file data/metadata.tsv \
    --o-visualization taxa_barplot.qzv

We can also collapse data on a particular taxonomic rank using the QIIME 2 [q2-taxa plugin](https://docs.qiime2.org/2022.2/plugins/available/taxa/). Why might we want to look at different taxonomic ranks, rather than just looking at ASVs?

In [None]:
!qiime taxa collapse \
    --i-table dada2/table.qza \
    --i-taxonomy taxa.qza \
    --p-level 6 \
    --o-collapsed-table genus.qza

We can export the table and convert it to a .csv file so that we can analyze these data using tools outside of the QIIME 2 environment.

In [None]:
!qiime tools export \
    --input-path genus.qza \
    --output-path exported
!biom convert -i exported/feature-table.biom -o genus.tsv --to-tsv

Now the data are in a common format and we can use them, for instance, to draw a heatmap using Pandas and Seaborn. Do not worry if you do not understand every bit of code here. This just serves to illustrate that you can get data out of QIIME 2 for custom visualizations (and this is even slicker when using QIIME 2's python API, as QIIME 2 objects can be "viewed" automatically as `DataFrames` or other python objects; but when using the CLI we just need to `export` the data first, as shown above).

In [None]:
abundances = pd.read_table("genus.tsv", skiprows=1, index_col=0)
abundances.index = abundances.index.str.split(";").str[5]       # Use only the genus name
abundances = abundances[~abundances.index.isin(["g__", "__"])]  # remove unclassified genera
abundances = abundances.iloc[0:100]                             # use only the first 100 genera

# Let's do a centered log-ratio transform: log x_i - log mean(x)
transformed = abundances.apply(
    lambda xs: np.log(xs + 0.5) - np.log(xs.mean() + 0.5),
    axis=1)

# and re-label the samples for the sake of plotting
transformed.columns = [x + '_' + md.loc[x, 'source'] for x in transformed.columns]

sns.clustermap(transformed.T, cmap="magma", xticklabels=True, figsize=(19, 6))

# Exercises

Okay, that's enough time in the back seat. 

It's time to take the wheel üöó 

Now you can dive into the data üèä

## Exercise 1 - Supervised classification of microbiome data

One pretty basic question we can ask is whether the microbial community composition is predictive of environmental type. 
Could you predict the source environment of a sample from 16S data alone? 


Start with the `classify-samples-ncv` action and follow it up by finding and looking at the `heatmap` visualization afterwards that shows important taxa.

We will try to build a machine learning model that can predict whether a microbiome is animal-associated or free-living. By default, this action will use a Random Forest classifier, but other algorithms can be selected with the `--p-estimator` parameter. This action will automaticall split your dataset into training and test sets (i.e., only a subset is used for model validation), but other actions in the `q2-sample-classifier` plugin allow other schemes (e.g., nested cross-validation).

Also inspect the accuracy results to see how well the phenotype can be predicted from the microbial composition. Next, you could try predicting other sample metadata columns, different models, etc, to see what works best (why?)...

I filled in the command for you but it is missing some inputs. Can you complete it? 

Try to use the genus table and not the ASV table here (why?).

**QUESTIONS:**

1) What does it mean for data to be in the 'training' or 'test' set? 

2) How well did this classifier perform? 

3) What ASVs contributed most to model performance? Why do you think these ASVs were so important?

4) Do you think this is a model is broadly useful? Would it perform well on external data that it has not seen yet? Why or why not?


In [None]:
!rm -rf classifier
!qiime sample-classifier classify-samples \
    --i-table [EMPTY] \
    --m-metadata-file data/metadata.tsv \
    --m-metadata-column source \
    --p-n-jobs 2 \
    --p-test-size 0.33 \
    --p-cv 1 \
    --output-dir classifier 

## Exercise 2 - Decorate Your Tree üéÑ

One visualization that we did not spend a lot of time on was the phylogentic tree of our ASVs. Let's change that! 

We previously viewed the tree before we had any annotations, and it was fairly boring. Now that we have taxonomy annotations and diversity data, let's decorate our tree. We will use the empress plugin again but this time with the `community-plot` option, which allows us to incorporate community and taxonomy information into our tree.

Once again, I filled in the command but see if you can fill in the blanks...

If you want to display this side-by-side with a PCoA plot, check out the `--i-pcoa` input.

**QUESTIONS:**

1) Are some of the branch lengths on the tree longer than you would expect? Do you notice anything interesting or suspicious about the taxonomic identities of these branches?

2) Can you find examples of phyla that are polyphyletic (i.e. where clusters of ASVs from the same phylum are found in different locations on the tree, showing different commmon ancestors)? What about polyphyletic taxa at lower taxonomic levels, like at the family or genus levels? Why do you think these patterns exist?

In [None]:
!qiime empress community-plot \
    --i-tree [EMPTY] \
    --i-feature-table [EMPTY] \
    --m-sample-metadata-file [EMPTY] \
    --m-feature-metadata-file taxa.qza \
    --o-visualization community-tree-viz.qzv

## Exercise 3 - create your own sequence reference database

At the start of this workshop we used the q2-fondue plugin to download raw sequence data from NCBI Sequence Read Archive. A different plugin, RESCRIPt (https://github.com/bokulich-lab/RESCRIPt/) can be used to download gene or genome data from NCBI GenBank (and some other sources) to create a reference database (e.g., for taxonomic classification, sequence alignment, comparative genomics, et cetera), as well as various other functions for sequence database management.

As a small exercise, here we will use RESCRIPt to generate a reference database of [NCBI RefSeqs](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/) 16S rRNA gene sequences. Then we demonstrate using [VSEARCH](https://pubmed.ncbi.nlm.nih.gov/27781170/) (via the q2-feature-classifier plugin) to align our query sequences against this reference database (in this case for the purpose of taxonomic classification, but other alignment tasks would also be possible here).

In [None]:
!qiime rescript get-ncbi-data \
  --p-query '33175[BioProject] OR 33317[BioProject]' \
  --o-sequences ncbi-refseqs.qza \
  --o-taxonomy ncbi-refseqs-taxonomy.qza

In [None]:
# Now it's your turn! Based on the outputs above, guess what should go where.
# I have filled in various parameter settings but feel free to adjust these to
# see how they work (or see the docs üòÄ)
!qiime feature-classifier classify-consensus-vsearch \
  --i-query dada2/representative_sequences.qza \
  --i-reference-reads [EMPTY] \
  --i-reference-taxonomy [EMPTY] \
  --p-maxaccepts 3 \
  --p-maxhits 3 \
  --p-perc-identity 0.95 \
  --p-top-hits-only \
  --p-threads 2 \
  --p-min-consensus 0.51 \
  --o-classification taxa-vsearch.qza

And finally you should view the results as a barplot... you already did this above, so I will let you fill in the entire command! Make sure you use the correct filepath to use the vsearch taxonomic classifications.

In [None]:
# you do the rest! 

## Bonus exercise! Looking back

One of the unique features of QIIME 2 is its integrated data provenance tracker, which embeds workflow information directly into output artifacts and visualizations. This makes QIIME 2-based workflows more transparent and reproducible, as this allows anyone to retrospectively review how any file was created (including relevant computational environment specs). We will now demonstrate a few ways to inspect and "replay" this provenance.

Earlier in this workshop you may have used `q2View` to inspect provenance information. If you did not, try it now. Use https://view.qiime2.org/ to view the `community-tree-viz.qzv` that you just created in the exercise above. Click on the "Provenance" tab (top-right corner) to view the provenance graph, which displays the workflow steps used to create that output. You can also find citation information in the "Details" tab.

`q2View` only allows inspection of one output at a time, however, and only supports retrospective provenance. A new QIIME 2 plugin, [provenance-lib](https://github.com/qiime2/provenance-lib) (still in alpha release at the time of writing) allows you to "replay" an entire workflow by collating all provenance information, (and/or) citation details, (and/or) and other information (e.g., metadata) into a single output. Here is an example to "replay" our workflow (but see `replay --help` to see other actions and options). This function parses all provenance data (and optionally metadata, citations, etc) in a directory of QIIME 2 outputs, and generates a workflow script

In [None]:
!replay provenance \
  --i-in-fp . \
  --p-recurse \
  --p-no-dump-recorded-metadata \
  --p-no-verbose \
  --o-out-fp ./q2-example-workflow-cli.sh

Now we will display the script in-line to view the complete workflow (you can also download the file from Colab if you would prefer to view with a text viewer). There are many commands! Do you remember running all of these? 

If you have spotted some unusual entries, good work.¬†This replay script includes the provenance of some files that you downloaded (e.g., pre-trained classifiers) and this becomes enchained in the provenance of downstream results. You can use `q2View` to inspect the provenance of the pre-trained classifier (or the results of a downstream action that you ran) to see how those files were generated. Or you can use this replay script (with some modifications to update filepaths et cetera) to recreate these files yourself!


In [None]:
with open('/content/materials/q2-example-workflow-cli.sh', 'r') as f:
    print(f.read())

<br><br><br>

---

# ü¶† Space for inspiration

If we make it this far, you can use this space to explore some other actions in QIIME 2, or export and analyze your data with other python functions (for example). As a starting place, check out all the QIIME 2 plugins already installed in your environment and see if anything sounds like an inspiring place to jump in...

In [None]:
!qiime --help

In [None]:
# You can add more code cells with the "+ Code" button on the top right
