Documentation

compmetagen · Apr 19, 2018 · c471e79 · c471e79
1 parent 638cc6e
commit c471e79
Show file tree

Hide file tree

Showing 24 changed files with 770 additions and 671 deletions.
diff --git a/doc/source/denoising_illumina.rst b/doc/source/denoising_illumina.rst
@@ -0,0 +1,71 @@
+Denoising (Illumina only)
+=========================
+
+Usually, amplicon sequences are clustered into **Operational Taxonomic Units**
+(OTUs) using a similarity threshold of 97%, which represents the common working
+definition of bacterial species. 
+
+Another approach consists to identify the **Sequence Variants** (SVs, see
+:doc:`/otu` for details). This approach avoids clustering sequences at a
+predefined similarity threshold and usually includes a denoising algorithm in
+order to identify SVs.
+
+In this tutorial we show how to perform the denoising of Illumina overlapping
+paired-end sequences in order to detect the SVs. Athough this tutorial explains
+how to apply the pipeline to 16S paired-end Illumina reads, it can be adapted to
+Illumina single-end sequening or to others markers gene/spacers, e.g. **Internal
+Transcribed Spacer (ITS)**, **18S** or **28S**.
+
+.. contents:: Table of Contents
+    :local:
+
+Data download and preprocessing
+-------------------------------
+
+In this tutorial we analyze the same dataset used in :doc:`/pairedend_97`. Reads
+merging, primer trimming and quality filtering are the same as in
+:doc:`/pairedend_97`:
+
+.. code-block:: sh
+
+    wget ftp://ftp.fmach.it/metagenomics/micca/examples/garda.tar.gz
+    tar -zxvf garda.tar.gz
+    cd garda
+
+    micca mergepairs -i fastq/*_R1*.fastq -o merged.fastq -l 100 -d 30
+    micca trim -i merged.fastq -o trimmed.fastq -w CCTACGGGNGGCWGCAG -r GACTACNVGGGTWTCTAATCC -W -R -c
+    micca filter -i trimmed.fastq -o filtered.fasta -e 0.75 -m 400
+
+Denoising - Sequence Variants identification
+--------------------------------------------
+
+The :doc:`/commands/otu` command implements the UNOISE3 protocol
+(``denovo_unoise``) which includes dereplication, denoising and chimera
+filtering:
+
+.. code-block:: sh
+
+    micca otu -m denovo_unoise -i filtered.fasta -o denovo_unoise_otus -t 4 -c
+
+The :doc:`/commands/otu` command returns several files in the output directory,
+including the **SV table** (``otutable.txt``) and a FASTA file containing the
+**representative sequences** (``otus.fasta``).
+
+.. Note::
+
+    See :doc:`/otu` to see how to apply the **de novo swarm**,
+    **closed-reference** and the **open-reference** OTU picking strategies to
+    these data.
+
+Further steps
+-------------
+
+* :ref:`pairedend_97-taxonomy`
+
+* :ref:`pairedend_97-tree`
+
+* :ref:`pairedend_97-biom`
+
+* :doc:`/phyloseq`
+
+* :doc:`/table`
diff --git a/doc/source/filtering.rst b/doc/source/filtering.rst
diff --git a/doc/source/formats.rst b/doc/source/formats.rst
@@ -5,43 +5,38 @@ Sequence files
 --------------
 
 `FASTA <https://en.wikipedia.org/wiki/FASTA_format>`_ and `FASTQ
-<https://en.wikipedia.org/wiki/FASTQ_format>`_ Sanger/Illumina 1.8+
-format (phred+33) formats are supported. micca provides the
-:doc:`/commands/convert` command to convert between sequence file
-formats.
-
+<https://en.wikipedia.org/wiki/FASTQ_format>`_ Sanger/Illumina 1.8+ format
+(phred+33) formats are supported. micca provides the :doc:`/commands/convert`
+command to convert between sequence file formats.
 
 Taxonomy files
 --------------
 
 Taxonomy files map sequence IDs to taxonomy. Input taxonomy files must
 be TAB-delimited files where rows are either in the form:
+
+#. ``SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;``
+#. ``SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;``
+#. ``SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales``
+#. ``SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;``
 
-   #. ``SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;``
-   #. ``SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;``
-   #. ``SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales``
-   #. ``SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;``
-
-
-Compatible taxonomy files are in:
+Compatible taxonomy files are:
 
   * Greengenes (http://greengenes.secondgenome.com/downloads);
   * QIIME-formatted SILVA (https://www.arb-silva.de/download/archive/qiime/);
   * UNITE (https://unite.ut.ee/repository.php);
   * Human Oral Microbiome Database (HOMD) (http://www.homd.org/).
 
 The output taxonomy file returned by :doc:`/commands/classify` is a
-TAB-delimited file where each row is always in the format::
+TAB-delimited file where each row is in the format::
 
    SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
 
+OTU/SV tables and taxonomy tables
+---------------------------------
 
-OTU table and taxonomy tables
------------------------------
-
-The OTU table returned by :doc:`/commands/otu` is an OTU x sample,
-TAB-delimited text file, containing the number of times an OTU is
-found in each sample::
+The OTU table returned by :doc:`/commands/otu` is an OTU x sample, TAB-delimited
+text file, containing the number of times an OTU is found in each sample::
 
    OTU     Mw_01 Mw_02 Mw_03 ...
    DENOVO1 151   178   177   ...
@@ -50,14 +45,14 @@ found in each sample::
    DENOVO4 166   299   115   ...
    ...     ...   ...   ...   ...
 
-The :doc:`/commands/tabletotax` command returns the "taxonomy tables"
-for each taxonomic level, e.g.::
+The :doc:`/commands/tabletotax` command returns the "taxonomy tables" for each
+taxonomic level, e.g.::
 
    OTU                                Mw_01 Mw_02 Mw_03 ...
-   Bacteria;Bacteroidetes	      1363  1543  1168  ...
+   Bacteria;Bacteroidetes             1363  1543  1168  ...
    Bacteria;Cyanobacteria/Chloroplast 0     0     0     ...
    Bacteria;Firmicutes                6257  5780  6761  ...
-   Bacteria;Lentisphaerae	      0     1     0     ...
+   Bacteria;Lentisphaerae             0     1     0     ...
    ...                                ...   ...   ...   ...
 
 
@@ -66,13 +61,12 @@ for each taxonomic level, e.g.::
 Sample data
 -----------
 
-The sample data file contains all of the information about the
-samples. In QIIME this file is called `Mapping File
-<http://qiime.org/tutorials/tutorial.html#mapping-file-tab-delimited-txt>`_.
-In micca, the sample data file must be a TAB-delimited text file (a
-row for each sample). The first column must be the sample identifier
-(assigned in :doc:`/commands/merge`, :doc:`/commands/split` or
-:doc:`/commands/mergepairs`)::
+The sample data file contains all of the information about the samples. In QIIME
+this file is called `Mapping File
+<http://qiime.org/tutorials/tutorial.html#mapping-file-tab-delimited-txt>`_. In
+micca, the sample data file must be a TAB-delimited text file (a row for each
+sample). The first column must be the sample identifier (assigned in
+:doc:`/commands/merge`, :doc:`/commands/split` or :doc:`/commands/mergepairs`)::
 
    ID    Group Altitude
    Mw_01 Mw1   492
@@ -87,3 +81,10 @@ Phylogenetic tree
 
 Only the `Newick format <https://en.wikipedia.org/wiki/Newick_format>`_ is
 supported.
+
+BIOM file
+---------
+
+The :doc:`/commands/tobiom` command generates OTU/SV tables in the biom version
+1.0 JSON file format
+(http://biom-format.org/documentation/format_versions/biom-1.0.html).
diff --git a/doc/source/images/alpha454.png b/doc/source/images/alpha454.png
diff --git a/doc/source/images/beta454.png b/doc/source/images/beta454.png
diff --git a/doc/source/images/filterstatspaired.png b/doc/source/images/filterstatspaired.png
diff --git a/doc/source/images/garda_alpha.png b/doc/source/images/garda_alpha.png
diff --git a/doc/source/images/garda_beta.png b/doc/source/images/garda_beta.png
diff --git a/doc/source/images/garda_stats_plot.png b/doc/source/images/garda_stats_plot.png
diff --git a/doc/source/images/garda_stats_qualsumm_plot.png b/doc/source/images/garda_stats_qualsumm_plot.png
diff --git a/doc/source/images/garda_taxtable2.png b/doc/source/images/garda_taxtable2.png
diff --git a/doc/source/images/rarecurve.png b/doc/source/images/rarecurve.png
diff --git a/doc/source/images/taxtable.png b/doc/source/images/taxtable.png
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -10,15 +10,15 @@
    :caption: Getting Started
 
    install
-   run
    databases
 
 .. toctree::
    :maxdepth: 1
    :caption: Tutorials
 
+   pairedend_97
+   denoising_illumina
    singleend
-   pairedend
    phyloseq
    table
    picrust
@@ -27,7 +27,6 @@
    :maxdepth: 1
    :caption: In Depth
 
-   filtering
    otu
    formats
    changes

diff --git a/doc/source/install.rst b/doc/source/install.rst
@@ -48,6 +48,8 @@ which all the software has already been installed, configured and tested.
 Using pip
 ---------
 
+At the moment, only Python 2.7 is supported.
+
 On Ubuntu >= 12.04 and Debian >=7
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -182,11 +184,9 @@ Testing the installation
 Install RDP classifier (optional)
 ---------------------------------
 
-The RDP Classifier is a naive bayesian classifier for
-taxonomic assignments
-(http://sourceforge.net/projects/rdp-classifier/). The RDP classifier
-can be used in the :doc:`/commands/classify` command (option
-``-m/--method rdp``).
+The RDP Classifier is a naive bayesian classifier for taxonomic assignments
+(http://sourceforge.net/projects/rdp-classifier/). The RDP classifier can be
+used in the :doc:`/commands/classify` command (option ``-m/--method rdp``).
 
 .. warning::