GitHub - fangly/hitman: HIgh Throughput Microbial ANalysis

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
bpipe		bpipe
contrib		contrib
lib		lib
scripts		scripts
.gitignore		.gitignore
Changes		Changes
LICENSE		LICENSE
README		README

Repository files navigation

HiTMAn - High Throughput Microbial Analysis

Hitman is a pipeline for analysing 16S rRNA amplicon datasets. The default
values make it suitable for bacterial and archaeal profiling in 454 and Illumina
datasets.


INSTALLATION

Please install these dependencies first:
   perl
   Getopt::Euclid perl module
   IPC::System::Simple perl module
   bpipe 0.9.8.5
   pigz 2.1.6
   sff2fastq 0.8.0
   pear 0.9
   fastx_toolkit 0.0.13.2
   trimmomatic 0.32
   ea-utils 1.1.2
   acacia 1.50
   usearch v7
   emboss 6.6.0
   bio-community 0.001005
   bioperl

Then add the ./scripts/ and ./contrib/ folder into your PATH environment
variable.


DESCRIPTION

Hitman is a bioinformatic workflow based around the UPARSE methodology and
composed using bpipe. In brief, Hitman:
1/  demultiplexes libraries using EA-utils' fastq-multx tool
2/  joins read pairs with PEAR, but keeps the forward read when pairs cannot be
    joined
3/  clips sequencing adapters using TRIMMOMATIC
4/  truncates the 3' end of sequences at the first residue below a threshold
    quality Q value using TRIMMOMATIC
5/  trims the 3' end of all sequences to a target length L using TRIMMOMATIC, 
    discarding all smaller sequences
6/  removes sequences exceeding the maximum number of expected errors E using
    USEARCH's fastq_filter
7/  uses USEARCH's cluster_otus to form operational taxonomic units (OTUs) from
    high-fidelity sequences (stringent QA in steps 4 and 6) that are sorted by
    decreasing abundance, occur at least twice in the dataset and have at least
    O% similarity
8/  discards chimeric OTUs in a reference-independent using USEARCH's
    cluster_otus, and in a reference-based fashion using UCHIME with database C
9/  assigns regular-fidelity sequences (less stringent QA in steps 4 and 6) to
    each OTU using USEARCH's usearch_global
10/ formats the results in BIOM format using Bio-Community's bc_convert_files
11/ gives a taxonomic assignment to each OTU by globally aligning their reference
    sequences against a database T of reference sequences trimmed to the target
    region (keeping only the best-matching alignment with at least I% identity)
    using USEARCH's usearch_global
12/ removes OTUs belonging to specific taxa W using Bio-Community's
    bc_manage_samples
13/ rarefies the microbiome profiles at the given depth D with Bio-Community's
    bc_accumulate
14/ corrects gene-copy number (GCN) bias using CopyRighter.

Many of these steps are optional or accept user-specified parameters. The
output is a BIOM file containing OTU-level, rarefied, GCN-corrected microbial
profiles with taxonomic affiliations.


RUNNING HITMAN

Three types of input files are necessary:
1/ SFF or FASTQ files containing regular or paired-end amplicon reads (can be
   gzipped)
2/ FASTA file containing the PCR primers used
3/ If the sequencing run was 454-multiplexed, a 2-column tabular mapping files
   indicating the multiplex identifiers (MIDs) used for each sample

The pipeline itself comprises three steps:
1/ hitman_hire: Collect amplicon reads to analyze from SFF/FASTQ files.
2/ hitman_deploy: Quality-filter and trim the sequences.
3/ hitman_execute: Produce a rarefied OTU table with taxonomic assignments.

Run these scripts with the --man flag to learn more about what they do.

A typical 454 GS-FLX Ti processing looks like this:
   hitman_hire    --dir 01-hire/    --input Gasket67.sff --mapping Gasket67.mapping
   ln -s 01-hire/*cat_files.fastq seqs.fastq
   hitman_deploy  --dir 02-deploy/  --input seqs.fastq
   ln -s 02-deploy/*trunc_seqs.fna seqs_qa.fna
   hitman_execute --dir 03-execute/ --input seqs_qa.fna --primers primers.fna

For typical Illumina MiSeq processing (with paired-end reads and no mapping file),
we can skip the Acacia homopolymer denoising step and increase rarefaction depth
from 1,000 (by default) to 10,000 reads:
   hitman_hire    --dir 01-hire/    --five Sample1.R1.fasta.gz Sample2.R1.fasta.gz --three Sample1.R2.fasta.gz Sample2.R2.fasta.gz
   ln -s 01-hire/*cat_files.fastq seqs.fastq
   hitman_deploy  --dir 02-deploy/  --input seqs.fastq -p SKIP_DENOISING=1 -p LIB_ADAPTERS=/path/to/trimmomatic/adapters/NexteraPE-PE.fa
   ln -s 02-deploy/*trunc_seqs.fna seqs_qa.fna
   hitman_execute --dir 03-execute/ --input seqs_qa.fna --primers primers.fna -p RAREFACTION_DEPTH=10000

If you are feeling comfortable with the process, though, I advise running a four-step process:
1/ hitman_hire (as usual)
2/ hitman_deploy, using stringent quality cutoffs
   (these reads will be used to determine OTU representative sequences only)
3/ hitman_deploy, using regular quality cutoffs
   (these reads will be mapped to high-quality OTUs)
4/ hitman_execute, using --input and --input2 to pass high- and regular- quality reads

Note: Hitman relies quite heavily on file names and extensions to work. Make
sure your files are named properly.


COPYRIGHT

Copyright 2012-2014 Florent ANGLY <florent.angly@gmail.com>

Hitman is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation, either version 3 of the License, or (at your
option) any later version.

Hitman is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.

You should have received a copy of the GNU General Public License along
with Hitman. If not, see <http://www.gnu.org/licenses/>.