Mucobiome is a simple pipeline which aims to analyse 16S RNA genomics data from high throughput sequencer. Neither QIIME nor MOTHUR are required. Few dependency like vsearch are necessary and are listed bellow. Mucobiome use snakemake as the backbone. This tools make it possible to run the pipeline optimally using multithreading.
the pipeline takes several pair-end fastq as input ( one per sample) and generate one biom file which contains OTU table with taxonomy and sample meta data.
- For each reads pairs :
- Merge fastq file using vsearch or flash
- Clean fastq file using sickle
- Reverse reads with seqtk
- Trim adaptators from reads using cutadapts
- Dereplicate reads with vsearch
- merge all reads pairs into one fasta file
- With Greengene database
- Extract interesting region from greengene 16S database using cutadapts and user's adaptator
- Make a taxonomy assignement using vsearch --usearch_global
- Compare sequence from the merged file and greengene database
- Create a Biom file
- Add taxonomy and sample metadata into the biom file
Mucobiome has been written with Python 3.4.
pip install -r requirements.txt
wget https://github.com/torognes/vsearch/archive/v2.3.4.tar.gz
tar xzf v2.3.4.tar.gz
cd vsearch-2.3.4
./autogen.sh
./configure
make
make install # as root or sudo make install
git clone https://github.com/lh3/seqtk.git;
cd seqtk;
make
sudo make install
https://github.com/najoshi/sickle.git
cd sickle
make
sudo make install
Mucobiome works with greengene. But you can use another database if you respect the same format. Run download_greengene.sh from the database folder to download greengene data.
cd database; sh download_greengene.sh
The actual repositories contains a simple dataset. Try the following commands which do nothing but display all pipeline command.
# you are in the main directory
snakemake -d working_directory -np --configfile config.yaml
Options | Description |
---|---|
-d | The working directory. All generated files will be drop here |
-n | Don't execute any commands. |
-p | Print commands |
--configfile | Tell which config file to use |
input data are paired fastq.gz files. You must put all your datas into the data/raw folder. Both paired files must respect the following syntax. {Sample} is your samplename. Do not use "." character in sample name. use alpha numeric only .
{SAMPLE}_1.fastq.gz
{SAMPLE}_2.fastq.gz
This file contains all parameters required to perform an analysis.
Options | Description |
---|---|
raw_folder | the directory which contains fastq input files |
primer_forward | The forward primer. By defaut primers select the V3-V5 region |
reverse_primer | The reverse primer. By defaut primers select the V3-V5 region |
database_fasta | This is a fasta file which contains complete 16S sequence. By default it use greengene |
database_taxonomy | This is a two columns file. Fasta header ID from database_fasta and the taxonomy |
sample_data | Sample meta data |
threshold | Taxonomy assignement threshold. By default 97% |
quality | Remove all reads bellow this threshold. You should set this value to 20 |
min_len | Minimum reads length accepted |
max_len | Maximum reads length accepted |
merge_tool | Use Flash or vsearch to perform merging. By default it use vsearch |
This files contains metadata for each sample. Fill it with your own data.
When all requirements are done, you can run one of the following commands
# Run your pipeline using 60 threads
snakemake -d working_directory -p --configfile config.yaml --cores 60
# Force the pipeline to rebuild the last step
snakemake -d working_directory -fp --configfile config.yaml --cores 60
# Force the pipeline to rebuild everything
snakemake -d working_directory -Fp --configfile config.yaml --cores 60