Skip to content


Repository files navigation

Mucobiome GitHub Logo

Mucobiome is a simple pipeline which aims to analyse 16S RNA genomics data from high throughput sequencer. Neither QIIME nor MOTHUR are required. Few dependency like vsearch are necessary and are listed bellow. Mucobiome use snakemake as the backbone. This tools make it possible to run the pipeline optimally using multithreading.

How it works ? Fastq paired files -> biom

the pipeline takes several pair-end fastq as input ( one per sample) and generate one biom file which contains OTU table with taxonomy and sample meta data.

  • For each reads pairs :
  • Merge fastq file using vsearch or flash
  • Clean fastq file using sickle
  • Reverse reads with seqtk
  • Trim adaptators from reads using cutadapts
  • Dereplicate reads with vsearch
  • merge all reads pairs into one fasta file
  • With Greengene database
  • Extract interesting region from greengene 16S database using cutadapts and user's adaptator
  • Make a taxonomy assignement using vsearch --usearch_global
  • Compare sequence from the merged file and greengene database
  • Create a Biom file
  • Add taxonomy and sample metadata into the biom file


Python depedencies

Mucobiome has been written with Python 3.4.

pip install -r requirements.txt 

Install vsearch

tar xzf v2.3.4.tar.gz
cd vsearch-2.3.4
make install  # as root or sudo make install

Install seqtk

git clone;
cd seqtk;
sudo make install

Install sickle
cd sickle
sudo make install

Download Database

Mucobiome works with greengene. But you can use another database if you respect the same format. Run from the database folder to download greengene data.

 cd database; sh 


Test your installation

The actual repositories contains a simple dataset. Try the following commands which do nothing but display all pipeline command.

# you are in the main directory 
snakemake -d working_directory -np --configfile config.yaml
-dThe working directory. All generated files will be drop here
-nDon't execute any commands.
-pPrint commands
--configfileTell which config file to use

Input file

input data are paired fastq.gz files. You must put all your datas into the data/raw folder. Both paired files must respect the following syntax. {Sample} is your samplename. Do not use "." character in sample name. use alpha numeric only .



This file contains all parameters required to perform an analysis.

raw_folderthe directory which contains fastq input files
primer_forwardThe forward primer. By defaut primers select the V3-V5 region
reverse_primerThe reverse primer. By defaut primers select the V3-V5 region
database_fastaThis is a fasta file which contains complete 16S sequence. By default it use greengene
database_taxonomyThis is a two columns file. Fasta header ID from database_fasta and the taxonomy
sample_dataSample meta data
thresholdTaxonomy assignement threshold. By default 97%
qualityRemove all reads bellow this threshold. You should set this value to 20
min_lenMinimum reads length accepted
max_lenMaximum reads length accepted
merge_toolUse Flash or vsearch to perform merging. By default it use vsearch


This files contains metadata for each sample. Fill it with your own data.

Run your experiment

When all requirements are done, you can run one of the following commands

   # Run your pipeline using 60 threads
   snakemake -d working_directory -p --configfile config.yaml --cores 60
   # Force the pipeline to rebuild the last step
   snakemake -d working_directory -fp --configfile config.yaml --cores 60
   # Force the pipeline to rebuild everything
   snakemake -d working_directory -Fp --configfile config.yaml --cores 60


A simple 16S RNA pipeline






No releases published
