Snakemake workflow to identify novel microbial species from a set of genomes.
Genomes are first quality-filtered based on the CheckM stats then compared against a genome database using Mash and MUMmer. Unknown hits are extracted, clustered at species-level using dRep and further quality-controlled with GUNC.
git clone https://github.com/alexmsalmeida/magscreen.git
-
Edit
config.yml
with the selected input, output and databases arguments. The input should point to the paths of the directory containing the.fa
assemblies to analyse and a path to the.csv
file with CheckM completeness and contamination scores. The databases folder should contain the GUNC diamond database and a custom Mash database (.msh
) with the genomes you want to screen against. -
(option 1) Run the pipeline locally (adjust
-j
based on the number of available cores)
snakemake --use-conda -k -j 4
- (option 2) Run the pipeline on a cluster (e.g., SLURM)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A ALMEIDA-SL3-CPU -p icelake-himem --time=12:00:00 --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output}'
The main output is located in the directory new_species/
which contains the best-quality representative genomes (.fa
files) of each new species. New species matching all of the following criteria are filtered out:
- Flagged by GUNC: clade_separation_score >0.45; contamination_portion >0.05; reference_representation_score >0.5
- Are singletons (dRep clusters with only one member)
- Are <90% complete based on CheckM