A Snakemake pipeline that goes from raw metagenomic FASTQs to per-sample abundance tables of antibiotic resistance genes (ARGs) and virulence factors (VFs).
Environmental metagenomes — wastewater, soil, rivers, the gut — are reservoirs of resistance genes and virulence traits that can move between organisms. Surveying which ARGs and VFs are present, in which microbes, and at what abundance across samples is the bread and butter of antimicrobial resistance (AMR) monitoring. Doing it well means stitching together a long sequence of bioinformatics tools, each with its own quirks: read QC, assembly, binning, quality assessment, taxonomic placement, gene prediction, BLASTP against curated databases, read mapping, coverage counting, and a couple of cleanup steps in between.
ARGus packages that whole chain as one reproducible Snakemake workflow. You point it at your FASTQs, fill in two database paths, and it produces a tidy set of feature tables you can drop straight into downstream analysis (a heatmap, a stats notebook, a manuscript).
fastp → fastq-pair → MEGAHIT → MaxBin2 → CheckM1 → GTDB-Tk
↓
Prodigal (binned + unbinned tracks)
↓
BLASTP (CARD / VFDB) → Bowtie2 → CoverM
↓
ARG / VF feature tables, MAG abundance
In words: reads are trimmed and re-paired (fastp, fastq-pair), co-assembled
across samples with MEGAHIT, binned into MAGs by MaxBin2, then assessed for
quality (CheckM1) and taxonomically classified (GTDB-Tk). Prodigal runs on
two parallel tracks — single-genome mode on each bin, metagenomic mode on
the unbinned contigs. The unbinned proteins are BLASTed against CARD's
antibiotic inactivation + efflux subsets; the binned proteins against VFDB.
For each hit, the matching nucleotide CDS is pulled out of Prodigal's .fna
output, deduplicated by header (seqkit), indexed with bowtie2-build, and the
original paired reads are mapped to it. CoverM then counts reads per CDS;
those counts get merged with the BLAST annotation to give the per-sample
feature tables. MultiQC rolls up the QC.
You'll need Snakemake ≥ 9, conda or mamba (for the per-rule environments), and ~150 GB of disk free for the reference databases — most of which is GTDB-Tk.
git clone https://github.com/<your-username>/ARGus.git
cd ARGus
# 1. describe your samples
$EDITOR config/samples.csv # one row per sample: sample_id, r1, r2, group
# 2. point at your reference databases (see the table below for what these are)
$EDITOR config/config.yaml
# 3. dry-run to see what will execute
snakemake -n --profile config/profiles/local
# 4. go
snakemake --profile config/profiles/localA full run on three 5–10 GB FASTQ samples takes ~4–6 hours on a single 60-core node. MEGAHIT and the final GTDB-Tk classify step are the two longest individual jobs.
If anything fails partway through, just re-run the same command — the local
profile sets rerun-incomplete: true, so Snakemake picks up where it stopped.
Two big ones you download once, two small ones the pipeline grabs for you on first run:
| Database | Manual or auto? | Size |
|---|---|---|
| CheckM1 v1.2.3 data | Manual (one-time download) | ~1.4 GB |
| GTDB-Tk r232 reference bundle | Manual (one-time download) | ~110 GB |
| CARD (filtered to inactivation + efflux mechanisms) | Auto on first run | ~50 MB |
| VFDB Set B | Auto on first run | ~10 MB |
CheckM1 and GTDB-Tk are too large to fetch silently. Download them once,
keep them somewhere central, and point config/config.yaml at the
directories:
# CheckM1 (~1.4 GB)
mkdir -p /path/to/checkm1_db && cd /path/to/checkm1_db
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar xzf checkm_data_2015_01_16.tar.gz && rm checkm_data_2015_01_16.tar.gz
# GTDB-Tk r232 (~110 GB — takes a couple of hours)
mkdir -p /path/to/gtdbtk_r232 && cd /path/to/gtdbtk_r232
wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.gz
tar xzf gtdbtk_r232_data.tar.gz --strip 1 && rm gtdbtk_r232_data.tar.gzCARD and VFDB are small enough that the pipeline downloads + indexes them
itself the first time you run it (set their database_path: entries to
null in the config to enable this; provide a path if you already have a
formatted BLAST DB).
Everything lands in results/. The interesting bits:
arg_analysis/<assembly_id>/arg_feature_table.tsv— deduplicated ARG hits with per-sample read counts. One row per CARD accession.arg_analysis/<assembly_id>/top10_args.tsv— the ten most abundant ARGs by total reads.vf_analysis/<assembly_id>/vf_feature_table.tsv— per-sample VF read counts, indexed bybin__protein_id.vf_analysis/<assembly_id>/vf_taxonomy.tsv— VF count matrix with each column header annotated by its MAG's full GTDB taxonomy.vf_analysis/<assembly_id>/vf_heatmap.html— interactive Plotly heatmap of VFs × MAGs. Open in any browser.mag_abundance/<assembly_id>/mag_feature_table.csv— per-MAG relative abundance, mean coverage, and covered fraction across samples.quality/checkm1/<assembly_id>/quality_report.tsv— CheckM1 completeness and contamination per bin.taxonomic_classification/<assembly_id>/gtdbtk.bac120.summary.tsv— GTDB-Tk taxonomy assignments for bacterial MAGs.reports/multiqc_report.html— MultiQC summary of fastp + bowtie2 logs.
Per-rule logs (results/logs/<stage>/) and benchmarks (results/benchmarks/<stage>/)
are kept alongside, which is handy when something misbehaves and you want to
know which rule and how much memory it used.
A few options in config/config.yaml you'll likely want to touch:
assembly.mode—co_assembly(default) for cross-sample comparisons, orsingle_samplefor one assembly per sample.assembly.assembler—megahit(default) ormetaspades.- MEGAHIT k-mer range — the default narrows MEGAHIT to
k = 21, 31, 41(instead of the standard 21→141 sweep) so a run finishes in hours, not days. Setkmin/kmax/ksteptonullto use MEGAHIT's full default range. arg_analysis.bowtie2_very_sensitive/vf_analysis.bowtie2_very_sensitive— settrueto pass--very-sensitivefor the read-mapping steps. Slower, but better recall on divergent reads.resources.max_threads/resources.max_memory_gb— caps applied to per-rule resource scaling. Thecores:line inconfig/profiles/local/config.yamlis a separate parallel-job budget; usually you want them equal.
Profiles for SGE and SLURM are in config/profiles/. Install the matching
executor plugin first:
pip install snakemake-executor-plugin-cluster-generic # SGE
pip install snakemake-executor-plugin-slurm # SLURM
snakemake --profile config/profiles/sge
snakemake --profile config/profiles/slurmYou'll likely want to edit those YAMLs for your cluster's partition name, account, and concurrent-job limits.
A few gotchas we've already hit so they don't bite you twice:
- CheckM1 fails with
ModuleNotFoundError: pkg_resources— already pinned inworkflow/envs/checkm1.yamlviasetuptools<81. If you swap envs around, this can resurface; the fix is the pin. - GTDB-Tk crashes with
unrecognized arguments: --skip_ani_screen— that flag was removed in GTDB-Tk 2.7. If you pin an older version (2.4–2.6), add the flag back to thegtdbtk_classifyrule inworkflow/rules/taxonomic_classification.smk. - GTDB-Tk summary column lookup fails — the column is
user_genome(singular, no_id); pandas scripts must key on that exact name. - CoverM column-name mismatch — CoverM 0.7.0 produces columns shaped
<sample>.sorted Mean(strips.bam). If you upgrade or downgrade CoverM, theextract_per_sample_abundancegrep pattern inworkflow/rules/depth.smkmay need adjusting.
| Stage | Tool | Pinned version |
|---|---|---|
| Read QC | fastp | 1.0.1 |
| Re-pair | fastq-pair | 1.0 |
| Assembly | MEGAHIT | 1.2.9 |
| Binning | MaxBin2 | 2.2.7 |
| Bin quality | CheckM1 | 1.2.3 |
| Taxonomy | GTDB-Tk | 2.7.1 (r232 data) |
| Gene prediction | Prodigal | 2.6.3 |
| BLASTP | NCBI BLAST+ | 2.16.0 |
| FASTA cleanup | seqkit | 2.9.0 |
| Read mapping | Bowtie2 + samtools | 2.5.4 / 1.22.1 |
| Coverage / counts | CoverM | 0.7.0 |
| QC aggregation | MultiQC | latest |
- CARD (Comprehensive Antibiotic Resistance Database) — McMaster University. https://card.mcmaster.ca
- VFDB (Virulence Factor Database) — Chen et al. http://www.mgc.ac.cn/VFs/
- GTDB-Tk — Chaumeil et al. (2022). https://github.com/Ecogenomics/GTDBTk
- CheckM1 — Parks et al. (2015). https://ecogenomics.github.io/CheckM/
- MEGAHIT — Li et al. (2015). https://github.com/voutcn/megahit
- MaxBin2 — Wu et al. (2016). https://sourceforge.net/projects/maxbin2/
- Snakemake — Köster & Rahmann (2012). https://snakemake.readthedocs.io