ARGus

A Snakemake pipeline that goes from raw metagenomic FASTQs to per-sample abundance tables of antibiotic resistance genes (ARGs) and virulence factors (VFs).

Why this exists

Environmental metagenomes — wastewater, soil, rivers, the gut — are reservoirs of resistance genes and virulence traits that can move between organisms. Surveying which ARGs and VFs are present, in which microbes, and at what abundance across samples is the bread and butter of antimicrobial resistance (AMR) monitoring. Doing it well means stitching together a long sequence of bioinformatics tools, each with its own quirks: read QC, assembly, binning, quality assessment, taxonomic placement, gene prediction, BLASTP against curated databases, read mapping, coverage counting, and a couple of cleanup steps in between.

ARGus packages that whole chain as one reproducible Snakemake workflow. You point it at your FASTQs, fill in two database paths, and it produces a tidy set of feature tables you can drop straight into downstream analysis (a heatmap, a stats notebook, a manuscript).

Pipeline outline

fastp → fastq-pair → MEGAHIT → MaxBin2 → CheckM1 → GTDB-Tk
                                  ↓
                 Prodigal (binned + unbinned tracks)
                                  ↓
        BLASTP (CARD / VFDB) → Bowtie2 → CoverM
                                  ↓
              ARG / VF feature tables, MAG abundance

In words: reads are trimmed and re-paired (fastp, fastq-pair), co-assembled across samples with MEGAHIT, binned into MAGs by MaxBin2, then assessed for quality (CheckM1) and taxonomically classified (GTDB-Tk). Prodigal runs on two parallel tracks — single-genome mode on each bin, metagenomic mode on the unbinned contigs. The unbinned proteins are BLASTed against CARD's antibiotic inactivation + efflux subsets; the binned proteins against VFDB. For each hit, the matching nucleotide CDS is pulled out of Prodigal's .fna output, deduplicated by header (seqkit), indexed with bowtie2-build, and the original paired reads are mapped to it. CoverM then counts reads per CDS; those counts get merged with the BLAST annotation to give the per-sample feature tables. MultiQC rolls up the QC.

Quick start

You'll need Snakemake ≥ 9, conda or mamba (for the per-rule environments), and ~150 GB of disk free for the reference databases — most of which is GTDB-Tk.

git clone https://github.com/<your-username>/ARGus.git
cd ARGus

# 1. describe your samples
$EDITOR config/samples.csv          # one row per sample: sample_id, r1, r2, group

# 2. point at your reference databases (see the table below for what these are)
$EDITOR config/config.yaml

# 3. dry-run to see what will execute
snakemake -n --profile config/profiles/local

# 4. go
snakemake --profile config/profiles/local

A full run on three 5–10 GB FASTQ samples takes ~4–6 hours on a single 60-core node. MEGAHIT and the final GTDB-Tk classify step are the two longest individual jobs.

If anything fails partway through, just re-run the same command — the local profile sets rerun-incomplete: true, so Snakemake picks up where it stopped.

Reference databases

Two big ones you download once, two small ones the pipeline grabs for you on first run:

Database	Manual or auto?	Size
CheckM1 v1.2.3 data	Manual (one-time download)	~1.4 GB
GTDB-Tk r232 reference bundle	Manual (one-time download)	~110 GB
CARD (filtered to inactivation + efflux mechanisms)	Auto on first run	~50 MB
VFDB Set B	Auto on first run	~10 MB

CheckM1 and GTDB-Tk are too large to fetch silently. Download them once, keep them somewhere central, and point config/config.yaml at the directories:

# CheckM1 (~1.4 GB)
mkdir -p /path/to/checkm1_db && cd /path/to/checkm1_db
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar xzf checkm_data_2015_01_16.tar.gz && rm checkm_data_2015_01_16.tar.gz

# GTDB-Tk r232 (~110 GB — takes a couple of hours)
mkdir -p /path/to/gtdbtk_r232 && cd /path/to/gtdbtk_r232
wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.gz
tar xzf gtdbtk_r232_data.tar.gz --strip 1 && rm gtdbtk_r232_data.tar.gz

CARD and VFDB are small enough that the pipeline downloads + indexes them itself the first time you run it (set their database_path: entries to null in the config to enable this; provide a path if you already have a formatted BLAST DB).

Outputs

Everything lands in results/. The interesting bits:

arg_analysis/<assembly_id>/arg_feature_table.tsv — deduplicated ARG hits with per-sample read counts. One row per CARD accession.
arg_analysis/<assembly_id>/top10_args.tsv — the ten most abundant ARGs by total reads.
vf_analysis/<assembly_id>/vf_feature_table.tsv — per-sample VF read counts, indexed by bin__protein_id.
vf_analysis/<assembly_id>/vf_taxonomy.tsv — VF count matrix with each column header annotated by its MAG's full GTDB taxonomy.
vf_analysis/<assembly_id>/vf_heatmap.html — interactive Plotly heatmap of VFs × MAGs. Open in any browser.
mag_abundance/<assembly_id>/mag_feature_table.csv — per-MAG relative abundance, mean coverage, and covered fraction across samples.
quality/checkm1/<assembly_id>/quality_report.tsv — CheckM1 completeness and contamination per bin.
taxonomic_classification/<assembly_id>/gtdbtk.bac120.summary.tsv — GTDB-Tk taxonomy assignments for bacterial MAGs.
reports/multiqc_report.html — MultiQC summary of fastp + bowtie2 logs.

Per-rule logs (results/logs/<stage>/) and benchmarks (results/benchmarks/<stage>/) are kept alongside, which is handy when something misbehaves and you want to know which rule and how much memory it used.

Configuration knobs worth knowing

A few options in config/config.yaml you'll likely want to touch:

assembly.mode — co_assembly (default) for cross-sample comparisons, or single_sample for one assembly per sample.
assembly.assembler — megahit (default) or metaspades.
MEGAHIT k-mer range — the default narrows MEGAHIT to k = 21, 31, 41 (instead of the standard 21→141 sweep) so a run finishes in hours, not days. Set kmin/kmax/kstep to null to use MEGAHIT's full default range.
arg_analysis.bowtie2_very_sensitive / vf_analysis.bowtie2_very_sensitive — set true to pass --very-sensitive for the read-mapping steps. Slower, but better recall on divergent reads.
resources.max_threads / resources.max_memory_gb — caps applied to per-rule resource scaling. The cores: line in config/profiles/local/config.yaml is a separate parallel-job budget; usually you want them equal.

Cluster execution

Profiles for SGE and SLURM are in config/profiles/. Install the matching executor plugin first:

pip install snakemake-executor-plugin-cluster-generic   # SGE
pip install snakemake-executor-plugin-slurm             # SLURM

snakemake --profile config/profiles/sge
snakemake --profile config/profiles/slurm

You'll likely want to edit those YAMLs for your cluster's partition name, account, and concurrent-job limits.

Troubleshooting

A few gotchas we've already hit so they don't bite you twice:

CheckM1 fails with ModuleNotFoundError: pkg_resources — already pinned in workflow/envs/checkm1.yaml via setuptools<81. If you swap envs around, this can resurface; the fix is the pin.
GTDB-Tk crashes with unrecognized arguments: --skip_ani_screen — that flag was removed in GTDB-Tk 2.7. If you pin an older version (2.4–2.6), add the flag back to the gtdbtk_classify rule in workflow/rules/taxonomic_classification.smk.
GTDB-Tk summary column lookup fails — the column is user_genome (singular, no _id); pandas scripts must key on that exact name.
CoverM column-name mismatch — CoverM 0.7.0 produces columns shaped <sample>.sorted Mean (strips .bam). If you upgrade or downgrade CoverM, the extract_per_sample_abundance grep pattern in workflow/rules/depth.smk may need adjusting.

What's under the hood

Stage	Tool	Pinned version
Read QC	fastp	1.0.1
Re-pair	fastq-pair	1.0
Assembly	MEGAHIT	1.2.9
Binning	MaxBin2	2.2.7
Bin quality	CheckM1	1.2.3
Taxonomy	GTDB-Tk	2.7.1 (r232 data)
Gene prediction	Prodigal	2.6.3
BLASTP	NCBI BLAST+	2.16.0
FASTA cleanup	seqkit	2.9.0
Read mapping	Bowtie2 + samtools	2.5.4 / 1.22.1
Coverage / counts	CoverM	0.7.0
QC aggregation	MultiQC	latest

References

CARD (Comprehensive Antibiotic Resistance Database) — McMaster University. https://card.mcmaster.ca
VFDB (Virulence Factor Database) — Chen et al. http://www.mgc.ac.cn/VFs/
GTDB-Tk — Chaumeil et al. (2022). https://github.com/Ecogenomics/GTDBTk
CheckM1 — Parks et al. (2015). https://ecogenomics.github.io/CheckM/
MEGAHIT — Li et al. (2015). https://github.com/voutcn/megahit
MaxBin2 — Wu et al. (2016). https://sourceforge.net/projects/maxbin2/
Snakemake — Köster & Rahmann (2012). https://snakemake.readthedocs.io

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
workflow		workflow
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARGus

Why this exists

Pipeline outline

Quick start

Reference databases

Outputs

Configuration knobs worth knowing

Cluster execution

Troubleshooting

What's under the hood

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARGus

Why this exists

Pipeline outline

Quick start

Reference databases

Outputs

Configuration knobs worth knowing

Cluster execution

Troubleshooting

What's under the hood

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages