Skip to content

ddkrutkin/ARGus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

ARGus

A Snakemake pipeline that goes from raw metagenomic FASTQs to per-sample abundance tables of antibiotic resistance genes (ARGs) and virulence factors (VFs).

Why this exists

Environmental metagenomes — wastewater, soil, rivers, the gut — are reservoirs of resistance genes and virulence traits that can move between organisms. Surveying which ARGs and VFs are present, in which microbes, and at what abundance across samples is the bread and butter of antimicrobial resistance (AMR) monitoring. Doing it well means stitching together a long sequence of bioinformatics tools, each with its own quirks: read QC, assembly, binning, quality assessment, taxonomic placement, gene prediction, BLASTP against curated databases, read mapping, coverage counting, and a couple of cleanup steps in between.

ARGus packages that whole chain as one reproducible Snakemake workflow. You point it at your FASTQs, fill in two database paths, and it produces a tidy set of feature tables you can drop straight into downstream analysis (a heatmap, a stats notebook, a manuscript).

Pipeline outline

fastp → fastq-pair → MEGAHIT → MaxBin2 → CheckM1 → GTDB-Tk
                                  ↓
                 Prodigal (binned + unbinned tracks)
                                  ↓
        BLASTP (CARD / VFDB) → Bowtie2 → CoverM
                                  ↓
              ARG / VF feature tables, MAG abundance

In words: reads are trimmed and re-paired (fastp, fastq-pair), co-assembled across samples with MEGAHIT, binned into MAGs by MaxBin2, then assessed for quality (CheckM1) and taxonomically classified (GTDB-Tk). Prodigal runs on two parallel tracks — single-genome mode on each bin, metagenomic mode on the unbinned contigs. The unbinned proteins are BLASTed against CARD's antibiotic inactivation + efflux subsets; the binned proteins against VFDB. For each hit, the matching nucleotide CDS is pulled out of Prodigal's .fna output, deduplicated by header (seqkit), indexed with bowtie2-build, and the original paired reads are mapped to it. CoverM then counts reads per CDS; those counts get merged with the BLAST annotation to give the per-sample feature tables. MultiQC rolls up the QC.

Quick start

You'll need Snakemake ≥ 9, conda or mamba (for the per-rule environments), and ~150 GB of disk free for the reference databases — most of which is GTDB-Tk.

git clone https://github.com/<your-username>/ARGus.git
cd ARGus

# 1. describe your samples
$EDITOR config/samples.csv          # one row per sample: sample_id, r1, r2, group

# 2. point at your reference databases (see the table below for what these are)
$EDITOR config/config.yaml

# 3. dry-run to see what will execute
snakemake -n --profile config/profiles/local

# 4. go
snakemake --profile config/profiles/local

A full run on three 5–10 GB FASTQ samples takes ~4–6 hours on a single 60-core node. MEGAHIT and the final GTDB-Tk classify step are the two longest individual jobs.

If anything fails partway through, just re-run the same command — the local profile sets rerun-incomplete: true, so Snakemake picks up where it stopped.

Reference databases

Two big ones you download once, two small ones the pipeline grabs for you on first run:

Database Manual or auto? Size
CheckM1 v1.2.3 data Manual (one-time download) ~1.4 GB
GTDB-Tk r232 reference bundle Manual (one-time download) ~110 GB
CARD (filtered to inactivation + efflux mechanisms) Auto on first run ~50 MB
VFDB Set B Auto on first run ~10 MB

CheckM1 and GTDB-Tk are too large to fetch silently. Download them once, keep them somewhere central, and point config/config.yaml at the directories:

# CheckM1 (~1.4 GB)
mkdir -p /path/to/checkm1_db && cd /path/to/checkm1_db
wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar xzf checkm_data_2015_01_16.tar.gz && rm checkm_data_2015_01_16.tar.gz

# GTDB-Tk r232 (~110 GB — takes a couple of hours)
mkdir -p /path/to/gtdbtk_r232 && cd /path/to/gtdbtk_r232
wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.gz
tar xzf gtdbtk_r232_data.tar.gz --strip 1 && rm gtdbtk_r232_data.tar.gz

CARD and VFDB are small enough that the pipeline downloads + indexes them itself the first time you run it (set their database_path: entries to null in the config to enable this; provide a path if you already have a formatted BLAST DB).

Outputs

Everything lands in results/. The interesting bits:

  • arg_analysis/<assembly_id>/arg_feature_table.tsv — deduplicated ARG hits with per-sample read counts. One row per CARD accession.
  • arg_analysis/<assembly_id>/top10_args.tsv — the ten most abundant ARGs by total reads.
  • vf_analysis/<assembly_id>/vf_feature_table.tsv — per-sample VF read counts, indexed by bin__protein_id.
  • vf_analysis/<assembly_id>/vf_taxonomy.tsv — VF count matrix with each column header annotated by its MAG's full GTDB taxonomy.
  • vf_analysis/<assembly_id>/vf_heatmap.html — interactive Plotly heatmap of VFs × MAGs. Open in any browser.
  • mag_abundance/<assembly_id>/mag_feature_table.csv — per-MAG relative abundance, mean coverage, and covered fraction across samples.
  • quality/checkm1/<assembly_id>/quality_report.tsv — CheckM1 completeness and contamination per bin.
  • taxonomic_classification/<assembly_id>/gtdbtk.bac120.summary.tsv — GTDB-Tk taxonomy assignments for bacterial MAGs.
  • reports/multiqc_report.html — MultiQC summary of fastp + bowtie2 logs.

Per-rule logs (results/logs/<stage>/) and benchmarks (results/benchmarks/<stage>/) are kept alongside, which is handy when something misbehaves and you want to know which rule and how much memory it used.

Configuration knobs worth knowing

A few options in config/config.yaml you'll likely want to touch:

  • assembly.modeco_assembly (default) for cross-sample comparisons, or single_sample for one assembly per sample.
  • assembly.assemblermegahit (default) or metaspades.
  • MEGAHIT k-mer range — the default narrows MEGAHIT to k = 21, 31, 41 (instead of the standard 21→141 sweep) so a run finishes in hours, not days. Set kmin/kmax/kstep to null to use MEGAHIT's full default range.
  • arg_analysis.bowtie2_very_sensitive / vf_analysis.bowtie2_very_sensitive — set true to pass --very-sensitive for the read-mapping steps. Slower, but better recall on divergent reads.
  • resources.max_threads / resources.max_memory_gb — caps applied to per-rule resource scaling. The cores: line in config/profiles/local/config.yaml is a separate parallel-job budget; usually you want them equal.

Cluster execution

Profiles for SGE and SLURM are in config/profiles/. Install the matching executor plugin first:

pip install snakemake-executor-plugin-cluster-generic   # SGE
pip install snakemake-executor-plugin-slurm             # SLURM

snakemake --profile config/profiles/sge
snakemake --profile config/profiles/slurm

You'll likely want to edit those YAMLs for your cluster's partition name, account, and concurrent-job limits.

Troubleshooting

A few gotchas we've already hit so they don't bite you twice:

  • CheckM1 fails with ModuleNotFoundError: pkg_resources — already pinned in workflow/envs/checkm1.yaml via setuptools<81. If you swap envs around, this can resurface; the fix is the pin.
  • GTDB-Tk crashes with unrecognized arguments: --skip_ani_screen — that flag was removed in GTDB-Tk 2.7. If you pin an older version (2.4–2.6), add the flag back to the gtdbtk_classify rule in workflow/rules/taxonomic_classification.smk.
  • GTDB-Tk summary column lookup fails — the column is user_genome (singular, no _id); pandas scripts must key on that exact name.
  • CoverM column-name mismatch — CoverM 0.7.0 produces columns shaped <sample>.sorted Mean (strips .bam). If you upgrade or downgrade CoverM, the extract_per_sample_abundance grep pattern in workflow/rules/depth.smk may need adjusting.

What's under the hood

Stage Tool Pinned version
Read QC fastp 1.0.1
Re-pair fastq-pair 1.0
Assembly MEGAHIT 1.2.9
Binning MaxBin2 2.2.7
Bin quality CheckM1 1.2.3
Taxonomy GTDB-Tk 2.7.1 (r232 data)
Gene prediction Prodigal 2.6.3
BLASTP NCBI BLAST+ 2.16.0
FASTA cleanup seqkit 2.9.0
Read mapping Bowtie2 + samtools 2.5.4 / 1.22.1
Coverage / counts CoverM 0.7.0
QC aggregation MultiQC latest

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages