ITS subregion extraction for fungal metabarcoding at long-read scale.
As long-read amplicon sequencing (Oxford Nanopore and PacBio HiFi) becomes routine, extracting ITS subregions (ITS1, 5.8S, ITS2, full ITS) reliably at scale can become a throughput and robustness bottleneck. ITSxRust is a Rust-based ITS extractor that follows the standard approach of locating conserved ribosomal flanks using profile-HMMs (via HMMER), while adding long-read–oriented features for reproducible, high-throughput processing.
- HMMER-based boundary detection — locates conserved ribosomal flanks (SSU, 5.8S, LSU) using
nhmmerprofile-HMM searches to delimit ITS subregions - Platform presets —
--preset ont(tolerant E-values, wider length constraints) and--preset hifi(strict thresholds); explicit flags override any preset value - Partial-chain fallback — when the full 4-anchor chain (SSU→5.8S_start→5.8S_end→LSU) is unavailable, recovers subregions from 2-anchor pairs (e.g., SSU+5.8S_start for ITS1)
- Confidence classification — each extracted read is labelled
confident,ambiguous, orpartialbased on per-anchor score/E-value thresholds; ambiguous reads can be diverted to a separate file with--write-ambiguous - Exact dereplication —
--derephashes identical sequences and searches only unique representatives, projecting results back to duplicates - Structured QC output —
--qc-jsonemits a per-sample JSON summary (read counts, skip-reason breakdown, parameters) suitable for MultiQC ingestion;--anchors-tsv/--anchors-jsonlexport per-read anchor coordinates and confidence labels - Multi-region output — extract ITS1, ITS2, full ITS, or all three simultaneously (
--region all) - FASTA and FASTQ support — reads gzipped or uncompressed inputs; preserves quality scores when outputting FASTQ
itsxrust extract \
--input reads.fastq.gz \
--hmm F.hmm \
--output its2_extracted.fasta \
--region its2 \
--preset ont \
--hmmer-cpu 8Expected output:
Using preset: ont
Params: in=Fastq out=Fasta region=its2 max_per_anchor=10 derep=false inc_e=1.0e-3 ...
tblout hits parsed: 142389 | anchor hits: 131052 | stored(topK): 98412 | reads w/anchor hits: 48231
Reads with computed bounds: 45102 (full-chain: 40811, confident: 38500, ambiguous: 2311, partial: 4291)
Wrote output: its2_extracted.fasta
Kept: 45102 (partial: 4291) Ambiguous (separate): 0 Skipped: 3129
To extract ITS1, ITS2, and full ITS simultaneously, use --region all. In this mode, --output is treated as a prefix:
itsxrust extract \
--input reads.fastq.gz \
--hmm F.hmm \
--output results/sample1 \
--region all \
--preset ont \
--hmmer-cpu 8This produces results/sample1.its1.fasta, results/sample1.its2.fasta, and results/sample1.full.fasta.
Download the binary for your platform from GitHub Releases, then:
chmod +x itsxrust
./itsxrust --helpThe Docker image bundles HMMER, so nhmmer is available out of the box:
docker run --rm -v $(pwd):/data ghcr.io/ayobi/itsxrust:latest \
extract --input /data/reads.fastq.gz --hmm /data/F.hmm \
--output /data/its2_extracted.fasta --region its2 --preset ontRequires Rust (stable, edition 2024) and Cargo:
cargo install --path .
itsxrust --helpITSxRust calls nhmmer to search profile-HMMs against input sequences. Install HMMER 3.x and ensure nhmmer is on your PATH:
conda install -c bioconda hmmerThe Docker image includes HMMER, so no separate installation is needed when using the container.
ITSxRust uses the same fungal HMM profiles as ITSx. The file is typically called F.hmm (for fungi) and is distributed with ITSx. After installing ITSx, find it with:
find $(dirname $(which ITSx))/../ -name "F.hmm" 2>/dev/nullOr download the ITSx package and extract the HMM files from the ITSx_db/HMMs/ directory.
itsxrust --help
itsxrust extract --help| Flag | Description | Default |
|---|---|---|
--input |
Input FASTA/FASTQ (.gz OK) |
required |
--hmm |
HMM profile file | required (unless --tblout-existing) |
--output |
Output file (single region) or prefix (--region all) |
required |
--region |
its1, its2, full, or all |
full |
--preset |
ont or hifi — sets E-value, constraints, confidence thresholds |
— |
--hmmer-cpu |
Threads for nhmmer | 8 |
--inc-e |
E-value inclusion threshold | 1e-5 (ont: 1e-3, hifi: 1e-10) |
--derep |
Exact dereplication before HMMER search | false |
--input-format |
auto, fasta, or fastq |
auto |
--output-format |
auto, fasta, or fastq |
auto |
--tblout-existing |
Reuse a previous nhmmer --tblout file (skips nhmmer) |
— |
--anchors-tsv |
Write per-read anchor coordinates as TSV | — |
--anchors-jsonl |
Write per-read anchor coordinates as JSONL | — |
--qc-json |
Write per-sample QC summary as JSON | — |
--write-ambiguous |
Divert ambiguous reads to a separate file | — |
--write-skipped |
Write skipped reads to a separate file | — |
--explain N |
Print skip reasons for the first N skipped reads | 0 |
Presets bundle sensible defaults for each platform. Explicit flags always override preset values.
| Parameter | No preset | --preset ont |
--preset hifi |
|---|---|---|---|
--inc-e |
1e-5 | 1e-3 | 1e-10 |
--max-per-anchor |
8 | 10 | 6 |
--min-its1 / --max-its1 |
50 / 1500 | 30 / 1800 | 50 / 1500 |
--min-its2 / --max-its2 |
50 / 2000 | 30 / 2500 | 50 / 2000 |
--min-full / --max-full |
150 / 4000 | 100 / 5000 | 150 / 4000 |
--min-anchor-score |
20 | 15 | 30 |
--max-anchor-evalue |
1e-4 | 1e-3 | 1e-8 |
Use --explain to see why reads are being skipped:
itsxrust extract --input reads.fq --hmm F.hmm --output out.fasta \
--region full --explain 10SKIP read_42: missing anchors: LSU_start
SKIP read_87: anchors present but no valid chain under constraints
Use --qc-json for a machine-readable summary of the full run:
itsxrust extract --input reads.fq --hmm F.hmm --output out.fasta \
--region all --preset ont --qc-json qc_summary.jsonThe JSON includes total reads, kept/skipped counts with reason-code breakdowns, confidence classification counts, dereplication stats (if --derep), and the effective parameters used.
Use --anchors-tsv or --anchors-jsonl to export per-read anchor hit coordinates, confidence labels, and ambiguous-reason annotations for reads that produced a valid chain.
Inputs
- FASTA or FASTQ, optionally gzipped
- HMM profile file (e.g.,
F.hmmfrom ITSx)
Outputs
- Extracted sequences as FASTA or FASTQ (one file per region, or one file for a single region)
- Optional: per-read anchor coordinates (TSV or JSONL), with confidence and ambiguity annotations
- Optional: QC summary JSON (
--qc-json) - Optional: ambiguous reads (
--write-ambiguous) and skipped reads (--write-skipped) in separate files
cargo fmt
cargo clippy --all-targets --all-features -- -D warnings
cargo testBenchmarking and simulation scripts live in bench/. See bench/sim/README.md for the simulation-based evaluation pipeline.
src/ Rust source (main, select, tblout, trim, preset, derep, report, seq, fasta, fastq, hmmer)
tests/ Integration tests
bench/ Benchmarking + simulation harness
- Bioconda recipe
- In-process HMM bindings (replace nhmmer subprocess)
MIT (see LICENSE).
If you use ITSxRust, please cite the repository via GitHub's "Cite this repository" button (powered by CITATION.cff).