FastSTR — Ultra-fast and accurate identification of Short Tandem Repeats (STRs) from long-read DNA sequences. Developed for genome-wide STR detection, consensus construction, and comparative STR analysis.
- Overview
- Installation
- Quick Start
- Command Line Options
- Input & Output
- Usage
- Performance
- Citation
- License
- Changelog
FastSTR is a novel and efficient tool for de novo detection of short tandem repeats (STRs) in genomic sequences. It combines fast motif recognition with accurate sequence alignment to achieve both high precision and completeness in STR identification. FastSTR is optimized for large-scale genomic datasets and enables rapid detection of repetitive elements without relying on predefined motif libraries or fixed repeat-length thresholds.
Compared to classical tools like TRF, T-reks, and TRASH, FastSTR achieves:
- ⚡ High-speed parallel processing — Processes genomic fragments in parallel, achieving up to 10× faster runtime.
- 🧠 Context-aware motif recognition — Uses an N-gram + Markov model to identify representative motifs without predefined motif libraries.
- 🧩 Segmented global alignment — Efficiently handles ultra-long or complex STRs while maintaining base-level precision.
- 🔍 Smart interval merging — Applies an interval-gain decision strategy to accurately resolve overlapping STRs.
- 🧬 Enhanced detection in complex regions — Identifies confounding or nested repeat regions (e.g., centromeric satellites) with a novel density-based concentration test.
- 💾 Lightweight & scalable — Requires few dependencies, easy to install and run, and supports multiple operating systems.
pip install faststr(coming soon)
conda install -c bioconda faststrgit clone https://github.com/yourname/faststr.git
cd faststr
pip install -e .faststr [--strict | --normal | --loose] [--default] genome.fafaststr --strict --default genome.faThis runs FastSTR in strict mode using the default model to identify STRs in the genome.fa file.
| Argument | Type | Default | Description |
|---|---|---|---|
match |
int | 2 | Match score |
mismatch |
int | 5 | Mismatch score |
gap_open |
int | 7 | Gap opening penalty |
gap_extend |
int | 3 | Gap extension penalty |
p_indel |
int | 15 | Indel percentage threshold |
p_match |
int | 80 | Match percentage threshold |
score |
int | 50 | Alignment score threshold |
quality_control |
bool | False | Enable read-level quality control |
DNA_file |
str | — | Path to DNA FASTA input |
-f |
str | — | Output directory |
-s |
int | 1 | Start index |
-e |
int | 0 | End index |
-l |
int | 15000 | Sub-read length |
-o |
int | 1000 | Overlap length |
-p |
int | 1 | Number of CPU cores |
-b |
float | 0.045 | Motif coverage threshold |
| Mode | Description |
|---|---|
--strict |
High precision, recommended for curated assemblies |
--normal |
Balanced mode, suitable for most datasets |
--loose |
High sensitivity, tolerant of mismatches |
| Preset | Description |
|---|---|
--default |
Standard scoring model |
(future) --sensitive |
Optimized for noisy long reads |
(future) --speed |
Optimized for large-scale detection |
- DNA sequences in FASTA format
| File Pattern | Description |
|---|---|
*detail.dat |
Contains all STR positions and motifs, quality statistics for each STR, and STR counts per chromosome. |
*align.dat |
Detailed alignment of all STRs against reference STRs, including mismatches and indels. |
*.csv |
Merged STR intervals with representative motifs and summary statistics for each interval. |
*.log |
Processing logs. |
faststr --normal --default human_genome.fafaststr --strict --default genome.fa -p 8faststr --strict --default NC_024512.1.fna -p 8| Dataset | Genome Size | Tool | Runtime | Recall | Precision |
|---|---|---|---|---|---|
| Human (T2T) | 2.94 G | TRF | 18 h 31 min | - | - |
| FastSTR | 1 h 13 min | 0.950 | 0.994 | ||
| Mouse (GRCm39) | 2.57 G | TRF | 1 h 41 min | - | - |
| FastSTR | 38 min | 0.966 | 0.997 | ||
| Zebrafish (GRCz11) | 1.58 G | TRF | 2 h 51 min | - | - |
| FastSTR | 25 min | 0.945 | 0.998 |
Note: TRF is used as the ground-truth. FastSTR runs based on 72 CPUs.
If you use FastSTR in your research, please cite:
Xingyu Liao et al.,
Efficient Identification of Short Tandem Repeats via Context-Aware Motif Discovery and Ultra-Fast Sequence Alignment,
Nat. XXX, 2026.
This project is licensed under the MIT License.
See LICENSE for more details.
- Initial release of FastSTR
- Supports three alignment modes and one default model
- Implemented parallel computation
- Added
.csv,.dat,.logoutputs