USTAR2 is a k-mers set compressor, with counting.
It is based on the ideas of UST and prophAsm for computing an SPSS representation (aka simplitigs) for the given k-mers set.
Additionally, it exploits the possibility of reusing already visited nodes to achieve better compression as shown by Matchtigs.
You will find four executables and one bash script:
ustar
: the main programustarx
: a k-mers extractorustars
: compute k-mers statisticsustar-test
: used for debugvalidate
: a validation script
There are no dependencies. However, you'll need BCALM2 in order to compute a compacted de Bruijn graph (cdBG) of your multi-fasta file.
git clone https://github.com/enricorox/USTAR
.cd USTAR
cmake . && make -j 4
.
Run BCALM2 first:
./bcalm -kmer-size <kmer-size> -in <your-multi-fasta> -all-abundance-counts
Then USTAR2:
./ustar -k <kmer-size> -i <bcalm-output>
To use the best heuristic, add -s+aa -x+u
See the help ./ustar -h
for details and advanced options.
You can check that the output file contains the same k-mers of your bcalm file with your preferred kmer counter.
If you want to check that kmers and counts are correct,
./ustarx -k <kmer-size> -i <ustar-fasta> -c <ustar-counts> -s
./validate <kmer-size> <your-multi-fasta> <ustar-kmers-counts>
Note that you'll need to install Jellyfish-2 in order to use validate
.
If using this tool, please cite the paper
@inproceedings{rossignolo2024ustar2,
title={USTAR2: Fast and Succinct Representation of k-mer Sets Using De Bruijn Graphs.},
author={Rossignolo, Enrico and Comin, Matteo},
booktitle={BIOSTEC (1)},
pages={368--378},
year={2024}
}