Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
lib
 
 
 
 
 
 
 
 
 
 
 
 

README.md

gsufsort

gsufsort [1] is a fast, portable and lightweight tool for constructing the suffix array and related data structures for string collections.

gsufsort runs in internal memory and data structures are written to disk.

For a string collection, gsufsort can compute the following data structures:

  • Suffix array (SA)
  • Inverse suffix array (ISA)
  • LCP-array (LCP)
  • k-truncated LCP-array (k-LCP)
  • Document array (DA)
  • Burrows-Wheeler transform (BWT)
  • Inverse BWT
  • Generalized suffix array (GSA)
  • Generalized enhanced suffix array (GESA)

Compilation and installation

gsufsort will compile in systems with a standard C compiler (like gcc) and make.

git clone https://github.com/felipelouza/gsufsort.git
cd gsufsort
make 

Issuing these commands will build executables gsufsort and gsufsort-64.

For inputs larger than 2GB, gsufsort-64 must be used.

To enable support to compressed files, zlib is required. If zlib is not installed in you system, build with option make GZ=0.

Execution

./gsufsort INPUT [options]

where INPUT is a single file or directory with a string collection.

Construction options:

--build	              (default)
--sa    [w]           compute the SA using w bytes (default 4), write to INPUT.w.sa
--isa   [w]           compute the ISA, write to INPUT.w.isa
--lcp   [w]           compute the LCP, write to INPUT.w.lcp
--trlcp k             compute the k-truncated LCP array, write to INPUT.w.lcp
--da    [w]           compute the DA, write to INPUT.w.da
--gsa   [w1][w2]      compute the GSA=(string,suffix) as pairs of w1+w2 bytes, write to INPUT.w1.w2.gsa
--gesa  [w1][w2][w3]  compute the GESA=(GSA,LCP,BWT), write to INPUT.w1.w2.w3.1.gesa
--bwt                 compute the BWT using 1 byte per symbol, write to INPUT.bwt
--docs  d             process only the first d strings in the collection
--light               run the lightweight algorithm to compute DA, GSA and GESA
--output DIR/NAME     write output files to DIR and use NAME as a prefix for file names

Loading options:

--load                load data-structures from disk INPUT[.sa][.da][.lcp][.gsa][.str]
--ibwt                invert the BWT, given INPUT.bwt, write output to INPUT.bwt.ibwt

Input options:

--txt                 handle input as raw text files (one string per line)
--fasta               handle input as fasta files 
--fastq               handle input as fastq files
--dir                 include all files at the input directory
--lower               convert input to lowercase before data structures construction
--upper               convert input to uppercase before data structures construction

Output options:

--str                 write the collection concatenation (T^{cat}) to INPUT.1.str
--print [p]           print the first p elements of arrays to stdout, defaults to the collection length
--qs                  write QS sequences in fastq permuted according to the BWT to INPUT.bwt.qs
--lcp_max             print maximum LCP value
--lcp_max_text        print maximum LCP value (text order)
--lcp_avg             print average LCP value
--time                print the running time in seconds
--verbose             verbose output
--help                this help message

Input files

  • File types (text, fasta or fastq) will be selected by extensions: .txt, .fasta (or .fa, .fna.) and .fastq (or .fq).

  • Options --txt, --fasta and --fastq enable loading file disregarding extensions.

  • In txt files, each line is taken as a strings in the collection. In fasta and fastq files, each sequence is taken as a string in the collection.

  • gsufsort supports the ASCII alphabet, but values 0 and 255 are reserved and must not occur in the input.

  • IUPAC symbols and 'N' are not handled as special symbols in fasta or fastq files.

  • gzipped input files (with .gz extension) are supported using zlib and kseq libraries. If zlib is not installed in your system, build gsufsort with the option make GZ=0. If zlib is not available and a gzipped file is given as input, a runtime error will occur.

  • A directory may be given as input, selecting option --dir. Every file with expected extensions in the directory will be processed to compose the collection, and the default output file prefix will be all. See also Wiki.

Output files

  • Output files are written by default in the current directory.

  • If option --output DIR/ is set, files are written to directory DIR. Setting --output DIR/NAME will make files be written to directory DIR with suffix NAME.

  • Output files format is discussed below.

quick test

To run a test with all strings from dataset/example.txt, type:

./gsufsort dataset/example.txt --sa --bwt
## gsufsort ##
## store_to_disk ##
example.txt.4.sa	76 bytes (n = 19)
example.txt.bwt	19 bytes (n = 19)

To see the result (option --print) stored in disk INPUT.4.sa and INPUT.bwt, use --load option:

./gsufsort example.txt --sa --bwt --load --print
## load_from_disk ##
example.txt.4.sa	76 bytes (n = 19)
example.txt.bwt	19 bytes (n = 19)
i	SA	BWT	suffixes
0	18	$	#
1	6	a	$
2	12	a	$
3	17	n	$
4	5	n	a$
5	11	b	a$
6	9	n	aba$
7	15	n	an$
8	3	n	ana$
9	7	$	anaba$
10	13	$	anan$
11	1	b	anana$
12	10	a	ba$
13	0	#	banana$
14	16	a	n$
15	4	a	na$
16	8	a	naba$
17	14	a	nan$
18	2	a	nana$

output format

  • SA, ISA, LCP, k-LCP and DA are each written sequentially to a binary file. The file has no header and every integer takes w bytes. The default value of w is 4.

  • BWT and iBWT are written in ASCII format, using 1 byte per input symbol.

  • The GSA is written sequentially to a binary file, with no headers. The values of DA and SA are intercalated throughout the file with w1 and w2 bytes respectively: DA[0],SA[0],DA[1],SA[1],...

  • The GESA is written sequentially to a binary file, with no headers. The values of DA and SA, LCP and BWT are intercalated throughout the file with w1, w2, w3 and 1 bytes respectively: DA[0],SA[0],LCP[0],BWT[0],DA[1],SA[1],LCP[1],BWT[1]...

wiki

See more details and additional features in Wiki.

authors

thanks

We thank to Antonis Maronikolakis for his helpful suggestions.

references

[1] Louza, F.A., Telles, G.P., Gog, S., Prezza, N., Rosone, G.. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol Biol 15, 18 (2020). https://doi.org/10.1186/s13015-020-00177-y

About

gsufsort: building suffix arrays, LCP-arrays and BWTs for string collections [AMB 2020]

Topics

Resources

License

You can’t perform that action at this time.