kseq++ is a C++11 re-implementation of kseq.h
by Heng Li. The goal for re-implementation of kseq
is
providing better API and resource management while preserving its flexibility
and performance. Like original kseq, this parser is based on generic stream
buffer and works with different file types. However, instead of using C macros,
it uses C++ templates.
It inherits all features from kseq (quoting from kseq homepage):
- Parse both FASTA and FASTQ format, and even a mixture of FASTA and FASTQ records in one file.
- Seamlessly adapt to gzipped compressed file when used with zlib.
- Support multi-line FASTQ.
- Work on a stream with an internal stream buffer.
while additionally provides:
- simpler and more readable API
- RAII-style memory management
The library also comes with a FASTA/Q writer. Like reading, it can write mixed multi-line FASTA and FASTQ records with gzip compression. The writer is multi-threaded and the actual write function call happens in another thread in order to hide the IO latency.
The RAII-style class KStream
is the core class which handles input and output
streams. Each FASTA or FASTQ record will be stored in a KSeq
object.
This library provides another layer of abstraction which hides most details and
provides very simple API on top of KStream
: SeqStreamIn
and SeqStreamOut
classes for reading and writing a sequence file respectively with exactly the
same interface. It is highly recommended to use these classes unless you
intent to use low-level interface like changing buffer size or use custom stream
type.
Jump to Examples or Installation.
KStream
is a generic, template class with the following template parameters
which are usually inferred by the compiler when constructed (so, there is no
need to provide them manually):
TFile
: type of the underlying stream/file (e.g.gzFile
)TFunc
: type of the read/write function corresponding toTFile
(e.g.int (*)(gzFile_s*, const void*, unsigned int)
for an output stream withgzFile
as underlying file type)TSpec
: stream opening mode (with values:mode::in
ormode::out
)
The template parameters are inferred by compiler in C++17 when instantiated by
calling their constructors. make_kstream
function family also construct
KStream
s which might be useful for inferring template parameters when using
older standards; e.g. C++11 or C++14.
To construct an instance, it requires at least three arguments: 1) the file object/pointer/descriptor (can be of any type), 2) its corresponding read/write function, and 3) stream opening mode (see Examples).
This header file defines SeqStream
class set: i.e. SeqStreamIn
and
SeqStreamOut
. SeqStream
classes are inherited from KStream
with simpler
constructors using sensible defaults. They do not define any new method or
override inherited ones. So, they can be treated the same way as KStream
.
In order to prevent imposing any unwanted external libraries (e.g. zlib
) , the
SeqStream
class set are defined in a separated header file (seqio.hpp
) from
the core library.
These examples read FASTQ/A records one by one from either compressed or uncompressed file.
Using SeqStreamIn
:
#include <iostream>
#include <kseq++/seqio.hpp>
using namespace klibpp;
int main(int argc, char* argv[])
{
KSeq record;
SeqStreamIn iss("file.fq.gz");
while (iss >> record) {
std::cout << record.name << std::endl;
if (!record.comment.empty()) std::cout << record.comment << std::endl;
std::cout << record.seq << std::endl;
if (!record.qual.empty()) std::cout << record.qual << std::endl;
}
}
Low-level API
Using KStream
#include <iostream>
#include <zlib>
#include <kseq++/kseq++.hpp>
using namespace klibpp;
int main(int argc, char* argv[])
{
KSeq record;
gzFile fp = gzopen(filename, "r");
auto ks = make_kstream(fp, gzread, mode::in);
// auto ks = KStream(fp, gzread, mode::in); // C++17
// auto ks = KStreamIn(fp, gzread); // C++17
while (ks >> record) {
std::cout << record.name << std::endl;
if (!record.comment.empty()) std::cout << record.comment << std::endl;
std::cout << record.seq << std::endl;
if (!record.qual.empty()) std::cout << record.qual << std::endl;
}
gzclose(fp);
}
Or records can be fetched and stored in a std::vector< KSeq >
in chunks.
Using SeqStreamIn
:
#include <iostream>
#include <kseq++/seqio.hpp>
using namespace klibpp;
int main(int argc, char* argv[])
{
SeqStreamIn iss("file.fq");
auto records = iss.read();
// auto records = iss.read(100); // read a chunk of 100 records
}
Low-level API
Using KStream
#include <iostream>
#include <zlib>
#include <kseq++/kseq++.hpp>
using namespace klibpp;
int main(int argc, char* argv[])
{
gzFile fp = gzopen(filename, "r");
auto ks = make_ikstream(fp, gzread);
auto records = ks.read(); // fetch all the records
// auto records = ks.read(100); // read a chunk of 100 records
gzclose(fp);
}
These examples write FASTA/Q records to an uncompressed file.
Using SeqStreamIn
:
#include <iostream>
#include <kseq++/seqio.hpp>
using namespace klibpp;
int main(int argc, char* argv[])
{
SeqStreamOut oss("file.dat");
for (KSeq const& r : records) oss << r;
}
Low-level API
Using KStream
#include <iostream>
#include <zlib>
#include <kseq++/kseq++.hpp>
using namespace klibpp;
int main(int argc, char* argv[])
{
int fd = open(filename, O_WRONLY);
auto ks = make_kstream(fd, write, mode::out);
// auto ks = KStreamOut(fd, write); // C++ 17
// ...
for (KSeq const& r : records) ks << r;
ks << kend;
close(fd);
}
Another example for writing a series of FASTQ records to a gzipped file in FASTA format:
#include <iostream>
#include <kseq++/seqio.hpp>
using namespace klibpp;
int main(int argc, char* argv[])
{
/* let `record` be a list of FASTQ records */
SeqStreamOut oss("file.fa.gz", /* compression */ true, format::fasta);
for (KSeq const& r : records) oss << r;
}
NOTE
The buffer will be flushed to the file when the KStream
object goes out of the
scope. Otherwise, ks << kend
is required to be called before closing the file
to make sure that there is no data loss.
There is no need to write kend
to the stream if using SeqStreamOut
.
While writing a record to a file, sequence and quality scores can be wrapped at
a certain length. The default wrapping length for FASTA format is 60 bps and can
be customised by KStream::set_wraplen
method. For FASTQ format -- i.e. when
the format is explicitly set to format::fastq
-- output sequence and quality
string are not wrapped by default.
Wrapping can be disabled or enable by KStream::set_nowrapping
and
KStream::set_wrapping
methods respectively. The latter reset the wrapping
length to the default value (60 bps).
The default behaviour is to write a record in FASTQ format if it has quality
information. Otherwise, i.e. when the quality string is empty, the record will
be written in FASTA format. So, the output might be a mixture of FASTQ and FASTA
records. However, the output format can be forced by using format::fasta
and
format::fastq
modifiers. For example:
out << format::fasta << fastq_record;
out << another_record; // all other calls after this will also be in FASTA format.
will write a FASTQ record in FASTA format. These modifiers affect all writes
after them until another modifier is used. The format::mix
modifier reverts
the behaviour to default.
NOTE
Writing a FASTA record in FASTQ format throws an exception unless the record is empty (a record with empty sequence and quality string).
kseq++ is a header-only library and can be simply included in a project. Use
the package provided in the
Releases section and copy
include/kseq++
to your project tree.
The kseq++.hpp
is the core header file and seqio.hpp
is optional and only
needs to be included when using higher-level API (see
above). The latter requires zlib
as dependency
which should be linked.
There are also other ways to install the library:
Installing from source requires CMake>= 3.10:
git clone https://github.com/cartoonist/kseqpp
cd kseqpp
mkdir build && cd build
cmake .. # -DCMAKE_INSTALL_PREFIX=/path/to/custom/install/prefix (optional)
make install
It is also distributed on bioconda:
conda install -c bioconda kseqpp
After installing the library, you can import the library to your project using
find_package
. It imports kseq++::kseq++
target which can be passed to
target_include_directories
and target_link_libraries
calls. This is a sample
CMake file for building myprogram
which uses the library:
cmake_minimum_required(VERSION 3.10)
project(myprogram VERSION 0.0.1 LANGUAGES CXX)
find_package(kseq++ REQUIRED)
set(SOURCES "src/main.cpp")
add_executable(myprogram ${SOURCES})
target_include_directories(myprogram
PRIVATE kseq++::kseq++)
target_link_libraries(myprogram
PRIVATE kseq++::kseq++)
CMake options:
- for building tests:
-DBUILD_TESTING=on
- for building benchmark:
-DBUILD_BENCHMARKING=on
NOTE: The results below are based on older versions of kseq++ and kseq.h
.
- TODO Update benchmark
NOTE: It is fair to say that kseq++ comes with a very negligible overhead
and is almost as fast as kseq.h
(in 'read' mode) with an idiomatic C++ API
and more convinient resource management. The original kseq.h
does not support
writing FASTA/Q files.
For this benchmark, I re-used sequence files from SeqKit benchmark: seqkit-benchmark-data.tar.gz
file | format | type | num_seqs | sum_len | min_len | avg_len | max_len |
---|---|---|---|---|---|---|---|
dataset_A.fa | FASTA | DNA | 67,748 | 2,807,643,808 | 56 | 41,442.5 | 5,976,145 |
dataset_B.fa | FASTA | DNA | 194 | 3,099,750,718 | 970 | 15,978,096.5 | 248,956,422 |
dataset_C.fq | FASTQ | DNA | 9,186,045 | 918,604,500 | 100 | 100 | 100 |
- CPU: Intel® Xeon® CPU E3-1241 v3 @ 3.50GHz, 4 cores, 8 threads
- RAM: DDR3 1600 MHz, 16352 MB
- HDD: Seagate Desktop HDD 500GB, 16MB Cache, SATA-3
- OS: Debian GNU/Linux 9.4 (stretch), Linux 4.9.91-1-amd64-smp
- Compiler: GCC 6.3.0, compiled with optimisation level 3 (
-O3
)
file | kseq++ | kseq | SeqAn | kseq++/read* | SeqAn/readRecords** |
---|---|---|---|---|---|
dataset_A.fa | 2.35 s | 2.5 s | 2.92 s | 3.52 s | 4.94 s |
dataset_B.fa | 2.66 s | 2.8 s | 3.34 s | 3.74 s | 9.82 s |
dataset_C.fq | 2.56 s | 2.46 s | 2.66 s | 4.56 s | 11.8 s |
* storing all records in std::vector
.
** storing all records in seqan2::StringSet< seqan2::CharString >
.
file | kseq++/plain | kseq++/gzipped | SeqAn/plain |
---|---|---|---|
dataset_A.fa | 2.3 s | 866 s | 2.29 s |
dataset_B.fa | 2.19 s | 849 s | 2.33 s |
dataset_C.fq | 1.94 s | 365 s | 2.24 s |