You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Performance
Replaced kseq++ (C++ wrapper) with kseq.h (Heng Li) for faster sequence I/O (~13% speedup in querysketch)
Batch-read pattern in querySample eliminates per-read lock contention under OpenMP
Rolling hash in lowComplexity filter replaces per-position recomputation
Stack-allocated spectrum arrays in lowComplexity replace heap-allocated vectors
Replaced std::endl with "\n" in hot loops to avoid unnecessary stream flushes
BloomFilter constructor uses value-initialized allocation instead of manual zeroing
comparesketch pre-caches all k-mer hashes to avoid redundant file I/O (M+N reads instead of M×N)
comparesketch reuses per-thread Bloom filters with memset instead of re-allocating
Sort lambda and range-for loops use const references to avoid copies
Features
Genomically spaced signature selection in buildSketch — signatures are evenly distributed across reference genome positions instead of randomly shuffled
Added read-level log output (logread_*.tsv) in querysketch listing matched read IDs per reference
comparesketch now supports compressed FASTA input (.fa.gz) via kseq.h/zlib
Added CMake build system (CMakeLists.txt) alongside existing Makefile
Added --cluster=N option to uniqsketch: automatically clusters nearly identical references using single-linkage union-find on pairwise unique k-mer counts, selects a representative per cluster, and outputs clusters.tsv mapping all members to their representatives
Error Handling
Added file existence validation for all input files across all three tools
Typo'd filenames now produce clear error messages instead of core dumps