Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Upgrade to AllSome/Split sequence bloom trees? #545
Are there any plans to upgrade the backend SBT code to use AllSome or Split sequence bloom trees, as they're faster in both creating and searching the indices?
They've been implemented, but the licenses isn't compatible with the BSD.
The SBT data structure seems less optimal given recent work by Zamin Iqbal's group on a Bitsliced Genomic Signature Index (BIGSI). They report that they need 4 orders of magnitude less space for the index. And you can incrementally add data to the index. Also there is Mantis (Rob Patro's group) which seems to outperform SBTs as well (not sure here). Given that these things are efficiently implemented, I wonder whether sourmash could use these implementations instead of SBT, via C bindings for example (@luizirber).
From the BIGSI article on why SBTs are suboptimal for microbiology:
After the LCA index was implemented we figured it has a pretty similar API to the SBT index, and with the interested in other indexes it seems it's a good time to make a
About Rust: I'm glad you see my long term plan =P
I'll put up some scaffold and open a PR this week, and we can work together to implement it.
@phiweger dirty truth: the
I'm not opposed to use BIGSI or Mantis, but a benefit for current SBT code is that it's very easy to add more datasets. BIGSI and Mantis are also tackling a harder problem because they take every k-mer into account, and we don't (@rob-p @Phelimb is this a fair assessment of BIGSI and mantis?). But I would focus in fixing the low hanging fruit before going too crazy into implementing everything under the sun (I still need to finish my PhD too! =P)