New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to AllSome/Split sequence bloom trees? #545

Open
olgabot opened this Issue Sep 17, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@olgabot
Copy link
Contributor

olgabot commented Sep 17, 2018

Are there any plans to upgrade the backend SBT code to use AllSome or Split sequence bloom trees, as they're faster in both creating and searching the indices?

They've been implemented, but the licenses isn't compatible with the BSD.

I see there's some interest in using Rust for the backend, and this might be a nice excuse to play around in the language. I'm happy to help where I can!

@phiweger

This comment has been minimized.

Copy link

phiweger commented Sep 20, 2018

The SBT data structure seems less optimal given recent work by Zamin Iqbal's group on a Bitsliced Genomic Signature Index (BIGSI). They report that they need 4 orders of magnitude less space for the index. And you can incrementally add data to the index. Also there is Mantis (Rob Patro's group) which seems to outperform SBTs as well (not sure here). Given that these things are efficiently implemented, I wonder whether sourmash could use these implementations instead of SBT, via C bindings for example (@luizirber).

From the BIGSI article on why SBTs are suboptimal for microbiology:

The first scalable search engine for raw sequence was developed in 2016, the Sequence Bloom Tree (SBT) (13), a k-mer (fixed-length DNA word) index, developed to enable detection of specific transcripts in RNA-seq data. However, SBT and its recent successors (14-16) shared with previous similar methods (Cortex, vari, McCortex) (17-20) a scaling dependence on the total number of k-mers in the union of datasets indexed. For species where there is considerable k-mer sharing between datasets (e.g. human), this works well. However, bacteria are fundamentally different: even within a species there can be enormous diversity due to horizontal transfer of DNA (Supplementary Figure 1), and we will show that this diversity renders previous methods unable to scale.

@luizirber

This comment has been minimized.

Copy link
Member

luizirber commented Sep 24, 2018

@olgabot, hmm, now I see the motivation for your question =]

After the LCA index was implemented we figured it has a pretty similar API to the SBT index, and with the interested in other indexes it seems it's a good time to make a Index interface/base class and support more options.

About Rust: I'm glad you see my long term plan =P
I started implementing the regular SBT in sourmash-rust, but it involves more effort than the previous goals (moving the C++ core and signature loading/saving). The biggest hurdle is internal nodes being khmer Nodegraphs, but we can also sidestep the problem by implementing new Indexes in Rust and exposing to Python instead of porting everything.

I'll put up some scaffold and open a PR this week, and we can work together to implement it.

@phiweger dirty truth: the Nodegraphs we use are too small, but once they saturate they keep working fine (but with longer search times, because we get more false positives). You still get precise results (because we go down to the leaves, and false positives don't change results), but it could be more efficient.

I'm not opposed to use BIGSI or Mantis, but a benefit for current SBT code is that it's very easy to add more datasets. BIGSI and Mantis are also tackling a harder problem because they take every k-mer into account, and we don't (@rob-p @Phelimb is this a fair assessment of BIGSI and mantis?). But I would focus in fixing the low hanging fruit before going too crazy into implementing everything under the sun (I still need to finish my PhD too! =P)

@luizirber luizirber referenced this issue Oct 5, 2018

Open

[WIP] Create an Index abstract base class #556

0 of 5 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment