New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lightweight SBT combining/adding for large SBTs #229

Closed
ctb opened this Issue May 17, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@ctb
Member

ctb commented May 17, 2017

In response to @meren,

I am sure there is a way to add more genomes (incrementally maybe? or by re-computing the entire thing?) to the database.

We actually can do this in a few different ways —

the heaviest weight way right now is to combine or update the database, which is not that time/resource intensive but is still inconvenient. (The database can be updated mostly incrementally; it’s a Sequence Bloom Tree underneath). We have a command line way to do this with ‘sourmash sbt_combine’.

the medium weight way (mostly just frustrating) is to have sbt_gather output unknown bits of the signature. Then you could do iterative search (run sbt_gather on database A, take what remains, run
it on database B, etc.) There are many reasons to support it and it’s very easy so we will probably add it next time I need it myself.

the lightest weight way to do this is not yet supported but is an hour of hacking away - let the sbt_gather and sbt_search commands take multiple SBTs. The SBT search is very lightweight in terms of memory and resources (searching all of gen bank takes seconds and < 500 MB of RAM) and so simply doing 2x or 3x of them on multiple databases and then massaging the results is not difficult. But I am trying to be a bit careful about complexifying the command line so am hesitant to blindly add it. Easy to do once we need it, tho.

@luizirber

This comment has been minimized.

Show comment
Hide comment
@luizirber

luizirber May 17, 2017

Member

On top of sbt_combine there is also PR 120 for adding a --append option to sbt_index, which would open an existing SBT and add new signatures to it.

And I actually think implement the three options are useful, they cover many different use cases.

Member

luizirber commented May 17, 2017

On top of sbt_combine there is also PR 120 for adding a --append option to sbt_index, which would open an existing SBT and add new signatures to it.

And I actually think implement the three options are useful, they cover many different use cases.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb May 19, 2017

Member

#240 adds the second and third options - you can now do --output-unassigned to get unassigned hashes in a single signature, and you can do sourmash gather query.sig sbt1 sig2 sbt3 sig4 to search/gather multiple SBTs and signatures.

Member

ctb commented May 19, 2017

#240 adds the second and third options - you can now do --output-unassigned to get unassigned hashes in a single signature, and you can do sourmash gather query.sig sbt1 sig2 sbt3 sig4 to search/gather multiple SBTs and signatures.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb May 21, 2017

Member

Closed by #120 and #240.

Member

ctb commented May 21, 2017

Closed by #120 and #240.

@ctb ctb closed this May 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment