Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sbt_gather documentation #129

Closed
phiweger opened this issue Feb 7, 2017 · 8 comments
Closed

sbt_gather documentation #129

phiweger opened this issue Feb 7, 2017 · 8 comments

Comments

@phiweger
Copy link

phiweger commented Feb 7, 2017

Dear all,

What does the sbt_gather command do exactly? On Titus' blog the following appears in the comments section:

Second, the asymmetry between query subsampling and subject subsampling. We have a solution to this but it's still several blog posts out and the software is not robust. Regardless, it is all in the sourmash version you have installed -- build bigger minhashes for your query by doing something like '-n 10000' and then look at the 'sbt_gather' command.

and that sbt_gather

Tries to pull a MinHash signature apart into its constituents.

  1. What does that mean exactly? I managed to run the command, but ran into more questions:

  2. Why can it only be used with signatures that have been computed with both, --with-cardinality and --scaled?

  3. What does --scaled do? E.g. what does --scaled 50 "mean"? I see that the signature is shorter than what I set with the -n argument, and that the "number" field in the signature JSON has some large integer in it, but I couldn't figure out from these changes what exactly occurs. If I were to guess is that it is related to the minhash "sampling rate", i.e. n / cardinality?

  4. Does the sbt_gather method have any relation to e.g. this asymmetric minhash publication?

Thank you very much for your help!

@ctb
Copy link
Contributor

ctb commented Feb 7, 2017 via email

@phiweger
Copy link
Author

phiweger commented Feb 8, 2017

Ah, I see. Could you elaborate on what "banded shingling" is? Does it mean splitting the full input set of k-mers/ their hashes into bands ? If so, given b bands, you then select the x_b smallest hash values from each band so that sum(x) == desired_resolution, right?

Hearing "band" and "minhash" reminds me of the "metahashing" that you can do to a minhash signature, which reduces the nearest-neighbor search space significantly, see chapter 3 of MMDS, 3.4.1. Wondering if this is related.

@ctb
Copy link
Contributor

ctb commented Feb 8, 2017 via email

@phiweger
Copy link
Author

Lets see if I understand correctly:

The number of bands ~ band size is chosen to set the desired resolution, e.g. if --scaled L, then 4**64/L bands, and 1/L of the k-mers are sampled for signature creation.

  1. The "lowest" band means the lowest lexicographically ordered set of k-mers (of size_band), i.e. the k-mer space is lexicographically ordered prior to applying bands?
  2. Sorry if this might be obvious, but where does the "64" come from?

@luizirber
Copy link
Member

luizirber commented Feb 15, 2017 via email

@phiweger
Copy link
Author

Ah, ok. What I don't get is this: You focus on one (arbitrary) band of k-mers, discarding all the others. Would it make a difference if you selected n random k-mers, where n is the size of one band? (I promise to shut up after that question :)

@luizirber
Copy link
Member

luizirber commented Feb 15, 2017 via email

@phiweger
Copy link
Author

I see, this confused me a little:

Almost - what we're doing is selecting one band (the lowest, currently, but can be any band).

Makes perfect sense. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants