-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sbt_gather documentation #129
Comments
Hi, sbt_gather is a demo implementation based on some research concepts
that we haven't fully explored. @philliptbrooks is working on
evaluating the basic approach, and I would be happy to put you in touch
with him via e-mail (titus@idyll.org to reach me).
A few overly specific answers:
* one of the goals of gather is to estimate how much of a particular
signature comes from a known species; for this, you need to normalize
for genome size. This can be done using --with-cardinality or --scaled.
* --scaled uses k-mer hash space banding ("banded shingling") to build minhash
signatures with comparable resolution. The parameter is the denominator
for the sampling rate, e.g. --scaled 10000 gives you a sketch where 1 in
10,000 k-mers has been sampled from the input data set.
I had not seen the asymmetric minhash publication - thanks! I'll have
to read it and understand it before commenting.
|
Ah, I see. Could you elaborate on what "banded shingling" is? Does it mean splitting the full input set of k-mers/ their hashes into bands ? If so, given b bands, you then select the x_b smallest hash values from each band so that sum(x) == desired_resolution, right? Hearing "band" and "minhash" reminds me of the "metahashing" that you can do to a minhash signature, which reduces the nearest-neighbor search space significantly, see chapter 3 of MMDS, 3.4.1. Wondering if this is related. |
On Tue, Feb 07, 2017 at 11:49:05PM -0800, viehwegerlib wrote:
Ah, I see. Could you elaborate on what "banded shingling" is? Does it mean splitting the full input set of k-mers/ their hashes into bands ? If so, given b bands, you then select the x_b smallest hash values from each band so that sum(x) == desired_resolution, right?
Almost - what we're doing is selecting one band (the lowest, currently, but
can be any band) to pay attention to, and discarding the rest of the kmers.
The number of bands ~ band size is chosen to set the desired resolution,
e.g. if --scaled L, then 4**64/L bands, and 1/L of the k-mers are sampled
for signature creation.
Hearing "band" and "minhash" reminds me of the "metahashing" that you can do to a minhash signature, which reduces the nearest-neighbor search space significantly, see chapter 3 of [MMDS, 3.4.1](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf). Wondering if this is related.
I don't think it is, directly, but all of this stuff is entangled :)
|
Lets see if I understand correctly:
|
64 bits is the size of our hash function output (and that 4**64 should have
been 2**64)
…On Wed, Feb 15, 2017 at 12:49 PM, viehwegerlib ***@***.***> wrote:
Lets see if I understand correctly:
The number of bands ~ band size is chosen to set the desired resolution,
e.g. if --scaled L, then 4**64/L bands, and 1/L of the k-mers are sampled
for signature creation.
1. The "lowest" band means the lowest lexicographically ordered set of
k-mers (of size_band), i.e. the k-mer space is lexicographically ordered
prior to applying bands?
2. Sorry if this might be obvious, but where does the "64" come from?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#129 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAZ8jliS6ktNdwg8JLcla6PZ6we2FfEks5rc2TmgaJpZM4L5a9h>
.
|
Ah, ok. What I don't get is this: You focus on one (arbitrary) band of k-mers, discarding all the others. Would it make a difference if you selected n random k-mers, where n is the size of one band? (I promise to shut up after that question :) |
We are picking the lowest band because it is a MinHash, if you get the
highest band you get a MaxHash.
Being consistent in how you choose kmers (minimum, or maximum) is a way of
biasing the selection to make signatures comparable (which is harder if you
select random kmers).
(And please don't shut up =] )
…On Wed, Feb 15, 2017 at 1:01 PM, viehwegerlib ***@***.***> wrote:
Ah, ok. What I don't get is this: You focus on one (arbitrary) band of
k-mers, discarding all the others. Would it make a difference if you
selected n random k-mers, where n is the size of one band? (I promise to
shut up after that question :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#129 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAZ8hMJuXOD8vAA9g6mB3tA1E3bTNKVks5rc2fBgaJpZM4L5a9h>
.
|
I see, this confused me a little:
Makes perfect sense. Thanks a lot! |
Dear all,
What does the sbt_gather command do exactly? On Titus' blog the following appears in the comments section:
and that sbt_gather
What does that mean exactly? I managed to run the command, but ran into more questions:
Why can it only be used with signatures that have been computed with both, --with-cardinality and --scaled?
What does --scaled do? E.g. what does --scaled 50 "mean"? I see that the signature is shorter than what I set with the -n argument, and that the "number" field in the signature JSON has some large integer in it, but I couldn't figure out from these changes what exactly occurs. If I were to guess is that it is related to the minhash "sampling rate", i.e. n / cardinality?
Does the sbt_gather method have any relation to e.g. this asymmetric minhash publication?
Thank you very much for your help!
The text was updated successfully, but these errors were encountered: