You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's say that you want a random key with the prefix 37. This can be done in 3 lookups:
Ignoring the first 2 bytes of each key, look up the first key with prefix 37 -- mtbl_source_get_strip_prefix(src, 2, "37", 2). Make a note of the first 2 bytes of the resulting key.
Do the same to find the first key not having the prefix 37 -- mtbl_source_get_strip_prefix(src, 2, "38", 2). Take the first two bytes of the key, and subtract 1.
Generate a random number between the values from (a) and (b), inclusive. Look up by the resulting 2 bytes, mtbl_source_get(src, "\x00\x02", 2).
The new suggestion
I ran the above scheme by @tudor, who pointed out that it has a few problems:
It effectively breaks prefix compression, which can be a substantial size / perf hit.
Its lookup cost is logarithmic. I was willing to accept that, but @tudor had a better fix.
His suggestion was to extend the MTBL API so that one can quickly seek to a specific key. This would take the shape of something like (restartable block offset, key index from that offset). I haven't yet checked if this can be boiled down to a single number, but that's beside the point.
Then, I can store (e.g. in the "foreign prefix" section of the MTBL file) an uncompressed list of key addresses, in order -- a single seek to base_offset + autoincrement_id tells you how to find the actual key.
In order to look up autoincrement_id from the key, it's easy enough to appendthe autoincrement ID to the key. Then, prefix compression works well, and lookups on the original keys work well.
Conclusion
It seems like adding mtbl_source_get_strip_prefix is not the best option, and the better option is to allow efficient serialization / deserialization of iterators. If there are no objections, I might try to prototype this.
Thoughts?
The text was updated successfully, but these errors were encountered:
The use case is "given a key prefix, sample a random key having that prefix".
The old proposal
On IRC, I pitched the following scheme to @edmonds --
For example, here are the keys -- the brackets are for ease of understanding:
[0000][123456]
[0001][373737]
[0002][37deadbe]
[0003][37ffffffff00000ba7]
[0004][ffffffffffffffff]
Let's say that you want a random key with the prefix 37. This can be done in 3 lookups:
mtbl_source_get_strip_prefix(src, 2, "37", 2)
. Make a note of the first 2 bytes of the resulting key.mtbl_source_get_strip_prefix(src, 2, "38", 2)
. Take the first two bytes of the key, and subtract 1.mtbl_source_get(src, "\x00\x02", 2)
.The new suggestion
I ran the above scheme by @tudor, who pointed out that it has a few problems:
His suggestion was to extend the MTBL API so that one can quickly seek to a specific key. This would take the shape of something like (restartable block offset, key index from that offset). I haven't yet checked if this can be boiled down to a single number, but that's beside the point.
Then, I can store (e.g. in the "foreign prefix" section of the MTBL file) an uncompressed list of key addresses, in order -- a single seek to
base_offset + autoincrement_id
tells you how to find the actual key.In order to look up
autoincrement_id
from thekey
, it's easy enough to appendthe autoincrement ID to the key. Then, prefix compression works well, and lookups on the original keys work well.Conclusion
It seems like adding
mtbl_source_get_strip_prefix
is not the best option, and the better option is to allow efficient serialization / deserialization of iterators. If there are no objections, I might try to prototype this.Thoughts?
The text was updated successfully, but these errors were encountered: