Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to best do random access with MTBL files? #15

Open
snarkmaster opened this issue May 26, 2015 · 1 comment
Open

How to best do random access with MTBL files? #15

snarkmaster opened this issue May 26, 2015 · 1 comment

Comments

@snarkmaster
Copy link
Contributor

The use case is "given a key prefix, sample a random key having that prefix".

The old proposal

On IRC, I pitched the following scheme to @edmonds --

  1. Prepend an 8-byte "autoincrement key" to each actual key, see below.
  2. Allow querying keys with a "skip prefix" option.

For example, here are the keys -- the brackets are for ease of understanding:

[0000][123456]
[0001][373737]
[0002][37deadbe]
[0003][37ffffffff00000ba7]
[0004][ffffffffffffffff]

Let's say that you want a random key with the prefix 37. This can be done in 3 lookups:

  1. Ignoring the first 2 bytes of each key, look up the first key with prefix 37 -- mtbl_source_get_strip_prefix(src, 2, "37", 2). Make a note of the first 2 bytes of the resulting key.
  2. Do the same to find the first key not having the prefix 37 -- mtbl_source_get_strip_prefix(src, 2, "38", 2). Take the first two bytes of the key, and subtract 1.
  3. Generate a random number between the values from (a) and (b), inclusive. Look up by the resulting 2 bytes, mtbl_source_get(src, "\x00\x02", 2).

The new suggestion

I ran the above scheme by @tudor, who pointed out that it has a few problems:

  1. It effectively breaks prefix compression, which can be a substantial size / perf hit.
  2. Its lookup cost is logarithmic. I was willing to accept that, but @tudor had a better fix.

His suggestion was to extend the MTBL API so that one can quickly seek to a specific key. This would take the shape of something like (restartable block offset, key index from that offset). I haven't yet checked if this can be boiled down to a single number, but that's beside the point.

Then, I can store (e.g. in the "foreign prefix" section of the MTBL file) an uncompressed list of key addresses, in order -- a single seek to base_offset + autoincrement_id tells you how to find the actual key.

In order to look up autoincrement_id from the key, it's easy enough to appendthe autoincrement ID to the key. Then, prefix compression works well, and lookups on the original keys work well.

Conclusion

It seems like adding mtbl_source_get_strip_prefix is not the best option, and the better option is to allow efficient serialization / deserialization of iterators. If there are no objections, I might try to prototype this.

Thoughts?

@edmonds
Copy link
Contributor

edmonds commented May 27, 2015

Hi, Alexey:

Sounds interesting! I'd be happy to review a patch implementing the new proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants