-
Notifications
You must be signed in to change notification settings - Fork 23.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SCAN command #579
Add SCAN command #579
Conversation
Thank you Pieter, for now I put it into the 2.8 Milestone, but actually the code looks so small, self contained, and the effects it has so good that maybe it's worth considering porting it back into 2.6 after some testing. I'll try to understand the code today and provide feedbacks. Cheers! |
FYI, this segfaults with 0 arguments. |
OK, great, thanks. Then let's hope that this will be released in 2.6 :-) Thanks, keep up the good work guys! |
Hello @pietern! I'm starting to re-evaluate the idea of an iterator for Redis, and the first item in this task is definitely to understand better your pull request and implementation. I don't understand exactly the implementation with the reversed bits counter, however for now that's not important, I would like to understand the behavior and guarantees of your implementation. I'm correct, this is what I get:
Am I correct? Also about the order the keys are visited, could you provide some info? I see that if the table is not actively rehashing, keys are visited just incrementally from the first the last bucket. However what happens if instead there are two tables? Last question, is there a simple way to implement SCAN without low-level bit operations, but just taking two indexes instead of one (even if we'll conceptually glue the indexes together as a single one in order to expose it to the user) in order to make the code more obvious to the casual reader? Thank you for your help! |
@antirez 👍 this is a really interesting take |
@antirez Those guarantees are correct. Although I don't have a formal proof for these guarantees, I'm reasonably confident they hold. I worked through every hash table state (stable, grow, shrink) and it appears to work everywhere by means of the reverse binary iteration (for lack of a better word). Nothing can be said about order, because the algorithm only deals with the hashes. From the point of view of the user, the order will be relatively random. Also the number of keys emitted per operation can be anywhere between 0 and the maximum number of collisions in a slot. To answer the question about multiple tables, it is important to first understand how the iteration works in the normal case. If a hash table is stable, you can iterate by enumerating the slots sequentially and emitting all entries in a slot. This clearly breaks down when you rehash, because you have no clue whether or not elements in slots have already been emitted or not. This is a problem that can be ignored, making the iteration algorithm suboptimal, but we can do better. Whenever a hash table is grown from size In the base case, consider a hash table that grows from size 2 to size 4. Assume we have an iterator that has just started and has emitted the elements at slot 0 before the hash table is resized. When the iterator resumes after the hash table has resized, it should not emit elements in slots 0 or 2, but only the elements in slots 1 and 3. To carry this information between iterations, a cursor is used. The cursor is an integer between When the iterator is called, it first emits the elements in the slot that the cursor points to. It masks the cursor with For example: consider a hash table with size 4. Only the 2 LSB's are relevant in addressing the hash table's slots. This means that if we want an increment operation to have effect on these 2 LSB's, we need to mask the other 62 bits to one. If this step is skipped, an increment on the reverse bit string might not carry all the way to the relevant 2 LSB's. Which bits are relevant depends on the size of the hash table. For a hash table of size When a hash table is shrunk, multiple slots are collapsed into a single one. This means that it is unavoidable to avoid elements from being emitted twice. If an iterator has emitted the The trick that is applied when the hash table is in the process of being resized is that it always uses the smaller hash table as the reference, and emits both the elements in the slot in the smaller hash, and the elements in all expanded slots in the larger hash table. I hope this clarifies things. Cheers! |
The irrelevant bits shouldn't be masked to 1. This can result in slots being skipped when the hash table is resized between calls to the iterator.
Hello, I believe I found an alternative of the Pieter's algorithm, inspired by his algorithm but extremely simpler to describe, that has similar behavior. Dict.c hash table layoutRedis hash tables always use power of two tables and are implemented in the file dict.c. However it is possible that at a given time the hash table uses two internal tables, T0 and T1, in order to incrementally rehash data from one table to one that is better dimensioned for the number of elements currently on the table. Conflict resolution is obtained using chaining, so it is guaranteed that in a given table, a key is always at the same bucket. In general in order to put a new element in the hash table, the hash H(key) is computed, trimmed to the size of the hash table by ANDing the hash with a bitmask (given that the tables are always power of two), and finally added to a linked list that is stored at the bucket. When a rehashing is in progress in the hash table and there are two tables, the new elements are always added to the second table, the one we are rehashing to. When there are to tables also read accesses are performed in the bucket of the two tables, by trimming the hash function output with both the masks, since we don't know if a key is still on the old table or was already rehashed to the new one. Algorithm description
Proof (somewhat...)Assume that there is a element in bucket that hashes in position K inside T0, or may hash into another position into T1, if T1 exists when we start iterating. Only two cases are possible:
English explanation of the algorithmBy visiting the index This shows how during the rehashing process it is not possible for us to miss a key just because it moved from one place to the other. The "Proof" section explains why this also works globally, that is, why a given key that exists from start to end of the iteration is guaranteed to be returned at least one time. |
Hey Salvatore, There are a couple observations I made reading your algorithm:
Maybe I'm missing something in the explanation of your algorithm, but as it is right now it appears these things need to be fixed for us to maintain the guarantees we talked about earlier in this thread. Ciao! |
Hello Pieter! Thanks. I'll analyze your concerns and reply here. Thanks! |
Btw, these concerns are addressed by the bitwise reverse iterator, since the cursor is oblivious to table size ;-) |
Hello Pieter, The similarities between the algorithms are due to the fact that I'm trying just copy your algorithm as closely as possible as I think it is very sounding, I'm just trying a way to see it from a different point of view in order to lower the complexity. However now that I'm "into" it it seems natural, but only because I used a non trivial amount of time studying it in the latest days. Your concerns about my algorithm are very true and the "stop" condition in my algorithm is wrong, so the result is that it skips some combinations when the new table is smaller. Indeed it can't happen with your implementation, as you test the higher bits first in a given combination, and the magic is that, if the table gets smaller, the mask changes so we test the lower combination again (for example the table was 8 and we were iterating on higher bits like 100, what happens if the table is resized to 4 is that the new mask is "11", so we effectively try again 00, that is the only place where 100 and 000 elements in the previous iterator can be moved to). At the same time when the table gets bigger it just continues iterating, but because the mask changed less bits are turned on with the inversion of the mask when the counter is incremented, so it will explore all the higher bits combinations first. Super cool, but I still suspect that this same result can be obtained in a way that is simpler to explain, because the concept is indeed "try the expansions first, and later move to the next combination of lower bits" or something like that. I wonder if there is a way to make that more intuitive... so investing some more time into this, and if I fail I'll just merge your code trying to augment it with more comments... Thanks! |
Ok, I suspect that the idea of counting backward to test the higher-level bits first, was indeed the right approach in my modified algorithm, but the thing is, to really do it well we need to start to count from the largest integer possible (ULONG_MAX basically) to 0, and using as index the integer masked with the current mask. Working more on that after lunch. |
No way... to get the same result of the Pieter's algorithm I've anyway to make it complex enough that is a pointless operation to do, we have just to explain it in better terms. The beauty of the Pieter's approach is that it never does additional operations, by counting flipping the bit, if the table gets bigger, for example from a 3 bits mask, to a 4 bits mask, only the combinations not already tried with the previous mask are tried, since it sets the upper bits first. Similarly when the mask is instead reduced, it only tests again the patterns that can be a compression of not already checked buckets in the previous table. Hard to explain but awesome. At this point let's talk API :-) Since I want to merge ASAP. My proposal:
In order to iterate over the keyspace.
In order to iterate over everything else, whatever key is, as long as it is a set, sorted set, or hash. What do you think? Also if you want to play with the Pietern's algorithm idea, here is a C program that simulates it: #include <stdio.h>
#include <stdlib.h>
unsigned long size;
unsigned long v;
unsigned long rev(unsigned long v) {
unsigned long s = 8 * sizeof(v); // bit size; must be power of 2
unsigned long mask = ~0;
while ((s >>= 1) > 0) {
mask ^= (mask << s);
v = ((v >> s) & mask) | ((v << s) & ~mask);
}
return v;
}
void init(void) {
size = 8; /* Initial table size. */
v = 0; /* Initial cursor position. */
printf("--- size: %d\n", (int)size);
}
void print_bin(unsigned long v) {
unsigned long bit = 1<<15;
while(bit) {
if (bit<<1 == size) printf("|");
printf("%s", (v&bit) ? "1" : "0");
bit >>= 1;
}
printf("\n");
}
/* Simulates a SCAN call. */
void scan(void) {
unsigned long mask = size-1;
print_bin(v & mask);
printf("Visit %lu\n", v & mask);
v |= ~mask;
v = rev(v);
v++;
v = rev(v);
}
int main(void) {
int iter = 0;
init();
do {
scan();
iter++;
if (iter == 3) {
size=16;
printf("--- size: %d\n", (int)size);
}
if (iter == 6) {
size=4;
printf("--- size: %d\n", (int)size);
}
} while(v);
return 0;
} |
feels like it should be SCAN or SCAN to stop any Khushil Dep On 24 October 2013 15:23, Salvatore Sanfilippo notifications@github.comwrote:
|
@khushil I've the feeling that it is much simpler if the key is the first argument as this is how it works usually in Redis, however I'm not entirely happy with the fact that there is some sort of confusion possible. Maybe we could use SCAN for all the keys iteration, and another command name such as KSCAN for the keyspace, this also plays a lot better with Redis Cluster. |
p.s. the second proposal would also allow me to merge Pieter's original implementation ASAP just renaming it KSCAN, and only later add the SCAN command when it works with all the data types. I would also love to see this implemented by Pieter if he got some time since he is the expert in both the iteration algorithm AND in the zipped data types that one needs to deal with when iterating encoded objects. |
@antirez What I mentioned the other day on twitter, is that the proposed approach (even if I like it a lot) breaks this assumption in redis-cluster clients: https://github.com/antirez/redis-rb-cluster/blob/master/cluster.rb#L115-L135 |
Wow, I was completely unaware that this feature was being discussed and developed. This is great :-) In my opinion, while the implementation may be similar in both cases, iterating over the keyspace and over the elements in a key are fundamentally different from a design point of view. That's why I'd vote for giving these two uses different names. For the keyspace I propose |
Hey Damian! In order to simplify things a bit I'm thinking about, as Pieter suggested, to call the key space iterator just SCAN. However I think I'll use a different interface compared to the current one. What I propose is also an extension on the features. Here is a short description:
This is basically a SORT-alike syntax. Supported options:
That's it... For iterating keys I'll give some more thoughts about it, but actually I'm not completely convinced by a generic command because with the idea of options it makes sense to have different options for different types, for instance one may want to iterate scores within a given range in sorted sets, or even have a different semantics for the return value of sorted sets given that there is a score, so we'll likely end with HSCAN, SSCAN, ZSCAN. Another reason for that actually is that when you read some source code it may be of some semantic help to read HSCAN for instance. Also this one for example would return key-value pairs, so the client library may return a different format. Cheers, |
👍 The options to SCAN that are already implemented are |
Oh awesome Pieter! I did not looked into the actual implementation of the SCAN pattern but apparently it is exactly as I wish it to be, so I can just merge at this point I guess, and add TYPE later :-) |
p.s. also COUNT is better than MINELE. Just there is to document that is the minimum as there is no way to guarantee to return just "COUNT" element. |
Perfect, I'm +1 on |
The command scans until it finds |
Disregard my previous comment, I read the code and it's clear now. |
Btw, the API looks great now. |
@soveran good catch, let's see the different cases: No TYPE or PATTERN option given: What you describe can theoretically happen to be honest, but it is extremely unlikely since the hash table size is always proportional to the number of elements it contains and the "low level" is 10% full. I can't imagine how in practice you may ever end with something like a 1 million buckets hash table with 10 elements. When TYPE or PATTERN are given instead it is extremely likely to happen :-) So I think that, yes, we should be allowed to return an empty set and don't scan more than a given number of buckets per iteration. I think that semantically there are no problems at all about non returning elements, it is just a matter of implementation, and of course of documentation. Thanks a ton! That was really helpful. |
p.s. I'm not sure why you disregarded your comment later. It is already the case that it returns without elements? Reading the code myself... |
I think there are two different meanings for |
Ok, reading the code it is clear that the keys are collected in one place and then filtered later, so only the first thing of high latency is possible, the one about an hash table almost empty with many buckets, that is a rare condition. Given tat the filtering step is performed later, this means we are already allowed to return zero elements in the current API. Because of this it will be easy to change the code later if needed to also break the busy loop if we see issues with that:
But AFAIK it will not be a source of issues. |
And yes, because of the actual semantics, COUNT will be hard to explain to users. |
Yes, I think
|
My "MINLEN" proposal was actually, semantically, the same as "COUNT" when using PATTERN or TYPE, so probably "COUNT" is a better name since it really does not make promises as "MINLEN" would imply :-) |
On IRC @soveran suggested that the line:
Could be changed with:
While I not usually touch things that work and are being tested in production for months, (Twitter is using the SCAN iterator in production as far as I can tell), It looks like a pretty safe change. Basically the v|m, given that m is guaranteed to be a power of two minus one, will always result into the latest bits to be all one. Because of the +1 all the lower bits should turn into 0, so masking for ~m should perform no useful operation at all. |
IRC again ... reminds me of a story not long ago :). This time the person could be right (logically anyway), though, albeit my first impression from an -O3 run of gcc -S revealed the following difference: _bitop1:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movq %rsi, %rcx
andq %rdi, %rcx
orq %rsi, %rdi
incq %rdi
notq %rsi
andq %rdi, %rsi
movq %rsi, %rax
orq %rcx, %rax
popq %rbp
ret
Leh_func_end1: versus _bitop2:
Leh_func_begin2:
pushq %rbp
Ltmp2:
movq %rsp, %rbp
Ltmp3:
movq %rsi, %rcx
andq %rdi, %rcx
orq %rdi, %rsi
incq %rsi
movq %rsi, %rax
orq %rcx, %rax
popq %rbp
ret
Leh_func_end2: My asm is a bit rusty at the moment, however I believe them to be identical for all constraints for unsigned long on 64bit and 32bit architectures, which essentially boils down to a difference of: orq %rsi, %rdi
incq %rdi
notq %rsi
andq %rdi, %rsi versus orq %rdi, %rsi
incq %rsi Or if you prefer Clang: orq %rsi, %rdi
leaq 1(%rdi), %rax
notq %rsi
andq %rsi, %rax versus orq %rdi, %rsi
leaq 1(%rsi), %rax Could be an interesting SO question regarding compiler bit operation optimisations; pseudo code here: #include<stdio.h>
#include<stdlib.h>
unsigned long bitop1(unsigned long v, unsigned long m0) {
return (((v | m0) + 1) & ~m0) | (v & m0);
}
unsigned long bitop2(unsigned long v, unsigned long m0) {
return ((v | m0) + 1) | (v & m0);
}
int main(void) {
/*
printf("bitop1: 0x%08lx --- bitop2: 0x%08lx\n",
bitop1(0x0, 0x0),
bitop2(0x0, 0x0));
*/
bitop1(0x0, 0x0);
bitop2(0x0, 0x0);
exit(EXIT_SUCCESS);
} |
I don't think the compiler could ever understand that "m0" is always a power of two minus one, hence the difference. |
I'll write up the details for this technique later (wrote a nice and details commit message, but vim trashed it).