(Please see http://groups.google.com/group/redis-db/browse_thread/thread/67d1b0bbe7669071 for full details, here I'm just summarizing the problem with some cut&paste from that thread).
If you are familiar with the design you know that the key space is split into 4096 parts.
Every part is called an "hash slot", and every node has a routing table to map every hash slot with a cluster node.
This way if a client sends a query to a node that is not responsible for the keys mentioned in the query, it gets a -MOVED message redirecting it to the right node. However we also have the ability to reconfigure the cluster while it
is running. So for instance I've hash slot 100 that is assigned to node A. And I want to move it to node B.
This is accomplished (and redis-trib is already able to do it for you automatically) with the following steps.
What is interesting is that while the hash slot is set as "Migrating to B", node A will reply to all the requests about this hash slot of keys that are still present in the hash slot, but if a request is about a key that is in hash slot 100 but is not
found inside the key space, it generates a "-ASK" error, that is like "-MOVED" but means: please only ask this exact query to the specified node, but don't update your table about it. Ask new queries about hash
slot 100 to me again.
This way all the new keys about hash slot 100 are created directly in B, but A handles all the queries about keys that are still in A. At the same time redis-trib moves keys from A to B. Eventually all the keys are moved and the hash slot configuration is consolidated to the new one, using other CLUSTER SETSLOT subcommands.
So far this is pretty cool. But there is a subtle problem about this.
When the cluster is stable, that is, there no resharding in progress, a client may ask a query to a random node.
There is only one node that will reply to queries related to a specific hash slot. All the other nodes will redirect the client to
this node. However when rehashing is in progress there are two nodes that will reply to queries for a given hash slot, that is, the MIGRATING node and the IMPORTING node.
If the client is a "smart" client with an internal routing table, it starts every connection to a cluster asking for the slot->node map, and makes sure to update the table when -MOVED messages are received. But there are also clients that are not smart, without a table, or even clients that are smart but perhaps don't update the table since a lot of time since they are idle, and the cluster moved a lot of hash slots recently. But to make things simpler let's just focus on the stupid client that has no internal map. It just send queries to a random node among a list of configured nodes, expecting to get
redirected if the wrong node was selected.
Such a simple client is only able to deal with -MOVED and -ASK redirections. And the two messages are handled in the same way, that is, just asking to the node specified in the redirection message. It is easy to see how this client may create a race condition, like that:
So in the case above what happens is that all the smart clients will have no problems, after a -ASK redirection they will send:
LPUSH foo bar
ASKING sets a flag that is cleared after the command is executed. If a client is dummy (no internal routing tables caching) but still is able to remember that after a -ASK redirection it should start the next query with ASKING, everything is fine as well.
A completely stupid client that is not able to start the chat with ASKING will simply ping/pong from A to B until the hash slot migration is completed, and will finally be served.
This was implemented but implementation still to verify. Taking the issue open for now.
Implemented a long time ago but issue was not closed, closing.