New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clients with blocking mutation commands block forever when master demoted to slave #5770

Open
nickwilliams-eventbrite opened this Issue Jan 11, 2019 · 0 comments

Comments

Projects
None yet
1 participant
@nickwilliams-eventbrite
Copy link

nickwilliams-eventbrite commented Jan 11, 2019

We are using Redis version 3.0.6-1ubuntu0.2 in a Sentinel configuration of one master and two slaves.

Our code implements a pub-sub-like architecture using RPUSH and BLPOP. It is designed to gracefully detect failovers, reload master information, and reconnect to the new master. This has been tested exhaustively with simulated master failures for various reasons and it has worked consistently.

This week, however, we did something we hadn't foreseen doing and tested for (and this was, for sure, our oversight): We intentionally demoted a master for maintenance. That did not go well.

We used the command redis-cli -p 26379 sentinel failover [...] to initiate the planned promotion/demotion. The problem arose in our code that executes BLPOP [key name] 5 (so, a timeout of five seconds). More than half of those clients blocked forever. We can replicate this only about 30% of the time and it appears to be a weird timing issue (race condition?) that is darn near impossible to trace. We don't know for sure the exact circumstances and can formulate only a decent theory.

During a demotion of a master to a slave, our expectation is that one of the following this would happen to all clients that are currently blocked in a BLPOP on the once-master-now-slave:

  • A READONLY error would be returned (because the master has now been demoted to a slave)
  • A block timeout would occur after 5 seconds (or, preferably, less, at the moment of demotion)
  • The client connection would be closed

And this expectation appears to bear out most of the time. Our clients detect the issue, recognize it as a failover, refresh master information from Sentinel, connect to the new master, and call BLPOP again. But sometimes (not even most of the time), clients (and it will be most clients when this happens) will, instead, block indefinitely waiting on BLPOP to return or time out, but it never does. The only thing we've been able do determine for sure is that the code is blocked on that command. We haven't been able to determine what the once-master-now-slave Redis instance is doing at that time that would block the command forever. All of our calls to BLPOP include a timeout > 0 and < 60, so that is definitely not the issue. These clients blocked for > 5 minutes until we forcefully killed the processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment