I encountered redis instance hanged.
My application had been executing RANDOMKEY to slave instance continuously.
Suddenly the instance did not response to any command using CPU 100%.
I attached redis process and found the instance reiterated in dbRandomKey loop.
In fact, the db has only 40 keys and those are all expired.
The logic in expireIfNeeded did not delete any expired key in case of instance is slave
and dbRandomKey tried to find other key again.
I fixed it to return NULL if the instance is slave and the size of db and expired is equal.
fix infinite loop by randomkey if the slave only has expired key.
Hello! Thanks for finding this issue. While we should find a way to fix it, maybe the current proposed solution is not optimal as it breaks the semantics completely.
Imagine a slave with 1000 keys, all with an expire set, but only 1 expired.
You call RANDOMKEY in that slave, and the first key you hit with dictGetRandomKey() is the expired one, so you enter the second if branch, and return NULL since the key has already expired, even if there are 999 possible keys to return.
IMHO a better approach could be to have a counter of max tries decrementing, and when we reach zero we either return the next key we fetch, whatever is expired or not, or the key found with the best ttl so far. Note that I'm suggesting to never return NULL here but instead an already expired key because RANDOMKEY semantics is to return a key as long as the database is not empty.
You may think this is a race, however it is normal for RANDOMKEY to return a key that is about to expire so that the user will not find it accessible in the next attempt, so probably this makes sense in this case, anyway it is a corner case that will be hard to trigger, usually the DELs synthesized by the master will empty the slave DB so RADNOMKEY will start to return an empty reply eventually.
- count retry and return key if it reach dbSize
I agree with you.
I understand the same size of db and expires does not mean all keys are expired.
Thank you for detailed explanation.
As you mentioned, I set the retry limit by size of 'db' and return a key when the retry limit exceed.
I am not sure on the retry limit by the size of 'db'. but I think it will not cause so big problem by slow command.
In case of master instance, the escape condition will never triggered. but I supposed no size effect.
Would you like to check this again?