Slave reports `master_link_status:up` but does not receive updates from master. #4069

micha · 2017-06-21T19:03:49Z

Hi! Redis is great, thanks for all your work! We have been running Redis in production for years without issue, but suddenly I am seeing some strange behavior around replication.

TL;DR

Slave looks normal and reports that it's replicating and synced to the master, but is in fact missing updates that should have been propagated from the master. Doing slaveof no one followed by slaveof <master> <port> on the affected slave helps temporarily, but the problem returns after a while. The only indication of issues that we could identify is a discrepancy in the number of keys between the affected slave and the master redis.

The above graph shows abs(numberOfKeys(master) - numberOfKeys(slave)) grouped by slave. The red line is the instance experiencing the issue. The sharp dips where it touches the background noise blue lines (the other redis slaves) are where we stopped and restarted replication (i.e. slaveof no one followed by slaveof <host> <port>).

So the questions are:

What could cause this issue, and how can we avoid it in the future?
How can we know when the slave replication is not working, so we can configure monitors and alerts?

Our Setup

We use redis as a cache.
All of our redis instances are running open source redis on EC2.
We run a single master and fan out to slaves in 3 availability zones (the "level 1" slaves).
Application servers in the 3 availability zones (AZs) have redis slaves onboard which are replicated from the level 1 slaves in their AZ. These onboard slaves are configured to allow writes — we are using lua scripts that need to be able to create temporary keys.
Each availability zone has an extra slave redis, replicating from a level 1 slave in the same AZ. These are configured to allow writes: they're used by newly launched application servers whose onboard redis has not yet finished synchronizing with their level 1 slave.

The instance in red above is the affected one.

Mitigation Attempts

Restart replication on the affected slave.
Rebuild affected slave on new EC2 instance.
Rebuild servers that connect to the affected slave on new EC2 instances.

The weirdest part is that the problem persists even after #2 — how does a brand new EC2 instance with a brand new redis install get the same issue the previous instance had, and no other instances are affected? We thought it must be related to something that a client or slave of that slave was doing, so we rebuilt all of them, too, in #3. And the problem still came back.

The text was updated successfully, but these errors were encountered:

antirez · 2017-06-28T15:01:56Z

Hello, how does the master CLIENT LIST output looks like, for the entry representing the slave? Is the output buffer full since the data could not be sent?

antirez · 2017-06-28T15:05:06Z

Also the reverse would be useful, running CLIENT LIST in the slave to see how the master instance entry looks like.

antirez added the replication label Jun 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slave reports `master_link_status:up` but does not receive updates from master. #4069

Slave reports `master_link_status:up` but does not receive updates from master. #4069

micha commented Jun 21, 2017

antirez commented Jun 28, 2017

antirez commented Jun 28, 2017

Slave reports master_link_status:up but does not receive updates from master. #4069

Slave reports master_link_status:up but does not receive updates from master. #4069

Comments

micha commented Jun 21, 2017

TL;DR

So the questions are:

Our Setup

Mitigation Attempts

antirez commented Jun 28, 2017

antirez commented Jun 28, 2017

Slave reports `master_link_status:up` but does not receive updates from master. #4069

Slave reports `master_link_status:up` but does not receive updates from master. #4069