Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slave reports master_link_status:up but does not receive updates from master. #4069

Open
micha opened this issue Jun 21, 2017 · 2 comments
Open

Comments

@micha
Copy link

micha commented Jun 21, 2017

Hi! Redis is great, thanks for all your work! We have been running Redis in production for years without issue, but suddenly I am seeing some strange behavior around replication.

TL;DR

Slave looks normal and reports that it's replicating and synced to the master, but is in fact missing updates that should have been propagated from the master. Doing slaveof no one followed by slaveof <master> <port> on the affected slave helps temporarily, but the problem returns after a while. The only indication of issues that we could identify is a discrepancy in the number of keys between the affected slave and the master redis.

2017-06-21-140142_1507x203_scrot

The above graph shows abs(numberOfKeys(master) - numberOfKeys(slave)) grouped by slave. The red line is the instance experiencing the issue. The sharp dips where it touches the background noise blue lines (the other redis slaves) are where we stopped and restarted replication (i.e. slaveof no one followed by slaveof <host> <port>).

So the questions are:

  1. What could cause this issue, and how can we avoid it in the future?
  2. How can we know when the slave replication is not working, so we can configure monitors and alerts?

Our Setup

  • We use redis as a cache.
  • All of our redis instances are running open source redis on EC2.
  • We run a single master and fan out to slaves in 3 availability zones (the "level 1" slaves).
  • Application servers in the 3 availability zones (AZs) have redis slaves onboard which are replicated from the level 1 slaves in their AZ. These onboard slaves are configured to allow writes — we are using lua scripts that need to be able to create temporary keys.
  • Each availability zone has an extra slave redis, replicating from a level 1 slave in the same AZ. These are configured to allow writes: they're used by newly launched application servers whose onboard redis has not yet finished synchronizing with their level 1 slave.

redis replication hierarchy - page 1 1

The instance in red above is the affected one.

Mitigation Attempts

  1. Restart replication on the affected slave.
  2. Rebuild affected slave on new EC2 instance.
  3. Rebuild servers that connect to the affected slave on new EC2 instances.

The weirdest part is that the problem persists even after #2 — how does a brand new EC2 instance with a brand new redis install get the same issue the previous instance had, and no other instances are affected? We thought it must be related to something that a client or slave of that slave was doing, so we rebuilt all of them, too, in #3. And the problem still came back.

@antirez
Copy link
Contributor

antirez commented Jun 28, 2017

Hello, how does the master CLIENT LIST output looks like, for the entry representing the slave? Is the output buffer full since the data could not be sent?

@antirez
Copy link
Contributor

antirez commented Jun 28, 2017

Also the reverse would be useful, running CLIENT LIST in the slave to see how the master instance entry looks like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants