Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Slave replication fails when long time to load RDB and hz set to 1 #1535

Closed
yoav-steinberg opened this Issue · 3 comments

2 participants

@yoav-steinberg

Tested on 2.8.4
When hz is set to 1 and the loading time of of the rdb at the final stage of the slave replication takes longer than repl-timeout I fail with: "MASTER timeout: no data nor PING received..."
The failure isn't because the master disconnects the repl link so I know this isn't #1449.

What seems to be happening is that while the rdbLoad runs the server cron isn't updating unixtime. So we're left with a very old unixtime when we create the master client object (server.master = createClient()). This causes the lastinteraction value in the client object to be very old.
At this point there's a race: a call to the replicationCron will kill the replication link because of the old lastinteraction value but reception of new data from the master (or ping) will update lastinteraction to a decent value. I guess both events are pending in the event loop.

For some reason when hz is set to 1 the replicationCron always wins the race and kills my replication link.

This is a bit complex so I might be missing something. Assuming this analysis is correct I'd suggest something like updating the server.unixtime after calling rdbLoad in the replication code.

@antirez
Owner

Thanks, investigating...

@antirez
Owner

I believe the analysis is correct. I fixed it in a slightly different way, but I hope you agree the fix is appropriate. Thanks for the issue and analysis!

@antirez antirez referenced this issue from a commit
@antirez Update cached time in rdbLoad() callback.
server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.

Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.

However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".

Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.
85492dc
@antirez antirez referenced this issue from a commit
@antirez Update cached time in rdbLoad() callback.
server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.

Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.

However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".

Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.
51bd9da
@antirez antirez referenced this issue from a commit
@antirez Update cached time in rdbLoad() callback.
server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.

Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.

However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".

Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.
3c1672d
@antirez antirez closed this
@yoav-steinberg

Thanks for the fix. I like it better than my original suggestion :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.