Bind source address for cluster communication #1567

Merged
merged 1 commit into from Mar 10, 2014

Conversation

Projects
None yet
2 participants
@mattsta
Contributor

mattsta commented Feb 26, 2014

This patch allows multiple Redis Cluster instances
to communicate when running on the same interface
of the same host.

The first address given to Redis as a bind parameter
(server.bindaddr[0]) gets used as the source IP
for cluster communication.

If no bind address is specified by the user, the
behavior remains unchanged (if the user doesn't
specify a bind address, server.bindaddr[0] is NULL
and the corresponding bind-to-ip code will be skipped).

@antirez

This comment has been minimized.

Show comment Hide comment
@antirez

antirez Mar 3, 2014

Owner

Hello Matt! Thanks for the patch, this definitely helps to make the Cluster more reliable in real environments.

A couple of remarks:

  • I've the feeling logging an error on connection errors is not a good idea, this will happen all the time, so why to fill the log? After all we have the disconnection flag that will tell us the big picture of nodes that can't talk.
  • You are passing bindaddr[0] anyway even if the user did not bound any address in the config, is this intentional or an error?
  • Why to apply the same concept to MIGRATE? In this case we act as a client, so any interface will do.

Salvatore

Owner

antirez commented Mar 3, 2014

Hello Matt! Thanks for the patch, this definitely helps to make the Cluster more reliable in real environments.

A couple of remarks:

  • I've the feeling logging an error on connection errors is not a good idea, this will happen all the time, so why to fill the log? After all we have the disconnection flag that will tell us the big picture of nodes that can't talk.
  • You are passing bindaddr[0] anyway even if the user did not bound any address in the config, is this intentional or an error?
  • Why to apply the same concept to MIGRATE? In this case we act as a client, so any interface will do.

Salvatore

@mattsta

This comment has been minimized.

Show comment Hide comment
@mattsta

mattsta Mar 3, 2014

Contributor

Thanks for looking this over!

Here's one reason for the logging addition and three possible fixes:

  • When I was first tracking down my connection problems, the actual error wasn't showing up with regard to connecting to another cluster port. (It may have been detectable by flags in CLUSTER NODES, but that wasn't apparent at the time; in production settings, people may forget to allow port+10000 through firewalls and will appreciate explicit error messages.)
    • Possible Fix A: Move log level to DEBUG. People expect DEBUG to both be very noisy and very accurate as to what the server is doing.
    • Possible Fix B: Remove the logging.
    • Possible Fix C (long term): Allow the logging system to report errors when they first happen, then every N times after (fast errors like replication disconnects retries fill up logs quickly too). Just print the error every 20, 50, or 100 times with a syslog-like "[Last message repeated N times]" notice.

Reason for the bindaddr[0] usage: if the user doesn't specify a bind address, bindaddr[0] is set to NULL. NULL as the source IP parameter to anetTcpNonBlockBindConnect results in the bind setup not running.

Removing the MIGRATE change is a good idea. I was trying to be consistent with converting cluster usage of anetTcpNonBlockConnect to anetTcpNonBlockBindConnect, but if it's not needed we can certainly get rid of it.

Result

I updated the commit with:

  • Change log level for cluster connection errors to DEBUG
  • Remove bind for MIGRATE connect
Contributor

mattsta commented Mar 3, 2014

Thanks for looking this over!

Here's one reason for the logging addition and three possible fixes:

  • When I was first tracking down my connection problems, the actual error wasn't showing up with regard to connecting to another cluster port. (It may have been detectable by flags in CLUSTER NODES, but that wasn't apparent at the time; in production settings, people may forget to allow port+10000 through firewalls and will appreciate explicit error messages.)
    • Possible Fix A: Move log level to DEBUG. People expect DEBUG to both be very noisy and very accurate as to what the server is doing.
    • Possible Fix B: Remove the logging.
    • Possible Fix C (long term): Allow the logging system to report errors when they first happen, then every N times after (fast errors like replication disconnects retries fill up logs quickly too). Just print the error every 20, 50, or 100 times with a syslog-like "[Last message repeated N times]" notice.

Reason for the bindaddr[0] usage: if the user doesn't specify a bind address, bindaddr[0] is set to NULL. NULL as the source IP parameter to anetTcpNonBlockBindConnect results in the bind setup not running.

Removing the MIGRATE change is a good idea. I was trying to be consistent with converting cluster usage of anetTcpNonBlockConnect to anetTcpNonBlockBindConnect, but if it's not needed we can certainly get rid of it.

Result

I updated the commit with:

  • Change log level for cluster connection errors to DEBUG
  • Remove bind for MIGRATE connect
Bind source address for cluster communication
The first address specified as a bind parameter
(server.bindaddr[0]) gets used as the source IP
for cluster communication.

If no bind address is specified by the user, the
behavior is unchanged.

This patch allows multiple Redis Cluster instances
to communicate when running on the same interface
of the same host.
@antirez

This comment has been minimized.

Show comment Hide comment
@antirez

antirez Mar 6, 2014

Owner

Thanks Matt,

  • Logging: I vote for "A", since this is an error that is too much the norm in a large environment, and we aim to conquer the world, so...
  • bindaddr[0]: now I see, the code is correct, but I think my failure at understanding that you were relying on a side effect of another function is an argument to turn the code into... (server.bindaddr_count == 0) ? NULL : bindaddr[0] :-)
  • Well, after all, we may take it to MIGRATE as well. It is not needed but I understand your point I think, if the server is told to use a given address to expose itself to the net, it's cool if outgoing connections will use it.

Thanks!

Owner

antirez commented Mar 6, 2014

Thanks Matt,

  • Logging: I vote for "A", since this is an error that is too much the norm in a large environment, and we aim to conquer the world, so...
  • bindaddr[0]: now I see, the code is correct, but I think my failure at understanding that you were relying on a side effect of another function is an argument to turn the code into... (server.bindaddr_count == 0) ? NULL : bindaddr[0] :-)
  • Well, after all, we may take it to MIGRATE as well. It is not needed but I understand your point I think, if the server is told to use a given address to expose itself to the net, it's cool if outgoing connections will use it.

Thanks!

antirez added a commit that referenced this pull request Mar 10, 2014

Merge pull request #1567 from mattsta/fix-cluster-join
Bind source address for cluster communication

@antirez antirez merged commit 3b0edb8 into antirez:unstable Mar 10, 2014

@antirez

This comment has been minimized.

Show comment Hide comment
@antirez

antirez Mar 10, 2014

Owner

Merged :-) Thanks.

Owner

antirez commented Mar 10, 2014

Merged :-) Thanks.

@mattsta mattsta deleted the mattsta:fix-cluster-join branch Mar 26, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment