Improve gcomm operation in presence of high packet loss #71

Closed
temeo opened this Issue Jun 25, 2014 · 12 comments

Comments

Projects
None yet
4 participants
Contributor

temeo commented Jun 25, 2014

This will be parent ticket for further packet loss related work.

Original report: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1274192

Introducing high delay and packet loss into network will make inter node communication highly unreliable in form of duplicated, lost or delayed messages. These kind of conditions will bring up some EVS related bugs, like currently open #37 and #40. EVS protocol related bugs are not in the scope of this ticket and should be reported and fixed separately.

Further work can be divided roughly in three parts, which will be outlined below.

1. Monitoring and manual eviction

Automatic node isolation or eviction is currently not possible since EVS lacks some necessary elements like proper per-node statistics collection and protocol to communicate current view of node states without running full membership protocol. Running full membership protocol in unstable network to should be avoided as it may result in oscillating membership if the network conditions are difficult enough. Therefore the first implementation of node eviction should be based on monitoring and manual eviction until better understanding about automatic eviction has been gained.

Proposed implementation:

  • Expose list of delayed node UUIDs and communication endpoints as wsrep status variables for monitoring purposes
  • Implement evs.evict wsrep provider option taking list of UUIDs that should be evicted manually from cluster

2. Join time health check

Joining a node with bad network connection is problematic since join operation will start EVS membership protocol round which in turn performs poorly in unstable network. To avoid starting join operation over unstable network, additional health check phase for GMCast should be deviced.

When GComm protocol stack is started, joiner should first connect to all known peers in GMCast network and exchange keep alive packets to verify that the network is ok. Joining node will start upper GComm protocol layers only after health check passes. Other nodes should not treat joining node as fully qualified member in the GMCast network until joiner sends first upper level protocol packet.

3. Automatic eviction

After enough understanding has been gained about how to properly identify the node that is causing turbulence for group communication, automatic eviction protocol can be enabled. This work will require proper per-node statistics collection on EVS level and protocol extension to communicate statistics related view to other nodes.

@temeo temeo added the enhancement label Jun 25, 2014

temeo added a commit that referenced this issue Jun 26, 2014

temeo added a commit that referenced this issue Jun 26, 2014

temeo added a commit that referenced this issue Jun 26, 2014

refs #71 - simple state change counter for delay list entry to distin…
…guish

           between fully broken communication link (0), once broken but now
           back to normal communication link (1) or communication link which
           suffers from packetloss (>1).

temeo added a commit that referenced this issue Jun 30, 2014

temeo added a commit that referenced this issue Jun 30, 2014

@temeo temeo self-assigned this Jul 2, 2014

temeo added a commit that referenced this issue Jul 2, 2014

temeo added a commit that referenced this issue Jul 2, 2014

temeo added a commit that referenced this issue Jul 2, 2014

refs #71 - simple state change counter for delay list entry to distin…
…guish

           between fully broken communication link (0), once broken but now
           back to normal communication link (1) or communication link which
           suffers from packetloss (>1).

temeo added a commit that referenced this issue Jul 2, 2014

temeo added a commit that referenced this issue Jul 2, 2014

temeo added a commit that referenced this issue Jul 2, 2014

Contributor

temeo commented Jul 4, 2014

Outline of current design:

EVS monitors response time for each other node. If the response time exceeds evs.delayed_period, node is added to list of delayed nodes.

The node on list of delayed nodes has associated state (OK or DELAYED, initialized to DELAYED) and counter (0...255, initialzed to 0). Each time check for delayed nodes is done (once in evs.inactive_check_period) and state change is detected counter is incremented. Once counter reaches over 1, node UUID and counter is reported to other nodes.

If node state stays OK over evs.delayed_decay_period, its counter is decremented by one. Once the counter reaches zero, node is removed from delayed list.

Delayed list can be monitored via wsrep_evs_delayed status variable. List is in form of comma separated values, with single element being of format

  0aac8e2a-0379-11e4-ba9c-afbf6afe14bc:tcp://192.168.17.13:10031:1

The first part is node UUID. Second part starting from tcp:// is low level connectivity endpoint (if known) and the final number is the value of state counter.

Node can be manually evicted from the cluster by assigning node UUID into evs.evict wsrep provider option:

set global wsrep_provider_options='evs.evict=0aac8e2a-0379-11e4-ba9c-afbf6afe14bc';

This will add node UUID to evicted list and will trigger group membership change.

At the moment it is possible to evict manually only one node at the time. This will cause problems if several nodes reside behind bad link (like in multi data center case) since membership protocol runs poorly over bad network. Parameter evs.evict parsing should be enhanced to allow several UUIDs at once.

Setting parameter evs.auto_evict to value greater than zero will enable automatic eviction. The operation is roughly the following:

Every node listens to messages reporting delayed nodes. When such a message is received, it is stored for evs.decay_period time.

Messages are iterated over and the following counters are updated

  • How many messages have a node reported with counter value over
    evs.auto_evict
  • How many messages have a node reported
    Messages containing processing node's UUID are ignored.

If at least one candidate is found, all nodes that have been reported by majority of the group are evicted automatically.

Also this approach has a problem if there are several nodes behind bad link. Some kind of heuristics should be applied to either try to keep group intact or to evict all delayed candidates without loosing majority of the group.

This kind of design emerged because:

  • There must be some way to gather statistics to make eviction decision, and doing it along with inactive check was the easies choise.
  • Simply having OK and DELAYED states for nodes to be evicted would not be enough, since eviction would happen with each partition and this would defeat the purpose of partition tolerant algorithm. Therefore the state change counter is used as a measure if the communication is ok (state OK and counter 0, or node not in list), completely failed (state DELAYED and counter 0) or acting randomly (counter over 1).
  • Slowly decrementing state counter was chosen to keep the entries with most state changes longer on the list.
  • Auto eviction algorithm aims to always keep majority of the group live or to avoid eviction. Otherwise cluster would disintegrate quickly by cross auto eviction if the connectivity between all nodes would have problems.

Suggested parameter values for testing with default EVS values:

evs.delayed_period=PT2S
evs.delayed_decay_period=PT30S
evs.auto_evict=5
Owner

ayurchen commented Jul 5, 2014

On 2014-07-04 16:59, temeo wrote:

Suggested parameter values for testing with default EVS values:

evs.delayed_period=PT2S

But not 'period'! Just 'evs.delay' or evs.response_delay.

2 seconds default is probably too much. 1 second should be more than
enough.

Also, could this be linked to suspect_timeout?

evs.delayed_decay_period=PT30S
evs.auto_evict=5

@dirtysalt dirtysalt added this to the 3.6 milestone Jul 11, 2014

Contributor

temeo commented Jul 28, 2014

@ayurchen Evs response delay should be higher than keepalive period, otherwise there will be false positives in idle cluster.

Linking it to suspect timeout would be nice but it would be hard to define good default for that.

temeo added a commit that referenced this issue Aug 14, 2014

refs #71 Renamed evs.delayed_period to evs.delay_margin and
         evs.delayed_decay_period to evs.delayed_keep_period.
         Now evs.delay_margin is added with evs.keepalive_period
         to get condition for delayed node, so there is no need
         to adjust evs.delay_margin between LAN and WAN setups.

ronin13 commented Aug 15, 2014

This is still reproducible:

Log files:

http://files.wnohang.net/files/results-140.tar.gz

Console: http://jenkins.percona.com/job/PXC-5.6-netem/140/label_exp=qaserver-04/console

Also here: http://files.wnohang.net/files/results-140-consoleText

The test is as follows:

a) Start one node.

b) Load 20 tables into it with sysbench, 1000 records each.

c) Start 9 other nodes - each with a random segment in [1,5]. - gmcast.segment

d) Make one node have "150ms 20ms distribution normal and loss 20% ".

e) Select 9 other nodes to write to. (sockets in "sysbench on sockets")

f) Start writing with sysbench - oltp test, 20 threads, 360 seconds,
--oltp_index_updates=20 --oltp_non_index_updates=20 --oltp_distinct_ranges=15

g) sysbench exits with error due to network partitioning. All nodes except one
with loss are in non-PC.

h) Sleep for 60 seconds and then run sanity tests.

i) The sanity tests also fail and test exits.

With last few commits, the point to network partitioning has increased (it is
stable for upto 230 seconds as seen in console whereas earlier it may have been
failing much earlier).

Now, what is the general design of solution expected here? Is it to isolate the
node with bad network - STONITH - implemented in galera? Is it to mark it as
down etc. If this is the case, as soon as the node is isolated, the cluster
should return to normal, but it is not happening.

Also, note that specifically, the writes are done to all nodes except the one
maimed for loss. This is to avoid sysbench itself getting affected by network.

ronin13 commented Aug 15, 2014

Note that this is upto c5353b1 in galera-3.x tree.

Contributor

dirtysalt commented Aug 15, 2014

I think gh71 is supposed to fix network partitioning problem, but is not merged into 3.x yet.

ronin13 commented Aug 16, 2014

With gh71 merged, I am seeing 3 nodes evicted in lieu of 1 node with packet loss.

Full logs: http://files.wnohang.net/files/results-142.tar.gz
Console: http://jenkins.percona.com/job/PXC-5.6-netem/142/label_exp=qaserver-04/console

Dock3.log:[Aug 16 19:56:11.002] 2014-08-16 20:56:11 1 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock3.log:[Aug 16 19:56:11.002] 2014-08-16 20:56:11 1 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock4.log:[Aug 16 19:55:56.474] 2014-08-16 20:55:56 1 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock4.log:[Aug 16 19:55:56.474] 2014-08-16 20:55:56 1 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock5.log:[Aug 16 19:55:56.044] 2014-08-16 20:55:56 1 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock5.log:[Aug 16 19:55:56.045] 2014-08-16 20:55:56 1 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)

ronin13 commented Aug 17, 2014

The earlier logs were with multiple segments.

http://files.wnohang.net/files/results-145.tar.gz is with 0 segment for all.

In this, Dock3 and Dock10 get evicted, even though Dock3 is the one with
loss/delay.

ronin13 commented Aug 21, 2014

Logs:

http://files.wnohang.net/files/results-154.tar.gz

Dock2 and Dock5 are evicted even if only Dock5 has the loss.

http://jenkins.percona.com/job/PXC-5.6-netem/154/label_exp=qaserver-04/console

Nodes failed to reach primary too, but the cluster lasted a bit longer before going non-primary.

ronin13 commented Aug 21, 2014

http://files.wnohang.net/files/results-156.tar.gz
http://jenkins.percona.com/job/PXC-5.6-netem/156/label_exp=qaserver-04/console

This is with evs.info_log_mask=0x3 in addition to others.

Dock9 is the one with loss, Dock1 is evicted.

temeo added a commit that referenced this issue Aug 25, 2014

refs #71 Additional timestamp to stamp nodes by last seen message
as opposed to original timestamp which is used to stamp nodes by
last known protocol advance.

Now Node::tstamp_ is used to determine suspected/inactive status
and Node::self_tstamp_ is used to determine delayed status.

dirtysalt added a commit that referenced this issue Sep 2, 2014

Refs #71: add some debug logs. increase initial state change count to…
… 1 and increase coresponding count threshold

dirtysalt added a commit that referenced this issue Sep 2, 2014

dirtysalt added a commit that referenced this issue Sep 2, 2014

Refs #71: remove view state file when node is evicted. otherwise it w…
…ill use same uuid which is probably in other nodes evicted list

@temeo temeo added the 2 - Working label Sep 12, 2014

temeo added a commit that referenced this issue Sep 24, 2014

temeo added a commit that referenced this issue Oct 1, 2014

temeo added a commit that referenced this issue Oct 9, 2014

regs #71 fixed merge errors, changed naming from evict list to
         delayed list, delayed list messages are sent only on
         protocol version 1.

temeo added a commit that referenced this issue Oct 9, 2014

refs #71 auto evict implementation, squash commit from gh71
Auto eviction is enabled by setting evs.auto_evict value greater
than zero and evs.version to 1.
Contributor

dirtysalt commented Oct 9, 2014

LGTM.

temeo added a commit that referenced this issue Oct 10, 2014

refs #71 auto evict implementation, squash commit from gh71
Auto eviction is enabled by setting evs.auto_evict value greater
than zero and evs.version to 1.

temeo added a commit that referenced this issue Oct 13, 2014

Merge pull request #155 from codership/3.x-gh71
refs #71, #155 auto evict implementation

@ronin13 ronin13 referenced this issue in codership/documentation Oct 21, 2014

Closed

Document EVS auto eviction #27

temeo added a commit that referenced this issue Jan 5, 2015

Contributor

temeo commented Jan 5, 2015

Monitoring, manual eviction and automatic eviction from original plan has been implemented. If there is still need for join time health check, it should be reported as a separate issue.

@temeo temeo added 3 - Done and removed 2 - Working labels Jan 5, 2015

@temeo temeo closed this Jan 5, 2015

@temeo temeo removed the 3 - Done label Jan 5, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment