Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Improve gcomm operation in presence of high packet loss #71
Comments
temeo
added
the
enhancement
label
Jun 25, 2014
added a commit
that referenced
this issue
Jun 26, 2014
added a commit
that referenced
this issue
Jun 26, 2014
added a commit
that referenced
this issue
Jun 26, 2014
added a commit
that referenced
this issue
Jun 26, 2014
added a commit
that referenced
this issue
Jun 26, 2014
added a commit
that referenced
this issue
Jun 30, 2014
added a commit
that referenced
this issue
Jun 30, 2014
added a commit
that referenced
this issue
Jun 30, 2014
temeo
self-assigned this
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 2, 2014
added a commit
that referenced
this issue
Jul 4, 2014
|
Outline of current design: EVS monitors response time for each other node. If the response time exceeds The node on list of delayed nodes has associated state (OK or DELAYED, initialized to DELAYED) and counter (0...255, initialzed to 0). Each time check for delayed nodes is done (once in If node state stays OK over Delayed list can be monitored via
The first part is node UUID. Second part starting from Node can be manually evicted from the cluster by assigning node UUID into
This will add node UUID to evicted list and will trigger group membership change. At the moment it is possible to evict manually only one node at the time. This will cause problems if several nodes reside behind bad link (like in multi data center case) since membership protocol runs poorly over bad network. Parameter Setting parameter Every node listens to messages reporting delayed nodes. When such a message is received, it is stored for Messages are iterated over and the following counters are updated
If at least one candidate is found, all nodes that have been reported by majority of the group are evicted automatically. Also this approach has a problem if there are several nodes behind bad link. Some kind of heuristics should be applied to either try to keep group intact or to evict all delayed candidates without loosing majority of the group. This kind of design emerged because:
Suggested parameter values for testing with default EVS values:
|
|
On 2014-07-04 16:59, temeo wrote:
But not 'period'! Just 'evs.delay' or evs.response_delay. 2 seconds default is probably too much. 1 second should be more than Also, could this be linked to suspect_timeout?
|
dirtysalt
added this to the
3.6 milestone
Jul 11, 2014
|
@ayurchen Evs response delay should be higher than keepalive period, otherwise there will be false positives in idle cluster. Linking it to suspect timeout would be nice but it would be hard to define good default for that. |
dirtysalt
referenced this issue
Aug 10, 2014
Closed
inequality of local state('un' field) before and after state exchange #92
added a commit
that referenced
this issue
Aug 14, 2014
added a commit
that referenced
this issue
Aug 14, 2014
ronin13
commented
Aug 15, 2014
|
This is still reproducible: Log files: http://files.wnohang.net/files/results-140.tar.gz Console: http://jenkins.percona.com/job/PXC-5.6-netem/140/label_exp=qaserver-04/console Also here: http://files.wnohang.net/files/results-140-consoleText The test is as follows: a) Start one node. b) Load 20 tables into it with sysbench, 1000 records each. c) Start 9 other nodes - each with a random segment in [1,5]. - gmcast.segment d) Make one node have "150ms 20ms distribution normal and loss 20% ". e) Select 9 other nodes to write to. (sockets in "sysbench on sockets") f) Start writing with sysbench - oltp test, 20 threads, 360 seconds, g) sysbench exits with error due to network partitioning. All nodes except one h) Sleep for 60 seconds and then run sanity tests. i) The sanity tests also fail and test exits. With last few commits, the point to network partitioning has increased (it is Now, what is the general design of solution expected here? Is it to isolate the Also, note that specifically, the writes are done to all nodes except the one |
ronin13
commented
Aug 15, 2014
|
Note that this is upto c5353b1 in galera-3.x tree. |
|
I think gh71 is supposed to fix network partitioning problem, but is not merged into 3.x yet. |
ronin13
commented
Aug 16, 2014
|
With gh71 merged, I am seeing 3 nodes evicted in lieu of 1 node with packet loss. Full logs: http://files.wnohang.net/files/results-142.tar.gz
|
ronin13
commented
Aug 17, 2014
|
The earlier logs were with multiple segments. http://files.wnohang.net/files/results-145.tar.gz is with 0 segment for all. In this, Dock3 and Dock10 get evicted, even though Dock3 is the one with |
added a commit
that referenced
this issue
Aug 20, 2014
ronin13
commented
Aug 21, 2014
|
Logs: http://files.wnohang.net/files/results-154.tar.gz Dock2 and Dock5 are evicted even if only Dock5 has the loss. http://jenkins.percona.com/job/PXC-5.6-netem/154/label_exp=qaserver-04/console Nodes failed to reach primary too, but the cluster lasted a bit longer before going non-primary. |
ronin13
commented
Aug 21, 2014
|
http://files.wnohang.net/files/results-156.tar.gz This is with evs.info_log_mask=0x3 in addition to others. Dock9 is the one with loss, Dock1 is evicted. |
added a commit
that referenced
this issue
Aug 25, 2014
added a commit
that referenced
this issue
Aug 29, 2014
added a commit
that referenced
this issue
Sep 2, 2014
added a commit
that referenced
this issue
Sep 2, 2014
added a commit
that referenced
this issue
Sep 2, 2014
temeo
added
the
2 - Working
label
Sep 12, 2014
This was referenced Sep 15, 2014
added a commit
that referenced
this issue
Sep 24, 2014
added a commit
that referenced
this issue
Sep 24, 2014
added a commit
that referenced
this issue
Oct 1, 2014
added a commit
that referenced
this issue
Oct 1, 2014
This was referenced Oct 2, 2014
added a commit
that referenced
this issue
Oct 9, 2014
added a commit
that referenced
this issue
Oct 9, 2014
added a commit
that referenced
this issue
Oct 9, 2014
|
LGTM. |
added a commit
that referenced
this issue
Oct 10, 2014
added a commit
that referenced
this issue
Oct 13, 2014
ronin13
referenced this issue
in codership/documentation
Oct 21, 2014
Closed
Document EVS auto eviction #27
added a commit
that referenced
this issue
Jan 5, 2015
|
Monitoring, manual eviction and automatic eviction from original plan has been implemented. If there is still need for join time health check, it should be reported as a separate issue. |
temeo commentedJun 25, 2014
This will be parent ticket for further packet loss related work.
Original report: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1274192
Introducing high delay and packet loss into network will make inter node communication highly unreliable in form of duplicated, lost or delayed messages. These kind of conditions will bring up some EVS related bugs, like currently open #37 and #40. EVS protocol related bugs are not in the scope of this ticket and should be reported and fixed separately.
Further work can be divided roughly in three parts, which will be outlined below.
1. Monitoring and manual eviction
Automatic node isolation or eviction is currently not possible since EVS lacks some necessary elements like proper per-node statistics collection and protocol to communicate current view of node states without running full membership protocol. Running full membership protocol in unstable network to should be avoided as it may result in oscillating membership if the network conditions are difficult enough. Therefore the first implementation of node eviction should be based on monitoring and manual eviction until better understanding about automatic eviction has been gained.
Proposed implementation:
evs.evictwsrep provider option taking list of UUIDs that should be evicted manually from cluster2. Join time health check
Joining a node with bad network connection is problematic since join operation will start EVS membership protocol round which in turn performs poorly in unstable network. To avoid starting join operation over unstable network, additional health check phase for GMCast should be deviced.
When GComm protocol stack is started, joiner should first connect to all known peers in GMCast network and exchange keep alive packets to verify that the network is ok. Joining node will start upper GComm protocol layers only after health check passes. Other nodes should not treat joining node as fully qualified member in the GMCast network until joiner sends first upper level protocol packet.
3. Automatic eviction
After enough understanding has been gained about how to properly identify the node that is causing turbulence for group communication, automatic eviction protocol can be enabled. This work will require proper per-node statistics collection on EVS level and protocol extension to communicate statistics related view to other nodes.