Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

MetaServer对DDD状态partition的处理方式的讨论 #80

Closed
qinzuoyan opened this issue Jun 5, 2018 · 3 comments
Closed

MetaServer对DDD状态partition的处理方式的讨论 #80

qinzuoyan opened this issue Jun 5, 2018 · 3 comments
Assignees

Comments

@qinzuoyan
Copy link
Member

背景

在线上偶尔会出现下面的一种情况:

  1. 集群正常,假设某个Partition的Config如下(第一个为Primary):
    Config[C,B,A], LastDrop[]
  2. 某个ReplicaServer节点A挂掉(由于磁盘满、坏盘、宕机等原因),由于replica_assign_delay_ms_for_dropouts的限制,此时MetaServer并不会立即补上缺失的副本:
    Config[C,B], LastDrop[A]
  3. 另外一个ReplicaServer节点B也挂掉(可能由于磁盘满、learn出core等原因),此时进入单备份状态,MetaServer会立即到新的节点上add secondary。如果replica的个数较多或者数据量比较大,learn的过程可能比较慢,在一段时间内,该partition只存在一个副本:
    Config[C], LastDrop[A,B]
  4. 节点A重启成功,ReplicaServer恢复,补上一个副本:
    Config[C,A], LastDrop[B]
  5. 此时如果Stop集群,由于Stop集群的先后顺序不确定,可能最后MetaServer记录到Zookeeper上的config为:
    Config[], LastDrop[B,A,C]
  6. 由于某些原因(可能数据坏了),需要踢掉节点A,整个集群重启时不启动节点A,这时该partition进入DDD状态,但是由于LastDrop的最后两个节点是A和C。根据现在的策略,只有LastDrop最后的两个节点都恢复了,才能选出一个Primary。由于A无法启动,就会出现“last dropped node A haven't come back yet”的错误,需要运维人工干预。

以上只是一种示例情况。实际上,只要某个partition进入DDD状态,且LastDrop的最后两个节点中有一个节点无法正常启动,就会进入需要人工干预的DDD状态。而在线上集群多个节点的起起停停过程中,这种情况是很容易出现的。

如果这样的partition很多,人工干预就是一个很大的工作量,真的通过人工操作来一个个恢复是不现实的。所以我们要思考,是否可以将这种情况的恢复过程自动化

如果LastDrop最后两个节点中,有一个恢复正常,另外一个是要踢掉的节点,实际上就可以自动来恢复,譬如直接选择恢复正常的那个节点作为Primary。但是我们需要证明,这样的选择确定不会丢数据。

或者退一步,即使不完全自动化,如果在Shell中提供一个自动诊断工具,将处于需要人工干预的DDD状态的partition信息展示出来,给出运维干预的建议,然后让用户确认,选择出合适的方案,一方面能够大大降低运维工作了,另一方面也能保证数据的正确性。

@qinzuoyan qinzuoyan self-assigned this Jun 6, 2018
@qinzuoyan
Copy link
Member Author

qinzuoyan commented Jun 7, 2018

在onebox环境下模拟需要人工干预的DDD情况:(譬如基于v1.10.0版本的Pegasus )

  1. 清理环境
./run.sh clear_onebox
  1. 启动onebox集群,并等待表创建完成
./run.sh start_onebox -m 1 -r 8 -p 32 --use_product_config -w
  1. 获取pidx=31的Partition的三备份信息,记录下一主两备所在节点的ID,假设分别为id1, id2, id3 (注:端口号最后一位数字即为节点ID)
echo 'app temp -d' | ./run.sh shell | grep '^31'
  1. 修改MetaServer配置,并重启MetaServer(禁止add_secondary,以模拟learn secondary未能完成的情况)
sed -i 's/add_secondary_max_count_for_one_node = [0-9]*/add_secondary_max_count_for_one_node = 0/' onebox/meta1/config.ini

./run.sh restart_onebox_instance -m 1
  1. 依次停止 id1, id2, id3 的ReplicaServer,每次停止后都需等待15秒左右
./run.sh stop_onebox_instance -r <id1>
sleep 15

./run.sh stop_onebox_instance -r <id2>
sleep 15

./run.sh stop_onebox_instance -r <id3>
sleep 15
  1. 此时查看表的config信息,会发现pidx=31的Partition的备份数变为0
echo 'app temp -d' | ./run.sh shell | grep '^31'
  1. 修改MetaServer配置,并重启MetaServer(允许add_secondary,以模拟learn secondary可以顺利完成的情况)
sed -i 's/add_secondary_max_count_for_one_node = 0/add_secondary_max_count_for_one_node = 20/' onebox/meta1/config.ini

./run.sh restart_onebox_instance -m 1
  1. 重新启动先被停掉的那两个ReplicaServer,并等待15秒左右
./run.sh start_onebox_instance -r <id1>
./run.sh start_onebox_instance -r <id2>
sleep 15
  1. 此时查看表的config信息,会发现pidx=31的Partition的备份数仍然为0,未能恢复
echo 'app temp -d' | ./run.sh shell | grep '^31'

查看MetaServer的日志,发现进入需要人工干预的DDD状态

grep '1.31 enters DDD state' onebox/meta1/data/log/log.3.txt

日志示例:

W2018-06-08 00:03:27.156 (1528387407156783000 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:455:on_missing_primary(): 1.31 enters DDD state, we are waiting for all rep
licas to come back, and select primary according to informations collected
D2018-06-08 00:03:27.156 (1528387407156794948 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:475:on_missing_primary(): 1.31: config_context.dropped[0]: node(10.231.58.2
25:34801), time(1528386967131){2018-06-07 23:56:07.131}, ballot(-1), commit_decree(-1), prepare_decree(-1)
D2018-06-08 00:03:27.156 (1528387407156808931 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:475:on_missing_primary(): 1.31: config_context.dropped[1]: node(10.231.58.2
25:34804), time(1528386967131){2018-06-07 23:56:07.131}, ballot(3), commit_decree(4), prepare_decree(5)
D2018-06-08 00:03:27.156 (1528387407156822007 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:475:on_missing_primary(): 1.31: config_context.dropped[2]: node(10.231.58.2
25:34803), time(1528386967131){2018-06-07 23:56:07.131}, ballot(5), commit_decree(4), prepare_decree(5)
D2018-06-08 00:03:27.156 (1528387407156844264 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:482:on_missing_primary(): 1.31: config_context.last_drop[0]: node(10.231.58
.225:34804)
D2018-06-08 00:03:27.156 (1528387407156853696 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:482:on_missing_primary(): 1.31: config_context.last_drop[1]: node(10.231.58
.225:34803)
D2018-06-08 00:03:27.156 (1528387407156862491 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:482:on_missing_primary(): 1.31: config_context.last_drop[2]: node(10.231.58
.225:34801)
D2018-06-08 00:03:27.156 (1528387407156872455 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:498:on_missing_primary(): 1.31: last two drops are 10.231.58.225:34803 and 
10.231.58.225:34801
W2018-06-08 00:03:27.156 (1528387407156882152 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:505:on_missing_primary(): 1.31: last dropped node 10.231.58.225:34801 haven
't come back yet
W2018-06-08 00:03:27.156 (1528387407156889570 6691)   meta.meta_state0.020100000000000f: server_load_balancer.cpp:590:on_missing_primary(): 1.31: don't select any node for security reason, 
administrator can select a proper one by shell

@neverchanje
Copy link
Contributor

"只有LastDrop最后的两个节点都恢复了,才能选出一个Primary"

为什么要这么设计?

@qinzuoyan
Copy link
Member Author

当初这样是为了保证数据安全性

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants