Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scripts: check health status for all clusters #244

Merged
merged 4 commits into from
Dec 29, 2018

Conversation

neverchanje
Copy link
Contributor

@neverchanje neverchanje commented Dec 27, 2018

  • scripts/pegasus_check_clusters.py: 检查集群状态,看是否负载不均衡,是否有 unhealthy 节点。目前均衡的判断还比较粗糙,可能有误报,后面可以结合 Support check if the cluster is balanced #237 来做这个事情。
    如下图的 c3srv-xchat 没有不均衡的情况,所以输出为空
=== c3srv-ada
address                 status              replica_count       primary_count       secondary_count     
xxxxxxxxxxxxx:47801     ALIVE               86                  48                  38                  
xxxxxxxxxxxxx:47801     ALIVE               86                  48                  38                  
xxxxxxxxxxxxx:47801     ALIVE               93                  13                  80                  
xxxxxxxxxxxxx:47801     ALIVE               69                  12                  57                  
xxxxxxxxxxxxx:47801     ALIVE               87                  16                  70                  

total_node_count   : 5
alive_node_count   : 5
unalive_node_count : 0
cluster is write unhealthy, write_unhealthy_app_count = 1
cluster is read unhealthy, read_unhealthy_app_count = 1
===
=== c3srv-xchat
===
=== c3srv-lbs
===
  • scripts/pegasus_check_ports.py: 检查集群端口情况,查看是否有冲突端口,给出建议的端口值,查看 机器 上共有多少 meta 进程
cluster c4srv-msg: 36600 [xxxxxxxxx # xxxxxxxxxxxx.bj]
cluster c4srv-adc: 53600 [xxxxxxxxx # xxxxxxxxxxxx.bj]
cluster c4srv-adb: 45600 [xxxxxxxxx # xxxxxxxxxxxx.bj]
cluster c4srv-feedprofile: 59600 [yyyyyyyyy # yyyyyyyyyyyy.bj]

port number conflicted: 53600 c4srv-feedhistory [xxxxxxxxxxxx # xxxxxxxxxxxx.bj]

recommended port number for [xxxxxxxxx # xxxxxxxxxxxx.bj] is: 54600
[xxxxxxxxx # xxxxxxxxxxxx.bj] has in total 3 clusters on it

recommended port number for [yyyyyyyyy # yyyyyyyyyyyy.bj] is: 60600
[yyyyyyyyy # yyyyyyyyyyyy.bj] has in total 1 clusters on it

@shengofsun
Copy link
Contributor

shengofsun commented Dec 29, 2018

误报指的是?现在meta的均衡策略因为没有考虑表间除不尽的调整,所以看上去可能会不太匀。
另外需要安装python库不?要的话文档记得改一下

@neverchanje
Copy link
Contributor Author

@shengofsun
需要安装 click,这个在脚本开头的注释有写安装步骤。
误报就是像 staging 那种特别多表,每个表都是 4 partitions,看起来不匀,但实际 meta 认为是 balanced 的,多数集群没这个问题。有同学在做 shell 查询 meta is_balanced() 的命令,以后用这个就不会有误报。

@neverchanje neverchanje merged commit 53ff3d7 into apache:master Dec 29, 2018
neverchanje pushed a commit to neverchanje/pegasus that referenced this pull request Jul 13, 2019
Former-commit-id: f2087564d1f156ad4b33f8364b3004a5ca23785d [formerly 53ff3d7]
Former-commit-id: 73282c611173089a15f7164fb1888fd554cd23b6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants