Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check by rpc call #694

Merged
merged 38 commits into from Apr 15, 2019
Merged

Conversation

zyearn
Copy link
Member

@zyearn zyearn commented Mar 20, 2019

By default health check succeeds if server can be connected. If this feature is used, health check is completed not only when server can be connected but also an additional http call succeeds indicated by FLAGS_health_check_path and FLAGS_health_check_timeout_ms

这个功能的改法是这样的,在健康检查时,默认如果connect成功server就恢复,而在health_check_path被设置时,socket在Revive()后会进入一个“用rpc进行hc”的状态,同时原来的HealthCheckTask结束,此时用户还是不能从lb中或者singleserver中选择这台机器。在这个状态时,如果rpc成功了,则恢复socket到正常状态;如果rpc失败了,只要socket不断,就会不停地间隔地发送rpc;如果rpc失败了且socket是Failed状态,则结束这个状态,之前的那一个SetFailed会触发下一次hc。

还需要区别是用户call还是hc call,前者无法选到socket,而后者可以正常选,从而发送rpc,这是通过在controller中加一个flag来实现

src/brpc/channel.cpp Outdated Show resolved Hide resolved
src/brpc/socket.cpp Outdated Show resolved Hide resolved
src/brpc/controller.h Outdated Show resolved Hide resolved
src/brpc/load_balancer.h Outdated Show resolved Hide resolved
src/brpc/socket.cpp Outdated Show resolved Hide resolved
src/brpc/socket.cpp Outdated Show resolved Hide resolved
src/brpc/socket.cpp Outdated Show resolved Hide resolved
src/brpc/controller.cpp Outdated Show resolved Hide resolved
src/brpc/socket.cpp Outdated Show resolved Hide resolved
src/brpc/socket.cpp Outdated Show resolved Hide resolved
@@ -242,7 +242,7 @@ locality-aware,优先选择延时低的下游,直到其延时高于其他机
| ------------------------- | ----- | ---------------------------------------- | ----------------------- |
| health_check_interval (R) | 3 | seconds between consecutive health-checkings | src/brpc/socket_map.cpp |

一旦server被连接上,它会恢复为可用状态。如果在隔离过程中,server从命名服务中删除了,brpc也会停止连接尝试。
在默认的配置下,一旦server被连接上,它会恢复为可用状态;brpc还提供了应用层健康检查的机制,协议是Http,只有当Server返回200时,这个server才算恢复,可以通过把-health\_check\_path设置成被检查的路径来打开这个功能(如果下游也是brpc,推荐设置成/health,服务健康的话会返回200),-health\_check\_timeout\_ms设置超时(默认500ms)。如果在隔离过程中,server从命名服务中删除了,brpc也会停止连接尝试。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说-health_check_path对应的http service要求输入是什么,输出是什么

@@ -242,7 +242,7 @@ locality-aware,优先选择延时低的下游,直到其延时高于其他机
| ------------------------- | ----- | ---------------------------------------- | ----------------------- |
| health_check_interval (R) | 3 | seconds between consecutive health-checkings | src/brpc/socket_map.cpp |

一旦server被连接上,它会恢复为可用状态。如果在隔离过程中,server从命名服务中删除了,brpc也会停止连接尝试。
在默认的配置下,一旦server被连接上,它会恢复为可用状态;brpc还提供了应用层健康检查的机制,协议是Http,只有当Server返回200时,这个server才算恢复,可以通过把-health\_check\_path设置成被检查的路径来打开这个功能(如果下游也是brpc,推荐设置成/health,服务健康的话会返回200),-health\_check\_timeout\_ms设置超时(默认500ms)。如果在隔离过程中,server从命名服务中删除了,brpc也会停止连接尝试。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health_check_timeout_ms需要对Connect也生效么?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是要的

@@ -118,6 +118,8 @@ friend int StreamCreate(StreamId*, Controller&, const StreamOptions*);
friend int StreamAccept(StreamId*, Controller&, const StreamOptions*);
friend void policy::ProcessMongoRequest(InputMessageBase*);
friend void policy::ProcessThriftRequest(InputMessageBase*);
friend class OnAppHealthCheckDone;
friend class HealthCheckManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用PrivateAccessor,不要加friend class

@@ -122,7 +122,8 @@ int DynPartLoadBalancer::SelectServer(const SelectIn& in, SelectOut* out) {
for (size_t i = 0; i < n; ++i) {
const SocketId id = s->server_list[i].id;
if ((!exclusion || !ExcludedServers::IsExcluded(in.excluded, id))
&& Socket::Address(id, &ptrs[nptr].first) == 0) {
&& Socket::Address(id, &ptrs[nptr].first) == 0
&& !(*out->ptr)->IsAppHealthCheck()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看清楚上下文。

src/brpc/policy/randomized_load_balancer.cpp Outdated Show resolved Hide resolved
"By default health check succeeds if server can be connected. If this"
"flag is set, health check is completed not only when server can be"
"connected but also an additional http call succeeds indicated by this"
"flag and FLAGS_health_check_timeout_ms");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if server can be connected -> if the server is connectable

not only...but also的意思是说hc不仅会在连接建立时完成,还会在http call成功时完成,也就是说会多次完成, 这个并不是正确的意思吧。应该说成,If this flag is set, health check is not completed until a http call to the path succeeds within -health_check_timeout_ms.

让我来写我还会加句提示" (to make sure the server functions well)"
另外一个细节是FLAGS_xxx是程序内部的说法,对于用户自然是-xxx而不是FLAGS_xxx

src/brpc/socket.cpp Outdated Show resolved Hide resolved
src/brpc/socket.cpp Outdated Show resolved Hide resolved
src/brpc/controller.cpp Outdated Show resolved Hide resolved
src/brpc/details/health_check.h Outdated Show resolved Hide resolved
src/brpc/details/health_check.cpp Show resolved Hide resolved
@jamesge jamesge merged commit 1ebba7f into apache:master Apr 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants