[ISSUE #4925] correct the member's state when the cluster information… #4948

MajorHe1 · 2021-02-24T13:30:51Z

Please do not create a Pull Request without creating an issue first.

What is the purpose of the change

try to fix issue #4925

Brief changelog

set the member's state to DOWN when the member joins the server list for the first time in case of the corresponding nacos-server has not been started.
Keep the state of member consistent with the value in the server list， otherwise there will be problems like issue nacos 1.4.1 集群中某节点下线，其余节点不能正确稳定感知状态 #4925

Verifying this change

XXXX

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Write necessary unit-test to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.
Run mvn -B clean package apache-rat:check findbugs:findbugs -Dmaven.test.skip=true to make sure basic checks pass. Run mvn clean install -DskipITs to make sure unit-test pass. Run mvn clean test-compile failsafe:integration-test to make sure integration-test pass.

…rmation changed

CLAassistant · 2021-02-24T13:30:56Z

All committers have signed the CLA.

KomachiSion · 2021-02-25T02:42:34Z

core/src/main/java/com/alibaba/nacos/core/cluster/ServerMemberManager.java

+                member.setState(NodeState.DOWN);
+            } else {
+                //fix issue # 4925
+                member.setState(serverList.get(address).getState());


I don't think it's a right fix.

this method is update members, so we should use input member state I think.

I don't think we could use input member state in this method
synchronized boolean memberChange(Collection<Member> members) {}

This method is only called in three places：
1.

In this case, because the member state is set to UP by default, whether it's from cluster.conf or address-server.

If a member has been judged as down, it will be reset to UP after this method. So this issue 4925 occured.

In this case, a joined member is not in the serverList. We are not sure if the joined member's process has started, so it should be set to DOWN at first, and set to UP after MemberInfoReportTask completed.

This has no effect on the method memberLeave()

I'm not sure whether it will affect other report or update states workflow. I think we need do some discussion.

@chuntaojun @shiyiyue1102 How do you think about?

Not only the state value of the member, I think all the properties of the member should be based on the value in the serverlist.
Otherwise, the data returned by the interface /nacos/v1/ns/operator/servers is always changing (be reset)，for the same reason as issue 4925.

But some of other Member attributes should modified by other members‘ report.
So I think the final value should be updated from the inputs.

this is temporary plan, finally we should figure out the remote data update and local data update to solve this problem

But some of other Member attributes should modified by other members‘ report.
So I think the final value should be updated from the inputs.

There is no problem that final value should be updated from the inputs, because this method
memberChange(Collection<Member> members)
is not called when nacos-server received other members' report.
It seems that this method is only called when the cluster information is updated，I think the attribute value of member should be based on the value in nacos-server memory.

zrlw · 2021-12-29T01:18:14Z

KomachiSion 的担心是对的，我们就碰上#4948修订导致的问题：
某个运维需要重新启动的nacos服务节点启动20多秒都不把心跳转发到负责的节点，后来发现因为4948把其他节点的状态从默认的UP改成了DOWN，从DOWN变UP要等到当前节点收到所有节点发来的健康检查报文，而其他节点发这个报文首先要等到重启的节点先发健康报文通知给他们。
这些问题的根源在于MemberInfoReportTask这个任务不仅是节点启动后延时5秒后开始执行，而且执行过程还是串行的，一次只发一个报文给目标节点，每次发送之间还要间隔2秒，如果有10个节点，等全部通知完了，25秒都过去了，期间这个重启的节点会越俎代庖收下本该由其他节点负责接收的心跳，后果就是其他节点会将心跳超期的实例下线。
我们改了MemberInfoReportTask，每次执行时都全量通知其他节点，免得节点多了通知太慢导致状态不一致。

[ISSUE alibaba#4925] correct the member's state when the cluster info…

6dad63d

…rmation changed

KomachiSion requested changes Feb 25, 2021

View reviewed changes

KomachiSion requested review from chuntaojun and shiyiyue1102 February 26, 2021 01:59

chuntaojun approved these changes Feb 26, 2021

View reviewed changes

KomachiSion approved these changes Mar 1, 2021

View reviewed changes

KomachiSion added the kind/bug Category issues or prs related to bug. label Mar 1, 2021

KomachiSion added this to the 1.4.2 milestone Mar 1, 2021

KomachiSion merged commit 4191286 into alibaba:develop Mar 1, 2021

zrlw added a commit to zrlw/nacos that referenced this pull request Dec 29, 2021

fix problem caused by alibaba#4948

e28ec7b

zrlw mentioned this pull request Feb 21, 2022

[1.x] Patch v1.x member manager #7806

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE #4925] correct the member's state when the cluster information… #4948

[ISSUE #4925] correct the member's state when the cluster information… #4948

MajorHe1 commented Feb 24, 2021 •

edited

Loading

CLAassistant commented Feb 24, 2021 •

edited

Loading

KomachiSion Feb 25, 2021

MajorHe1 Feb 25, 2021

KomachiSion Feb 26, 2021

MajorHe1 Feb 26, 2021

KomachiSion Feb 26, 2021

chuntaojun Feb 26, 2021

MajorHe1 Mar 1, 2021 •

edited

Loading

zrlw commented Dec 29, 2021

[ISSUE #4925] correct the member's state when the cluster information… #4948

[ISSUE #4925] correct the member's state when the cluster information… #4948

Conversation

MajorHe1 commented Feb 24, 2021 • edited Loading

What is the purpose of the change

Brief changelog

Verifying this change

CLAassistant commented Feb 24, 2021 • edited Loading

KomachiSion Feb 25, 2021

Choose a reason for hiding this comment

MajorHe1 Feb 25, 2021

Choose a reason for hiding this comment

KomachiSion Feb 26, 2021

Choose a reason for hiding this comment

MajorHe1 Feb 26, 2021

Choose a reason for hiding this comment

KomachiSion Feb 26, 2021

Choose a reason for hiding this comment

chuntaojun Feb 26, 2021

Choose a reason for hiding this comment

MajorHe1 Mar 1, 2021 • edited Loading

Choose a reason for hiding this comment

zrlw commented Dec 29, 2021

MajorHe1 commented Feb 24, 2021 •

edited

Loading

CLAassistant commented Feb 24, 2021 •

edited

Loading

MajorHe1 Mar 1, 2021 •

edited

Loading