Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE #4925] correct the member's state when the cluster information… #4948

Merged
merged 1 commit into from
Mar 1, 2021

Conversation

MajorHe1
Copy link
Collaborator

@MajorHe1 MajorHe1 commented Feb 24, 2021

Please do not create a Pull Request without creating an issue first.

What is the purpose of the change

try to fix issue #4925

Brief changelog

  1. set the member's state to DOWN when the member joins the server list for the first time in case of the corresponding nacos-server has not been started.
  2. Keep the state of member consistent with the value in the server list, otherwise there will be problems like issue nacos 1.4.1 集群中某节点下线,其余节点不能正确稳定感知状态 #4925

Verifying this change

XXXX

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
  • Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.
  • Run mvn -B clean package apache-rat:check findbugs:findbugs -Dmaven.test.skip=true to make sure basic checks pass. Run mvn clean install -DskipITs to make sure unit-test pass. Run mvn clean test-compile failsafe:integration-test to make sure integration-test pass.

@CLAassistant
Copy link

CLAassistant commented Feb 24, 2021

CLA assistant check
All committers have signed the CLA.

member.setState(NodeState.DOWN);
} else {
//fix issue # 4925
member.setState(serverList.get(address).getState());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a right fix.

this method is update members, so we should use input member state I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we could use input member state in this method
synchronized boolean memberChange(Collection<Member> members) {}

This method is only called in three places:
1.
image

In this case, because the member state is set to UP by default, whether it's from cluster.conf or address-server.
image

If a member has been judged as down, it will be reset to UP after this method. So this issue 4925 occured.

image

In this case, a joined member is not in the serverList. We are not sure if the joined member's process has started, so it should be set to DOWN at first, and set to UP after MemberInfoReportTask completed.

image

This has no effect on the method memberLeave()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it will affect other report or update states workflow. I think we need do some discussion.

@chuntaojun @shiyiyue1102 How do you think about?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only the state value of the member, I think all the properties of the member should be based on the value in the serverlist.
Otherwise, the data returned by the interface /nacos/v1/ns/operator/servers is always changing (be reset),for the same reason as issue 4925.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But some of other Member attributes should modified by other members‘ report.
So I think the final value should be updated from the inputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is temporary plan, finally we should figure out the remote data update and local data update to solve this problem

Copy link
Collaborator Author

@MajorHe1 MajorHe1 Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But some of other Member attributes should modified by other members‘ report.
So I think the final value should be updated from the inputs.

There is no problem that final value should be updated from the inputs, because this method
memberChange(Collection<Member> members)
is not called when nacos-server received other members' report.
It seems that this method is only called when the cluster information is updated,I think the attribute value of member should be based on the value in nacos-server memory.

@KomachiSion KomachiSion added the kind/bug Category issues or prs related to bug. label Mar 1, 2021
@KomachiSion KomachiSion added this to the 1.4.2 milestone Mar 1, 2021
@KomachiSion KomachiSion merged commit 4191286 into alibaba:develop Mar 1, 2021
@zrlw
Copy link
Contributor

zrlw commented Dec 29, 2021

KomachiSion 的担心是对的,我们就碰上#4948修订导致的问题:
某个运维需要重新启动的nacos服务节点启动20多秒都不把心跳转发到负责的节点,后来发现因为4948把其他节点的状态从默认的UP改成了DOWN,从DOWN变UP要等到当前节点收到所有节点发来的健康检查报文,而其他节点发这个报文首先要等到重启的节点先发健康报文通知给他们。
这些问题的根源在于MemberInfoReportTask这个任务不仅是节点启动后延时5秒后开始执行,而且执行过程还是串行的,一次只发一个报文给目标节点,每次发送之间还要间隔2秒,如果有10个节点,等全部通知完了,25秒都过去了,期间这个重启的节点会越俎代庖收下本该由其他节点负责接收的心跳,后果就是其他节点会将心跳超期的实例下线。
我们改了MemberInfoReportTask,每次执行时都全量通知其他节点,免得节点多了通知太慢导致状态不一致。

zrlw added a commit to zrlw/nacos that referenced this pull request Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Category issues or prs related to bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants