HDDS-4568. Add SCMContext to SCM HA by GlenGeng-awx · Pull Request #1737 · apache/ozone

GlenGeng-awx · 2020-12-25T03:08:22Z

What changes were proposed in this pull request?

We want to provide SCMContext, which would be a single source of truth for some key information that is shared across all components within SCM.

SCMContext holds two kind of key information:

RaftServer related info: isLeader, term.
SafeMode related info: inSafeMode, preCheckComplete.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4568

How was this patch tested?

CI

GlenGeng-awx · 2020-12-25T05:08:13Z

cc @nandakumar131 @ChenSammi @amaliujia

amaliujia

I think some of the changes related to new Ratis API are in master branch already. It might be better to merge master to 2823 branch and then rebase this PR to remove those changes (e.g. changes in XceiverServerRatis)

amaliujia · 2020-12-25T22:45:39Z

...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java

This part of a change is contained in master (#1728)

I see. so this is from the cherry picked commit from master.

amaliujia · 2020-12-25T22:53:41Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

Will it make sense to use SCMContext to get current term?

We use SCMContext to encapsulate raft related info, so that components in SCM won't need to hold a reference of SCMHAManager or SCMRatisServer.

For non-HA mode or unit test, we just need an empty SCMContext, instead of a mocked SCMHAManager or a mocked SCMRatisServer.

linyiqun

@GlenGeng , I left one minor comment below.

I see now we use the term index value as the SCM leader check. This is used across SCM internal components. Does this will cover the client request behaviour?

For example, one client configured a single SCM address that is a Follower role and then send the request. Will it update the SCM metadata, like pipeline, containers? I see there was a isLeader check before but now that was removed in HDDS-4551. Do you know the context for this?

linyiqun · 2021-01-02T03:00:31Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/SCMBlockDeletingService.java

Could we also catch the NotLeaderException like other places does? There are some other places we don't catch this exception that is swallowed by IOException.

} catch (NotLeaderException nle) { } catch (IOException e) { // We may tolerate a number of failures for sometime // but if it continues to fail, at some point we need to raise // an exception and probably fail the SCM ? At present, it simply // continues to retry the scanning. LOG.error("Failed to get block deletion transactions from delTX log", e); return EmptyTaskResult.newResult(); }

Sure, I will fix them in the next patch.

GlenGeng-awx · 2021-01-04T04:03:39Z

@GlenGeng , I left one minor comment below.

I see now we use the term index value as the SCM leader check. This is used across SCM internal components. Does this will cover the client request behaviour?

For example, one client configured a single SCM address that is a Follower role and then send the request. Will it update the SCM metadata, like pipeline, containers? I see there was a isLeader check before but now that was removed in HDDS-4551. Do you know the context for this?

Hey Yiqun, thanks for the review!

For example, one client configured a single SCM address that is a Follower role and then send the request.

For now, we need client to know all the SCM instances that engage in the SCM raft cluster, if client send request to a follower SCM, it will get a NotLeaderException, and failover to the next SCM instance.

Will it update the SCM metadata, like pipeline, containers?

We removed the leader check in HDDS-4551, since all the metadata updates that will be saved into rocksdb will go through ratis, as a RaftClientRequest, so if underly Raft is in a non-Leader role, the replied RaftClientReply will be injected with a NotLeaderException.

Please check SCMHAInvocationHandler to see how we implement this: for now, add container, remove container, update container state, add pipeline, remove pipeline, update pipeline state will go through ratis.

If needed, we can schedule a zoom meeting to discuss about current SCM HA design.

linyiqun · 2021-01-04T06:09:06Z

@GlenGeng , thanks for the detailed explanation!

amaliujia · 2021-01-12T07:39:33Z

Thanks Glen to rebase PR! Will try to give another pass on this PR.

hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/ha/TestSCMContext.java

amaliujia

Overall LGTM. Left a comment.

runzhiwang

LGTM

bshashikant · 2021-01-15T09:03:51Z

Thanks @GlenGeng for detailed explanation. The changes look good to me except one minor comment. Thanks for the efforts.

GlenGeng-awx changed the title ~~HDDS-4568. add SCMContext to SCM HA~~ HDDS-4568. Add SCMContext to SCM HA Dec 25, 2020

amaliujia reviewed Dec 25, 2020

View reviewed changes

linyiqun reviewed Jan 2, 2021

View reviewed changes

GlenGeng-awx force-pushed the HDDS-4568-Latest branch from ee5e166 to 369efc5 Compare January 6, 2021 11:54

Glen Geng added 3 commits January 12, 2021 15:19

HDDS-4568: SCMContext Part 1: Raft Related Info

d57a79a

HDDS-4568: SCMContext Part 2: SafeMode Related Info

b91c678

HDDS-4568: add UT

4ea79af

GlenGeng-awx force-pushed the HDDS-4568-Latest branch from 369efc5 to 4ea79af Compare January 12, 2021 07:26

HDDS-4568: Fix comments and failed tests.

b38236c

GlenGeng-awx force-pushed the HDDS-4568-Latest branch from b358bfc to b38236c Compare January 12, 2021 10:40

amaliujia reviewed Jan 13, 2021

View reviewed changes

hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/ha/TestSCMContext.java Show resolved Hide resolved

amaliujia reviewed Jan 13, 2021

View reviewed changes

amaliujia mentioned this pull request Jan 13, 2021

HDDS-3205. DeleteBlock via Ratis in SCM HA #1780

Merged

runzhiwang approved these changes Jan 14, 2021

View reviewed changes

bshashikant approved these changes Jan 18, 2021

View reviewed changes

ChenSammi merged commit bb9c68f into apache:HDDS-2823 Jan 19, 2021

Conversation

GlenGeng-awx commented Dec 25, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

GlenGeng-awx commented Dec 25, 2020

Uh oh!

amaliujia left a comment

Choose a reason for hiding this comment

Uh oh!

amaliujia Dec 25, 2020

Choose a reason for hiding this comment

Uh oh!

amaliujia Dec 30, 2020

Choose a reason for hiding this comment

Uh oh!

amaliujia Dec 25, 2020

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

linyiqun left a comment

Choose a reason for hiding this comment

Uh oh!

linyiqun Jan 2, 2021

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx commented Jan 4, 2021

Uh oh!

linyiqun commented Jan 4, 2021

Uh oh!

amaliujia commented Jan 12, 2021

Uh oh!

Uh oh!

amaliujia left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runzhiwang left a comment

Choose a reason for hiding this comment

Uh oh!

bshashikant commented Jan 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amaliujia left a comment •

edited

Loading