Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-5471. Enable SCM HA in ozone-ha Kubernetes example #2447

Merged
merged 12 commits into from
Jul 11, 2023

Conversation

JacksonYao287
Copy link
Contributor

What changes were proposed in this pull request?

generate correct yamls for running SCM-HA in k8s

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5471

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

k8s test

@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Jul 21, 2021

@adoroszlai @elek please take a look! then new yaml file is generated by fleksizible with the updated Fleksizible file

args:
- ozone
- scm
- --bootstrap
Copy link
Contributor

@ChenSammi ChenSammi Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all three scm instances neet to start with --bootstrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the first scm instance will be started with --init, and the other two will be started will --bootstrap, this CRD file are used in the k8s cluster of my test environment and works well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenSammi the logic here is as follows.
we want to create 3 scm replicas in this yaml file. for each scm instance, two init containers will be executed first, and then the container will be started. it seem like we have 3 individual scm , and for each scm, we do ozone scm --init, ozone scm --bootstrap and ozone scm sequentially one by one.

so for the first(primordial) scm, ozone scm --init succeeds , and the next ozone scm --bootstrap will fail with the message (SCM bootstrap command can only be executed in non-Primordial SCM), then it is started by ozone scm as a primordial scm.

for the second and third scm, ozone scm --init will fail with the message(SCM init command can only be executed in Primordial SCM), and ozone scm --bootstrap will succeed, and then it is started by ozone scm as a non-primordial scm

by this way , we got this scm HA cluster!

i am not sure is it clear to you ? please let me know if the description is no clear, thanks

@JacksonYao287
Copy link
Contributor Author

@adoroszlai could you take a look at this? thanks

@adoroszlai
Copy link
Contributor

Thanks @JacksonYao287 for the patch. I tried adding test.sh to the affected ozone-ha example (copied from ozone example) to verify the change. However, additional SCMs failed to start with the following log.

2021-12-27 13:49:00 ERROR StorageContainerManager:320 - Please make sure you have run 'ozone scm --init' command to generate all the required metadata to /data/metadata/scm or make sure you have run 'ozone scm --bootstrap' cmd to add the SCM to existing SCM HA group.
2021-12-27 13:49:00 ERROR StorageContainerManagerStarter:73 - SCM start failed with exception
org.apache.hadoop.hdds.scm.exceptions.SCMException: SCM not initialized due to storage config failure.
	at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:321)
	at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:491)
	at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:503)
	at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:168)
	at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:142)
	at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:71)
	at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:47)

But bootstrap reported to be successful:

STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG:   host = scm-2/10.42.0.44
STARTUP_MSG:   args = [--bootstrap]
...
2021-12-27 13:45:53 INFO  StorageContainerManager:987 - SCM BootStrap  is successful for ClusterID CID-b449e5a7-66ff-44eb-8a6e-bbf202cbf05e, SCMID 5053d1c5-7b10-4093-b19f-dcf28410a21f
2021-12-27 13:45:53 INFO  StorageContainerManager:989 - Primary SCM Node ID ba47cc25-6474-4e64-a32e-baa5126f85fd

Can you please check?

Here's the run from my fork:
https://github.com/adoroszlai/hadoop-ozone/runs/4642413726

@JacksonYao287
Copy link
Contributor Author

thanks @adoroszlai for the review and verification.

However, additional SCMs failed to start with the following log

i have used this yaml file to deploy a scm-ha-enabled k8s cluster and run many chaos tests on it, it always works well.
I'll look into why it fails here

@adoroszlai
Copy link
Contributor

/pending

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

Please note that the PR will be closed after 21 days of inactivity from now. (But can be re-opened anytime later...)

/pending

@adoroszlai
Copy link
Contributor

@sokui do you have any idea why SCM HA might fail to start in kubernetes?

@sokui
Copy link
Contributor

sokui commented May 9, 2022

@adoroszlai Your pasted link seems expired: https://github.com/adoroszlai/hadoop-ozone/runs/4642413726. Do you mean you run the test.sh in PR against the old ozone examples? Or you just merge the whole PR to your local branch, and run it? please clarify. This PR seems good to me though.

@adoroszlai
Copy link
Contributor

@sokui Thanks for the quick response.

Do you mean you run the test.sh in PR against the old ozone examples? Or you just merge the whole PR to your local branch, and run it?

I checked out the PR branch locally, added test.shto ozone-ha kubernetes example, and submitted to my fork. (Later I also temporarily disabled other tests for faster feedback.)

Your pasted link seems expired

After my comment about the failed run from my fork, @JacksonYao287 also added ozone-ha/test.sh to this PR, and the logs are still available.

These will probably also expire in about a week. Then we can trigger another run if still needed.

This PR seems good to me though.

Yes, looks good to me, too. But we prefer tests to verify, and unfortunately it is failing. ;)

@github-actions
Copy link

Thank you very much for the patch. I am closing this PR temporarily as there was no activity recently and it is waiting for response from its author.

It doesn't mean that this PR is not important or ignored: feel free to reopen the PR at any time.

It only means that attention of committers is not required. We prefer to keep the review queue clean. This ensures PRs in need of review are more visible, which results in faster feedback for all PRs.

If you need ANY help to finish this PR, please contact the community on the mailing list or the slack channel."

@github-actions github-actions bot closed this Jul 22, 2022
@adoroszlai adoroszlai removed the pending label Jul 4, 2023
@adoroszlai adoroszlai reopened this Jul 4, 2023
@adoroszlai adoroszlai changed the title HDDS-5471. generate correct yamls for running SCM-HA in k8s HDDS-5471. Enable SCM HA in ozone-ha Kubernetes example Jul 4, 2023
@adoroszlai
Copy link
Contributor

/ready

@github-actions github-actions bot dismissed their stale review July 4, 2023 12:46

Blocking review request is removed.

@adoroszlai
Copy link
Contributor

@JacksonYao287 @sokui finally I've found out the cause of test failure.

The patch includes volume mount for the bootstrap initContainer:

- name: bootstrap
image: '@docker.image@'
args:
- ozone
- scm
- --bootstrap
envFrom:
- configMapRef:
name: config
volumeMounts:
- name: data
mountPath: /data

However, each test in kubernetes check regenerates its Kubernetes YAML files based on Flekszible and k8s/definitions. So the test does not use our version of scm-statefulset.yaml, rather the regenerated file, which does not have the volume mount, due to the order of transformations:

- type: ozone/persistence
- type: ozone/scm-ha

The latter adds bootstrap, but by that time volumes are already added. So files written by bootstrap were not present in the real SCM container.

This can be fixed simply by first applying scm-ha transformation, only then persistence (see 873d292).

@adoroszlai adoroszlai requested a review from smengcl July 4, 2023 14:30
@@ -20,6 +20,7 @@ import:
transformations:
- type: Image
image: "@docker.image@"
- type: ozone/scm-ha
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line actually accomplish? I removed it and ran Flekzible again, and it didn't seem to make a difference, in the sense that the file changes created by flekzible appeared to be the same changes without it.

Copy link
Contributor

@adoroszlai adoroszlai Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line applies the transformations described in
hadoop-ozone/dist/src/main/k8s/definitions/ozone/definitions/scm-ha.yaml.

Removing it and regenerating the ozone-ha example results in this difference:

diff --git hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml
index 61555e1eb5..b3acc6f1d2 100644
--- hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml
+++ hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml
@@ -22,7 +22,10 @@ data:
   OZONE-SITE.XML_hdds.datanode.dir: /data/storage
   OZONE-SITE.XML_ozone.scm.datanode.id.dir: /data
   OZONE-SITE.XML_ozone.metadata.dirs: /data/metadata
+  OZONE-SITE.XML_ozone.scm.block.client.address: scm-0.scm
   OZONE-SITE.XML_ozone.om.address: om-0.om
+  OZONE-SITE.XML_ozone.scm.client.address: scm-0.scm
+  OZONE-SITE.XML_ozone.scm.names: scm-0.scm
   OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "3"
   OZONE-SITE.XML_ozone.datanode.pipeline.limit: "1"
   OZONE-SITE.XML_dfs.datanode.use.datanode.hostname: "true"
@@ -31,10 +34,3 @@ data:
   LOG4J.PROPERTIES_log4j.appender.stdout.layout: org.apache.log4j.PatternLayout
   LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern: '%d{yyyy-MM-dd
     HH:mm:ss} %-5p %c{1}:%L - %m%n'
-  OZONE-SITE.XML_ozone.scm.service.ids: scmservice
-  OZONE-SITE.XML_ozone.scm.nodes.scmservice: scm0,scm1,scm2
-  OZONE-SITE.XML_ozone.scm.address.scmservice.scm0: scm-0.scm.default.svc.cluster.local
-  OZONE-SITE.XML_ozone.scm.address.scmservice.scm1: scm-1.scm.default.svc.cluster.local
-  OZONE-SITE.XML_ozone.scm.address.scmservice.scm2: scm-2.scm.default.svc.cluster.local
-  OZONE-SITE.XML_ozone.scm.ratis.enable: "true"
-  OZONE-SITE.XML_ozone.scm.primordial.node.id: scm0
diff --git hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml
index a0be3d58ed..0cbca68524 100644
--- hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml
+++ hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml
@@ -26,7 +26,7 @@ spec:
       app: ozone
       component: scm
   serviceName: scm
-  replicas: 3
+  replicas: 1
   template:
     metadata:
       labels:
@@ -52,18 +52,6 @@ spec:
         volumeMounts:
         - name: data
           mountPath: /data
-      - name: bootstrap
-        image: '@docker.image@'
-        args:
-        - ozone
-        - scm
-        - --bootstrap
-        envFrom:
-        - configMapRef:
-            name: config
-        volumeMounts:
-        - name: data
-          mountPath: /data
       containers:
       - name: scm
         image: '@docker.image@'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That helps, thanks! Not sure why I didn't see that when I tried.


pre_run_setup

execute_robot_test scm-0 smoketest/basic/basic.robot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run any more tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can add more tests later. For now, most environments perform only this basic/basic test.

@@ -27,7 +27,6 @@ spec:
component: datanode
serviceName: datanode
replicas: 3
podManagementPolicy: Parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

podManagementPolicy and imagePullPolicy were added in 2ffbfff manually. Such changes are confusing: example with manual edits does not match what we test. We need to add these in the flekszible definitions instead, and regenerate the examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I'm not sure we should add these in the examples. In our test environment it does not make a difference. Users may add it in their own environment as needed.

@@ -53,7 +52,6 @@ spec:
containers:
- name: datanode
image: '@docker.image@'
imagePullPolicy: Always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removed?

Copy link
Contributor

@GeorgeJahad GeorgeJahad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@adoroszlai adoroszlai merged commit 54b0c31 into apache:master Jul 11, 2023
11 checks passed
@adoroszlai
Copy link
Contributor

Thanks @JacksonYao287 for the patch, @ChenSammi, @GeorgeJahad, @sokui for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants