-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-5471. Enable SCM HA in ozone-ha Kubernetes example #2447
Conversation
@adoroszlai @elek please take a look! then new yaml file is generated by fleksizible with the updated Fleksizible file |
args: | ||
- ozone | ||
- scm | ||
- --bootstrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all three scm instances neet to start with --bootstrap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the first scm instance will be started with --init, and the other two will be started will --bootstrap, this CRD file are used in the k8s cluster of my test environment and works well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenSammi the logic here is as follows.
we want to create 3 scm replicas in this yaml file. for each scm instance, two init containers will be executed first, and then the container will be started. it seem like we have 3 individual scm , and for each scm, we do ozone scm --init
, ozone scm --bootstrap
and ozone scm
sequentially one by one.
so for the first(primordial) scm, ozone scm --init
succeeds , and the next ozone scm --bootstrap
will fail with the message (SCM bootstrap command can only be executed in non-Primordial SCM), then it is started by ozone scm
as a primordial scm.
for the second and third scm, ozone scm --init
will fail with the message(SCM init command can only be executed in Primordial SCM), and ozone scm --bootstrap
will succeed, and then it is started by ozone scm
as a non-primordial scm
by this way , we got this scm HA cluster!
i am not sure is it clear to you ? please let me know if the description is no clear, thanks
@adoroszlai could you take a look at this? thanks |
Thanks @JacksonYao287 for the patch. I tried adding
But bootstrap reported to be successful:
Can you please check? Here's the run from my fork: |
thanks @adoroszlai for the review and verification.
i have used this yaml file to deploy a scm-ha-enabled k8s cluster and run many chaos tests on it, it always works well. |
/pending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this issue as un-mergeable as requested.
Please use /ready
comment when it's resolved.
Please note that the PR will be closed after 21 days of inactivity from now. (But can be re-opened anytime later...)
/pending
@sokui do you have any idea why SCM HA might fail to start in kubernetes? |
@adoroszlai Your pasted link seems expired: https://github.com/adoroszlai/hadoop-ozone/runs/4642413726. Do you mean you run the test.sh in PR against the old ozone examples? Or you just merge the whole PR to your local branch, and run it? please clarify. This PR seems good to me though. |
@sokui Thanks for the quick response.
I checked out the PR branch locally, added
After my comment about the failed run from my fork, @JacksonYao287 also added
These will probably also expire in about a week. Then we can trigger another run if still needed.
Yes, looks good to me, too. But we prefer tests to verify, and unfortunately it is failing. ;) |
Thank you very much for the patch. I am closing this PR temporarily as there was no activity recently and it is waiting for response from its author. It doesn't mean that this PR is not important or ignored: feel free to reopen the PR at any time. It only means that attention of committers is not required. We prefer to keep the review queue clean. This ensures PRs in need of review are more visible, which results in faster feedback for all PRs. If you need ANY help to finish this PR, please contact the community on the mailing list or the slack channel." |
/ready |
Blocking review request is removed.
@JacksonYao287 @sokui finally I've found out the cause of test failure. The patch includes volume mount for the bootstrap ozone/hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml Lines 55 to 66 in 3400b28
However, each test in ozone/hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/Flekszible Lines 23 to 24 in 3400b28
The latter adds This can be fixed simply by first applying |
@@ -20,6 +20,7 @@ import: | |||
transformations: | |||
- type: Image | |||
image: "@docker.image@" | |||
- type: ozone/scm-ha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this line actually accomplish? I removed it and ran Flekzible again, and it didn't seem to make a difference, in the sense that the file changes created by flekzible appeared to be the same changes without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line applies the transformations described in
hadoop-ozone/dist/src/main/k8s/definitions/ozone/definitions/scm-ha.yaml
.
Removing it and regenerating the ozone-ha
example results in this difference:
diff --git hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml
index 61555e1eb5..b3acc6f1d2 100644
--- hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml
+++ hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/config-configmap.yaml
@@ -22,7 +22,10 @@ data:
OZONE-SITE.XML_hdds.datanode.dir: /data/storage
OZONE-SITE.XML_ozone.scm.datanode.id.dir: /data
OZONE-SITE.XML_ozone.metadata.dirs: /data/metadata
+ OZONE-SITE.XML_ozone.scm.block.client.address: scm-0.scm
OZONE-SITE.XML_ozone.om.address: om-0.om
+ OZONE-SITE.XML_ozone.scm.client.address: scm-0.scm
+ OZONE-SITE.XML_ozone.scm.names: scm-0.scm
OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "3"
OZONE-SITE.XML_ozone.datanode.pipeline.limit: "1"
OZONE-SITE.XML_dfs.datanode.use.datanode.hostname: "true"
@@ -31,10 +34,3 @@ data:
LOG4J.PROPERTIES_log4j.appender.stdout.layout: org.apache.log4j.PatternLayout
LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern: '%d{yyyy-MM-dd
HH:mm:ss} %-5p %c{1}:%L - %m%n'
- OZONE-SITE.XML_ozone.scm.service.ids: scmservice
- OZONE-SITE.XML_ozone.scm.nodes.scmservice: scm0,scm1,scm2
- OZONE-SITE.XML_ozone.scm.address.scmservice.scm0: scm-0.scm.default.svc.cluster.local
- OZONE-SITE.XML_ozone.scm.address.scmservice.scm1: scm-1.scm.default.svc.cluster.local
- OZONE-SITE.XML_ozone.scm.address.scmservice.scm2: scm-2.scm.default.svc.cluster.local
- OZONE-SITE.XML_ozone.scm.ratis.enable: "true"
- OZONE-SITE.XML_ozone.scm.primordial.node.id: scm0
diff --git hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml
index a0be3d58ed..0cbca68524 100644
--- hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml
+++ hadoop-ozone/dist/src/main/k8s/examples/ozone-ha/scm-statefulset.yaml
@@ -26,7 +26,7 @@ spec:
app: ozone
component: scm
serviceName: scm
- replicas: 3
+ replicas: 1
template:
metadata:
labels:
@@ -52,18 +52,6 @@ spec:
volumeMounts:
- name: data
mountPath: /data
- - name: bootstrap
- image: '@docker.image@'
- args:
- - ozone
- - scm
- - --bootstrap
- envFrom:
- - configMapRef:
- name: config
- volumeMounts:
- - name: data
- mountPath: /data
containers:
- name: scm
image: '@docker.image@'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That helps, thanks! Not sure why I didn't see that when I tried.
|
||
pre_run_setup | ||
|
||
execute_robot_test scm-0 smoketest/basic/basic.robot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run any more tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can add more tests later. For now, most environments perform only this basic/basic
test.
@@ -27,7 +27,6 @@ spec: | |||
component: datanode | |||
serviceName: datanode | |||
replicas: 3 | |||
podManagementPolicy: Parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch.
podManagementPolicy
and imagePullPolicy
were added in 2ffbfff manually. Such changes are confusing: example with manual edits does not match what we test. We need to add these in the flekszible definitions instead, and regenerate the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I'm not sure we should add these in the examples. In our test environment it does not make a difference. Users may add it in their own environment as needed.
@@ -53,7 +52,6 @@ spec: | |||
containers: | |||
- name: datanode | |||
image: '@docker.image@' | |||
imagePullPolicy: Always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @JacksonYao287 for the patch, @ChenSammi, @GeorgeJahad, @sokui for the reviews. |
What changes were proposed in this pull request?
generate correct yamls for running SCM-HA in k8s
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5471
Please replace this section with the link to the Apache JIRA)
How was this patch tested?
k8s test