-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-2291. Acceptance tests for OM HA. #11
Conversation
Thank you very much the patch @hanishakoneru . Overall I am very to happy have more HA tests with the robot framework and I would be happy to commit it (after clean builds). Personally I would prefer to use a different approach, but it's only because I may have different thinking. It may not be better or worse. The only thing what I would like to do here is the explain my view, just because this is the fan part: to understand the thinking of each other. 1. The level of the tests To run acceptance test we need to solve two problems: 1.) Create a running ozone cluster (and may restart services during the tests) Currently these two roles/levels are separated. The second one is implemented by the robot tests but the (existing) robot tests don't include any logic to start (or restart) services. The environments are mainly defined with docker-compose files and the logic to start them is defined by shell scripts (for example this is the simplest one) The two levels/roles are separated. __ 2. the flexibility __ The main advantage of this approach that you can run the tests in different environments. For example I can replace the shell script based cluster creation process with anything else.
__ 3. blockade __ Blockade based tests are slightly different. They do both 1 (cluster creation) and 2 (test + assertion). Mainly because they are more interested about the environment setup (creating cluster, shutting down nodes, etc.). They do all the cluster set up / tear down based on docker-compose and the logic is defined in python scripts. __ 4. docker + ssh __ This patch follows a different approach. Instead of using docker-compose to start/stop/restart services/nodes it installs an additional ssh daemon inside the containers to make it possible to restart the jvm process instead of the containers. (docker-compose is used to start/stop services and ssh daemons are used to restart) Usually this is not the way which is suggested to use in containerized environments. With docker usually it's easier to restart the containers and run only one process per container (and it provides better separation and easier management). __ 5. this patch __ But the previous approach (using docker-compose to start / stop instead of ssh) is not portable at all. It can't be started inside kubernetes with little effort(for example). On the other hand this patch can be used very easily in other environments as the "service restart" part of the environment management is included (with the help of ssh). Summary:
With other words: if we separate the environment creation from the test definitions where should we put the restart functionality to. You put it to the place where we have the test definition, I described a system where it can be put to the place where we have the environment creation. I think both approach is acceptable, I will commit this one after a green acceptance test run. (And we can continue the thinking about how these tests can be evolved. For example: Do we need to separate these kind of the tests and create more tests where we restart clusters? |
/retest |
@elek This is a brilliant comment. Can you please add this comment to the readme in the compose directory and to the wiki pages so that others are aware of this. It is this kind of context which is often missing when you read code. Thank you taking time out to explain the thought process. |
Thank you very much @elek for the explanation. Appreciate it.
I totally agree. We should create more tests where the cluster is restarted. Specially with OM HA, we need tests which cover restarts extensively. And when we do have more such tests which involve service restarts, we can separate them out. |
84d3739
to
1a6f179
Compare
1a6f179
to
615ece0
Compare
/retest |
The failed integration test is unrelated to this PR. |
Yes. It seems to be an independent problem. I have seen the same OzoneSecure problem in the nightly. While I am investigating that issue, It's not related to the new test. I will merge this PR soon... |
@hanishakoneru, @elek, on second look, Here is another one, where I can successfully run |
Seems to be a docker issue, can be a problem with one of the k8s node:
I have seen it earlier. It can be more stable with github actions where all the builds are executed in ephemeral vms |
# This is the 1st commit message: Initial Commit # This is the commit message apache#2: more slight changes # This is the commit message apache#3: changes++ # This is the commit message apache#4: getExecutorService Changes # This is the commit message apache#5: applyTransaction() Changes # This is the commit message apache#6: changes++ # This is the commit message apache#7: TestOzoneManagerLock changes # This is the commit message apache#8: add changes # This is the commit message apache#9: add more minor changes # This is the commit message apache#10: add config to ozone-default.xml # This is the commit message apache#11: minor changes # This is the commit message apache#12: change modulo logic # This is the commit message apache#13: changes # This is the commit message apache#14: changes++ # This is the commit message apache#15: add changes++ # This is the commit message apache#16: minor changes # This is the commit message apache#17: Changes (to be reverted) # This is the commit message apache#18: Changes 09/05
Add robot tests to test OM HA functionality.
This patch is dependent on HDDS-2240 (#9)
In this robot test, we stop/ start OM from inside the robot test. And the docker container is started with a long running process (sleep) so that om stop does not result in container exit.
We are still discussing the right approach to go about this. This patch is just posted as a POC to help with the discussion.
cc. @elek