HDDS-2291. Acceptance tests for OM HA. #11

hanishakoneru · 2019-10-13T21:13:28Z

Add robot tests to test OM HA functionality.

This patch is dependent on HDDS-2240 (#9)

In this robot test, we stop/ start OM from inside the robot test. And the docker container is started with a long running process (sleep) so that om stop does not result in container exit.
We are still discussing the right approach to go about this. This patch is just posted as a POC to help with the discussion.
cc. @elek

elek · 2019-10-21T08:38:03Z

Thank you very much the patch @hanishakoneru . Overall I am very to happy have more HA tests with the robot framework and I would be happy to commit it (after clean builds).

Personally I would prefer to use a different approach, but it's only because I may have different thinking. It may not be better or worse. The only thing what I would like to do here is the explain my view, just because this is the fan part: to understand the thinking of each other.

1. The level of the tests

To run acceptance test we need to solve two problems:

1.) Create a running ozone cluster (and may restart services during the tests)
2.) Execute commands / check the results (run tests + assert)

Currently these two roles/levels are separated.

The second one is implemented by the robot tests but the (existing) robot tests don't include any logic to start (or restart) services.

The environments are mainly defined with docker-compose files and the logic to start them is defined by shell scripts (for example this is the simplest one)

The two levels/roles are separated.

__ 2. the flexibility __

The main advantage of this approach that you can run the tests in different environments. For example I can replace the shell script based cluster creation process with anything else.

I can create kubernetes clusters and execute the same robot tests inside.
Anybody can execute the same robot tests in any commercial Hadoop/Ozone distribution

__ 3. blockade __

Blockade based tests are slightly different. They do both 1 (cluster creation) and 2 (test + assertion). Mainly because they are more interested about the environment setup (creating cluster, shutting down nodes, etc.).

They do all the cluster set up / tear down based on docker-compose and the logic is defined in python scripts.

__ 4. docker + ssh __

This patch follows a different approach. Instead of using docker-compose to start/stop/restart services/nodes it installs an additional ssh daemon inside the containers to make it possible to restart the jvm process instead of the containers. (docker-compose is used to start/stop services and ssh daemons are used to restart)

Usually this is not the way which is suggested to use in containerized environments. With docker usually it's easier to restart the containers and run only one process per container (and it provides better separation and easier management).

__ 5. this patch __

But the previous approach (using docker-compose to start / stop instead of ssh) is not portable at all. It can't be started inside kubernetes with little effort(for example).

On the other hand this patch can be used very easily in other environments as the "service restart" part of the environment management is included (with the help of ssh).

Summary:

This is a slightly different approach what we followed in the normal tests and not the mainstream usage of the containers
But it's very effective and has some clear advantages (easier to re-use tests in different env)
I have ideas how can it be done in a different way but they have different drawbacks (and different advantages)

With other words: if we separate the environment creation from the test definitions where should we put the restart functionality to. You put it to the place where we have the test definition, I described a system where it can be put to the place where we have the environment creation.

I think both approach is acceptable, I will commit this one after a green acceptance test run.

(And we can continue the thinking about how these tests can be evolved. For example: Do we need to separate these kind of the tests and create more tests where we restart clusters?

elek · 2019-10-21T08:42:22Z

/retest

anuengineer · 2019-10-21T16:27:03Z

@elek This is a brilliant comment. Can you please add this comment to the readme in the compose directory and to the wiki pages so that others are aware of this. It is this kind of context which is often missing when you read code. Thank you taking time out to explain the thought process.

hanishakoneru · 2019-10-30T19:28:59Z

Thank you very much @elek for the explanation. Appreciate it.

And we can continue the thinking about how these tests can be evolved. For example: Do we need to separate these kind of the tests and create more tests where we restart clusters?

I totally agree. We should create more tests where the cluster is restarted. Specially with OM HA, we need tests which cover restarts extensively. And when we do have more such tests which involve service restarts, we can separate them out.

hanishakoneru · 2019-10-31T01:41:55Z

/retest

hanishakoneru · 2019-10-31T17:39:15Z

The failed integration test is unrelated to this PR.
In acceptance tests, the failures are in OzoneSecure suit. Ozone OM HA robot tests are passing.

elek · 2019-11-04T13:57:15Z

In acceptance tests, the failures are in OzoneSecure suit. Ozone OM HA robot tests are passing.

Yes. It seems to be an independent problem. I have seen the same OzoneSecure problem in the nightly. While I am investigating that issue, It's not related to the new test.

I will merge this PR soon...

adoroszlai · 2019-11-04T14:05:31Z

I have seen the same OzoneSecure problem in the nightly. While I am investigating that issue, It's not related to the new test.

I think it was broken in #81 and is being fixed in #110.

elek · 2019-11-04T14:07:03Z

I think it was broken in #81 and is being fixed in #110.

Thanks a lot for this pointer...

adoroszlai · 2019-11-05T11:28:47Z

@hanishakoneru, @elek, on second look, ozone-om-ha acceptance test is failing after successful execution due to being unable to contact/stop OM containers.

Here is another one, where ozonesecure is fixed, the report is all green, but status is still failure. (Leftover OM container bound to port 9880 also results in ozones3-haproxy test failure.)

I can successfully run ozone-om-ha locally both with and without the fix from HDDS-2321.

elek · 2019-11-08T11:10:54Z

Seems to be a docker issue, can be a problem with one of the k8s node:

Connection pool is full, discarding connection: 0.0.0.0

I have seen it earlier. It can be more stable with github actions where all the builds are executed in ephemeral vms

Hdds 3188

# This is the 1st commit message: Initial Commit # This is the commit message apache#2: more slight changes # This is the commit message apache#3: changes++ # This is the commit message apache#4: getExecutorService Changes # This is the commit message apache#5: applyTransaction() Changes # This is the commit message apache#6: changes++ # This is the commit message apache#7: TestOzoneManagerLock changes # This is the commit message apache#8: add changes # This is the commit message apache#9: add more minor changes # This is the commit message apache#10: add config to ozone-default.xml # This is the commit message apache#11: minor changes # This is the commit message apache#12: change modulo logic # This is the commit message apache#13: changes # This is the commit message apache#14: changes++ # This is the commit message apache#15: add changes++ # This is the commit message apache#16: minor changes # This is the commit message apache#17: Changes (to be reverted) # This is the commit message apache#18: Changes 09/05

hanishakoneru requested a review from elek October 13, 2019 21:13

POC for OM HA robot tests

4394f38

hanishakoneru force-pushed the HDDS-2291 branch from 84d3739 to 1a6f179 Compare October 30, 2019 22:42

fixes after HDDS-2240

615ece0

hanishakoneru force-pushed the HDDS-2291 branch from 1a6f179 to 615ece0 Compare October 31, 2019 01:41

elek closed this in 9857b0c Nov 4, 2019

adoroszlai mentioned this pull request Nov 5, 2019

HDDS-2321. Ozone Block Token verify should not apply to all datanode … #110

Closed

GlenGeng-awx pushed a commit to GlenGeng-awx/hadoop-ozone that referenced this pull request Sep 18, 2020

Merge pull request apache#11 from timmylicheng/HDDS-3188

6dbe41b

Hdds 3188

hanishakoneru deleted the HDDS-2291 branch December 1, 2020 21:28

k5342 pushed a commit to k5342/ozone that referenced this pull request Oct 20, 2023

fliptable (apache#11)

73cf670

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-2291. Acceptance tests for OM HA. #11

HDDS-2291. Acceptance tests for OM HA. #11

hanishakoneru commented Oct 13, 2019 •

edited

elek commented Oct 21, 2019 •

edited

elek commented Oct 21, 2019

anuengineer commented Oct 21, 2019

hanishakoneru commented Oct 30, 2019

hanishakoneru commented Oct 31, 2019

hanishakoneru commented Oct 31, 2019

elek commented Nov 4, 2019

adoroszlai commented Nov 4, 2019

elek commented Nov 4, 2019

adoroszlai commented Nov 5, 2019 •

edited

elek commented Nov 8, 2019 •

edited

HDDS-2291. Acceptance tests for OM HA. #11

HDDS-2291. Acceptance tests for OM HA. #11

Conversation

hanishakoneru commented Oct 13, 2019 • edited

elek commented Oct 21, 2019 • edited

elek commented Oct 21, 2019

anuengineer commented Oct 21, 2019

hanishakoneru commented Oct 30, 2019

hanishakoneru commented Oct 31, 2019

hanishakoneru commented Oct 31, 2019

elek commented Nov 4, 2019

adoroszlai commented Nov 4, 2019

elek commented Nov 4, 2019

adoroszlai commented Nov 5, 2019 • edited

elek commented Nov 8, 2019 • edited

hanishakoneru commented Oct 13, 2019 •

edited

elek commented Oct 21, 2019 •

edited

adoroszlai commented Nov 5, 2019 •

edited

elek commented Nov 8, 2019 •

edited