Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-2291. Acceptance tests for OM HA. #11

Closed
wants to merge 2 commits into from

Conversation

hanishakoneru
Copy link
Contributor

@hanishakoneru hanishakoneru commented Oct 13, 2019

Add robot tests to test OM HA functionality.

This patch is dependent on HDDS-2240 (#9)

In this robot test, we stop/ start OM from inside the robot test. And the docker container is started with a long running process (sleep) so that om stop does not result in container exit.
We are still discussing the right approach to go about this. This patch is just posted as a POC to help with the discussion.
cc. @elek

@elek
Copy link
Member

elek commented Oct 21, 2019

Thank you very much the patch @hanishakoneru . Overall I am very to happy have more HA tests with the robot framework and I would be happy to commit it (after clean builds).

Personally I would prefer to use a different approach, but it's only because I may have different thinking. It may not be better or worse. The only thing what I would like to do here is the explain my view, just because this is the fan part: to understand the thinking of each other.

1. The level of the tests

To run acceptance test we need to solve two problems:

1.) Create a running ozone cluster (and may restart services during the tests)
2.) Execute commands / check the results (run tests + assert)

Currently these two roles/levels are separated.

The second one is implemented by the robot tests but the (existing) robot tests don't include any logic to start (or restart) services.

The environments are mainly defined with docker-compose files and the logic to start them is defined by shell scripts (for example this is the simplest one)

The two levels/roles are separated.

__ 2. the flexibility __

The main advantage of this approach that you can run the tests in different environments. For example I can replace the shell script based cluster creation process with anything else.

  1. I can create kubernetes clusters and execute the same robot tests inside.
  2. Anybody can execute the same robot tests in any commercial Hadoop/Ozone distribution

__ 3. blockade __

Blockade based tests are slightly different. They do both 1 (cluster creation) and 2 (test + assertion). Mainly because they are more interested about the environment setup (creating cluster, shutting down nodes, etc.).

They do all the cluster set up / tear down based on docker-compose and the logic is defined in python scripts.

__ 4. docker + ssh __

This patch follows a different approach. Instead of using docker-compose to start/stop/restart services/nodes it installs an additional ssh daemon inside the containers to make it possible to restart the jvm process instead of the containers. (docker-compose is used to start/stop services and ssh daemons are used to restart)

Usually this is not the way which is suggested to use in containerized environments. With docker usually it's easier to restart the containers and run only one process per container (and it provides better separation and easier management).

__ 5. this patch __

But the previous approach (using docker-compose to start / stop instead of ssh) is not portable at all. It can't be started inside kubernetes with little effort(for example).

On the other hand this patch can be used very easily in other environments as the "service restart" part of the environment management is included (with the help of ssh).

Summary:

  • This is a slightly different approach what we followed in the normal tests and not the mainstream usage of the containers
  • But it's very effective and has some clear advantages (easier to re-use tests in different env)
  • I have ideas how can it be done in a different way but they have different drawbacks (and different advantages)

With other words: if we separate the environment creation from the test definitions where should we put the restart functionality to. You put it to the place where we have the test definition, I described a system where it can be put to the place where we have the environment creation.

I think both approach is acceptable, I will commit this one after a green acceptance test run.

(And we can continue the thinking about how these tests can be evolved. For example: Do we need to separate these kind of the tests and create more tests where we restart clusters?

@elek
Copy link
Member

elek commented Oct 21, 2019

/retest

@anuengineer
Copy link
Contributor

@elek This is a brilliant comment. Can you please add this comment to the readme in the compose directory and to the wiki pages so that others are aware of this. It is this kind of context which is often missing when you read code. Thank you taking time out to explain the thought process.

@hanishakoneru
Copy link
Contributor Author

Thank you very much @elek for the explanation. Appreciate it.

And we can continue the thinking about how these tests can be evolved. For example: Do we need to separate these kind of the tests and create more tests where we restart clusters?

I totally agree. We should create more tests where the cluster is restarted. Specially with OM HA, we need tests which cover restarts extensively. And when we do have more such tests which involve service restarts, we can separate them out.

@hanishakoneru
Copy link
Contributor Author

/retest

@hanishakoneru
Copy link
Contributor Author

The failed integration test is unrelated to this PR.
In acceptance tests, the failures are in OzoneSecure suit. Ozone OM HA robot tests are passing.

@elek
Copy link
Member

elek commented Nov 4, 2019

In acceptance tests, the failures are in OzoneSecure suit. Ozone OM HA robot tests are passing.

Yes. It seems to be an independent problem. I have seen the same OzoneSecure problem in the nightly. While I am investigating that issue, It's not related to the new test.

I will merge this PR soon...

@elek elek closed this in 9857b0c Nov 4, 2019
@adoroszlai
Copy link
Contributor

I have seen the same OzoneSecure problem in the nightly. While I am investigating that issue, It's not related to the new test.

I think it was broken in #81 and is being fixed in #110.

@elek
Copy link
Member

elek commented Nov 4, 2019

I think it was broken in #81 and is being fixed in #110.

Thanks a lot for this pointer...

@adoroszlai
Copy link
Contributor

adoroszlai commented Nov 5, 2019

@hanishakoneru, @elek, on second look, ozone-om-ha acceptance test is failing after successful execution due to being unable to contact/stop OM containers.

Here is another one, where ozonesecure is fixed, the report is all green, but status is still failure. (Leftover OM container bound to port 9880 also results in ozones3-haproxy test failure.)

I can successfully run ozone-om-ha locally both with and without the fix from HDDS-2321.

@elek
Copy link
Member

elek commented Nov 8, 2019

Seems to be a docker issue, can be a problem with one of the k8s node:

Connection pool is full, discarding connection: 0.0.0.0

I have seen it earlier. It can be more stable with github actions where all the builds are executed in ephemeral vms

GlenGeng-awx pushed a commit to GlenGeng-awx/hadoop-ozone that referenced this pull request Sep 18, 2020
@hanishakoneru hanishakoneru deleted the HDDS-2291 branch December 1, 2020 21:28
tanvipenumudy added a commit to tanvipenumudy/ozone that referenced this pull request May 12, 2022
# This is the 1st commit message:

Initial Commit

# This is the commit message apache#2:

more slight changes

# This is the commit message apache#3:

changes++

# This is the commit message apache#4:

getExecutorService Changes

# This is the commit message apache#5:

applyTransaction() Changes

# This is the commit message apache#6:

changes++

# This is the commit message apache#7:

TestOzoneManagerLock changes

# This is the commit message apache#8:

add changes

# This is the commit message apache#9:

add more minor changes

# This is the commit message apache#10:

add config to ozone-default.xml

# This is the commit message apache#11:

minor changes

# This is the commit message apache#12:

change modulo logic

# This is the commit message apache#13:

changes

# This is the commit message apache#14:

changes++

# This is the commit message apache#15:

add changes++

# This is the commit message apache#16:

minor changes

# This is the commit message apache#17:

Changes (to be reverted)

# This is the commit message apache#18:

Changes 09/05
k5342 pushed a commit to k5342/ozone that referenced this pull request Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants