Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10179

Merged
merged 2 commits into from
Jan 22, 2019

Conversation

ph
Copy link
Contributor

@ph ph commented Jan 18, 2019

This reverts commit edeed09.

Looking at this with a fresh eyes, I am not sure its flakyness, we should get log from the docker containers on failures.

@ph
Copy link
Contributor Author

ph commented Jan 18, 2019

Seriously, my only theory is that it's something takes too much time for a container to bring up and make the health check fails, I am currently looking to expose the docker compose logs. Because debugging this on Travis is not really fun.

What is frustrating is the docker compose file of x-pack is really close to the other file that doesn't fail.. I am thinking that the password setup that we do in the beginning of the ES container is making the thing goes boom, of course it doesn't happen always.

@ph
Copy link
Contributor Author

ph commented Jan 18, 2019

Do not merge this PR is for testing.

@ph
Copy link
Contributor Author

ph commented Jan 18, 2019

From my experience at looking at theses log they always seems to fails on the health check for elasticsearch.

Creating libbeat667d4597047394139f4e96d47c5ca886faea66b5_elasticsearch_1

@ph ph requested a review from a team as a code owner January 18, 2019 22:55
@urso
Copy link

urso commented Jan 20, 2019

would we get some more info by running docker-compose with -verbose flag?

@ph
Copy link
Contributor Author

ph commented Jan 21, 2019

After talking with security the main problem here is our health check is actually wrong (and might be wrong in our other docker compose files). Security is just an index, when ES is started user are added to the Security index, If ES respond to calls it doesn't mean that the cluster is green, it doesn't mean that the security index is accessible, until the cluster is green it is possible that looking for user effectively fails.

@ph ph requested review from a team as code owners January 21, 2019 20:07
@ph ph added the in progress Pull request is currently in progress. label Jan 21, 2019
@ph ph changed the title Revert "Skip x-pack libbeat tests again as flaky (#10068)" [WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" Jan 21, 2019
This reverts commit edeed09.

And do the following:

- Move all docker-compose.yml file to the version 2.3 format to have
support for `start_period`
- The health check check the cluster health instead of checking that the
host respond.
- Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI.
- DUMP the last health check information and the docker-compose logs
environment:
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- "network.host="
- "transport.host=127.0.0.1"
- "http.host=0.0.0.0"
- "xpack.security.enabled=true"
- "xpack.license.self_generated.type=trial"
command: bash -c "bin/elasticsearch-keystore create && echo changeme | bin/elasticsearch-keystore add --stdin bootstrap.password && /usr/local/bin/docker-entrypoint.sh eswrapper"
restart: on-failure:5
- ELASTIC_PASSWORD=changeme
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ph ph added the needs_backport PR is waiting to be backported to other branches. label Jan 22, 2019
Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation. Probably the health check should also be updated in testing/environment setup?

@@ -51,6 +48,6 @@ services:
healthcheck:
test: ["CMD-SHELL", 'python -c ''import urllib, json; response = urllib.urlopen("http://elastic:changeme@localhost:5601/api/status"); data = json.loads(response.read()); exit(1) if data["status"]["overall"]["state"] != "green" else exit(0);''']
retries: 1200
interval: 1s
interval: 5s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this quite long delay and lower interval? This will slow down builds I would assume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well slow down is actually relative in the healthcheck, if you wait for 5s instead of 1s, it's 4 fewer requests ES has to deal before the next check. Since Travis is not the fastest thing in the world, giving it more time to start is a good compromise. You see that I've only modified theses limit for the x-pack/libbeat/docker-compose.yml and not the other files, it's actually because it takes more times to start and recover the security index to allow authentification.

@ruflin if the test suite becomes flaky or unreliable we will have to switch to file-based authentification for ES.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that ES can cope with 1 request per sec ;-) For the start_period, tried to find some good docs but I assume it waits initially 60s? Seems a bit too much.

Overall my goal is to keep the CI time down, as soon as a service is available, tests should start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think too, but well its Travis its docker, I prefer to give it as much chance as possible, maybe call that I am fed up with that and I would prefer to not whack a mole more :D

The start_period is a really bad name, it should be called grace period.
When the container starts, it means that health check that fails will not increase the fail count but if the health check succeeds during that period the container can be marked as healthy, when this happen the remaining time of the start_period is ignored. So it doesn't really impact that much the startup it just gives a bit more time to be healthy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, good argument ;-)


elasticsearch:
extends:
file: ${ES_BEATS}/testing/environments/${TESTING_ENVIRONMENT}.yml
service: elasticsearch
healthcheck:
test: ["CMD", "curl", "-u", "elastic:changeme", "-f", "http://localhost:9200"]
test: ["CMD-SHELL", 'python -c ''import urllib, json; response = urllib.urlopen("http://elastic:changeme@localhost:9200/_cluster/health"); data = json.loads(response.read()); exit(1) if data["status"] != "green" else exit(0);''']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this should only matter for x-pack or anything that uses security, I didn't want to change anything on other docker-compose file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin Just to clarify, I will not change the healthcheck for the testing/environment files, become as of today nothing assume that a cluster need to be green other than security, but as we add more test or when/if metricbeat has integration tests that requires security we could move all the docker-compose files to that strategy.

retries: 1200
interval: 1s
interval: 5s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the old one again to speed up bulids?

@@ -202,7 +202,7 @@ integration-tests: prepare-tests
.PHONY: integration-tests-environment
integration-tests-environment: ## @testing Runs the integration inside a virtual environment. This can be run on any docker-machine (local, remote)
integration-tests-environment: prepare-tests build-image
${DOCKER_COMPOSE} run beat make integration-tests RACE_DETECTOR=$(RACE_DETECTOR) DOCKER_COMPOSE_PROJECT_NAME=${DOCKER_COMPOSE_PROJECT_NAME}
${DOCKER_COMPOSE} run beat make integration-tests RACE_DETECTOR=$(RACE_DETECTOR) DOCKER_COMPOSE_PROJECT_NAME=${DOCKER_COMPOSE_PROJECT_NAME} || docker inspect --format "{{json .State.Health }}" $$(${DOCKER_COMPOSE} ps -q) | jq && ${DOCKER_COMPOSE} logs --tail 200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was for debugging purpose. Should we keep it in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin YES we should, because when docker-compose fails it can be hard to either reproduce it or get the logs, it took me a few tries to get an idea how to have useful debug information and when I finally found a good way it took 50 runs (7m each) to finally make it fails..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above will do two things:

  1. Display the last information for the healthcheck.
  2. Display 200 log lines for each container.

The 2, was why I was able to actually make sense of this error, Logging at the log it was clearly demonstrating that ES was actually not in a green state and the authentification failed.

@ph ph added review and removed in progress Pull request is currently in progress. labels Jan 22, 2019
@ph ph added the v6.6.0 label Jan 22, 2019
@ph ph added the :Testing label Jan 22, 2019
ph added a commit to ph/beats that referenced this pull request Jan 25, 2019
…astic#10179)

* Enable back CM integration suite

This reverts commit edeed09.

And do the following:

- Move all docker-compose.yml file to the version 2.3 format to have
support for `start_period`
- The health check check the cluster health instead of checking that the
host respond.
- Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI.
- DUMP the last health check information and the docker-compose logs
ph added a commit to ph/beats that referenced this pull request Jan 26, 2019
…astic#10179)

* Enable back CM integration suite

This reverts commit edeed09.

And do the following:

- Move all docker-compose.yml file to the version 2.3 format to have
support for `start_period`
- The health check check the cluster health instead of checking that the
host respond.
- Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI.
- DUMP the last health check information and the docker-compose logs
ph added a commit that referenced this pull request Jan 26, 2019
…10259)

* Enable back CM integration suite

This reverts commit edeed09.

And do the following:

- Move all docker-compose.yml file to the version 2.3 format to have
support for `start_period`
- The health check check the cluster health instead of checking that the
host respond.
- Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI.
- DUMP the last health check information and the docker-compose logs
ph added a commit that referenced this pull request Jan 26, 2019
…10260)

* Enable back CM integration suite

This reverts commit edeed09.

And do the following:

- Move all docker-compose.yml file to the version 2.3 format to have
support for `start_period`
- The health check check the cluster health instead of checking that the
host respond.
- Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI.
- DUMP the last health check information and the docker-compose logs
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…astic#10179) (elastic#10260)

* Enable back CM integration suite

This reverts commit 0724101.

And do the following:

- Move all docker-compose.yml file to the version 2.3 format to have
support for `start_period`
- The health check check the cluster health instead of checking that the
host respond.
- Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI.
- DUMP the last health check information and the docker-compose logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants