[WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10179

ph · 2019-01-18T17:17:25Z

This reverts commit edeed09.

Looking at this with a fresh eyes, I am not sure its flakyness, we should get log from the docker containers on failures.

ph · 2019-01-18T20:02:49Z

Seriously, my only theory is that it's something takes too much time for a container to bring up and make the health check fails, I am currently looking to expose the docker compose logs. Because debugging this on Travis is not really fun.

What is frustrating is the docker compose file of x-pack is really close to the other file that doesn't fail.. I am thinking that the password setup that we do in the beginning of the ES container is making the thing goes boom, of course it doesn't happen always.

ph · 2019-01-18T20:06:54Z

Do not merge this PR is for testing.

ph · 2019-01-18T22:03:28Z

From my experience at looking at theses log they always seems to fails on the health check for elasticsearch.

Creating libbeat667d4597047394139f4e96d47c5ca886faea66b5_elasticsearch_1

urso · 2019-01-20T01:09:41Z

would we get some more info by running docker-compose with -verbose flag?

ph · 2019-01-21T19:56:51Z

After talking with security the main problem here is our health check is actually wrong (and might be wrong in our other docker compose files). Security is just an index, when ES is started user are added to the Security index, If ES respond to calls it doesn't mean that the cluster is green, it doesn't mean that the security index is accessible, until the cluster is green it is possible that looking for user effectively fails.

This reverts commit edeed09. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs

ph · 2019-01-21T20:11:31Z

x-pack/libbeat/docker-compose.yml

    environment:
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - "network.host="
      - "transport.host=127.0.0.1"
      - "http.host=0.0.0.0"
      - "xpack.security.enabled=true"
      - "xpack.license.self_generated.type=trial"
-    command: bash -c "bin/elasticsearch-keystore create && echo changeme | bin/elasticsearch-keystore add --stdin bootstrap.password && /usr/local/bin/docker-entrypoint.sh eswrapper"
-    restart: on-failure:5
+      - ELASTIC_PASSWORD=changeme


Doc reference: https://github.com/elastic/elasticsearch-docker/blob/c2877efb6e6d9121c13a02a8384599b9ef42f71b/.tedi/template/bin/docker-entrypoint.sh#L78-L91

ruflin

Thanks for the investigation. Probably the health check should also be updated in testing/environment setup?

ruflin · 2019-01-22T08:39:21Z

x-pack/libbeat/docker-compose.yml

@@ -51,6 +48,6 @@ services:
    healthcheck:
      test: ["CMD-SHELL", 'python -c ''import urllib, json; response = urllib.urlopen("http://elastic:changeme@localhost:5601/api/status"); data = json.loads(response.read()); exit(1) if data["status"]["overall"]["state"] != "green" else exit(0);''']
      retries: 1200
-      interval: 1s
+      interval: 5s


Any reason for this quite long delay and lower interval? This will slow down builds I would assume?

Well slow down is actually relative in the healthcheck, if you wait for 5s instead of 1s, it's 4 fewer requests ES has to deal before the next check. Since Travis is not the fastest thing in the world, giving it more time to start is a good compromise. You see that I've only modified theses limit for the x-pack/libbeat/docker-compose.yml and not the other files, it's actually because it takes more times to start and recover the security index to allow authentification.

@ruflin if the test suite becomes flaky or unreliable we will have to switch to file-based authentification for ES.

I would hope that ES can cope with 1 request per sec ;-) For the start_period, tried to find some good docs but I assume it waits initially 60s? Seems a bit too much.

Overall my goal is to keep the CI time down, as soon as a service is available, tests should start.

I think too, but well its Travis its docker, I prefer to give it as much chance as possible, maybe call that I am fed up with that and I would prefer to not whack a mole more :D

The start_period is a really bad name, it should be called grace period.
When the container starts, it means that health check that fails will not increase the fail count but if the health check succeeds during that period the container can be marked as healthy, when this happen the remaining time of the start_period is ignored. So it doesn't really impact that much the startup it just gives a bit more time to be healthy.

SGTM, good argument ;-)

ruflin · 2019-01-22T08:39:40Z

x-pack/libbeat/docker-compose.yml


  elasticsearch:
    extends:
      file: ${ES_BEATS}/testing/environments/${TESTING_ENVIRONMENT}.yml
      service: elasticsearch
    healthcheck:
-      test: ["CMD", "curl", "-u", "elastic:changeme", "-f", "http://localhost:9200"]
+      test: ["CMD-SHELL", 'python -c ''import urllib, json; response = urllib.urlopen("http://elastic:changeme@localhost:9200/_cluster/health"); data = json.loads(response.read()); exit(1) if data["status"] != "green" else exit(0);''']


Currently this should only matter for x-pack or anything that uses security, I didn't want to change anything on other docker-compose file.

@ruflin Just to clarify, I will not change the healthcheck for the testing/environment files, become as of today nothing assume that a cluster need to be green other than security, but as we add more test or when/if metricbeat has integration tests that requires security we could move all the docker-compose files to that strategy.

ruflin · 2019-01-22T08:39:52Z

x-pack/libbeat/docker-compose.yml

      retries: 1200
-      interval: 1s
+      interval: 5s


Can we use the old one again to speed up bulids?

ruflin · 2019-01-22T08:40:09Z

libbeat/scripts/Makefile

@@ -202,7 +202,7 @@ integration-tests: prepare-tests
 .PHONY: integration-tests-environment
 integration-tests-environment:  ## @testing Runs the integration inside a virtual environment. This can be run on any docker-machine (local, remote)
 integration-tests-environment: prepare-tests build-image
-	${DOCKER_COMPOSE} run beat make integration-tests RACE_DETECTOR=$(RACE_DETECTOR) DOCKER_COMPOSE_PROJECT_NAME=${DOCKER_COMPOSE_PROJECT_NAME}
+	${DOCKER_COMPOSE} run beat make integration-tests RACE_DETECTOR=$(RACE_DETECTOR) DOCKER_COMPOSE_PROJECT_NAME=${DOCKER_COMPOSE_PROJECT_NAME} || docker inspect --format "{{json .State.Health }}" $$(${DOCKER_COMPOSE} ps -q) | jq && ${DOCKER_COMPOSE} logs --tail 200


I assume this was for debugging purpose. Should we keep it in?

@ruflin YES we should, because when docker-compose fails it can be hard to either reproduce it or get the logs, it took me a few tries to get an idea how to have useful debug information and when I finally found a good way it took 50 runs (7m each) to finally make it fails..

The above will do two things:

Display the last information for the healthcheck.

Display 200 log lines for each container.

The 2, was why I was able to actually make sense of this error, Logging at the log it was clearly demonstrating that ES was actually not in a green state and the authentification failed.

…astic#10179) * Enable back CM integration suite This reverts commit edeed09. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs

…10259) * Enable back CM integration suite This reverts commit edeed09. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs

…10260) * Enable back CM integration suite This reverts commit edeed09. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs

…astic#10179) (elastic#10260) * Enable back CM integration suite This reverts commit 0724101. And do the following: - Move all docker-compose.yml file to the version 2.3 format to have support for `start_period` - The health check check the cluster health instead of checking that the host respond. - Use the `ELASTIC_PASSWORD` variable instead of invoking the CLI. - DUMP the last health check information and the docker-compose logs

ph added review libbeat and removed review labels Jan 18, 2019

ph requested a review from a team as a code owner January 18, 2019 22:55

ph force-pushed the fix/enable-cm-back branch from 9491a9b to 10f35c4 Compare January 21, 2019 20:07

ph requested review from a team as code owners January 21, 2019 20:07

ph added the in progress Pull request is currently in progress. label Jan 21, 2019

ph changed the title ~~Revert "Skip x-pack libbeat tests again as flaky (#10068)"~~ [WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" Jan 21, 2019

ph force-pushed the fix/enable-cm-back branch from 10f35c4 to 13345b6 Compare January 21, 2019 20:09

ph commented Jan 21, 2019

View reviewed changes

enforce

444456e

ph added the needs_backport PR is waiting to be backported to other branches. label Jan 22, 2019

ruflin reviewed Jan 22, 2019

View reviewed changes

ph added review and removed in progress Pull request is currently in progress. labels Jan 22, 2019

ruflin approved these changes Jan 22, 2019

View reviewed changes

ph merged commit 5a339e7 into elastic:master Jan 22, 2019

ph mentioned this pull request Jan 22, 2019

Cherry-pick #10179 to 6.x: Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10259

Merged

ph added v6.7.0 and removed needs_backport PR is waiting to be backported to other branches. labels Jan 22, 2019

ph mentioned this pull request Jan 22, 2019

Cherry-pick #10179 to 6.6: Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10260

Merged

ph added the v6.6.0 label Jan 22, 2019

ph added the :Testing label Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10179

[WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10179

ph commented Jan 18, 2019

ph commented Jan 18, 2019 •

edited

Loading

ph commented Jan 18, 2019

ph commented Jan 18, 2019

urso commented Jan 20, 2019

ph commented Jan 21, 2019

ph Jan 21, 2019

ruflin left a comment

ruflin Jan 22, 2019

ph Jan 22, 2019

ruflin Jan 22, 2019

ph Jan 22, 2019

ruflin Jan 22, 2019

ruflin Jan 22, 2019

ph Jan 22, 2019

ph Jan 22, 2019

ruflin Jan 22, 2019

ruflin Jan 22, 2019

ph Jan 22, 2019

ph Jan 22, 2019

[WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10179

[WIP] Revert "Skip x-pack libbeat tests again as flaky (#10068)" #10179

Conversation

ph commented Jan 18, 2019

ph commented Jan 18, 2019 • edited Loading

ph commented Jan 18, 2019

ph commented Jan 18, 2019

urso commented Jan 20, 2019

ph commented Jan 21, 2019

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ph commented Jan 18, 2019 •

edited

Loading