-
Couldn't load subscription status.
- Fork 128
Replace the usage of tianon/true images with docker compose up --wait #1720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the usage of tianon/true images with docker compose up --wait #1720
Conversation
4828815 to
cc02bc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pkoutsovasilis, It looks interesting this change!
I added some comments about this, I see that at least there is a different behavior. The start up process is not repeated if elastic-agent container fails, as it happened in main (non-detached mode).
Another concern about this PR is that it would break the current support of elastic-package with docker-compose V1. Should this support be removed as part of the future 1.0 version ? @jsoriano
This is also true that the support from docker itself (docker-compose release notes):
Effective July 2023, Compose V1 stopped receiving updates and is no longer in new Docker Desktop releases.
|
|
||
| var allServices []string | ||
| for _, aService := range services { | ||
| allServices = append(allServices, aService, fmt.Sprintf("%s_%s", aService, readyServicesSuffix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not needed to support docker-compose V1 and it can be removed all the _is_ready containers.
It looks like that readyServicesSuffix constant could be deleted.
If that is the case, the code of Status function in internal/stack/status.go could also be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those container definitions are part of the services required for testing (system tests).
Related to the status command, it should not affect since that command is intended to show just information about the containers related to the Stack (Kibana, Package Registry, Elasticsearch, etc.). It uses the docker-compose project from the profile:
elastic-package/internal/stack/compose.go
Line 189 in 62fe1c9
| containerIDs, err := docker.ContainerIDsWithLabel(projectLabelDockerCompose, DockerComposeProjectName(options.Profile)) |
About the test packages, not sure what it would be the best option.
Currently, the servicedeployer run in detached mode, using -d docker-compose flag:
| ExtraArgs: []string{"--build", "-d"}, |
| ExtraArgs: []string{"--build", "-d"}, |
| ExtraArgs: []string{"--build", "-d"}, |
About the test packages, if servicedeployer is not updated with the new flags, those new flags those container should be kept. And it is also run a explicit method to wait for the containers being ready/healthy:
| err = p.WaitForHealthy(ctx, opts) |
| err = p.WaitForHealthy(ctx, opts) |
@jsoriano Should servicedeployer be updated too (Up options) with these new flags? Or keep the current implementation ? As they are running with -d , it looks safe.
About the test package , it could be removed... but in the integrations repository they would keep using that container. Probably, it could be kept the tianon/true container to be sure that it is tested also with that. WDYT ?
| var args []string | ||
| if options.DaemonMode { | ||
| args = append(args, "-d") | ||
| args = append(args, "-d", "--wait", "--wait-timeout", fmt.Sprintf("%d", 600)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing in main branch without using detach mode (-d), if elastic-agent container fails... it is being re-tried the docker-compose up
dependency failed to start: container elastic-package-stack-elastic-agent-1 exited (1)
2024/03/13 19:13:05 DEBUG output command: /usr/bin/docker ps -a --filter label=com.docker.compose.project=elastic-package-stack --format {{.ID}}
2024/03/13 19:13:05 DEBUG output command: /usr/bin/docker inspect d5df6764fb51 09f479d5637e 644fbbaa8557 341a68995249 831674f8c23f 920f6d104b80 25de6ef50f35 8cd0417a9622 afa7b068ae1b 44f10e51526b
Elastic Agent failed to start, trying again in 10s.
2024/03/13 19:13:16 DEBUG running command: /usr/bin/docker compose version --short
2024/03/13 19:13:16 DEBUG Determined Docker Compose version: 2.24.6
2024/03/13 19:13:16 DEBUG running command: /usr/bin/docker compose -f /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml -p elastic-package-stack up
[+] Running 9/0
With this change, it looks like elastic-package does not get the error (elastic-agent failed to start) and it cannot retry the docker-compose up. Leaving the scenario with the elastic agent with exited status. It looks like is_ready containers help in this case, not sure how
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrodm this is weird as the e2e-test in the CI have caught one case that this happens. Can you help with an example that hopefully reproduces what you see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is weird as the e2e-test in the CI have caught one case that this happens.
Steps in CI use detached mode (-d):
| elastic-package stack up -d -v |
Can you help with an example that hopefully reproduces what you see?
Sure ! Here I was referring to the case where that flag is not used:
elastic-package stack up -v
Not all runs of this command fail with this, so it needs to be repeated until it is hit that error.
After some retries running the above command, elastic-agent-1 container could not start and elastic-package did not try to re-start the container as it would happen :
elastic-agent-1 | {"log.level":"info","@timestamp":"2024-03-26T11:51:50.649Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":505},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server:8220/","ecs.version":"1.6.0"}
elastic-agent-1 | Error: fail to enroll: fail to execute request to fleet-server: EOF
elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1 | Error: enrollment failed: exit status 1
elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1 exited with code 1
And the status of the cluster:
$ elastic-package stack status
Status of Elastic stack services:
╭──────────────────┬─────────┬───────────────────╮
│ SERVICE │ VERSION │ STATUS │
├──────────────────┼─────────┼───────────────────┤
│ elastic-agent │ 8.12.2 │ exited (1) │
│ elasticsearch │ 8.12.2 │ running (healthy) │
│ fleet-server │ 8.12.2 │ running (healthy) │
│ kibana │ 8.12.2 │ running (healthy) │
│ package-registry │ latest │ running (healthy) │
╰──────────────────┴─────────┴───────────────────╯
With the same options running with the latest published version, it does the retry:
elastic-agent-1 | {"log.level":"info","@timestamp":"2024-03-26T12:01:12.540Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":505},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server:8220/","ecs.version":"1.6.0"}
elastic-agent-1 | Error: fail to enroll: fail to execute request to fleet-server: dial tcp 192.168.192.6:8220: connect: connection refused
elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1 | Error: enrollment failed: exit status 1
elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1 exited with code 1
dependency failed to start: container elastic-package-stack-elastic-agent-1 exited (1)
2024/03/26 13:01:12 DEBUG output command: /usr/bin/docker ps -a --filter label=com.docker.compose.project=elastic-package-stack --format {{.ID}}
2024/03/26 13:01:12 DEBUG output command: /usr/bin/docker inspect 0ebea2626c5e b5873a01c18c 94ec7181f20c 0449f8bf38ef a8c2d3eaf656 cc5633104400 1916c6b5f1dc 25baea050703 b1af9e72d4f2 2f268ae81ff3
Elastic Agent failed to start, trying again in 10s.
2024/03/26 13:01:22 DEBUG running command: /usr/bin/docker compose version --short
2024/03/26 13:01:22 DEBUG Determined Docker Compose version: 2.25.0
2024/03/26 13:01:22 DEBUG running command: /usr/bin/docker compose -f /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml -p elastic-package-stack up
WARN[0000] /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml: `version` is obsolete
[+] Running 9/0
✔ Container elastic-package-stack-package-registry-1 Running 0.0s
✔ Container elastic-package-stack-elasticsearch-1 Running 0.0s
✔ Container elastic-package-stack-package-registry_is_ready-1 Created 0.0s
✔ Container elastic-package-stack-elasticsearch_is_ready-1 Created 0.0s
✔ Container elastic-package-stack-kibana-1 Running 0.0s
✔ Container elastic-package-stack-kibana_is_ready-1 Created 0.0s
✔ Container elastic-package-stack-fleet-server-1 Running 0.0s
✔ Container elastic-package-stack-fleet-server_is_ready-1 Created 0.0s
✔ Container elastic-package-stack-elastic-agent-1 Created 0.0s
Attaching to elastic-agent-1, elastic-agent_is_ready-1, elasticsearch-1, elasticsearch_is_ready-1, fleet-server-1, fleet-server_is_ready-1, kibana-1, kibana_is_ready-1, package-registry-1, package-registry_is_ready-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's odd 🥲 hmmm I see, I think this is what happens; with the *_is_ready services in place we have dependencies on the actual services, so when elastic-agent fails the respective _is_ready service can't start because it's dependency failed completely and thus docker compose up return an error in the sense sorry I couldn't bring up all the services. However when we remove the *_is_ready services then this isn't triggering as nothing depends on elastic-agent and thus up considers that everything had been brought up and the user will deal with any errors visible in the logs... I don't have an immediate fix for that
I think we can remove support for v1, it was completely discontinued last year, and is not included anymore in docker desktop distributions. The migration to v2 is straightforward, and should be transparent for anyone using a recent docker version, since #1592. No need to relate this change to #1158. |
6533727 to
6c28d49
Compare
6c28d49 to
537ce34
Compare
💚 Build Succeeded
History
|
|
after speaking with @jsoriano we decided that is better to introduce a milder change to support native arm64 execution and exploiting docker-compose |
After having a look at the code I realised that the
tianon/trueimages, as well as the respective docker compose*_is_readyservices, are required only in the detached mode ofdocker compose up. Specifically, based on the dependencies of services, docker compose will detach when all services are reported as healthy and only*_is_readyones are in the state ofStarted. Soelastic-package stack up -d -vwould look like this imageNotice that
Doneis being printed while*_is_readyservices are atStartedThis PR substitutes the usage of such former services with the built-in

docker compose --waitfor detached mode which results in the following imageNotice that
Doneis being printed when all services are reportedHealthyAs for the compose up attached mode this essentially remains the same. As before the invocation will immediately transition to the container logs.
Why we should try to avoid
tianon/trueimages?! Because many devs employ Apple-silicon macbooks and this will get us a step closer running natively without any ISA emulation. This last bit, assuming this PR will make it, requires onlyelastic-package-registryandfleet-serverto offer arm64 images and native execution here we go.As always I may have missed something in my logic above so please feel free to correct me 🙂
This PR closes #1708 (PS: @mrodm I told you in EAH that I didn't open an issue because I wanted to open a PR, but aside joking, thanks for capturing this in an issue 😄)