-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate container's restart policy and exit status before starting #1668
Comments
I think this is a bit more tricky than that @20k-ultra, what you are describing is already implemented by the engine, so we effectively could be racing against the engine. What happens if the container was deleted? The exit code also persists after a reboot, so the supervisor should never start a stopped container? I agree there is a product discussion we want to have, because there is a pattern here, but we might have to think about the implications of this for other users. |
Under what conditions is a container deleted?
Correct. It should comply with the restart semantics specified in the compose file, and should match how Docker would behave. |
@pipex I think you're misunderstanding. The Supervisor will ignore restart policies and start containers that the engine isn't trying to start so there's no race condition. In the case where the engine does want to restart a container that race condition exists today because the Supervisor wants to always start containers and with a restart: always policy the engine will too. The key difference is that we shouldn't start if the policy is configured to not want that. If the container was deleted then the Supervisor wouldn't be trying to decide if it should start it since it would be trying to create it. I'd hope exit codes persist after a reboot so that the Supervisor and engine continue to respect the restart-policy specified. |
@pwt we actually had an internal discussion about this recently and talked about introducing another directive for docker-compose called start_on. I have to check our meeting recording to remember the rationale but it seems like more work to remember another balena specific config however I recall it provided a bunch of new functionality...I'll comment with specifics next week. link for internal meeting https://jel.ly.fish/3ca18f95-d72a-4de8-834d-61a8e1467308 |
[20k-ultra] This issue has attached support thread https://jel.ly.fish/def651d2-3b13-4240-a6b8-80c04a3141aa |
after discussing this feature we should be able to achieve the desired results (do not start a container which voids the specified restart policy) by using a start_on like directive in docker-compose. This can be values such:
A spec will be required to cover this as it can have large implications with a lot of benefits if done correctly compared to simply just respecting the restart policy of a container. We have to remember that docker doesn't know what an application or release is so it's not fair to copy exactly what docker does and say that is what balena devices should do. |
I think the The supervisor currently respects the restart policy of the container while it's running, because it uses information in memory to tell if a container was already started when a new target state comes in. Once the supervisor is restarted, that information is lost so the supervisor takes the safe path and restarts the container. If the supervisor can identify from the container state on the engine when the target state was already applied this would solve the issue. If we do it properly, we'll see the correct behavior: "follow the restart policy except when the service is being updated". |
@20k-ultra thanks for directing me here from my query on the forums regarding this issue. I had thought it was just a bug, I didn't realise there was ongoing conversation around expanding the feature. Based on my use cases and any use cases I can think of for others, I can't think of any reason to expand beyond the standard docker based restart policies. Things like "creation: this is for one time containers which will only run once" doesn't seem to bring much value. You will still have the image taking up space, and an exited container isn't using any resources even if you don't remove it. The same effect could be achieved by having a restart policy as never, then have the process running in the container exit, or call 'stop' via the API. If the supervisor currently respects the restart policy of the container while Supervisor is running (although it appears not to when you push a change to a device https://forums.balena.io/t/stopped-containers-keep-restarting/329650), it seemed to me simply extending that behaviour to include after restarts and after supervisor updates/container updates etc. would suffice. When I connect to one of my devices with a production image, linked to the Cloud and run a container (balena run -it alpine sh), exit the container again, and restart the device, the exited container still has the same 'State' info stored for that container (seen via balena inspect container-uid).
Seemed to me that having Supervisor read from this stored container info rather than a temporary memory would do the trick? I'm not sure if "follow the restart policy except when the service is being updated" would make as much sense as just "always follow the restart policy". I can't think of a scenario where I would want a stopped container to restart after an update. I would want it to update, and then run when I tell it to. In this sense, it is the same as doing a 'balena pull', I want to update the image, but do what I tell it to do/or have already told it to do through the restart policy and earlier instruction. Or in short, make sure Supervisor follows the same behaviour as Balena Engine. |
@Maggie0002 wrote:
Strong agree. |
Can anyone think of a temporary workaround for this issue? It just so happens that it is a security vulnerability on my app if the container that has been instructed to stop keeps starting again without instruction from the user.
None of these seem like particularly good options, 1 and 2 especially ugly. Any ideas welcome. |
I'm currently using (1). I agree that it's suboptimal. |
Another option might be to add something to your container entrypoint like if shoud_not_be_running; then
exit;
fi would that work in your case? |
And then have a global env variable set by one container and read by the one I want to stop? Not sure if you can share an env var between containers. Guess it would involve mounting a shared .env file in a volume? I may try it as a .pid file in a shared volume and then check for the existence of the pid file on launch of the container I want stopped. Would be a little better than sending a stop command to the api on boot too. Always helpful to bounce ideas around, I’m glad I asked. Thanks. |
For reference, here is the entrypoint line I ended up using in my docker-compose file:
Then through my app (or script, or however else you would like to do it), I write the file The At the moment the container runs when the PID exists. Someone may want to reverse the behaviour depending on what they want their default behaviour to be on boot. Thanks @pipex for putting me in the right direction. |
hey @Maggie0002 @pwt this came up again after I mentioned to some people that it's not obvious to know what restart policy a container is using in the dashboard and I pointed out how the Supervisor can violate a policy if it gets restarted (this issue being discussed here). Last year I said we discussed a The Supervisor should not do any kind of "restart" management once a container has been started. This is because that mechanism already exists in the engine as @pipex said. Therefore, I plan to PR the Supervisor to ensure it does not start a container again after being started once.. I plan to achieve this by checking if the I have to check if the timestamp is GTE because the engine sets a bogus timestamp value for some reason to 0001-01-01. It does the same for Does this purposed patch solve the issue better for you @Maggie0002 @pwt ? From what I've gathered rereading this thread you two were on the side of this issue being a bug that just needed to be fixed rather then a feature introduced to prevent it of which I am on board with more now. |
Hi @20k-ultra. Thanks for the response. Yes, I think this is the correct (and expected) behaviour. |
Sounds right to me. Docker engine (i.e. Balena engine) is a services manager, it seems to make sense to let it do its job managing services as much as possible, rather than have Supervisor interfere, especially as it then produces more predictable outcomes from the restart policies we add to the docker-compose files. Not sure how necessary it is to compare the createdAt vs the startedAt, wouldn't the test always return true unless for some reason there was a major malfunction during the create process that just happened to be in that tiny gap between creating and starting? At which point, some of other form of error handling would likely take over? Maybe that's not the case, I have no idea the workflow, just kinda sprung to mind reading the message. |
I shared the purposed mechanism for this exact conversation so thank you for the input. The Supervisor creates the container then starts it in separate steps so it's possible to have a created container that was not started. Having a valid startedAt seems to be the most obvious way to tell if a container has been started. Does the choice of condition seem ok now ? |
Ah I see, then I guess better safe than sorry. Seems like a fairly harmless redundancy (famous last words). Would need to be confident in the time sync mechanism in place though, as for offline devices without access to NTP, and if there was a failure in whatever workaround solution is applied to all the devices for time, it could appear as 1970 that a container started. Perhaps unlikely, bit of an edge case if something else didn't work quite as planned. Same caveats again, not as familiar with Balena Engine etc. etc, but I think if it is a two stage process of
Edit: Quick glance looks like it does only return |
Sharing the state of a container that is just created (never started) for discussions
and here is a running container that has never exited:
and that container stopped:
@Maggie0002 I believe the time sync to be a significant enough edge case I will try to think of another way to tell if a container has started. The status is a good one which we'll probably go with.. |
This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>
This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>
This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>
This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>
This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>
This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>
hey @Maggie0002 @pwt just merged #1926 which fixes this issue. It's available in Supervisor v13.1.3 |
If a container is stopped either from the dashboard (which uses the Supervisor API) or on device via the cli, the Supervisor should not start a stopped container just because the target state wants it running. The target state actually never changes the running state to stopped (there is no way to make it stopped) which makes respecting unless-stopped impossible for the current Supervisor behaviour.
The Supervisor behaves this way because it remembers which containers it has started so on subsequent target state polls, even though the running state wants the stopped container running, the Supervisor won't try to start it because it has already started it. This mechanism fails when the Supervisor is restarted and thus losing memory of the start.
This mechanism could be improved to evaluate the container it wants to start by noting the restart policy and exit code. For example, if the restart policy was unless-stopped and the exit code is 0 then we know the container was stopped by a user and we shouldn't start it again. Additionally, if the restart policy is no the exit code doesn't matter, etc for each policy.
tl;dr how Supervisor will evaluate stopped containers:
https://docs.docker.com/config/containers/start-containers-automatically/#use-a-restart-policy
The text was updated successfully, but these errors were encountered: