Evaluate container's restart policy and exit status before starting #1668

20k-ultra · 2021-04-21T19:41:13Z

If a container is stopped either from the dashboard (which uses the Supervisor API) or on device via the cli, the Supervisor should not start a stopped container just because the target state wants it running. The target state actually never changes the running state to stopped (there is no way to make it stopped) which makes respecting unless-stopped impossible for the current Supervisor behaviour.

The Supervisor behaves this way because it remembers which containers it has started so on subsequent target state polls, even though the running state wants the stopped container running, the Supervisor won't try to start it because it has already started it. This mechanism fails when the Supervisor is restarted and thus losing memory of the start.

This mechanism could be improved to evaluate the container it wants to start by noting the restart policy and exit code. For example, if the restart policy was unless-stopped and the exit code is 0 then we know the container was stopped by a user and we shouldn't start it again. Additionally, if the restart policy is no the exit code doesn't matter, etc for each policy.

tl;dr how Supervisor will evaluate stopped containers:

unless-stopped: if exit code is 0 then start
always: exit code is irrelevant, always start
on-failure if exit code is 1 then start
no: exit code is irrelevant, never start

https://docs.docker.com/config/containers/start-containers-automatically/#use-a-restart-policy

pipex · 2021-05-07T23:56:40Z

I think this is a bit more tricky than that @20k-ultra, what you are describing is already implemented by the engine, so we effectively could be racing against the engine.

What happens if the container was deleted? The exit code also persists after a reboot, so the supervisor should never start a stopped container? I agree there is a product discussion we want to have, because there is a pattern here, but we might have to think about the implications of this for other users.

pwt · 2021-05-08T11:14:37Z

What happens if the container was deleted?

Under what conditions is a container deleted?

The exit code also persists after a reboot, so the supervisor should never start a stopped container?

Correct. It should comply with the restart semantics specified in the compose file, and should match how Docker would behave.

20k-ultra · 2021-05-19T20:36:58Z

@pipex I think you're misunderstanding. The Supervisor will ignore restart policies and start containers that the engine isn't trying to start so there's no race condition. In the case where the engine does want to restart a container that race condition exists today because the Supervisor wants to always start containers and with a restart: always policy the engine will too. The key difference is that we shouldn't start if the policy is configured to not want that.

If the container was deleted then the Supervisor wouldn't be trying to decide if it should start it since it would be trying to create it. I'd hope exit codes persist after a reboot so that the Supervisor and engine continue to respect the restart-policy specified.

20k-ultra · 2021-05-19T21:00:22Z

@pwt we actually had an internal discussion about this recently and talked about introducing another directive for docker-compose called start_on. I have to check our meeting recording to remember the rationale but it seems like more work to remember another balena specific config however I recall it provided a bunch of new functionality...I'll comment with specifics next week.

link for internal meeting https://jel.ly.fish/3ca18f95-d72a-4de8-834d-61a8e1467308

jellyfish-bot · 2021-07-02T23:59:06Z

[20k-ultra] This issue has attached support thread https://jel.ly.fish/def651d2-3b13-4240-a6b8-80c04a3141aa

20k-ultra · 2021-07-03T00:02:41Z

after discussing this feature we should be able to achieve the desired results (do not start a container which voids the specified restart policy) by using a start_on like directive in docker-compose. This can be values such:

never: on-demand services which will be started by another application
creation: this is for one time containers which will only run once when installed and never again (perhaps like configuring network interfaces)
always: this will always start a service if it does not match the target state value
etc...

A spec will be required to cover this as it can have large implications with a lot of benefits if done correctly compared to simply just respecting the restart policy of a container. We have to remember that docker doesn't know what an application or release is so it's not fair to copy exactly what docker does and say that is what balena devices should do.

pipex · 2021-07-05T16:53:20Z

I think the start_on feature still requires more research and discussion. I'm not entirely convinced if there is a real use case for it.

The supervisor currently respects the restart policy of the container while it's running, because it uses information in memory to tell if a container was already started when a new target state comes in. Once the supervisor is restarted, that information is lost so the supervisor takes the safe path and restarts the container. If the supervisor can identify from the container state on the engine when the target state was already applied this would solve the issue. If we do it properly, we'll see the correct behavior: "follow the restart policy except when the service is being updated".

maggie44 · 2021-07-06T00:33:37Z

@20k-ultra thanks for directing me here from my query on the forums regarding this issue.

I had thought it was just a bug, I didn't realise there was ongoing conversation around expanding the feature. Based on my use cases and any use cases I can think of for others, I can't think of any reason to expand beyond the standard docker based restart policies. Things like "creation: this is for one time containers which will only run once" doesn't seem to bring much value. You will still have the image taking up space, and an exited container isn't using any resources even if you don't remove it. The same effect could be achieved by having a restart policy as never, then have the process running in the container exit, or call 'stop' via the API.

If the supervisor currently respects the restart policy of the container while Supervisor is running (although it appears not to when you push a change to a device https://forums.balena.io/t/stopped-containers-keep-restarting/329650), it seemed to me simply extending that behaviour to include after restarts and after supervisor updates/container updates etc. would suffice.

When I connect to one of my devices with a production image, linked to the Cloud and run a container (balena run -it alpine sh), exit the container again, and restart the device, the exited container still has the same 'State' info stored for that container (seen via balena inspect container-uid).

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-07-06T00:24:51.454128206Z",
            "FinishedAt": "2021-07-06T00:24:52.564791101Z"
        },
        "HostConfig": {
        ....
            "RestartPolicy": {
                "Name": "no",
        ....
            },

Seemed to me that having Supervisor read from this stored container info rather than a temporary memory would do the trick?

I'm not sure if "follow the restart policy except when the service is being updated" would make as much sense as just "always follow the restart policy". I can't think of a scenario where I would want a stopped container to restart after an update. I would want it to update, and then run when I tell it to. In this sense, it is the same as doing a 'balena pull', I want to update the image, but do what I tell it to do/or have already told it to do through the restart policy and earlier instruction.

Or in short, make sure Supervisor follows the same behaviour as Balena Engine.

pwt · 2021-07-06T08:04:09Z

@Maggie0002 wrote:

@20k-ultra thanks for directing me here from my query on the forums regarding this issue.

I had thought it was just a bug, I didn't realise there was ongoing conversation around expanding the feature. Based on my use cases and any use cases I can think of for others, I can't think of any reason to expand beyond the standard docker based restart policies.

Strong agree.

maggie44 · 2021-07-29T18:31:12Z

Can anyone think of a temporary workaround for this issue? It just so happens that it is a security vulnerability on my app if the container that has been instructed to stop keeps starting again without instruction from the user.

Polling the Supervisor api on regular intervals to see if the container is running and then stopping it again (on restart isn't much of an issue as only requires one check, but is a little ugly for monitoring updates that may come at unexpected intervals)?
Starting the container on Balena Engine as a container outside of Supervisor control, not part of the deployment (would mean putting the built docker image in the repository as a file and writing an install script, as I am preloading to allow immediate use offline and won't have internet access to pull from an online repo)?
Is there a way to monitor the Supervisor for when one of the actions that is causing the issue occurs (namely updates?)?

None of these seem like particularly good options, 1 and 2 especially ugly. Any ideas welcome.

pwt · 2021-07-29T19:35:51Z

I'm currently using (1). I agree that it's suboptimal.

pipex · 2021-07-29T19:58:36Z

Can anyone think of a temporary workaround for this issue?

Another option might be to add something to your container entrypoint like

if shoud_not_be_running; then
    exit;
fi

would that work in your case?

maggie44 · 2021-07-29T22:24:37Z

Can anyone think of a temporary workaround for this issue?

Another option might be to add something to your container entrypoint like
if shoud_not_be_running; then
    exit;
fi
would that work in your case?

And then have a global env variable set by one container and read by the one I want to stop? Not sure if you can share an env var between containers. Guess it would involve mounting a shared .env file in a volume?

I may try it as a .pid file in a shared volume and then check for the existence of the pid file on launch of the container I want stopped. Would be a little better than sending a stop command to the api on boot too.

Always helpful to bounce ideas around, I’m glad I asked. Thanks.

maggie44 · 2021-07-31T03:14:20Z

For reference, here is the entrypoint line I ended up using in my docker-compose file:

entrypoint: /bin/sh -l -c '[[ -f /any/folder/any-name.pid ]] && exec /original-entrypoint/command || exit 0'

Then through my app (or script, or however else you would like to do it), I write the file /any/folder/any-name.pid and have the /any/folder/ folder mounted as a volume in both containers.

The exec before the original command ensures the process still runs as PID1 and exits cleanly on SIGTERM etc.

At the moment the container runs when the PID exists. Someone may want to reverse the behaviour depending on what they want their default behaviour to be on boot.

Thanks @pipex for putting me in the right direction.

20k-ultra · 2022-04-05T00:49:29Z

hey @Maggie0002 @pwt this came up again after I mentioned to some people that it's not obvious to know what restart policy a container is using in the dashboard and I pointed out how the Supervisor can violate a policy if it gets restarted (this issue being discussed here). Last year I said we discussed a start_on directive for docker-compose as a feature to solve this... I rewatched our internal talk regarding this and although that feature might mitigate the underlying issue, it still exists and would probably surface eventually.

The Supervisor should not do any kind of "restart" management once a container has been started. This is because that mechanism already exists in the engine as @pipex said. Therefore, I plan to PR the Supervisor to ensure it does not start a container again after being started once.. I plan to achieve this by checking if the startedAt timestamp is equal to or greater than the createdAt since this would confirm the container has started.

I have to check if the timestamp is GTE because the engine sets a bogus timestamp value for some reason to 0001-01-01. It does the same for exitedAt...

Does this purposed patch solve the issue better for you @Maggie0002 @pwt ? From what I've gathered rereading this thread you two were on the side of this issue being a bug that just needed to be fixed rather then a feature introduced to prevent it of which I am on board with more now.

pwt · 2022-04-05T07:50:34Z

Does this purposed patch solve the issue better for you @Maggie0002 @pwt ? From what I've gathered rereading this thread you two were on the side of this issue being a bug that just needed to be fixed rather then a feature introduced to prevent it of which I am on board with more now.

Hi @20k-ultra. Thanks for the response. Yes, I think this is the correct (and expected) behaviour.

maggie44 · 2022-04-05T12:01:12Z

hey @Maggie0002 @pwt this came up again after I mentioned to some people that it's not obvious to know what restart policy a container is using in the dashboard and I pointed out how the Supervisor can violate a policy if it gets restarted (this issue being discussed here). Last year I said we discussed a start_on directive for docker-compose as a feature to solve this... I rewatched our internal talk regarding this and although that feature might mitigate the underlying issue, it still exists and would probably surface eventually.

The Supervisor should not do any kind of "restart" management once a container has been started. This is because that mechanism already exists in the engine as @pipex said. Therefore, I plan to PR the Supervisor to ensure it does not start a container again after being started once.. I plan to achieve this by checking if the startedAt timestamp is equal to or greater than the createdAt since this would confirm the container has started.

I have to check if the timestamp is GTE because the engine sets a bogus timestamp value for some reason to 0001-01-01. It does the same for exitedAt...

Does this purposed patch solve the issue better for you @Maggie0002 @pwt ? From what I've gathered rereading this thread you two were on the side of this issue being a bug that just needed to be fixed rather then a feature introduced to prevent it of which I am on board with more now.

Sounds right to me. Docker engine (i.e. Balena engine) is a services manager, it seems to make sense to let it do its job managing services as much as possible, rather than have Supervisor interfere, especially as it then produces more predictable outcomes from the restart policies we add to the docker-compose files. Not sure how necessary it is to compare the createdAt vs the startedAt, wouldn't the test always return true unless for some reason there was a major malfunction during the create process that just happened to be in that tiny gap between creating and starting? At which point, some of other form of error handling would likely take over? Maybe that's not the case, I have no idea the workflow, just kinda sprung to mind reading the message.

20k-ultra · 2022-04-05T19:35:25Z

Not sure how necessary it is to compare the createdAt vs the startedAt, wouldn't the test always return true unless for some reason there was a major malfunction during the create process that just happened to be in that tiny gap between creating and starting?

I shared the purposed mechanism for this exact conversation so thank you for the input.

The Supervisor creates the container then starts it in separate steps so it's possible to have a created container that was not started. Having a valid startedAt seems to be the most obvious way to tell if a container has been started. Does the choice of condition seem ok now ?

maggie44 · 2022-04-05T21:57:00Z

Not sure how necessary it is to compare the createdAt vs the startedAt, wouldn't the test always return true unless for some reason there was a major malfunction during the create process that just happened to be in that tiny gap between creating and starting?

I shared the purposed mechanism for this exact conversation so thank you for the input.

The Supervisor creates the container then starts it in separate steps so it's possible to have a created container that was not started. Having a valid startedAt seems to be the most obvious way to tell if a container has been started. Does the choice of condition seem ok now ?

Ah I see, then I guess better safe than sorry. Seems like a fairly harmless redundancy (famous last words). Would need to be confident in the time sync mechanism in place though, as for offline devices without access to NTP, and if there was a failure in whatever workaround solution is applied to all the devices for time, it could appear as 1970 that a container started. Perhaps unlikely, bit of an edge case if something else didn't work quite as planned.

Same caveats again, not as familiar with Balena Engine etc. etc, but I think if it is a two stage process of create -> run then in Docker at least you could just check the status?

docker inspect -f '{{.State.Status}}' created_container
created

docker inspect -f '{{.State.Status}}' run_container
exited <-- could be a series of other outputs, but I don't think it can be 'created' again after first boot, but would have to experiment or check the code to be sure.

Edit: Quick glance looks like it does only return created if it hasn't tried to run before, using a test to see if the start time isn't set: https://github.com/docker/docker-ce/blob/30893e979c3c6d759616ea63545b3463a7fcd483/components/engine/container/state.go#L93. I guess you could do the same test, or just rely on the created status. Again, I haven't done any testing, don't bet on me.

20k-ultra · 2022-04-07T20:55:46Z

Sharing the state of a container that is just created (never started) for discussions

    "State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },

and here is a running container that has never exited:

        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 27946,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2022-04-07T19:59:20.35667396Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        }

and that container stopped:

       "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2022-04-07T19:59:20.35667396Z",
            "FinishedAt": "2022-04-07T20:57:40.420144413Z"
        },

@Maggie0002 I believe the time sync to be a significant enough edge case I will try to think of another way to tell if a container has started. The status is a good one which we'll probably go with..

This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>

20k-ultra · 2022-04-20T19:39:45Z

hey @Maggie0002 @pwt just merged #1926 which fixes this issue. It's available in Supervisor v13.1.3

20k-ultra changed the title ~~Do not restart a container with unless-stopped policy~~ Evaluate container's restart policy and exit status before starting Apr 21, 2021

20k-ultra mentioned this issue Apr 21, 2021

Supervisor restarts a stopped container on a change to an unrelated service variable #1549

Open

cywang117 added the Needs discussion The fix for this needs to be discussed. label May 10, 2021

This was referenced May 25, 2021

Supervisor doesn't do update due to lock even if user application is not running #810

Open

Health check returns OK even without device connectivity #1714

Open

maggie44 mentioned this issue Jul 6, 2021

Portainer keeps starting itself after updates LearnersBlock/learners-block#151

Closed

maggie44 mentioned this issue Aug 1, 2021

Supervisor restarting containers after update LearnersBlock/learners-block#173

Closed

20k-ultra added a commit that referenced this issue Apr 12, 2022

Only start a container once in its lifetime

91427f4

This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>

20k-ultra mentioned this issue Apr 12, 2022

Prevent restart policy violation #1926

Merged

20k-ultra added a commit that referenced this issue Apr 14, 2022

Only start a container once in its lifetime

b10bdb8

This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>

20k-ultra added a commit that referenced this issue Apr 14, 2022

Only start a container once in its lifetime

ab6412a

This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>

20k-ultra added a commit that referenced this issue Apr 15, 2022

Only start a container once in its lifetime

e9cf437

This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>

20k-ultra added a commit that referenced this issue Apr 15, 2022

Only start a container once in its lifetime

a1b2b10

This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>

20k-ultra added a commit that referenced this issue Apr 20, 2022

Only start a container once in its lifetime

ca9945b

This will ensure the restart policy specified is not violated Change-type: patch Closes: #1668 Signed-off-by: 20k-ultra <3946250+20k-ultra@users.noreply.github.com>

20k-ultra closed this as completed Apr 20, 2022

pipex mentioned this issue Jul 21, 2022

Services may get stuck in Downloading state after update (version >= v13.1.3) #1991

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate container's restart policy and exit status before starting #1668

Evaluate container's restart policy and exit status before starting #1668

20k-ultra commented Apr 21, 2021 •

edited

Loading

pipex commented May 7, 2021

pwt commented May 8, 2021

20k-ultra commented May 19, 2021

20k-ultra commented May 19, 2021 •

edited

Loading

jellyfish-bot commented Jul 2, 2021

20k-ultra commented Jul 3, 2021

pipex commented Jul 5, 2021

maggie44 commented Jul 6, 2021 •

edited

Loading

pwt commented Jul 6, 2021

maggie44 commented Jul 29, 2021 •

edited

Loading

pwt commented Jul 29, 2021

pipex commented Jul 29, 2021 •

edited

Loading

maggie44 commented Jul 29, 2021 •

edited by jellyfish-bot

Loading

maggie44 commented Jul 31, 2021 •

edited

Loading

20k-ultra commented Apr 5, 2022

pwt commented Apr 5, 2022

maggie44 commented Apr 5, 2022 •

edited

Loading

20k-ultra commented Apr 5, 2022

maggie44 commented Apr 5, 2022 •

edited

Loading

20k-ultra commented Apr 7, 2022 •

edited

Loading

20k-ultra commented Apr 20, 2022

Evaluate container's restart policy and exit status before starting #1668

Evaluate container's restart policy and exit status before starting #1668

Comments

20k-ultra commented Apr 21, 2021 • edited Loading

pipex commented May 7, 2021

pwt commented May 8, 2021

20k-ultra commented May 19, 2021

20k-ultra commented May 19, 2021 • edited Loading

jellyfish-bot commented Jul 2, 2021

20k-ultra commented Jul 3, 2021

pipex commented Jul 5, 2021

maggie44 commented Jul 6, 2021 • edited Loading

pwt commented Jul 6, 2021

maggie44 commented Jul 29, 2021 • edited Loading

pwt commented Jul 29, 2021

pipex commented Jul 29, 2021 • edited Loading

maggie44 commented Jul 29, 2021 • edited by jellyfish-bot Loading

maggie44 commented Jul 31, 2021 • edited Loading

20k-ultra commented Apr 5, 2022

pwt commented Apr 5, 2022

maggie44 commented Apr 5, 2022 • edited Loading

20k-ultra commented Apr 5, 2022

maggie44 commented Apr 5, 2022 • edited Loading

20k-ultra commented Apr 7, 2022 • edited Loading

20k-ultra commented Apr 20, 2022

20k-ultra commented Apr 21, 2021 •

edited

Loading

20k-ultra commented May 19, 2021 •

edited

Loading

maggie44 commented Jul 6, 2021 •

edited

Loading

maggie44 commented Jul 29, 2021 •

edited

Loading

pipex commented Jul 29, 2021 •

edited

Loading

maggie44 commented Jul 29, 2021 •

edited by jellyfish-bot

Loading

maggie44 commented Jul 31, 2021 •

edited

Loading

maggie44 commented Apr 5, 2022 •

edited

Loading

maggie44 commented Apr 5, 2022 •

edited

Loading

20k-ultra commented Apr 7, 2022 •

edited

Loading