Add restart handling #5

windsource · 2023-09-05T08:44:33Z

Description

The state contains a boolean field restart which is currently ignored. The restart handling should be implemented.

Goals

Define if a simple boolean flag is sufficient or if we need a something like always, neverand on failure
When restart is set the the agent shall restart a workload when it was stopped for any reason like failure or system restart

Tasks

Create and adapt swdds
Write stests
Write user documentation for restart handling
Introduce a new channel to receive workload states of the GenericPollingStateChecker inside the WorkloadControlLoop
Check in the WorkloadControlLoop if a restart is required (RestartPolicy + received workload state must match)
Perform the restart operation by executing an update with the same workload configuration

Final Result

Summary

The restart handling is implemented according to the newly introduced restart policies for a workload:

NEVER // The workload is never restarted. Once the workload exits, it remains in the exited state.
ON_FAILURE // If the workload exits with a non-zero exit code, it will be restarted.
ALWAYS // The workload is restarted upon termination, regardless of the exit code.

The WorkloadControlLoop handles the restarts of a workload when the ExecutionState of the workload it manages fulfills the workload's configured restart policy. The restart is represented by an update operation with the same workload spec.
When restarting a workload and there is no full device restart, the inter-workload dependencies are not considered.

The text was updated successfully, but these errors were encountered:

krucod3 · 2023-09-07T12:22:08Z

Decisions after discussion:

the restart bool flag will be removed and supplemented with a restart policy enum with the fields mentioned above
default value for the restart flag is never
the difference between always and failure is the exit code of the workload
all workloads will be started again when the agent is started (unless they are already running), because a newly started agent is treated as a restart of the node. Already running workloads will not be restarted in this case. This is independent from the restart policy of the workload.

krucod3 · 2024-01-25T08:46:34Z

After discussions and because of the bug ticket #158: most important here is that the restart of Ankaios starts the workloads again .

inf17101 · 2024-03-22T07:58:51Z

I have done some research again, just to check how the restart flags vary between all the known software orchestrators.

Kubernetes supports Never, Always, OnFailure (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/). K3s and docker-compose supports the same values. But Docker (https://docs.docker.com/config/containers/start-containers-automatically/) supports one more interesting field called unless-stopped. The unless-stopped restart policy restarts a workload in the same way like always does besides the user has actively stopped the workload. For transferring into automotive use cases maybe someone wants to configure that a workload shall be restarted when it is failed or what else, but if an integrator stops the workload the orchestrator shall not restart it (because it is unnecessary since the user's decision was actively to see the workload not running anymore).

What's also interesting is that the Docker allows the user to limit the number of restarts inside the on-failure(max_retries) option.

krucod3 · 2024-03-22T11:47:37Z

I was also thinking about a max retries value for the on-failure case. We need to think if we would go with a value for the enum or a separate attribute. As for the unless-stopped, for Ankaios this would be the normal case as we either delete or assign to the "" agent which cannot restart anything.

krucod3 · 2024-03-22T11:53:17Z

As we would like to release the v0.3 soon, the implementation for this feature will need to be shifted to the next release. I'll for now move this to v0.4, but we need to see if we will make an intermediate release only with this change, or change the way we release features (e.g., with fine granular releases).
For now we can at least define the interface (the changes to the manifest) and release it with v0.3 to reduce the number of releases with breaking changes. This will be done with #212

inf17101 · 2024-03-26T10:10:20Z

Proposal for an Ankaios manifest containing the restart policy enum flags:

apiVersion: v0.1
workloads:
  restarted_always:
    runtime: podman
    agent: agent_A
    restart: ALWAYS
    runtimeConfig: |
      image: alpine:latest
      commandOptions: [ "--entrypoint", "/bin/sh" ]
      commandArgs: [ "-c", "echo 'Always restarted.'; sleep 10"]
  restarted_never:
    runtime: podman
    agent: agent_A
    restart: NEVER
    runtimeConfig: |
      image: alpine:latest
      commandArgs: [ "echo", "Explicitly never restarted."]
  default_restarted_never: # default restart value = NEVER
    runtime: podman
    agent: agent_A
    runtimeConfig: |
      image: alpine:latest
      commandArgs: [ "echo", "Implicitly never restarted."]
  restarted_on_failure:
    runtime: podman
    agent: agent_A
    restart: ON_FAILURE
    runtimeConfig: |
      image: alpine:latest
      commandOptions: [ "--entrypoint", "/bin/sh" ]
      commandArgs: [ "-c", "echo 'Restarted on failure.'; sleep 7; exit 1"]

The default value is NEVER and is used when the user does not specify the restart flag.

Three different restart policies:

NEVER: Ankaios never restarts the workload. Once the workload exits, it remains in the exited state. If the Ankaios agent is restarted, it resumes the workload. If the runtimeConfig has changed, Ankaios replaces and starts the workload with the workload config inside the desired state.
ALWAYS: Ankaios restarts the workload if it has exited, regardless of the exit code. This ensures continuous operation of the container. If the Ankaios agent is restarted, the workload will also be restarted if it is not running. If the runtimeConfig has changed, Ankaios replaces and starts the workload with the workload config inside the desired state.
ON_FAILURE: Ankaios restarts the workload if it exits with a non-zero exit code. If the Ankaios agent is restarted and detects that the workload has failed, it will be restarted. If the runtimeConfig has changed, Ankaios replaces and starts the workload with the workload config inside the desired state.

Replacing the workloads upon detecting a different runtime configuration aligns with Ankaios component's restart feature. This is because the Ankaios cluster is self-healing and ensures the desired state.

The enum values are written in SCREAMING_SNAKE_CASE to maintain consistency with the dependency enum values inside the Ankaios manifest.

windsource · 2024-04-15T07:56:08Z

What does it mean for NEVER

If the Ankaios agent is restarted, it resumes the workload.

Does 'resume' mean, that the workload is kept in its current state?

krucod3 · 2024-04-15T09:01:58Z

@windsource, exactly. The internal 'resume' handling just leaves the workload in the state it currently is. If the workload was running (and shall be running), it would be left in running.

inf17101 · 2024-04-15T11:06:20Z

Swdd design ideas:

In general, a restart can be represented in Ankaios as an normal update with the same configuration of the workload (update = delete + create). For an update the WorkloadSpec is necessary. The WorkloadControlLoop stores the current WorkloadSpec and does anyway workload related tasks (delete, update, create, retry). Storing the WorkloadSpec in the WorkloadObject itself to retrieve it for the restart, would lead to inconsistencies in such cases. This is why the ControlLoop maintains the WorkloadSpec which was implemented in the past accordingly. The WorkloadControlLoop can handle the restart itself, meaning the Workload itself handles its own restart behavior according to the configured restart policy.

There a few implementation possibilities how the WorkloadControlLoop can be triggered:

Currently, the AgentManager receives all workload states (from other agents + from the workloads it manages itself). When the AgentManager receives a new workload state of the workloads it manages, the AgentManager triggers the RuntimeManager which has the map of workloads. The RuntimeManager or a separate Restart component called by the RuntimeManager decides when to restart the workload according to the restart policy and the current workload state or the WorkloadControlLoop decides it. The extra component would send an "restart" (which is an update) to the ControlLoop or if the ControlLoop itself handles the restart decision it just triggers the update.
The WorkloadControlLoop shall receive the workload states itself and forwards it to the AgentManager which then forwards the workload state to the Ankaios server at the end. The WorkloadControlLoop handles the independently of other components the restart decisions and updates the workload if it must be restarted.

The first approach which does function calls top down from the AgentManager, over the RuntimeManager to the WorkloadObject, which sends then an Update command to the WorkloadControlLoop, would do unnecessary method calls if the workload must not be restarted (in the end in such cases, the WorkloadControlLoop ignores the call). This is a little bit confusing in my opinion, because the call is triggered high level and flows over a lot of components to be ignored at the low level implementation (and in addition is abstracted away at the end via a message channel from the WorkloadObject to the WorkloadControlLoop).

The second approach is better, because a minimum of component knows about the restart handling, which is easier for maintaining. Only the WorkloadControlLoop knows the restart logic and does the restarts independently when its necessary. The disadvantages are the overhead for extra communicaton of workload states through another channel which needs to be introduced and a potential more "busy control loop" if the frequency of the state checker reported states is increased.

For now, I am thinking for another option and I am still analyzing scenarios and the code.

inf17101 · 2024-04-24T07:34:17Z

@krucod3: I have introduced a fix for the restart handling in a special scenario. When the agent receives an update of the workload and the state checker reports the execution state Failed(ExecFailed) for the old workload during the update code path is traversed, then the restart is wrongly executed after the update, because the workload state was inside the workload state channel. This is might happen also for other restart policy combinations like RestartPolicy::Always. Receiving all workload state is desired and fine, because every workload state must be forwarded to the AgentManager for inter-workload dependency handling and forwarding to the Ankaios server.

I have introduced a fix, that the workload control loop compares the WorkloadInstanceName of the received workload state with the instance name that is currently valid inside the workload control loop. In case of different instance names, the restart is skipped otherwise it is executed. The received workload state is forwarded in both cases to the AgentManager.

Now the restart is skipped when the instance names are different:

In addition, this integration test scenario cannot be committed, because the test is unstable due to the fact that this scenario only happens in rare circumstances depending on the timing of tokio::select method polling workload states and workload commands inside the workload control loop.
But to have a regressoin test and a tested condition, I will introduce an unit test to test the condition for skipping restarts isolated.

inf17101 · 2024-04-24T08:02:41Z

@krucod3: The restart handling does not consider the inter-workload dependencies, currently. If someone needs this feature he or she can develop a workload using the control interface to handle this case. If the feature extension is needed in the future, we can still develop it. But for now we keep it simple.

krucod3 · 2024-04-24T08:16:30Z

@inf17101: yes, let's ignore the dependencies for the restart. We also shortly discussed this with @windsource and he also shares our opinion here.
To document the background better, the idea of the dependencies is to influence the startup behavior, If a failed workload needs to be started it was running already, normally there is no need to take care of dependencies again.

inf17101 · 2024-04-29T08:11:45Z

@krucod3: Like discussed, to bring the full-device restart behavior in line with the inter-workload dependencies and the restart handling, the following options are available, @windsource :

Option 1: The resume handling will be changed, so that only "running" workloads will be resumed, exited workloads will be replaced (delete + create). The create operation will be done, when the inter-workload dependencies of that workload are ready.

Option 2: Enhancement of option 1. A replace will be only done, when the restart policy is not NEVER (meaning enabled) and the ExecutionState and RestartPolicy are matching.

For all options, if a workload exits and there is no device restart, then the workload is restarted without considering the inter-workload dependencies. Since there is no full device restart and its dependencies may be in the running state, there is no need to consider the dependencies as the workload was already running before.

In addition: Regardless of option 1 or 2, for both the behavior is triggered, when an agent is restarted, because only on agent restart the resume code is executed currently. So we would have one side effect, that a workload is replaced as well, when no full device restart happens and only the agent is restarted. Maybe we can adapt this behavior, too. But for now it is like that.

windsource · 2024-04-29T11:43:07Z

@inf17101 As discussed option 1 shall be implemented.

inf17101 · 2024-05-07T07:46:26Z

The changes for the general restart handling is merged into main. The changes for resuming workloads with considering dependencies after device restarts are done in a separate issue #261 (PR #265 ).

windsource added the enhancement New feature or request. Issue will appear in the change log "Features" label Sep 5, 2023

windsource added this to the v0.2 milestone Sep 5, 2023

lingnoi self-assigned this Sep 7, 2023

krucod3 modified the milestones: v0.2, v0.3 Oct 19, 2023

krucod3 unassigned lingnoi Nov 29, 2023

krucod3 modified the milestones: v0.3, backlog, v0.4 Nov 29, 2023

windsource mentioned this issue Jan 18, 2024

reusable workloads not resumed properly #158

Closed

krucod3 modified the milestones: v0.4, v0.3 Jan 25, 2024

krucod3 self-assigned this Mar 7, 2024

krucod3 mentioned this issue Mar 22, 2024

Change the restart flag to a restart policy enum #212

Closed

3 tasks

krucod3 modified the milestones: v0.3, v0.4 Mar 22, 2024

inf17101 mentioned this issue Mar 26, 2024

Change restart flag to an restart policy enum in interfaces #217

Merged

1 task

krucod3 removed their assignment Mar 28, 2024

krucod3 modified the milestones: v0.4, v0.3.1 Apr 15, 2024

This was referenced Apr 16, 2024

Cannot update workloads if runtime config is unchanged #222

Closed

Add restart policy handling for workloads #249

Merged

windsource mentioned this issue Apr 29, 2024

Re-use existing container bundles on device re-start #260

Open

3 tasks

inf17101 self-assigned this Apr 30, 2024

inf17101 closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add restart handling #5

Add restart handling #5

windsource commented Sep 5, 2023 •

edited by inf17101

Loading

krucod3 commented Sep 7, 2023

krucod3 commented Jan 25, 2024

inf17101 commented Mar 22, 2024 •

edited

Loading

krucod3 commented Mar 22, 2024

krucod3 commented Mar 22, 2024

inf17101 commented Mar 26, 2024 •

edited

Loading

windsource commented Apr 15, 2024 •

edited

Loading

krucod3 commented Apr 15, 2024

inf17101 commented Apr 15, 2024 •

edited

Loading

inf17101 commented Apr 24, 2024 •

edited

Loading

inf17101 commented Apr 24, 2024

krucod3 commented Apr 24, 2024

inf17101 commented Apr 29, 2024

windsource commented Apr 29, 2024

inf17101 commented May 7, 2024

Add restart handling #5

Add restart handling #5

Comments

windsource commented Sep 5, 2023 • edited by inf17101 Loading

Description

Goals

Tasks

Final Result

Summary

krucod3 commented Sep 7, 2023

krucod3 commented Jan 25, 2024

inf17101 commented Mar 22, 2024 • edited Loading

krucod3 commented Mar 22, 2024

krucod3 commented Mar 22, 2024

inf17101 commented Mar 26, 2024 • edited Loading

windsource commented Apr 15, 2024 • edited Loading

krucod3 commented Apr 15, 2024

inf17101 commented Apr 15, 2024 • edited Loading

inf17101 commented Apr 24, 2024 • edited Loading

inf17101 commented Apr 24, 2024

krucod3 commented Apr 24, 2024

inf17101 commented Apr 29, 2024

windsource commented Apr 29, 2024

inf17101 commented May 7, 2024

windsource commented Sep 5, 2023 •

edited by inf17101

Loading

inf17101 commented Mar 22, 2024 •

edited

Loading

inf17101 commented Mar 26, 2024 •

edited

Loading

windsource commented Apr 15, 2024 •

edited

Loading

inf17101 commented Apr 15, 2024 •

edited

Loading

inf17101 commented Apr 24, 2024 •

edited

Loading