Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add restart handling #5

Closed
6 tasks done
windsource opened this issue Sep 5, 2023 · 15 comments
Closed
6 tasks done

Add restart handling #5

windsource opened this issue Sep 5, 2023 · 15 comments
Assignees
Labels
enhancement New feature or request. Issue will appear in the change log "Features"
Milestone

Comments

@windsource
Copy link
Contributor

windsource commented Sep 5, 2023

Description

The state contains a boolean field restart which is currently ignored. The restart handling should be implemented.

Goals

  • Define if a simple boolean flag is sufficient or if we need a something like always, neverand on failure
  • When restart is set the the agent shall restart a workload when it was stopped for any reason like failure or system restart

Tasks

  • Create and adapt swdds
  • Write stests
  • Write user documentation for restart handling
  • Introduce a new channel to receive workload states of the GenericPollingStateChecker inside the WorkloadControlLoop
  • Check in the WorkloadControlLoop if a restart is required (RestartPolicy + received workload state must match)
  • Perform the restart operation by executing an update with the same workload configuration

Final Result

Summary

The restart handling is implemented according to the newly introduced restart policies for a workload:

  • NEVER // The workload is never restarted. Once the workload exits, it remains in the exited state.
  • ON_FAILURE // If the workload exits with a non-zero exit code, it will be restarted.
  • ALWAYS // The workload is restarted upon termination, regardless of the exit code.

The WorkloadControlLoop handles the restarts of a workload when the ExecutionState of the workload it manages fulfills the workload's configured restart policy. The restart is represented by an update operation with the same workload spec.
When restarting a workload and there is no full device restart, the inter-workload dependencies are not considered.

@windsource windsource added the enhancement New feature or request. Issue will appear in the change log "Features" label Sep 5, 2023
@windsource windsource added this to the v0.2 milestone Sep 5, 2023
@krucod3
Copy link
Contributor

krucod3 commented Sep 7, 2023

Decisions after discussion:

  • the restart bool flag will be removed and supplemented with a restart policy enum with the fields mentioned above
  • default value for the restart flag is never
  • the difference between always and failure is the exit code of the workload
  • all workloads will be started again when the agent is started (unless they are already running), because a newly started agent is treated as a restart of the node. Already running workloads will not be restarted in this case. This is independent from the restart policy of the workload.

@lingnoi lingnoi self-assigned this Sep 7, 2023
@krucod3 krucod3 modified the milestones: v0.2, v0.3 Oct 19, 2023
@krucod3 krucod3 modified the milestones: v0.3, backlog, v0.4 Nov 29, 2023
@krucod3 krucod3 modified the milestones: v0.4, v0.3 Jan 25, 2024
@krucod3
Copy link
Contributor

krucod3 commented Jan 25, 2024

After discussions and because of the bug ticket #158: most important here is that the restart of Ankaios starts the workloads again .

@krucod3 krucod3 self-assigned this Mar 7, 2024
@inf17101
Copy link
Contributor

inf17101 commented Mar 22, 2024

I have done some research again, just to check how the restart flags vary between all the known software orchestrators.

Kubernetes supports Never, Always, OnFailure (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/). K3s and docker-compose supports the same values. But Docker (https://docs.docker.com/config/containers/start-containers-automatically/) supports one more interesting field called unless-stopped. The unless-stopped restart policy restarts a workload in the same way like always does besides the user has actively stopped the workload. For transferring into automotive use cases maybe someone wants to configure that a workload shall be restarted when it is failed or what else, but if an integrator stops the workload the orchestrator shall not restart it (because it is unnecessary since the user's decision was actively to see the workload not running anymore).

What's also interesting is that the Docker allows the user to limit the number of restarts inside the on-failure(max_retries) option.

@krucod3
Copy link
Contributor

krucod3 commented Mar 22, 2024

I was also thinking about a max retries value for the on-failure case. We need to think if we would go with a value for the enum or a separate attribute. As for the unless-stopped, for Ankaios this would be the normal case as we either delete or assign to the "" agent which cannot restart anything.

@krucod3
Copy link
Contributor

krucod3 commented Mar 22, 2024

As we would like to release the v0.3 soon, the implementation for this feature will need to be shifted to the next release. I'll for now move this to v0.4, but we need to see if we will make an intermediate release only with this change, or change the way we release features (e.g., with fine granular releases).
For now we can at least define the interface (the changes to the manifest) and release it with v0.3 to reduce the number of releases with breaking changes. This will be done with #212

@krucod3 krucod3 modified the milestones: v0.3, v0.4 Mar 22, 2024
@inf17101
Copy link
Contributor

inf17101 commented Mar 26, 2024

Proposal for an Ankaios manifest containing the restart policy enum flags:

apiVersion: v0.1
workloads:
  restarted_always:
    runtime: podman
    agent: agent_A
    restart: ALWAYS
    runtimeConfig: |
      image: alpine:latest
      commandOptions: [ "--entrypoint", "/bin/sh" ]
      commandArgs: [ "-c", "echo 'Always restarted.'; sleep 10"]
  restarted_never:
    runtime: podman
    agent: agent_A
    restart: NEVER
    runtimeConfig: |
      image: alpine:latest
      commandArgs: [ "echo", "Explicitly never restarted."]
  default_restarted_never: # default restart value = NEVER
    runtime: podman
    agent: agent_A
    runtimeConfig: |
      image: alpine:latest
      commandArgs: [ "echo", "Implicitly never restarted."]
  restarted_on_failure:
    runtime: podman
    agent: agent_A
    restart: ON_FAILURE
    runtimeConfig: |
      image: alpine:latest
      commandOptions: [ "--entrypoint", "/bin/sh" ]
      commandArgs: [ "-c", "echo 'Restarted on failure.'; sleep 7; exit 1"]

The default value is NEVER and is used when the user does not specify the restart flag.

Three different restart policies:

  • NEVER: Ankaios never restarts the workload. Once the workload exits, it remains in the exited state. If the Ankaios agent is restarted, it resumes the workload. If the runtimeConfig has changed, Ankaios replaces and starts the workload with the workload config inside the desired state.
  • ALWAYS: Ankaios restarts the workload if it has exited, regardless of the exit code. This ensures continuous operation of the container. If the Ankaios agent is restarted, the workload will also be restarted if it is not running. If the runtimeConfig has changed, Ankaios replaces and starts the workload with the workload config inside the desired state.
  • ON_FAILURE: Ankaios restarts the workload if it exits with a non-zero exit code. If the Ankaios agent is restarted and detects that the workload has failed, it will be restarted. If the runtimeConfig has changed, Ankaios replaces and starts the workload with the workload config inside the desired state.

Replacing the workloads upon detecting a different runtime configuration aligns with Ankaios component's restart feature. This is because the Ankaios cluster is self-healing and ensures the desired state.

The enum values are written in SCREAMING_SNAKE_CASE to maintain consistency with the dependency enum values inside the Ankaios manifest.

@windsource
Copy link
Contributor Author

windsource commented Apr 15, 2024

What does it mean for NEVER

If the Ankaios agent is restarted, it resumes the workload.

Does 'resume' mean, that the workload is kept in its current state?

@krucod3 krucod3 modified the milestones: v0.4, v0.3.1 Apr 15, 2024
@krucod3
Copy link
Contributor

krucod3 commented Apr 15, 2024

@windsource, exactly. The internal 'resume' handling just leaves the workload in the state it currently is. If the workload was running (and shall be running), it would be left in running.

@inf17101
Copy link
Contributor

inf17101 commented Apr 15, 2024

Swdd design ideas:

In general, a restart can be represented in Ankaios as an normal update with the same configuration of the workload (update = delete + create). For an update the WorkloadSpec is necessary. The WorkloadControlLoop stores the current WorkloadSpec and does anyway workload related tasks (delete, update, create, retry). Storing the WorkloadSpec in the WorkloadObject itself to retrieve it for the restart, would lead to inconsistencies in such cases. This is why the ControlLoop maintains the WorkloadSpec which was implemented in the past accordingly. The WorkloadControlLoop can handle the restart itself, meaning the Workload itself handles its own restart behavior according to the configured restart policy.

There a few implementation possibilities how the WorkloadControlLoop can be triggered:

  1. Currently, the AgentManager receives all workload states (from other agents + from the workloads it manages itself). When the AgentManager receives a new workload state of the workloads it manages, the AgentManager triggers the RuntimeManager which has the map of workloads. The RuntimeManager or a separate Restart component called by the RuntimeManager decides when to restart the workload according to the restart policy and the current workload state or the WorkloadControlLoop decides it. The extra component would send an "restart" (which is an update) to the ControlLoop or if the ControlLoop itself handles the restart decision it just triggers the update.

  2. The WorkloadControlLoop shall receive the workload states itself and forwards it to the AgentManager which then forwards the workload state to the Ankaios server at the end. The WorkloadControlLoop handles the independently of other components the restart decisions and updates the workload if it must be restarted.

The first approach which does function calls top down from the AgentManager, over the RuntimeManager to the WorkloadObject, which sends then an Update command to the WorkloadControlLoop, would do unnecessary method calls if the workload must not be restarted (in the end in such cases, the WorkloadControlLoop ignores the call). This is a little bit confusing in my opinion, because the call is triggered high level and flows over a lot of components to be ignored at the low level implementation (and in addition is abstracted away at the end via a message channel from the WorkloadObject to the WorkloadControlLoop).

The second approach is better, because a minimum of component knows about the restart handling, which is easier for maintaining. Only the WorkloadControlLoop knows the restart logic and does the restarts independently when its necessary. The disadvantages are the overhead for extra communicaton of workload states through another channel which needs to be introduced and a potential more "busy control loop" if the frequency of the state checker reported states is increased.

For now, I am thinking for another option and I am still analyzing scenarios and the code.

@inf17101
Copy link
Contributor

inf17101 commented Apr 24, 2024

@krucod3: I have introduced a fix for the restart handling in a special scenario. When the agent receives an update of the workload and the state checker reports the execution state Failed(ExecFailed) for the old workload during the update code path is traversed, then the restart is wrongly executed after the update, because the workload state was inside the workload state channel. This is might happen also for other restart policy combinations like RestartPolicy::Always. Receiving all workload state is desired and fine, because every workload state must be forwarded to the AgentManager for inter-workload dependency handling and forwarding to the Ankaios server.

I have introduced a fix, that the workload control loop compares the WorkloadInstanceName of the received workload state with the instance name that is currently valid inside the workload control loop. In case of different instance names, the restart is skipped otherwise it is executed. The received workload state is forwarded in both cases to the AgentManager.

Now the restart is skipped when the instance names are different:
image

In addition, this integration test scenario cannot be committed, because the test is unstable due to the fact that this scenario only happens in rare circumstances depending on the timing of tokio::select method polling workload states and workload commands inside the workload control loop.
But to have a regressoin test and a tested condition, I will introduce an unit test to test the condition for skipping restarts isolated.

@inf17101
Copy link
Contributor

@krucod3: The restart handling does not consider the inter-workload dependencies, currently. If someone needs this feature he or she can develop a workload using the control interface to handle this case. If the feature extension is needed in the future, we can still develop it. But for now we keep it simple.

@krucod3
Copy link
Contributor

krucod3 commented Apr 24, 2024

@inf17101: yes, let's ignore the dependencies for the restart. We also shortly discussed this with @windsource and he also shares our opinion here.
To document the background better, the idea of the dependencies is to influence the startup behavior, If a failed workload needs to be started it was running already, normally there is no need to take care of dependencies again.

@inf17101
Copy link
Contributor

@krucod3: Like discussed, to bring the full-device restart behavior in line with the inter-workload dependencies and the restart handling, the following options are available, @windsource :

Option 1: The resume handling will be changed, so that only "running" workloads will be resumed, exited workloads will be replaced (delete + create). The create operation will be done, when the inter-workload dependencies of that workload are ready.

Option 2: Enhancement of option 1. A replace will be only done, when the restart policy is not NEVER (meaning enabled) and the ExecutionState and RestartPolicy are matching.

For all options, if a workload exits and there is no device restart, then the workload is restarted without considering the inter-workload dependencies. Since there is no full device restart and its dependencies may be in the running state, there is no need to consider the dependencies as the workload was already running before.

In addition: Regardless of option 1 or 2, for both the behavior is triggered, when an agent is restarted, because only on agent restart the resume code is executed currently. So we would have one side effect, that a workload is replaced as well, when no full device restart happens and only the agent is restarted. Maybe we can adapt this behavior, too. But for now it is like that.

@windsource
Copy link
Contributor Author

@inf17101 As discussed option 1 shall be implemented.

@inf17101
Copy link
Contributor

inf17101 commented May 7, 2024

The changes for the general restart handling is merged into main. The changes for resuming workloads with considering dependencies after device restarts are done in a separate issue #261 (PR #265 ).

@inf17101 inf17101 closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Issue will appear in the change log "Features"
Projects
None yet
Development

No branches or pull requests

4 participants