-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add restart handling #5
Comments
Decisions after discussion:
|
After discussions and because of the bug ticket #158: most important here is that the restart of Ankaios starts the workloads again . |
I have done some research again, just to check how the restart flags vary between all the known software orchestrators. Kubernetes supports What's also interesting is that the Docker allows the user to limit the number of restarts inside the |
I was also thinking about a max retries value for the on-failure case. We need to think if we would go with a value for the enum or a separate attribute. As for the unless-stopped, for Ankaios this would be the normal case as we either delete or assign to the "" agent which cannot restart anything. |
As we would like to release the v0.3 soon, the implementation for this feature will need to be shifted to the next release. I'll for now move this to v0.4, but we need to see if we will make an intermediate release only with this change, or change the way we release features (e.g., with fine granular releases). |
Proposal for an Ankaios manifest containing the restart policy enum flags: apiVersion: v0.1
workloads:
restarted_always:
runtime: podman
agent: agent_A
restart: ALWAYS
runtimeConfig: |
image: alpine:latest
commandOptions: [ "--entrypoint", "/bin/sh" ]
commandArgs: [ "-c", "echo 'Always restarted.'; sleep 10"]
restarted_never:
runtime: podman
agent: agent_A
restart: NEVER
runtimeConfig: |
image: alpine:latest
commandArgs: [ "echo", "Explicitly never restarted."]
default_restarted_never: # default restart value = NEVER
runtime: podman
agent: agent_A
runtimeConfig: |
image: alpine:latest
commandArgs: [ "echo", "Implicitly never restarted."]
restarted_on_failure:
runtime: podman
agent: agent_A
restart: ON_FAILURE
runtimeConfig: |
image: alpine:latest
commandOptions: [ "--entrypoint", "/bin/sh" ]
commandArgs: [ "-c", "echo 'Restarted on failure.'; sleep 7; exit 1"] The default value is Three different restart policies:
Replacing the workloads upon detecting a different runtime configuration aligns with Ankaios component's restart feature. This is because the Ankaios cluster is self-healing and ensures the desired state. The enum values are written in SCREAMING_SNAKE_CASE to maintain consistency with the dependency enum values inside the Ankaios manifest. |
What does it mean for
Does 'resume' mean, that the workload is kept in its current state? |
@windsource, exactly. The internal 'resume' handling just leaves the workload in the state it currently is. If the workload was running (and shall be running), it would be left in running. |
Swdd design ideas: In general, a restart can be represented in Ankaios as an normal update with the same configuration of the workload (update = delete + create). For an update the WorkloadSpec is necessary. The WorkloadControlLoop stores the current WorkloadSpec and does anyway workload related tasks (delete, update, create, retry). Storing the WorkloadSpec in the WorkloadObject itself to retrieve it for the restart, would lead to inconsistencies in such cases. This is why the ControlLoop maintains the WorkloadSpec which was implemented in the past accordingly. The WorkloadControlLoop can handle the restart itself, meaning the Workload itself handles its own restart behavior according to the configured restart policy. There a few implementation possibilities how the WorkloadControlLoop can be triggered:
The first approach which does function calls top down from the AgentManager, over the RuntimeManager to the WorkloadObject, which sends then an Update command to the WorkloadControlLoop, would do unnecessary method calls if the workload must not be restarted (in the end in such cases, the WorkloadControlLoop ignores the call). This is a little bit confusing in my opinion, because the call is triggered high level and flows over a lot of components to be ignored at the low level implementation (and in addition is abstracted away at the end via a message channel from the WorkloadObject to the WorkloadControlLoop). The second approach is better, because a minimum of component knows about the restart handling, which is easier for maintaining. Only the WorkloadControlLoop knows the restart logic and does the restarts independently when its necessary. The disadvantages are the overhead for extra communicaton of workload states through another channel which needs to be introduced and a potential more "busy control loop" if the frequency of the state checker reported states is increased. For now, I am thinking for another option and I am still analyzing scenarios and the code. |
@krucod3: I have introduced a fix for the restart handling in a special scenario. When the agent receives an update of the workload and the state checker reports the execution state I have introduced a fix, that the workload control loop compares the Now the restart is skipped when the instance names are different: In addition, this integration test scenario cannot be committed, because the test is unstable due to the fact that this scenario only happens in rare circumstances depending on the timing of tokio::select method polling workload states and workload commands inside the workload control loop. |
@krucod3: The restart handling does not consider the inter-workload dependencies, currently. If someone needs this feature he or she can develop a workload using the control interface to handle this case. If the feature extension is needed in the future, we can still develop it. But for now we keep it simple. |
@inf17101: yes, let's ignore the dependencies for the restart. We also shortly discussed this with @windsource and he also shares our opinion here. |
@krucod3: Like discussed, to bring the full-device restart behavior in line with the inter-workload dependencies and the restart handling, the following options are available, @windsource : Option 1: The resume handling will be changed, so that only "running" workloads will be resumed, exited workloads will be replaced (delete + create). The create operation will be done, when the inter-workload dependencies of that workload are ready. Option 2: Enhancement of option 1. A replace will be only done, when the restart policy is not NEVER (meaning enabled) and the ExecutionState and RestartPolicy are matching. For all options, if a workload exits and there is no device restart, then the workload is restarted without considering the inter-workload dependencies. Since there is no full device restart and its dependencies may be in the running state, there is no need to consider the dependencies as the workload was already running before. In addition: Regardless of option 1 or 2, for both the behavior is triggered, when an agent is restarted, because only on agent restart the resume code is executed currently. So we would have one side effect, that a workload is replaced as well, when no full device restart happens and only the agent is restarted. Maybe we can adapt this behavior, too. But for now it is like that. |
@inf17101 As discussed option 1 shall be implemented. |
Description
The state contains a boolean field
restart
which is currently ignored. The restart handling should be implemented.Goals
always
,never
andon failure
Tasks
Final Result
Summary
The restart handling is implemented according to the newly introduced restart policies for a workload:
The WorkloadControlLoop handles the restarts of a workload when the ExecutionState of the workload it manages fulfills the workload's configured restart policy. The restart is represented by an update operation with the same workload spec.
When restarting a workload and there is no full device restart, the inter-workload dependencies are not considered.
The text was updated successfully, but these errors were encountered: