New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLA+ model for SwarmKit tasks #2613

Merged
merged 1 commit into from Jun 7, 2018

Conversation

Projects
None yet
4 participants
@talex5
Copy link
Contributor

talex5 commented Apr 20, 2018

- What I did

I want to learn about how SwarmKit works, and extend its documentation. As a first step, I've been reading through the docs in the design directory and writing a more formal version of them. If someone familiar with SwarmKit has time to read over and correct my intepretation, that would be great. I marked with XXX things I was particularly unsure about. For example, the docs say very little about how a service is removed.

To make it more precise, I've written the document using TLA+ notation, but hopefully it's still readable to people not familiar with that.

So far, this is based only on reading the design docs. I plan to start reading through the Go code soon, but I wanted to get some early feedback first. Once it's correct, it could be added to the existing documentation.

- How to test it

It's easiest to read the PDF rendering of the spec.

You can also load the spec into the TLA toolbox and run the model checker on it to test the claimed properties. The document explains how to do that.

- Description for the changelog

Add a TLA+ spec for SwarmKit's task model

@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 20, 2018

Codecov Report

Merging #2613 into master will decrease coverage by 0.48%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2613      +/-   ##
==========================================
- Coverage   62.31%   61.82%   -0.49%     
==========================================
  Files         134      134              
  Lines       21739    21821      +82     
==========================================
- Hits        13546    13491      -55     
- Misses       6740     6892     +152     
+ Partials     1453     1438      -15
@dperny
Copy link
Member

dperny left a comment

Hopefully these explanations will make some things clearer.

(* This has every field mentioned in `task_model.md' except for `spec', which
it doesn't seem to use for anything.
XXX: can desired_state be any state, or just some (e.g. {ready, running, shutdown, remove})? *)

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

I believe that there are no guarantees about this, but the current behavior should be only running, shutdown, and remove. I have also envisioned that one day for single-run tasks, the desired state might be completed. So to model accurately, it may be best to model the desired state as any possible state. And if that doesn't work, then we can constrain states as needed.

\/ CreateService
\/ UpdateService

(* XXX: what about removing a service? (not described in any detail in the design docs) *)

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

Removing a service is possible. The original way it was done is to delete the service object. This cascades to the Orchestrator, which saw that the service was removed and deleted all of its tasks. The deletion of the tasks caused the Dispatcher to send an updated Assignments to the workers which did not include the deleted tasks, and in turn the worker shut down the deleted tasks.

This caused an issue because shutting down a task is far from instant, while deleting a task is. When a task is deleted, its network resources are freed by the Allocator, and can be reused. However, because the task may take some time to shut down, it was possible for network resources (IP addresses) to be reused while the old tasks' containers were still shutting down.

The solution for this was to add the Remove state, which (being a terminal state) indicates to a worker that the task should be shut down. Then, when the task is shut down, that state is reported back to the manager, which affirmatively knows that the task is terminated, and at that point can safely delete the resources it uses.

However, a further extension of this problem is the fact that a service itself can have resources associated with it, specifically Virtual IPs. Deleting the service would free those virtual IPs, creating the same problem. So, to fix that, I believe @nishanttotla was or is working on a change to the service object to flag it for removal, instead of removing it outright, which would be treated like a deletion used to be, kicking off the movement of tasks to desired state removed, etc., and then finally deleting the service when all of the tasks were deleted.

This comment has been minimized.

@talex5

talex5 Apr 23, 2018

Contributor

OK, I'll wait for @nishanttotla to do that (or submit a design for it) and add it to the model then.

To check I understand the current system:

  1. The service is removed from the system.
  2. The orchestrator sets the desired state for every task without a service to remove.
  3. Once a task with no service is shut down, the reaper will remove it.
are configured to start the new task before stopping the old one.'' - need to implement this
XXX: ``The updater takes care of making sure that each slot converges to having a single running task.''
How does it do that?

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

You'd have to look at the code itself, but the pattern is probably something along the lines of watching the event stream for task updates, and moving desired state of older tasks in a slot during a rolling update to shutdown.

XXX: `orchestrators.md' says: ``[The Restart supervisor] atomically changes the state of the old task to Shutdown''.
However, this conflicts with the claim that only the agent can change the state of a running task.
Presumably it means the desired state.

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

Yes, probably means desired state. Every state after assigned is set by the agent. Feel free to either open a new PR or add to this PR an update to the documentation.

(* XXX: spec says ``Tasks that were removed because of service removal
or scale down are not kept around in task history.
Does that include tasks that had already finished for some
other reason? *)

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

Going further off my explanation of slots, and the fact that the primary purpose of a slot is to maintain task history, the tasks in a slot have very little relationship between each other beyond the fact that they share a slot number.

When a task is removed in a scale down or service removal, the history for that slot is no longer useful in any meaningful way, so all tasks with the removed slot number in any state are removed. The slot no longer "exists" (as far as the concept of existence is meaningful for a slot) and therefore no longer requires any history.

(* Every task has a service:
XXX: The spec says: ``In some cases, there are tasks that exist independent of any service.
These do not have a value set in service_id.''. Need an example of one. *)

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

Network attachment tasks. They're an abomination, an absolutely disgusting hack that should never have been shipped. They are tasks which use the NetworkAttachmentSpec runtime spec, a spec which contains a container ID. These tasks are created through the grpc server implemented in resourceapi, and they are created by the agent. The purpose of Network Attachment tasks is to attach regular docker run containers to a swarmkit overlay network that has been created with the --attachable option.

There have been bugs in the past with attachment tasks because we have no visibility into their lifecycle and their creation and destruction is entirely at the whim of the agent. The end result is that network attachment tasks would remain in the object store after they had been terminated, leaking resources and causing the associated attached networks to be unremovable.

I added a workaround to remove network attachment tasks when a node is removed in #2409, but the underlying issue IIRC remains.

I'm sorry.

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

Also, @jlhawn is working on a hack week project to run one-off tasks, like docker run-esque tasks, that would exist independently of services.

Transition == {
<< null, new >>,
<< new, pending >>,
<< pending, assigned >>,

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

I can't quite remember, I'd have to go check the code, but it may be possible to move from new to a terminal state, if the task cannot be allocated because there aren't enough free resources. I think. I'm not sure.

<< shutdown, remove >>,
<< rejected, remove >>,
<< complete, remove >>,
<< failed, remove >>,

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

Tasks will never actually BE in the removed state. They will only be in the desired state of removed. The state functions mainly as a flag to the task reaper, to indicate that a task once in a terminal state it and its history can be deleted.

<< shutdown, orphaned >>,

<< remove, null >>,
<< orphaned, null >>

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

Tasks can be deleted (moved to the null state, I think, in terms of the implementation of this model) from any terminal state. For example, when a task history is deleted, the old tasks will all be in terminal states (like SHUTDOWN or FAILED) and will be deleted straight from those states. Once a task is in a terminal state, the resources it had in use are definitely free and it is safe to remove it.

=============================================================================
\* Modification History
\* Last modified Thu Apr 19 14:55:46 BST 2018 by tal
\* Created Mon Apr 16 11:31:19 BST 2018 by tal

This comment has been minimized.

@dperny

dperny Apr 20, 2018

Member

this is a dumb nit, but wouldn't the modification history just be tracked in git? Or does this history get baked into the PDF that a TLA+ file creates?

This comment has been minimized.

@talex5

talex5 Apr 23, 2018

Contributor

It's just some noise the TLA+ toolbox updates automatically when you save. Although it looks like if you delete it then it doesn't create it again, so easily solved!

@talex5

This comment has been minimized.

Copy link
Contributor

talex5 commented Apr 23, 2018

Thanks for the detailed review @dperny! I've made some updates:

Task.pdf

  • Explicitly model SwarmKit's view of a node's state (up or down). This is needed because we shouldn't automatically orphan nodes as soon as the node is down. It also makes it easier to write
    properties that depend on node state.

  • Note that removing services will be modelled once SwarmKit's new design for this is ready.

  • Confirm that slots do not exist as separate objects.

  • Confirm that already-terminated terms are removed when their slot is no longer needed.

  • remove is not actually a possible state of a task, only a desired state. The reaper removes any task that has finished and has this as its desired state.

  • Model the allocator rejecting tasks.

  • Some minor corrections to the existing markdown docs.

Note: I also made a change to the tasks variable, which should make things a bit easier to read as the model now mainly talks about things in terms of tasks rather than task IDs.

I also updated the set of valid transitions to group them by the component that performs them, and to check this.

progress through the `ACCEPTED`, `PREPARING`, and `STARTING` states on the
way to `RUNNING`. If a task exits without an error code, it moves to the
progress through the `ACCEPTED`, `PREPARING`, `READY', and `STARTING` states on
the way to `RUNNING`. If a task exits without an error code, it moves to the

This comment has been minimized.

@anshulpundir

anshulpundir Apr 23, 2018

Contributor

If a task exits without an error code, it moves to the COMPLETE state.

Lets also clarify this by saying that the actual state is changed to COMPLETE by the agent.

This comment has been minimized.

@talex5

talex5 Apr 24, 2018

Contributor

Actually, it says the agent handles these states on the line above the bit shown by GitHub's diff, so maybe it's OK as it is.

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

OK, its fine as is.

@dperny

This comment has been minimized.

Copy link
Member

dperny commented Apr 24, 2018

OK, I'll wait for @nishanttotla to do that (or submit a design for it) and add it to the model then.

To check I understand the current system:

  1. The service is removed from the system.
  2. The orchestrator sets the desired state for every task without a service to remove.
  3. Once a task with no service is shut down, the reaper will remove it.

More accurately, the service is removed from the system, the process of which involves setting the desired state of all of its tasks REMOVE, and when any task in desired state REMOVE enters a terminal state, the taskreaper will delete them.

(* A description of the task model in SwarmKit.
Currently, this is mostly based on `design/task_model.md'.
Note: I am not yet familiar with the SwarmKit code, and this document is likely to

This comment has been minimized.

@anshulpundir

anshulpundir Apr 24, 2018

Contributor

nit: looks like this is no longer true :)

@anshulpundir

This comment has been minimized.

Copy link
Contributor

anshulpundir commented Apr 24, 2018

More accurately, the service is removed from the system, the process of which involves setting the desired state of all of its tasks REMOVE, and when any task in desired state REMOVE enters a terminal state, the taskreaper will delete them.

Even more accurately, any task which is in a terminal state or in an unassigned state (i.e. < ASSIGNED), the taskreaper will delete them.

Also:

The orchestrator sets the desired state for every task without a service to remove.

Not every task without a service, but just the tasks for the service just deleted.

@dperny @talex5

@talex5

This comment has been minimized.

Copy link
Contributor

talex5 commented Apr 24, 2018

Thanks for the explanation. I've added service removal. I modelled it as flagging the service and all its tasks for removal and then removing it when it had no tasks left (i.e. the proposed design rather than the current one). I added a comment about it in the model.

I also updated the reaper to allow removing tasks that haven't yet been assigned (as explained by @anshulpundir), and added a couple of exceptions to the rule that only one component is in charge of each state to cover that.

Task.pdf

@GordonTheTurtle

This comment has been minimized.

Copy link

GordonTheTurtle commented Apr 30, 2018

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "tla" git@github.com:talex5/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842353920240
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

@talex5 talex5 force-pushed the talex5:tla branch from 3cba954 to 96d9575 Apr 30, 2018

@GordonTheTurtle GordonTheTurtle removed the dco/no label Apr 30, 2018

@talex5 talex5 changed the title [RFC] TLA+ model for SwarmKit tasks TLA+ model for SwarmKit tasks May 2, 2018

@talex5

This comment has been minimized.

Copy link
Contributor

talex5 commented May 2, 2018

I've now split the spec into multiple files, and created a separate low-level spec for worker nodes. I also added a Makefile so you can build the PDFs and run the model checker without having to install the TLA toolbox (it uses a Docker image). This could also be used in CI.

@talex5

This comment has been minimized.

Copy link
Contributor

talex5 commented May 11, 2018

I think this PR is ready for merging now.

@anshulpundir
Copy link
Contributor

anshulpundir left a comment

Some minor comments but otherwise looks good!

I also think we should squash the commits together.

progress through the `ACCEPTED`, `PREPARING`, and `STARTING` states on the
way to `RUNNING`. If a task exits without an error code, it moves to the
progress through the `ACCEPTED`, `PREPARING`, `READY', and `STARTING` states on
the way to `RUNNING`. If a task exits without an error code, it moves to the

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

OK, its fine as is.

(* The user removes a service.
Note: Currently, SwarmKit deletes the service from its records immediately.
However, this isn't right because we need to wait for the resources to be freed.

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

this isn't right because

Instead of saying this isn't right, I'd rather say that this is not very elegant. Because we still wait for resources to be freed before deleting the tasks, and resources are tied to tasks and not services.

This comment has been minimized.

@talex5

talex5 May 16, 2018

Contributor

@dperny mentioned that there are also service-level resources that need to be freed. How about:

However, this isn't right because we need to wait for service-level resources such as Virtual IPs to be freed.

\E id \in UnusedId(sid, slot) :
tasks' = tasks \union { NewTask(sid, slot, id, running) }

(* Add a task if a slot exists, contains no runnable tasks, and we weren't trying to remove it.

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

supernit: remove comma before 'and' ?

This comment has been minimized.

@talex5

talex5 May 16, 2018

Contributor

I think this is correct. According to https://en.wikipedia.org/wiki/Serial_comma:

A majority of American style guides mandate use of the serial comma

(* We don't model the updater explicitly, but we allow any task to be restarted (perhaps with
a different image) at any time, which should cover the behaviours of the restart supervisor.
XXX: wait for new task to be ready before shutting down old one?

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

Is this a TODO ? I think some clarification is needed.

This comment has been minimized.

@talex5

talex5 May 16, 2018

Contributor

I agree. I've updated it to:

TODO: SwarmKit also allows ``start-first'' mode updates where we first get the new task to
`running' and then mark the old task for shutdown. Add this to the model.

TODO: ``However, there are also cases where a slot may have multiple tasks with the
desired state of RUNNING. This can happen during rolling updates when the updates
are configured to start the new task before stopping the old one.'' - need to implement this *)

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

need to implement this

By this you mean implement the TLA model for this ?


(* The allocator rejects the task because there aren't enough resources.
XXX: Not clear whether SwarmKit actually does this.

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

Don' think it does. The task just stays in NEW (or READY ?) until resources are available. cc @dperny

This comment has been minimized.

@talex5

talex5 May 16, 2018

Contributor

He said he wasn't sure: #2613 (comment)
Shall I remove this then?

This comment has been minimized.

@anshulpundir

anshulpundir May 21, 2018

Contributor

This comment has been minimized.

@anshulpundir

anshulpundir May 22, 2018

Contributor

Browsing through the code, I don't see a case where the current allocator fails tasks if there aren't enough resources. @talex5

This comment has been minimized.

@talex5

talex5 Jun 4, 2018

Contributor

OK, thanks for checking! I've removed this bit now.


---------------------------- MODULE Scheduler -------------------------------

\* Actions performed the scheduler

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

performed the scheduler => performed by the scheduler

(* Forget about tasks in remove or orphan states.
Orphaned tasks belong to nodes that we are assuming are lost forever.
Or, at the very least, it's likely that a node outage lasting longer than the timeout

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

has crashed => has happened ?

This comment has been minimized.

@talex5

talex5 May 16, 2018

Contributor

How about:

Orphaned tasks belong to nodes that we are assuming are lost forever (or have crashed and will come up with nothing running, which is an equally fine outcome).

I think this is what @dperny meant in #2613 (comment).

``Moving a task to the Orphaned state is not desirable,
because it's the one case where we break the otherwise invariant
that the agent sets all states past ASSIGNED.''

This comment has been minimized.

@anshulpundir

anshulpundir May 16, 2018

Contributor

that the agent sets all states past ASSIGNED.

Clarify that this is the actual state and not the desired state ?

@talex5 talex5 force-pushed the talex5:tla branch from 9431093 to c43c585 May 16, 2018

@talex5 talex5 force-pushed the talex5:tla branch from c43c585 to de903c5 Jun 4, 2018

Add some TLA+ specifications for SwarmKit
These documents use the TLA+ notation to describe the behaviour of
SwarmKit more precisely than the prose documentation in the design
directory.

Run `make pdfs` to render these documents as PDF files.
The best one to start with is `SwarmKit.pdf`, which introduces the TLA+
notation and describes the overall components of SwarmKit.

The specifications can also be executed by the TLC model checker to help
find mistakes. Use `make check` to run the checks.

Signed-off-by: Thomas Leonard <thomas.leonard@docker.com>

@talex5 talex5 force-pushed the talex5:tla branch from de903c5 to 33108cb Jun 4, 2018

@anshulpundir anshulpundir merged commit 2546862 into docker:master Jun 7, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 61.82% (target 0%)
Details
dco-signed All commits are signed

@talex5 talex5 deleted the talex5:tla branch Jun 8, 2018

anshulpundir added a commit to anshulpundir/swarmkit that referenced this pull request Jun 11, 2018

Merge pull request docker#2613 from talex5/tla
TLA+ model for SwarmKit tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment