Add jobs support to CLI #2262

dperny · 2020-01-16T15:40:34Z

- What I did

Add support to the CLI for swarm jobs (moby/moby#40307).

Does not include compose support.

- How I did it

Added two new modes accepted by the --mode flag
- replicated-job creates a replicated job
- global-job creates a global job.
When using replicated-job mode, the replicas flag sets the TotalCompletions parameter of the job. This is the total number of tasks that will run
Added a new flag, max-concurrent, for use with replicated-job mode. This flag sets the MaxConcurrent parameter of the job, which is the maximum number of replicas the job will run simultaneously.
When using replicated-job or global-job mode, using any of the update parameter flags will result in an error, as jobs cannot be updated in the traditional sense.
Updated the docker service ls UI to include the completion status (completed vs total tasks) if the service is a job.
Updated the progress bars UI for service creation and update to support jobs. For jobs, there is displayed a bar covering the overall progress of the job (the number of tasks completed over the total number of tasks to complete).
Added documentation explaining the use of the new flags, and of jobs in general.

- How to verify it

Includes automated tests for all major changes.

- Description for the changelog

Added CLI support for swarm jobs.

dperny · 2020-01-16T15:59:06Z

Added nolint: gocyclo to the (*serviceOptions).ToService method. In exchange for this concession, I have added a documentation comment to that method.

mathroc · 2020-01-17T10:28:13Z

@dperny I see that a job will have to be force updated to run again, will it be possible to have a --rm flag that removes the job once completed (like docker run --rm does) ?

So that we don’t have to worry if the job should be created or force updated (when running a database migration job for example)

Alternatively, would it be possible to create a job service without starting it. and have a command to run this job when we like ? The benefits here would be that the job configuration stays always the same and can be included easily in a stack and docker would keep the task history (compared to the --rm proposal just before)

dperny · 2020-01-17T15:15:19Z

@mathroc for whatever reason, I had not considered the possibility of an --rm flag, actually. I'm unsure how to implement it correctly, and it certainly won't make it into this release. That said, if my distant memories of being a mediocre ruby-on-rails developer are somewhat in tact, database migrations should be idempotent, mitigating the possibility of screwing things up by accidentally re-running a job.

Second, to create a job without starting it, you can just set --replicas to 0, which should be valid. Or, for a global job, set a placement constraint that cannot be met.

mathroc · 2020-01-18T18:29:02Z

initializing a service job with --replicas 0 might be enough

the problem with the database migration in my exemple was not that it could run twice, but that I thought I would have to query docker to see if the service job already exists and then either create the service or force update it. but with --replicas 0, I can initalize the service once (eg: with docker stack deploy and then I don’t have to worry I just jave to run docker service update --replicas 1 --force to run the job and it should work all the time

thx @dperny

dperny · 2020-01-20T19:48:48Z

Removed WIP. The support for jobs upstream was merged.

SvenDowideit · 2020-02-04T06:14:10Z

@thaJeztah Is there some change / way this can be expected in the next release? I'm presuming there will be a docker-v20.04 ?

cli/command/service/opts.go

docs/reference/commandline/service_ls.md

dperny · 2020-02-20T17:59:57Z

Rebased to hopefully fix merge conflicts.

silvin-lubecki · 2020-02-24T09:36:19Z

The code itself looks good to me, but I need to take it for a spin to check the UX and the whole feature 👍
Anyway thank you @dperny for this awesome work 🐱

nkabbara · 2020-03-25T18:22:08Z

Hello! Is there an ETA for this feature?

nkabbara · 2020-03-25T18:47:18Z

initializing a service job with --replicas 0 might be enough

the problem with the database migration in my exemple was not that it could run twice, but that I thought I would have to query docker to see if the service job already exists and then either create the service or force update it. but with --replicas 0, I can initalize the service once (eg: with docker stack deploy and then I don’t have to worry I just jave to run docker service update --replicas 1 --force to run the job and it should work all the time

thx @dperny

Hi @mathroc, curious about which migration strategy you settled on.

I'm thinking about something similar to what you've suggested:

Create a migrations service with 0 replicas and restart-condition set none.
Update image with new release
Up replicas to 1.

cpuguy83 · 2020-04-08T17:03:25Z

docs/reference/commandline/service_create.md

+
+```bash
+$ docker service create --name mythrottledjob \
+                        --mode replicated-job \


Maybe --kind=job --mode=replicated?

Alternatively this could be docker job create even if the API is for a service.

While docker job create is probably a clean(er); advantages;

separate subcommand

we could hide/remove flags that don't apply to jobs

Downside:

given that they're both backed by a service, we need to either filter out jobs from docker service ls (etc) and vice-versa.

that might become confusing if we don't apply the same filter very strictly (docker service rm <job service> could otherwise remove a job)

also think of docker service create myjob, which could show an error that service myjob already exists, but docker service ls wouldn't show it

thaJeztah · 2020-04-21T12:53:20Z

Reviewing this together with @silvin-lubecki I'll post notes along the way (sorry for the extra noise)

I tried running the example you included in the documentation;

docker service create --name myjob \
   --mode replicated-job \
   bash "true"

Output looks good to me

job progress: 1 out of 1 complete [==================================================>]
active tasks: 0 out of 0 tasks
1/1: complete  [==================================================>]
job complete

What I think is slightly confusing, is that "REPLICAS" shows 0/1: we expect this job to run once, so should it show /1 ?

docker service ls
ID                  NAME                MODE                REPLICAS              IMAGE               PORTS
1tzs32id63hz        myjob               replicated job      0/1 (1/1 completed)   bash:latest

Thinking if I can come with a better presentation for that 🤔

Trying with more replicas:

docker service rm myjob
docker service create --name myjob --mode replicated-job --replicas=4 bash "true"
vbtoewcdxz17hfa14p2kua96r
job progress: 4 out of 4 complete [==================================================>]
active tasks: 0 out of 0 tasks
1/4: complete  [==================================================>]
2/4: complete  [==================================================>]
3/4: complete  [==================================================>]
4/4: complete  [==================================================>]
job complete

docker service ls
ID                  NAME                MODE                REPLICAS              IMAGE               PORTS
vbtoewcdxz17        myjob               replicated job      0/4 (4/4 completed)   bash:latest

thaJeztah · 2020-04-21T12:56:08Z

Slightly confusing:

$ docker service scale myjob=2
myjob: scale can only be used with replicated mode

The job is replicated, so perhaps we should change this to "cannot be used with jobs" instead of mentioning the replicated mode?

thaJeztah · 2020-04-21T13:17:40Z

Looks like the compose schema (or validation) needs some updating; using this compose file:

version: "3.9"
services:
  job:
    image: bash
    command: "true"
    deploy:
      mode: "replicated-job"
      replicas: 6

I get an error:

docker stack deploy -c docker-compose.yml mystack
Creating network mystack_default
service job: Unknown mode: replicated-job

I tried updating the compose code:

diff --git a/cli/compose/convert/service.go b/cli/compose/convert/service.go
index da182bbf..9ce91b90 100644
--- a/cli/compose/convert/service.go
+++ b/cli/compose/convert/service.go
@@ -609,12 +609,12 @@ func convertDeployMode(mode string, replicas *uint64) (swarm.ServiceMode, error)
 	serviceMode := swarm.ServiceMode{}

 	switch mode {
-	case "global":
+	case "global", "global-job":
 		if replicas != nil {
 			return serviceMode, errors.Errorf("replicas can only be used with replicated mode")
 		}
 		serviceMode.Global = &swarm.GlobalService{}
-	case "replicated", "":
+	case "replicated", "replicated-job", "":
 		serviceMode.Replicated = &swarm.ReplicatedService{Replicas: replicas}
 	default:
 		return serviceMode, errors.Errorf("Unknown mode: %s", mode)

After that, docker stack deploy "worked", but behind the scenes, it keeps failing; probably because docker stack deploy updates the service after it's been created (adding a network alias). By the time it tries doing so, the task exited already, so a new task is created;

docker service ls
ID                  NAME                MODE                REPLICAS                IMAGE               PORTS
uq7b0h3v6ghf        mystack_job         replicated          0/6                     bash:latest

DEBU[2020-04-21T13:23:24.888699099Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).addSvcRecords(tasks.job, 10.0.1.69, <nil>, false) addServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.888713104Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).addSvcRecords(mystack_job, 10.0.1.2, <nil>, false) addServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.888719214Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).addSvcRecords(job, 10.0.1.2, <nil>, false) addServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.888725306Z] addServiceBinding from addServiceInfoToCluster END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.888921206Z] addServiceInfoToCluster END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.889020616Z] EnableService 745ed888b6fc250aa77129acc92efe3c86ae56ea0b2d9d9cf2e78acc92c561ac DONE
DEBU[2020-04-21T13:23:24.889211225Z] deleteServiceInfoFromCluster from sbLeave START for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.889277881Z] rmServiceBinding from deleteServiceInfoFromCluster START for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab p:0xc00092bd00 nid:t1holdww2ke12dtrzc8u5cf76 sKey:{uq7b0h3v6ghf9t6su2svo3o6r } deleteSvc:true
DEBU[2020-04-21T13:23:24.889579250Z] rmServiceBinding fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab delete t1holdww2ke12dtrzc8u5cf76, p:0xc001fcc300 in loadbalancers len:0
DEBU[2020-04-21T13:23:24.891846283Z] state for task nzfaz6g6xq9je6isbt5zc2qh1 updated to COMPLETE  method="(*Dispatcher).processUpdates" module=dispatcher node.id=cof80xs5o119fw86iyoj4g1eb state.transition="STARTING->COMPLETE" task.id=nzfaz6g6xq9je6isbt5zc2qh1
DEBU[2020-04-21T13:23:24.892457593Z] dispatcher committed status update to store   method="(*Dispatcher).processUpdates" module=dispatcher node.id=cof80xs5o119fw86iyoj4g1eb state.transition="STARTING->COMPLETE" task.id=nzfaz6g6xq9je6isbt5zc2qh1
ERRO[2020-04-21T13:23:24.896670528Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.934970842Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.936440772Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.937721697Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.938833516Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.945949923Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.947207460Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.948256879Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.949100243Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.951210680Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.951912778Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.952897854Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.953746905Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.954535721Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.956941017Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.958569629Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.959791120Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.960749464Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.962392561Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.963731630Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.965130174Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.966460429Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.967377257Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.968119396Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.968828863Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
ERRO[2020-04-21T13:23:24.969258109Z] Failed to allocate network resources for node cof80xs5o119fw86iyoj4g1eb  error="could not find network allocator state for network t1holdww2ke12dtrzc8u5cf76" module=node node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:24.986545081Z] deleteEndpointNameResolution fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab mystack_job rm_service:true suppress:false sAliases:[job] tAliases:[745ed888b6fc]
DEBU[2020-04-21T13:23:24.986952947Z] delContainerNameResolution fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab mystack_job.6.n9ein9uqdaal1iz9ofnrlp5mj
DEBU[2020-04-21T13:23:24.987067454Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(mystack_job.6.n9ein9uqdaal1iz9ofnrlp5mj, 10.0.1.69, <nil>, true) rmServiceBinding sid:fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.987134811Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(745ed888b6fc, 10.0.1.69, <nil>, true) rmServiceBinding sid:fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.987266665Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(tasks.mystack_job, 10.0.1.69, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.987487926Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(tasks.job, 10.0.1.69, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.987826764Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(mystack_job, 10.0.1.2, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.988210127Z] fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab (t1holdw).deleteSvcRecords(job, 10.0.1.2, <nil>, false) rmServiceBinding sid:uq7b0h3v6ghf9t6su2svo3o6r
DEBU[2020-04-21T13:23:24.988428037Z] rmServiceBinding from deleteServiceInfoFromCluster END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:24.988553281Z] deleteServiceInfoFromCluster from sbLeave END for mystack_job fbfa7ee83a20d767534e9494dcfe93ed47d123597fd60094a9c0723ee43470ab
DEBU[2020-04-21T13:23:25.032479740Z] Revoking external connectivity on endpoint gateway_c8a85dc2ca97 (1852b1637272d761266368a9aaa7ae92db121dba94dd081f7d05de816eccf998)
DEBU[2020-04-21T13:23:25.034617553Z] DeleteConntrackEntries purged ipv4:0, ipv6:0
DEBU[2020-04-21T13:23:25.036293286Z] (*worker).Update                              len(assignments)=30 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036400099Z] (*worker).reconcileSecrets                    len(removedSecrets)=0 len(updatedSecrets)=0 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036425854Z] (*worker).reconcileConfigs                    len(removedConfigs)=0 len(updatedConfigs)=0 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036762896Z] (*worker).reconcileTaskState                  len(removedTasks)=25 len(updatedTasks)=5 module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb
DEBU[2020-04-21T13:23:25.036865221Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=41ctnmnrtlxmy0y25zeypuvbb
DEBU[2020-04-21T13:23:25.036991513Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=uve1k8wnxu6f9aqjeyyaj5rby
DEBU[2020-04-21T13:23:25.037227835Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=dexskffhpf6kdg3fcq5425ly1
DEBU[2020-04-21T13:23:25.037288598Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=n9ein9uqdaal1iz9ofnrlp5mj
DEBU[2020-04-21T13:23:25.037554188Z] assigned                                      module=node/agent node.id=cof80xs5o119fw86iyoj4g1eb task.desiredstate=REMOVE task.id=fwdyyzse2za6250edat9ipb8m
DEBU[2020-04-21T13:23:25.037741656Z] Could not find network sandbox for container mystack_job.1.41ctnmnrtlxmy0y25zeypuvbb on service binding deactivation request
DEBU[2020-04-21T13:23:25.037855034Z] Could not find network sandbox for container mystack_job.3.jzdojq6rhvjstq497ivin4t6f on service binding deactivation request
DEBU[2020-04-21T13:23:25.039183320Z] Could not find network sandbox for container mystack_job.1.858lwuzkldxpnsitrddc3xufv on service binding deactivation request
DEBU[2020-04-21T13:23:25.038237993Z] Could not find network sandbox for container mystack_job.6.mhnw31hqedi0suglvi6hlagqf on service binding deactivation request
DEBU[2020-04-21T13:23:25.038373469Z] Could not find network sandbox for container mystack_job.2.yrouuun8qlcteih936149fvmf on service binding deactivation request
DEBU[2020-04-21T13:23:25.038400634Z] Could not find network sandbox for container mystack_job.5.5ji21r2vthmqld15zandztqz6 on service binding deactivation request
DEBU[2020-04-21T13:23:25.038465065Z] Could not find network sandbox for container mystack_job.6.oruzoa6aps23e5vfo4ga1winl on service binding deactivation request
DEBU[2020-04-21T13:23:25.039286354Z] Could not find network sandbox for container mystack_job.6.b5lwk0fovf9693pglumovqd1g on service binding deactivation request
DEBU[2020-04-21T13:23:25.038113392Z] Could not find network sandbox for container mystack_job.1.a526t4vylb020ailytz3ec1uo on service binding deactivation request
DEBU[2020-04-21T13:23:25.038690939Z] Could not find network sandbox for container mystack_job.3.lgd2xt92l7y05tcu291ho0xkq on service binding deactivation request
DEBU[2020-04-21T13:23:25.038730482Z] Could not find network sandbox for container mystack_job.1.cgmzrchzdewstm5yzk2xh80fw on service binding deactivation request
DEBU[2020-04-21T13:23:25.038734658Z] Could not find network sandbox for container mystack_job.2.n0g0rb4pvuuchel57owaihqqx on service binding deactivation request

silvin-lubecki · 2020-04-21T13:35:41Z

Tried with multiple replicas:

$ docker service create --name test --replicas=10 --max-concurrent=2  --mode=replicated-job bash true
idja6562brblp995gnq7ps78i
job progress: 10 out of 10 complete [==================================================>]
active tasks: 0 out of 0 tasks
1/10: complete  [==================================================>]
2/10: complete  [==================================================>]
3/10: complete  [==================================================>]
4/10: complete  [==================================================>]
5/10: complete  [==================================================>]
6/10: complete  [==================================================>]
7/10: complete  [==================================================>]
8/10: complete  [==================================================>]
9/10: complete  [==================================================>]
10/10: complete  [==================================================>]
job complete

Then tried a ps

$ docker service ps test
ID                  NAME                             IMAGE                NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
tti15rwuoglx        test.1                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
u0ef31o2mgyq        test.2                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
avvo2zxxysu8        test.3                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
l022qhhmz148        test.4                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
u3sc4nht7va9        test.5                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
zsgfzezhit0y        test.6                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
t1dx8jej05lq        test.7                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
teon4syxhul9        test.8                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
e7qb4wn8ew7h        test.9                           hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago
lankjaond9iy        test.9h335bnxjxm95yf2n8uuz4kk3   hello-world:latest   4d35eab424f4        Complete            Complete 7 minutes ago

This part of the UI/UX looks nice so far 👍

thaJeztah · 2020-04-21T13:40:59Z

Trying with a long running container as job; creating (as expected) continues waiting for it to complete, so I had to CTRL-C;

docker service create --mode=replicated-job --name=longy nginx:alpine
qx9ol95pxk0i7ztz5291wspc8
job progress: 0 out of 1 complete [>                                                  ]
active tasks: 1 out of 1 tasks
1/1: running   [=============================================>     ]
^COperation continuing in background.

After that, I killed the container:

docker kill fdad641d54a6

Looking at docker service ls

docker service ls
ID                  NAME                MODE                REPLICAS                IMAGE               PORTS
qx9ol95pxk0i        longy               replicated job      1/1 (0/1 completed)     nginx:alpine

I can see a new task was created for the service

docker service ps longy
ID                  NAME                                  IMAGE               NODE                DESIRED STATE       CURRENT STATE           ERROR                         PORTS
3fwmn61dybah        longy.cof80xs5o119fw86iyoj4g1eb       nginx:alpine        1db0546f51e6        Complete            Running 4 minutes ago
l0qiutahfdmr         \_ longy.cof80xs5o119fw86iyoj4g1eb   nginx:alpine        1db0546f51e6        Shutdown            Failed 4 minutes ago    "task: non-zero exit (137)"

Is it expected that a new task is created if one fails, or should it terminate the job, and mark it as "failed" ?

thaJeztah · 2020-04-21T13:09:13Z

docs/reference/commandline/service_ls.md

+c8wgl7q4ndfd  frontend  replicated      5/5                  nginx:alpine
+dmu1ept4cxcf  redis     replicated      3/3                  redis:3.0.6
+iwe3278osahj  mongo     global          7/7                  mongo:3.3
+hh08h9uu8uwr  job       replicated-job  1/1 (3/5 completed)  nginx:latest        


is 1/1 correct here?

Yes. It implies that 1 task is still running, 3 tasks are completed, and 5 tasks are desired. This would imply the job is running 5 iterations one after another.

thaJeztah · 2020-04-21T13:49:27Z

docs/reference/commandline/service_create.md

+
+Jobs are a special kind of service designed to run an operation to completion
+and then stop, as opposed to running long-running daemons. When a Task
+belonging to a job exits successfully (return value 0), the Task is marked as


See my other comment; do we want failed tasks to be started / replaced / tried again by default? Or should it have --restart-condition=none ?

thaJeztah · 2020-04-21T13:57:57Z

One thing I'm thinking of; should we have an alias (on the CLI) for replicated-job to allow people to just set --mode=job (which is the same as --mode=replicated-job)?

thaJeztah · 2020-04-21T14:01:52Z

I'm overall good with the current UX. I think that fully separating "jobs" from regular services would not be possible (because they share the same constructs under the hood). That said; it would be possible to add docker job subcommands in future (or just write a simple plugin for this). When doing so, we should (I think) not try to hide that jobs are services (just show both in docker service ps, and just make docker job create a shorthand / convenience function).

Some things I think should be addressed:

better call out that the default is --restart-condition=failure, so users must set a custom restart policy if they do not want to have a job run multiple times if it fails
perhaps discuss the --mode=job alias (open to input)
ideally we'd have docker stack deploy working with this (but could be added in future if it's problematic to get working)

thaJeztah · 2020-04-23T19:07:05Z

Discussing with @tonistiigi @cpuguy83 - haven't checked yet, but we need to check what happens if I create a job with --restart-condition=any (is it rejected, or ok? because that would make it exactly the same as a regular service)

dperny · 2020-04-24T16:03:42Z

The problem where completed jobs are showing 0/4 (4/4 Completed) is actually a bit of a bug in Swarmkit. In the ServiceStatus, Swarmkit should not be setting the denominator to MaxReplicas, but should instead be setting it to the lesser of MaxReplicas or TotalCompletions - CompletedTasks. It's an easy fix, but it's not in this code.
docker service scale should be usable with jobs, and the fact that it's not is a consequence of me overlooking it.
Compose support for jobs isn't in this PR. I was going to open a second PR with compose support. I can add it to this PR if desired.
It is expected that, if a Task fails, a new task should be spawned, until the desired number of completions is reached. The exception should be if --restart-condition=none is set.
RestartOnAny is treated the same as RestartOnFailure if the service is a Job. This needs to be added both to the documentation here and to the swagger docs in the main repo, actually. This behavior isn't accidental; it was a deliberate decision (IIRC, it was part of the jobs design spec).

dperny · 2020-04-24T16:21:33Z

I'm opposed to the alias of --mode=job for --mode=replicated-job primarily because it makes the documentation unwieldy.

* Added two new modes accepted by the `--mode` flag * `replicated-job` creates a replicated job * `global-job` creates a global job. * When using `replicated-job` mode, the `replicas` flag sets the `TotalCompletions` parameter of the job. This is the total number of tasks that will run * Added a new flag, `max-concurrent`, for use with `replicated-job` mode. This flag sets the `MaxConcurrent` parameter of the job, which is the maximum number of replicas the job will run simultaneously. * When using `replicated-job` or `global-job` mode, using any of the update parameter flags will result in an error, as jobs cannot be updated in the traditional sense. * Updated the `docker service ls` UI to include the completion status (completed vs total tasks) if the service is a job. * Updated the progress bars UI for service creation and update to support jobs. For jobs, there is displayed a bar covering the overall progress of the job (the number of tasks completed over the total number of tasks to complete). * Added documentation explaining the use of the new flags, and of jobs in general. Signed-off-by: Drew Erny <derny@mirantis.com>

thaJeztah · 2020-04-30T19:14:40Z

Sorry for the delay (again)

Compose support for jobs isn't in this PR. I was going to open a second PR with compose support. I can add it to this PR if desired.

I think it's ok to leave it out of compose for now; we should perhaps consider if we want it to be a separate "entity" inside compose files (instead of just an option for mode?)

It is expected that, if a Task fails, a new task should be spawned, until the desired number of completions is reached. The exception should be if --restart-condition=none is set.

There's something to be said for both sides; either I want (e.g.) a migration to run (but don't try to run it again if it failed), or have a guarantee that all my jobs will at least continue until completed.

I think it's ok in the current implementation, as long as we're explicit about this in the documentation so that users are not caught by surprise

I'm opposed to the alias of --mode=job for --mode=replicated-job primarily because it makes the documentation unwieldy

👍 mostly me thinking out loud; could also be easily added in future it there's a strong need for it, so no blocker from my perpective

thaJeztah · 2020-04-30T19:15:25Z

@dperny I see you pushed after my previous comment; were there specific things you addressed/changed?

dperny · 2020-05-01T16:13:33Z

Yes. I fixed docker service scale to work with replicated jobs, and I reworded some docs to address general comments (although I can't now remember exactly what I reworded. I think it had to do with restart-condition).

silvin-lubecki

LGTM, thanks a lot for that PR @dperny 🎉

thaJeztah

LGTM. Let's merge; we can tweak docs later if needed 👍

thanks @dperny !

Ohtar10 · 2020-10-21T16:16:48Z

This is really good stuff!

Question: I see some discussion w.r.t compose support, would that have its own PR too then? would it make it to 20.x.0 release?

Thanks!

thaJeztah · 2020-10-21T16:29:08Z

@Ohtar10 compose support has not been added yet. Some discussion may be needed if we implement this as "mode" for services, or if a new "jobs" top-level property is added. Perhaps "mode" could be implemented as (temporary?) solution, but may need some work; #2262 (comment)

(contributions should be welcome though!)

Ohtar10 · 2020-10-22T13:56:00Z

Cool, thanks for the reply @thaJeztah.

As an end-user, and after trying the feature from the test channel, I would say that if jobs will keep being part of docker service create command, then in the compose file should be the same for the sake of consistency, hence, handle it as a "mode", at least temporary solution as you mention.

In case the feature evolves to something more elaborate, e.g., cronjobs, additional configurable properties etc. then I think it would make sense to have a top-level property not only in the compose file but at docker CLI level as well, e.g., docker job create.

dperny requested a review from thaJeztah as a code owner January 16, 2020 15:40

GordonTheTurtle added the status/0-triage label Jan 16, 2020

dperny force-pushed the swarm-jobs branch 2 times, most recently from fc6a2cd to d34f273 Compare January 16, 2020 15:58

dperny force-pushed the swarm-jobs branch from d34f273 to 73f2e45 Compare January 20, 2020 17:18

dperny changed the title ~~WIP: Add jobs support to CLI~~ Add jobs support to CLI Jan 20, 2020

dperny force-pushed the swarm-jobs branch from 73f2e45 to f56effc Compare January 28, 2020 15:58

thaJeztah added impact/changelog kind/feature status/1-design-review area/swarm impact/documentation and removed status/0-triage labels Feb 13, 2020

silvin-lubecki reviewed Feb 20, 2020

View reviewed changes

cli/command/service/opts.go Show resolved Hide resolved

silvin-lubecki reviewed Feb 20, 2020

View reviewed changes

cli/command/service/opts.go Outdated Show resolved Hide resolved

silvin-lubecki reviewed Feb 20, 2020

View reviewed changes

cli/command/service/opts.go Outdated Show resolved Hide resolved

silvin-lubecki reviewed Feb 20, 2020

View reviewed changes

docs/reference/commandline/service_ls.md Outdated Show resolved Hide resolved

dperny force-pushed the swarm-jobs branch 2 times, most recently from 67d4da0 to e1dabab Compare February 20, 2020 17:59

cpuguy83 reviewed Apr 8, 2020

View reviewed changes

thaJeztah reviewed Apr 21, 2020

View reviewed changes

dperny force-pushed the swarm-jobs branch from e1dabab to 9375644 Compare April 24, 2020 16:22

silvin-lubecki approved these changes May 4, 2020

View reviewed changes

thaJeztah approved these changes May 6, 2020

View reviewed changes

thaJeztah merged commit 4f05814 into docker:master May 6, 2020

thaJeztah added this to the 20.03.0 milestone Jun 10, 2020

silvin-lubecki mentioned this pull request Aug 17, 2020

Swarm Jobs CLI #2226

Closed

Ohtar10 mentioned this pull request Oct 21, 2020

Add support for swarm jobs moby/moby#40307

Merged

albers mentioned this pull request Dec 15, 2020

Add bash completion for jobs #2887

Merged

ollypom mentioned this pull request Jan 4, 2021

Added Swarm Job support to Stack Deploy #2907

Merged

pklejch mentioned this pull request Apr 26, 2021

Add support for one-off jobs in Docker Swarm docker/docker-py#2829

Closed

srebala mentioned this pull request Aug 3, 2021

Swarm Job Support portainer/portainer#5383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jobs support to CLI #2262

Add jobs support to CLI #2262

dperny commented Jan 16, 2020 •

edited by silvin-lubecki

Loading

dperny commented Jan 16, 2020

mathroc commented Jan 17, 2020

dperny commented Jan 17, 2020

mathroc commented Jan 18, 2020

dperny commented Jan 20, 2020

SvenDowideit commented Feb 4, 2020

dperny commented Feb 20, 2020

silvin-lubecki commented Feb 24, 2020

nkabbara commented Mar 25, 2020

nkabbara commented Mar 25, 2020

cpuguy83 Apr 8, 2020

cpuguy83 Apr 8, 2020

thaJeztah Apr 21, 2020

thaJeztah commented Apr 21, 2020 •

edited

Loading

thaJeztah commented Apr 21, 2020

thaJeztah commented Apr 21, 2020 •

edited

Loading

silvin-lubecki commented Apr 21, 2020

thaJeztah commented Apr 21, 2020

thaJeztah Apr 21, 2020

dperny Apr 24, 2020

thaJeztah Apr 21, 2020 •

edited

Loading

thaJeztah commented Apr 21, 2020

thaJeztah commented Apr 21, 2020

thaJeztah commented Apr 23, 2020

dperny commented Apr 24, 2020

dperny commented Apr 24, 2020

thaJeztah commented Apr 30, 2020

thaJeztah commented Apr 30, 2020

dperny commented May 1, 2020

silvin-lubecki left a comment

thaJeztah left a comment

Ohtar10 commented Oct 21, 2020

thaJeztah commented Oct 21, 2020

Ohtar10 commented Oct 22, 2020

Add jobs support to CLI #2262

Add jobs support to CLI #2262

Conversation

dperny commented Jan 16, 2020 • edited by silvin-lubecki Loading

dperny commented Jan 16, 2020

mathroc commented Jan 17, 2020

dperny commented Jan 17, 2020

mathroc commented Jan 18, 2020

dperny commented Jan 20, 2020

SvenDowideit commented Feb 4, 2020

dperny commented Feb 20, 2020

silvin-lubecki commented Feb 24, 2020

nkabbara commented Mar 25, 2020

nkabbara commented Mar 25, 2020

cpuguy83 Apr 8, 2020

Choose a reason for hiding this comment

cpuguy83 Apr 8, 2020

Choose a reason for hiding this comment

thaJeztah Apr 21, 2020

Choose a reason for hiding this comment

thaJeztah commented Apr 21, 2020 • edited Loading

thaJeztah commented Apr 21, 2020

thaJeztah commented Apr 21, 2020 • edited Loading

silvin-lubecki commented Apr 21, 2020

thaJeztah commented Apr 21, 2020

thaJeztah Apr 21, 2020

Choose a reason for hiding this comment

dperny Apr 24, 2020

Choose a reason for hiding this comment

thaJeztah Apr 21, 2020 • edited Loading

Choose a reason for hiding this comment

thaJeztah commented Apr 21, 2020

thaJeztah commented Apr 21, 2020

thaJeztah commented Apr 23, 2020

dperny commented Apr 24, 2020

dperny commented Apr 24, 2020

thaJeztah commented Apr 30, 2020

thaJeztah commented Apr 30, 2020

dperny commented May 1, 2020

silvin-lubecki left a comment

Choose a reason for hiding this comment

thaJeztah left a comment

Choose a reason for hiding this comment

Ohtar10 commented Oct 21, 2020

thaJeztah commented Oct 21, 2020

Ohtar10 commented Oct 22, 2020

dperny commented Jan 16, 2020 •

edited by silvin-lubecki

Loading

thaJeztah commented Apr 21, 2020 •

edited

Loading

thaJeztah commented Apr 21, 2020 •

edited

Loading

thaJeztah Apr 21, 2020 •

edited

Loading