task image get and task are on different workers #6218

evanchaoli · 2020-10-27T09:11:21Z

Summary

I found this problem in same test as #6217. See the screen shot:

We can see that, task image get was done on a different worker than the worker where task ran.

I used "random" container placement strategy. But image fetching should be done on the worker where the task is going to run, right? @vito

Steps to reproduce

Expected results

Actual results

Additional context

Triaging info

Concourse version:
Browser (if applicable):
Did this used to work?

vito · 2020-10-27T14:19:51Z

Yeah, image fetching used to happen on the same worker. It was a very hairy code path that duplicated a lot of the logic from regular check/get, which this refactor cleaned up, but this is one trade-off of the decoupling.

The upside of the decoupling is that it should make it a lot easier to implement runtimes like Kubernetes, because now everything really just flows through the exec check/get/put/task steps.

The new logic should already use the image cache if it already exists on the same worker the task chose (same as with normal get steps and caching semantics). So there's a chance that it fetches on different worker and then uses a local cache anyway. We could accelerate the cache warming by marking the volume that we stream over as a cache too, which is a feature request that comes up pretty often.

One alternative could be to explicitly choose a worker first and then specify the worker in the GetPlan but that feels kind of hairy. None of the Plan types currently have such low level configuration, and it might not make as much sense with other runtimes.

evanchaoli · 2020-11-20T07:45:58Z

@vito Recently I read through your refactored code and did some tests. Now I feel marking streamed volume as cache is something we must do. Otherwise, I'm afraid v7 will have performance downgrade. For example: say there two 3 workers, the job has only a put step:

Build 1, put step is assigned to worker2, and image-Get is assigned to worker1.
After image fetched on worker1, the image will be streamed to worker2, then build 1's put works on worker2.
Build 2, put step is assigned to worker2 again, and image-Get is assigned to worker3.
The image will be fetched on worker3 again, and streamed again from worker3 to worker2.

This process sounds a little weird, as in build 1, the image has been streamed to worker2 already. Comparing to pre-V7, this process consumes more resources:

1 more image get
2 more image streaming
1 more image cache space on worker2

If the put sticks to worker2 (for example by tag), then the more workers, the more resource wastes.

So I am thinking several enhancements:

After a volumes is streamed to the other worker, clone the tuple in table volumes with the other worker's name in the cloned tuple. In current volumes take, handle is unique, we need to change to worker+handle as unique.
For get step, if a version has been got on the other worker, we can just stream the volume to current worker, instead of run real get again. Especially, with p2p ability (even without p2p), streaming a volume from another worker should be cheaper than running a real get. Also, with 1, it may find multiple workers as streaming source, so that it's possible to choose a idle worker to stream from, and it may retry if one worker fails, which benefits from multiple aspects:
1. improve get's stability;
2. reduce count of get containers thus improve Concourse performance;
3. reduce third party system's workload, e.g. reduces git clones from git servers. This idea is similar to global resource, but also apply to non-global-resource mode.

WDYT?

vito · 2020-11-24T02:17:14Z

@evanchaoli Agreed - I think this refactor really forces the issue, and it's worth addressing now.

After a volumes is streamed to the other worker, clone the tuple in table volumes with the other worker's name in the cloned tuple. In current volumes take, handle is unique, we need to change to worker+handle as unique.

I think it might be easier to call InitializeResourceCache on the streamed volume instead - then we don't have to change how handles work. That should already allow the volume to be found by later steps on different workers based on the existing volume lookup rules. It's not as general as the handles change, but I think resource caches are the most important thing to propagate as they last beyond the duration of the build. (Task caches do too, but those won't be propagating from step to step in the first place.)

For get step, if a version has been got on the other worker, we can just stream the volume to current worker, instead of run real get again. Especially, with p2p ability (even without p2p), streaming a volume from another worker should be cheaper than running a real get. Also, with 1, it may find multiple workers as streaming source, so that it's possible to choose a idle worker to stream from, and it may retry if one worker fails, which benefits from multiple aspects:

Yeah this sounds worth trying; it'd work really well once we have the above behavior. 👍

evanchaoli · 2020-11-24T10:32:17Z

@vito Thanks for your reply. I plan to create some PoC code for these enhancements.

After a volumes is streamed to the other worker, clone the tuple in table volumes with the other worker's name in the cloned tuple. In current volumes take, handle is unique, we need to change to worker+handle as unique.

I think it might be easier to call InitializeResourceCache on the streamed volume instead - then we don't have to change how handles work. That should already allow the volume to be found by later steps on different workers based on the existing volume lookup rules. It's not as general as the handles change, but I think resource caches are the most important thing to propagate as they last beyond the duration of the build. (Task caches do too, but those won't be propagating from step to step in the first place.)

Yeah, when a volume is streamed, dest handle is different from the source handle, so handle will be always unique in volumes table. Instead of calling InitializeResourceCache, can we create a new method db.CloneVolumeToOtherWorker? This method should just clone the source volume tuple with updating a few fields: handle, worker, worker_base_resource_id etc. This way is more generic, it may handle all types of volumes.

vito · 2020-11-24T14:46:08Z

@evanchaoli I don't think that's necessary because the handle doesn't matter when querying for resource cache volumes and resource caches are already treated as equivalent even if the handles are different. I think it's simpler to keep handles globally unique.

If you have a use case for supporting other types of volumes, maybe it could be done by adding a cloned_volume_handle column which refers to the parent? But I'd like to know the use case first. The only other type of volume that goes from worker to worker would be a task output, so it seems like it would only be to handle a case where an output streams to multiple subsequent steps which selected the same worker. Is that what you had in mind?

evanchaoli · 2020-11-24T23:48:07Z

@vito I guess you misunderstood me, maybe I didn't state clearly.

I have realized that handle doesn't matter when find a resource cache, a streamed volume has a different handle than source volume, so handle is globally unique, I'm not going to change that. What I meant for "clone" is, after streamed a volume, create a new tuple in volumes table, setting the new volume handle and worker name, and clone rest of fields from source volume tuple. This way, we don't need to distinguish volume types when marking a streamed volume as cache.

vito · 2020-11-24T23:58:31Z

@evanchaoli Ah! Yeah, sorry, I misunderstood.

I think the problem with that approach is that the remaining columns to be 'cloned' - worker_resource_cache, worker_base_resource_type_id, worker_task_cache, etc. - themselves reference a worker. So if you were to clone them directly you'd have a volume that claims to be on worker A but then has columns that are tied to worker B. Whereas if you call InitializeResourceCache it will create the worker resource cache on worker A and point to it for you. Cloning the values directly might work to some extent, but I think it would result in strange behavior elsewhere. 🤔

For example, when a worker goes away, the worker_resource_cache rows that reference it are removed, which cascades to garbage-collecting the volumes. So if a volume is cloned from worker B to worker A and then worker B goes away, this would result in removing the caches that were cloned to worker A, too. Combined with having get effectively fetch once per cluster, I think this would result in a lot more churn as a result of a single worker going away. Does that make sense?

evanchaoli added the bug label Oct 27, 2020

evanchaoli mentioned this issue Jan 11, 2021

WIP: mark streamed volume as cache #6411

Closed

11 tasks

evanchaoli mentioned this issue Feb 3, 2021

Cache streamed volumes and try to get by looking for local cache #6495

Merged

11 tasks

clarafu closed this as completed in #6495 Mar 10, 2021

andy-paine mentioned this issue Feb 3, 2022

Scheduled tasks check and get are often scheduled to different workers than the task itself. #8030

Closed

andy-paine mentioned this issue Feb 10, 2022

atc: Consider image volumes in volume-locality strategy #8057

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task image get and task are on different workers #6218

task image get and task are on different workers #6218

evanchaoli commented Oct 27, 2020 •

edited

vito commented Oct 27, 2020 •

edited

evanchaoli commented Nov 20, 2020 •

edited

vito commented Nov 24, 2020 •

edited

evanchaoli commented Nov 24, 2020

vito commented Nov 24, 2020

evanchaoli commented Nov 24, 2020

vito commented Nov 24, 2020

task image get and task are on different workers #6218

task image get and task are on different workers #6218

Comments

evanchaoli commented Oct 27, 2020 • edited

Summary

Steps to reproduce

Expected results

Actual results

Additional context

Triaging info

vito commented Oct 27, 2020 • edited

evanchaoli commented Nov 20, 2020 • edited

vito commented Nov 24, 2020 • edited

evanchaoli commented Nov 24, 2020

vito commented Nov 24, 2020

evanchaoli commented Nov 24, 2020

vito commented Nov 24, 2020

evanchaoli commented Oct 27, 2020 •

edited

vito commented Oct 27, 2020 •

edited

evanchaoli commented Nov 20, 2020 •

edited

vito commented Nov 24, 2020 •

edited