Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task image get and task are on different workers #6218

Closed
evanchaoli opened this issue Oct 27, 2020 · 7 comments · Fixed by #6495
Closed

task image get and task are on different workers #6218

evanchaoli opened this issue Oct 27, 2020 · 7 comments · Fixed by #6495
Labels

Comments

@evanchaoli
Copy link
Contributor

evanchaoli commented Oct 27, 2020

Summary

I found this problem in same test as #6217. See the screen shot:

task-get-different-worker

We can see that, task image get was done on a different worker than the worker where task ran.

I used "random" container placement strategy. But image fetching should be done on the worker where the task is going to run, right? @vito

Steps to reproduce

Expected results

Actual results

Additional context

Triaging info

  • Concourse version:
  • Browser (if applicable):
  • Did this used to work?
@evanchaoli evanchaoli added the bug label Oct 27, 2020
@vito
Copy link
Member

vito commented Oct 27, 2020

Yeah, image fetching used to happen on the same worker. It was a very hairy code path that duplicated a lot of the logic from regular check/get, which this refactor cleaned up, but this is one trade-off of the decoupling.

The upside of the decoupling is that it should make it a lot easier to implement runtimes like Kubernetes, because now everything really just flows through the exec check/get/put/task steps.

The new logic should already use the image cache if it already exists on the same worker the task chose (same as with normal get steps and caching semantics). So there's a chance that it fetches on different worker and then uses a local cache anyway. We could accelerate the cache warming by marking the volume that we stream over as a cache too, which is a feature request that comes up pretty often.

One alternative could be to explicitly choose a worker first and then specify the worker in the GetPlan but that feels kind of hairy. None of the Plan types currently have such low level configuration, and it might not make as much sense with other runtimes.

@evanchaoli
Copy link
Contributor Author

evanchaoli commented Nov 20, 2020

@vito Recently I read through your refactored code and did some tests. Now I feel marking streamed volume as cache is something we must do. Otherwise, I'm afraid v7 will have performance downgrade. For example: say there two 3 workers, the job has only a put step:

  1. Build 1, put step is assigned to worker2, and image-Get is assigned to worker1.
  2. After image fetched on worker1, the image will be streamed to worker2, then build 1's put works on worker2.
  3. Build 2, put step is assigned to worker2 again, and image-Get is assigned to worker3.
  4. The image will be fetched on worker3 again, and streamed again from worker3 to worker2.

This process sounds a little weird, as in build 1, the image has been streamed to worker2 already. Comparing to pre-V7, this process consumes more resources:

  • 1 more image get
  • 2 more image streaming
  • 1 more image cache space on worker2

If the put sticks to worker2 (for example by tag), then the more workers, the more resource wastes.

So I am thinking several enhancements:

  1. After a volumes is streamed to the other worker, clone the tuple in table volumes with the other worker's name in the cloned tuple. In current volumes take, handle is unique, we need to change to worker+handle as unique.
  2. For get step, if a version has been got on the other worker, we can just stream the volume to current worker, instead of run real get again. Especially, with p2p ability (even without p2p), streaming a volume from another worker should be cheaper than running a real get. Also, with 1, it may find multiple workers as streaming source, so that it's possible to choose a idle worker to stream from, and it may retry if one worker fails, which benefits from multiple aspects:
    1. improve get's stability;
    2. reduce count of get containers thus improve Concourse performance;
    3. reduce third party system's workload, e.g. reduces git clones from git servers. This idea is similar to global resource, but also apply to non-global-resource mode.

WDYT?

@vito
Copy link
Member

vito commented Nov 24, 2020

@evanchaoli Agreed - I think this refactor really forces the issue, and it's worth addressing now.

After a volumes is streamed to the other worker, clone the tuple in table volumes with the other worker's name in the cloned tuple. In current volumes take, handle is unique, we need to change to worker+handle as unique.

I think it might be easier to call InitializeResourceCache on the streamed volume instead - then we don't have to change how handles work. That should already allow the volume to be found by later steps on different workers based on the existing volume lookup rules. It's not as general as the handles change, but I think resource caches are the most important thing to propagate as they last beyond the duration of the build. (Task caches do too, but those won't be propagating from step to step in the first place.)

For get step, if a version has been got on the other worker, we can just stream the volume to current worker, instead of run real get again. Especially, with p2p ability (even without p2p), streaming a volume from another worker should be cheaper than running a real get. Also, with 1, it may find multiple workers as streaming source, so that it's possible to choose a idle worker to stream from, and it may retry if one worker fails, which benefits from multiple aspects:

Yeah this sounds worth trying; it'd work really well once we have the above behavior. 👍

@evanchaoli
Copy link
Contributor Author

@vito Thanks for your reply. I plan to create some PoC code for these enhancements.

After a volumes is streamed to the other worker, clone the tuple in table volumes with the other worker's name in the cloned tuple. In current volumes take, handle is unique, we need to change to worker+handle as unique.

I think it might be easier to call InitializeResourceCache on the streamed volume instead - then we don't have to change how handles work. That should already allow the volume to be found by later steps on different workers based on the existing volume lookup rules. It's not as general as the handles change, but I think resource caches are the most important thing to propagate as they last beyond the duration of the build. (Task caches do too, but those won't be propagating from step to step in the first place.)

Yeah, when a volume is streamed, dest handle is different from the source handle, so handle will be always unique in volumes table. Instead of calling InitializeResourceCache, can we create a new method db.CloneVolumeToOtherWorker? This method should just clone the source volume tuple with updating a few fields: handle, worker, worker_base_resource_id etc. This way is more generic, it may handle all types of volumes.

@vito
Copy link
Member

vito commented Nov 24, 2020

@evanchaoli I don't think that's necessary because the handle doesn't matter when querying for resource cache volumes and resource caches are already treated as equivalent even if the handles are different. I think it's simpler to keep handles globally unique.

If you have a use case for supporting other types of volumes, maybe it could be done by adding a cloned_volume_handle column which refers to the parent? But I'd like to know the use case first. The only other type of volume that goes from worker to worker would be a task output, so it seems like it would only be to handle a case where an output streams to multiple subsequent steps which selected the same worker. Is that what you had in mind?

@evanchaoli
Copy link
Contributor Author

@vito I guess you misunderstood me, maybe I didn't state clearly.

I have realized that handle doesn't matter when find a resource cache, a streamed volume has a different handle than source volume, so handle is globally unique, I'm not going to change that. What I meant for "clone" is, after streamed a volume, create a new tuple in volumes table, setting the new volume handle and worker name, and clone rest of fields from source volume tuple. This way, we don't need to distinguish volume types when marking a streamed volume as cache.

@vito
Copy link
Member

vito commented Nov 24, 2020

@evanchaoli Ah! Yeah, sorry, I misunderstood.

I think the problem with that approach is that the remaining columns to be 'cloned' - worker_resource_cache, worker_base_resource_type_id, worker_task_cache, etc. - themselves reference a worker. So if you were to clone them directly you'd have a volume that claims to be on worker A but then has columns that are tied to worker B. Whereas if you call InitializeResourceCache it will create the worker resource cache on worker A and point to it for you. Cloning the values directly might work to some extent, but I think it would result in strange behavior elsewhere. 🤔

For example, when a worker goes away, the worker_resource_cache rows that reference it are removed, which cascades to garbage-collecting the volumes. So if a volume is cloned from worker B to worker A and then worker B goes away, this would result in removing the caches that were cloned to worker A, too. Combined with having get effectively fetch once per cluster, I think this would result in a lot more churn as a result of a single worker going away. Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants