Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume-locality strategy should ignore inputs of cached resource #8070

Closed
evanchaoli opened this issue Feb 14, 2022 · 6 comments · Fixed by #8061
Closed

Volume-locality strategy should ignore inputs of cached resource #8070

evanchaoli opened this issue Feb 14, 2022 · 6 comments · Fixed by #8061
Labels
Milestone

Comments

@evanchaoli
Copy link
Contributor

evanchaoli commented Feb 14, 2022

Summary

With EnableCachedtreamedVolumes turning on, if get step finds a resource cache it's gonna fetch, then it will not do anything, just returned the found volume.

Say, a following task needs the get step as an input. If container placement strategy is volume-locality, then the task will be placed on the worker where the resource cache volume locates, which is bad, if the resource cache is widely used by many pipelines, because it will end up:

  • All containers the need the resource as input will go to the same worker
  • The resource volume will get no chance to be streamed to other workers, thus no way to sharing loads across workers.

Let's think about if EnableCachedtreamedVolumes off. Then the get should go to a fewest-build-container worker (#8061), then the task should go to the same worker per volume-locality.

Thus, when EnableCachedtreamedVolumes is on, as get does nothing, the task should go to a fewest-build-container worker, then stream the input volume from source worker. This is still more performant, because streaming a volume between workers should anyway cheaper than running a real get step. But for those common resource caches, which are shared by a lot of pipelines, once they are warmed up on most of workers, then build should run much faster.

Steps to reproduce

Expected results

Actual results

Additional context

Triaging info

  • Concourse version: 7.6.0
  • Browser (if applicable):
  • Did this used to work?
@xtremerui
Copy link
Contributor

Thus, when EnableCachedtreamedVolumes is on, as get does nothing, the task should go to a fewest-build-container worker, then stream the input volume from source worker. This is still more performant, because streaming a volume between workers should anyway cheaper than running a real get step.

Could you elaborate more on running a real get step in the end when you also mentioned as get does nothing in the beginning?

@evanchaoli
Copy link
Contributor Author

@xtremerui Let me explain using a sample pipeline:

- get: mygit
- task: mytest
  config:
  - inputs: {name: mygit}

Assuming container placement is volume-locality+fewest-builds-containers.

Case 1: When EnableCachedtreamedVolumes is off

In first build, build-1, get step, say, is placed on worker-1 due to fewest-builds-container strategy. Then task will be placed on worker-1 too due to volume-locality strategy.

Then in second build, build-2, get step, say, is placed on worker-3 due to fewest-builds-container strategy. Then task will be placed on worker-3 too due to volume-locality strategy.

So, get should always go to a relative idle worker, and task goes to the same worker as get. But get will always create a container and run a git clone.

Case 2: When EnableCachedtreamedVolumes is on, and WithOut this PR

In first build, build-1, get step, say, is placed on worker-1 due to fewest-builds-container strategy. Then task will be placed on worker-1 too due to volume-locality strategy.

Then in second build, build-2, get step will do nothing as it find a cached volume on worker-1. Then task will go to worker-1 due to volume-locality strategy.

In theory, all following builds will only run on worker-1 due to volume-locality. This make the volume gets no chance to stream to other workers, so other workers will never share builds of the job.

Further considering that, if mygit is a global resource that are shared by multiple pipeline/jobs. Then all builds of those jobs will only run on worker-1, which might overload worker-1 but other workers have no way to share worker loads.

Case 3: When EnableCachedtreamedVolumes is on, and With this PR

In first build, build-1, get step, say, is placed on worker-1 due to fewest-builds-container strategy. Then task will be placed on worker-1 too due to volume-locality strategy.

Then in second build, build-2, get step will do nothing as it find a cached volume on worker-1. With this PR, cached input is not considered by volume-locality, so task will use fewest-builds-container strategy. Say worker-2 has fewest containers, then task is placed on worker-2. As worker-2 doesn't have the volume, it needs to stream mygit volume from worker-1 to worker-2.

Here let's compare build-2 of case 1 and case 3:

  • In case 1, build-2 runs a get and a task on a fewest-container worker.
  • In case 3, build-2 run a task on a fewest-container worker, and invokes a volume-streaming. If we compare the real get step of case 1 (creating a container and run a git clone) with the volume-streaming of of case 3, I believe case 3 is cheaper.

Moving forward, in case 1, a build will always run a get and a task. But in case 3, as the volume warming up on all workers, eventually, volume streaming is not needed, a build only needs to run a task.

Hopefully the above sample helps you understand why I consider this PR is very important.

@xtremerui
Copy link
Contributor

I just want to make clear that EnableCachedStreamedVolumes means it will cache a streamed volume (like mygit in build 2 case 3).

For a get step, a volume will be cached despite if EnableCachedStreamedVolumes is true or false.

volume, versionResult, processResult, err := step.performGetAndInitCache(ctx, logger, delegate, getResource, resourceCache, workerSpec, containerSpec, containerOwner, worker)

so your scenarios could be reduced to case 2 and case 3 that EnableCachedStreamedVolumes wouldn't relavent to whether the task will be distributed across workers but only by your changes in this PR.

Though, you might want to use EnableCacheStreamedVolumes carefully

@evanchaoli
Copy link
Contributor Author

so your scenarios could be reduced to case 2 and case 3 that EnableCachedStreamedVolumes wouldn't relavent to whether the task will be distributed across workers but only by your changes in this PR.

@xtremerui I don't think so. Case 1 and case 2/3 has big difference. When EnableCachedStreamedVolumes if off, get will always fetch a new volume, and register the new volume in current build's artifact repository, then due to volume-locality, task will go to the same worker as get. But when EnableCachedStreamedVolumes is on, without this PR, other builds' task will also the worker where build-1's get ran.

Though, you might want to use EnableCacheStreamedVolumes carefully

I'm not sure about non-Linux. But IMO, on 7.x, EnableCachedStreamedVolumes should be recommended. Otherwise image fetching would become to a big concern.

Before 7.x, say a task is placed on worker-1, then image fetch will happen on worker-1. If next build also place the task on worker-1, then as image is there already, no need to fetch the image again.

But with 7.x, say a task is placed on worker-1, nested image get is placed on worker-2. The image will be fetched on worker-2, and streamed to worker-1. If next build, the task is still on worker-1, and image-get is still on worker-2. As worker-2 has the image already, no need to fetch image, but it still needs to stream the image to worker-1, which is a duplicate. On a large deployment, this is a big performance concern.

@xtremerui
Copy link
Contributor

What is the impact of this PR when EnableCacheStreamedVolumes is off?

@evanchaoli
Copy link
Contributor Author

What is the impact of this PR when EnableCacheStreamedVolumes is off?

This PR has NO impact when EnableCacheStreamedVolumes is off. As you can see, FromCache flag is only set to true when EnableCacheStreamedVolumes is on and get find a resource cache from some worker.

@evanchaoli evanchaoli modified the milestones: v7.8.0, v7.7.0 Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants