-
Notifications
You must be signed in to change notification settings - Fork 840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task image get and task are on different workers #6218
Comments
Yeah, image fetching used to happen on the same worker. It was a very hairy code path that duplicated a lot of the logic from regular check/get, which this refactor cleaned up, but this is one trade-off of the decoupling. The upside of the decoupling is that it should make it a lot easier to implement runtimes like Kubernetes, because now everything really just flows through the exec check/get/put/task steps. The new logic should already use the image cache if it already exists on the same worker the task chose (same as with normal One alternative could be to explicitly choose a worker first and then specify the worker in the GetPlan but that feels kind of hairy. None of the Plan types currently have such low level configuration, and it might not make as much sense with other runtimes. |
@vito Recently I read through your refactored code and did some tests. Now I feel marking streamed volume as cache is something we must do. Otherwise, I'm afraid v7 will have performance downgrade. For example: say there two 3 workers, the job has only a
This process sounds a little weird, as in build 1, the image has been streamed to worker2 already. Comparing to pre-V7, this process consumes more resources:
If the So I am thinking several enhancements:
WDYT? |
@evanchaoli Agreed - I think this refactor really forces the issue, and it's worth addressing now.
I think it might be easier to call
Yeah this sounds worth trying; it'd work really well once we have the above behavior. 👍 |
@vito Thanks for your reply. I plan to create some PoC code for these enhancements.
Yeah, when a volume is streamed, dest handle is different from the source handle, so |
@evanchaoli I don't think that's necessary because the handle doesn't matter when querying for resource cache volumes and resource caches are already treated as equivalent even if the handles are different. I think it's simpler to keep handles globally unique. If you have a use case for supporting other types of volumes, maybe it could be done by adding a |
@vito I guess you misunderstood me, maybe I didn't state clearly. I have realized that handle doesn't matter when find a resource cache, a streamed volume has a different handle than source volume, so handle is globally unique, I'm not going to change that. What I meant for "clone" is, after streamed a volume, create a new tuple in |
@evanchaoli Ah! Yeah, sorry, I misunderstood. I think the problem with that approach is that the remaining columns to be 'cloned' - For example, when a worker goes away, the |
Summary
I found this problem in same test as #6217. See the screen shot:
We can see that, task image get was done on a different worker than the worker where task ran.
I used "random" container placement strategy. But image fetching should be done on the worker where the task is going to run, right? @vito
Steps to reproduce
Expected results
Actual results
Additional context
Triaging info
The text was updated successfully, but these errors were encountered: