Remove assign_cpu_and_gpu_sets #4412

epicfaace · 2023-03-01T19:15:35Z

Reasons for making this change

Remove assign_cpu_and_gpu_sets as I don't think this check is needed. This is important as it simplifies #4385, because workers won't need to do checks on their own cpu / gpusets (they can just request pods with specific GPUs). I could have just bypassed this code path for the kubernetes runtime, but it seems to be an unnecessary check we can just do away with -- because the bundle manager already handles ensuring the worker has enough CPUs / GPUs available to run a bundle. What do you think @AndrewJGaut @percyliang ?

epicfaace · 2023-03-01T19:17:03Z

codalab/worker/worker.py

-                % (request_gpus, len(gpuset), len(self.gpuset))
-            )
-
-        def propose_set(resource_set, request_count):


Is there a chance that the user will actually get less CPUs / GPUs than they request? If so, then we might need to keep this function.

epicfaace · 2023-03-15T20:19:51Z

We should discuss this in a meeting.

epicfaace · 2023-03-15T20:50:47Z

codalab/worker/worker_run_state.py

@@ -237,19 +237,7 @@ def mount_dependency(dependency, shared_file_system):
            )
            return run_state._replace(stage=RunStage.CLEANING_UP)

-        # Check CPU and GPU availability
-        try:
-            cpuset, gpuset = self.assign_cpu_and_gpu_sets_fn(


Add an if statement -- consider just not doing this with kubernetes.

epicfaace · 2023-03-22T01:47:06Z

@AndrewJGaut any thoughts on this?

AndrewJGaut · 2023-03-22T16:34:14Z

@AndrewJGaut any thoughts on this?

I believe you're correct @epicfaace and assign_cpu_and_gpu_sets can be removed.

Here's why: We do check if the worker has enough resources in the bundle-manager and keep a tally on the worker's cpus and gpus (e.g. see here).

Now, we could run into issues if, for instance, the bundle_manager was starting bundles in a multi-threaded fashion (and sent a start message to the same worker multiple times before decrementing the worker['cpus'] and/or worker['gpus'] in that function).However, the condition for entering the loop in which those resources are decremented will not return until the rest-server has received the message and sent it to the worker has received the message (or it returns False and no bundle is started anyway). Therefore, it will never be the case as long as bundle_manager is single-threaded that there will be a race condition related to worker cpu and gpu counts.

Since the bundle-manager will never schedule a run on the worker unless it has sufficient resources for all of its running bundles, it should never be the case that there aren't enough cpus and/or gpus on the worker to run a bundle.

Therefore, the assign_cpu_and_gpu_sets check done in the worker is not necessary and can be removed. However, it could become necessary if we wanted the worker and/or bundle-managers to be multi-threaded (with respect to transitioning bundle stages on the worker and run scheduling bundles on the bundle-manager). In that case, though, I'd recommend we keep a member variable in the worker keeping track of the number of cpus and gpus available rather than looping through the count for all run bundles as done here.

AndrewJGaut · 2023-03-22T16:45:47Z

Some extra interesting tidbits:

The beginning of the function is the only place where we check whether or not there are enough CPUs or GPUs on the worker currently to run that bundle.
The scheduling algorithm on the bundle-manager makes more sense to me now. It loops over bundles it's trying to run. For each bundle, it calls a function that checks if any worker has the resources required to run that bundle. If so, it tries to start the bundle on one of those workers and then deducts resources if that fails. So, in particular, it does each bundle individually and with a single thread, which is why we can be sure that no race conditions will arise.
The bundle-manager only does one thing with multiple threads: make bundles
I can't actually find where the cpu sets and gpu sets are incremented after a run bundle finishes... but I know that it does b/c otherwise workers wouldn't be able to schedule bundles ad infinitum

Remove assign_cpu_and_gpu_sets

29ba357

epicfaace requested review from AndrewJGaut and percyliang March 1, 2023 19:15

epicfaace commented Mar 1, 2023

View reviewed changes

epicfaace added 2 commits March 1, 2023 19:35

fix

00e1674

Update worker_run_state.py

5048b44

epicfaace commented Mar 15, 2023

View reviewed changes

epicfaace mentioned this pull request Apr 5, 2023

Test Ashwin and Jiani's changes #4436

Closed

Merge branch 'master' into cpu-gpu-set

e0d346c

AndrewJGaut mentioned this pull request Sep 9, 2023

Remove assign_cpu_and_gpu_sets from worker (or validate that it's necessary) #4527

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove assign_cpu_and_gpu_sets #4412

Remove assign_cpu_and_gpu_sets #4412

epicfaace commented Mar 1, 2023

epicfaace Mar 1, 2023

epicfaace commented Mar 15, 2023

epicfaace Mar 15, 2023

epicfaace commented Mar 22, 2023

AndrewJGaut commented Mar 22, 2023 •

edited

Loading

AndrewJGaut commented Mar 22, 2023 •

edited

Loading

Remove assign_cpu_and_gpu_sets #4412

Are you sure you want to change the base?

Remove assign_cpu_and_gpu_sets #4412

Conversation

epicfaace commented Mar 1, 2023

Reasons for making this change

epicfaace Mar 1, 2023

Choose a reason for hiding this comment

epicfaace commented Mar 15, 2023

epicfaace Mar 15, 2023

Choose a reason for hiding this comment

epicfaace commented Mar 22, 2023

AndrewJGaut commented Mar 22, 2023 • edited Loading

AndrewJGaut commented Mar 22, 2023 • edited Loading

AndrewJGaut commented Mar 22, 2023 •

edited

Loading

AndrewJGaut commented Mar 22, 2023 •

edited

Loading