simulation, add max_calls arg to ray.remote to avoid rayidle in gpus #1384

mofanv · 2022-08-23T11:19:27Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Issue: @ray.remote in ray_client_proxy.py is repeatedly called when running simulation. By default, after each client finishes its training, the ray worker still rests in GPUs as Ray::IDLE. This accumulates and causes CUDA memory to run out.

Change: By adding one argument @ray.remote(max_calls=1), each ray worker will be removed after every client finishes.

Any other comments?

None

danieljanes · 2022-08-23T17:08:32Z

Thanks for the PR @mofanv !

CC'ing @pedropgusmao @jafermarq @Ryan0v0, thoughts?

jafermarq · 2022-08-23T17:33:55Z

Thanks for looking into this @mofanv. Have you noticed any slowdown in training when for example you use many workers? (i.e. more than what you can have running concurrently). I suspect, using max_calls=1 will create some overheads when spawning a new worker in the same round.

mofanv · 2022-08-24T07:16:34Z

Hi, I have not noticed any significant slowdown with max_calls=1 to close ray workers. How many workers do you think I can test for such spawning overheads?
My setting now is using 10 out of 100 clients each round as in

flower/examples/simulation_pytorch/main.py

Line 159 in aa9b946

fraction_fit=0.1,

. I changed the model to mobilenetv3 and it started to have CUDA out of memory error.

jafermarq · 2022-08-24T07:57:05Z

I'd imagine the slowdown will only happen in settings were you N clients per round but your system (e.g. GPUs and CPUs) can only accommodate M at a given time, M<N. For example. If you have 50 clients per round but your system can only run 4 concurrently, then the remaining 46 will be scheduled by ray to run later, once some resources have been freed. In this example, I would imagine that having max_calls=1 hurts performance because resources need to be deallocated (when client ends -- i.e. when the call ends) and re-allocated again (when a new client in the round is spawned)

About the CUDA out of memory error, how are you setting the num_gpus resources for the clients? This would be a fractional number [0:1] representing the portion of the GPU memory that a single client needs. (this value is different depending on how much VRAM your GPU has)

mofanv · 2022-08-24T14:07:19Z

Thanks for the clarification. I wonder will such re-allocation causes a significant overhead. As I can imagine that most of the execution time is due to model training on GPUs? Also, if I understand it correctly, resting clients, i.e., ray workers, for later use still holds the memory. But this may occupy a large amount of memory when clients are training a large model?

I tried 8 clients on 8 GPUs (i.e., num_gpus=1) and also 8 clients on 4GPUs (i.e., num_gpus=0.5), both have this GPU OOM problem. My machine has 8 V100 GPUs. Each GPU has 16GB memory. I monitored GPUs usage by watch -n0.1 nvidia-smi, and observed more Ray::IDLE processes resting in GPUs every round. Each round costs around 1.5G of memory for mobilenetv3.

I have also tested on the original simulation_pytorch example, and the accumulated memory issue also happens, but just less severe because the model size is small.

jafermarq · 2022-08-24T19:01:21Z

Which version of Ray do you use? Ray 1.12 resulted in an issue similar to what you describe "more Ray::IDLE processes resting in GPUs every round". Could you downgrade to 1.11.1 and try again? pip install ray==1.11.1

mofanv · 2022-08-25T11:40:56Z

I see. I was using Ray 1.13, and by downgrading it to 1.11.1, no Ray::IDLE anymore.
So now both methods work well for me. I wonder which do you suggest for now? i) use Ray 1.11.1 and wait for Ray 1.13 or later version to fix the problem, or ii) just add max_calls=1 in order to use Ray 1.13 without OOM

mofanv · 2022-08-26T15:59:03Z

I checked again. This OOM problem also exists in Ray 2.0.0

vtsouval · 2022-12-07T16:35:53Z

@mofanv Where should the ray.remote(max_calls=1) parameter? When I add it to the clients fit function, the functionality is broken.

mofanv · 2022-12-09T04:12:07Z

@mofanv Where should the ray.remote(max_calls=1) parameter? When I add it to the clients fit function, the functionality is broken.

Not sure is it still a issue in the current version. I think the official suggestion is to downgrade Ray version.
But if you still want to do the trick, then changing ray.remote in this and all following lines here

flower/src/py/flwr/simulation/ray_transport/ray_client_proxy.py

Line 118 in 713c008

@ray.remote

vtsouval · 2022-12-09T14:51:39Z

@mofanv Thank you.

Another workaround that worked for me was to limit the number number of CPU cores that ray can utilize to be equal to the utilized CPU cores from client per round. This can be set in the initialization parameters of ray in start_simulation (i.e. ray_init_args("num_cpus": 10)). This way all IDLE clients are removed in each federated round.

pedropgusmao · 2023-01-07T00:20:24Z

max_calls=1 can be a problem if you increase the number of clients per round. I'm running experiments where, without max_calls=1, each round with 100 clients takes about 40 seconds. If I set max_calls=1, the required time goes up to 10 minutes!

kuihao · 2023-01-15T16:39:18Z

Thank you very much, you really saved my life, I checked all night long and finally solved the problem, the method that changing all @ray.remote to @ray.remote(max_calls=1) really works. Not only does it save space, but it also reduces the running time from 15 minutes to 1 minute.

As for downloading the ray version, it didn't work for me. After I downloaded the version from ray==2.2.0 to ray==1.11.1, all the processes died as soon as they were sent to cuda, and the flower server didn't know why it didn't wait for the cuda to return and end itself.

mxbi · 2023-02-02T19:44:05Z

Potentially useful: https://docs.ray.io/en/latest/ray-core/tasks/using-ray-with-gpus.html#workers-not-releasing-gpu-resources
The issue seems to be that Ray assumes GPU resources are freed entirely after a task is finished, which is far from guaranteed.

In my case, a very simple FL on a tiny dataset ran out of memory on my 24GB RTX 3090 after about 15 clients/5 rounds, because new threads were being launched for every round and not being freed.

Additionally, this behaviour can vary from system-to-system because Ray seems to keep around num_cpus clients lazily before it starts re-using them. In my case, it worked on a Colab notebook with 2 threads, but not on my machine with 48 threads. Essentially, ray's default behaviour is to delay garbage collection until it runs out of CPU threads, rather than running out of GPU memory.

As @vtsouval suggested, pointing users to set ray_init_args={"num_cpus": MAX_WORKERS} could be a better option, where MAX_WORKERS=n_clients ensures maximum parallelisation (assuming you have the VRAM for it). This fixed the memory leak for me while retaining the performance benefit of re-using workers between rounds. Setting max_calls=1 seems to destroy and recreate the process on every call, which as noted can harm performance greatly.

(Caveat: I'm a first-time Flower user, so I could be off-base here. That being said, this was the first snag I hit using it and I'm probably not the only one, so +1 for finding a fix for this)

jafermarq · 2023-02-03T17:36:54Z

@mxbi if you use Ray 1.11 it will work just fine (as i indicated earlier in this thread #1384 (comment)) . Afaik this issue with Flower+Ray happens with Ray 2+

@pedropgusmao @danieljanes why aren't we pinning Ray 1.11 to Flower simulation? I always downgrade Ray manually.

pedropgusmao · 2023-02-03T17:58:33Z

Tbh, I don't think keeping an older version as the default is good long-term as this might lead to broken dependencies later on.
Ray already moved tomax_calls=1 in their 2.2 version, but I still think we might want to look for alternatives.

alexkyllo · 2023-04-16T02:00:22Z

I'm running into this issue with flwr==1.3.0 and ray==2.2.0. I need to train 10 clients per round on 1 GPU, but only 5 clients will fit in my GPU memory. Ray keeps creating processes and leaving them "IDLE", consuming memory until I get a CUDA OOM error. I've tried changing the arguments like start_simulation(client_resources={"num_cpus": 8,"num_gpus": 1}). I can't downgrade to ray==1.11.1 because it only supports Python <=3.9 and I'm on Python 3.10. Does anyone know of another workaround?

kuihao · 2023-04-16T03:02:41Z

@alexkyllo Although I am using python 3.9 (ray 2.2.0, flwr1.2.0), I have chosen not to downgrade the ray version and still use ray 2.2.0.

Firstly, open ray_client_proxy.py (/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py) and select all @ray.remote and replace them with @ray.remote(max_calls=1). This will ensure that the ray actors (clients) are cleared when idle, rather than occupying v-ram on the GPU (Nvidia RTX 3090, GTX 1080 Ti, GTX 1070, A6000, A4000, P5000). At least, this method has been effective for me. You may want to give it a try.

alexkyllo · 2023-04-16T03:47:20Z

@alexkyllo Although I am using python 3.9 (ray 2.2.0, flwr1.2.0), I have chosen not to downgrade the ray version and still use ray 2.2.0.

Firstly, open ray_client_proxy.py (/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py) and select all @ray.remote and replace them with @ray.remote(max_calls=1). This will ensure that the ray actors (clients) are cleared when idle, rather than occupying v-ram on the GPU (Nvidia RTX 3090, GTX 1080 Ti, GTX 1070, A6000, A4000, P5000). At least, this method has been effective for me. You may want to give it a try.

Thanks, this works, but I will need to fork the flwr package to make it work across my environments.

I was a little confused because @pedropgusmao 's comment mentioned that max_calls=1 was the default in Ray 2.2 but this doesn't appear to be the case after all.

Would it be a good idea to enhance flwr to make this max_calls thing configurable via an argument to start_simulation or something?

simulation, add max_calls arg to ray.remote to avoid rayidle in gpus

079d9ff

mofanv requested review from danieljanes and tanertopal as code owners August 23, 2022 11:19

mofanv closed this Sep 8, 2022

noah822 mentioned this pull request Jun 22, 2023

dynamically decorate function with ray.remote to fix GPU OOM #1957

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simulation, add max_calls arg to ray.remote to avoid rayidle in gpus #1384

simulation, add max_calls arg to ray.remote to avoid rayidle in gpus #1384

mofanv commented Aug 23, 2022

danieljanes commented Aug 23, 2022 •

edited

jafermarq commented Aug 23, 2022

mofanv commented Aug 24, 2022

jafermarq commented Aug 24, 2022

mofanv commented Aug 24, 2022

jafermarq commented Aug 24, 2022 •

edited

mofanv commented Aug 25, 2022 •

edited

mofanv commented Aug 26, 2022

vtsouval commented Dec 7, 2022

mofanv commented Dec 9, 2022

vtsouval commented Dec 9, 2022

pedropgusmao commented Jan 7, 2023

kuihao commented Jan 15, 2023

mxbi commented Feb 2, 2023 •

edited

jafermarq commented Feb 3, 2023 •

edited

pedropgusmao commented Feb 3, 2023

alexkyllo commented Apr 16, 2023

kuihao commented Apr 16, 2023

alexkyllo commented Apr 16, 2023

simulation, add max_calls arg to ray.remote to avoid rayidle in gpus #1384

simulation, add max_calls arg to ray.remote to avoid rayidle in gpus #1384

Conversation

mofanv commented Aug 23, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

danieljanes commented Aug 23, 2022 • edited

jafermarq commented Aug 23, 2022

mofanv commented Aug 24, 2022

jafermarq commented Aug 24, 2022

mofanv commented Aug 24, 2022

jafermarq commented Aug 24, 2022 • edited

mofanv commented Aug 25, 2022 • edited

mofanv commented Aug 26, 2022

vtsouval commented Dec 7, 2022

mofanv commented Dec 9, 2022

vtsouval commented Dec 9, 2022

pedropgusmao commented Jan 7, 2023

kuihao commented Jan 15, 2023

mxbi commented Feb 2, 2023 • edited

jafermarq commented Feb 3, 2023 • edited

pedropgusmao commented Feb 3, 2023

alexkyllo commented Apr 16, 2023

kuihao commented Apr 16, 2023

alexkyllo commented Apr 16, 2023

danieljanes commented Aug 23, 2022 •

edited

jafermarq commented Aug 24, 2022 •

edited

mofanv commented Aug 25, 2022 •

edited

mxbi commented Feb 2, 2023 •

edited

jafermarq commented Feb 3, 2023 •

edited