Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulation, add max_calls arg to ray.remote to avoid rayidle in gpus #1384

Closed
wants to merge 1 commit into from

Conversation

mofanv
Copy link

@mofanv mofanv commented Aug 23, 2022

Reference Issues/PRs

Fixes #1152 and #1376

What does this implement/fix? Explain your changes.

Issue: @ray.remote in ray_client_proxy.py is repeatedly called when running simulation. By default, after each client finishes its training, the ray worker still rests in GPUs as Ray::IDLE. This accumulates and causes CUDA memory to run out.

Change: By adding one argument @ray.remote(max_calls=1), each ray worker will be removed after every client finishes.

Any other comments?

None

@danieljanes
Copy link
Member

danieljanes commented Aug 23, 2022

Thanks for the PR @mofanv !

CC'ing @pedropgusmao @jafermarq @Ryan0v0, thoughts?

@jafermarq
Copy link
Contributor

Thanks for looking into this @mofanv. Have you noticed any slowdown in training when for example you use many workers? (i.e. more than what you can have running concurrently). I suspect, using max_calls=1 will create some overheads when spawning a new worker in the same round.

@mofanv
Copy link
Author

mofanv commented Aug 24, 2022

Hi, I have not noticed any significant slowdown with max_calls=1 to close ray workers. How many workers do you think I can test for such spawning overheads?
My setting now is using 10 out of 100 clients each round as in

fraction_fit=0.1,
. I changed the model to mobilenetv3 and it started to have CUDA out of memory error.

@jafermarq
Copy link
Contributor

I'd imagine the slowdown will only happen in settings were you N clients per round but your system (e.g. GPUs and CPUs) can only accommodate M at a given time, M<N. For example. If you have 50 clients per round but your system can only run 4 concurrently, then the remaining 46 will be scheduled by ray to run later, once some resources have been freed. In this example, I would imagine that having max_calls=1 hurts performance because resources need to be deallocated (when client ends -- i.e. when the call ends) and re-allocated again (when a new client in the round is spawned)

About the CUDA out of memory error, how are you setting the num_gpus resources for the clients? This would be a fractional number [0:1] representing the portion of the GPU memory that a single client needs. (this value is different depending on how much VRAM your GPU has)

@mofanv
Copy link
Author

mofanv commented Aug 24, 2022

Thanks for the clarification. I wonder will such re-allocation causes a significant overhead. As I can imagine that most of the execution time is due to model training on GPUs? Also, if I understand it correctly, resting clients, i.e., ray workers, for later use still holds the memory. But this may occupy a large amount of memory when clients are training a large model?

I tried 8 clients on 8 GPUs (i.e., num_gpus=1) and also 8 clients on 4GPUs (i.e., num_gpus=0.5), both have this GPU OOM problem. My machine has 8 V100 GPUs. Each GPU has 16GB memory. I monitored GPUs usage by watch -n0.1 nvidia-smi, and observed more Ray::IDLE processes resting in GPUs every round. Each round costs around 1.5G of memory for mobilenetv3.

I have also tested on the original simulation_pytorch example, and the accumulated memory issue also happens, but just less severe because the model size is small.

@jafermarq
Copy link
Contributor

jafermarq commented Aug 24, 2022

Which version of Ray do you use? Ray 1.12 resulted in an issue similar to what you describe "more Ray::IDLE processes resting in GPUs every round". Could you downgrade to 1.11.1 and try again? pip install ray==1.11.1

@mofanv
Copy link
Author

mofanv commented Aug 25, 2022

I see. I was using Ray 1.13, and by downgrading it to 1.11.1, no Ray::IDLE anymore.
So now both methods work well for me. I wonder which do you suggest for now? i) use Ray 1.11.1 and wait for Ray 1.13 or later version to fix the problem, or ii) just add max_calls=1 in order to use Ray 1.13 without OOM

@mofanv
Copy link
Author

mofanv commented Aug 26, 2022

I checked again. This OOM problem also exists in Ray 2.0.0

@mofanv mofanv closed this Sep 8, 2022
@vtsouval
Copy link

vtsouval commented Dec 7, 2022

@mofanv Where should the ray.remote(max_calls=1) parameter? When I add it to the clients fit function, the functionality is broken.

@mofanv
Copy link
Author

mofanv commented Dec 9, 2022

@mofanv Where should the ray.remote(max_calls=1) parameter? When I add it to the clients fit function, the functionality is broken.

Not sure is it still a issue in the current version. I think the official suggestion is to downgrade Ray version.
But if you still want to do the trick, then changing ray.remote in this and all following lines here

@vtsouval
Copy link

vtsouval commented Dec 9, 2022

@mofanv Thank you.

Another workaround that worked for me was to limit the number number of CPU cores that ray can utilize to be equal to the utilized CPU cores from client per round. This can be set in the initialization parameters of ray in start_simulation (i.e. ray_init_args("num_cpus": 10)). This way all IDLE clients are removed in each federated round.

@pedropgusmao
Copy link
Contributor

max_calls=1 can be a problem if you increase the number of clients per round. I'm running experiments where, without max_calls=1, each round with 100 clients takes about 40 seconds. If I set max_calls=1, the required time goes up to 10 minutes!

@kuihao
Copy link
Contributor

kuihao commented Jan 15, 2023

Thank you very much, you really saved my life, I checked all night long and finally solved the problem, the method that changing all @ray.remote to @ray.remote(max_calls=1) really works. Not only does it save space, but it also reduces the running time from 15 minutes to 1 minute.

As for downloading the ray version, it didn't work for me. After I downloaded the version from ray==2.2.0 to ray==1.11.1, all the processes died as soon as they were sent to cuda, and the flower server didn't know why it didn't wait for the cuda to return and end itself.

@mxbi
Copy link

mxbi commented Feb 2, 2023

Potentially useful: https://docs.ray.io/en/latest/ray-core/tasks/using-ray-with-gpus.html#workers-not-releasing-gpu-resources
The issue seems to be that Ray assumes GPU resources are freed entirely after a task is finished, which is far from guaranteed.

In my case, a very simple FL on a tiny dataset ran out of memory on my 24GB RTX 3090 after about 15 clients/5 rounds, because new threads were being launched for every round and not being freed.

Additionally, this behaviour can vary from system-to-system because Ray seems to keep around num_cpus clients lazily before it starts re-using them. In my case, it worked on a Colab notebook with 2 threads, but not on my machine with 48 threads. Essentially, ray's default behaviour is to delay garbage collection until it runs out of CPU threads, rather than running out of GPU memory.

As @vtsouval suggested, pointing users to set ray_init_args={"num_cpus": MAX_WORKERS} could be a better option, where MAX_WORKERS=n_clients ensures maximum parallelisation (assuming you have the VRAM for it). This fixed the memory leak for me while retaining the performance benefit of re-using workers between rounds. Setting max_calls=1 seems to destroy and recreate the process on every call, which as noted can harm performance greatly.

(Caveat: I'm a first-time Flower user, so I could be off-base here. That being said, this was the first snag I hit using it and I'm probably not the only one, so +1 for finding a fix for this)

@jafermarq
Copy link
Contributor

jafermarq commented Feb 3, 2023

@mxbi if you use Ray 1.11 it will work just fine (as i indicated earlier in this thread #1384 (comment)) . Afaik this issue with Flower+Ray happens with Ray 2+

@pedropgusmao @danieljanes why aren't we pinning Ray 1.11 to Flower simulation? I always downgrade Ray manually.

@pedropgusmao
Copy link
Contributor

Tbh, I don't think keeping an older version as the default is good long-term as this might lead to broken dependencies later on.
Ray already moved tomax_calls=1 in their 2.2 version, but I still think we might want to look for alternatives.

@alexkyllo
Copy link

I'm running into this issue with flwr==1.3.0 and ray==2.2.0. I need to train 10 clients per round on 1 GPU, but only 5 clients will fit in my GPU memory. Ray keeps creating processes and leaving them "IDLE", consuming memory until I get a CUDA OOM error. I've tried changing the arguments like start_simulation(client_resources={"num_cpus": 8,"num_gpus": 1}). I can't downgrade to ray==1.11.1 because it only supports Python <=3.9 and I'm on Python 3.10. Does anyone know of another workaround?

@kuihao
Copy link
Contributor

kuihao commented Apr 16, 2023

@alexkyllo Although I am using python 3.9 (ray 2.2.0, flwr1.2.0), I have chosen not to downgrade the ray version and still use ray 2.2.0.

Firstly, open ray_client_proxy.py (/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py) and select all @ray.remote and replace them with @ray.remote(max_calls=1). This will ensure that the ray actors (clients) are cleared when idle, rather than occupying v-ram on the GPU (Nvidia RTX 3090, GTX 1080 Ti, GTX 1070, A6000, A4000, P5000). At least, this method has been effective for me. You may want to give it a try.

@alexkyllo
Copy link

@alexkyllo Although I am using python 3.9 (ray 2.2.0, flwr1.2.0), I have chosen not to downgrade the ray version and still use ray 2.2.0.

Firstly, open ray_client_proxy.py (/lib/python3.9/site-packages/flwr/simulation/ray_transport/ray_client_proxy.py) and select all @ray.remote and replace them with @ray.remote(max_calls=1). This will ensure that the ray actors (clients) are cleared when idle, rather than occupying v-ram on the GPU (Nvidia RTX 3090, GTX 1080 Ti, GTX 1070, A6000, A4000, P5000). At least, this method has been effective for me. You may want to give it a try.

Thanks, this works, but I will need to fork the flwr package to make it work across my environments.

I was a little confused because @pedropgusmao 's comment mentioned that max_calls=1 was the default in Ray 2.2 but this doesn't appear to be the case after all.

Would it be a good idea to enhance flwr to make this max_calls thing configurable via an argument to start_simulation or something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exhausting computational resources with multiple fl.simulation.start_simulation runnings (ray)
8 participants