Adding advanced_pytorch example #1007

cozek · 2022-01-16T11:40:17Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Implements an advaned_pytorch example. Closely duplicates the advanced_tensorflow example.

Any other comments?

I added a --toy flag since my machine can't run the full 10 client simulation.
Please run the full simulation and let me know how it works. Happy to make any required changes.
I had a notebook for prototyping, removed it. Can add it in a later PR if needed.

danieljanes · 2022-01-16T18:27:06Z

Hi @cozek , thanks for this :) we'll review the PR shortly. In the meantime, could you add a reference to the new advanced_pytorch example in README.md (close to where the advanced_tensorflow example is listed)?

## Reference Issues/PRs Fix adap#1007 ## What does this implement/fix? Explain your changes. The original example send the data and model to device in utils.py but not in client.py, which may give an error.

pedropgusmao · 2022-02-16T09:43:25Z

Hi @cozek the PR is looking great. Could you just set DEVICE = torch.device("cpu") in :
https://github.com/cozek/flower/blob/d36170c6785ca08716dd6ee2b6a1f184d6015dfb/examples/advanced_pytorch/utils.py#L9
It is likely that we run out of CUDA memory if not.

pedropgusmao · 2022-02-16T09:53:14Z

Another thing, each client appears to be downloading the CIFAR10 dataset. This creates problems when launching multiple clients. I suggest moving the download functionality to ./run.sh and simply loading the file later.

…to advanced-pytorch

cozek · 2022-02-16T15:13:14Z

Another thing, each client appears to be downloading the CIFAR10 dataset. This creates problems when launching multiple clients. I suggest moving the download functionality to ./run.sh and simply loading the file later.

Done.

pedropgusmao · 2022-02-18T10:15:26Z

@cozek , it's looking good.
I just ran a few tests and apparently running in CPU takes about 25min per round... a bit too much :).
To solve this CUDA_OUT_OF_MEMORY vs CPU long training times, I suggest the following:

Store the model as CPU type inside the clients and just use it with cuda during training (remembering to store it back as CPU).
Reduce the number o participating clients to 2 per round.
Reduce bach_size to 16.

You can then keep an eye on your GPU memory usage which hopefully won't be too much.

…aluate func in server

cozek · 2022-03-05T06:22:44Z

@cozek , it's looking good. I just ran a few tests and apparently running in CPU takes about 25min per round... a bit too much :). To solve this CUDA_OUT_OF_MEMORY vs CPU long training times, I suggest the following:

Store the model as CPU type inside the clients and just use it with cuda during training (remembering to store it back as CPU).

Reduce the number o participating clients to 2 per round.

Reduce bach_size to 16.

You can then keep an eye on your GPU memory usage which hopefully won't be too much.

Done.

pedropgusmao · 2022-03-09T07:35:23Z

Traceback (most recent call last):
  File "server.py", line 108, in <module>
    main()
  File "server.py", line 104, in main
    fl.server.start_server("0.0.0.0:8080", config={"num_rounds": 4}, strategy=strategy)
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/app.py", line 114, in start_server
    force_final_distributed_eval=force_final_distributed_eval,
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/app.py", line 148, in _fl
    hist = server.fit(num_rounds=config["num_rounds"])
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/server.py", line 87, in fit
    res = self.strategy.evaluate(parameters=self.parameters)
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/strategy/fedavg.py", line 178, in evaluate
    eval_res = self.eval_fn(weights)
  File "server.py", line 60, in evaluate
    loss, accuracy = utils.test(model, valset)
  File "/home/pedro/repos/advanced_pytorch/examples/advanced_pytorch/utils.py", line 86, in test
    images, labels = images.to(DEVICE), labels.to(DEVICE)
AttributeError: 'int' object has no attribute 'to'

I'm getting the error above. Maybe you are returning the labels as an integer (original CIFAR). Instead, it must be at Long Tensor (int64) .
Also, cuda_device in train() is not used.

examples/advanced_pytorch/client.py

examples/advanced_pytorch/utils.py

examples/advanced_pytorch/client.py

pedropgusmao · 2022-04-27T09:03:45Z

Hi @cozek,
I left a few suggestions on how to load the model only if needed and how to pass the flag whether or not to use gpus.
Let me know what you think.

…or doing a dry run; client no longer stores model as an attribute

cozek · 2022-04-28T07:58:41Z

Hi @cozek, I left a few suggestions on how to load the model only if needed and how to pass the flag whether or not to use gpus. Let me know what you think.

Thanks for the suggestions. Made the changes accordingly.

pedropgusmao · 2022-05-04T14:53:39Z

@cozek @danieljanes looks good, just server-side evaluation is still in cpu only, which makes it a bit slow, but still, it is working properly.

danieljanes · 2022-05-05T11:22:01Z

Thanks for the PR @cozek & thanks for the review @pedropgusmao 👍

cozek added 7 commits January 15, 2022 23:09

adding notebook for local simulation

915812e

updating code

9750ef8

adding server code

6f5e0b1

adding client server scripts

6e15fcb

updating code

4234c19

removing notebook

d19e1a7

black formatting

6af58a3

cozek requested review from danieljanes and tanertopal as code owners January 16, 2022 11:40

Add advanced_pytorch example.

426758d

danieljanes assigned pedropgusmao Jan 19, 2022

danieljanes added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 19, 2022

Merge branch 'main' into advanced-pytorch

d36170c

cozek added 3 commits February 16, 2022 20:27

Merge branch 'main' into advanced-pytorch

298074e

Merge branch 'advanced-pytorch' of https://github.com/cozek/flower in…

4e86244

…to advanced-pytorch

Add CIFAR download to run.sh; change device to CPU

b04cda6

pedropgusmao and others added 3 commits February 18, 2022 10:21

Merge branch 'main' into advanced-pytorch

dd5430e

Merge branch 'main' into advanced-pytorch

82fbc43

download model in run.sh; move model to cpu after training; return ev…

b434ace

…aluate func in server

Merge branch 'main' into advanced-pytorch

ce59a09