distributed training #74

clementchadebec · 2023-02-03T17:45:08Z

New features

Pythae now supports distributed training (built on top of PyTorch DDP). Launching a distributed training can be done using a training script in which all of the distributed environment variables are passed to a BaseTrainerConfig instance as follows:

training_config = BaseTrainerConfig(
     num_epochs=10,
     learning_rate=1e-3,
     per_device_train_batch_size=64,
     per_device_eval_batch_size=64,
     dist_backend="nccl", # distributed backend
     world_size=8 # number of gpus to use (n_nodes x n_gpus_per_node),
     rank=0 # process/gpu id,
     local_rank=1 # node id,
     master_addr="localhost" # master address,
     master_port="12345" # master port,
 )

The script can then be launched using a launcher such a srun. This module was tested in both mono-node-multi-gpu and multi-node-multi-gpu settings.

Major Changes

Selection and definition of custom optimizers and schedulers changed. It is no longer needed to build the optimizer (resp. scheduler) and pass them to the Trainer. As of v0.1.0, the choice and parameters of the optimizers and schedulers can be passed directly to the TrainerConfig. See changes below:

As of v0.1.0

my_model = VAE(model_config=model_config)
# Specify instances and params directly in Trainer config
training_config = BaseTrainerConfig(
    ...,
    optimizer_cls="AdamW",
    optimizer_params={"betas": (0.91, 0.995)}
    scheduler_cls="MultiStepLR",
    scheduler_params={"milestones": [10, 20, 30], "gamma": 10**(-1/5)}
)
trainer = BaseTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    training_config=training_config
)
# Launch training
trainer.train()

Before v0.1.0

my_model = VAE(model_config=model_config)
training_config = BaseTrainerConfig(...)
### Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=training_config.learning_rate, betas=(0.91, 0.995))
### Scheduler
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10, 20, 30], gamma=10**(-1/5))
# Pass instances to Trainer
trainer = BaseTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    training_config=training_config,
    optimizer=optimizer,
    scheduler=scheduler
)
# Launch training
trainer.train()

batch_size key no longer available in the Trainer configurations. It is replaced by the keys per_device_train_batch_size and per_device_eval_batch_size where the batch size per device is specified. Please note that if you are in a distributed setting with for instance 4 GPUs and specify a per_device_eval_batch_size=64, this is equivalent to training on a single GPU using a batch_size of 4*64.

Minor changes

Added the ability to specify the desired number of workers for data_loading in the Trainer configuration under the keys train_dataloader_num_workers and eval_dataloader_num_workers
Cleaned up __init__ of Trainers and moved sanity checks from train method to __init__
Moved checks on optimizers and schedulers in TrainerConfing __post_init_post_parse__

clementchadebec added 30 commits January 27, 2023 22:28

[WIP] distributed

14d1c57

[WIP] distributed training

4f69592

add script

d7818a3

add smaller script

b53c03e

fix device

48194f4

fix DDP model setting

1a8b66f

reorder model on device

0df2c67

remove model.model...

b2b8434

set print on process 0

2aef7e7

remove duplicate

6984645

model folder set on main process

98b84c0

test

9fea140

model.update fix

531ece3

exists_ok=True

d49bc1c

add exist_ok

e73503f

jz multinode fix

dd0ee3e

fix batch size splitting

5b42baf

isort and black

0497c03

[WIP] work on trainers

e1b3ab3

[WIP] work on Distributed and trainers

a322a4c

add tets example with adversarial trainer

47a96db

fix small issue

d5adeee

fix master addr environ

a2ebb59

fix typo

04073ff

fix update with DDP

d91446d

udpate callback for distributed training

aad8cb8

diplay progress per process

1055a69

enhance display

56837d8

[WIP] make CoupledAdv distributed

291b6a4

Cealn up trainers and add distributed training

e8d7ca9

clementchadebec added 6 commits February 4, 2023 12:22

add grad accumulation

ea056e0

remove print

b45134a

benchmark

12a18d9

remove num_workers

c68b825

add FFHQ to benchmark

301896f

fix predict

9aae770

clementchadebec added the enhancement New feature or request label Feb 4, 2023

clementchadebec added 20 commits February 4, 2023 16:05

fix predict

8e055c2

reduce number of samples in predict

05c1e52

add parser

7768fb8

add sigmoid

ef6f02d

update config

f61af2c

add imagenet script

d0b47ae

convert img to RGB

f0c06b8

add sigmoid to decoder

467c74b

increase batch size

be7e60a

change nets

cad2595

change nets

e55901d

add new script

1441b84

add convert to RGB

bf7e5a5

update tests

3e34d76

clean up

5da96b2

prepare release

4fb3d03

update doc

2b90e99

fix input_dim

17ee687

last figures

38385dd

doc fix

51ec501

clementchadebec linked an issue Feb 6, 2023 that may be closed by this pull request

Allow distributed training #70

Closed

clementchadebec marked this pull request as ready for review February 6, 2023 16:40

clementchadebec merged commit 08f805e into main Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training #74

distributed training #74

clementchadebec commented Feb 3, 2023 •

edited

distributed training #74

distributed training #74

Conversation

clementchadebec commented Feb 3, 2023 • edited

clementchadebec commented Feb 3, 2023 •

edited