New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed training #74
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New features
Pythae
now supports distributed training (built on top of PyTorch DDP). Launching a distributed training can be done using a training script in which all of the distributed environment variables are passed to aBaseTrainerConfig
instance as follows:The script can then be launched using a launcher such a
srun
. This module was tested in both mono-node-multi-gpu and multi-node-multi-gpu settings.Major Changes
optimizers
andschedulers
changed. It is no longer needed to build theoptimizer
(resp.scheduler
) and pass them to theTrainer
. As of v0.1.0, the choice and parameters of theoptimizers
andschedulers
can be passed directly to theTrainerConfig
. See changes below:As of v0.1.0
Before v0.1.0
batch_size
key no longer available in theTrainer
configurations. It is replaced by the keysper_device_train_batch_size
andper_device_eval_batch_size
where the batch size per device is specified. Please note that if you are in a distributed setting with for instance 4 GPUs and specify aper_device_eval_batch_size=64
, this is equivalent to training on a single GPU using a batch_size of 4*64.Minor changes
Trainer
configuration under the keystrain_dataloader_num_workers
andeval_dataloader_num_workers
__init__
ofTrainers
and moved sanity checks fromtrain
method to__init__
optimizers
andschedulers
inTrainerConfing
__post_init_post_parse__