Changes to `train_model` for distributed training support #3390

scarecrow1123 · 2019-10-23T16:52:38Z

This is the second PR after #3372 to bring in distributed training support. Following are the high level changes done:

Extra params to TrainerBase, Trainer and CallbackTrainer classes to include rank and world_size options
In a distributed setup, the plan is to run training in sub processes by using multiprocessing.spawn. Hence parts of train_model function in train.py is isolated to a new _train_worker function which is invoked as a process
In this PR, changes for distributed training are still not in. Only the isolation of the above mentioned function is done, to mostly test its correctness when run with existing tests/experiments

~~P.S: Since this is a dependent PR, the changes done for the previous PR is being shown until the older one is merged. I'd be happy to know if there is a better way to do this :)~~

@DeNeutoy Once this reviewed, the next PR candidate would be the actual distributed training related code.

DeNeutoy · 2019-10-23T20:52:29Z

@scarecrow1123 could you rebase this please? Then we'll review 👍

scarecrow1123 · 2019-10-23T22:25:33Z

@DeNeutoy Rebase is done. Please proceed with the review.

DeNeutoy

LGTM

At some point the distributed stuff will need to have some test coverage, but I can see that this doesn't actually implement any new things and it's just some wrangling with the command line api. So looks good!

allennlp/training/trainer.py

DeNeutoy · 2019-10-28T17:25:55Z

Thanks!

Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_

* Logging and metrics changes for distributed training (#3372) * Refactor logging setup to support distributed attrs * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers * Support for distributed training in `get_metrics` * Remove bad import * Fix duplicate log messages in stdout * Remove preemptive `logging.shutdown` `logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model` * Fix black formatting issues * Remove `tee_logger` references in API doc * Set log level from `ALLENNLP_DEBUG` env * Changes to `train_model` for distributed training support (#3390) * High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests * `Trainer` changes for distributed training (#3414) Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_ * Dist tests (#3515) * add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint * strip out old DP stuff, ensure multiple cuda devices raises err… (#3516) * strip out old DP stuff, ensure multiple cuda devices raises errors * lint * remove unused attribute * remove _cuda_devices everywhere * fixes * move distributed config up to top level * lint * clean up * rename occurences of batch_group * remove hack from find_learning_rate * fix last tests * black * use a top level distributed config * correct error for int * change up parse_cuda_devices to raise good error and be strongly typed * fix merge

High level API changes to support distributed training

5353e36

scarecrow1123 force-pushed the ddp-2 branch from 5d4dd80 to 5353e36 Compare October 23, 2019 22:24

scarecrow1123 added 2 commits October 24, 2019 05:16

Fix flake8 error

d405482

Fix mypy error

107e2f3

DeNeutoy approved these changes Oct 24, 2019

View reviewed changes

allennlp/training/trainer.py Show resolved Hide resolved

scarecrow1123 added 2 commits October 28, 2019 18:50

Add docstring and misc fixes

03b77d4

Fix flake tests

b5c8ef0

DeNeutoy merged commit 8dde222 into allenai:torch-distributed Oct 28, 2019

scarecrow1123 mentioned this pull request Oct 30, 2019

WIP: Trainer changes for distributed training #3414

Merged

DeNeutoy mentioned this pull request Dec 16, 2019

Torch distributed #3529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to `train_model` for distributed training support #3390

Changes to `train_model` for distributed training support #3390

scarecrow1123 commented Oct 23, 2019 •

edited

Loading

DeNeutoy commented Oct 23, 2019

scarecrow1123 commented Oct 23, 2019

DeNeutoy left a comment

DeNeutoy commented Oct 28, 2019

Changes to train_model for distributed training support #3390

Changes to train_model for distributed training support #3390

Conversation

scarecrow1123 commented Oct 23, 2019 • edited Loading

DeNeutoy commented Oct 23, 2019

scarecrow1123 commented Oct 23, 2019

DeNeutoy left a comment

Choose a reason for hiding this comment

DeNeutoy commented Oct 28, 2019

Changes to `train_model` for distributed training support #3390

Changes to `train_model` for distributed training support #3390

scarecrow1123 commented Oct 23, 2019 •

edited

Loading