Torch distributed #3529

DeNeutoy · 2019-12-16T22:27:21Z

Includes the following PRs from the torch-distributed branch:
#3516
#3515
#3414
#3390
#3372

* Refactor logging setup to support distributed attrs * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers * Support for distributed training in `get_metrics` * Remove bad import * Fix duplicate log messages in stdout * Remove preemptive `logging.shutdown` `logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model` * Fix black formatting issues * Remove `tee_logger` references in API doc * Set log level from `ALLENNLP_DEBUG` env

* High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests

Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_

* add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint

* strip out old DP stuff, ensure multiple cuda devices raises errors * lint * remove unused attribute * remove _cuda_devices everywhere * fixes * move distributed config up to top level * lint * clean up * rename occurences of batch_group * remove hack from find_learning_rate * fix last tests * black * use a top level distributed config * correct error for int * change up parse_cuda_devices to raise good error and be strongly typed

DeNeutoy · 2019-12-16T22:55:15Z

@matt-gardner I assigned you just in case you had any Qs/ red flags.

brendan-ai2

Looks like a clean merge of all the PRs from torch-distributed, so this is an easy LGTM from me. :)

matt-gardner · 2019-12-17T00:09:51Z

If you two are happy with it, no concerns from me. It'd be nice to see a brief writeup somewhere with what's new and different here (maybe in release notes?)

DeNeutoy · 2019-12-17T00:11:14Z

Yep, will do 👍

brendan-ai2 · 2019-12-17T01:01:59Z

🚀

scarecrow1123 · 2019-12-17T02:36:33Z

Hey @DeNeutoy , thanks for taking this up! Just adding a word of caution since I noticed that torch.set_start_method("spawn") has been removed in the final version of run.py. The default start method that PyTorch uses is fork which is prone to deadlocks in various scenarios. Hence the recommended ways are to set it to either of spawn or forkserver. This would ensure that any multi-threading happening in the distributed context doesn't lead to any deadlocks. There may be no mutli-threading code in AllenNLP, but things could go wrong in the underlying NCCL/OpenMP layers as well (refer this open PyTorch issue). I was not sure if I had to raise an issue for this.

brendan-ai2 · 2019-12-17T02:43:06Z

@scarecrow1123 we dug into this a bit and it's not clear that this is necessary. In particular, we're using multiprocessing.spawn from https://github.com/pytorch/pytorch/blob/f9010d764894f3ba9bc6a129213a0faf42af9b65/torch/multiprocessing/spawn.py#L138 to create processes which uses get_context with 'spawn', i.e. it doesn't appear to rely on the default start_method you suggest setting. That said, my understanding here is far from crystal clear, so if you can elaborate on the issue and why it's necessary, we'd definitely appreciate it. :)

scarecrow1123 · 2019-12-17T02:50:37Z

@brendan-ai2 You're spot on with mp.spawn. Just want to make sure that the places such as data reader/iterator, where multi-processing is used would not be affected. I don't remember them using a separate context, please correct me if I'm wrong. Neither am I too sure about whether something would go wrong, but just being careful here.

brendan-ai2 · 2019-12-17T20:52:20Z

Thanks for the extra info, @scarecrow1123! Fortunately we're planning on ripping out the old MultiprocessDatasetReader and MultiprocessIterator shortly, so they shouldn't be an issue for long. Point well taken about being careful!

scarecrow1123 and others added 6 commits October 23, 2019 13:38

Changes to train_model for distributed training support (#3390)

8dde222

* High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests

Dist tests (#3515)

9831717

* add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint

Merge branch 'master' into torch-distributed

b2caecf

DeNeutoy requested review from brendan-ai2 and matt-gardner December 16, 2019 22:54

Merge branch 'master' into torch-distributed

308984f

brendan-ai2 approved these changes Dec 16, 2019

View reviewed changes

DeNeutoy added 2 commits December 16, 2019 15:44

Merge branch 'master' into torch-distributed

587f212

fix merge

79f0513

DeNeutoy merged commit ca453c8 into master Dec 17, 2019

DeNeutoy deleted the torch-distributed branch December 17, 2019 00:53

DeNeutoy mentioned this pull request Dec 17, 2019

Using DistributedDataParallel for multi GPU training #2536

Closed

Hyperparticle mentioned this pull request Dec 26, 2019

Multi-GPU training? Hyperparticle/udify#9

Closed

DeNeutoy mentioned this pull request Jan 21, 2020

Slow training in machines with multiple GPUs #3265

Closed

brendan-ai2 mentioned this pull request Feb 14, 2020

Fix file friendly logging #3785

Closed

DeNeutoy mentioned this pull request Feb 18, 2020

How to register a MultiProcessDatasetReader and config it in config.json? #3794

Closed

dirkgr mentioned this pull request Apr 16, 2020

Data loading for multi-GPU has unexpected behavior #4087

Closed

OyvindTafjord mentioned this pull request Apr 28, 2020

Fix logging #4164

Merged

schmmd added this to the Performance milestone May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch distributed #3529

Torch distributed #3529

DeNeutoy commented Dec 16, 2019 •

edited by brendan-ai2

DeNeutoy commented Dec 16, 2019

brendan-ai2 left a comment

matt-gardner commented Dec 17, 2019

DeNeutoy commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019

scarecrow1123 commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019

scarecrow1123 commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019

Torch distributed #3529

Torch distributed #3529

Conversation

DeNeutoy commented Dec 16, 2019 • edited by brendan-ai2

DeNeutoy commented Dec 16, 2019

brendan-ai2 left a comment

Choose a reason for hiding this comment

matt-gardner commented Dec 17, 2019

DeNeutoy commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019

scarecrow1123 commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019

scarecrow1123 commented Dec 17, 2019

brendan-ai2 commented Dec 17, 2019

DeNeutoy commented Dec 16, 2019 •

edited by brendan-ai2