Logging and metrics changes for distributed training #3372

scarecrow1123 · 2019-10-17T20:38:28Z

As discussed at #2536 , this is the first PR to bring in DistributedDataParallel support. To start with this contains refactored implementations of prepare_global_logging and training_util.get_metrics methods.

Both changes involve adding rank and world_size as optional attributes.
Wrt logging, in the distributed case, each worker will log to their respective files with names stdout_{rank}.log/stderr_{rank}.log.
Logs to stdout/stderr are only from the master(rank=0) process. tqdm gets updated with metrics aggregated across all workers
With these changes to logging, TeeLogger becomes unnecessary IMO and hence I took the liberty to remove it.
get_metrics does a dist.all_reduce to aggregate all the provided metrics

P.S: pytest failures are yet to be fixed

DeNeutoy

@scarecrow1123 looking good, I caught a few things (not creating the output dir is why like 95% of the tests are failing, i'm sure)

allennlp/common/tee_logger.py

allennlp/training/util.py

DeNeutoy · 2019-10-21T16:05:52Z

Also, when you're working on PRs to this torch-distributed branch, could you try to make sure that you start from torch-distributed as a base?

It would be good if you could revert the changes in 7afe971, because that will be in master when we rebase.

It's fine this time because we created it after you started working, but e.g you can see that 7afe971 snuck in, and it will make rebasing/merging against master easier if we don't pull in diffs from there as we go. Thanks!

DeNeutoy · 2019-10-21T16:12:59Z

Also, as you're going to be submitting several PRs, here are a couple of development tips for allennlp which I find useful:

You can run the CI checks locally using the scripts in the scripts directory, e.g bash scripts/black.sh, bash scripts/flake8.sh, bash scripts/flake8.sh.
Typically I run only tests I think I might have broken locally, e.g pytest allennlp/tests/the/one/I/broke otherwise it takes super long.
In our CI we have checks for the allennlp docs. In general don't worry about these, as we will always be able to tell you how to fix it, but basically we check that we have docs for all public modules, and that we don't have docs for modules which don't exist. For example, this PR is failing the docs check because you deleted the TeeLogger, but the docs are still present here:

https://github.com/allenai/allennlp/blob/master/doc/api/allennlp.common.tee_logger.rst

Hopefully adding/removing them is self-explanatory, but getting them to build correctly can be a pain (sphinx is very fiddly). If you ever have trouble with that, let us know, because there's lots of sphinx rules i've internalised inside my head haha.

brendan-ai2 · 2019-10-21T19:23:04Z

One clarification, if we literally git revert 7afe971 that revert will be carried into master when we later merge all of this work. So I suspect @DeNeutoy was really suggesting something like:

Make a new branch fix based off of torch-distributed.
git cherry-pick in just the desired commits from scarecrow1123:ddp-1.
Then check out scarecrow1123:ddp-1 and git reset --hard fix.

That will ensure scarecrow1123:ddp-1 contains just the desired commits without the harmful revert. I suspect there's a better way to do step 2 than cherry picking in the commits individually, but this is what I came up with off the top of my head.

* `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers

`logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model`

scarecrow1123 · 2019-10-22T05:49:16Z

it will make rebasing/merging against master easier if we don't pull in diffs from there as we go

That will ensure scarecrow1123:ddp-1 contains just the desired commits without the harmful revert. I suspect there's a better way to do step 2 than cherry picking in the commits individually, but this is what I came up with off the top of my head.

Apologies for the goof up. I've modified the tree to be clean thanks to @brendan-ai2 's suggestions.

scarecrow1123 · 2019-10-23T03:14:20Z

@DeNeutoy Let me know if there is anything else to be changed here and if this is good to go into the branch. The changes for the next PR is underway.

DeNeutoy

LGTM

Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_

* Logging and metrics changes for distributed training (#3372) * Refactor logging setup to support distributed attrs * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers * Support for distributed training in `get_metrics` * Remove bad import * Fix duplicate log messages in stdout * Remove preemptive `logging.shutdown` `logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model` * Fix black formatting issues * Remove `tee_logger` references in API doc * Set log level from `ALLENNLP_DEBUG` env * Changes to `train_model` for distributed training support (#3390) * High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests * `Trainer` changes for distributed training (#3414) Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_ * Dist tests (#3515) * add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint * strip out old DP stuff, ensure multiple cuda devices raises err… (#3516) * strip out old DP stuff, ensure multiple cuda devices raises errors * lint * remove unused attribute * remove _cuda_devices everywhere * fixes * move distributed config up to top level * lint * clean up * rename occurences of batch_group * remove hack from find_learning_rate * fix last tests * black * use a top level distributed config * correct error for int * change up parse_cuda_devices to raise good error and be strongly typed * fix merge

scarecrow1123 mentioned this pull request Oct 18, 2019

Using DistributedDataParallel for multi GPU training #2536

Closed

DeNeutoy reviewed Oct 21, 2019

View reviewed changes

allennlp/common/tee_logger.py Show resolved Hide resolved

DeNeutoy reviewed Oct 21, 2019

View reviewed changes

allennlp/training/util.py Show resolved Hide resolved

scarecrow1123 added 4 commits October 22, 2019 11:01

Support for distributed training in get_metrics

98f2b6f

Remove bad import

0dc4531

Fix duplicate log messages in stdout

7d26423

scarecrow1123 force-pushed the ddp-1 branch from 43185f2 to 7d26423 Compare October 22, 2019 05:36

Remove preemptive logging.shutdown

7264e05

`logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model`

scarecrow1123 added 2 commits October 22, 2019 11:29

Fix black formatting issues

9a8f239

Remove tee_logger references in API doc

dfb38cb

scarecrow1123 changed the title ~~WIP: Logging and metrics changes for distributed training~~ Logging and metrics changes for distributed training Oct 23, 2019

Set log level from ALLENNLP_DEBUG env

fbaf769

DeNeutoy approved these changes Oct 23, 2019

View reviewed changes

scarecrow1123 mentioned this pull request Oct 23, 2019

Changes to train_model for distributed training support #3390

Merged

DeNeutoy merged commit 2d7a51b into allenai:torch-distributed Oct 23, 2019

scarecrow1123 mentioned this pull request Oct 30, 2019

WIP: Trainer changes for distributed training #3414

Merged

DeNeutoy mentioned this pull request Dec 16, 2019

Torch distributed #3529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging and metrics changes for distributed training #3372

Logging and metrics changes for distributed training #3372

scarecrow1123 commented Oct 17, 2019

DeNeutoy left a comment

DeNeutoy commented Oct 21, 2019

DeNeutoy commented Oct 21, 2019

brendan-ai2 commented Oct 21, 2019

scarecrow1123 commented Oct 22, 2019

scarecrow1123 commented Oct 23, 2019

DeNeutoy left a comment

Logging and metrics changes for distributed training #3372

Logging and metrics changes for distributed training #3372

Conversation

scarecrow1123 commented Oct 17, 2019

DeNeutoy left a comment

Choose a reason for hiding this comment

DeNeutoy commented Oct 21, 2019

DeNeutoy commented Oct 21, 2019

brendan-ai2 commented Oct 21, 2019

scarecrow1123 commented Oct 22, 2019

scarecrow1123 commented Oct 23, 2019

DeNeutoy left a comment

Choose a reason for hiding this comment