This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
Commits on Oct 23, 2019
-
Logging and metrics changes for distributed training (#3372)
* Refactor logging setup to support distributed attrs * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers * Support for distributed training in `get_metrics` * Remove bad import * Fix duplicate log messages in stdout * Remove preemptive `logging.shutdown` `logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model` * Fix black formatting issues * Remove `tee_logger` references in API doc * Set log level from `ALLENNLP_DEBUG` env
Configuration menu - View commit details
-
Copy full SHA for 2d7a51b - Browse repository at this point
Copy the full SHA 2d7a51bView commit details
Commits on Oct 28, 2019
-
Changes to
train_model
for distributed training support (#3390)* High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests
Configuration menu - View commit details
-
Copy full SHA for 8dde222 - Browse repository at this point
Copy the full SHA 8dde222View commit details
Commits on Nov 21, 2019
-
Trainer
changes for distributed training (#3414)Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_
Configuration menu - View commit details
-
Copy full SHA for 8d004f2 - Browse repository at this point
Copy the full SHA 8d004f2View commit details
Commits on Dec 12, 2019
-
* add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint
Configuration menu - View commit details
-
Copy full SHA for 9831717 - Browse repository at this point
Copy the full SHA 9831717View commit details
Commits on Dec 13, 2019
-
strip out old DP stuff, ensure multiple cuda devices raises err… (#3516)
* strip out old DP stuff, ensure multiple cuda devices raises errors * lint * remove unused attribute * remove _cuda_devices everywhere * fixes * move distributed config up to top level * lint * clean up * rename occurences of batch_group * remove hack from find_learning_rate * fix last tests * black * use a top level distributed config * correct error for int * change up parse_cuda_devices to raise good error and be strongly typed
Configuration menu - View commit details
-
Copy full SHA for 18878d6 - Browse repository at this point
Copy the full SHA 18878d6View commit details
Commits on Dec 16, 2019
-
Configuration menu - View commit details
-
Copy full SHA for b2caecf - Browse repository at this point
Copy the full SHA b2caecfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 308984f - Browse repository at this point
Copy the full SHA 308984fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 587f212 - Browse repository at this point
Copy the full SHA 587f212View commit details
Commits on Dec 17, 2019
-
Configuration menu - View commit details
-
Copy full SHA for 79f0513 - Browse repository at this point
Copy the full SHA 79f0513View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.