Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Torch distributed #3529

Merged
merged 9 commits into from Dec 17, 2019
Merged

Torch distributed #3529

merged 9 commits into from Dec 17, 2019

Commits on Oct 23, 2019

  1. Logging and metrics changes for distributed training (#3372)

    * Refactor logging setup to support distributed attrs
    
    * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()`
    * Remove `TeeLogger` and use standard log handlers
    * Remove `replace_cr_with_newline` and use the standard logging practice of using
    `logging.Filter`
    * Introduce `rank` and `world_size` optional attributes to support
    distributed workers
    
    * Support for distributed training in `get_metrics`
    
    * Remove bad import
    
    * Fix duplicate log messages in stdout
    
    * Remove preemptive `logging.shutdown`
    
    `logging.shutdown` is called by the logging module
    by default during exit which makes it unnecessary to
    be called from `train_model`
    
    * Fix black formatting issues
    
    * Remove `tee_logger` references in API doc
    
    * Set log level from `ALLENNLP_DEBUG` env
    scarecrow1123 authored and DeNeutoy committed Oct 23, 2019
    Configuration menu
    Copy the full SHA
    2d7a51b View commit details
    Browse the repository at this point in the history

Commits on Oct 28, 2019

  1. Changes to train_model for distributed training support (#3390)

    * High level API changes to support distributed training
    
    * Fix flake8 error
    
    * Fix mypy error
    
    * Add docstring and misc fixes
    
    * Fix flake tests
    scarecrow1123 authored and DeNeutoy committed Oct 28, 2019
    Configuration menu
    Copy the full SHA
    8dde222 View commit details
    Browse the repository at this point in the history

Commits on Nov 21, 2019

  1. Trainer changes for distributed training (#3414)

    Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done:
    
    * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance
    * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel`
    *  Logging and metric aggregation are already done in the previous PRs
    * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class
    
    To run distributed training, the trainer needs to have the following flag to be enabled:
    
    ```jsonnet
    {
        "trainer": {
            "distributed": true,
            // ...
        }
    }
    ```
    
    TODO:
    * Try to reproduce comparable results and share extensive results for existing/selected models
    * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes
    * Should all the callbacks need to be called from every worker in case callback based training?
    * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank)
    * Write tests - _would be happy to get some suggestions on how to write tests for this_
    scarecrow1123 authored and brendan-ai2 committed Nov 21, 2019
    Configuration menu
    Copy the full SHA
    8d004f2 View commit details
    Browse the repository at this point in the history

Commits on Dec 12, 2019

  1. Dist tests (#3515)

    * add some tests
    
    * another test, fix incorrect type annotations
    
    * torch mp uses it's own context, no need to set default
    
    * lint
    DeNeutoy committed Dec 12, 2019
    Configuration menu
    Copy the full SHA
    9831717 View commit details
    Browse the repository at this point in the history

Commits on Dec 13, 2019

  1. strip out old DP stuff, ensure multiple cuda devices raises err… (#3516)

    * strip out old DP stuff, ensure multiple cuda devices raises errors
    
    * lint
    
    * remove unused attribute
    
    * remove _cuda_devices everywhere
    
    * fixes
    
    * move distributed config up to top level
    
    * lint
    
    * clean up
    
    * rename occurences of batch_group
    
    * remove hack from find_learning_rate
    
    * fix last tests
    
    * black
    
    * use a top level distributed config
    
    * correct error for int
    
    * change up parse_cuda_devices to raise good error and be strongly typed
    DeNeutoy committed Dec 13, 2019
    Configuration menu
    Copy the full SHA
    18878d6 View commit details
    Browse the repository at this point in the history

Commits on Dec 16, 2019

  1. Configuration menu
    Copy the full SHA
    b2caecf View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    308984f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    587f212 View commit details
    Browse the repository at this point in the history

Commits on Dec 17, 2019

  1. fix merge

    DeNeutoy committed Dec 17, 2019
    Configuration menu
    Copy the full SHA
    79f0513 View commit details
    Browse the repository at this point in the history