strip out old DP stuff, ensure multiple cuda devices raises errors #3516

DeNeutoy · 2019-12-13T00:33:51Z

DataParallel is removed
Trainer and CallbackTrainer now take a single cuda device
batch_loss now takes only a single batch, rather than a list (prev required for DataParallel training).
trainer._cuda_devices removed and replaced with trainer.cuda_device
trainer._multi_gpu is removed
Config for distributed training is now read from the top level of the config, rather than the Trainer, because it is not explicitly used to construct the Trainer object. This means we have to do less config inspection in allennlp train, as well as ensuring that Trainer configs work with all other commands.

DeNeutoy · 2019-12-13T18:25:48Z

allennlp/commands/find_learning_rate.py

@@ -193,6 +193,14 @@ def find_learning_rate_model(
    prepare_environment(params)

    cuda_device = params.params.get("trainer").get("cuda_device", -1)
+    devices = parse_cuda_device(cuda_device)
+
+    # HACK: The trainer can not be constructed with multiple gpus.


@brendan-ai2 - this PR makes the Trainer take strictly a single GPU. This presents a small problem, which is that we still want configs which have a "cuda_device": [1,2,3] line to work with other commands which are not train (like in this find_learning_rate command).

I put this hack in here to demonstrate the problem (we have to make sure to manually parse the cuda devices every time we try to build the trainer, which is kind of gross), but I think a good way to resolve this is to have a separate "distributed_gpus" argument, either in the trainer or the top level of the config, which we can use in train to determine and launch the workers. This way, Trainer.from_params would be guaranteed to work in all cases without manually parsing out the cuda device.

Also, maybe it makes sense to move the distributed, master_address and master_port config arguments to the top level of the config - they are not actually arguments to the Trainer, they just affect how it is constructed within the train command. Does that make sense? Do you have any other ideas?

Actually I think moving the distributed config up to the top level just solves all the problems, I think i'll do that.

Yeah, that makes total sense. One suggestion though, I suspect that we may need further distributed options soon. (E.g., various multi-node settings) What about having config like:

"trainer": {"num_epochs": 2, "optimizer": "adam"}, "distributed": {"cuda_devices": [0, 1, 2]},

This gives us an obvious place to add them.

DeNeutoy · 2019-12-13T21:37:29Z

allennlp/tests/commands/train_test.py

-                    "cuda_device": [0, 1],
-                },
+                "trainer": {"num_epochs": 2, "optimizer": "adam"},
+                "distributed": True,


This flag is a bit pointless now - maybe we can just rely on distributed cuda_devices as the single flag?

Less redundancy sounds good. Though see my point above about having a top-level distributed: {... stanza.

DeNeutoy · 2019-12-13T21:39:41Z

allennlp/training/trainer.py

-        else:
-            model_device = cuda_device
-        if model_device >= 0:
+            raise ConfigurationError(


I would like to move this check inside parse_cuda_devices, as the two places we pass cuda devices are strictly typed - cuda_device inside the trainer config must be an int and distributed_cuda_devices at the top level must be a List[int]. However, we should probably still raise a nice warning for people who are converting their configs. Thoughts?

I'm definitely in favor of raising a nice warning to assist in changing the configs. To that end, could we add a snippet in the exception? Something like:

"""In allennlp 1.0, the Trainer cannot be passed multiple cuda devices. Instead, use the faster Distributed Data Parallel. For instance, if you previously had config like: { "trainer": { "cuda_device": [0, 1, 2, 3], "num_epochs": 20, ... } } simply change it to: { "distributed": { "cuda_devices": [0, 1, 2, 3], }, "trainer": { "num_epochs": 20, ... } } """

As for where to put the warning, it seems that parse_cuda_devices is outdated now given the strict typing you mention. Maybe make a parse_cuda_device that gives an error if you pass it a list and similarly a parse_cuda_devices that mandates it has a list?

Added this inside parse_cuda_device, which now strictly returns an int. We ended up not needing the version which parses multiple of them, as we only do it in one place.

brendan-ai2 · 2019-12-13T22:38:10Z

allennlp/training/trainer_base.py

@@ -36,12 +36,18 @@ def __init__(
        rank: int = 0,


Type of cuda_device a few lines up should be just int now.

Thanks, fixed

brendan-ai2 · 2019-12-13T22:38:47Z

allennlp/training/util.py

@@ -328,36 +325,6 @@ def create_serialization_dir(
        os.makedirs(serialization_dir, exist_ok=True)


-def data_parallel(


Hooray! 🎉

brendan-ai2 · 2019-12-13T22:41:53Z

allennlp/training/trainer_base.py

+                "our Trainer always uses a single GPU per process."
+            )
+
+        if not isinstance(cuda_device, int):
            raise ConfigurationError(
                "Expected an int or list for cuda_device, got {}".format(cuda_device)


This error still mentions list.

Thanks, fixed

brendan-ai2

Looks solid. Left a few minor notes. Thanks for the PR!

* Logging and metrics changes for distributed training (#3372) * Refactor logging setup to support distributed attrs * `cleanup_logging()` is replaced with stdlib's `logging.shutdown()` * Remove `TeeLogger` and use standard log handlers * Remove `replace_cr_with_newline` and use the standard logging practice of using `logging.Filter` * Introduce `rank` and `world_size` optional attributes to support distributed workers * Support for distributed training in `get_metrics` * Remove bad import * Fix duplicate log messages in stdout * Remove preemptive `logging.shutdown` `logging.shutdown` is called by the logging module by default during exit which makes it unnecessary to be called from `train_model` * Fix black formatting issues * Remove `tee_logger` references in API doc * Set log level from `ALLENNLP_DEBUG` env * Changes to `train_model` for distributed training support (#3390) * High level API changes to support distributed training * Fix flake8 error * Fix mypy error * Add docstring and misc fixes * Fix flake tests * `Trainer` changes for distributed training (#3414) Followup PR to #3390 and #3372 to bring in distributed training support. Following are the major changes done: * Workers are spawned using `mp.spawn` and each worker creates its own `Trainer` instance * `Trainer.__init__` wraps up `self.model` with `DistributedDataParallel` * Logging and metric aggregation are already done in the previous PRs * `Vocabulary` creation in case of distributed training is done before spawning the workers and creating `Trainer` class To run distributed training, the trainer needs to have the following flag to be enabled: ```jsonnet { "trainer": { "distributed": true, // ... } } ``` TODO: * Try to reproduce comparable results and share extensive results for existing/selected models * Check if other commands like `evaluate`, `predict`, `fine-tune` works well with the new changes * Should all the callbacks need to be called from every worker in case callback based training? * Should the current dataset readers be changed to support distributed training as well?(to selectively yield data based on their rank) * Write tests - _would be happy to get some suggestions on how to write tests for this_ * Dist tests (#3515) * add some tests * another test, fix incorrect type annotations * torch mp uses it's own context, no need to set default * lint * strip out old DP stuff, ensure multiple cuda devices raises err… (#3516) * strip out old DP stuff, ensure multiple cuda devices raises errors * lint * remove unused attribute * remove _cuda_devices everywhere * fixes * move distributed config up to top level * lint * clean up * rename occurences of batch_group * remove hack from find_learning_rate * fix last tests * black * use a top level distributed config * correct error for int * change up parse_cuda_devices to raise good error and be strongly typed * fix merge

Mark Neumann added 4 commits December 12, 2019 16:27

strip out old DP stuff, ensure multiple cuda devices raises errors

c2c69dd

lint

4afc9c5

remove unused attribute

8080301

remove _cuda_devices everywhere

278c89d

DeNeutoy commented Dec 13, 2019

View reviewed changes

Mark Neumann added 8 commits December 13, 2019 11:49

fixes

9a74ab7

move distributed config up to top level

bdc9a8a

lint

833f54d

clean up

1ba654c

rename occurences of batch_group

5358eed

remove hack from find_learning_rate

8d810ea

fix last tests

b014b9f

black

df23b16

DeNeutoy commented Dec 13, 2019

View reviewed changes

DeNeutoy requested a review from brendan-ai2 December 13, 2019 21:40

use a top level distributed config

69db3d4

brendan-ai2 reviewed Dec 13, 2019

View reviewed changes

brendan-ai2 approved these changes Dec 13, 2019

View reviewed changes

Mark Neumann added 2 commits December 13, 2019 14:45

correct error for int

bc2c2d1

change up parse_cuda_devices to raise good error and be strongly typed

2398e8f

DeNeutoy merged commit 18878d6 into allenai:torch-distributed Dec 13, 2019

DeNeutoy deleted the remove-dp-2 branch December 13, 2019 23:38

DeNeutoy mentioned this pull request Dec 16, 2019

Torch distributed #3529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strip out old DP stuff, ensure multiple cuda devices raises errors #3516

strip out old DP stuff, ensure multiple cuda devices raises errors #3516

DeNeutoy commented Dec 13, 2019 •

edited

DeNeutoy Dec 13, 2019

DeNeutoy Dec 13, 2019

brendan-ai2 Dec 13, 2019

DeNeutoy Dec 13, 2019

brendan-ai2 Dec 13, 2019

DeNeutoy Dec 13, 2019

brendan-ai2 Dec 13, 2019 •

edited

brendan-ai2 Dec 13, 2019

DeNeutoy Dec 13, 2019

brendan-ai2 Dec 13, 2019 •

edited

DeNeutoy Dec 13, 2019

brendan-ai2 Dec 13, 2019

brendan-ai2 Dec 13, 2019

DeNeutoy Dec 13, 2019

brendan-ai2 left a comment

		@@ -328,36 +325,6 @@ def create_serialization_dir(
		os.makedirs(serialization_dir, exist_ok=True)


		def data_parallel(

strip out old DP stuff, ensure multiple cuda devices raises errors #3516

strip out old DP stuff, ensure multiple cuda devices raises errors #3516

Conversation

DeNeutoy commented Dec 13, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ai2 Dec 13, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ai2 Dec 13, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ai2 left a comment

Choose a reason for hiding this comment

DeNeutoy commented Dec 13, 2019 •

edited

brendan-ai2 Dec 13, 2019 •

edited

brendan-ai2 Dec 13, 2019 •

edited