Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

FairScale integration #5242

Merged
merged 104 commits into from
Jul 19, 2021
Merged

FairScale integration #5242

merged 104 commits into from
Jul 19, 2021

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jun 3, 2021

Still TODO ☑️ 👇

(stars indicate relative difficulty / complexity / how much I'm dreading this item)

  • ★☆☆☆☆ Make BeamSearch a lazy parameter to T5 module and model
  • ★☆☆☆☆ Add a configuration file to allennlp_models/training_configs/generation/ for fine-tuning T5 on CNN/DM
  • ★★★★☆ Implement abstraction for activation/gradient checkpointing (plus optional activation CPU offloading)
  • ★★★☆☆ Integrate activation/gradient checkpointing and CPU offloading into T5
  • ★☆☆☆☆ Get rid of the do_auto_wrap option.
  • ★★☆☆☆ Improve internal API/handling of _MODULE_SHARDED_FLAG and _WRAPPED_MODULE_GETTER. Maybe have a function in nn.util like set_module_sharded which would take a wrapped_module_getter function, and another is_sharded_module function. Or could just create a mixin base class with the needed methods.
  • ★★★★★ Get state checkpointing working
  • ★★☆☆☆ Add AdaFactor optimizer
  • ★★☆☆☆ API clean up
  • ★☆☆☆☆ Update CHANGELOG
  • ★☆☆☆☆ Remove allennlp-models branch patch in .github/workflows/ci.yml, and do the same in the corresponding allennlp-models PR (requirements.txt and Makefile)

For reviewers of this PR, I would suggest you start by looking at the new functionality provided in allennlp/nn/parallel/ and allennlp/nn/checkpoint/. Then look at allennlp/modules/transformer/t5.py to see how these features are integrated into a model. The training config in the models PR is a complete example of how to specify these options in a config.

@epwalsh epwalsh marked this pull request as ready for review June 29, 2021 22:24
@epwalsh epwalsh changed the title [WIP] FairScale integration FairScale integration Jun 29, 2021
@epwalsh epwalsh requested review from dirkgr and AkshitaB June 29, 2021 22:25
@epwalsh
Copy link
Member Author

epwalsh commented Jun 29, 2021

@jacobdanovitch you might find this interesting!

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 lines of code for distributed training. 1400 lines for checkpointing. I see why you gave it all those stars.

I'm not a big fan of those wrappers, or wrapper factories. They change the model init API in unintuitive ways, they are annoying to pass around, they mess with serialization. Is there some way we could do this from outside the model, cleanly? For example, what if we gave regexes that tell the trainer which modules to wrap? Or maybe an API where the model can optionally return a list of modules that it would like wrapped by the trainer during initialization? What scenarios would be broken if we had that approach?

I also seem to remember you saying that this wrapper factory approach mirrors the approach that FairScale took. Can you point me to some examples of that?

CHANGELOG.md Outdated
Comment on lines 52 to 55
- The type of the `grad_norm` parameter of `GradientDescentTrainer` is now `Union[float, bool]`,
with a default value of `False`. `False` means gradients are not rescaled and the gradient
norm is never even calculated. `True` means the gradients are still not rescaled but the gradient
norm is calculated and passed on to callbacks. A `float` value means gradients are rescaled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen the code yet, but I'm not too wild about this API. That means you have to know whether some other component needs the gradient norm or not. I'd rather provide a function called get_grad_norm() or something like that, which calculates it lazily.

README.md Outdated Show resolved Hide resolved
allennlp/models/model.py Outdated Show resolved Hide resolved
allennlp/nn/module.py Outdated Show resolved Hide resolved
allennlp/nn/parallel/ddp_wrapper.py Outdated Show resolved Hide resolved
allennlp/nn/util.py Outdated Show resolved Hide resolved
allennlp/training/checkpointer.py Show resolved Hide resolved
allennlp/training/gradient_descent_trainer.py Outdated Show resolved Hide resolved
return amp.GradScaler()


class DdpWrapper(Registrable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a wrapper though. This is more like a wrapper factory. Is there any scenario where we would create this object and then wrap multiple models with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a terrible name. I've renamed it DdpAccelerator.

scripts/py2md.py Show resolved Hide resolved
@epwalsh epwalsh requested a review from dirkgr July 8, 2021 22:48
@dirkgr dirkgr linked an issue Jul 13, 2021 that may be closed by this pull request
"model", it gets specified as "ddp_wrapper" in the "distributed" part of the config, and is then
passed in to the model separately.
"model", it gets specified as "ddp_accelerator" in the "distributed" part of the config, and is then
passed in to the model automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Can we just pass it to the model and save ourselves one exception?

@epwalsh epwalsh merged commit ca656fc into main Jul 19, 2021
@epwalsh epwalsh deleted the fairscale branch July 19, 2021 23:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to set find_unused_parameters (if necessary?)
2 participants