Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[T5] Support Distributed Training #4434

Merged
merged 2 commits into from Mar 21, 2022
Merged

[T5] Support Distributed Training #4434

merged 2 commits into from Mar 21, 2022

Conversation

klshuster
Copy link
Contributor

Patch description
Per #4430 , T5 was hanging with distributed calls. This was due to the forced setting of the CUDA device required for training with T5 model parallel. This is now a protected call.

Testing steps

  1. Tested with the command provided in Fine-tuning T5 models with multiprocessing_train #4430
  2. New distributed test for t5:
$ pytest test_t5.py
======test session starts ======
platform linux -- Python 3.7.9, pytest-5.3.2, py-1.10.0, pluggy-0.13.1
rootdir: /private/home/kshuster/ParlAI, inifile: pytest.ini
plugins: hydra-core-1.1.0, requests-mock-1.7.0, regressions-2.1.1, datadir-1.3.1
collected 9 items

test_t5.py .........                                                                                                                                                          [100%]

======slowest 10 test durations ======
97.49s call     tests/nightly/gpu/test_t5.py::TestT5Distributed::test_t5_distributed
60.84s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_t5_model_parallel
46.50s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_t5_ft
19.85s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_t5_gen
16.18s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_small
12.19s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_summarization
10.75s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_translation_en_to_fr
9.85s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_translation_en_to_de
9.79s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_translation_en_to_ro

(0.00 durations hidden.  Use -vv to show these durations.)
======9 passed, 9 warnings in 290.46s (0:04:50) ======

Other information

Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lordy house of cards

@klshuster klshuster merged commit f036542 into main Mar 21, 2022
@klshuster klshuster deleted the t5_distribuetd branch March 21, 2022 19:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants