Skip to content

Commit

Permalink
[feat] 3/n pp (Lightning-AI#5036)
Browse files Browse the repository at this point in the history
* add pp doc

* udpate doc

* update doc

* update doc

* Update docs

* update doc

* udpate

* update doc

* update doc

* Formatting, update sharded zip link

* Update docs/source/multi_gpu.rst

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Apply suggestions from code review

* Reference directly to section

Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
  • Loading branch information
4 people committed Dec 9, 2020
1 parent 69725ad commit 20b806a
Show file tree
Hide file tree
Showing 4 changed files with 99 additions and 8 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Expand Up @@ -32,6 +32,6 @@ repos:
types: [python]

- repo: https://github.com/pre-commit/mirrors-mypy
rev: master
rev: v0.790
hooks:
- id: mypy
87 changes: 81 additions & 6 deletions docs/source/multi_gpu.rst
Expand Up @@ -612,6 +612,7 @@ This is useful when dealing with large Transformer based models, or in environme
Lightning currently offers the following methods to leverage model parallelism:

- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
- Sequential Model Parallelism with Checkpointing (partition your :class:`nn.Sequential <torch.nn.Sequential>` module across multiple GPUs, leverage checkpointing and microbatching for further memory improvements and device utilization)

Sharded Training
^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -666,7 +667,7 @@ To use Sharded Training, you need to first install FairScale using the command b

.. code-block:: bash
pip install https://github.com/facebookresearch/fairscale/archive/bb468670838b98dc8f8d67be4eabf195042a7994.zip
pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip
.. code-block:: python
Expand All @@ -678,6 +679,80 @@ Sharded Training can work across all DDP variants by adding the additional ``--p

Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.

----------

.. _sequential-parallelism:

Sequential Model Parallelism with Checkpointing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs.
In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically.

Reference: https://arxiv.org/abs/1811.06965

.. note:: DDPSequentialPlugin is currently supported only for Pytorch 1.6.

To get started, install FairScale through extras using with ``pip install pytorch-lightning["extra"]``

or directly using

.. code-block:: bash
pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip
To use Sequential Model Parallelism, you must define a :class:`nn.Sequential <torch.nn.Sequential>` module that defines the layers you wish to parallelize across GPUs.
This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below.

.. code-block:: python
from pytorch_lightning.plugins.ddp_sequential_plugin import DDPSequentialPlugin
from pytorch_lightning import LightningModule
class MyModel(LightningModule):
def __init__(self):
...
self.sequential_module = torch.nn.Sequential(my_layers)
# Split my module across 4 gpus, one layer each
model = MyModel()
plugin = DDPSequentialPlugin(balance=[1, 1, 1, 1])
trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
trainer.fit(model)
We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here <https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/basic_examples/conv_sequential_example.py>`_.
To run the example, you need to install `Bolts <https://github.com/PyTorchLightning/pytorch-lightning-bolts>`_. Install with ``pip install pytorch-lightning-bolts``.

When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings.

.. list-table:: GPU Memory Utilization
:widths: 25 25 50
:header-rows: 1

* - GPUS
- Without Balancing
- With Balancing
* - Gpu 0
- 4436 MB
- 1554 MB
* - Gpu 1
- ~0
- 994 MB

To run the example with Sequential Model Parallelism:

.. code-block:: bash
python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 2 --accelerator ddp --use_ddp_sequential
To run the same example without Sequential Model Parallelism:

.. code-block:: bash
python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 1
Batch size
----------
Expand Down Expand Up @@ -728,17 +803,17 @@ Lightning supports the use of TorchElastic to enable fault-tolerant and elastic
.. code-block:: python
Trainer(gpus=8, accelerator='ddp')
Following the `TorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:

.. code-block:: bash
etcd --enable-v2
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
--advertise-client-urls PUBLIC_HOSTNAME:2379
And then launch the elastic job with:

.. code-block:: bash
Expand All @@ -750,7 +825,7 @@ And then launch the elastic job with:
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
See the official `TorchElastic documentation <https://pytorch.org/elastic>`_ for details
on installation and more use cases.
Expand Down
10 changes: 9 additions & 1 deletion docs/source/performance.rst
Expand Up @@ -131,4 +131,12 @@ To use Optimizer Sharded Training, refer to :ref:`model-parallelism`.

Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.

Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.
Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.


Sequential Model Parallelism with Checkpointing
---------------------------------------------------------------------
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.

For more information, refer to :ref:`sequential-parallelism`.
8 changes: 8 additions & 0 deletions docs/source/training_tricks.rst
Expand Up @@ -123,3 +123,11 @@ The algorithm in short works by:
:members: scale_batch_size

.. warning:: Batch size finder is not supported for DDP yet, it is coming soon.


Sequential Model Parallelism with Checkpointing
---------------------------------------------------------------------
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.

For more information, refer to :ref:`sequential-parallelism`.

0 comments on commit 20b806a

Please sign in to comment.