Skip to content

Commit

Permalink
merge changes from deepspeed master (microsoft#24)
Browse files Browse the repository at this point in the history
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: brett koonce <koonce@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: hamlet <gvvvv@163.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sid <sidney.black@aleph-alpha.de>
  • Loading branch information
14 people committed Apr 6, 2021
1 parent 1ed4d80 commit 1dcfbca
Show file tree
Hide file tree
Showing 41 changed files with 810 additions and 329 deletions.
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
[![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
[![Docker Pulls](https://img.shields.io/docker/pulls/deepspeed/deepspeed)](https://hub.docker.com/r/deepspeed/deepspeed)

### 03/2021: DeepSpeed is hiring! Come join us: [SDE 2](https://careers.microsoft.com/us/en/job/1013160/Software-Engineer-2), [Sr. SDE](https://careers.microsoft.com/us/en/job/1017151/Senior-Software-Engineer), [Sr. Researcher](https://careers.microsoft.com/us/en/job/1016440/Senior-Researcher)

[DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
library that makes distributed training easy, efficient, and effective.

Expand Down Expand Up @@ -31,15 +33,18 @@ information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale)


# News
* [2021/04/01] [[DeepSpeed on AzureML] Transformers and CIFAR examples are now available on AzureML GitHub](https://github.com/Azure/azureml-examples/tree/main/workflows/train/deepspeed)
* [2021/03/30] [[PyTorch Lightning Blog] Accessible Multi-Billion Parameter Model Training with PyTorch Lightning + DeepSpeed](https://medium.com/pytorch-lightning/accessible-multi-billion-parameter-model-training-with-pytorch-lightning-deepspeed-c9333ac3bb59)
* [2021/03/16] [1-bit Adam v2: NCCL-based implementation and more](https://www.deepspeed.ai/tutorials/onebit-adam/)
* [2021/03/08] [ZeRO-3 Offload: Scale your models to trillion parameters without code changes while leveraging both CPUs & GPUs](https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html)
* [2021/01/19] [[🤗Hugging Face Blog] Fit More and Train Faster With ZeRO via DeepSpeed and FairScale](https://huggingface.co/blog/zero-deepspeed-fairscale)
* [2020/11/12] [Simplified install, JIT compiled ops, PyPI releases, and reduced dependencies](#installation)
* [2020/11/10] [Efficient and robust compressed training through progressive layer dropping](https://www.deepspeed.ai/news/2020/10/28/progressive-layer-dropping-news.html)
* [2020/09/10] [DeepSpeed v0.3: Extreme-scale model training for everyone](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)
* [Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention-news.html)
* [Training a trillion parameters with pipeline parallelism](https://www.deepspeed.ai/news/2020/09/08/pipeline-parallelism.html)
* [Up to 5x less communication and 3.4x faster training through 1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-news.html)
* [10x bigger model training on a single GPU with ZeRO-Offload](https://www.deepspeed.ai/news/2020/09/08/ZeRO-Offload.html)
* [2020/08/07] [DeepSpeed Microsoft Research Webinar](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html) is now available on-demand


# Table of Contents
Expand Down
12 changes: 8 additions & 4 deletions deepspeed/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ def _parse_version(version_str):
sys.modules['deepspeed.pt.loss_scaler'] = deepspeed.runtime.fp16.loss_scaler


def initialize(args,
model,
def initialize(args=None,
model=None,
optimizer=None,
model_parameters=None,
training_data=None,
Expand All @@ -62,8 +62,7 @@ def initialize(args,
"""Initialize the DeepSpeed Engine.
Arguments:
args: a dictionary containing local_rank and deepspeed_config
file location
args: an object containing local_rank and deepspeed_config fields. This is optional if `config_params` is passed.
model: Required: nn.module class before apply any wrappers
Expand All @@ -88,6 +87,9 @@ def initialize(args,
mini-batch of Tensor(s). Used when using batched loading from a
map-style dataset.
config_params: Optional: Instead of requiring args.deepspeed_config you can pass your deepspeed config
as a dictionary instead.
Returns:
A tuple of ``engine``, ``optimizer``, ``training_dataloader``, ``lr_scheduler``
Expand All @@ -108,6 +110,8 @@ def initialize(args,
__git_branch__),
ranks=[0])

assert model is not None, "deepspeed.initialize requires a model"

if not isinstance(model, PipelineModule):
engine = DeepSpeedEngine(args=args,
model=model,
Expand Down
6 changes: 6 additions & 0 deletions deepspeed/launcher/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,12 @@ def main(args=None):
result = subprocess.Popen(cmd, env=env)
result.wait()

# In case of failure must propagate the error-condition back to the caller (usually shell). The
# actual error and traceback should have been printed in the subprocess, so in order to avoid
# unnecessary noise we just quietly exit here with the same code as the subprocess
if result.returncode > 0:
sys.exit(result.returncode)


if __name__ == "__main__":
main()
10 changes: 8 additions & 2 deletions deepspeed/profiling/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@
Licensed under the MIT license.
"""

from deepspeed.runtime.config_utils import get_scalar_param
from deepspeed.runtime.config_utils import get_scalar_param, DeepSpeedConfigObject
from deepspeed.profiling.constants import *


class DeepSpeedFlopsProfilerConfig(object):
class DeepSpeedFlopsProfilerConfig(DeepSpeedConfigObject):
def __init__(self, param_dict):
"""
docstring
"""
super(DeepSpeedFlopsProfilerConfig, self).__init__()

self.enabled = None
Expand All @@ -24,6 +27,9 @@ def __init__(self, param_dict):
self._initialize(flops_profiler_dict)

def _initialize(self, flops_profiler_dict):
"""
docstring
"""
self.enabled = get_scalar_param(flops_profiler_dict,
FLOPS_PROFILER_ENABLED,
FLOPS_PROFILER_ENABLED_DEFAULT)
Expand Down
2 changes: 1 addition & 1 deletion deepspeed/profiling/flops_profiler/profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ def del_extra_repr(module):
"Each module profile is listed after its name in the following order: \nnumber of parameters, percentage of total parameters, number of multiply-accumulate operations (MACs), percentage of total MACs, latency, percentage of total latency, number of floating point operations per second (FLOPS, computed as 2 * MACs / latency)."
)
print(
"Note: \n1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.\n2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throught.\n"
"Note: \n1. A module can have torch.nn.functional (e.g. to compute logits) along with submodules, thus making the difference between the parent's MACs(or latency) and the sum of its submodules'.\n2. Number of floating point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.\n"
)
print(self.model)

Expand Down
11 changes: 2 additions & 9 deletions deepspeed/runtime/activation_checkpointing/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Licensed under the MIT license.
"""

from deepspeed.runtime.config_utils import get_scalar_param
from deepspeed.runtime.config_utils import get_scalar_param, DeepSpeedConfigObject

#########################################
# DeepSpeed Activation Checkpointing
Expand Down Expand Up @@ -56,7 +56,7 @@
}


class DeepSpeedActivationCheckpointingConfig(object):
class DeepSpeedActivationCheckpointingConfig(DeepSpeedConfigObject):
def __init__(self, param_dict):
super(DeepSpeedActivationCheckpointingConfig, self).__init__()

Expand All @@ -74,13 +74,6 @@ def __init__(self, param_dict):

self._initialize(act_chkpt_config_dict)

"""
For json serialization
"""

def repr(self):
return self.__dict__

def _initialize(self, act_chkpt_config_dict):
self.partition_activations = get_scalar_param(
act_chkpt_config_dict,
Expand Down
8 changes: 4 additions & 4 deletions deepspeed/runtime/comm/nccl.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def compressed_allreduce(self,

# worker_scale = self.compression_backend.cupy2torch(cupy_worker_scale)
recvbuf_sign = self.compression_backend.cupy2torch(cupy_recvbuf_sign)
# recvbuf_scale = self.compression_backend.cupy2torch(cupy_recvbuf_scale)
#recvbuf_scale = self.compression_backend.cupy2torch(cupy_recvbuf_scale)
recvbuf_scale = [
torch.zeros(1,
dtype=worker_scale.dtype,
Expand All @@ -106,13 +106,13 @@ def compressed_allreduce(self,
cupy_sign_list_packed = None

cupy_recvbuf_sign = self.compression_backend.torch2cupy(recvbuf_sign)
# cupy_recvbuf_scale = self.compression_backend.torch2cupy(torch.stack(recvbuf_scale))
#cupy_recvbuf_scale = self.compression_backend.torch2cupy(torch.stack(recvbuf_scale))

compensated_server_m = self.compression_backend.cupy2torch(
(cupy.unpackbits(cupy_recvbuf_sign.flatten())).reshape(
self.size,
-1)).float().add_(-0.5).mul_(2.0).mul_(
torch.stack(recvbuf_scale).mul_(1 / self.size)).sum(0)
torch.stack(recvbuf_scale).mul_(1 / self.size)).sum(0)
compensated_server_m.add_(server_error)
server_scale = torch.norm(compensated_server_m) / np.sqrt(
compensated_server_m.numel())
Expand Down Expand Up @@ -172,7 +172,7 @@ def compressed_allreduce(self,
(cupy.unpackbits(cupy_recvbuf_sign_server.flatten())).reshape(
self.size,
-1)).float().add_(-0.5).mul_(2.0).mul_(
self.compression_backend.cupy2torch(
self.compression_backend.cupy2torch(
cupy_recvbuf_scale_server)).flatten().data)
if original_size != worker_error_size:
buffer_m = buffer_m[0:original_size]
Expand Down
13 changes: 12 additions & 1 deletion deepspeed/runtime/config_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,21 @@
"""
Collection of DeepSpeed configuration utilities
"""

import json
from collections import Counter


class DeepSpeedConfigObject(object):
"""
For json serialization
"""
def repr(self):
return self.__dict__

def __repr__(self):
return json.dumps(self.__dict__, sort_keys=True, indent=4)


def get_scalar_param(param_dict, param_name, param_default_value):
return param_dict.get(param_name, param_default_value)

Expand Down
Loading

0 comments on commit 1dcfbca

Please sign in to comment.