DeepSpeed using DistributedSampler with model parallelism

DeepSpeed's data loader will use `DistributedSampler` by default unless another is provided:

https://github.com/microsoft/DeepSpeed/blob/001abe2362d9edba062070fb05df40925f54cb3e/deepspeed/pt/deepspeed_dataloader.py#L43

If DeepSpeed is configured with model parallelism, or called from a library with a sub-group of the world processes, the default behavior of `DistributedSampler` is incorrect because it queries the global world size and rank information. We should specify `num_replicas` and `rank` when creating the sampler.

If `mpu` is provided to `deepspeed.initialize()`, we should query `mpu.get_data_parallel_world_size()` and `mpu.get_data_parallel_rank()` and forward that information to the sampler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed using DistributedSampler with model parallelism #99

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSpeed using DistributedSampler with model parallelism #99

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions