Skip to content

DeepSpeed using DistributedSampler with model parallelism #99

@ShadenSmith

Description

@ShadenSmith

DeepSpeed's data loader will use DistributedSampler by default unless another is provided:

https://github.com/microsoft/DeepSpeed/blob/001abe2362d9edba062070fb05df40925f54cb3e/deepspeed/pt/deepspeed_dataloader.py#L43

If DeepSpeed is configured with model parallelism, or called from a library with a sub-group of the world processes, the default behavior of DistributedSampler is incorrect because it queries the global world size and rank information. We should specify num_replicas and rank when creating the sampler.

If mpu is provided to deepspeed.initialize(), we should query mpu.get_data_parallel_world_size() and mpu.get_data_parallel_rank() and forward that information to the sampler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions