Question about using Autotuner with ZeRO and tensor parallelism

I’m reading through the Autotuner code and found this function:

https://github.com/microsoft/DeepSpeed/blob/f743feca033515fdded50a98093da5a48eb41e74/deepspeed/autotuning/autotuner.py#L278-L302

It computes

`total_gpus = self.exp_num_nodes * self.exp_num_gpus 
`

based on the autotuning config. If ZeRO is enabled, then based on which stages are enabled, `optimizer_mem`, `gradients_mem`, and/or `params_mem` get sharded across the GPUs. But then if `self.mp_size()` (for tensor parallelism, right?) is greater than 1, then the total memory usage is divided _again_ by the amount of tensor parallelism. So if ZeRO and tensor parallelism are both enabled, this is double-dipping, right? With N GPUs, we can’t get the per-GPU memory usage any smaller than 1/N. I’m not sure if

1. there’s a bug here
2. if the value of the `num_gpus` flag supposed to be reduced by the amount of tensor parallelism, or
3. if I’m not understanding this correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about using Autotuner with ZeRO and tensor parallelism #6796

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about using Autotuner with ZeRO and tensor parallelism #6796

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions